4 Best Practices for Data Lakes

4 Best Practices for Data Lakes

4 Best Practices for Data Lakes

Data lakes are a still-evolving way for companies to better leverage Big Data. Understanding data lake use cases is a good starting point.

Data lakes sound simple: Pool data or information into a Big Data system that combines processing speed with storage -- a Hadoop cluster or an in-memory solution -- so the business can access it for new insight. As with so much in technology, though, the reality is much more challenging than the dream.

Part of that is a misunderstanding of what a data lake should be, said the man who coined the term, Pentaho founder and CTO James Dixon. He never intended data lakes to describe a huge Hadoop repository that pulled data from all enterprise applications.

"When people ask what a data lake is, I tell them it's what you used to have on tape. Take what you have on tape and pour it into a data lake and start exploring that data," Dixon said. "Our story was always only put into Hadoop what you need to; if you want to combine information from the data lake with information in your CRM system, well just do a join, do that blending of data only when you need to."

Read Also:
The 5 Biggest Challenges Facing Data Visualization

Despite Dixon's intentions, the term took on a broader meaning with bigger promises. Folks began viewing Big Data lakes as a way to solve integration headaches by bringing all data into one super-fast, easy-to-access repository.

Instead, the repositories turned into slow and unyielding data swamps. Big Data required special expertise to analyze. The conclusions that resulted from using raw data raised red flags about data quality and governance.

"Everybody wanted to look at a data lake as the silver bullet for IT. Has there ever been one? I'm still waiting," said Nick Heudecker, who researches data management for Gartner's IT Leaders (ITL) Data and Analytics group. "I think once you get beyond that discovery phase, you need to do more. Data lakes, that same infrastructure can help, but you need to go into more of a professional information management world once you used that data to answer the questions that you generated."

So given the reality of data lakes, how can you utilize them to your organization's advantage? Experts say there are four key data lake best practices:

To build a successful data lake, enterprises need to throw out the idea that data lakes will allow you to collect all your data in one place. It's also important to understand that data lakes are not a replacement for enterprise data management systems and practices -- at least, not given the current state of Big Data technology.

Read Also:
Edge analytics – The pros and cons of immediate, local insight

"Organizations are still talking about data lakes but they're also recognizing that all lakes are not equal," said Jack Norris, senior VP of Data and Applications with MapR. "There's a certain amount of capabilities you need or we've heard people talk about data swamps, where it's hard to get data to flow out or in, it's just stagnating there."

Given that the data lake didn't work out as planned, is it still viable? Yes, provided you understand its limits, experts say.

"I have a pretty scoped view - I don't want to say narrow - but a very scoped view of what a data lake is," Heudecker said. "To me, it's a data science sandbox. It's where you play with data and you try to find new insights. Once you've found that new insight, does it make sense to leave data in its raw format? I would argue that it doesn't because you now need to optimize the data. You need to insure that it's governed, that it's semantically consistent, that it will meet the needs of the business consumers so to me the data lake is a lab. And you can do other things with it but for me, when I'm advising clients that's how I try to advise them to think about their data lake."

Read Also:
As governments open access to data, law lags far behind

That isn't as limiting as it may sound. For instance, Heudecker notes enterprises use data lakes to extract insight from Internet of Things deployments.

 



Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
Big Data and the Role of Data Governance

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
Universities Can Predict When Students Are About to Drop Out

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
Data Warehouses Should Have Staging Tables

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Turbo Charge Enterprise Analytics with Big Data

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
MapR adds in-Hadoop Document Database

Leave a Reply

Your email address will not be published. Required fields are marked *