Data lakes are a still-evolving way for companies to better leverage Big Data. Understanding data lake use cases is a good starting point.
Data lakes sound simple: Pool data or information into a Big Data system that combines processing speed with storage — a Hadoop cluster or an in-memory solution — so the business can access it for new insight. As with so much in technology, though, the reality is much more challenging than the dream.
Part of that is a misunderstanding of what a data lake should be, said the man who coined the term, Pentaho founder and CTO James Dixon. He never intended data lakes to describe a huge Hadoop repository that pulled data from all enterprise applications.
“When people ask what a data lake is, I tell them it’s what you used to have on tape. Take what you have on tape and pour it into a data lake and start exploring that data,” Dixon said. “Our story was always only put into Hadoop what you need to; if you want to combine information from the data lake with information in your CRM system, well just do a join, do that blending of data only when you need to.”
Despite Dixon’s intentions, the term took on a broader meaning with bigger promises. Folks began viewing Big Data lakes as a way to solve integration headaches by bringing all data into one super-fast, easy-to-access repository.
Instead, the repositories turned into slow and unyielding data swamps. Big Data required special expertise to analyze. The conclusions that resulted from using raw data raised red flags about data quality and governance.
“Everybody wanted to look at a data lake as the silver bullet for IT. Has there ever been one? I’m still waiting,” said Nick Heudecker, who researches data management for Gartner’s IT Leaders (ITL) Data and Analytics group. “I think once you get beyond that discovery phase, you need to do more. Data lakes, that same infrastructure can help, but you need to go into more of a professional information management world once you used that data to answer the questions that you generated.”
So given the reality of data lakes, how can you utilize them to your organization’s advantage? Experts say there are four key data lake best practices:
To build a successful data lake, enterprises need to throw out the idea that data lakes will allow you to collect all your data in one place. It’s also important to understand that data lakes are not a replacement for enterprise data management systems and practices — at least, not given the current state of Big Data technology.
“Organizations are still talking about data lakes but they’re also recognizing that all lakes are not equal,” said Jack Norris, senior VP of Data and Applications with MapR. “There’s a certain amount of capabilities you need or we’ve heard people talk about data swamps, where it’s hard to get data to flow out or in, it’s just stagnating there.”
Given that the data lake didn’t work out as planned, is it still viable? Yes, provided you understand its limits, experts say.
“I have a pretty scoped view – I don’t want to say narrow – but a very scoped view of what a data lake is,” Heudecker said. “To me, it’s a data science sandbox. It’s where you play with data and you try to find new insights. Once you’ve found that new insight, does it make sense to leave data in its raw format? I would argue that it doesn’t because you now need to optimize the data. You need to insure that it’s governed, that it’s semantically consistent, that it will meet the needs of the business consumers so to me the data lake is a lab. And you can do other things with it but for me, when I’m advising clients that’s how I try to advise them to think about their data lake.”
That isn’t as limiting as it may sound. For instance, Heudecker notes enterprises use data lakes to extract insight from Internet of Things deployments.