In mid-2014, a pair of Gartner analysts levied some trenchant criticisms at the increasingly hyped concept of data lakes.
“The fundamental issue with the data lake is that it makes certain assumptions about the users of information,” said Gartner research director Nick Heudecker.
“It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without ‘a priori knowledge’ and that they understand the incomplete nature of datasets, regardless of structure.” A year and a half later, Gartner’s concerns do not appear to have eased. While there are successful projects, there are also failures — and the key success factor appears to be a strong understanding of the different roles of a data lake and a data warehouse.
Heudecker said a data lake, often marketed as a means of tackling big data challenges, is a great place to figure out new questions to ask of your data, “provided you have the skills”.
“If that’s what you want to do, I’m less concerned about a data lake implementation. However, a higher risk scenario is if your intent is to reimplement your data warehousing service level agreements (SLAs) on the data lake.”
Heudecker said a data lake is typically optimised for different uses cases, levels of concurrency and multi-tenancy.
“In other words, don’t use a data lake for data warehousing in anger.”
It’s perfectly reasonable to need both, he said, because each is optimised for different SLAs, users and skills.;