One of the consequences of the hype and exaggeration that surrounds Big Data is that the labels and definitions that we use to describe the field quickly become overloaded. One of the Big Data concepts that presently we risk over-loading to the point of complete abstraction is the “Data Lake”.
Data Lake discussions are everywhere right now; to read some of these commentaries, the Data Lake is almost the prototypical use-case for the Hadoop technology stack. But there are far fewer actual, reference-able Data Lake implementations than there are Hadoop deployments – and even less documented best-practice that will tell you how you might actually go about building one.
So if the Data Lake is more architectural concept than physical reality in most organisations today, now seems like a good time to ask: What is a Data Lake anyway? What do we want it to be? And what do we want it not to be?
When you cut through the hype, most proponents of the Data Lake concept are promoting three big ideas:
1) It should capture all data in a centralized, Hadoop-based repository (whatever all means)
2) It stores the data in a raw, un-modelled format
3) And that doing so will enable you to break down the barriers that still inhibit end-to-end, cross-functional Analytics in too many organisations
Now those are lofty and worthwhile ambitions, but at this point many of you could be forgiven a certain sense of déjà vu – because improving data accessibility and integration are what many of you thought you were building the Data Warehouse for.
In fact, many production Hadoop applications are built according to an application-specific design pattern, rather than an application-neutral one that allows multiple applications to be brought to a single copy of data (in technical jargon, this is called a “star schema” design pattern). And whilst there is a legitimate place in most organizations for at least some application-specific data stores, far from breaking down barriers to Enterprise-wide Analytics, many of these solutions risk creating a new generation of data silos.
A few short years after starting their Hadoop journey, a leading Teradata customer has already deployed more than twenty sizeable application-specific Hadoop clusters.