What is a “Data Lake” Anyway?

What is a “Data Lake” Anyway?

One of the consequences of the hype and exaggeration that surrounds Big Data is that the labels and definitions that we use to describe the field quickly become overloaded. One of the Big Data concepts that presently we risk over-loading to the point of complete abstraction is the “Data Lake”.

Data Lake discussions are everywhere right now; to read some of these commentaries, the Data Lake is almost the prototypical use-case for the Hadoop technology stack. But there are far fewer actual, reference-able Data Lake implementations than there are Hadoop deployments – and even less documented best-practice that will tell you how you might actually go about building one.

So if the Data Lake is more architectural concept than physical reality in most organisations today, now seems like a good time to ask: What is a Data Lake anyway? What do we want it to be? And what do we want it not to be?

Read Also:
5 Steps for Advanced Data Analysis using Visualization

When you cut through the hype, most proponents of the Data Lake concept are promoting three big ideas:

1) It should capture all data in a centralized, Hadoop-based repository (whatever all means)

2) It stores the data in a raw, un-modelled format

3) And that doing so will enable you to break down the barriers that still inhibit end-to-end, cross-functional Analytics in too many organisations

Now those are lofty and worthwhile ambitions, but at this point many of you could be forgiven a certain sense of déjà vu – because improving data accessibility and integration are what many of you thought you were building the Data Warehouse for.

In fact, many production Hadoop applications are built according to an application-specific design pattern, rather than an application-neutral one that allows multiple applications to be brought to a single copy of data (in technical jargon, this is called a “star schema” design pattern). And whilst there is a legitimate place in most organizations for at least some application-specific data stores, far from breaking down barriers to Enterprise-wide Analytics, many of these solutions risk creating a new generation of data silos.

Read Also:
10 Deep Learning Terms Explained in Simple English

A few short years after starting their Hadoop journey, a leading Teradata customer has already deployed more than twenty sizeable application-specific Hadoop clusters.

 



Sentiment Analysis Symposium

27
Jun
2017
Sentiment Analysis Symposium

15% off with code 7WDATA

Read Also:
How Journalism Professors Used Legos to Teach Super Bowl Data Visualization

Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

28
Jun
2017
Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

15% off with code 7WDATA

Read Also:
Giving smart cities a technological edge

AI, Machine Learning and Sentiment Analysis Applied to Finance

28
Jun
2017
AI, Machine Learning and Sentiment Analysis Applied to Finance

15% off with code 7WDATA

Read Also:
How Big Data is Revolutionizing Corporate Training

Real Business Intelligence

11
Jul
2017
Real Business Intelligence

25% off with code RBIYM01

Read Also:
How Big Data is Revolutionizing Corporate Training

Advanced Analytics Forum

20
Sep
2017
Advanced Analytics Forum

15% off with code Discount15

Read Also:
How Journalism Professors Used Legos to Teach Super Bowl Data Visualization
Read Also:
NIH to Bring Precision Medicine Data Collection to Patient Homes

Leave a Reply

Your email address will not be published. Required fields are marked *