In my last post, I started to look at the use of Hadoop in general and the data lake concept in particular as part of a plan for modernizing the data environment. There are surely benefits to the data lake, especially when it's deployed using a low-cost, scalable hardware platform. The significant issue we began to explore is this: the more prolific you become at loading data into the data lake, the greater the chance that entropy will overtake any attempt at proactive management.
Let's presume that you plan to migrate all corporate data to the data lake. And the idea of the data lake is to provide a resting place for raw data in its native format until it's needed. Now, let’s imagine what you need to know when you decide that the data truly is needed:
In other words, you need to know a lot about that data. And here is the most confusing part: you may not even know which data is the data you want! That is part of the promise of the data lake – data is kept around until someone needs it, and it's up to the data consumer to determine what data they need, when they need it.
In reality, the simplistic approach to the data lake just won’t work. You need a means for creating a catalog of the data in the data lake so that data consumers have a way to browse through the inventory of data assets to determine which are usable for a particular application or analysis.
Chief Analytics Officer Europe
15% off with code 7WDCAO17
Chief Analytics Officer Spring 2017
15% off with code MP15
Big Data and Analytics for Healthcare Philadelphia
$200 off with code DATA200
10% off with code 7WDATASMX
Data Science Congress 2017
20% off with code 7wdata_DSC2017