In my last post, I started to look at the use of Hadoop in general and the data lake concept in particular as part of a plan for modernizing the data environment. There are surely benefits to the data lake, especially when it's deployed using a low-cost, scalable hardware platform. The significant issue we began to explore is this: the more prolific you become at loading data into the data lake, the greater the chance that entropy will overtake any attempt at proactive management.
Let's presume that you plan to migrate all corporate data to the data lake. And the idea of the data lake is to provide a resting place for raw data in its native format until it's needed. Now, let’s imagine what you need to know when you decide that the data truly is needed:
In other words, you need to know a lot about that data. And here is the most confusing part: you may not even know which data is the data you want! That is part of the promise of the data lake – data is kept around until someone needs it, and it's up to the data consumer to determine what data they need, when they need it.
In reality, the simplistic approach to the data lake just won’t work. You need a means for creating a catalog of the data in the data lake so that data consumers have a way to browse through the inventory of data assets to determine which are usable for a particular application or analysis.