Data lakes are reaching adolescence. The wild emotional years and peer pressure aren’t quite over yet. Many mainstream organizations are still having enthusiastic growing pains that end in embarrassment. Channeling Hadoop and data lake enthusiasm in the right direction is a huge CIO challenge. Here are a few paths toward data lake maturity.
Adolescent enthusiasm is no way to run a business. Every data lake project must have a line of business champion and goal. Just like every other project. Projects without a funding business sponsor tend to fail. Improved customer experience, cost efficiency, and new business opportunities should top the list of data lake projects.
Think of the ROI plan as guide rails on a mountain road keeping projects on the road. Million dollar projects that the business people refuse to touch are career limiting. And for the first few projects, skip ‘cost efficiency’ justifications that only benefit IT. Those projects are often false growth spurts.
One pitfall that Data Lake enthusiasts fall into is “put everything in the data lake.” We know one worldwide corporate giant that got into colossal expense this way. Hundreds of times a day they store a terabyte file in the lake. Hadoop then replicates that file twice for availability. Then they derived seven files from the first. That’s eight terabytes times three terabytes. Multiply that by dozens of files daily. Soon the data lake is twenty petabytes and a thousand servers.
Nope, disk storage is not free --especially in the data lake. Start here: every file placed in the data lake must be a line of business necessity. Avoid polluting the lake. Next minimize derivatives. That means programmers must coordinate designs. That’s common sense project management. Optimizing spending from the beginning is easier and cheaper than cleaning up a swamp.
A first principle of data lakes is to capture the original raw data files. Raw files means the data will have flaws, inconsistencies, and missing values. But dirty data begets muddy answers. Broken data begets broken answers. Refuse to clean up the data and the business users will refuse to use it.