Data Lakes are Growing Up

Data Lakes are Growing Up

Data Lakes are Growing Up

Data lakes are reaching adolescence. The wild emotional years and peer pressure aren’t quite over yet. Many mainstream organizations are still having enthusiastic growing pains that end in embarrassment. Channeling Hadoop and data lake enthusiasm in the right direction is a huge CIO challenge. Here are a few paths toward data lake maturity.

Adolescent enthusiasm is no way to run a business. Every data lake project must have a line of business champion and goal. Just like every other project. Projects without a funding business sponsor tend to fail. Improved customer experience, cost efficiency, and new business opportunities should top the list of data lake projects.

Think of the ROI plan as guide rails on a mountain road keeping projects on the road. Million dollar projects that the business people refuse to touch are career limiting. And for the first few projects, skip ‘cost efficiency’ justifications that only benefit IT. Those projects are often false growth spurts.

Read Also:
Top Questions to Ask Yourself When Shopping for a Data Integration Solution

One pitfall that Data Lake enthusiasts fall into is “put everything in the data lake.” We know one worldwide corporate giant that got into colossal expense this way. Hundreds of times a day they store a terabyte file in the lake. Hadoop then replicates that file twice for availability. Then they derived seven files from the first. That’s eight terabytes times three terabytes. Multiply that by dozens of files daily. Soon the data lake is twenty petabytes and a thousand servers.

Nope, disk storage is not free --especially in the data lake. Start here: every file placed in the data lake must be a line of business necessity. Avoid polluting the lake. Next minimize derivatives. That means programmers must coordinate designs. That’s common sense project management. Optimizing spending from the beginning is easier and cheaper than cleaning up a swamp.

A first principle of data lakes is to capture the original raw data files. Raw files means the data will have flaws, inconsistencies, and missing values. But dirty data begets muddy answers. Broken data begets broken answers. Refuse to clean up the data and the business users will refuse to use it.

Read Also:
What regulations are having the biggest impact on data governance?

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
What Is The Profession Of Data Science Really About Now And In The Future?

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
How to Build a Big Data and Analytics Team

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
Data Quality is the Key to Business Success

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
From Science to Data Science, a Comprehensive Guide for Transition

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
Data Quality is the Key to Business Success

Leave a Reply

Your email address will not be published. Required fields are marked *