Data Lakes are Growing Up

Data Lakes are Growing Up

Data Lakes are Growing Up

Data lakes are reaching adolescence. The wild emotional years and peer pressure aren’t quite over yet. Many mainstream organizations are still having enthusiastic growing pains that end in embarrassment. Channeling Hadoop and data lake enthusiasm in the right direction is a huge CIO challenge. Here are a few paths toward data lake maturity.

Adolescent enthusiasm is no way to run a business. Every data lake project must have a line of business champion and goal. Just like every other project. Projects without a funding business sponsor tend to fail. Improved customer experience, cost efficiency, and new business opportunities should top the list of data lake projects.

Think of the ROI plan as guide rails on a mountain road keeping projects on the road. Million dollar projects that the business people refuse to touch are career limiting. And for the first few projects, skip ‘cost efficiency’ justifications that only benefit IT. Those projects are often false growth spurts.

Read Also:
How Big Data Is Helping the NYPD Solve Crimes Faster

One pitfall that Data Lake enthusiasts fall into is “put everything in the data lake.” We know one worldwide corporate giant that got into colossal expense this way. Hundreds of times a day they store a terabyte file in the lake. Hadoop then replicates that file twice for availability. Then they derived seven files from the first. That’s eight terabytes times three terabytes. Multiply that by dozens of files daily. Soon the data lake is twenty petabytes and a thousand servers.

Nope, disk storage is not free --especially in the data lake. Start here: every file placed in the data lake must be a line of business necessity. Avoid polluting the lake. Next minimize derivatives. That means programmers must coordinate designs. That’s common sense project management. Optimizing spending from the beginning is easier and cheaper than cleaning up a swamp.

A first principle of data lakes is to capture the original raw data files. Raw files means the data will have flaws, inconsistencies, and missing values. But dirty data begets muddy answers. Broken data begets broken answers. Refuse to clean up the data and the business users will refuse to use it.

Read Also:
5 Big Data Projects You Can No Longer Overlook

 



Data Innovation Summit 2017

30
Mar
2017
Data Innovation Summit 2017

30% off with code 7wData

Read Also:
5 mistakes to avoid when deploying an analytical database

Big Data Innovation Summit London

30
Mar
2017
Big Data Innovation Summit London

$200 off with code DATA200

Read Also:
Identifying the Obstacles to Analytics Success

Enterprise Data World 2017

2
Apr
2017
Enterprise Data World 2017

$200 off with code 7WDATA

Read Also:
Business Intelligence with GPU Databases + Visual Analytics – MapD

Data Visualisation Summit San Francisco

19
Apr
2017
Data Visualisation Summit San Francisco

$200 off with code DATA200

Read Also:
9 IoT global trends for 2017

Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
Cloud Stampede Is On, But Who's Watching Security?

Leave a Reply

Your email address will not be published. Required fields are marked *