Data Lakes are Growing Up

Data Lakes are Growing Up

Data Lakes are Growing Up

Data lakes are reaching adolescence. The wild emotional years and peer pressure aren’t quite over yet. Many mainstream organizations are still having enthusiastic growing pains that end in embarrassment. Channeling Hadoop and data lake enthusiasm in the right direction is a huge CIO challenge. Here are a few paths toward data lake maturity.

Adolescent enthusiasm is no way to run a business. Every data lake project must have a line of business champion and goal. Just like every other project. Projects without a funding business sponsor tend to fail. Improved customer experience, cost efficiency, and new business opportunities should top the list of data lake projects.

Think of the ROI plan as guide rails on a mountain road keeping projects on the road. Million dollar projects that the business people refuse to touch are career limiting. And for the first few projects, skip ‘cost efficiency’ justifications that only benefit IT. Those projects are often false growth spurts.

Read Also:
Why Are We Treating Data Like a Picasso?

One pitfall that Data Lake enthusiasts fall into is “put everything in the data lake.” We know one worldwide corporate giant that got into colossal expense this way. Hundreds of times a day they store a terabyte file in the lake. Hadoop then replicates that file twice for availability. Then they derived seven files from the first. That’s eight terabytes times three terabytes. Multiply that by dozens of files daily. Soon the data lake is twenty petabytes and a thousand servers.

Nope, disk storage is not free --especially in the data lake. Start here: every file placed in the data lake must be a line of business necessity. Avoid polluting the lake. Next minimize derivatives. That means programmers must coordinate designs. That’s common sense project management. Optimizing spending from the beginning is easier and cheaper than cleaning up a swamp.

A first principle of data lakes is to capture the original raw data files. Raw files means the data will have flaws, inconsistencies, and missing values. But dirty data begets muddy answers. Broken data begets broken answers. Refuse to clean up the data and the business users will refuse to use it.

Read Also:
BigData’s Big Impact on Professional Sports

 



Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
Internet of Things: Five truths you need to know to succeed

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
Smart Data Plus Deep Reasoning Equals Business Value from Data Analysis

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
The Artificial Intelligence Gold Rush

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
The Power of a Data Value Chain For Your Business

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
Why Are We Treating Data Like a Picasso?

Leave a Reply

Your email address will not be published. Required fields are marked *