Six Key Principles to Accelerate Data Preparation in the Data Lake

Six Key Principles to Accelerate Data Preparation in the Data Lake

Six Key Principles to Accelerate Data Preparation in the Data Lake

In the digital economy, most companies are looking to significantly increase their use of data and analytics for competitive advantage. Many are making investments in new technologies like Hadoop that promise the speed and flexibility they need, and early pilots look promising.  However, companies are struggling to scale these platforms to broad adoption by the enterprise, because the longest part of the process—getting data ready for business consumption—is consistently too high. A 2015 O’Reilly Data Scientist Salary Survey, for example, reported that most data scientists spend a third of their time performing basic extraction/transformation/load (ETL), data cleaning, and basic data exploration rather than true analytics or data modeling.

If we think of these new data platforms as a marketplace, preparing data for business use is the “transaction cost” of delivering value.  In an economic marketplace, high transaction costs limit the creation and exchange of value.  In a data marketplace, time-consuming data preparation dramatically limits the use of data and analytics in business processes, and more significantly, the rate at which companies can innovate.

We have identified several root causes for slow data preparation and six key principles that continuously accelerate data preparation. These principles have proven very effective, with some companies achieving a 30x increase in analytics productivity and 100x increase in user adoption. These results have convinced us that the most important driver for innovating with data is making data access and preparation frictionless.

Read Also:
Using Predictive Algorithms to Track Real Time Health Trends

There are two common methods to prepare data in the data lake. Data scientists often use programming interfaces such as Spark, Python, or R to work with data in the lake. Data wrangling tools can help lighten their burden. Using a graphical interface rather than requiring programmatic skills, wrangling tools permit data scientists to search data already in the lake and lightly cleanse and prepare new data sets for analytic projects. What wrangling tools don’t deliver is support for loading and cleansing complex data into the lake or the ability to manage, secure, and govern data consistent with the enterprise-scale demands of most large companies.

Moreover, with wrangling tools, work done by an individual data scientist is only available for that person or that person’s immediate team. Wrangling tools don’t support shared learning across the enterprise as multiple data scientists crowdsource and enhance data in the data lake for easy reuse by others.

Read Also:
How Open Data for Science Will Transform How We Develop Revolutionary New Products, Services and Miracle Cures

The challenge is that both approaches to data preparation force data scientists to spend too much valuable time on low-value-added tasks.

For an individual analyst, especially one with deep technical skills who needs direct access to data without waiting for IT, writing code to load data into the lake and then accessing it directly can be cost effective, quick, and pragmatic. Likewise, when most of the needed data is already in the lake, a team of data scientists, even those with limited programming skills, can easily use a data wrangling tool to search, lightly cleanse, and prepare new data sets.

Where these approaches break down is when large quantities of new data or complex or dirty data must be loaded into the lake or when many analysts, including some with limited technical skills, need to work with data in the lake and share the results of their analyses and their enhancements to the data with one another. Let’s examine some of these challenges in more detail.

Read Also:
Talend Updates Big Data Sandbox with Docker -

Data wrangling tools offer strong support to prepare data that is relatively simple, well-organized, and data-type/schema compliant, such as relational tables or flat files. These tools allow users to enhance or standardize data in existing fields, create new fields, define relationships between tables, and create new data sets.



Big Data Innovation Summit London

30
Mar
2017
Big Data Innovation Summit London

$200 off with code DATA200

Read Also:
From Science to Data Science, a Comprehensive Guide for Transition

Data Innovation Summit 2017

30
Mar
2017
Data Innovation Summit 2017

30% off with code 7wData

Read Also:
How Data Can Help Farmers Protect the Earth

Enterprise Data World 2017

2
Apr
2017
Enterprise Data World 2017

$200 off with code 7WDATA

Read Also:
Cognitive Analytics Answers the Question: What's Interesting in Your Data?

Data Visualisation Summit San Francisco

19
Apr
2017
Data Visualisation Summit San Francisco

$200 off with code DATA200

Read Also:
The Art of Analytics, Or What the Green-Haired People Can Teach Us

Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
The power of data ownership: Getting it right in 2017

Leave a Reply

Your email address will not be published. Required fields are marked *