Six Key Principles to Accelerate Data Preparation in the Data Lake

Six Key Principles to Accelerate Data Preparation in the Data Lake

Six Key Principles to Accelerate Data Preparation in the Data Lake

In the digital economy, most companies are looking to significantly increase their use of data and analytics for competitive advantage. Many are making investments in new technologies like Hadoop that promise the speed and flexibility they need, and early pilots look promising.  However, companies are struggling to scale these platforms to broad adoption by the enterprise, because the longest part of the process—getting data ready for business consumption—is consistently too high. A 2015 O’Reilly Data Scientist Salary Survey, for example, reported that most data scientists spend a third of their time performing basic extraction/transformation/load (ETL), data cleaning, and basic data exploration rather than true analytics or data modeling.

If we think of these new data platforms as a marketplace, preparing data for business use is the “transaction cost” of delivering value.  In an economic marketplace, high transaction costs limit the creation and exchange of value.  In a data marketplace, time-consuming data preparation dramatically limits the use of data and analytics in business processes, and more significantly, the rate at which companies can innovate.

We have identified several root causes for slow data preparation and six key principles that continuously accelerate data preparation. These principles have proven very effective, with some companies achieving a 30x increase in analytics productivity and 100x increase in user adoption. These results have convinced us that the most important driver for innovating with data is making data access and preparation frictionless.

Read Also:
Gigamon brings big data analytics to security

There are two common methods to prepare data in the data lake. Data scientists often use programming interfaces such as Spark, Python, or R to work with data in the lake. Data wrangling tools can help lighten their burden. Using a graphical interface rather than requiring programmatic skills, wrangling tools permit data scientists to search data already in the lake and lightly cleanse and prepare new data sets for analytic projects. What wrangling tools don’t deliver is support for loading and cleansing complex data into the lake or the ability to manage, secure, and govern data consistent with the enterprise-scale demands of most large companies.

Moreover, with wrangling tools, work done by an individual data scientist is only available for that person or that person’s immediate team. Wrangling tools don’t support shared learning across the enterprise as multiple data scientists crowdsource and enhance data in the data lake for easy reuse by others.

Read Also:
Five Myths About Machine Learning You Need To Know Today

The challenge is that both approaches to data preparation force data scientists to spend too much valuable time on low-value-added tasks.

For an individual analyst, especially one with deep technical skills who needs direct access to data without waiting for IT, writing code to load data into the lake and then accessing it directly can be cost effective, quick, and pragmatic. Likewise, when most of the needed data is already in the lake, a team of data scientists, even those with limited programming skills, can easily use a data wrangling tool to search, lightly cleanse, and prepare new data sets.

Where these approaches break down is when large quantities of new data or complex or dirty data must be loaded into the lake or when many analysts, including some with limited technical skills, need to work with data in the lake and share the results of their analyses and their enhancements to the data with one another. Let’s examine some of these challenges in more detail.

Read Also:
When Big Data Means Bad Analytics

Data wrangling tools offer strong support to prepare data that is relatively simple, well-organized, and data-type/schema compliant, such as relational tables or flat files. These tools allow users to enhance or standardize data in existing fields, create new fields, define relationships between tables, and create new data sets.



Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
Using OpenStack To Build A Hybrid Cloud With AWS

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
5 Steps for Creating a Scalable Data Security Plan

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
Using OpenStack To Build A Hybrid Cloud With AWS

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Fleet management meets big data

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
Five Myths About Machine Learning You Need To Know Today

Leave a Reply

Your email address will not be published. Required fields are marked *