In the digital economy, most companies are looking to significantly increase their use of data and analytics for competitive advantage. Many are making investments in new technologies like Hadoop that promise the speed and flexibility they need, and early pilots look promising. However, companies are struggling to scale these platforms to broad adoption by the enterprise, because the longest part of the process—getting data ready for business consumption—is consistently too high. A 2015 O’Reilly Data Scientist Salary Survey, for example, reported that most data scientists spend a third of their time performing basic extraction/transformation/load (ETL), data cleaning, and basic data exploration rather than true analytics or data modeling.
If we think of these new data platforms as a marketplace, preparing data for business use is the “transaction cost” of delivering value. In an economic marketplace, high transaction costs limit the creation and exchange of value. In a data marketplace, time-consuming data preparation dramatically limits the use of data and analytics in business processes, and more significantly, the rate at which companies can innovate.
We have identified several root causes for slow data preparation and six key principles that continuously accelerate data preparation. These principles have proven very effective, with some companies achieving a 30x increase in analytics productivity and 100x increase in user adoption. These results have convinced us that the most important driver for innovating with data is making data access and preparation frictionless.
There are two common methods to prepare data in the data lake. Data scientists often use programming interfaces such as Spark, Python, or R to work with data in the lake. Data wrangling tools can help lighten their burden. Using a graphical interface rather than requiring programmatic skills, wrangling tools permit data scientists to search data already in the lake and lightly cleanse and prepare new data sets for analytic projects. What wrangling tools don’t deliver is support for loading and cleansing complex data into the lake or the ability to manage, secure, and govern data consistent with the enterprise-scale demands of most large companies.
Moreover, with wrangling tools, work done by an individual data scientist is only available for that person or that person’s immediate team. Wrangling tools don’t support shared learning across the enterprise as multiple data scientists crowdsource and enhance data in the data lake for easy reuse by others.
The challenge is that both approaches to data preparation force data scientists to spend too much valuable time on low-value-added tasks.
For an individual analyst, especially one with deep technical skills who needs direct access to data without waiting for IT, writing code to load data into the lake and then accessing it directly can be cost effective, quick, and pragmatic. Likewise, when most of the needed data is already in the lake, a team of data scientists, even those with limited programming skills, can easily use a data wrangling tool to search, lightly cleanse, and prepare new data sets.
Where these approaches break down is when large quantities of new data or complex or dirty data must be loaded into the lake or when many analysts, including some with limited technical skills, need to work with data in the lake and share the results of their analyses and their enhancements to the data with one another. Let’s examine some of these challenges in more detail.
Data wrangling tools offer strong support to prepare data that is relatively simple, well-organized, and data-type/schema compliant, such as relational tables or flat files. These tools allow users to enhance or standardize data in existing fields, create new fields, define relationships between tables, and create new data sets.