Investigating the Potential of Data Preparation

Investigating the Potential of Data Preparation

Investigating the Potential of Data Preparation

Data preparation is critical to the effectiveness of both operational and analytic business processes. Operational processes today are fed by streams of constantly generated data. Our data and analytics in the cloud benchmark research shows that more than half (55%) of organizations spend the most time in their analytic processes preparing data for analysis – a situation that reduces their productivity. Data now comes from more sources than ever, at a faster pace and in a dizzying array of formats; it often contains inconsistencies in both structure and content.

In response to these changing information conditions, data preparation technology is evolving. Big data, data science, streaming data and self-service all are impacting the way organizations collect and prepare data. Data sources used in analytic processes now include cloud-based data and external data. Many data sources now include large amounts of unstructured data, in contrast to just a few years ago when most organizations focused primarily on structured data. Our big data analytics benchmark research shows that nearly half (49%) include unstructured content such as documents or Web pages in their analyses.

Read Also:
Introduction to Business Intelligence: How to Take a More Intelligent Approach to Business

The ways in which data is stored in organizations are changing as well. Historically, data was extracted, transformed and loaded, and only then made available to end users through data warehouses or data marts. Now data warehouses are being supplemented with, or in some cases replaced by, data lakes, which I have written about. As a result, the data preparation process may involve not just loading raw information into a data lake, but also retrieving and refining information from it.

The advent of big data technologies such as Hadoop and NoSQL databases intensifies the need to apply data science techniques to make sense of these volumes of information. In this case querying and reporting over such large amounts of information are both inefficient and ineffective analytical techniques. And using data science means addressing additional data preparation requirements such as normalizing, sampling, binning and dealing with missing or outlying values. For example, in our next-generation predictive analytics benchmark research, 83 percent of organizations reported using sampling in preparing their analyses. Data scientists also frequently use sandboxes – copies of the data that can be manipulated without impacting operational processes or production data sources. Managing sandboxes adds yet another challenge to the data preparation process.

Read Also:
6 Ways Machine Learning Will Impact Ecommerce

data governance is always a challenge; in this new world, it's as if anything grown even more difficult as the volume and variety of data grow. At the moment most big data technologies trail their relational database counterparts in providing data governance capabilities.

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
How Big Data Drives Digital Marketing Success

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
How Big Data Drives Digital Marketing Success

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
Deep Learning Makes Way for Neural Networking Applications

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
Open data moving food from farm to fork

Big Data and Analytics Marketing Summit London

12
Jun
2017
Big Data and Analytics Marketing Summit London

$200 off with code DATA200

Read Also:
Keeping transformations on target
Read Also:
Image Processing in Python With Pillow

Leave a Reply

Your email address will not be published. Required fields are marked *