Investigating the Potential of Data Preparation

Investigating the Potential of Data Preparation

Investigating the Potential of Data Preparation

Data preparation is critical to the effectiveness of both operational and analytic business processes. Operational processes today are fed by streams of constantly generated data. Our data and analytics in the cloud benchmark research shows that more than half (55%) of organizations spend the most time in their analytic processes preparing data for analysis – a situation that reduces their productivity. Data now comes from more sources than ever, at a faster pace and in a dizzying array of formats; it often contains inconsistencies in both structure and content.

In response to these changing information conditions, data preparation technology is evolving. Big data, data science, streaming data and self-service all are impacting the way organizations collect and prepare data. Data sources used in analytic processes now include cloud-based data and external data. Many data sources now include large amounts of unstructured data, in contrast to just a few years ago when most organizations focused primarily on structured data. Our big data analytics benchmark research shows that nearly half (49%) include unstructured content such as documents or Web pages in their analyses.

Read Also:
Big Data Processing 101: The What, Why, and How

The ways in which data is stored in organizations are changing as well. Historically, data was extracted, transformed and loaded, and only then made available to end users through data warehouses or data marts. Now data warehouses are being supplemented with, or in some cases replaced by, data lakes, which I have written about. As a result, the data preparation process may involve not just loading raw information into a data lake, but also retrieving and refining information from it.

The advent of big data technologies such as Hadoop and NoSQL databases intensifies the need to apply data science techniques to make sense of these volumes of information. In this case querying and reporting over such large amounts of information are both inefficient and ineffective analytical techniques. And using data science means addressing additional data preparation requirements such as normalizing, sampling, binning and dealing with missing or outlying values.

For example, in our next-generation predictive analytics benchmark research 83 percent of organizations reported using sampling in preparing their analyses. Data scientists also frequently use sandboxes – copies of the data that can be manipulated without impacting operational processes or production data sources. Managing sandboxes adds yet another challenge to the data preparation process.

Read Also:
6 Steps to Effective Data Preparation for Quality Conclusions

Data governance is always a challenge; in this new world it has if anything grown even more difficult as the volume and variety of data grow. At the moment most big data technologies trail their relational database counterparts in providing data governance capabilities.



Data Innovation Summit 2017

30
Mar
2017
Data Innovation Summit 2017

30% off with code 7wData

Read Also:
Big Data: Big Opportunities

Big Data Innovation Summit London

30
Mar
2017
Big Data Innovation Summit London

$200 off with code DATA200

Read Also:
Artificial Intelligence, Lawmakers And The Future Of Work

Enterprise Data World 2017

2
Apr
2017
Enterprise Data World 2017

$200 off with code 7WDATA

Read Also:
Machine Learning For Drug Discovery

Data Visualisation Summit San Francisco

19
Apr
2017
Data Visualisation Summit San Francisco

$200 off with code DATA200

Read Also:
Artificial Intelligence, Lawmakers And The Future Of Work

Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
How Big Data is Changing The Way You Fly
Read Also:
Do Your Big Data Analytics Measure Or Predict?

Leave a Reply

Your email address will not be published. Required fields are marked *