Improving the Quality of Data on Hadoop -

Improving the Quality of Data on Hadoop –

Improving the Quality of Data on Hadoop –

As the value and volume of data explodes, so does the need for mature data management. Big data is now receiving the same treatment as relational data -- integration, transformation, process orchestration, and error recovery -- so the quality of big data is becoming critical.

Because of the promise and capacity of Hadoop, data quality was initially overlooked. However, not all Hadoop use cases are for analytics; some are driving critical business processes. Data quality is now a key consideration for process improvement and decision making based on data coming out of Hadoop.

With the size of our data stores in Hadoop, we must consider whether data quality practices can scale to the potential immensity of big data. Hadoop obviously shatters the limits of data storage, not only in terms of data volume and variety as well as in terms of structure. One way that data quality is maintained in a conventional data warehouse is by imposing strict limits on the volume, variety, and structure of data. This is in direct opposition to the advantages that Hadoop and NoSQL offer.

Read Also:
Hortonworks enters joint initiative with Hewlett Packard Enterprise on Apache Spark enhancements

We must also consider the cost of poor data quality within a Hadoop cluster. From an analytics perspective, "bad data" may not be as troublesome as it once was, if we consider the statistical insignificance of incorrect, incomplete, or inaccurate records. The effect of a statistical outlier or anomaly is reduced by the massive amounts of data around it; the sheer volume effectively drowns it out.

In conventional data analysis and data warehousing practice, "bad data" was something to be detected, cleansed, reconciled, and purged.

 



Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
Predictive Analytics Let Manufacturers See More Clearly into Their Supply Chains

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
The Hidden Bias in Customer Metrics

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
Topic Modeling Large Amounts of Text Data
Read Also:
How Cloud Computing is Revolutionizing Healthcare?

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Planet analytics: big data, sustainability, and environmental impact

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
What is Your City’s Digital Transformation IQ?

Leave a Reply

Your email address will not be published. Required fields are marked *