Must-Know: What are common data quality issues for Big Data and how to handle them?

Must-Know: What are common data quality issues for Big Data and how to handle them?

Must-Know: What are common data quality issues for Big Data and how to handle them?

The most common data quality issues observed when dealing with Big Data can be best understood in terms of the key characteristics of Big Data – Volume, Velocity, Variety, Veracity, and Value.

In the traditional data warehouse environment, comprehensive data quality assessment and reporting was at least possible (if not, ideal). However, in the Big Data projects the scale of data makes it impossible. Thus, the data quality measurements can at best be approximations (i.e. need to be described in probability and confidence intervals, and not in terms of absolute values). We also need to re-define most of the data quality metrics based on the specific characteristics of the Big Data project so that those metrics can have a clear meaning, be measured (good approximation) and be used for evaluating the alternative strategies for data quality improvement.

Despite the great volume of underlying data, it is not uncommon to find out that some desired data was not captured or is not available for other reasons (such as high cost, delay in getting it, etc.). It is ironical but true that data availability continues to be a prominent data quality concern in the Big Data era.

Read Also:
Google Offers Free Cloud Access to Colleges, Plays Catch Up to Amazon, Microsoft (EdSurge News)

The tremendous pace of data generation and collection makes it incredibly hard to monitor data quality within a reasonable overhead on time and resources (storage, compute, human effort, etc.). So, by the time data quality assessment completes, the output might be outdated and of little use, particularly if the Big Data project is to serve any real-time or near real-time business needs. In such scenarios, you would need to re-define data quality metrics so that they are relevant as well as feasible in the real-time context.

Sampling can help you gain speed for the data quality efforts, but this comes at the cost of a bias (which eventually makes the end result less useful) because of the fact that samples are rarely an accurate representation of the entire data. Lesser samples will give higher speed, but with a bigger bias.

Another impact of velocity is that you might have to do data quality assessments on-the-fly, i.e. somewhere plugged-in within the data collection/transfer/storage processes; as the critical time-constraint does not give you the privilege of making a copy of a selected data subset, storing it elsewhere and running data quality assessments on it.

Read Also:
How to Resolve Competing Information Management Strategies -

One of the biggest data quality issues in Big Data is that the data includes several data types (structured, semi-structured, and unstructured) coming in from different data sources. Thus, often a single data quality metric will not be applicable for the entire data and you would need to separately define data quality metrics for each data type. Moreover, assessing and improving the data quality of unstructured or semi-structured data is way more tricky and complex than that of structured data. For example, when mining the physician notes from medical records across the world (related to a particular medical condition) even if the language (and the grammar) is same the meaning might be very different due to local dialects and slang. This leads to low data interpretability, another data quality measure.

Data from different sources often has serious semantic differences. For example, “profit” can have widely varied definitions across the business units of an organization or external agencies. Thus, the fields with identical names may not mean the same thing. This problem is made worse by the lack of adequate and consistent meta-data from each data source. In order to make sense of data, you need reliable metadata (such as to make sense of sales numbers from a store, you need other information such as date-time, items purchased, coupons used, etc.). Usually, a lot of these data sources are outside an organization and thus, it is very hard to ensure good metadata for such data.

Read Also:
HDS adds to advanced analytics portfolio with Pentaho buy

Another common issue is syntactic inconsistencies.



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Big Data – a roadmap for smarter data

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
How to Resolve Competing Information Management Strategies -

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
New big data tools for machine learning spring from home of Spark and Mesos

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
The Impact of Bad Data in Automation: Why Quality Management is Critical

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
How advanced analytics can shore up defenses against data theft

Leave a Reply

Your email address will not be published. Required fields are marked *