mike_ocallaghan-pat_tillman_bridge_construction-100583978-primary.idge

5 ways real-time will kill data quality

5 ways real-time will kill data quality

 

Your business is happening in real time. Your data needs to follow suit. You need to access information while it matters, if you want to gain real-time insights.

But this move to real time is treacherous. If you are not careful, you will destroy the quality of your data in less time than it takes to start Windows. Here is why.

Running data quality controls such as data deduplication or lookups can be time consuming and resource intensive. It is tempting to skip such “cumbersome” controls, in order to accelerate the loading of the data in the target (data mart, operational BI application, etc.)

This will inevitably result not only in incomplete records, with missing elements, but also in corrupted information. It may be possible to repair incomplete records later (but at a higher cost) but duplicated records, which will by then have been modified or referenced separately, might be beyond repair.

How to avoid it: Don't compromise data integrity for speed. Ask yourself if these few seconds gained in loading time are worth the damage to your data (it's not).

Read Also:
What You Need to Know About Sharing Your Medical Information

Waiting for someone to review doubtful records, to manually review duplicates and select surviving records, is probably the most time-consuming aspect of data quality. Yet, there is a reason why data stewardship processes had been deployed, and data stewards appointed.

How to avoid it: Same as above -- don't compromise data integrity for speed. Remember how much it cost last time you had to do a full quality assessment and repair of your database.

Collecting transactional records too quickly after a transaction happens will result in unfinished transactions being collected. For example, you may load an order from the CRM, but because it takes several minutes for this order to be propagated to the ERP and processed there, you won't get the matching manifest -- and create a discrepancy in your reports.

How to avoid it: If you need these frequent refreshes, acknowledge the fact that data integrity will sometimes be broken. Build your reporting and analytics to account for these discrepancies.

In a typical real-time scenario, not all sources will be refreshed at the same frequency. This can be for technical reasons such data volumes or available bandwidth, or practical reasons -- for example, customer shipping addresses change less often than package tracking statuses. But these differences in velocity create inconsistencies in your target systems, which will be harder to spot than when a data point is just missing, like in the previous case.

Read Also:
CEOs Unaware of Company Data Frustrations

How to avoid it: Treat real-time reports as questionable. If you spot outliers, or odd results, always have in mind that differences in data velocity can be playing tricks.

A theory at work in the world of data lakes, is to throw every record you can find in the data lake, and worry about it later. And then ("later"), to implement data quality workflows inside the data lake, cleansing "dirty" records by copying them (after enrichment and deduplication) into a "clean" state.

The concern here is that more and more users are gaining access to the data lake, which is badly suffering from a lack of metadata and documentation. Hence it is very difficult for a non-initiated party to recognize the state of the record (dirty or clean).

How to avoid it: If you absolutely need to create a data swamp full of dirty data, keep it to yourself. Don't throw your dirty data into the data lake. Only share with your unsuspecting colleagues data that is in a reasonable state of cleanliness.

Read Also:
How Business Intelligence Is Changing the Manufacturing Industry


Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
The Future of Analytics Is Prescriptive, Not Predictive

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
Consumer Lenders Show Growing Interest in Big Data

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
Deep Learning Machine Out of China Beats Humans in IQ Test

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
Understanding the Close Relationship Between the IoT and Big Data

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
Consumer Lenders Show Growing Interest in Big Data

Leave a Reply

Your email address will not be published. Required fields are marked *