Your business is happening in real time. Your data needs to follow suit. You need to access information while it matters, if you want to gain real-time insights.
But this move to real time is treacherous. If you are not careful, you will destroy the quality of your data in less time than it takes to start Windows. Here is why.
Running data quality controls such as data deduplication or lookups can be time consuming and resource intensive. It is tempting to skip such “cumbersome” controls, in order to accelerate the loading of the data in the target (data mart, operational BI application, etc.)
This will inevitably result not only in incomplete records, with missing elements, but also in corrupted information. It may be possible to repair incomplete records later (but at a higher cost) but duplicated records, which will by then have been modified or referenced separately, might be beyond repair.
How to avoid it: Don't compromise data integrity for speed. Ask yourself if these few seconds gained in loading time are worth the damage to your data (it's not).
Waiting for someone to review doubtful records, to manually review duplicates and select surviving records, is probably the most time-consuming aspect of data quality. Yet, there is a reason why data stewardship processes had been deployed, and data stewards appointed.
How to avoid it: Same as above -- don't compromise data integrity for speed. Remember how much it cost last time you had to do a full quality assessment and repair of your database.
Collecting transactional records too quickly after a transaction happens will result in unfinished transactions being collected. For example, you may load an order from the CRM, but because it takes several minutes for this order to be propagated to the ERP and processed there, you won't get the matching manifest -- and create a discrepancy in your reports.
How to avoid it: If you need these frequent refreshes, acknowledge the fact that data integrity will sometimes be broken. Build your reporting and analytics to account for these discrepancies.
In a typical real-time scenario, not all sources will be refreshed at the same frequency. This can be for technical reasons such data volumes or available bandwidth, or practical reasons -- for example, customer shipping addresses change less often than package tracking statuses. But these differences in velocity create inconsistencies in your target systems, which will be harder to spot than when a data point is just missing, like in the previous case.
How to avoid it: Treat real-time reports as questionable. If you spot outliers, or odd results, always have in mind that differences in data velocity can be playing tricks.
A theory at work in the world of data lakes, is to throw every record you can find in the data lake, and worry about it later. And then ("later"), to implement data quality workflows inside the data lake, cleansing "dirty" records by copying them (after enrichment and deduplication) into a "clean" state.
The concern here is that more and more users are gaining access to the data lake, which is badly suffering from a lack of metadata and documentation. Hence it is very difficult for a non-initiated party to recognize the state of the record (dirty or clean).
How to avoid it: If you absolutely need to create a data swamp full of dirty data, keep it to yourself. Don't throw your dirty data into the data lake. Only share with your unsuspecting colleagues data that is in a reasonable state of cleanliness.
Chief Analytics Officer Spring 2017
15% off with code MP15
Big Data and Analytics for Healthcare Philadelphia
$200 off with code DATA200
10% off with code 7WDATASMX
Data Science Congress 2017
20% off with code 7wdata_DSC2017
20% off with code AIP17-7WDATA-20