In the age of transformation, all successful companies collect data, but one of the most expensive and difficult problems to solve is the quality of that information. Data analysis is useless if we don’t have reliable information, because the answers we derive from it could deviate greatly from reality. Consequently, we could make bad decisions.
Most organizations believe the data they work with is reasonably good, but they recognize that poor-quality data poses a substantial risk to their bottom line. (The State of Enterprise Quality Data 2016 – 451 Research)
Meanwhile, the idiosyncrasies of Big Data are only making the data quality problem more acute. Information is being generated at increasingly faster rates, while larger data volumes are innately harder to manage.
There are four main drivers of dirty data:
Correcting a data quality problem is not easy. For one thing, it is complicated and expensive; benefits aren’t apparent in the short term, so it can be hard to justify to management. And as I mentioned above, the data gathering and interpretation process has many vulnerable places where error can creep in. Furthermore, both the business processes from which you’re gathering data and the technology you’re using are liable to change at short notice, so quality correction processes need to be flexible.
Therefore, an organization that wants reliable data quality needs to build in multiple quality checkpoints: during collection, delivery, storage, integration, recovery, and during analysis or data mining.
Monitoring so many potential checkpoints, each requiring a different approach, calls for a thorough quality assurance plan.
A classic starting point is analyzing data quality when it first enters the system – often via manual input, or where the organization may not have standardized data input systems. The risk analyzed is that data entry can be erroneous, duplicated, or overly abbreviated (e.g. “NY” instead of “New York City.