Better data can lead to better outcomes. But what do we mean by ‘data quality’? ODI Associate Leigh Dodds, who is working with ODI Partner Experian to explore data quality in UK open datasets, explains
One way to help improve data quality is to generate quality metrics for a dataset. CC BY 2.0, uploaded by Elizabeth Hahn.
First impressions are everything. The efforts made to publish a dataset will guide a user’s experience in finding, accessing and using it. No matter how good the contents of your dataset, if it is not clearly documented, well-structured and easily accessible, then it won’t get used.
Open data certificates are a mark of quality and trust for open data. They measure the legal, technical, practical and social aspects of publishing data. Creating and publishing a certificate will help a publisher build confidence in their data. Open data certificates complement the five star scheme, that assesses how well data is integrated with the web.
2. A dataset can contain a variety of problems
Data quality also relates to the contents of a dataset. Data errors usually occur when the data was originally collected. But the problems may only become apparent once a user begins working with the data.
There are a number of different types of data quality problem. The following list isn’t exhaustive but includes some of the most common:
The dataset isn’t valid when compared to its schema, for example there are missing columns, or they are in the wrong order The dataset contains invalid or incorrect values, for example numbers that are not within their expected range, text where there should be numbers, spelling mistakes or invalid phone numbers The dataset has missing data from some fields or the dataset doesn’t include all of the available data – some addresses in a dataset might be missing their postcode, for example The data may have precision problems — these may be due to limits in accuracy of the sensors or other devices (such as GPS devices) that were used to record the data, or they many be due to simple rounding errors introduced during analysis