Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now. Brought to you in partnership with Hortonworks.
The models of Provenance, Lineage, and Chain of Custody are used in fine art to determine when a piece was created, the sequence of locations where it was held, how it was touched along the way, and who has owned it since creation, all with the purpose of authenticating the piece. What does this have to do with boring data?
It turns out many decisions which affect our daily lives are made using a single final result – or score – which is derived from many other pieces of data. What if one of those pieces of data was wrong or stale? This could lead to “Bad Data”, and the consequences can range from the inconvenient to the catastrophic. We must understand the data components used to calculate a final number to ensure the result is valid and current; this is why we need to adopt the models of Data provenance, Data Lineage and Data Chain of Custody, and make them an intrinsic part of any data driven decision.
Let me start with a few Examples:
The cost of “Bad Data” ranges from TDWI (The Data Warehousing Institute) estimate of $611 billion each year for U.S. firms, to IBM’s $3.1 trillion per year figure, either figure is simply staggering, not to mention the individual lives affected by this.
The causes of Bad Data typically fall into these categories:
The right solution needs to address all these issues under the umbrella of Data Governance, and it must provide a full audit trail to record and verify all events that could change every piece of data going into a meaningful calculation. It must enable enterprises to have the proper tracking and monitoring of data via Data Provenance, Data Lineage, and Data Chain of Custody.
Data Provenance refers to the “origin” and “source” of data – where a piece of data came from and the process by which came to be in its present state.
Data Lineage is the process of tracing and recording the origins of data and its movement between databases or systems; it tracks the data life cycle from its origin to its destination over time, and what happens as it goes through diverse processes.