Data Readiness Levels: Turning Data from Palid to Vivid
- by 7wData
Application of models to data is fraught. You are faced with collaborators who sometimes have a very basic understanding of the complications of collating, processing and curating data. Challenges include: poor data collection practices, missing values, inconvenient storage mechanisms, intellectual property, security and privacy. All these aspects obstruct the sharing and interconnection of data.
All these problems arise before modeling even starts. Both questions and data are badly characterised. This is particularly true in the era of Big Data, where one gains the impression that the depth of data-discussion in many decision making forums is of the form “We have a Big Data problem, do you have a Big Data solution?”, “Yes, I have a Big Data solution.” Of course in practice it also turns out to be a solution that requires Big Money to pay for because in practice no one bothered to scope the nature of the problem, the data, or the solution.
Data scientists and statisticians are often treated like magicians who wave a model across a disparate and carelessly collated set of data and with a cry of ‘sortitouticus’ a magical conclusion is drawn. In practice the sea of data we are faced with is normally undrinkable. The challenge of data desalination is very resource hungry and many projects fail to achieve their potential as a result.
For any data analyst, when embarking on a project, a particular challenge is assessing the quality of the available data. This difficulty can be compounded when project partners do not themselves have a deep understanding of the process of data analysis. If partners are not data-savy they may not understand just how much good practice needs to be placed in the curation of data to ensure that conclusions are robust and representative.
In one such meeting, while scoping a project with potential collaborators in the domain of health monitoring, it occurred to me that in most proposal documents, very scant attention is paid to these obstacles (other than ensuring a data-wizard is named on the project).
One difficulty is that the concept of “data”, for many people, is somehow abstract and disembodied. This seems to mean that it is challenging for us to reason about. Psychologists refer to the idea of vivid information as information that is weighted more heavily in reasoning than non-vivid or pallid information. In this sense data seems to be rendered vivid to be properly accounted for in planning.
A parallel thought occurred to me is that the idea of “technology” is also similarly disembodied, it is pallid information. Perhaps to deal with this challenge, in large scale projects, when deploying technology, we are nowadays guided to consider its readiness stage. The readiness of the technology is embodied in a set of numbers which describe its characteristics: is it lab tested only? Is it ready for commercialization? Is it merely conceptual? No doubt there are pros and cos of such readiness levels, but one of the pros is that the embodiment of the technological readiness pipeline ensures that some thought is given to that process. Technology is rendered more vivd even when it is still disembodied.
So it occurred to me that it would be very useful to have a scale to embody data readiness. This idea would allow analysts to encourage better consideration of the data collection/production and consolidation, with a set of simple questions, “And what will the data readiness level be at that point?”. Or “How will that have progressed the data readiness?”. Or to make statements, “we’ll be unable to deliver on that integration unless the data readiness level is at least B3.”.
It turns out, that like all (potential) good ideas, I’m not the first there. However, this discussion document from the nanotechnology community in 2013 is not general enough to give me what we need (it seems very domain specific, it has an obsession with units which would often be inappropriate).
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
Evolving Your Data Architecture for Trustworthy Generative AI
18 April 2024
5 PM CET – 6 PM CET
Read MoreShift Difficult Problems Left with Graph Analysis on Streaming Data
29 April 2024
12 PM ET – 1 PM ET
Read More