When an organization has mastered the use of automated data ingest and the appropriate application of metadata, there are a number of additional concerns to be addressed with using data at scale. These include data quality, data lineage, and a searchable data catalog. All of these are factors in presenting an effective and useful data catalog. The data catalog is the foundation of the self-service capability for a business-facing data presentation and transformation layer.
It is best to apply data quality rules in an automated fashion. Simple rules can be defined, like operators in a programming language, corresponding to atomic operations such as greater than or less than. Those simple operators can be hierarchically organized into a collection of operations to establish basic rules such as social security number validation. Those rules can also be logically organized into collections to create “rule sets”. For example, a rule set might contain all the data quality operations that are needed for a specific data feed. Business transactions that require application validation and pre-processing such as loan processing or credit line increase request processing. All the fields in that set of data can be analyzed automatically, creating clean records for upstream analytics automatically. This type of data quality pre-processing is exactly what we as consumers of data should be demanding of our data ingest process.
Knowing where data has come from and how it was transformed not only by data quality rules but by all of the transformations required by business rules specific to the use case at hand is invaluable. Being able to retrace the steps in data ingest and processing is critical to many data users in terms of regulatory requirements. Data in both banking and pharmaceutical research, for example, must be trackable based upon regulations in the respective industries. At a lower level, a simple data engineering level, being able to track data lineage is also a very valuable debugging tool. This includes both business process debugging (i.e., is our business process correct) but also from a simple data science perspective (i.e., are advanced algorithms functioning correctly). With Bedrock, all of these concerns can be answered by the proper application of maintaining proper data lineage for datasets over time.
A functioning data catalog seems like such a simple concept in this day and age. The fact is that many organizations still face the challenge of simply finding data. Having data is not the problem. Everyone has data now. What many are missing is an effective method for finding datasets in what can be a sea of ingested data. Part of the issue is simple organization.