You know all those product catalogs sitting on your coffee table (or in your recycle bin) and the websites you visit to buy gifts? They hold interesting lessons for how information can be consumed in the big data era.
Like an inventory, a catalog should list everything available for consumption (and nothing that isn’t), but that’s not enough. An Amazon product page, for example, includes pictures, specs, reviews, and recommendations. These bits of information, cumulatively, help the user decide what to buy.
Consuming data also requires rich context. Before embarking on a research project, an analyst needs to understand the shape of the data set, its source, whether it is up to date, who else has used it, and how it was used. To address those requirements, a catalog should provide data samples and statistical profiles, lineage, lists of users and stewards, and tips on how the data should be interpreted.
Yesterday’s data challenge was all about collecting relevant data for analysis and producing relevant reports, but these days many organizations possess the data and computational resources to answer almost any analytical question. But finding the most relevant, trustworthy data sets and metrics can be like finding a limited-edition Darth Vader Pez dispenser for Uncle Jack.
A 21st century data catalog should do the following:
Some catalogs may try to be a source of truth about the right table to consult for a given purpose, the right categorization of a given value, or the right way to calculate a given metric. If universally consulted and respected, such prescriptive catalogs, hypothetically, could help everyone within an organization align and bring about an overall reduction in disparities and confusion. In practice, however, prescriptivism poses challenges for large enterprises (for example, when Hawaii is grouped with the other states by the finance department, but lumped in with Puerto Rico and Guam by the logistics team responsible for shipping).
A better approach is to document what people are doing: Who is querying which tables, viewing which reports, or using a particular calculation for a given metric? A data asset or technique used just one time by an intern probably isn’t trustworthy.;