If I have one gripe about technology today, it’s the seemingly universal belief that manifests social and business issues for the first time. Sure, Twitter may be a relatively new advent and only recently have people been fired for tweeting really inappropriate jokes. Still, if you think that losing your job for publicly saying something objectionable is new, think again.
And the same holds true with respect to data quality. Yes, contemporary cloud computing arrived fairly recently, although its roots in grid computing go back decades. Make no mistake, though: the notion that duplicate, erroneous, invalid and incomplete information harms a business is hardly new. It predates today’s rampant technology, big data and even the modern-day computer. Bum handwritten general-ledger entries caused problems centuries ago.
This begs the question: What are the data quality risks specific to cloud computing? In a nutshell, I see three.
In the traditional on-premise world, data is often integrated and stored through a very controlled series of periodic ETL batch jobs. For the most part, that data is internal to the enterprise, although exceptions certainly exist such as interfaces to banks, insurance carriers and the like. In an era of cloud computing, though, data flows much more frequently and quickly, often via APIs. That data is often external to the enterprise. Examples include feeds from Twitter, LinkedIn, Google Maps, etc.
With much less control over the data, organizations could certainly see “their” data quality take a hit.;