There has always been an uneasy truce within large organizations between those who control access to data – the IT group, usually – and those who need that data to improve business performance. In a perfect world, the IT group would like to see a single source of truth manifested in master data management (MDM) and the enterprise data warehouse (EDW).
Let’s consider MDM. A paper by Wilbram Hazejager of DataFlux Corporation (acquired by SAS in 2000) notes that MDM’s origins go back to the early 2000s. Its proponents did – and still do – see MDM as the way to solve the problem of disparate, disjointed data spread across different lines of business.
Nevertheless, according to Gartner, the majority of MDM initiatives fail. There can be many reasons for this. But one reason is simple: To succeed, MDM demands strict adherence to data-governance policies by everyone in the enterprise, all the time. That’s not very realistic.
But the effort to implement MDM, even if only partially realized, reinforces IT’s role as the gatekeeper of enterprise data. Rapidly growing supplies of data make it all the more difficult to streamline the data supply chain that delivers raw material for analysis to business users. And it puts IT in the unenviable position of trying to deliver more data sets, faster, while the greater enterprise population yearns for data democracy.
Along with MDM, the enterprise data warehouse also represents a legacy approach to handling critical business data. Large and expensive to maintain, the typical EDW fulfills a narrow, often application-specific purpose. Moreover, data architects must use extract, transform, and load (ETL) tools to add data to an EDW, which consumes substantial time and money. Simply adding a row of data to an EDW could take months.
Escaping the confines of IT’s grip on enterprise data has pushed many a business unit into the netherworld of shadow IT, a term often used to describe information-technology systems and solutions built and used inside organizations without explicit organizational approval. These solutions often leverage the cloud. It doesn’t take much to deploy a Hadoop cluster in the cloud and start filling it with data, more or less on the sly.
This is not to say that most corporate deployments of Hadoop are “off the books.” They are not. In fact, getting off the ETL treadmill has been one of Hadoop’s main selling points for large enterprises. Hadoop stack vendors have focused most of their marketing dollars on the notion that organizations can move some of their EDW data into Hadoop. It’s far cheaper and more flexible in terms of hardware and storage.
These vendors talk about EL – extract and load – rather than ETL. Extract the data and load it into Hadoop; transform it when necessary for a particular use case. The popularity of Hadoop as a destination for structured as well as unstructured data has spawned several SQL on Hadoop solutions, including MapReduce, Impala, SparkSQL, Presto, and Hive on Tez.
Yet there’s far too much data in EDWs for any company to consider putting all their EDW data in Hadoop. Moving a billion rows of data from an EDW takes time. It also puts a load on the primary business system that depends on the data warehouse, which can impact operations. Likewise, an EDW database can handle only so many requests before performance degrades; plus, these data migrations hog enormous network bandwidth. In other words, it’s not a trivial exercise.
So organizations have a foot in both worlds. If organizations ever move all their EDW data to Hadoop, it will be a multi-year, possibly a multi-decade process. Most knowledge about customers, transactions, and products still lives in EDWs. Right now, most enterprises use Hadoop to hold large data sets like log files or sensor data, which are massive, multi-format, and don’t conform well to the schema of an EDW. They may not be sure what value this data holds, but they want some place to put it until they figure it out.