We hear a lot today about streaming data, fast data, and data in motion. It’s as if until now data has been stagnant, just sitting in some dusty database and never moving. The truth is that we have always needed ways to move data.
Historically, the industry has been pretty inventive about getting this done. From the early days of data warehousing and extract, transform, and load (ETL) to today’s real-time streaming ingest systems, we have continued to adapt and create new ways to move data sensibly, even as its appearance and motion patterns have dramatically changed.
In our new data driven world, the practice of exerting firm control over our data in motion is an increasingly critical competency and is becoming core to successful business operations. Based on more than 20 years in enterprise data, here is how I see where we have been and where we are going as we evolve into a world in which the full force of data volume, velocity, and variability takes hold.
First Generation: Stocking the Warehouse via ETL
Let’s roll back a couple of decades. The first substantial data-movement problems that plagued the mid-1990s emerged with the trend toward data warehousing. The goal was to move transaction data provided by disparate applications or residing in databases into the newly minted data warehouse. Organizations operated a variety of applications, such as offerings from SAP, Peoplesoft, and Siebel, and a variety of database technologies like Oracle and IBM. As a result, there was no simple way to move the data; each was a bespoke project requiring an understanding of vendor-specific schemas and languages. The inability to “stock the warehouse” efficiently led to data warehouse projects failing or becoming excessively expensive.
ETL tools addressed this initial data-movement problem by creating connectors for applications and databases to load the warehouse. For each source, one needed only to specify the fields and map them into the warehouse. The engine did the rest of the work. I refer to this first generation as schema-driven ETL. It was developer-centric, focused on preprocessing (aggregating and blending) data at scale from multiple sources to get it uniformly into a warehouse for business intelligence (BI) consumption. Large companies spent millions of dollars on these first-generation tools that allowed developers to move data without dealing with the myriad languages of custom applications.
This first generation became a multi-billion dollar industry.
Second Generation: Less Cloudy Skies via iPaaS
Over time, consolidation in the database and application world created a more homogeneous, standards-based world. Organizations began to wonder if ETL was even necessary, now that the new world order had done away with the fragmentation that has spawned its existence, and a small number of database/application mega-vendors remained.
But a new challenge replaced the old. By the mid-2000s, the emergence of SaaS apps added another layer of complexity to the data-movement challenge. The new questions were: “How do we get cloud-based transaction data into warehouses? How do we synchronize information between different cloud applications? Should we deploy integration middleware in the cloud or on-premise or both?”
As the SaaS delivery model proliferated, customer, product, and other domain data became fragmented across dozens of different applications with inconsistent, overlapping, or redundant data structures. Because cloud applications are API-driven rather than language-driven, organizations had to rationalize across the different flavors of APIs needed to send data between these various locations.
The cloud forced data-movement technologies to evolve from analytic data integration, the sweet spot for data warehouses, to operational integration, featuring data movement between applications that increased the pressure on the system to deliver trustworthy data quickly.