Big data is no longer a war between batch and streaming data processing. Getting data from disparate data stores and running analytics on them in real-time is a huge technological challenge. Cracking this data federation problem has become a Holy Grail of sorts.
Cracking this data federation problem has become a Holy Grail of sorts. And while there are two primary approaches to cracking this problem today, a third has emerged that just might offer the most promise.
One of the ongoing barriers to greater big data adoption is the complexity of the associated software. The industry really needs to tackle this, offering end users the ability to query whatever data they want, wherever it is, no matter what format, and all without going through IT.
Which is mostly impossible today.
There are two general approaches used for data federation, both with their strengths and weaknesses.
This first database-centric approach, is used by relational database (RDBMS) vendors like Teradata (QueryGrid) and IBM (FluidQuery) or by specialty technologies like the former Composite Software.
One of the biggest problems with such database-centric tools is that they're geared for DBA-type users, not business users and analysts. Further, these tools generally do not cover all types of big data. Most were designed for data that fits into tables and columns, but search, streams, and semi-structured or unstructured data (for which NoSQL databases are well-suited) do not necessarily fit as well.
In addition, performance can sometimes be an issue when attempting to perform speed-of-thought analytics on a traditionally-federated source.
Query Tool-Centric approach
The query tool-centric approach, is used by Tableau, Qlik, and others.
These technologies do allow end users to mashup multiple sources, but they may not scale to big data volumes, as data is mashed up often on the user's desktop computer or web browser rather than in a scalable big data backend like Apache Spark.
And again, they were not really designed for the variety of big data sources and for anything beyond fairly trivial low-cardinality mashups.
The New Kid On The Block
The New approach to Data federation is coming from Zoomdata, who just announced its Fusion product, with an early access program to give companies a taste. Zoomdata claims Fusion can make multiple data sources appear as one source without moving or transforming data.
If it works as advertised, this would allow a business user to define a fused data source without waiting for a data architect to set it up ahead of time. Without resorting to a command line, Fusion is exposed as a simple drag-and-drop user interface that hides the underlying Spark-based infrastructure that combines datasets in ways hitherto impossible.
While interesting in itself, the real power comes from Zoomdata's ability to push as much as the processing to each underlying data platform as possible, based on the capabilities and performance profile of those systems, and use Spark to do the rest of the work that can't or shouldn't be pushed down.
That's really why this technology never really worked well before, since no one knows exactly the right questions to ask ahead of time, plus it was often really slow to actually run federated queries.
The Zoomdata approach is the exact opposite. It allows users to hook up their own data and run queries with fast results. That ability to truly iterate on big data—historical AND real-time data, enterprise, and cloud data—can be transformative to a company.