Previously, we introduced streaming, saw some of the benefits it can bring and discussed some of the architectural options and vendors / engines that can support streaming-oriented solutions. We now focus on one of the key players in this space, Apache Flink, and the commercial entity that employs many of the Flink committers and provides Flink-related services, data Artisans (dA).
We talked with dA CEO and Flink PMC member, Kostas Tzoumas. Tzoumas, who has a solid engineering background and was one of the co-creators of Flink, was keen to elaborate on an array of topics: from the streaming paradigm itself and its significance for applications, to Flink's latest release and roadmap and dA's commercial offering and plans.
A few people, including Tzoumas, have made the case for seeing traditional, bounded data and processing as a special case of its unbounded counterparts. While this may seem like a theoretical construct, its implications can be far-reaching.
As Dean Wampler, author ofFast Data Architectures for Streaming Applications argues, "if everything is considered a "stream" -- either finite (as in batch processing) or unbounded -- then the same infrastructure doesn't just unify the batch and speed layers, but batch processing becomes a subset of stream processing."
This as a paradigm shift, argues Tzoumas, as it means that the database is no longer the keeper of the global truth.
The global truth is in the stream: an always-on, immutable flow of data that is processed by an unbounded processing engine. State becomes a view on that unbounded data, specific to each application and kept locally utilizing whatever storage makes sense for the application.
In this architecture, applications consume streams of unbounded data, but they also use streams to publish their own data that may in turn be consumed by other applications. So the streaming engine becomes the hub of the entire data ecosystem. According to Tzoumas:
"What most people think of when it comes to streaming is applications like real-time analytics or IoT. We do believe that these are super-important and we fully support them, however what unbounded processing has the potential to do is offer a new pathway for all applications whose nature fits the streaming model, and those go way beyond the typical examples one would think of. Basically, these are all applications that periodically update their data. They may not necessarily be real time -- they may have latency that goes into the hours range, but that's not the point. as long as there is an inflow of data, we see them as streaming data applications. These are operational, not analytics applications. So for me streaming is not in any way confined to the analytics world, or the Hadoop world for that matter."
Tzoumas further says that:
Tzoumas offers two reasons for this:
1) Time management. You can manage time correctly (by using event time and watermarks), so you can group records correctly, based on when an event occurred, and not just artificially, based on when an event was ingested or processed (which is very often wrong).
2) State management. By modeling your problem as a streaming problem, you can keep state across boundaries. So two events that arrive in different time intervals but still belong to the same logical group can still be correlated with each other.
This may sound compelling, but why go for Flink when there are so many alternatives out there?
Why would anyone choose Flink over Spark in specific, which is enjoying wide popularity and vendor support? After all, if it's latency you're worried about, as Tom Reilly, Cloudera's CEO put it, "we're able to offer sub-second responses and we don't hear any complaints from customers."
Spark is a really good community, but Spark as a platform has some fundamental problems when it comes to streaming, argues Tzoumas.
"It's not about sub-second responses, it's about how to approach a continuous application that needs to keep state.