Last year was a banner year for Spark. Big names like Cloudera and IBM jumped on the bandwagon, companies like Uber and Netflix rolled out major deployments, and Databricks’ aggressive release schedule brought a brace of improvements and new features. Yet real competition for Spark also emerged, led by Apache Flink and Google Cloud Dataflow (aka Apache Beam).
Flink and Dataflow bring new innovations and target some of Spark’s weaker aspects, particularly memory management and streaming support. Spark has not been standing still in the face of this competition, however; big efforts were made last year to improve Spark’s memory management and query optimizer.
Moreover, this year will usher in Spark 2.0 — and with it a new twist for streaming applications, which Databricks calls “Structured Streaming.”
Structured Streaming is a collection of additions to Spark Streaming rather than a huge change to Spark itself. In other words, for all of you Spark jockeys out there: The fundamental concept of microbatching at the core of Spark’s streaming architecture endures.
If you’ve read one of my previous Spark articles or attended any of my talks over the past year or so, you’ll have noticed that I make the same point again and again: Use DataFrames whenever you can, not Spark’s RDD primitive.
DataFrames get the benefit of the Catalyst query optimizer and, as of 1.6, DataFrames typed with DataSets can take advantage of dedicated encoders that allow significantly faster serialization/deserialization times (an order of magnitude faster than the default Java serializer). Furthermore, in Spark 2.0, DataFrames come to Spark Streaming with the simple concept of an infinite DataFrame. Creating such a DataFrame from a stream is simple:
(Note: The Structured Streaming APIs are in constant flux, so while the code snippets in this article provide a general idea of how the code will look, they may change between now and the release of Spark 2.0.)
This results in a streaming DataFrame that can be manipulated in exactly the same way as the more familiar batch DataFrame — using custom user-defined functions (UDFs), for example. Behind the scenes, those results will be updated as new data flows from the stream source. Instead of a disparate set of avenues into data, you’ll have one unified API for both batch and streaming sources. And of course, all your queries on the DataFrames will call on the Catalyst Optimizer to produce efficient operations across the cluster.
That’s all well and good, as far as it goes. It makes developing easier between batch and streaming applications. But the real importance of Structured Streaming is Spark’s abstraction of repeated queries (RQ). In essence, RQ simply states that the majority of streaming applications can be seen as asking the same question over and over again (for example, “How many people visited my website in the past five minutes?”).;