What Spark’s Structured Streaming really means

What Spark’s Structured Streaming really means

Last year was a banner year for Spark. Big names like Cloudera and IBM jumped on the bandwagon, companies like Uber and Netflix rolled out major deployments, and Databricks' aggressive release schedule brought a brace of improvements and new features. Yet real competition for Spark also emerged, led by Apache Flink and Google Cloud Dataflow (aka Apache Beam).

Flink and Dataflow bring new innovations and target some of Spark's weaker aspects, particularly memory management and streaming support. Spark has not been standing still in the face of this competition, however; big efforts were made last year to improve Spark's memory management and query optimizer.

Moreover, this year will usher in Spark 2.0 -- and with it a new twist for streaming applications, which Databricks calls "Structured Streaming."

Structured Streaming is a collection of additions to Spark Streaming rather than a huge change to Spark itself. In other words, for all of you Spark jockeys out there: The fundamental concept of microbatching at the core of Spark's streaming architecture endures.

Read Also:
Data Generation Gap: Younger IT Workers Believe The Hype

If you've read one of my previous Spark articles or attended any of my talks over the past year or so, you'll have noticed that I make the same point again and again: Use DataFrames whenever you can, not Spark's RDD primitive.

DataFrames get the benefit of the Catalyst query optimizer and, as of 1.6, DataFrames typed with DataSets can take advantage of dedicated encoders that allow significantly faster serialization/deserialization times (an order of magnitude faster than the default Java serializer). Furthermore, in Spark 2.0, DataFrames come to Spark Streaming with the simple concept of an infinite DataFrame. Creating such a DataFrame from a stream is simple:

(Note: The Structured Streaming APIs are in constant flux, so while the code snippets in this article provide a general idea of how the code will look, they may change between now and the release of Spark 2.0.)

This results in a streaming DataFrame that can be manipulated in exactly the same way as the more familiar batch DataFrame -- using custom user-defined functions (UDFs), for example. Behind the scenes, those results will be updated as new data flows from the stream source. Instead of a disparate set of avenues into data, you'll have one unified API for both batch and streaming sources. And of course, all your queries on the DataFrames will call on the Catalyst Optimizer to produce efficient operations across the cluster.

Read Also:
These open data pioneers are making cities smart and workforces diverse

That's all well and good, as far as it goes. It makes developing easier between batch and streaming applications. But the real importance of Structured Streaming is Spark's abstraction of repeated queries (RQ). In essence, RQ simply states that the majority of streaming applications can be seen as asking the same question over and over again (for example, "How many people visited my website in the past five minutes?").;



Sentiment Analysis Symposium

27
Jun
2017
Sentiment Analysis Symposium

15% off with code 7WDATA

Read Also:
5 reasons data scientists are attending World of Watson

Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

28
Jun
2017
Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

15% off with code 7WDATA

Read Also:
The Internet of Underwater Things: How researchers are mimicking whales to network the seas

AI, Machine Learning and Sentiment Analysis Applied to Finance

28
Jun
2017
AI, Machine Learning and Sentiment Analysis Applied to Finance

15% off with code 7WDATA

Read Also:
A marketer’s guide to data mining
Read Also:
Data Pipeline goes open, changing business models, and more open source news

Real Business Intelligence

11
Jul
2017
Real Business Intelligence

25% off with code RBIYM01

Read Also:
Mining the Big Data Wheel

Advanced Analytics Forum

20
Sep
2017
Advanced Analytics Forum

15% off with code Discount15

Read Also:
Data Pipeline goes open, changing business models, and more open source news

Leave a Reply

Your email address will not be published. Required fields are marked *