Introducing sparklyr

Introducing sparklyr, an R Interface for Apache Spark

Introducing sparklyr, an R Interface for Apache Spark
Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:

We’re also excited to be working with several industry partners. IBM is incorporating sparklyr into its Data Science Experience, Cloudera is working with us to ensure that sparklyr meets the requirements of its enterprise customers, and H2O has provided an integration between sparklyr and H2O Sparkling Water.

You can install sparklyr from CRAN as follows:

You should also install a local version of Spark for development purposes:

If you use the RStudio IDE, you should also download the latest preview release of the IDE, which includes several enhancements for interacting with Spark.

Extensive documentation and examples are available at

You can connect to both local instances of Spark as well as remote Spark clusters. Here we’ll connect to a local instance of Spark:

Read Also:
3 Surprising Things That Big Data Reveals About HR

The returned Spark connection () provides a remote dplyr data source to the Spark cluster.

You can copy R data frames into Spark using the dplyr function. (More typically, though, you’ll read data within the Spark cluster using the spark_read family of functions.) For the examples below, we’ll copy some datasets from R into Spark. (Note that you may need to install the nycflights13 and Lahman packages in order to execute this code.)

We can now use all of the available dplyr verbs against the tables within the cluster. Here’s a simple filtering example:

Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last example from the tutorial which plots data on flight delays:

Note that while the dplyr functions shown above look identical to the ones you use with R data frames, with sparklyr they use Spark as their back end and execute remotely in the cluster.

dplyr window functions are also supported, for example:

Read Also:
The Reason So Many Analytics Efforts Fall Short

For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.

It’s also possible to execute SQL queries directly against tables within a Spark cluster. The  object implements a DBI interface for Spark, so you can use  to execute SQL and return the result as an R data frame:

You can orchestrate machine-learning algorithms in a Spark cluster via either Spark MLlib or via the H2O Sparkling Water extension package.

Read Full Story…


Leave a Reply

Your email address will not be published. Required fields are marked *