Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.
Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:
We’re also excited to be working with several industry partners. IBM is incorporating sparklyr into its Data Science Experience, Cloudera is working with us to ensure that sparklyr meets the requirements of its enterprise customers, and H2O has provided an integration between sparklyr and H2O Sparkling Water.
You can install sparklyr from CRAN as follows:
You should also install a local version of Spark for development purposes:
If you use the RStudio IDE, you should also download the latest preview release of the IDE, which includes several enhancements for interacting with Spark.
Extensive documentation and examples are available at http://spark.rstudio.com.
You can connect to both local instances of Spark as well as remote Spark clusters. Here we’ll connect to a local instance of Spark:
The returned Spark connection () provides a remote dplyr data source to the Spark cluster.
You can copy R data frames into Spark using the dplyr function. (More typically, though, you’ll read data within the Spark cluster using the spark_read family of functions.) For the examples below, we’ll copy some datasets from R into Spark. (Note that you may need to install the nycflights13 and Lahman packages in order to execute this code.)
We can now use all of the available dplyr verbs against the tables within the cluster. Here’s a simple filtering example:
Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last example from the tutorial which plots data on flight delays:
Note that while the dplyr functions shown above look identical to the ones you use with R data frames, with sparklyr they use Spark as their back end and execute remotely in the cluster.
dplyr window functions are also supported, for example:
For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.
It’s also possible to execute SQL queries directly against tables within a Spark cluster. The object implements a DBI interface for Spark, so you can use to execute SQL and return the result as an R data frame:
You can orchestrate machine-learning algorithms in a Spark cluster via either Spark MLlib or via the H2O Sparkling Water extension package.