Introducing sparklyr

Introducing sparklyr, an R Interface for Apache Spark

Introducing sparklyr, an R Interface for Apache Spark

Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:

We’re also excited to be working with several industry partners. IBM is incorporating sparklyr into its Data Science Experience, Cloudera is working with us to ensure that sparklyr meets the requirements of its enterprise customers, and H2O has provided an integration between sparklyr and H2O Sparkling Water.

You can install sparklyr from CRAN as follows:

You should also install a local version of Spark for development purposes:

If you use the RStudio IDE, you should also download the latest preview release of the IDE, which includes several enhancements for interacting with Spark.

Extensive documentation and examples are available at http://spark.rstudio.com.

You can connect to both local instances of Spark as well as remote Spark clusters. Here we’ll connect to a local instance of Spark:

Read Also:
Data Scientist Core Skills

The returned Spark connection () provides a remote dplyr data source to the Spark cluster.

You can copy R data frames into Spark using the dplyr function. (More typically, though, you’ll read data within the Spark cluster using the spark_read family of functions.) For the examples below, we’ll copy some datasets from R into Spark. (Note that you may need to install the nycflights13 and Lahman packages in order to execute this code.)

We can now use all of the available dplyr verbs against the tables within the cluster. Here’s a simple filtering example:

Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last example from the tutorial which plots data on flight delays:

Note that while the dplyr functions shown above look identical to the ones you use with R data frames, with sparklyr they use Spark as their back end and execute remotely in the cluster.

dplyr window functions are also supported, for example:

Read Also:
Information Governance Insights: Ch-ch-ch-Changes!

For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.

It’s also possible to execute SQL queries directly against tables within a Spark cluster. The  object implements a DBI interface for Spark, so you can use  to execute SQL and return the result as an R data frame:

You can orchestrate machine-learning algorithms in a Spark cluster via either Spark MLlib or via the H2O Sparkling Water extension package.

 



Big Data Innovation Summit London

30
Mar
2017
Big Data Innovation Summit London

$200 off with code DATA200

Read Also:
Containers vs. virtual machines: How to tell which is the right choice for your enterprise

Data Innovation Summit 2017

30
Mar
2017
Data Innovation Summit 2017

30% off with code 7wData

Read Also:
Follow Me To The Data Field Days [Live Broadcast]

Enterprise Data World 2017

2
Apr
2017
Enterprise Data World 2017

$200 off with code 7WDATA

Read Also:
Containers vs. virtual machines: How to tell which is the right choice for your enterprise
Read Also:
Blockchain: A Better Way to Track Pork Chops, Bonds, Bad Peanut Butter?

Data Visualisation Summit San Francisco

19
Apr
2017
Data Visualisation Summit San Francisco

$200 off with code DATA200

Read Also:
Information Governance Insights: Ch-ch-ch-Changes!

Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
Wearables Data Support Proactive Treatment in Senior Care

Leave a Reply

Your email address will not be published. Required fields are marked *