Introducing sparklyr

Introducing sparklyr, an R Interface for Apache Spark

Introducing sparklyr, an R Interface for Apache Spark

Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:

We’re also excited to be working with several industry partners. IBM is incorporating sparklyr into its Data Science Experience, Cloudera is working with us to ensure that sparklyr meets the requirements of its enterprise customers, and H2O has provided an integration between sparklyr and H2O Sparkling Water.

You can install sparklyr from CRAN as follows:

You should also install a local version of Spark for development purposes:

If you use the RStudio IDE, you should also download the latest preview release of the IDE, which includes several enhancements for interacting with Spark.

Extensive documentation and examples are available at http://spark.rstudio.com.

Read Also:
What is the promise of big data? Computer will be better than humans

You can connect to both local instances of Spark as well as remote Spark clusters. Here we’ll connect to a local instance of Spark:

The returned Spark connection () provides a remote dplyr data source to the Spark cluster.

You can copy R data frames into Spark using the dplyr function. (More typically, though, you’ll read data within the Spark cluster using the spark_read family of functions.) For the examples below, we’ll copy some datasets from R into Spark. (Note that you may need to install the nycflights13 and Lahman packages in order to execute this code.)

We can now use all of the available dplyr verbs against the tables within the cluster. Here’s a simple filtering example:

Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last example from the tutorial which plots data on flight delays:

Note that while the dplyr functions shown above look identical to the ones you use with R data frames, with sparklyr they use Spark as their back end and execute remotely in the cluster.

Read Also:
Advanced Analytics in Procurement

dplyr window functions are also supported, for example:

For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.

It’s also possible to execute SQL queries directly against tables within a Spark cluster. The  object implements a DBI interface for Spark, so you can use  to execute SQL and return the result as an R data frame:

You can orchestrate machine-learning algorithms in a Spark cluster via either Spark MLlib or via the H2O Sparkling Water extension package.

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Advanced Analytics in Procurement

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
Mega collection of data science books and terminology

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
5 Data Management Mistakes to Avoid during Data Integration Projects

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
3 Surprising Things That Big Data Reveals About HR
Read Also:
Applications of Predictive Analytics in various industries

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

Leave a Reply

Your email address will not be published. Required fields are marked *