Introducing sparklyr

Introducing sparklyr, an R Interface for Apache Spark

Introducing sparklyr, an R Interface for Apache Spark

Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:

We’re also excited to be working with several industry partners. IBM is incorporating sparklyr into its Data Science Experience, Cloudera is working with us to ensure that sparklyr meets the requirements of its enterprise customers, and H2O has provided an integration between sparklyr and H2O Sparkling Water.

You can install sparklyr from CRAN as follows:

You should also install a local version of Spark for development purposes:

If you use the RStudio IDE, you should also download the latest preview release of the IDE, which includes several enhancements for interacting with Spark.

Extensive documentation and examples are available at http://spark.rstudio.com.

You can connect to both local instances of Spark as well as remote Spark clusters. Here we’ll connect to a local instance of Spark:

Read Also:
Seacoast Bank Cashes in on Data Virtualization

The returned Spark connection () provides a remote dplyr data source to the Spark cluster.

You can copy R data frames into Spark using the dplyr function. (More typically, though, you’ll read data within the Spark cluster using the spark_read family of functions.) For the examples below, we’ll copy some datasets from R into Spark. (Note that you may need to install the nycflights13 and Lahman packages in order to execute this code.)

We can now use all of the available dplyr verbs against the tables within the cluster. Here’s a simple filtering example:

Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last example from the tutorial which plots data on flight delays:

Note that while the dplyr functions shown above look identical to the ones you use with R data frames, with sparklyr they use Spark as their back end and execute remotely in the cluster.

dplyr window functions are also supported, for example:

Read Also:
Democratize big data: How to bring order and accessibility to the data lake

For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.

It’s also possible to execute SQL queries directly against tables within a Spark cluster. The  object implements a DBI interface for Spark, so you can use  to execute SQL and return the result as an R data frame:

You can orchestrate machine-learning algorithms in a Spark cluster via either Spark MLlib or via the H2O Sparkling Water extension package.

 



Predictive Analytics Innovation summit San Diego
22 Feb

$200 off with code DATA200

Read Also:
Cloudera Partners with Docker, Inc. to Provide the First Commercially-Supported Secure Containers
Read Also:
Business intelligence helps Maryland facility thrive with value-based care
Big Data Paris 2017
6 Mar
Big Data Paris 2017

15% off with code BDP17-7WDATA

Read Also:
Business intelligence helps Maryland facility thrive with value-based care
Read Also:
4 Key Ingredients For An Effective Big Data Implementation
Read Also:
SQL Examiner Suite for Cross Database Migration

Leave a Reply

Your email address will not be published. Required fields are marked *