Around two weeks ago the Elastacloud data science team attended the EARL conference in London. The first day consisted of several workshops (Spark and R with sparklyr, web scraping and text analysis in R, writing R functions for fun and profit, Introduction to Shiny, working with GitHub, working with the MicrosoftML package). Most of us attended Spark and R with sparklyr in the morning and working with the MicrosoftML package in the afternoon. The next two days consisted of simultaneous streams of talks on wide variety of R applications. Some of my personal highlights included: the keynote talk by Jenny Bryan on workflows, Dr Joy McKenny’s talk on using R in to monitor sewer network performance for the water industry, Simon Field’s talk on how to develop data scientist super powers, and lastly but by no means least the Sparklyr workshop. Some highlights and code snippets of the sparklyr workshop are illustrated below.
R computation is single threaded and memory bound. Sparklyr is a R interface to Apache Spark. As shown in examples below sparklyr converts nice readable dplyr code to SQL queries to allow one to explore and wrangle large volumes of data, and MLlib to assist the user to develop machine learning models.
One of the latest features of sparklyr 0.6 allows the user to run arbitrary R code on all nodes, as illustrated below.
We had a great time bonding, learning and seeing what great things the R community has produced in various industries.