Running Apache Spark with sparklyr and R in Windows

RStudio recently released the sparklyr package that allows users to connect to Apache Spark instances from R. In addition, this package offers dplyr integration, allowing you to utilize Spark as you use dplyr functions like filter and select, which is very convenient. The package will also assist you in downloading and installing Apache Spark if it is a fresh install. This post covers the local install of Apache Spark via sparklyr and RStudio in Windows 10.

As per the guide, install the latest preview release of RStudio and run the following commands to install sparklyr

install.packages("devtools")
devtools::install_github("rstudio/sparklyr")

Once installed, you should see a new tab beside Environment and History tabs in RStudio – the Spark tab. You will be able to see a “New Connection” button.

On clicking the button, you will be able to set various options from Spark and Hadoop versions (at time of writing – Spark 1.6.2 and Hadoop 2.6) to connecting to local or remote Spark clusters and whether to use dplyr as the DB interface. Let’s use the defaults here.

If this is a fresh install, RStudio will prompt a confirmation dialog box:

Upon clicking “Install”, RStudio will then proceed to download and install Apache Spark for you and attempt to connect to the local Spark instance.

At this stage you might receive a similar error message below:

Error in start_shell(scon, list(), jars, packages) : 
  Failed to launch Spark shell. Ports file does not exist.
    Path: C:\Users\<USERNAME>\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\spark-submit.cmd
    Parameters: --packages "com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34" --jars "<PATH TO R PACKAGES>\R\win-library\3.2\sparklyr\java\rspark_utils.jar"  sparkr-shell D:\Temp\RtmpO0cLos\file23c0703c73bf.out

In addition: Warning message:
running command '"C:\Users\<USERNAME>\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\spark-submit.cmd" --packages "com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34" --jars "<PATH TO R PACKAGES>\R\win-library\3.2\sparklyr\java\rspark_utils.jar"  sparkr-shell <PATH TO TEMP DIRECTORY>\Temp\RtmpO0cLos\file23c0703c73bf.out' had status 127 

If you meet this error, it may be due to Windows security permissions of the .CMD files for Apache Spark. To resolve this issue, go to the Apache Spark install directory, which should be C:\Users\<USERNAME>\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\. You should be able to see the files ending with extension .cmd. At the time of writing, the files include

beeline.cmd
load-spark-env.cmd
pyspark.cmd
pyspark2.cmd
run-example.cmd
run-example2.cmd
spark-class.cmd
spark-class2.cmd
spark-shell.cmd
spark-shell2.cmd
spark-submit.cmd
spark-submit2.cmd
sparkR.cmd
sparkR2.cmd

For each of the these .CMD files, edit the security permission for your USERNAME to allow “Read & execute” as shown below:

After editing the permissions, when you attempt to connect to Spark by running the commands below, you should not experience any more errors.

> library(sparklyr)
> library(dplyr)
> sc <- spark_connect(master = "local")

To verify, you can try out the examples from the RStudio guide, or try the adapted example below:

> iris_tbl <- copy_to(sc, iris)
> iris_tbl
Source:   query [?? x 5]
Database: spark connection master=local app=sparklyr local=TRUE

   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
          <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
# ... with more rows
> 

Visually, you would be able to see the data frame in the Spark tab as well:

You can now have fun with Apache Spark in RStudio! You might also want to add the Apache Spark install directory C:\Users\<USERNAME>\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin to your path so that you can run Spark shells (sparkr,pyspark and spark-shell) in the command line. For example, with PySpark, you should see a similar welcome screen as below after the initialization messages.

SG elections – a Twitter snapshot

With the election fever ongoing in Singapore, let’s take a snapshot of the popular tweets with the hashtag #ge2015 at this point of time.

library(twitteR)

consumerKey <- readLines("twitterkey.txt")
consumerSecret <- readLines("twittersecret.txt")
accessToken <- readLines("twitteraccesstoken.txt")
accessTokenSecret <- readLines("twitteraccesstokensecret.txt")

setup_twitter_oauth(consumerKey,consumerSecret,accessToken,accessTokenSecret)
## [1] "Using direct authentication"
tweets <- searchTwitter("#ge2015", resultType="popular", n=100)

tweetsdf <- twListToDF(tweets)

library(dplyr)
tweetsdf <- tbl_df(tweetsdf)
glimpse(tweetsdf)
## Observations: 31
## Variables:
## $ text          (chr) "#GE2015: PAP candidate Koh Poh Koon performs CP...
## $ favorited     (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ favoriteCount (dbl) 113, 32, 58, 49, 7, 16, 19, 2, 8, 6, 8, 168, 17,...
## $ replyToSN     (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ created       (time) 2015-09-06 04:43:24, 2015-09-06 04:01:11, 2015-...
## $ truncated     (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ replyToSID    (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ id            (chr) "640384783080538112", "640374159881601025", "640...
## $ replyToUID    (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ statusSource  (chr) "<a href=\"https://about.twitter.com/products/tw...
## $ screenName    (chr) "STcom", "TODAYonline", "YamKeng", "YamKeng", "T...
## $ retweetCount  (dbl) 293, 67, 75, 92, 45, 24, 12, 11, 10, 17, 9, 476,...
## $ isRetweet     (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ retweeted     (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ longitude     (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ latitude      (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Let’s first look at the top contributors of these popular tweets.

# There are users with several popular tweets
tweetsdf %>% select(screenName) %>%
  group_by(screenName) %>%
  summarise(count=n()) %>%
  arrange(desc(count))
## Source: local data frame [7 x 2]
## 
##     screenName count
## 1      mrbrown     9
## 2        STcom     6
## 3  TODAYonline     6
## 4         wpsg     6
## 5      YamKeng     2
## 6 LizforLeader     1
## 7 PAPSingapore     1

From the counts, we can see that amongst the popular tweets in this snapshot, the highest number come from the user ‘mrbrown’ which is the Twitter handle of blogger Mr Brown.

An outlier here would be the user ‘LizforLeader’, which is the official Twitter account of Liz Kendall’s campaign to become leader of the Labour Party in the UK. Since this is not applicable in the context of Singapore, it shall be removed subsequently.

# retweets of popular tweets per user
retweetsByUsers <- tweetsdf %>% 
  select(screenName, retweetCount) %>%
  filter(screenName != 'LizforLeader') %>%
  group_by(screenName) %>%
  summarise_each(funs(sum)) %>%
  arrange(desc(retweetCount))

retweetsByUsers %>%
    mutate(percentage = round(retweetCount/sum(retweetCount)*100, digits=2))
## Source: local data frame [6 x 3]
## 
##     screenName retweetCount percentage
## 1      mrbrown         2118      61.34
## 2        STcom          449      13.00
## 3  TODAYonline          347      10.05
## 4 PAPSingapore          220       6.37
## 5      YamKeng          167       4.84
## 6         wpsg          152       4.40

If we look at the combined retweet counts per user, we can see that the majority of retweets (at 61.34%) in this snapshot are that of Mr Brown’s tweets.

Let’s proceed to look at the top 5 tweets by retweet count in descending order at this point of time.

# Top 5 tweets by highest number of retweets 
top5Tweets <- tweetsdf %>% select(screenName,id,retweetCount) %>%
  filter(screenName != 'LizforLeader') %>%
  arrange(desc(retweetCount)) %>% 
  top_n(5)
## Selecting by retweetCount
top5Tweets
## Source: local data frame [5 x 3]
## 
##   screenName                 id retweetCount
## 1    mrbrown 639654658827423744          476
## 2    mrbrown 638647979021234176          415
## 3      STcom 640384783080538112          293
## 4    mrbrown 639105775214792706          274
## 5    mrbrown 638705015981379586          224
# Direct Links to the top 5 tweets
paste("http://twitter.com/",top5Tweets$screenName,"/status/", top5Tweets$id, sep="")
## [1] "http://twitter.com/mrbrown/status/639654658827423744"
## [2] "http://twitter.com/mrbrown/status/638647979021234176"
## [3] "http://twitter.com/STcom/status/640384783080538112"  
## [4] "http://twitter.com/mrbrown/status/639105775214792706"
## [5] "http://twitter.com/mrbrown/status/638705015981379586"

Not surprisingly, of the top 5 tweets, 4 came from Mr Brown. With a few more days to go before election day, the results will probably change. For now, you can view the top 5 tweets in this current snapshot below.

Getting session information in R

When troubleshooting R bugs or asking for assistance in mailing lists and sites like StackOverflow, it is good to review or present information about your system and packages loaded.

I much prefer the session_info() function from the devtools package over the default sessionInfo() function as it’s output is not only more readable, it also provides useful information like timezone and additional packages (non-base) loaded at the time.

Assuming you have the devtools packages already installed, you can invoke the function in one line:

devtools::session_info()