#SG50ShadesOfGrey – An #rstats Analysis

It seems like the 50 Shades of Grey movie has spawned humor over Twitter in Singapore, as well as making rounds internationally. In the spirit of #rstats, let’s look at some trends of #SG50ShadesOfGrey.

We shall use twitteR and foreach package to get a data frame of the popular tweets for #sg50shadesofgrey

library(twitteR)

consumerKey <- readLines("twitterkey.txt")
consumerSecret <- readLines("twittersecret.txt")
accessToken <- readLines("twitteraccesstoken.txt")
accessTokenSecret <- readLines("twitteraccesstokensecret.txt")

setup_twitter_oauth(consumerKey,consumerSecret,accessToken,accessTokenSecret)
## [1] "Using direct authentication"
tweets <- searchTwitter("#sg50shadesofgrey", resultType="popular", n=100)


# Each item in the list can be converted into a data frame with attributes
# as columns and one row of data. We will then convert these data frames
# to rows in a single data frame.
library(foreach)
tweetsdf<- foreach(i=1:length(tweets), .combine=rbind) %do% as.data.frame(tweets[[i]])

library(dplyr)


tweetsdf <- tbl_df(tweetsdf)

nrow(tweetsdf)
## [1] 30
names(tweetsdf)
##  [1] "text"          "favorited"     "favoriteCount" "replyToSN"    
##  [5] "created"       "truncated"     "replyToSID"    "id"           
##  [9] "replyToUID"    "statusSource"  "screenName"    "retweetCount" 
## [13] "isRetweet"     "retweeted"     "longitude"     "latitude"

Let’s first look at the top contributors of these popular tweets.

# There are users with several popular tweets
tweetsdf %>% select(screenName) %>%
  group_by(screenName) %>%
  summarise(count=n()) %>%
  arrange(desc(count))
## Source: local data frame [18 x 2]
## 
##        screenName count
## 1         alfpang     8
## 2       adibjalal     4
## 3     BBCtrending     2
## 4    asonofapeach     2
## 5       DanialRon     1
## 6     InsideScoot     1
## 7    MIIKOLICIOUS     1
## 8           STcom     1
## 9   SoSingaporean     1
## 10  SyakirahNasri     1
## 11      ahbengpls     1
## 12     ahbengsiao     1
## 13  benjaminkheng     1
## 14  juicyjuleswei     1
## 15       omgitsjy     1
## 16      sammmydee     1
## 17         smrtsg     1
## 18 spinorbinmusic     1

From the counts, we can see that amongst the most popular tweets, the highest number come from the user ‘alfpang’.

Looking at the variables, the retweetCount and favoriteCount variables look interesting. However they are probably highly correlated. We can find out with a plot and confirm with a correlation test.

library(ggplot2)
# Correlation of retweetCount with favoriteCount
tweetsdf %>% select(favoriteCount,retweetCount) %>% 
  ggplot(., aes(x=favoriteCount,y=retweetCount)) + geom_point() + geom_smooth() +
  labs(title = "Favorite Count vs Retweet Count", x = "favorite counts", y = "retweet counts")
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

# Correlation Test
with(tweetsdf, cor.test(retweetCount, favoriteCount))
## 
##  Pearson's product-moment correlation
## 
## data:  retweetCount and favoriteCount
## t = 27.12, df = 28, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9610520 0.9912528
## sample estimates:
##      cor 
## 0.981492

Since retweetCount and favoriteCount are highly correlated, we shall focus on retweetCount. Let’s now find out the top 5 users who have their tweets retweeted.

# Top 5 tweets by highest number of retweets per user
tweetsdf %>% select(screenName, retweetCount) %>%
  group_by(screenName) %>%
  summarise_each(funs(sum)) %>%
  arrange(desc(retweetCount)) %>%
  top_n(5)
## Selecting by retweetCount
## Source: local data frame [5 x 2]
## 
##      screenName retweetCount
## 1       alfpang         3625
## 2 SyakirahNasri         3183
## 3  asonofapeach         2724
## 4 juicyjuleswei         2342
## 5     adibjalal         1641

Let’s now look at the top 5 tweets based on retweet count.

# Top 5 tweets by highest number of retweets 
top5Tweets <- tweetsdf %>% select(screenName,id,retweetCount) %>% 
  arrange(desc(retweetCount)) %>% 
  top_n(5)
## Selecting by retweetCount
top5Tweets
## Source: local data frame [5 x 3]
## 
##      screenName                 id retweetCount
## 1 SyakirahNasri 566537040909434881         3183
## 2 juicyjuleswei 566513157405806592         2342
## 3  asonofapeach 566467904720236545         1795
## 4     ahbengpls 566805714094411776         1613
## 5       alfpang 565752652286267392         1274
# Direct Links to the top 5 tweets
paste("http://twitter.com/",top5Tweets$screenName,"/status/", top5Tweets$id, sep="")
## [1] "http://twitter.com/SyakirahNasri/status/566537040909434881"
## [2] "http://twitter.com/juicyjuleswei/status/566513157405806592"
## [3] "http://twitter.com/asonofapeach/status/566467904720236545" 
## [4] "http://twitter.com/ahbengpls/status/566805714094411776"    
## [5] "http://twitter.com/alfpang/status/565752652286267392"

Here are the top 5 tweets: