SG elections – a Twitter snapshot

With the election fever ongoing in Singapore, let’s take a snapshot of the popular tweets with the hashtag #ge2015 at this point of time.

library(twitteR)

consumerKey <- readLines("twitterkey.txt")
consumerSecret <- readLines("twittersecret.txt")
accessToken <- readLines("twitteraccesstoken.txt")
accessTokenSecret <- readLines("twitteraccesstokensecret.txt")

setup_twitter_oauth(consumerKey,consumerSecret,accessToken,accessTokenSecret)
## [1] "Using direct authentication"
tweets <- searchTwitter("#ge2015", resultType="popular", n=100)

tweetsdf <- twListToDF(tweets)

library(dplyr)
tweetsdf <- tbl_df(tweetsdf)
glimpse(tweetsdf)
## Observations: 31
## Variables:
## $ text          (chr) "#GE2015: PAP candidate Koh Poh Koon performs CP...
## $ favorited     (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ favoriteCount (dbl) 113, 32, 58, 49, 7, 16, 19, 2, 8, 6, 8, 168, 17,...
## $ replyToSN     (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ created       (time) 2015-09-06 04:43:24, 2015-09-06 04:01:11, 2015-...
## $ truncated     (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ replyToSID    (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ id            (chr) "640384783080538112", "640374159881601025", "640...
## $ replyToUID    (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ statusSource  (chr) "<a href=\"https://about.twitter.com/products/tw...
## $ screenName    (chr) "STcom", "TODAYonline", "YamKeng", "YamKeng", "T...
## $ retweetCount  (dbl) 293, 67, 75, 92, 45, 24, 12, 11, 10, 17, 9, 476,...
## $ isRetweet     (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ retweeted     (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ longitude     (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ latitude      (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Let’s first look at the top contributors of these popular tweets.

# There are users with several popular tweets
tweetsdf %>% select(screenName) %>%
  group_by(screenName) %>%
  summarise(count=n()) %>%
  arrange(desc(count))
## Source: local data frame [7 x 2]
## 
##     screenName count
## 1      mrbrown     9
## 2        STcom     6
## 3  TODAYonline     6
## 4         wpsg     6
## 5      YamKeng     2
## 6 LizforLeader     1
## 7 PAPSingapore     1

From the counts, we can see that amongst the popular tweets in this snapshot, the highest number come from the user ‘mrbrown’ which is the Twitter handle of blogger Mr Brown.

An outlier here would be the user ‘LizforLeader’, which is the official Twitter account of Liz Kendall’s campaign to become leader of the Labour Party in the UK. Since this is not applicable in the context of Singapore, it shall be removed subsequently.

# retweets of popular tweets per user
retweetsByUsers <- tweetsdf %>% 
  select(screenName, retweetCount) %>%
  filter(screenName != 'LizforLeader') %>%
  group_by(screenName) %>%
  summarise_each(funs(sum)) %>%
  arrange(desc(retweetCount))

retweetsByUsers %>%
    mutate(percentage = round(retweetCount/sum(retweetCount)*100, digits=2))
## Source: local data frame [6 x 3]
## 
##     screenName retweetCount percentage
## 1      mrbrown         2118      61.34
## 2        STcom          449      13.00
## 3  TODAYonline          347      10.05
## 4 PAPSingapore          220       6.37
## 5      YamKeng          167       4.84
## 6         wpsg          152       4.40

If we look at the combined retweet counts per user, we can see that the majority of retweets (at 61.34%) in this snapshot are that of Mr Brown’s tweets.

Let’s proceed to look at the top 5 tweets by retweet count in descending order at this point of time.

# Top 5 tweets by highest number of retweets 
top5Tweets <- tweetsdf %>% select(screenName,id,retweetCount) %>%
  filter(screenName != 'LizforLeader') %>%
  arrange(desc(retweetCount)) %>% 
  top_n(5)
## Selecting by retweetCount
top5Tweets
## Source: local data frame [5 x 3]
## 
##   screenName                 id retweetCount
## 1    mrbrown 639654658827423744          476
## 2    mrbrown 638647979021234176          415
## 3      STcom 640384783080538112          293
## 4    mrbrown 639105775214792706          274
## 5    mrbrown 638705015981379586          224
# Direct Links to the top 5 tweets
paste("http://twitter.com/",top5Tweets$screenName,"/status/", top5Tweets$id, sep="")
## [1] "http://twitter.com/mrbrown/status/639654658827423744"
## [2] "http://twitter.com/mrbrown/status/638647979021234176"
## [3] "http://twitter.com/STcom/status/640384783080538112"  
## [4] "http://twitter.com/mrbrown/status/639105775214792706"
## [5] "http://twitter.com/mrbrown/status/638705015981379586"

Not surprisingly, of the top 5 tweets, 4 came from Mr Brown. With a few more days to go before election day, the results will probably change. For now, you can view the top 5 tweets in this current snapshot below.

Definition of Statistical Significance

Note to self: Statistical Significance as defined by Andrew Gelman in his blog post.

Statistical Significance

Definition: A mathematical technique to measure the strength of evidence from a single study. Statistical significance is conventionally declared when the p-value is less than 0.05. The p-value is the probability of seeing a result as strong as observed or greater, under the null hypothesis (which is commonly the hypothesis that there is no effect). Thus, the smaller the p-value, the less consistent are the data with the null hypothesis under this measure.

Getting session information in Python

We’ve gone through how to get session information in R previously, so how do we do the same for Python? It seems that there is no single convenient function available so here’s one approach.

To get the system information, you can utilize the commonly used IPython package:

import IPython
IPython.sys_info()

To find out packages that have been loaded at the time (includes modules loaded by Python itself and by any Python IDE), you can utilize the sys.modules.keys() method. The code below gets the package name rather than the sub-components.

import sys
packages = set()
for name in sys.modules.keys():
    packages.add(name.split('.')[0])

print sorted(packages)

Getting session information in R

When troubleshooting R bugs or asking for assistance in mailing lists and sites like StackOverflow, it is good to review or present information about your system and packages loaded.

I much prefer the session_info() function from the devtools package over the default sessionInfo() function as it’s output is not only more readable, it also provides useful information like timezone and additional packages (non-base) loaded at the time.

Assuming you have the devtools packages already installed, you can invoke the function in one line:

devtools::session_info()