Sentiment Analysis of Bernie Sanders Tweets

Project Goal

The goal of this project is to perform basic text processing on tweets. In order to obtain this data the Twitter API was used. This allowed for the Bernie Sander tweets to be retrieved from the BernieSanders Twitter account.

Twitter API

Once you have a Twitter developer account, you can create a Twitter app to generate Twitter API, Access Token and Secret Keys etc. In order to retrieve the tweets you need the following:

Consumer Key
Consumer Secret
Access Token
Access Token Secret

All of which can be accessed once you have created an app in the Twitter Developer portal. Once you have them, you can assign them to R objects as below, and authorize them using the setup_twitter_oath() :

library(twitteR)
library(ROAuth)
consumer_key <-'your consumer key'
consumer_secret <- 'your consumer secret'
access_token <- 'your access token'
access_secret <- 'your access secret'
setup_twitter_oauth(consumer_key, consumer_secret, access_token,
                    access_secret)

The twitteR package has certain parameters that enable us to retrieve data from a specific user. In order to do so we have to specify the Twitter ID and the number of tweets, in this case we specify ‘BernieSanders’ and 583 tweets, from 2020-10-19 to 2021-08-17 (Twitter policy changes may affect the number of twitters you can pull):

twitter_user <- 'BernieSanders'
twitter_max <- 583

Next we can use the userTimeline() function to download the timeline (tweets) from the specified twitter_user. The function allows us to determine whether we also want to retrieve retweets and replied tweets, by specifying the relevant parameters. Consult the help documentation of the twitteR package for more details.

tweets <- userTimeline(twitter_user, n = twitter_max, includeRts=FALSE)
# Get the amount of tweets pulled:
length(tweets)
# Convert tweets to a data frame:
tweets.df <- twListToDF(tweets)
# save the downloaded tweets into a csv for future use
file_name = "Bernie_tweets.csv"
write.csv(tweets.df, file = file_name)

Once you have retrieved the tweets to your local environment or saved them as a csv file, you are ready to start exploring and pre-processing the data.

Data Pre-processing

Load the Bernie Sanders tweets you previously saved:

tweets.df <- read.csv('Bernie_tweets.csv', stringsAsFactors = FALSE)

Lets take a quick look at some of the tweets, for instance, say we are interested int he 150th tweet.:

# Tweet number 150
display_n <- 150
tweets.df[display_n, c("text")]

“The way to rebuild the crumbling middle class in this country is by growing the trade union movement.”

Next in order to work with the text data properly we have to create a Corpus, which we can do using the Corpus() function from the tm package, it allows us to specify the source to be character vectors, we assign this to a new object, myCorpus:

myCorpus_raw <- Corpus(VectorSource(tweets.df$text))
myCorpus <- myCorpus_raw 
myCorpus

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 583

Using the lapply function we can index the corpus object, returning the first three tweets:

lapply(myCorpus[1:3], as.character)

## [[1]]
## [1] "With our budget reconciliation legislation, we will make the largest investment in America's working families in de… https://t.co/7Y2PyELysJ"
## 
## [[2]]
## [1] "A government which responds to the needs of its people and not just the interests of corporations is what we must d… https://t.co/qS1ObEpsM4"
## 
## [[3]]
## [1] "Direct payments as a result of the expanded Child Tax Credit are now in month two. In that time:\n- child poverty ha… https://t.co/1Js1tsu6bx"

Now, we would like to remove non-graphical characters, remove non-English words, and remove URLs.

Non-graphical characters:

To replace non-graphical characters we will define an operation that replaces non-graphical characters with a space, we will name it toSpace, using the gsub function, and the content_transformer() function as a wrapper function:

toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern," ",x))})

In order to apply this to all non visible characters we will use ‘[^[:graph:]]’, which is a regular expression for all non-visible characters(Follow this link for more information on regular expressions.):

myCorpus<- tm_map(myCorpus,toSpace,"[^[:graph:]]")
# Convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))

Removing URLs

Once again to remove the URLs we use regular ‘[:space:]’, which is another Regular expression for whitespace. What we are trying to do here is to match ‘[^[:space:]]’ (non-space) zero or multiple times, as a way to identify a URL. We will refer to this operation as removeURL and apply it to our Corpus object using the tm_map function:

removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))

Removing Non-English words

To remove everything except English words or space, we take advantage of the ‘[^[:alpha:][:space:]]*’ regular expression, assigning this operation to removeNumPunct then apply it to our Corpus as done in the previous step:

removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x) 
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
# remove stopwords
myCorpus <- tm_map(myCorpus, removeWords, stopwords()) 
# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)

Text stemming

Text stemming basically reduces words to their root form, the SnowballC package allows us to do so:

library("SnowballC")
myCorpus <- tm_map(myCorpus, stemDocument)

Term Document Matrix

Our final data preparation step is to build a term document matrix. In this matrix each word is a row and each column, a tweet:

tdm <- TermDocumentMatrix(myCorpus,control = list(wordLengths = c(1, Inf)))
tdm

## <<TermDocumentMatrix (terms: 1602, documents: 583)>>
## Non-/sparse entries: 6005/927961
## Sparsity           : 99%
## Maximal term length: 22
## Weighting          : term frequency (tf)

nrow(tdm) # number of words

## [1] 1602

ncol(tdm) # number of tweets

## [1] 583

We can inspect the frequent word by specifying a frequency threshold, in this case we will set it to 30.

freq_thre = 30
# First few words:
head((freq.terms <- findFreqTerms(tdm, lowfreq = freq_thre)))

## [1] "america" "famili"  "will"    "work"    "just"    "must"

Next we want to calculate the word frequency by using the rowSums function, and filter the matrix to only include words that appear more than the specified threshold:

# calculate the word frenquency 
term.freq <- rowSums(as.matrix(tdm)) 
# only keep the frequencies of words(terms) appeared more then freq_thre times
term.freq <- subset(term.freq, term.freq >= freq_thre)

We may wish to plot the word frequencies:

library(ggplot2)
# select the words(terms) appeared more then freq_thre times, according to selected term.freq
df <- data.frame(term = names(term.freq), freq = term.freq)
p1 <- ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
  xlab("Terms") + ylab("Count") + coord_flip() +
  theme(axis.text=element_text(size=7))+theme_light()
p1

Term Frequency in Bernie Sander Tweets

Associations

We may want to consider which words are associated with a specific word, we can do so by specifying our word of interest and the correlation limit, specifying these parameters when calling the findAssocs() function from the tm package:

word <- 'covid'
cor_limit <- 0.2
(findAssocs(tdm,word,cor_limit))

## $covid
##          roll         virus          back        unless             u 
##          0.50          0.50          0.37          0.35          0.35 
##        relief        hazard    usprogress         persp     framework 
##          0.34          0.25          0.25          0.25          0.25 
##      dysfunct     oligarchi         woman administratio           cor 
##          0.25          0.25          0.25          0.25          0.25 
##          save 
##          0.21

Topic Modelling

In order to conduct topic modelling, to try to identify themes in the tweets, we will use the topicmodels package. In order to do so we must convert our tdm object into a document term matrix, where wach row is a document (i.e. a tweet) and each column a term. Then we can specify the number of topics and terms that we are interested in:

library(topicmodels)
dtm <- as.DocumentTermMatrix(tdm)
topic_num = 6
term_num = 2
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document (tweet)
dtm  <- dtm[rowTotals> 0, ] #remove all docs without words
lda <- LDA(dtm, k = topic_num) # find k topics
term <- terms(lda, term_num) # first term_num terms of every topic
(term <- apply(term, MARGIN = 2, paste, collapse = ", "))

##          Topic 1          Topic 2          Topic 3          Topic 4 
## "american, work"    "work, peopl"  "today, worker"     "right, now" 
##          Topic 5          Topic 6 
##      "will, one"  "countri, wage"

Sentiment Analysis

Sentiment analysis can be useful to get an idea of the extent to which the tweets in question can be considered negative, positive, or neutral. To do so we will use the sentiment package. In order to find the sentiment we will use the raw text, which is in our tweets.df object:

library(sentiment)
# use the raw text for sentiment analysis 
sentiments <- sentiment(tweets.df$text)
table(sentiments$polarity)

## 
## negative  neutral positive 
##       40      502       41

We may wish to visualise the sentiment, where he negative values are associated with negative sentiment and positive values with positive sentiment, in contrast, neutral is 0:

library(data.table)
sentiments$score <- 0
sentiments$score[sentiments$polarity == "positive"] <- 1
sentiments$score[sentiments$polarity == "negative"] <- -1
sentiments$date <- as.IDate(tweets.df$created)
result <- aggregate(score ~ date, data = sentiments, sum)
# plot the scores
p2 <- result %>% ggplot(aes(date,score)) +geom_line()+theme_light()
ggplotly(p2)

As we can see there seems to be a peak in positive sentiment in Bernie’s tweets between the May and July periods, on 2021-06-09 to be specific. We might want to check out these tweets more closely to see why this was the case, filtering using a regular expression to filter for the date, will allow us to do so:

filter(tweets.df, grepl('2021-06-09',created)) %>% select(text)

##                                                                                                                                           text
## 1  Why should we be surprised that in a corrupt system, the very, very richest people in this country pay, in a given… https://t.co/BEckXNh2M4
## 2 I’m proud to endorse @mike4brooklyn and @disruptionary for City Council who will always put working people before t… https://t.co/CDLd28NVUH
## 3 I’m joining with @AOC and @nycDSA to endorse @Adolfo4Council, @tiffany_caban, @jaslinforqueens, @alexaforcouncil fo… https://t.co/5RRAiB1F7B
## 4 For Brooklyn Borough President, I’m proud to endorse @ReynosoBrooklyn who is a proven progressive champion fighting… https://t.co/nzqfKfTMkf
## 5 For Manhattan District Attorney, I’m proud to endorse @TahanieNYC who will lead the wave of progressive prosecutors… https://t.co/opBy1i7Gvg
## 6 For New York City Comptroller, I'm endorsing @bradlander because he will stand up to special interests and demand t… https://t.co/jA2H5WIRtY
## 7 For New York City Public Advocate, I'm proud to endorse @JumaaneWilliams who is a tireless fighter for working peop… https://t.co/xXzC5HJN7C
## 8 In just two weeks, New Yorkers have the chance to elect candidates who understand that working people are hurting a… https://t.co/APii5WGIml

Emotion Lexicon

Alternatively we may use the emotion lexicon,which is a list of words and their respective associations with 8 emotions (anger,fear,anticipation,trust,surprise,sadness,joy and disgust), and two sentiments (negative and positive). FOr this we will use the syuzhet package. It is important that we use the cleaned data as the functions we will use from this package cannot handle non-graphic data:

library(syuzhet)
# use the cleaned text for emotion analysis 
# since get_nrc_sentiment cannot deal with non-graphic data 
tweet_clean <- data.frame(text = sapply(myCorpus, as.character), stringsAsFactors = FALSE)

In contrast to the previous section, in this case we have a matrix, where each row is a document (i.e. a tweet) and each column an emotion:

# each row: a document (a tweet); each column: an emotion
emotion_matrix <- get_nrc_sentiment(tweet_clean$text)

We need to format the matrix so that instead of each row being a tweet and each column and emotion, each row is an emotion and each column a tweet. To do so we will transpose the matrix.

# Matrix Transpose
td <- data.frame(t(emotion_matrix)) 
#The function rowSums computes column sums across rows for each level of a grouping variable.
td_new <- data.frame(rowSums(td))
td_new

##              rowSums.td.
## anger                173
## anticipation         324
## disgust               72
## fear                 145
## joy                  240
## sadness              186
## surprise             144
## trust                361
## negative             360
## positive             525

Finally some transformation and cleaning is necessary, in particular to rename and arrange the columns:

#Transformation and cleaning
names(td_new)[1] <- "count"
td_new <- cbind("sentiment" = rownames(td_new), td_new)
rownames(td_new) <- NULL
knitr::kable(td_new)

sentiment	count
anger	173
anticipation	324
disgust	72
fear	145
joy	240
sadness	186
surprise	144
trust	361
negative	360
positive	525

Finally we may wish to visualise the sentiment, we add an interactive element here using the plotly package:

library(plotly)
library(ggplot2)
sentiment_plot <- qplot(sentiment, data=td_new, weight=count, geom="bar",fill=sentiment)+ggtitle("Tweets emotion and sentiment")+theme_minimal()+theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
ggplotly(sentiment_plot)

Bernie Sanders tweet emotion and sentiment

Session Info

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] gridGraphics_0.5-1 plotly_4.9.4.1     syuzhet_1.0.6      data.table_1.14.0 
##  [5] sentiment_1.0      plyr_1.8.6         rjson_0.2.20       RCurl_1.98-1.3    
##  [9] topicmodels_0.2-12 SnowballC_0.7.0    tm_0.7-8           NLP_0.2-1         
## [13] forcats_0.5.1      stringr_1.4.0      dplyr_1.0.7        purrr_0.3.4       
## [17] readr_2.0.0        tidyr_1.1.3        tibble_3.1.3       ggplot2_3.3.5     
## [21] tidyverse_1.3.1    ROAuth_0.9.6       twitteR_1.1.9     
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7      fs_1.5.0          lubridate_1.7.10  bit64_4.0.5      
##  [5] httr_1.4.2        tools_4.1.0       backports_1.2.1   bslib_0.2.5.1    
##  [9] utf8_1.2.2        R6_2.5.0          DBI_1.1.1         lazyeval_0.2.2   
## [13] colorspace_2.0-2  withr_2.4.2       tidyselect_1.1.1  bit_4.0.4        
## [17] compiler_4.1.0    cli_3.0.1         rvest_1.0.0       pacman_0.5.1     
## [21] xml2_1.3.2        labeling_0.4.2    bookdown_0.22     slam_0.1-48      
## [25] sass_0.4.0        scales_1.1.1      digest_0.6.27     rmarkdown_2.9    
## [29] pkgconfig_2.0.3   htmltools_0.5.1.1 highr_0.9         dbplyr_2.1.1     
## [33] htmlwidgets_1.5.3 rlang_0.4.11      readxl_1.3.1      rstudioapi_0.13  
## [37] farver_2.1.0      jquerylib_0.1.4   generics_0.1.0    jsonlite_1.7.2   
## [41] crosstalk_1.1.1   magrittr_2.0.1    modeltools_0.2-23 Rcpp_1.0.7       
## [45] munsell_0.5.0     fansi_0.5.0       lifecycle_1.0.0   stringi_1.7.3    
## [49] yaml_2.2.1        parallel_4.1.0    crayon_1.4.1      haven_2.4.1      
## [53] hms_1.1.0         knitr_1.33        pillar_1.6.2      stats4_4.1.0     
## [57] reprex_2.0.0      glue_1.4.2        evaluate_0.14     modelr_0.1.8     
## [61] vctrs_0.3.8       rmdformats_1.0.2  tzdb_0.1.2        cellranger_1.1.0 
## [65] gtable_0.3.0      assertthat_0.2.1  xfun_0.24         broom_0.7.8      
## [69] viridisLite_0.4.0 ellipsis_0.3.2