Sentiment Analysis of Bernie Sanders Tweets
Project Goal
The goal of this project is to perform basic text processing on tweets. In order to obtain this data the Twitter API was used. This allowed for the Bernie Sander tweets to be retrieved from the BernieSanders Twitter account.
Twitter API
Once you have a Twitter developer account, you can create a Twitter app to generate Twitter API, Access Token and Secret Keys etc. In order to retrieve the tweets you need the following:
Consumer Key
Consumer Secret
Access Token
Access Token Secret
All of which can be accessed once you have created an app in the Twitter Developer portal. Once you have them, you can assign them to R objects as below, and authorize them using the setup_twitter_oath() :
library(twitteR)
library(ROAuth)
<-'your consumer key'
consumer_key <- 'your consumer secret'
consumer_secret <- 'your access token'
access_token <- 'your access secret'
access_secret setup_twitter_oauth(consumer_key, consumer_secret, access_token,
access_secret)
The twitteR package has certain parameters that enable us to retrieve data from a specific user. In order to do so we have to specify the Twitter ID and the number of tweets, in this case we specify ‘BernieSanders’ and 583 tweets, from 2020-10-19 to 2021-08-17 (Twitter policy changes may affect the number of twitters you can pull):
<- 'BernieSanders'
twitter_user <- 583 twitter_max
Next we can use the userTimeline() function to download the timeline (tweets) from the specified twitter_user. The function allows us to determine whether we also want to retrieve retweets and replied tweets, by specifying the relevant parameters. Consult the help documentation of the twitteR package for more details.
<- userTimeline(twitter_user, n = twitter_max, includeRts=FALSE)
tweets # Get the amount of tweets pulled:
length(tweets)
# Convert tweets to a data frame:
<- twListToDF(tweets)
tweets.df # save the downloaded tweets into a csv for future use
= "Bernie_tweets.csv"
file_name write.csv(tweets.df, file = file_name)
Once you have retrieved the tweets to your local environment or saved them as a csv file, you are ready to start exploring and pre-processing the data.
Data Pre-processing
Load the Bernie Sanders tweets you previously saved:
<- read.csv('Bernie_tweets.csv', stringsAsFactors = FALSE) tweets.df
Lets take a quick look at some of the tweets, for instance, say we are interested int he 150th tweet.:
# Tweet number 150
<- 150
display_n c("text")] tweets.df[display_n,
“The way to rebuild the crumbling middle class in this country is by growing the trade union movement.”
Next in order to work with the text data properly we have to create a Corpus, which we can do using the Corpus() function from the tm package, it allows us to specify the source to be character vectors, we assign this to a new object, myCorpus:
<- Corpus(VectorSource(tweets.df$text))
myCorpus_raw <- myCorpus_raw
myCorpus myCorpus
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 583
Using the lapply function we can index the corpus object, returning the first three tweets:
lapply(myCorpus[1:3], as.character)
## [[1]]
## [1] "With our budget reconciliation legislation, we will make the largest investment in America's working families in de… https://t.co/7Y2PyELysJ"
##
## [[2]]
## [1] "A government which responds to the needs of its people and not just the interests of corporations is what we must d… https://t.co/qS1ObEpsM4"
##
## [[3]]
## [1] "Direct payments as a result of the expanded Child Tax Credit are now in month two. In that time:\n- child poverty ha… https://t.co/1Js1tsu6bx"
Now, we would like to remove non-graphical characters, remove non-English words, and remove URLs.
Non-graphical characters:
To replace non-graphical characters we will define an operation that replaces non-graphical characters with a space, we will name it toSpace, using the gsub function, and the content_transformer() function as a wrapper function:
<- content_transformer(function(x, pattern) {return (gsub(pattern," ",x))}) toSpace
In order to apply this to all non visible characters we will use ‘[^[:graph:]]’, which is a regular expression for all non-visible characters(Follow this link for more information on regular expressions.):
<- tm_map(myCorpus,toSpace,"[^[:graph:]]")
myCorpus# Convert to lower case
<- tm_map(myCorpus, content_transformer(tolower)) myCorpus
Removing URLs
Once again to remove the URLs we use regular ‘[:space:]’, which is another Regular expression for whitespace. What we are trying to do here is to match ‘[^[:space:]]’ (non-space) zero or multiple times, as a way to identify a URL. We will refer to this operation as removeURL and apply it to our Corpus object using the tm_map function:
<- function(x) gsub("http[^[:space:]]*", "", x)
removeURL <- tm_map(myCorpus, content_transformer(removeURL)) myCorpus
Removing Non-English words
To remove everything except English words or space, we take advantage of the ‘[^[:alpha:][:space:]]*’ regular expression, assigning this operation to removeNumPunct then apply it to our Corpus as done in the previous step:
<- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
removeNumPunct <- tm_map(myCorpus, content_transformer(removeNumPunct))
myCorpus # remove stopwords
<- tm_map(myCorpus, removeWords, stopwords())
myCorpus # remove extra whitespace
<- tm_map(myCorpus, stripWhitespace) myCorpus
Text stemming
Text stemming basically reduces words to their root form, the SnowballC package allows us to do so:
library("SnowballC")
<- tm_map(myCorpus, stemDocument) myCorpus
Term Document Matrix
Our final data preparation step is to build a term document matrix. In this matrix each word is a row and each column, a tweet:
<- TermDocumentMatrix(myCorpus,control = list(wordLengths = c(1, Inf)))
tdm tdm
## <<TermDocumentMatrix (terms: 1602, documents: 583)>>
## Non-/sparse entries: 6005/927961
## Sparsity : 99%
## Maximal term length: 22
## Weighting : term frequency (tf)
nrow(tdm) # number of words
## [1] 1602
ncol(tdm) # number of tweets
## [1] 583
We can inspect the frequent word by specifying a frequency threshold, in this case we will set it to 30.
= 30
freq_thre # First few words:
head((freq.terms <- findFreqTerms(tdm, lowfreq = freq_thre)))
## [1] "america" "famili" "will" "work" "just" "must"
Next we want to calculate the word frequency by using the rowSums function, and filter the matrix to only include words that appear more than the specified threshold:
# calculate the word frenquency
<- rowSums(as.matrix(tdm))
term.freq # only keep the frequencies of words(terms) appeared more then freq_thre times
<- subset(term.freq, term.freq >= freq_thre) term.freq
We may wish to plot the word frequencies:
library(ggplot2)
# select the words(terms) appeared more then freq_thre times, according to selected term.freq
<- data.frame(term = names(term.freq), freq = term.freq)
df <- ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
p1 xlab("Terms") + ylab("Count") + coord_flip() +
theme(axis.text=element_text(size=7))+theme_light()
p1
Associations
We may want to consider which words are associated with a specific word, we can do so by specifying our word of interest and the correlation limit, specifying these parameters when calling the findAssocs() function from the tm package:
<- 'covid'
word <- 0.2
cor_limit findAssocs(tdm,word,cor_limit)) (
## $covid
## roll virus back unless u
## 0.50 0.50 0.37 0.35 0.35
## relief hazard usprogress persp framework
## 0.34 0.25 0.25 0.25 0.25
## dysfunct oligarchi woman administratio cor
## 0.25 0.25 0.25 0.25 0.25
## save
## 0.21
Topic Modelling
In order to conduct topic modelling, to try to identify themes in the tweets, we will use the topicmodels package. In order to do so we must convert our tdm object into a document term matrix, where wach row is a document (i.e. a tweet) and each column a term. Then we can specify the number of topics and terms that we are interested in:
library(topicmodels)
<- as.DocumentTermMatrix(tdm)
dtm = 6
topic_num = 2
term_num <- apply(dtm , 1, sum) #Find the sum of words in each Document (tweet)
rowTotals <- dtm[rowTotals> 0, ] #remove all docs without words
dtm <- LDA(dtm, k = topic_num) # find k topics
lda <- terms(lda, term_num) # first term_num terms of every topic
term <- apply(term, MARGIN = 2, paste, collapse = ", ")) (term
## Topic 1 Topic 2 Topic 3 Topic 4
## "american, work" "work, peopl" "today, worker" "right, now"
## Topic 5 Topic 6
## "will, one" "countri, wage"
Sentiment Analysis
Sentiment analysis can be useful to get an idea of the extent to which the tweets in question can be considered negative, positive, or neutral. To do so we will use the sentiment package. In order to find the sentiment we will use the raw text, which is in our tweets.df object:
library(sentiment)
# use the raw text for sentiment analysis
<- sentiment(tweets.df$text)
sentiments table(sentiments$polarity)
##
## negative neutral positive
## 40 502 41
We may wish to visualise the sentiment, where he negative values are associated with negative sentiment and positive values with positive sentiment, in contrast, neutral is 0:
library(data.table)
$score <- 0
sentiments$score[sentiments$polarity == "positive"] <- 1
sentiments$score[sentiments$polarity == "negative"] <- -1
sentiments$date <- as.IDate(tweets.df$created)
sentiments<- aggregate(score ~ date, data = sentiments, sum)
result # plot the scores
<- result %>% ggplot(aes(date,score)) +geom_line()+theme_light()
p2 ggplotly(p2)
As we can see there seems to be a peak in positive sentiment in Bernie’s tweets between the May and July periods, on 2021-06-09 to be specific. We might want to check out these tweets more closely to see why this was the case, filtering using a regular expression to filter for the date, will allow us to do so:
filter(tweets.df, grepl('2021-06-09',created)) %>% select(text)
## text
## 1 Why should we be surprised that in a corrupt system, the very, very richest people in this country pay, in a given… https://t.co/BEckXNh2M4
## 2 I’m proud to endorse @mike4brooklyn and @disruptionary for City Council who will always put working people before t… https://t.co/CDLd28NVUH
## 3 I’m joining with @AOC and @nycDSA to endorse @Adolfo4Council, @tiffany_caban, @jaslinforqueens, @alexaforcouncil fo… https://t.co/5RRAiB1F7B
## 4 For Brooklyn Borough President, I’m proud to endorse @ReynosoBrooklyn who is a proven progressive champion fighting… https://t.co/nzqfKfTMkf
## 5 For Manhattan District Attorney, I’m proud to endorse @TahanieNYC who will lead the wave of progressive prosecutors… https://t.co/opBy1i7Gvg
## 6 For New York City Comptroller, I'm endorsing @bradlander because he will stand up to special interests and demand t… https://t.co/jA2H5WIRtY
## 7 For New York City Public Advocate, I'm proud to endorse @JumaaneWilliams who is a tireless fighter for working peop… https://t.co/xXzC5HJN7C
## 8 In just two weeks, New Yorkers have the chance to elect candidates who understand that working people are hurting a… https://t.co/APii5WGIml
Emotion Lexicon
Alternatively we may use the emotion lexicon,which is a list of words and their respective associations with 8 emotions (anger,fear,anticipation,trust,surprise,sadness,joy and disgust), and two sentiments (negative and positive). FOr this we will use the syuzhet package. It is important that we use the cleaned data as the functions we will use from this package cannot handle non-graphic data:
library(syuzhet)
# use the cleaned text for emotion analysis
# since get_nrc_sentiment cannot deal with non-graphic data
<- data.frame(text = sapply(myCorpus, as.character), stringsAsFactors = FALSE) tweet_clean
In contrast to the previous section, in this case we have a matrix, where each row is a document (i.e. a tweet) and each column an emotion:
# each row: a document (a tweet); each column: an emotion
<- get_nrc_sentiment(tweet_clean$text) emotion_matrix
We need to format the matrix so that instead of each row being a tweet and each column and emotion, each row is an emotion and each column a tweet. To do so we will transpose the matrix.
# Matrix Transpose
<- data.frame(t(emotion_matrix))
td #The function rowSums computes column sums across rows for each level of a grouping variable.
<- data.frame(rowSums(td))
td_new td_new
## rowSums.td.
## anger 173
## anticipation 324
## disgust 72
## fear 145
## joy 240
## sadness 186
## surprise 144
## trust 361
## negative 360
## positive 525
Finally some transformation and cleaning is necessary, in particular to rename and arrange the columns:
#Transformation and cleaning
names(td_new)[1] <- "count"
<- cbind("sentiment" = rownames(td_new), td_new)
td_new rownames(td_new) <- NULL
::kable(td_new) knitr
sentiment | count |
---|---|
anger | 173 |
anticipation | 324 |
disgust | 72 |
fear | 145 |
joy | 240 |
sadness | 186 |
surprise | 144 |
trust | 361 |
negative | 360 |
positive | 525 |
Finally we may wish to visualise the sentiment, we add an interactive element here using the plotly package:
library(plotly)
library(ggplot2)
<- qplot(sentiment, data=td_new, weight=count, geom="bar",fill=sentiment)+ggtitle("Tweets emotion and sentiment")+theme_minimal()+theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
sentiment_plot ggplotly(sentiment_plot)
Session Info
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] gridGraphics_0.5-1 plotly_4.9.4.1 syuzhet_1.0.6 data.table_1.14.0
## [5] sentiment_1.0 plyr_1.8.6 rjson_0.2.20 RCurl_1.98-1.3
## [9] topicmodels_0.2-12 SnowballC_0.7.0 tm_0.7-8 NLP_0.2-1
## [13] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
## [17] readr_2.0.0 tidyr_1.1.3 tibble_3.1.3 ggplot2_3.3.5
## [21] tidyverse_1.3.1 ROAuth_0.9.6 twitteR_1.1.9
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 fs_1.5.0 lubridate_1.7.10 bit64_4.0.5
## [5] httr_1.4.2 tools_4.1.0 backports_1.2.1 bslib_0.2.5.1
## [9] utf8_1.2.2 R6_2.5.0 DBI_1.1.1 lazyeval_0.2.2
## [13] colorspace_2.0-2 withr_2.4.2 tidyselect_1.1.1 bit_4.0.4
## [17] compiler_4.1.0 cli_3.0.1 rvest_1.0.0 pacman_0.5.1
## [21] xml2_1.3.2 labeling_0.4.2 bookdown_0.22 slam_0.1-48
## [25] sass_0.4.0 scales_1.1.1 digest_0.6.27 rmarkdown_2.9
## [29] pkgconfig_2.0.3 htmltools_0.5.1.1 highr_0.9 dbplyr_2.1.1
## [33] htmlwidgets_1.5.3 rlang_0.4.11 readxl_1.3.1 rstudioapi_0.13
## [37] farver_2.1.0 jquerylib_0.1.4 generics_0.1.0 jsonlite_1.7.2
## [41] crosstalk_1.1.1 magrittr_2.0.1 modeltools_0.2-23 Rcpp_1.0.7
## [45] munsell_0.5.0 fansi_0.5.0 lifecycle_1.0.0 stringi_1.7.3
## [49] yaml_2.2.1 parallel_4.1.0 crayon_1.4.1 haven_2.4.1
## [53] hms_1.1.0 knitr_1.33 pillar_1.6.2 stats4_4.1.0
## [57] reprex_2.0.0 glue_1.4.2 evaluate_0.14 modelr_0.1.8
## [61] vctrs_0.3.8 rmdformats_1.0.2 tzdb_0.1.2 cellranger_1.1.0
## [65] gtable_0.3.0 assertthat_0.2.1 xfun_0.24 broom_0.7.8
## [69] viridisLite_0.4.0 ellipsis_0.3.2