Install required libraries.

install.packages("tm") #for text cleaning
install.packages("topicmodels") #for topic modeling
install.packages("LDAvis") #for visualizing topic models 
install.packages("servr") #for visualizing topic models 
install.packages("stringi") 
install.packages("dplyr") 

Let’s fire up necessary libraries.

library(tm)
library(topicmodels)
library(LDAvis)
library(servr)
library(dplyr)
library(stringi) 

Load text data (using isis.tweets.csv as an example). Make sure the csv file is in your current working directory

#load the csv file
alltweets <- read.csv("isis_tweets.csv", header = TRUE) 
alltweets <- alltweets[0:3000,]
#extract data from the column 'text' so that each tweet is a document
corpus <- iconv(alltweets$text, to = "ASCII", sub = "")
#do a quick check of the loaded data. See the 125th document (tweet) 
corpus[125]

Now, performe text cleaning. There are many noises in text. For example, prepositions, pronouns and adjective and such are non-content-bearing. They need to be removed. There are also words that you think are irrelavant in the context of your investigation. The following code removes these words. Additionally, it converts all words to the lower case, deletes punctuation and numbers and removes white spaces between words.

corpus <- Corpus(VectorSource(corpus))

corpus <- tm_map(corpus, content_transformer(tolower)) #converted to lower cases
corpus <- tm_map(corpus, removePunctuation) #remove punctuation
corpus <- tm_map(corpus,removeWords,stopwords("english")) #filter stop words
corpus <- tm_map(corpus,removeWords,c("via", "twitter", "retweet", " ","@\\w+","http.+ |http.+$","amp")) #user-defined stop words to be filtered out

corpus <- tm_map(corpus, removeNumbers) #remove numbers
corpus <- tm_map(corpus,stripWhitespace) #remove white space in text

Construct a document-term matrix (DTM) (https://en.wikipedia.org/wiki/Document-term_matrix).

dtm <- DocumentTermMatrix(corpus) 

#the line above will convert corpus to DTM. But we need to run the next four lines to remove empty document from DTM to prevent potential errors. 
rowTotals<-apply(dtm,1,sum) #running this line takes time
empty.rows<-dtm[rowTotals==0,]$dimnames[1][[1]] 
corpus<-corpus[-as.numeric(empty.rows)]
dtm <- DocumentTermMatrix(corpus)

A DTM is a matrix, with documents in the rows and terms in the columns. In this example, a document is a tweet and a term is a word that appears in tweets. Let’s see what the DTM look like (only showing 5 rows and columns)

inspect(dtm[1:5, 1:5]) 
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 0/25
## Sparsity           : 100%
## Maximal term length: 38
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs aamaqegyptianpolicecolonel aamaqisisfighters aamaqnewsagencyiraq
##    1                          0                 0                   0
##    2                          0                 0                   0
##    3                          0                 0                   0
##    4                          0                 0                   0
##    5                          0                 0                   0
##     Terms
## Docs aamaqtwoisraeliarmyhelicopterscrossed
##    1                                     0
##    2                                     0
##    3                                     0
##    4                                     0
##    5                                     0
##     Terms
## Docs aamaqtwoisraelicombatdronesalsocrossed
##    1                                      0
##    2                                      0
##    3                                      0
##    4                                      0
##    5                                      0

You will see many zeros. That’s common – most terms appear only once. Let’s check what terms appear at least 100 times.

findFreqTerms(dtm, 100) 
## character(0)

Let’s explore term frequency. We ask R to rank 25 most frequent terms.

dtm.mx <- as.matrix(dtm)
frequency <- colSums(dtm.mx)
frequency <- sort(frequency, decreasing=TRUE)
frequency[0:24] 
##   islamicstate          syria  caliphatenews           isis           city 
##             86             78             45             44             43 
##           iraq rtdidyouknowvs         killed  wilayatninawa           know 
##             35             35             32             31             30 
##        muslims         ramadi  rtramiallolah         aleppo          allah 
##             28             26             24             23             23 
##            one   albattarengl         people           will   rtmaghrebiqm 
##             23             20             20             20             19 
##       damascus      deirezzor           time   wilayathalab 
##             18             18             18             18

We can also explore the association between two terms. We can ask R to list terms associated with the word “aleppo” (with a correlation coefficient of 0.1 or higher).

findAssocs(dtm, "aleppo", 0.1) 
## $aleppo
##                                               equipment 
##                                                    0.38 
##                                             fsafighters 
##                                                    0.38 
##                                                fullarms 
##                                                    0.38 
##                                          infantryschool 
##                                                    0.38 
##                             islamicstatedeclarescontrol 
##                                                    0.38 
##                                              jundalaqsa 
##                                                    0.38 
##                                                  manbej 
##                                                    0.38 
##                                                 arrival 
##                                                    0.27 
##                                              withdrawal 
##                                                    0.27 
##                                                  urgent 
##                                                    0.23 
##                                                    aash 
##                                                    0.19 
##                                                ablekill 
##                                                    0.19 
##                                                 alkhayr 
##                                                    0.19 
##                                          besiegedexpect 
##                                                    0.19 
##                                         besiegedwaiting 
##                                                    0.19 
##                                                bombcows 
##                                                    0.19 
##    breakingsyriaamaaqnewsisiskilledsyrianregimesoldiers 
##                                                    0.19 
##                                                cameback 
##                                                    0.19 
##                                             causedtoday 
##                                                    0.19 
##                                      confirmationthough 
##                                                    0.19 
##                                                contrast 
##                                                    0.19 
##                                                 dambush 
##                                                    0.19 
##                                            decidedtoday 
##                                                    0.19 
##                                         dsaadestruction 
##                                                    0.19 
##                                           eastalsafirah 
##                                                    0.19 
##                                  enoughmterziabmohammad 
##                                                    0.19 
##                             fidaeefulaanitabshoexplains 
##                                                    0.19 
##                                    fujiomaikoyeahgazwat 
##                                                    0.19 
##                                             hugemistake 
##                                                    0.19 
##   isiskilledsyriansoldierstodaynearkweiresairbasealeppo 
##                                                    0.19 
##                      isisthreatensgovernmentsupplyroute 
##                                                    0.19 
##                                                   jabha 
##                                                    0.19 
##                                         judicialsystems 
##                                                    0.19 
##                                killedisismilitantstoday 
##                                                    0.19 
##                                                 lataika 
##                                                    0.19 
##                                   leastreportedlykilled 
##                                                    0.19 
##                                                lifeline 
##                                                    0.19 
##                             massivehumanitariandisaster 
##                                                    0.19 
##                                      mediawillnevershow 
##                                                    0.19 
##                                               nublzahra 
##                                                    0.19 
##                                               redcircle 
##                                                    0.19 
##                                                   roles 
##                                                    0.19 
##                                              rtbuzzriet 
##                                                    0.19 
## rtlevantinegroupsyriaupdaterebelsloseterritorialcontrol 
##                                                    0.19 
##                                           rtnidalgazaui 
##                                                    0.19 
##                rtwayfrernusantarwitnessturkeydoesntdeny 
##                                                    0.19 
##                                   russiabombedyesterday 
##                                                    0.19 
##                                            secureroutes 
##                                                    0.19 
##                               shelliscontrolledvillages 
##                                                    0.19 
##                               shiaaxisrebelsmustprevent 
##                                                    0.19 
##                                        suicideoperation 
##                                                    0.19 
##                            svbiedhitassadtroopsmilitias 
##                                                    0.19 
##                                        syrianarmyclaims 
##                                                    0.19 
##                                        syrianarmyconvoy 
##                                                    0.19 
##                                syrianregimesuppliesline 
##                                                    0.19 
##                                           talmaksureast 
##                                                    0.19 
##                              tanksarmoredseizemunitions 
##                                                    0.19 
##                                      terroristsecterian 
##                                                    0.19 
##                                               todayscut 
##                                                    0.19 
##                                                   tried 
##                                                    0.19 
##                                             turkeyneeds 
##                                                    0.19 
##                                   tworesidentsajenglish 
##                                                    0.19 
##                          unprecedentedregimeoffensiveht 
##                                                    0.19 
##                                      villagesheikhahmed 
##                                                    0.19 
##                                                     fsa 
##                                                    0.15 
##                                                    join 
##                                                    0.15 
##                                                 mujahid 
##                                                    0.15 
##                                             assadregime 
##                                                    0.13 
##                                              aswedflags 
##                                                    0.13 
##                                                 cutting 
##                                                    0.13 
##                                              explaining 
##                                                    0.13 
##                                                actually 
##                                                    0.11 
##                                            breakingnews 
##                                                    0.11 
##                                                  defend 
##                                                    0.11 
##                                          infogeographic 
##                                                    0.11 
##                                                 managed 
##                                                    0.11 
##                                                    pics 
##                                                    0.11 
##                                         russianairraids 
##                                                    0.11

Let’s produce a topic model by setting up parameters. Tuning up a topic model is art. It is like tuning up a telescope. The algorithm is agnostic to how many topics your text data entail. You need to specify the number of topics to be identified. Let’s start by asking the algorithm to give us 5 topics (k <- 5).

burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
k <- 5 #find 5 topics

Let’s generate a topic model based on the input parameters. Be patient, topic modeling is computationally intensive.

ldaOut <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))

Now that the results are ready. Let’s take a look at them one by one. Running the next two lines will generate a csv file. The file lists which topic each document (that is, each tweet) belongs to.

ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("isis",k,"DocsToTopics.csv"))

The following lines will give you keywords associated with each topic. The current output file gives you 6 keywords for each topic. Do you know how to show 15 keywords per topic?

ldaOut.terms <- as.matrix(terms(ldaOut,6))
write.csv(ldaOut.terms,file=paste("isis",k,"TopicsToTerms.csv"))
ldaOut.terms[1:6,]
##      Topic 1     Topic 2          Topic 3        Topic 4        
## [1,] "isis"      "islamicstate"   "syria"        "caliphatenews"
## [2,] "killed"    "rtdidyouknowvs" "will"         "city"         
## [3,] "deirezzor" "know"           "rtmaghrebiqm" "ramadi"       
## [4,] "muslims"   "aleppo"         "area"         "time"         
## [5,] "one"       "allah"          "want"         "photoreport"  
## [6,] "russia"    "wilayatninawa"  "fighting"     "reports"      
##      Topic 5              
## [1,] "iraq"               
## [2,] "islamicstate"       
## [3,] "now"                
## [4,] "childlambakhijibran"
## [5,] "khilafah"           
## [6,] "bajwaonline"

Let’s run the following lines and see what they produce.

topicProbabilities <- as.data.frame(ldaOut@gamma)
write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities.csv"))
topicProbabilities[1:5,]
##          V1        V2        V3        V4        V5
## 1 0.1818182 0.1818182 0.2000000 0.2181818 0.2181818
## 2 0.2222222 0.1851852 0.1851852 0.2037037 0.2037037
## 3 0.2115385 0.1923077 0.2115385 0.1923077 0.1923077
## 4 0.2407407 0.2037037 0.1851852 0.1851852 0.1851852
## 5 0.1886792 0.2075472 0.1886792 0.2075472 0.2075472

Let’s visualize the result. We will use the R library called LDAvis for visualization. However, LDAvis does not directly take the output from the topic modeling done through topicmodels (which is the library we used for topic modeling). So we need the following lines to convert the output to be LDAvis-readable.

topicmodels_json_ldavis <- function(fitted, corpus, doc_term){
  # Find required quantities
  phi <- posterior(fitted)$terms %>% as.matrix
  theta <- posterior(fitted)$topics %>% as.matrix
  vocab <- colnames(phi)
  doc_length <- vector()
  for (i in 1:length(corpus)) {
    temp <- paste(corpus[[i]]$content, collapse = ' ')
    doc_length <- c(doc_length, stri_count(temp, regex = '\\S+'))
  }
  temp_frequency <- inspect(doc_term)
  freq_matrix <- data.frame(ST = colnames(temp_frequency),
                            Freq = colSums(temp_frequency))
  rm(temp_frequency)
  
  # Convert to json
  json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
                                 vocab = vocab,
                                 doc.length = doc_length,
                                 term.frequency = freq_matrix$Freq)
  
  return(json_lda)
}

In the above code block, we created a function that takes in the topic modeling output and convert it to JSON, which is the format readable in LDAvis. Now, we will use the function to convert our output.

output <- topicmodels_json_ldavis(ldaOut,corpus, dtm) 

Let’s view the output in LDAvis.

serVis(output, out.dir = "isis6_vis", open.browser = FALSE)