This post uses R markdown to explain my version of topic modelling using Latent Dirichlet Allocation (LDA) which finds the best number of topics for a set of documents (this approach has been adapted from here). Although the link shows 4 metrics that can be used, I only focus on 3 - "Griffiths2004", "CaoJuan2009" and "Arun2010". "Deveaud2014" does not provide any insight for the dataset used in this example as it decreases montonically throughout the sequence. Specifically, this post
- discusses getting started with the necessary libraries needed for the analysis
- provides the initiaization of several variables needed and loads the necessary dataset
- outlines the pre-processing steps to cleanse the data before analysis
- demonstrates how the optimum number of topics for the dataset is obtained
- presents a visualization of the results of LDA
The below-mentioned libraries are needed for the analyis to work. Install them from CRAN if it is not present in your version of R.
rm(list = ls()) #clear the variables (just in case) #load text mining library library(tm) library(slam) #load topic models library library(topicmodels) library(ldatuning) library(rjson) library(snow) library(parallel) library(stringr) library(stringi) #FOr topic visualization library(LDAvis) library(dplyr) #this library is for presenting outputs in a nice dignified way library(knitr)
Here, some of the variables will be initialized which will be used later for analysis. First, the stopwords are defined (which we will use the default available from the tm library). Then, the parameters for the LDA are defined. Lastly, the range of numbers within which the optimum topic number for the data set is to be found is defined.
#define all stopwords genericStopwords <- c( stopwords("english"), stopwords("SMART") ) genericStopwords <- gsub("'", "", genericStopwords) #remove apostrophes genericStopwords <- unique(genericStopwords) #genericStopwords <- stemDocument(genericStopwords, language = "porter") #Set parameters for Gibbs sampling for LDA nstart <- 5 seed <- list(5, 46225, 500, 6300, 190000) best <- TRUE burnin <- 5000 iter <- 10000 thin <- 10000 keep <- 100 #Range of topic numbers to search for optimum number sequ <- seq(2, 25, 1)
Here we use the data set that comes is available at this link. After loading the data set, the data is pre-processed by converting all characters to lower case. Sentences shorter than 8 words are removed. Punctuation, control characters, numbers and whitespace are removed. The data is then loaded into a corpus as a format which the library can analyze. Stopwords are removed here.If needed, stemming can be done after the removal of stopwords. But, in my case, I prefer not to use it as some words after truncation become quite indecipherable. Lemmatization would be a better option here. As of now, I have not tried it out yet.
#load files into corpus #get listing of .txt files in directory filenames <- list.files(paste(getwd(),"textmining",sep = "/"),pattern="*.txt",full.names = TRUE) #read files into a character vector data_orig <- lapply(filenames,readLines) #pre-processing: data <- tolower(data_orig) #force to lowercase data[stri_count(data, regex="\\S+") < 8] = "" data <- gsub("'", "", data) #remove apostrophes data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space data <- gsub("[[:cntrl:]]", " ", data) #replace control characters with space data <- gsub("[[:digit:]]", "", data) #remove digits data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents data <- stripWhitespace(data) #load files into corpus #create corpus from vector data_docs <- Corpus(VectorSource(data)) #inspect a particular document in corpus writeLines(as.character(data_docs[[2]]))
c big data metaphors we live by when big data metaphors erase human sensemaking and the ways in which values are baked into categories algorithms and visualizations we have indeed lost the plot not found it quoted from my essay on metaphors for big data co written with simon buckingham shum
#Removal of stopwords data_docs <- tm_map(data_docs, removeWords, genericStopwords) #Good practice to check every now and then writeLines(as.character(data_docs[[2]]))
big data metaphors live big data metaphors erase human sensemaking ways values baked categories algorithms visualizations lost plot found quoted essay metaphors big data written simon buckingham shum
The document-term matrix is then created so that terms that occur in less than 1% of the documents are removed as well. These terms are unlikely to have an impact in finding topics.
#Create document-term matrix dtm <- DocumentTermMatrix(data_docs) #remove terms that occur in less than 1% of the documents ind <- col_sums(dtm) < length(data) * 0.01 dtm <- dtm[,!ind] #remove documents with no terms ind <- row_sums(dtm) == 0 dtm <- dtm[!ind,] data_docs <- data_docs[!ind] #collapse matrix by summing over columns freq <- col_sums(dtm) #create sort order (descending) freq <- freq[order(freq, decreasing = TRUE)] #List all terms in decreasing order of freq term_count_table <- data.frame( Term = names(freq), Count = unname(freq) ) kable(term_count_table[1:25,]) #show first 25 rows
Term | Count |
---|---|
work | 161 |
people | 143 |
data | 141 |
project | 128 |
approach | 121 |
management | 121 |
problem | 117 |
time | 101 |
question | 98 |
system | 98 |
point | 96 |
things | 96 |
enterprise | 93 |
change | 88 |
systems | 87 |
important | 84 |
based | 82 |
business | 82 |
process | 78 |
decision | 78 |
organisations | 75 |
organization | 70 |
make | 68 |
good | 67 |
ibis | 66 |
As there is a range of topic numbers to analyze, each topic modeling process can be processed in a parallel manner as they are independent of each other. So here, I make use of the ldatuning package to analyze the sequence of topic numbers which run using parallel processing. Once the analysis is done, using the computed metrics, we find what is the optimum topic number for the data set. The plot for the metrics is as below. The lowest index at which the metrics have a maximum/minimum values is found. Using the index found, the topic model is re-computed for visualization.
#Run LDA using Gibbs sampling ##Calculate the number of cores no_cores <- detectCores() - 1 result <- FindTopicsNumber( dtm_query, topics = sequ, metrics = c("CaoJuan2009", "Deveaud2014"), method = "Gibbs", control = list( nstart = nstart, seed = seed, best = best, burnin = burnin, iter = iter, keep = keep, thin = thin ), mc.cores = no_cores, verbose = TRUE )
FindTopicsNumber_plot(result)
topic_num <- result$topics[min(which.min(result$CaoJuan2009), which.min(result$Arun2010), which.max(result$Griffiths2004))] print(paste("The optimum number of topics for the data set is ",topic_num))
[1] "The optimum number of topics for the data set is 16"
ldaOut <- LDA( dtm_query, k = topic_num, method = "Gibbs", control = list( nstart = nstart, seed = seed, best = best, burnin = burnin, iter = iter, keep = keep, thin = thin ) )
The visualization of topics is done using the LDAvis library. Once again, parallel processing is employed to speed up determining visualization parameters. The visualization parameters are saved in a json file which is then loaded via serVis() to create the interactive HTML files for visualization, You may view an example of the result at the following link. Note: Since LDA is unsupervised, the results may change everytime that the analysis is run. The results in the link is just an example of such an instance.
##Prepare data for Visualization #Calculate the number of cores no_cores <- detectCores() cl <- makeCluster(no_cores) #Find required quantities phi <- posterior(ldaOut)$terms %>% as.matrix theta <- posterior(ldaOut)$topics %>% as.matrix vocab <- colnames(phi) doc_length <- vector() for (i in 1:length(data_docs)) { temp <- paste(data_docs[[i]]$content, collapse = ' ') doc_length <- c(doc_length, stri_count(temp, regex = '\\S+')) } json_lda <- LDAvis::createJSON( phi = phi, theta = theta, vocab = vocab, doc.length = doc_length, term.frequency = col_sums(dtm), R = 10, cluster = cl, plot.opts = list(xlab = "Dimension 1", ylab = "Dimension 2") ) stopCluster(cl) #Topics visualization serVis( json_lda, out.dir = "Vis" )
Further improvements to this code will include using POS (Part-of-Speech) tagging to remove stopwords more effectively. Lemmatization will also be included. The approach needs to be evaluated for all the metrics and some judgment is needed to decide which metrics are useful for the dataset and which aren't.