Introduction to R mallet

Installation

To use the mallet R package, we need to use rJava, an R package for using Java within R (to access the mallet Java code). See details at github.com/s-u/rJava.

Next, install the mallet R package from CRAN. To install, simply use install.packages()

install.packages("mallet")

Usage

Depending on the size of your data, it can be so that you need to increase the Java virtual machine (JVM) heap memory to handle larger corpora. To do this, you need to specify how much memory you want to allocate to the JVM using the Xmx flag. Below is an example of allocating 4 Gb to the JVM.

options(java.parameters = "-Xmx4g")

To load the package, use library().

library(mallet)

Reading data into R

There are multiple ways to read text data into R. A simple way is to read individual text files into a character vector. Below is an example of reading the different stop list txt files that come with the mallet package into R as a character vector (that can be used by the mallet R package as data).

# Note this is the path to the folder where the stoplists are stored in the R package.
# Change this path to another directory to read other txt files into R.
directory <- system.file("stoplists", package = "mallet")

files_in_directory <- list.files(directory, full.names = TRUE)

txt_file_content <- character(length(files_in_directory))
for(i in seq_along(files_in_directory)){
  txt_file_content[i] <- paste(readLines(files_in_directory[i]), collapse = "\n")
}
# We can check the content with str()
str(txt_file_content)
##  chr [1:6] "English stoplist is the standard Mallet stoplist.\n\nGerman, French, Finnish are borrowed from http://www.ranks.nl." ...

We will now use the example data set of the State of the Union addresses from 1946 to 2000 that is included with the mallet R package as a data.frame. This data can be accessed as follows.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data(sotu)
sotu[["text"]][1:2]
## [1] "To the Congress of the United States:  "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [2] "A quarter century ago the Congress decided that it could no longer consider  the financial programs of the various departments on a piecemeal basis.  Instead it has called on the President to present a comprehensive Executive  Budget. The Congress has shown its satisfaction with that method by  extending the budget system and tightening its controls. The bigger and  more complex the Federal Program, the more necessary it is for the Chief  Executive to submit a single budget for action by the Congress.  "

Mallet also comes with five different stop list files (see above). We can access the path to these lists as follows.

mallet_supported_stoplists()
## [1] "de" "en" "fi" "fr" "jp"
stopwords_en_file_path <- mallet_stoplist_file_path("en")

Training topic models

As a first step, we need to create an LDA trainer object and supply the trainer with documents. We start by creating a mallet instance list object. This function has a few extra options (whether to lowercase or how we define a token). See ?mallet.import for details.

sotu.instances <- 
  mallet.import(id.array = row.names(sotu), 
                text.array = sotu[["text"]], 
                stoplist = stopwords_en_file_path,
                token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")

If the data is already cleaned and we want to use the index of text.array, we can supply the text.array.

sotu.instances.short <- 
  mallet.import(text.array = sotu[["text"]])

It is also possible to supply stop words as a character vector.

stop_vector <- readLines(stopwords_en_file_path)
sotu.instances.short <- 
  mallet.import(text.array = sotu[["text"]], 
                stoplist = stop_vector)

We first need to create a topic trainer object to fit a model.

topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1)

Load our documents. We could also pass in the filename of a saved instance list file we build from the command-line tools.

topic.model$loadDocuments(sotu.instances)

We use the method getVocabulary() to get the model’s vocabulary. The vocabulary may be helpful in further curating the stopword list.

vocabulary <- topic.model$getVocabulary()
head(vocabulary)
## [1] "congress" "united"   "states"   "quarter"  "century"  "ago"

Similarly, we can access the word and document frequencies with mallet.word.freqs().

word_freqs <- mallet.word.freqs(topic.model)
head(word_freqs)
##       word word.freq doc.freq
## 1 congress      1025      879
## 2   united       508      426
## 3   states       557      480
## 4  quarter        16       15
## 5  century       166      155
## 6      ago       179      171

To optimize hyperparameters ( and ) every 20 iterations, after 50 burn-in iterations, we set alpha optimization as follows.

topic.model$setAlphaOptimization(20, 50)

Now train a model. Note that hyperparameter optimization is on by default. We can specify the number of iterations. Here we’ll use a large-ish round number.

topic.model$train(200)

We can also run through a few iterations where we pick the best topic for each token rather than sampling from the posterior distribution.

topic.model$maximize(10)

Analysis of a topic model

To analyze our corpus using our model, we usually want to access the probability of topics per document and the probability of words per topic. By default, these functions return raw word counts. Here we want probabilities, so we normalize and add “smoothing” so that nothing has exactly 0 probability.

doc.topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE)
topic.words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE)

What are the top words in topic 2? Notice that R indexes from 1 and Java from 0, so this will be the topic that mallet called topic 1.

mallet.top.words(topic.model, word.weights = topic.words[2,], num.top.words = 5)
##         term     weight
## 1 production 0.01504072
## 2    economy 0.01270350
## 3     prices 0.01184652
## 4       farm 0.01153489
## 5      price 0.01106744

Show the largest document with at least 50% tokens belonging to topic 2. Note, since the model is not identified, you might end up with another topic if you run the same code.

docs <- which(doc.topics[,2] > 0.50)
doc_size <- nchar(sotu[["text"]])[docs]
idx <- docs[order(doc_size, decreasing = TRUE)[1]]
sotu[["text"]][idx]
## [1] "Under my program of phased decontrol, domestic crude oil price controls  will end September 30, 1981. As a result exploratory drilling activities  have reached an all-time high; Prices for new natural gas are being  decontrolled under the Natural Gas Policy Act--and natural gas production  is now at an all time high; the supply shortages of several years ago have  been eliminated; The windfall profits tax on crude oil has been enacted  providing $227 billion over ten years for assistance to low-income  households, increased mass transit funding, and a massive investment in the  production and development of alternative energy sources; The Synthetic  Fuels Corporation has been established to help private companies build the  facilities to produce energy from synthetic fuels; Solar energy funding has  been quadrupled, solar energy tax credits enacted, and a Solar Energy and  Energy Conservation Bank has been established; A route has been chosen to  bring natural gas from the North Slope of Alaska to the lower 48 states;  Coal production and consumption incentives have been increased, and coal  production is now at its highest level in history; A gasoline rationing  plan has been approved by Congress for possible use in the event of a  severe energy supply shortage or interruption; Gasohol production has been  dramatically increased, with a program being put in place to produce 500  million gallons of alcohol fuel by the end of this year--an amount that  could enable gasohol to meet the demand for 10 percent of all unleaded  gasoline; New energy conservation incentives have been provided for  individuals, businesses and communities and conservation has increased  dramatically. The U.S. has reduced oil imports by 25 percent--or 2 million  barrels per day--over the past four years.  "

We can also study the topics and how the differ in different parts of the corpus, for example in different time periods.

post1975_topic_words <- mallet.subset.topic.words(topic.model, sotu[["year"]] > 1975)
mallet.top.words(topic.model, word.weights = post1975_topic_words[2,], num.top.words = 5)
##         term weight
## 1     energy     90
## 2        oil     50
## 3    economy     49
## 4       food     43
## 5 production     40

Another functionality included in the mallet R package is to (hierarchically) cluster the topics to assess what topics that are “closer” to each other. Use ?mallet.topic.hclust to see further details on how to cluster topics.

topic_labels <- mallet.topic.labels(topic.model, num.top.words = 2)
topic_clusters <- mallet.topic.hclust(doc.topics, topic.words, balance = 0.5)
plot(topic_clusters, labels=topic_labels, xlab = "", )

Save and load topic states

We can also store our current topic model state to use it for postprocessing. We can store the state file either as a text file or a compressed gzip file.

state_file <- file.path(tempdir(), "temp_mallet_state.gz")
save.mallet.state(topic.model = topic.model, state.file = state_file)

We also store the topic counts per document and remove the old model.

doc.topics.counts <- mallet.doc.topics(topic.model, smoothed=FALSE, normalized=FALSE)

rm(topic.model)

To initialize a model with the sampled topic indicators, one needs to create a new model, load the same data and then load the topic indicators. Unfortunately, setting the alpha parameter vector is currently not possible, so it is not currently possible to initialize the model with the same alpha prior.

new.topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1)
new.topic.model$loadDocuments(sotu.instances)
load.mallet.state(topic.model = new.topic.model, state.file = state_file)

doc.topics.counts[1:3, 1:6]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    2    0
## [2,]    0    0    7    0    0    0
## [3,]    0    0    0    9    0    0
mallet.doc.topics(new.topic.model, smoothed=FALSE, normalized=FALSE)[1:3, 1:6]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    2    0
## [2,]    0    0    7    0    0    0
## [3,]    0    0    0    9    0    0

This vignette gives a first example of using the mallet R package for topic modelling.

Save and load topic models

We can also save Mallet topic models and load them back into R.

model_file <- file.path(tempdir(), "temp_mallet.model")
mallet.topic.model.save(new.topic.model, model_file)
read.topic.model <- mallet.topic.model.read(model_file)

doc.topics.counts[1:3, 1:6]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    2    0
## [2,]    0    0    7    0    0    0
## [3,]    0    0    0    9    0    0
mallet.doc.topics(read.topic.model, smoothed=FALSE, normalized=FALSE)[1:3, 1:6]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    2    0
## [2,]    0    0    7    0    0    0
## [3,]    0    0    0    9    0    0