Title: | An R Wrapper for the Java Mallet Topic Modeling Toolkit |
---|---|
Description: | An R interface for the Java Machine Learning for Language Toolkit (mallet) <https://mimno.github.io/Mallet/> to estimate probabilistic topic models, such as Latent Dirichlet Allocation. We can use the R package to read textual data into 'mallet' from R objects, run the Java implementation of 'mallet' directly in R, and extract results as R objects. The 'mallet' toolkit has many functions, this wrapper focuses on the topic modeling sub-package written by David Mimno. The package uses the 'rJava' package to connect to a Java Virtual Machine (JVM). |
Authors: | Måns Magnusson [cre, aut] , David Mimno [aut, cph] |
Maintainer: | Måns Magnusson <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.3.1 |
Built: | 2024-10-27 03:33:52 UTC |
Source: | https://github.com/mimno/rmallet |
An R interface for the Java Machine Learning for Language Toolkit (mallet) <http://mallet.cs.umass.edu/> to estimate probabilistic topic models, such as Latent Dirichlet Allocation. We can use the R package to read textual data into mallet from R objects, run the Java implementation of mallet directly in R, and extract results as R objects. The Mallet toolkit has many functions, this wrapper focuses on the topic modeling sub-package written by David Mimno. The package uses the rJava package to connect to a JVM.
The model, Latent Dirichlet allocation (LDA): David M Blei, Andrew Ng, Michael Jordan. Latent Dirichlet Allocation. J. of Machine Learning Research, 2003.
The Java toolkit: Andrew Kachites McCallum. The Mallet Toolkit. 2002.
Details of the fast sparse Gibbs sampling algorithm: Limin Yao, David Mimno, Andrew McCallum. Streaming Inference for Latent Dirichlet Allocation. KDD, 2009.
Hyperparameter optimization: Hanna Wallach, David Mimno, Andrew McCallum. Rethinking LDA: Why Priors Matter. NIPS, 2010.
This reads writes a current sampling state of mallet to file. The state contain
hyperparameters and
together with topic indicators.
load.mallet.state(topic.model, state.file)
load.mallet.state(topic.model, state.file)
topic.model |
A |
state.file |
File path to store the mallet state file to. |
a java cc.mallet.topics.RTopicModel
object
Return the mallet jar filename(s)
mallet_jar(full.names = FALSE) mallet.jar(full.names = FALSE)
mallet_jar(full.names = FALSE) mallet.jar(full.names = FALSE)
full.names |
a logical value. If TRUE, the directory path is prepended to the file names to give a relative file path. If FALSE, the file name(s) (rather than paths) are returned. |
Mallet is implemented as a jar-file in the mallet R package. This function returns the file name and file path for that file(s)
Return the file path to the mallet stoplists
mallet_stoplist_file_path(language = "en") mallet.stoplist.file.path(language = "en")
mallet_stoplist_file_path(language = "en") mallet.stoplist.file.path(language = "en")
language |
language to return stoplist for. Defaults to engligs ([en]). |
Returns the path to the mallet stop word list. See [mallet_supported_stoplists()] for which stoplists that are included.
Mallet supported stoplists
mallet_supported_stoplists() mallet.supported.stoplists()
mallet_supported_stoplists() mallet.supported.stoplists()
return vector with included stoplists
This function returns a matrix with one row for every document and one column for every topic.
mallet.doc.topics(topic.model, normalized = FALSE, smoothed = FALSE)
mallet.doc.topics(topic.model, normalized = FALSE, smoothed = FALSE)
topic.model |
A |
normalized |
If |
smoothed |
If |
a number of documents by number of topics matrix.
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Extract results doc_topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE) topic_words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) top_words <- mallet.top.words(topic.model, word.weights = topic_words[2,], num.top.words = 5) ## End(Not run)
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Extract results doc_topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE) topic_words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) top_words <- mallet.top.words(topic.model, word.weights = topic_words[2,], num.top.words = 5) ## End(Not run)
This function takes an array of document IDs and text files (as character strings) and converts them into a Mallet instance list.
mallet.import( id.array = NULL, text.array, stoplist = "", preserve.case = FALSE, token.regexp = "[\\p{L}]+" )
mallet.import( id.array = NULL, text.array, stoplist = "", preserve.case = FALSE, token.regexp = "[\\p{L}]+" )
id.array |
An array of document IDs. Default is |
text.array |
A character vector with each element containing a document. |
stoplist |
The name of a file containing stopwords (words to ignore), one per line, or a character vector containing stop words. If the file is not in the current working directory, you may need to include a full path. Default is no stoplist. |
preserve.case |
By default, the input text is converted to all lowercase. |
token.regexp |
A quoted string representing a regular expression that defines a token. The default is one or more unicode letter: "[\\p{L}]+". Note that special characters must have double backslashes. |
a cc/mallet/types/InstanceList
object.
mallet.word.freqs
returns term and document frequencies, which may be useful in selecting stopwords.
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") ## End(Not run)
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") ## End(Not run)
This function takes a directory path as its only argument and returns a
data.frame
with two columns: <id> & <text>,
which can be passed to the mallet.import
function.
This data.frame
has as many rows as there are files in the Dir
.
mallet.read.dir(Dir)
mallet.read.dir(Dir)
Dir |
The path to a directory containing one document per file. |
a data.frame
with file id
and text
content.
This function was contributed to RMallet by Dan Bowen.
Dan Bowen
## Not run: directory <- system.file("stoplists", package = "mallet") stoplists <- mallet.read.dir(directory) ## End(Not run)
## Not run: directory <- system.file("stoplists", package = "mallet") stoplists <- mallet.read.dir(directory) ## End(Not run)
This function returns a matrix of word probabilities for each topic similar to
mallet.topic.words
, but estimated from a subset of the documents
in the corpus. The model assumes that topics are the same no matter where they
are used, but we know this is often not the case. This function lets us test
whether some words are used more or less than we expect in a particular set
of documents.
mallet.subset.topic.words( topic.model, subset.docs, normalized = FALSE, smoothed = FALSE )
mallet.subset.topic.words( topic.model, subset.docs, normalized = FALSE, smoothed = FALSE )
topic.model |
A |
subset.docs |
A logical vector of |
normalized |
If |
smoothed |
If |
a number of topics by vocabulary size matrix for the the included documents.
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Extract subcorpus topic word matrix post1975_topic_words <- mallet.subset.topic.words(topic.model, sotu[["year"]] > 1975) mallet.top.words(topic.model, word.weights = post1975_topic_words[2,], num.top.words = 5) ## End(Not run)
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Extract subcorpus topic word matrix post1975_topic_words <- mallet.subset.topic.words(topic.model, sotu[["year"]] > 1975) mallet.top.words(topic.model, word.weights = post1975_topic_words[2,], num.top.words = 5) ## End(Not run)
This function returns a data frame with two columns, one containing the most probable words as character values, the second containing the weight assigned to that word in the word weights vector you supplied.
mallet.top.words(topic.model, word.weights, num.top.words = 10)
mallet.top.words(topic.model, word.weights, num.top.words = 10)
topic.model |
A |
word.weights |
A vector of word weights for one topic, usually a row from the |
num.top.words |
The number of most probable words to return. If not specified, defaults to 10. |
a data.frame
with the top terms (term
) and their weights/probability (weight
).
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Extract top words top_words <- mallet.top.words(topic.model, word.weights = topic_words[2,], num.top.words = 5) ## End(Not run)
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Extract top words top_words <- mallet.top.words(topic.model, word.weights = topic_words[2,], num.top.words = 5) ## End(Not run)
Returns a hierarchical clustering of topics that can be plotted as a dendrogram.
There are two ways of measuring topic similarity: topics may contain the some of
the same words, or the may appear in some of the same documents. The balance
parameter allows you to interpolate between the similarities determined by these two methods.
mallet.topic.hclust( doc.topics, topic.words, balance = 0.3, method = "euclidean", ... )
mallet.topic.hclust( doc.topics, topic.words, balance = 0.3, method = "euclidean", ... )
doc.topics |
A documents by topics matrix of topic probabilities (see |
topic.words |
A topics by words matrix of word probabilities (see |
balance |
A value between 0.0 (use only document-level similarity) and 1.0 (use only word-level similarity). |
method |
method to use in |
... |
Further arguments for |
An object of class hclust
which describes the tree produced by the clustering process.
This function uses data matrices from mallet.doc.topics
and mallet.topic.words
using the hclust
function.
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Create hiearchical clusters of topics doc_topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE) topic_words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) topic_labels <- mallet.topic.labels(topic.model) plot(mallet.topic.hclust(doc_topics, topic_words, balance = 0.3), labels=topic_labels) ## End(Not run)
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Create hiearchical clusters of topics doc_topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE) topic_words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) topic_labels <- mallet.topic.labels(topic.model) plot(mallet.topic.hclust(doc_topics, topic_words, balance = 0.3), labels=topic_labels) ## End(Not run)
This function returns a vector of strings, one for each topic, with the most probable words in that topic separated by spaces.
mallet.topic.labels(topic.model, topic.words = NULL, num.top.words = 3, ...)
mallet.topic.labels(topic.model, topic.words = NULL, num.top.words = 3, ...)
topic.model |
A |
topic.words |
The matrix of topic-word weights returned by |
num.top.words |
The number of words to include for each topic. Defaults to 3. |
... |
Further arguments supplied to |
a character vector with one element per topic
mallet.topic.words
produces topic-word weights.
mallet.top.words
produces a data frame for a single topic.
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Create hiearchical clusters of topics doc_topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE) topic_words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) topic_labels <- mallet.topic.labels(topic.model) plot(mallet.topic.hclust(doc_topics, topic_words, balance = 0.3), labels=topic_labels) ## End(Not run)
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Create hiearchical clusters of topics doc_topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE) topic_words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) topic_labels <- mallet.topic.labels(topic.model) plot(mallet.topic.hclust(doc_topics, topic_words, balance = 0.3), labels=topic_labels) ## End(Not run)
This function returns the topic model loaded from a file or stores a topic model to file.
mallet.topic.model.read(filename) mallet.topic.model.load(filename) mallet.topic.model.write(topic.model, filename) mallet.topic.model.save(topic.model, filename)
mallet.topic.model.read(filename) mallet.topic.model.load(filename) mallet.topic.model.write(topic.model, filename) mallet.topic.model.save(topic.model, filename)
filename |
The mallet topic model file |
topic.model |
A |
This function returns a matrix with one row for every topic and one column for every word in the vocabulary.
mallet.topic.words(topic.model, normalized = FALSE, smoothed = FALSE)
mallet.topic.words(topic.model, normalized = FALSE, smoothed = FALSE)
topic.model |
A |
normalized |
If |
smoothed |
If |
a number of topics by vocabulary size matrix.
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Extract results doc_topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE) topic_words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) top_words <- mallet.top.words(topic.model, word.weights = topic_words[2,], num.top.words = 5) ## End(Not run)
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Extract results doc_topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE) topic_words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) top_words <- mallet.top.words(topic.model, word.weights = topic_words[2,], num.top.words = 5) ## End(Not run)
This method returns a data frame with one row for each unique vocabulary word,
and three columns: the word as a character
value, the total number of
tokens of that word type, and the total number of documents that contain that
word at least once. This information can be useful in identifying candidate
stopwords.
mallet.word.freqs(topic.model)
mallet.word.freqs(topic.model)
topic.model |
A |
a data.frame
with the word type (word
), the word frequency (word.freq
), and the document frequency (doc.freq
)
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Get word frequencies word_freqs <- mallet.word.freqs(topic.model) ## End(Not run)
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Get word frequencies word_freqs <- mallet.word.freqs(topic.model) ## End(Not run)
This function creates a java cc.mallet.topics.RTopicModel object that wraps a
Mallet topic model trainer java object, cc.mallet.topics.ParallelTopicModel.
Note that you can call any of the methods of this java object as properties.
In the example below, I make a call directly to the
topic.model$setAlphaOptimization(20, 50)
java method,
which passes this update to the model itself.
MalletLDA(num.topics = 10, alpha.sum = 5, beta = 0.01)
MalletLDA(num.topics = 10, alpha.sum = 5, beta = 0.01)
num.topics |
The number of topics to use. If not specified, this defaults to 10. |
alpha.sum |
This is the magnitude of the Dirichlet prior over the topic distribution of a document.
The default value is 5.0. With 10 topics, this setting leads to a Dirichlet with
parameter |
beta |
This is the per-word weight of the Dirichlet prior over topic-word distributions. The magnitude of the distribution (the sum over all words of this parameter) is determined by the number of words in the vocabulary. Again, this value may change due to hyperparameter optimization. |
a cc.mallet.topics.RTopicModel
object
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Extract results doc_topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE) topic_words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) top_words <- mallet.top.words(topic.model, word.weights = topic_words[2,], num.top.words = 5) ## End(Not run)
## Not run: # Read in sotu example data data(sotu) sotu.instances <- mallet.import(id.array = row.names(sotu), text.array = sotu[["text"]], stoplist = mallet_stoplist_file_path("en"), token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") # Create topic model topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model$loadDocuments(sotu.instances) # Train topic model topic.model$train(200) # Extract results doc_topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE) topic_words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) top_words <- mallet.top.words(topic.model, word.weights = topic_words[2,], num.top.words = 5) ## End(Not run)
This function returns the topic model loaded from a file.
save.mallet.instances(instances, filename) load.mallet.instances(filename)
save.mallet.instances(instances, filename) load.mallet.instances(filename)
instances |
An |
filename |
The filename to save to or load from. |
This function writes a current sampling state of mallet to file.
The state contain hyperparameters and
together with topic indicators.
The state file can be read into R using the function
save.mallet.state(topic.model, state.file)
save.mallet.state(topic.model, state.file)
topic.model |
A |
state.file |
File path (.gz format) to store the mallet state file to. |
A dataset containing State of the Union Adresses by paragraph from 1946 to 2000.
sotu
sotu
A tibble
data.frame
with 6816 rows and 3 variables:
Year of the adress.
The paragraph of the address.
The address content.
https://en.wikipedia.org/wiki/State_of_the_Union