Latent Dirichlet Allocation (LDA) is a algorithms used to discover the topics that are present in a corpus. See the slides for details.
Non-Negative Matrix Factorization is a dimension reduction technique that factors an input matrix of shape $m \times n$ into a matrix of shape $m \times k$ and another matrix of shape $n \times k$.
In text mining, one can use NMF to build topic models. Using NMF, one can factor a Term-Document Matrix of shape documents x word types into a matrix of documents x topics and another matrix of shape word types x topics. The former matrix describes the distribution of each topic in each document, and the latter describes the distribution of each word in each topic.
Non-negative Matrix Factorization (NMF) can also be used to find topics in text. The mathematical basis underpinning NMF is quite different from LDA. LDA is based on probabilistic graphical modeling while NMF relies on linear algebra. Both algorithms take as input a bag of words matrix (i.e., each document represented as a row, with each columns containing the count of words in the corpus). The aim of each algorithm is then to produce 2 smaller matrices; a document to topic matrix and a word to topic matrix that when multiplied together reproduce the bag of words matrix with the lowest error.
NMF sometimes produces more meaningful topics for smaller datasets.
NMF has been included in scikit-learn
. scikit-learn
brings API consistency which makes it almost trivial to perform Topic Modeling using both LDA and NMF.
Scikit Learn also includes seeding
options for NMF which greatly helps with algorithm convergence and offers both online and batch variants of LDA.
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'))
documents = dataset.data
The creation of the bag of words matrix is very easy in scikit-learn
. All the heavy lifting is done by the feature extraction functionality provided for text datasets. A tf-idf
transformer is applied to the bag of words matrix that NMF must process with the TfidfVectorizer
.
LDA on the other hand, being a probabilistic graphical model (i.e. dealing with probabilities) only requires raw counts, so a CountVectorizer
is used. Stop words are removed and the number of terms included in the bag of words matrix is restricted to the top 1000.
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
no_features = 1000
# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()
As mentioned previously the algorithms are not able to automatically determine the number of topics and this value must be set when running the algorithm. Comprehensive documentation on available parameters is available for both NMF and LDA. Initialising the W and H matrices in NMF with ‘nndsvd’ rather than random initialisation improves the time it takes for NMF to converge. LDA can also be set to run in either batch or online mode.
from sklearn.decomposition import NMF, LatentDirichletAllocation
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model.components_):
print ("Topic %d:" % (topic_idx))
print (" ".join([feature_names[i]
for i in topic.argsort()[:-no_top_words - 1:-1]]))
no_top_words = 10
no_topics = 20
# Run NMF
nmf = NMF(n_components=no_topics, random_state=1,
alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics,
max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0).fit(tf)
display_topics(nmf, tfidf_feature_names, no_top_words)
display_topics(lda, tf_feature_names, no_top_words)