{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Topic modelling\n", "\n", "## Latent Dirichlet Allocation (LDA)\n", "\n", "Latent Dirichlet Allocation (LDA) is a algorithms used to discover the **topics** that are present in a corpus. See the slides for details.\n", "\n", "\n", "## Non-negative Matrix Factorization (NMF) \n", "\n", "Non-Negative Matrix Factorization is a dimension reduction technique that factors an input matrix of shape $m \\times n$ into a matrix of shape $m \\times k$ and another matrix of shape $n \\times k$.\n", "\n", "In text mining, one can use NMF to build topic models. Using NMF, one can factor a Term-Document Matrix of shape documents x word types into a matrix of documents x topics and another matrix of shape word types x topics. The former matrix describes the distribution of each topic in each document, and the latter describes the distribution of each word in each topic. \n", "\n", "## Comparison between LDA and NMF\n", "\n", "Non-negative Matrix Factorization (NMF) can also be used to find topics in text. The mathematical basis underpinning NMF is quite different from LDA. LDA is based on probabilistic graphical modeling while NMF relies on linear algebra. Both algorithms take as input a bag of words matrix (i.e., each document represented as a row, with each columns containing the count of words in the corpus). The aim of each algorithm is then to produce 2 smaller matrices; a document to topic matrix and a word to topic matrix that when multiplied together reproduce the bag of words matrix with the lowest error.\n", "\n", "\n", "- NMF sometimes produces more meaningful topics for smaller datasets. \n", "\n", "- NMF has been included in `scikit-learn`. `scikit-learn` brings API consistency which makes it almost trivial to perform Topic Modeling using both LDA and NMF. \n", "\n", "- Scikit Learn also includes `seeding` options for NMF which greatly helps with algorithm convergence and offers both online and batch variants of LDA.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading 20news dataset. This may take a few minutes.\n", "Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)\n" ] } ], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "\n", "dataset = fetch_20newsgroups(shuffle=True, random_state=1, \n", " remove=('headers', 'footers', 'quotes'))\n", "documents = dataset.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creation of the bag of words matrix\n", "\n", "The creation of the bag of words matrix is very easy in `scikit-learn` . All the heavy lifting is done by the feature extraction functionality provided for text datasets. A `tf-idf` transformer is applied to the bag of words matrix that NMF must process with the `TfidfVectorizer`. \n", "\n", "LDA on the other hand, being a probabilistic graphical model (i.e. dealing with probabilities) **only requires raw counts**, so a `CountVectorizer` is used. Stop words are removed and the number of terms included in the bag of words matrix is restricted to the top 1000." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n", "\n", "no_features = 1000\n", "\n", "# NMF is able to use tf-idf\n", "tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')\n", "tfidf = tfidf_vectorizer.fit_transform(documents)\n", "tfidf_feature_names = tfidf_vectorizer.get_feature_names()\n", "\n", "# LDA can only use raw term counts for LDA because it is a probabilistic graphical model\n", "tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')\n", "tf = tf_vectorizer.fit_transform(documents)\n", "tf_feature_names = tf_vectorizer.get_feature_names()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NMF and LDA with Scikit Learn\n", "\n", "As mentioned previously the algorithms are not able to automatically determine the number of topics and this value must be set when running the algorithm. Comprehensive documentation on available parameters is available for both NMF and LDA. Initialising the W and H matrices in NMF with ‘nndsvd’ rather than random initialisation improves the time it takes for NMF to converge. LDA can also be set to run in either batch or online mode." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic 0:\n", "people time right did good said say make way government\n", "Topic 1:\n", "window problem using server application screen display motif manager running\n", "Topic 2:\n", "god jesus bible christ faith believe christian christians sin church\n", "Topic 3:\n", "game team year games season players play hockey win league\n", "Topic 4:\n", "new 00 sale 10 price offer shipping condition 20 15\n", "Topic 5:\n", "thanks mail advance hi looking info help information address appreciated\n", "Topic 6:\n", "windows file files dos program version ftp ms directory running\n", "Topic 7:\n", "edu soon cs university ftp internet article email pub david\n", "Topic 8:\n", "key chip clipper encryption keys escrow government public algorithm nsa\n", "Topic 9:\n", "drive scsi drives hard disk ide floppy controller cd mac\n", "Topic 10:\n", "just ll thought tell oh little fine work wanted mean\n", "Topic 11:\n", "does know anybody mean work say doesn help exist program\n", "Topic 12:\n", "card video monitor cards drivers bus vga driver color memory\n", "Topic 13:\n", "like sounds looks look bike sound lot things really thing\n", "Topic 14:\n", "don know want let need doesn little sure sorry things\n", "Topic 15:\n", "car cars engine speed good bike driver road insurance fast\n", "Topic 16:\n", "ve got seen heard tried good recently times try couple\n", "Topic 17:\n", "use used using work available want software need image data\n", "Topic 18:\n", "think don lot try makes really pretty wasn bit david\n", "Topic 19:\n", "com list dave internet article sun hp email ibm phone\n", "Topic 0:\n", "people gun state control right guns crime states law police\n", "Topic 1:\n", "time question book years did like don space answer just\n", "Topic 2:\n", "mr line rules science stephanopoulos title current define int yes\n", "Topic 3:\n", "key chip keys clipper encryption number des algorithm use bit\n", "Topic 4:\n", "edu com cs vs w7 cx mail uk 17 send\n", "Topic 5:\n", "use does window problem way used point different case value\n", "Topic 6:\n", "windows thanks know help db does dos problem like using\n", "Topic 7:\n", "bike water effect road design media dod paper like turn\n", "Topic 8:\n", "don just like think know people good ve going say\n", "Topic 9:\n", "car new price good power used air sale offer ground\n", "Topic 10:\n", "file available program edu ftp information files use image version\n", "Topic 11:\n", "ax max b8f g9v a86 145 pl 1d9 0t 34u\n", "Topic 12:\n", "government law privacy security legal encryption court fbi technology information\n", "Topic 13:\n", "card bit memory output video color data mode monitor 16\n", "Topic 14:\n", "drive scsi disk mac hard apple drives controller software port\n", "Topic 15:\n", "god jesus people believe christian bible say does life church\n", "Topic 16:\n", "year game team games season play hockey players league player\n", "Topic 17:\n", "10 00 15 25 20 11 12 14 16 13\n", "Topic 18:\n", "armenian israel armenians war people jews turkish israeli said women\n", "Topic 19:\n", "president people new said health year university school day work\n" ] } ], "source": [ "from sklearn.decomposition import NMF, LatentDirichletAllocation\n", "\n", "def display_topics(model, feature_names, no_top_words):\n", " for topic_idx, topic in enumerate(model.components_):\n", " print (\"Topic %d:\" % (topic_idx))\n", " print (\" \".join([feature_names[i]\n", " for i in topic.argsort()[:-no_top_words - 1:-1]]))\n", "\n", "no_top_words = 10\n", "no_topics = 20\n", "\n", "# Run NMF\n", "nmf = NMF(n_components=no_topics, random_state=1, \n", " alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)\n", "\n", "# Run LDA\n", "lda = LatentDirichletAllocation(n_components=no_topics, \n", " max_iter=5, \n", " learning_method='online', \n", " learning_offset=50.,\n", " random_state=0).fit(tf)\n", "\n", "display_topics(nmf, tfidf_feature_names, no_top_words)\n", "display_topics(lda, tf_feature_names, no_top_words)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5rc1" } }, "nbformat": 4, "nbformat_minor": 2 }