{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Probabilistic Language Models\n", "\n", "\n", "Feng Li\n", "\n", "School of Statistics and Mathematics\n", "\n", "Central University of Finance and Economics\n", "\n", "[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n", "\n", "[https://feng.li/python](https://feng.li/python)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Probabilistic Language Models\n", "\n", "\n", "How can we assign a probability to a sentence?\n", "\n", " P(high winds tonight) > P(large winds tonight)\n", " \n", " \n", "How do we do a proper spell correction?\n", "\n", "- The office is about fifteen **minuets** from my house\n", "\n", " P(about fifteen minutes from) > P(about fifteen minuets from)\n", " \n", " \n", "- How do we imporve the precision of speech recognition?\n", "\n", "\n", " P(I saw a van) >> P(eyes awe of an)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The Goal of a Language Model: \n", "\n", "We compute the probability of a sentence or sequence of words:\n", "\n", "$$P(W) = P(w_1,w_2,w_3,w_4,w_5,...w_n)$$\n", " \n", "Related task: probability of an upcoming word:\n", "\n", "$$P(w_5|w_1,w_2,w_3,w_4)$$\n", " \n", "A model that computes either of these $P(W)$ or $P(w_n|w_1,w_2,...,w_{n-1})$ is called a **language model**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## How to compute this joint probability:\n", "\n", "$$P(its, water, is, so, transparent, that)$$\n", "\n", "- Intuition: let’s rely on the Chain Rule of Probability\n", "\n", "- The Chain Rule in General\n", "\n", "$$P(x_1,x_2,x_3,...,x_n) = P(x_1)P(x_2|x_1)P(x_3|x_1,x_2)...P(x_n|x_1,…,x_{n-1})$$\n", "\n", "- The Joint probability is now factorized as \n", "\n", "$$P(“its~water~is~so~transparent”) =\n", "\tP(its) × P(water|its) × P(is|its~water) \n", " × P(so|its~water~is) × P(transparent|its~water~is~so)\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Simplifying assumption:\n", "\n", "- Markov Assumption\n", "\n", "$$\n", "P(w_1,w_2,...,w_{n}) \\approx \\prod_{i=1}^n P(w_i|w_{i-1},...,w_{i-k})\n", "$$\n", "where $k$ is some positive integer.\n", "\n", "- In other words, we approximate each component in the product\n", "\n", "$$\n", " P(w_i |w_{i-1},...,w_{1}) \\approx P(w_i|w_{i-k},...,w_{i-1})\n", "$$" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "from nltk.tokenize import RegexpTokenizer" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "os.chdir(os.path.dirname(os.path.realpath('__file__')))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "merged = pd.read_excel('data/guba.xlsx', sheet_name='Merged')\n", "\n", "merged = merged['Explanation'].tolist()\n", "tokenizer = RegexpTokenizer(r'\\w+')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Tokenize\n", "\n", "merged = [tokenizer.tokenize(merged[i]) for i in range(len(merged))]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# stop words\n", "def word_clean(sentence, stop_words):\n", " sentence = [i.lower() for i in sentence]\n", " sentence = [token for token in sentence if not token.isnumeric()]\n", " sentence = [j for j in sentence if j not in stop_words]\n", " return sentence\n", "\n", "stop_words = pd.read_csv('data/stopwords.txt', header=None)[0].to_list()\n", "merged = [word_clean(sentence, stop_words) for sentence in merged]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import nltk\n", "from nltk.stem.wordnet import WordNetLemmatizer" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Lemmatize the documents.\n", "\n", "lemmatizer = WordNetLemmatizer()\n", "merged = [[lemmatizer.lemmatize(token) for token in doc] for doc in merged]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Compute bigrams.\n", "from gensim.models import Phrases\n", "\n", "# Add bigrams and trigrams to docs\n", "bigram = Phrases(merged, min_count=1)\n", "merged = [bigram[lst] for lst in merged]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Remove rare and common tokens.\n", "from gensim.corpora import Dictionary\n", "\n", "# Create a dictionary representation of the documents.\n", "merged_dictionary = Dictionary(merged)\n", "\n", "# Filter out words that occur less than 2 documents, or more than 60% of the documents.\n", "merged_dictionary.filter_extremes(no_below=1, no_above=0.6)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Bag-of-words representation of the documents.\n", "\n", "merged_corpus = [merged_dictionary.doc2bow(doc) for doc in merged]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Train merged model.\n", "from gensim.models import LdaModel\n", "\n", "# Set training parameters.\n", "num_topics = 5\n", "chunksize = 20\n", "passes = 5\n", "iterations = 200\n", "eval_every = None # Don't evaluate model perplexity, takes too much time.\n", "\n", "# Make an index to word dictionary.\n", "temp = merged_dictionary[0] # This is only to \"load\" the dictionary.\n", "id2word = merged_dictionary.id2token\n", "\n", "merged_model = LdaModel(\n", " corpus=merged_corpus,\n", " id2word=id2word,\n", " chunksize=chunksize,\n", " alpha='auto',\n", " eta='auto', \n", " iterations=iterations,\n", " num_topics=num_topics,\n", " passes=passes,\n", " eval_every=eval_every\n", ")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Looking in indexes: https://mirrors.163.com/pypi/simple/\n", "Collecting pyLDAvis\n", " Using cached pyLDAvis-3.3.1-py2.py3-none-any.whl\n", "Collecting joblib\n", " Using cached https://mirrors.163.com/pypi/packages/55/85/70c6602b078bd9e6f3da4f467047e906525c355a4dacd4f71b97a35d9897/joblib-1.0.1-py3-none-any.whl (303 kB)\n", "Collecting pandas>=1.2.0\n", " Using cached https://mirrors.163.com/pypi/packages/48/b4/1081d66b71c4dfc1bc1e19d6f2abbf93ed42f69df7703eb323742d45423e/pandas-1.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)\n", "Collecting future\n", " Using cached future-0.18.2-py3-none-any.whl\n", "Collecting sklearn\n", " Using cached sklearn-0.0-py2.py3-none-any.whl\n", "Collecting setuptools\n", " Using cached https://mirrors.163.com/pypi/packages/41/f4/a7ca4859317232b1efb64a826b8d2d7299bb77fb60bdb08e2bd1d61cf80d/setuptools-58.2.0-py3-none-any.whl (946 kB)\n", "Collecting gensim\n", " Using cached https://mirrors.163.com/pypi/packages/61/e8/ddf62a31b4f97f543a38233047865d02be97c192f7f8d849bbf3353bc094/gensim-4.1.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.0 MB)\n", "Collecting numexpr\n", " Using cached https://mirrors.163.com/pypi/packages/4e/88/ccd8973d0dde04e95f6fbc7818f770a18293de7348fc3f9b66e9bf44a2f9/numexpr-2.7.3-cp39-cp39-manylinux2010_x86_64.whl (471 kB)\n", "Collecting numpy>=1.20.0\n", " Using cached https://mirrors.163.com/pypi/packages/7a/4c/dd00ce768b0f0f7de5c486cbd9f5b922bc3af2f3a5da30121d7f7dc03130/numpy-1.21.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.8 MB)\n", "Collecting scikit-learn\n", " Using cached https://mirrors.163.com/pypi/packages/53/8b/99d0658d74a2e6277dbe40b6759581badb2790f6422369ae6a3d606b9164/scikit_learn-1.0.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.7 MB)\n", "Collecting jinja2\n", " Using cached https://mirrors.163.com/pypi/packages/94/42/d8bca8e99789bcc35dfa9b03acaa8b518720d6e060163745bc2bf2ead842/Jinja2-3.0.2-py3-none-any.whl (133 kB)\n", "Collecting scipy\n", " Using cached https://mirrors.163.com/pypi/packages/83/4a/13f813a86b7f510954bdae649f0fda718e8210320b4108656ddaf96442c9/scipy-1.7.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.8 MB)\n", "Collecting funcy\n", " Using cached https://mirrors.163.com/pypi/packages/66/89/479de0afbbfb98d1c4b887936808764627300208bb771fcd823403645a36/funcy-1.15-py2.py3-none-any.whl (32 kB)\n", "Collecting python-dateutil>=2.7.3\n", " Using cached https://mirrors.163.com/pypi/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)\n", "Collecting pytz>=2017.3\n", " Using cached https://mirrors.163.com/pypi/packages/70/94/784178ca5dd892a98f113cdd923372024dc04b8d40abe77ca76b5fb90ca6/pytz-2021.1-py2.py3-none-any.whl (510 kB)\n", "Collecting six>=1.5\n", " Using cached https://mirrors.163.com/pypi/packages/ee/ff/48bde5c0f013094d729fe4b0316ba2a24774b3ff1c52d924a8a4cb04078a/six-1.15.0-py2.py3-none-any.whl (10 kB)\n", "Collecting smart-open>=1.8.1\n", " Using cached https://mirrors.163.com/pypi/packages/cd/11/05f68ea934c24ade38e95ac30a38407767787c4e3db1776eae4886ad8c95/smart_open-5.2.1-py3-none-any.whl (58 kB)\n", "Collecting MarkupSafe>=2.0\n", " Using cached https://mirrors.163.com/pypi/packages/c2/db/314df69668f582d5173922bded7b58126044bb77cfce6347c5d992074d2e/MarkupSafe-2.0.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (30 kB)\n", "Collecting threadpoolctl>=2.0.0\n", " Using cached https://mirrors.163.com/pypi/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl (12 kB)\n", "Installing collected packages: numpy, threadpoolctl, six, scipy, joblib, smart-open, scikit-learn, pytz, python-dateutil, MarkupSafe, sklearn, setuptools, pandas, numexpr, jinja2, gensim, future, funcy, pyLDAvis\n", "Successfully installed MarkupSafe-2.0.1 funcy-1.15 future-0.18.2 gensim-4.1.2 jinja2-3.0.2 joblib-1.0.1 numexpr-2.7.3 numpy-1.21.2 pandas-1.3.4 pyLDAvis-3.3.1 python-dateutil-2.8.1 pytz-2021.1 scikit-learn-1.0.1 scipy-1.7.2 setuptools-58.2.0 six-1.15.0 sklearn-0.0 smart-open-5.2.1 threadpoolctl-2.1.0\n" ] } ], "source": [ "# We need a specific version of pyLDAvis. Let's install it to current notebook directory.\n", "! pip3 install pyLDAvis -I -t modules" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['modules',\n", " 'modules',\n", " '/home/fli/cloud/teaching/python/python-slides/P08-Advanced-Topics',\n", " '/usr/lib/python39.zip',\n", " '/usr/lib/python3.9',\n", " '/usr/lib/python3.9/lib-dynload',\n", " '',\n", " '/home/fli/.local/lib/python3.9/site-packages',\n", " '/usr/local/lib/python3.9/dist-packages',\n", " '/usr/lib/python3/dist-packages',\n", " '/usr/local/lib/python3.9/dist-packages/IPython/extensions',\n", " '/home/fli/.ipython']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# prepend the `modules` folder to Python's search path \n", "import sys\n", "sys.path.insert(0, 'modules')\n", "sys.path" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['modules',\n", " '/home/fli/cloud/teaching/python/python-slides/P08-Advanced-Topics',\n", " '/usr/lib/python39.zip',\n", " '/usr/lib/python3.9',\n", " '/usr/lib/python3.9/lib-dynload',\n", " '',\n", " '/home/fli/.local/lib/python3.9/site-packages',\n", " '/usr/local/lib/python3.9/dist-packages',\n", " '/usr/lib/python3/dist-packages',\n", " '/usr/local/lib/python3.9/dist-packages/IPython/extensions',\n", " '/home/fli/.ipython']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sys.path" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/fli/cloud/teaching/python/python-slides/P08-Advanced-Topics/modules/pyLDAvis/_prepare.py:246: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only\n", " default_term_info = default_term_info.sort_values(\n" ] } ], "source": [ "import pyLDAvis\n", "from pyLDAvis import gensim_models\n", "vis= gensim_models.prepare(merged_model, merged_corpus, dictionary=merged_dictionary)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "pyLDAvis.save_html(vis, 'lda.html')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TermFreqTotalCategorylogproblogliftrelevance
377product23.91352930.062396Topic1-3.29280.9540-3.2928
627real_estate10.39319412.188516Topic1-4.12611.0235-4.1261
246company7.4087499.892665Topic1-4.46460.8937-4.4646
309financial6.1972048.785535Topic1-4.64320.8339-4.6432
800zhangjiang_tech5.3800725.830456Topic1-4.78461.1025-4.7846
558expected_return5.3160085.792948Topic1-4.79661.0970-4.7966
520bank4.5821585.053505Topic1-4.94511.0850-4.9451
251future4.3605514.810861Topic1-4.99471.0846-4.9947
102stock3.90411911.538281Topic1-5.10530.0992-5.1053
540customer3.8478564.313823Topic1-5.11981.0686-5.1198
265term3.8058595.013711Topic1-5.13070.9072-5.1307
619project3.6648465.127102Topic1-5.16850.8471-5.1685
245business3.4342756.798940Topic1-5.23350.4999-5.2335
567fund3.2499446.864512Topic1-5.28870.4352-5.2887
582invest3.1387053.589726Topic1-5.32351.0486-5.3235
636return3.1277703.583477Topic1-5.32701.0469-5.3270
660trust3.1277703.583477Topic1-5.32701.0469-5.3270
562financing3.1129683.573827Topic1-5.33171.0448-5.3317
638risk3.1129683.573827Topic1-5.33171.0448-5.3317
666wealth_management3.1129683.573827Topic1-5.33171.0448-5.3317
241annual2.8782783.321743Topic1-5.41011.0396-5.4101
782board2.8546024.410119Topic1-5.41840.7479-5.4184
24control2.5836363.367196Topic1-5.51810.9180-5.5181
799zhangjiang2.4007522.847076Topic1-5.59151.0124-5.5915
790park2.4007522.847076Topic1-5.59151.0124-5.5915
655threshold2.3961822.844386Topic1-5.59341.0114-5.5934
625rate2.3889132.841308Topic1-5.59651.0095-5.5965
661type2.3889132.841308Topic1-5.59651.0095-5.5965
565focus2.3772802.833401Topic1-5.60131.0074-5.6013
665wealth2.3772802.833401Topic1-5.60131.0074-5.6013
604ningbo_branch2.3772802.833401Topic1-5.60131.0074-5.6013
603ningbo2.3772802.833401Topic1-5.60131.0074-5.6013
523buy_product2.3772802.833401Topic1-5.60131.0074-5.6013
518average_annualized2.3772802.833401Topic1-5.60131.0074-5.6013
766size2.3584102.833941Topic1-5.60930.9992-5.6093
721infrastructure2.3584102.833941Topic1-5.60930.9992-5.6093
716huge2.3584102.833941Topic1-5.60930.9992-5.6093
672anxin2.3584102.833941Topic1-5.60930.9992-5.6093
717improve2.3584102.833941Topic1-5.60930.9992-5.6093
794senior_executive2.2990762.742155Topic1-5.63481.0066-5.6348
\n", "
" ], "text/plain": [ " Term Freq Total Category logprob loglift \\\n", "377 product 23.913529 30.062396 Topic1 -3.2928 0.9540 \n", "627 real_estate 10.393194 12.188516 Topic1 -4.1261 1.0235 \n", "246 company 7.408749 9.892665 Topic1 -4.4646 0.8937 \n", "309 financial 6.197204 8.785535 Topic1 -4.6432 0.8339 \n", "800 zhangjiang_tech 5.380072 5.830456 Topic1 -4.7846 1.1025 \n", "558 expected_return 5.316008 5.792948 Topic1 -4.7966 1.0970 \n", "520 bank 4.582158 5.053505 Topic1 -4.9451 1.0850 \n", "251 future 4.360551 4.810861 Topic1 -4.9947 1.0846 \n", "102 stock 3.904119 11.538281 Topic1 -5.1053 0.0992 \n", "540 customer 3.847856 4.313823 Topic1 -5.1198 1.0686 \n", "265 term 3.805859 5.013711 Topic1 -5.1307 0.9072 \n", "619 project 3.664846 5.127102 Topic1 -5.1685 0.8471 \n", "245 business 3.434275 6.798940 Topic1 -5.2335 0.4999 \n", "567 fund 3.249944 6.864512 Topic1 -5.2887 0.4352 \n", "582 invest 3.138705 3.589726 Topic1 -5.3235 1.0486 \n", "636 return 3.127770 3.583477 Topic1 -5.3270 1.0469 \n", "660 trust 3.127770 3.583477 Topic1 -5.3270 1.0469 \n", "562 financing 3.112968 3.573827 Topic1 -5.3317 1.0448 \n", "638 risk 3.112968 3.573827 Topic1 -5.3317 1.0448 \n", "666 wealth_management 3.112968 3.573827 Topic1 -5.3317 1.0448 \n", "241 annual 2.878278 3.321743 Topic1 -5.4101 1.0396 \n", "782 board 2.854602 4.410119 Topic1 -5.4184 0.7479 \n", "24 control 2.583636 3.367196 Topic1 -5.5181 0.9180 \n", "799 zhangjiang 2.400752 2.847076 Topic1 -5.5915 1.0124 \n", "790 park 2.400752 2.847076 Topic1 -5.5915 1.0124 \n", "655 threshold 2.396182 2.844386 Topic1 -5.5934 1.0114 \n", "625 rate 2.388913 2.841308 Topic1 -5.5965 1.0095 \n", "661 type 2.388913 2.841308 Topic1 -5.5965 1.0095 \n", "565 focus 2.377280 2.833401 Topic1 -5.6013 1.0074 \n", "665 wealth 2.377280 2.833401 Topic1 -5.6013 1.0074 \n", "604 ningbo_branch 2.377280 2.833401 Topic1 -5.6013 1.0074 \n", "603 ningbo 2.377280 2.833401 Topic1 -5.6013 1.0074 \n", "523 buy_product 2.377280 2.833401 Topic1 -5.6013 1.0074 \n", "518 average_annualized 2.377280 2.833401 Topic1 -5.6013 1.0074 \n", "766 size 2.358410 2.833941 Topic1 -5.6093 0.9992 \n", "721 infrastructure 2.358410 2.833941 Topic1 -5.6093 0.9992 \n", "716 huge 2.358410 2.833941 Topic1 -5.6093 0.9992 \n", "672 anxin 2.358410 2.833941 Topic1 -5.6093 0.9992 \n", "717 improve 2.358410 2.833941 Topic1 -5.6093 0.9992 \n", "794 senior_executive 2.299076 2.742155 Topic1 -5.6348 1.0066 \n", "\n", " relevance \n", "377 -3.2928 \n", "627 -4.1261 \n", "246 -4.4646 \n", "309 -4.6432 \n", "800 -4.7846 \n", "558 -4.7966 \n", "520 -4.9451 \n", "251 -4.9947 \n", "102 -5.1053 \n", "540 -5.1198 \n", "265 -5.1307 \n", "619 -5.1685 \n", "245 -5.2335 \n", "567 -5.2887 \n", "582 -5.3235 \n", "636 -5.3270 \n", "660 -5.3270 \n", "562 -5.3317 \n", "638 -5.3317 \n", "666 -5.3317 \n", "241 -5.4101 \n", "782 -5.4184 \n", "24 -5.5181 \n", "799 -5.5915 \n", "790 -5.5915 \n", "655 -5.5934 \n", "625 -5.5965 \n", "661 -5.5965 \n", "565 -5.6013 \n", "665 -5.6013 \n", "604 -5.6013 \n", "603 -5.6013 \n", "523 -5.6013 \n", "518 -5.6013 \n", "766 -5.6093 \n", "721 -5.6093 \n", "716 -5.6093 \n", "672 -5.6093 \n", "717 -5.6093 \n", "794 -5.6348 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vis.sorted_terms()" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }