# Probabilistic Language Models


Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)

[https://feng.li/python](https://feng.li/python)

# Probabilistic Language Models


How can we assign a probability to a sentence?

    P(high winds tonight) > P(large winds tonight)
    
    
How do we do a proper spell correction?

- The office is about fifteen **minuets** from my house

      P(about fifteen minutes from) > P(about fifteen minuets from)
      
      
- How do we imporve the precision of speech recognition?


      P(I saw a van) >> P(eyes awe of an)

## The Goal of a Language Model: 

We compute the probability of a sentence or sequence of words:

$$P(W) = P(w_1,w_2,w_3,w_4,w_5,...w_n)$$
     
Related task: probability of an upcoming word:

$$P(w_5|w_1,w_2,w_3,w_4)$$
      
A model that computes either of these $P(W)$     or     $P(w_n|w_1,w_2,...,w_{n-1})$ is called a **language model**.

## How to compute this joint probability:

$$P(its, water, is, so, transparent, that)$$

- Intuition: let’s rely on the Chain Rule of Probability

- The Chain Rule in General

$$P(x_1,x_2,x_3,...,x_n) = P(x_1)P(x_2|x_1)P(x_3|x_1,x_2)...P(x_n|x_1,…,x_{n-1})$$

- The Joint probability is now factorized as 

$$P(“its~water~is~so~transparent”) =
	P(its) × P(water|its) ×  P(is|its~water) 
         ×  P(so|its~water~is) ×  P(transparent|its~water~is~so)
$$

### Simplifying assumption:

- Markov Assumption

$$
P(w_1,w_2,...,w_{n}) \approx \prod_{i=1}^n P(w_i|w_{i-1},...,w_{i-k})
$$
where $k$ is some positive integer.

- In other words, we approximate each component in the product

$$
    P(w_i |w_{i-1},...,w_{1}) \approx  P(w_i|w_{i-k},...,w_{i-1})
$$

In [1]:
import os
import pandas as pd
from nltk.tokenize import RegexpTokenizer

In [2]:
os.chdir(os.path.dirname(os.path.realpath('__file__')))

In [3]:
merged = pd.read_excel('data/guba.xlsx', sheet_name='Merged')

merged = merged['Explanation'].tolist()
tokenizer = RegexpTokenizer(r'\w+')

In [4]:
# Tokenize

merged = [tokenizer.tokenize(merged[i]) for i in range(len(merged))]

In [5]:
# stop words
def word_clean(sentence, stop_words):
    sentence = [i.lower() for i in sentence]
    sentence = [token for token in sentence if not token.isnumeric()]
    sentence = [j for j in sentence if j not in stop_words]
    return sentence

stop_words = pd.read_csv('data/stopwords.txt', header=None)[0].to_list()
merged = [word_clean(sentence, stop_words) for sentence in merged]

In [6]:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer

In [7]:
# Lemmatize the documents.

lemmatizer = WordNetLemmatizer()
merged = [[lemmatizer.lemmatize(token) for token in doc] for doc in merged]

In [8]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs
bigram = Phrases(merged, min_count=1)
merged = [bigram[lst] for lst in merged]

In [9]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
merged_dictionary = Dictionary(merged)

# Filter out words that occur less than 2 documents, or more than 60% of the documents.
merged_dictionary.filter_extremes(no_below=1, no_above=0.6)

In [10]:
# Bag-of-words representation of the documents.

merged_corpus = [merged_dictionary.doc2bow(doc) for doc in merged]

In [11]:
# Train merged model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 5
chunksize = 20
passes = 5
iterations = 200
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary.
temp = merged_dictionary[0]  # This is only to "load" the dictionary.
id2word = merged_dictionary.id2token

merged_model = LdaModel(
    corpus=merged_corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',  
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

In [17]:
# We need a specific version of pyLDAvis. Let's install it to current notebook directory.
! pip3 install pyLDAvis -I -t modules

Looking in indexes: https://mirrors.163.com/pypi/simple/
Collecting pyLDAvis
  Using cached pyLDAvis-3.3.1-py2.py3-none-any.whl
Collecting joblib
  Using cached https://mirrors.163.com/pypi/packages/55/85/70c6602b078bd9e6f3da4f467047e906525c355a4dacd4f71b97a35d9897/joblib-1.0.1-py3-none-any.whl (303 kB)
Collecting pandas>=1.2.0
  Using cached https://mirrors.163.com/pypi/packages/48/b4/1081d66b71c4dfc1bc1e19d6f2abbf93ed42f69df7703eb323742d45423e/pandas-1.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
Collecting future
  Using cached future-0.18.2-py3-none-any.whl
Collecting sklearn
  Using cached sklearn-0.0-py2.py3-none-any.whl
Collecting setuptools
  Using cached https://mirrors.163.com/pypi/packages/41/f4/a7ca4859317232b1efb64a826b8d2d7299bb77fb60bdb08e2bd1d61cf80d/setuptools-58.2.0-py3-none-any.whl (946 kB)
Collecting gensim
  Using cached https://mirrors.163.com/pypi/packages/61/e8/ddf62a31b4f97f543a38233047865d02be97c192f7f8d849bbf3353bc094/gensim-4.1.2-cp

In [3]:
# prepend the `modules` folder to Python's search path 
import sys
sys.path.insert(0, 'modules')
sys.path

['modules',
 'modules',
 '/home/fli/cloud/teaching/python/python-slides/P08-Advanced-Topics',
 '/usr/lib/python39.zip',
 '/usr/lib/python3.9',
 '/usr/lib/python3.9/lib-dynload',
 '',
 '/home/fli/.local/lib/python3.9/site-packages',
 '/usr/local/lib/python3.9/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.9/dist-packages/IPython/extensions',
 '/home/fli/.ipython']

In [19]:
sys.path

['modules',
 '/home/fli/cloud/teaching/python/python-slides/P08-Advanced-Topics',
 '/usr/lib/python39.zip',
 '/usr/lib/python3.9',
 '/usr/lib/python3.9/lib-dynload',
 '',
 '/home/fli/.local/lib/python3.9/site-packages',
 '/usr/local/lib/python3.9/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.9/dist-packages/IPython/extensions',
 '/home/fli/.ipython']

In [18]:
import pyLDAvis
from pyLDAvis import gensim_models
vis= gensim_models.prepare(merged_model, merged_corpus, dictionary=merged_dictionary)

  default_term_info = default_term_info.sort_values(


In [15]:
pyLDAvis.save_html(vis, 'lda.html')

In [16]:
vis.sorted_terms()

Unnamed: 0,Term,Freq,Total,Category,logprob,loglift,relevance
377,product,23.913529,30.062396,Topic1,-3.2928,0.954,-3.2928
627,real_estate,10.393194,12.188516,Topic1,-4.1261,1.0235,-4.1261
246,company,7.408749,9.892665,Topic1,-4.4646,0.8937,-4.4646
309,financial,6.197204,8.785535,Topic1,-4.6432,0.8339,-4.6432
800,zhangjiang_tech,5.380072,5.830456,Topic1,-4.7846,1.1025,-4.7846
558,expected_return,5.316008,5.792948,Topic1,-4.7966,1.097,-4.7966
520,bank,4.582158,5.053505,Topic1,-4.9451,1.085,-4.9451
251,future,4.360551,4.810861,Topic1,-4.9947,1.0846,-4.9947
102,stock,3.904119,11.538281,Topic1,-5.1053,0.0992,-5.1053
540,customer,3.847856,4.313823,Topic1,-5.1198,1.0686,-5.1198
