Feng Li
School of Statistics and Mathematics
Central University of Finance and Economics
Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples. Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.
A token is the technical name for a sequence of characters, that we want to treat as a group. The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items with the command: set()
.
Stopwords are common words that generally do not contribute to the meaning of a sentence, at least for the purposes of information retrieval and natural language processing. These are words such as the and a. Most search engines will filter out stopwords from search queries and documents in order to save space in their index.
Stemming is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking is cook , and a good stemming algorithm knows that the ing suffix can be removed. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.
Frequency Counts the number of hits. Frequency counts require finding all the occurences of a particular feature in the corpus. So it is implicit in concordancing. Software is used for this purpose. Frequency counts can be explained statistically.
Word segmentation is the problem of dividing a string of written language into its component words.
In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). (Some examples where the space character alone may not be sufficient include contractions like can't for can not.)
However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited.
In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages.
Natural Language Toolkit
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
Natural Language Processing with Python provides a practical introduction to programming for language processing. Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more.
Stanford Word Segmenter
Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.
The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.
NLP toolkits for Chinese
Before you use the NLTK library, please use the NLTK Downloader to obtain the resource. The NLTK data will be downloaded in to the following director depending your OS.
$HOME/nltk_data
/usr/share/nltk_data
/usr/local/share/nltk_data
/usr/lib/nltk_data
/usr/local/lib/nltk_data
/usr/nltk_data
/usr/lib/nltk_data
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('treebank')
[nltk_data] Downloading package punkt to /home/fli/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. [nltk_data] Downloading package stopwords to /home/fli/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip. [nltk_data] Downloading package treebank to /home/fli/nltk_data... [nltk_data] Unzipping corpora/treebank.zip.
True
The sent_tokenize()
function uses an instance of PunktSentenceTokenizer
from the
nltk.tokenize.punkt
module. This instance has already been trained and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.
para = "Python is a widely used general-purpose, high-level programming language. \
Its design philosophy emphasizes code readability, and its syntax allows programmers \
to express concepts in fewer lines of code than would be possible in languages such as \
C++ or Java. The language provides constructs intended to enable clear programs \
on both a small and large scale."
from nltk.tokenize import sent_tokenize
sent_tokenize(para)
['Python is a widely used general-purpose, high-level programming language.', 'Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.', 'The language provides constructs intended to enable clear programs on both a small and large scale.']
from nltk.tokenize import word_tokenize
word_tokenize('Hello World.')
['Hello', 'World', '.']
tok = word_tokenize(para)
print(tok)
['Python', 'is', 'a', 'widely', 'used', 'general-purpose', ',', 'high-level', 'programming', 'language', '.', 'Its', 'design', 'philosophy', 'emphasizes', 'code', 'readability', ',', 'and', 'its', 'syntax', 'allows', 'programmers', 'to', 'express', 'concepts', 'in', 'fewer', 'lines', 'of', 'code', 'than', 'would', 'be', 'possible', 'in', 'languages', 'such', 'as', 'C++', 'or', 'Java', '.', 'The', 'language', 'provides', 'constructs', 'intended', 'to', 'enable', 'clear', 'programs', 'on', 'both', 'a', 'small', 'and', 'large', 'scale', '.']
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction']
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
print(english_stops)
{'at', 'be', 'yours', 'own', 'if', 'we', 'that', 'll', 'again', 'had', 'you', 'aren', 'their', 'in', 't', "couldn't", 'ain', 'an', 'which', 'from', "mustn't", 'can', "mightn't", 'i', 'more', 'yourselves', 'as', 'should', 'weren', 'the', 'but', 'very', 'until', 'just', 'with', 'wouldn', 'too', 'below', 'she', 'further', 'will', 'now', 'why', "aren't", 'than', 'do', 'have', 'to', "don't", "hasn't", "you're", 'this', 'did', "won't", 'won', 'and', 'him', 'am', 'other', 'it', 'hers', 've', 'wasn', 'off', 'they', 'above', 'them', "haven't", 'before', 'where', 'there', 'being', 'nor', 'our', 'who', 'been', 'by', 'some', 'has', 'only', 'on', 'd', 'through', 'm', 'is', 'didn', 'ourselves', 'theirs', 'does', 'about', 'needn', 'those', 'between', "that'll", 'or', 'under', 'no', 'shan', "wasn't", "you'll", 'not', "hadn't", 'both', 'himself', "shouldn't", 'out', 'mustn', 'hadn', 'during', 'don', 'while', 'same', 'so', 'whom', 'then', 'few', 'shouldn', 'for', 'of', 'hasn', 'such', 'how', 'are', 'doing', 'after', 'its', "needn't", 'most', 're', 'isn', "shan't", "weren't", 'up', "isn't", 'his', 'haven', 'down', 'itself', "you've", "should've", 'her', "didn't", 'my', 'because', 'themselves', 'all', "doesn't", 'having', 'here', 's', 'myself', "wouldn't", "you'd", "it's", 'once', 'herself', 'each', 'mightn', 'ours', 'over', 'into', 'when', 'your', 'was', 'these', 'o', 'were', 'a', 'me', 'he', "she's", 'any', 'doesn', 'what', 'against', 'y', 'couldn', 'ma', 'yourself'}
words = ["Can't", 'is', 'a', 'contraction']
[word for word in words if word not in english_stops]
["Can't", 'contraction']
One of the most common stemming algorithms is the Porter stemming algorithm by Martin Porter. It is designed to remove and replace well-known suffixes of English words
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('cooking')
'cook'
In everyday language, people are often not strictly grammatical. They will write things such as I looooooove it in order to emphasize the word love . However, computers don't know that "looooooove" is a variation of "love" unless they are told. This recipe presents a method to remove these annoying repeating characters in order to end up with a proper English word.
import re
from nltk.corpus import wordnet
replacement_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'ain\'t', 'is not'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
(r'(\w+)\'d', '\g<1> would')
]
class RegexpReplacer(object):
def __init__(self, patterns=replacement_patterns):
self.patterns = [(re.compile(regex), repl) for (regex, repl) in
patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.patterns:
(s, count) = re.subn(pattern, repl, s)
return s
class RepeatReplacer(object):
def __init__(self):
self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
self.repl = r'\1\2\3'
def replace(self, word):
if wordnet.synsets(word):
return word
repl_word = self.repeat_regexp.sub(self.repl, word)
if repl_word != word:
return self.replace(repl_word)
else:
return repl_word
It is often useful to reduce the vocabulary of a text by replacing words with common synonyms. By compressing the vocabulary without losing meaning, you can save memory in cases such as frequency analysis and text indexing. Vocabulary reduction can also increase the occurrence of significant collocations
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
train_sents = treebank.tagged_sents()[:3000]
tagger = UnigramTagger(train_sents)
treebank.sents()[0]
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
tagger.tag(treebank.sents()[0])
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
One approach to visualizing words and counts is word clouds, which artistically lay out the words with sizes proportional to their counts.
Generally, though, data scientists don’t think much of word clouds, in large part because the placement of the words doesn’t mean anything other than “here’s some space where I was able to fit a word.”
If you ever are forced to create a word cloud, think about whether you can make the axes convey something. For example, imagine that, for each of some collection of data science–related buzzwords, you have two numbers between 0 and 100—the first rep‐ resenting how frequently it appears in job postings, the second how frequently it appears on resumes:
import matplotlib.pyplot as plt
def plot_resumes(plt):
data = [ ("big data", 100, 15), ("Hadoop", 95, 25), ("Python", 75, 50),
("R", 50, 40), ("machine learning", 80, 20), ("statistics", 20, 60),
("data science", 60, 70), ("analytics", 90, 3),
("team player", 85, 85), ("dynamic", 2, 90), ("synergies", 70, 0),
("actionable insights", 40, 30), ("think out of the box", 45, 10),
("self-starter", 30, 50), ("customer focus", 65, 15),
("thought leadership", 35, 35)]
def text_size(total):
"""equals 8 if total is 0, 28 if total is 200"""
return 8 + total / 200 * 20
for word, job_popularity, resume_popularity in data:
plt.text(job_popularity, resume_popularity, word,
ha='center', va='center',
size=text_size(job_popularity + resume_popularity))
plt.xlabel("Popularity on Job Postings")
plt.ylabel("Popularity on Resumes")
plt.axis([0, 100, 0, 100])
plt.show()
plot_resumes(plt)
All are statistics!