Concepts in text processing

Corpora (语料库)

Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples. Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.


A token is the technical name for a sequence of characters, that we want to treat as a group. The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items with the command: set().


Stopwords are common words that generally do not contribute to the meaning of a sentence, at least for the purposes of information retrieval and natural language processing. These are words such as the and a. Most search engines will filter out stopwords from search queries and documents in order to save space in their index.


Stemming is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking is cook , and a good stemming algorithm knows that the ing suffix can be removed. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.

Frequency Counts (频数统计)

Frequency Counts the number of hits. Frequency counts require finding all the occurences of a particular feature in the corpus. So it is implicit in concordancing. Software is used for this purpose. Frequency counts can be explained statistically.

Word Segmenter (分词)

Word segmentation is the problem of dividing a string of written language into its component words.

In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). (Some examples where the space character alone may not be sufficient include contractions like can't for can not.)

However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited.

Part-Of-Speech Tagger (词性标注工具)

In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Named Entity Recognizer(命名实体识别工具)

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages.


Here we will treat text as raw data for the programs we write, programs that manipulate and analyze it in a variety of interesting ways.

Natural Language Processing tools

  • Natural Language Toolkit

    NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

    Natural Language Processing with Python provides a practical introduction to programming for language processing. Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more. The book is being updated for Python 3 and NLTK 3.

  • Stanford Word Segmenter

    Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.

    The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.

  • NLP toolkits for Chinese

Tokenizing Text

Tokenizing text into sentences

The sent_tokenize() function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module. This instance has already been trained and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.

In [6]:
para = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale."
In [7]:
from nltk.tokenize import sent_tokenize
['Python is a widely used general-purpose, high-level programming language.',
 'Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.',
 'The language provides constructs intended to enable clear programs on both a small and large scale.']

Tokenizing sentences into words

In [10]:
from nltk.tokenize import word_tokenize
word_tokenize('Hello World.')
['Hello', 'World', '.']
In [11]:

Tokenizing sentences using regular expressions

In [12]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction']

Filtering stopwords in a tokenized sentence

In [13]:
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
{'all', 'when', 'into', 'now', 'after', 'because', 'of', 'had', 'her', 'ourselves', 'you', 'who', 'hers', 'just', 'being', 'too', 'such', 'very', 'did', 'myself', 'out', 'where', 'most', 'have', 'some', 'an', 'there', 'then', 'himself', 'own', 'if', 'down', 'for', 'were', 'through', 'off', 'few', 'no', 'them', 'he', 'further', 'and', 'nor', 'my', 'itself', 'those', 'does', 'other', 'its', 'not', 'what', 'here', 's', 'i', 'again', 'only', 'with', 'on', 'has', 'having', 'we', 'the', 'doing', 'this', 'why', 'while', 'am', 'whom', 'should', 'themselves', 'can', 'between', 'herself', 'under', 'in', 'any', 'more', 'below', 'a', 'your', 'about', 'yourself', 'was', 'it', 'is', 'or', 'as', 't', 'these', 'both', 'from', 'up', 'once', 'me', 'over', 'same', 'will', 'that', 'during', 'him', 'she', 'so', 'yourselves', 'been', 'to', 'until', 'our', 'do', 'than', 'their', 'before', 'his', 'above', 'by', 'are', 'which', 'be', 'how', 'theirs', 'yours', 'they', 'don', 'each', 'ours', 'at', 'against', 'but'}
In [14]:
words = ["Can't", 'is', 'a', 'contraction']
In [15]:
[word for word in words if word not in english_stops]
["Can't", 'contraction']

Replacing and Correcting Words


One of the most common stemming algorithms is the Porter stemming algorithm by Martin Porter. It is designed to remove and replace well-known suffixes of English words

In [19]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

Removing repeating characters

In everyday language, people are often not strictly grammatical. They will write things such as I looooooove it in order to emphasize the word love . However, computers don't know that "looooooove" is a variation of "love" unless they are told. This recipe presents a method to remove these annoying repeating characters in order to end up with a proper English word.

In [34]:
import re
from nltk.corpus import wordnet

replacement_patterns = [
    (r'won\'t', 'will not'),
    (r'can\'t', 'cannot'),
    (r'i\'m', 'i am'),
    (r'ain\'t', 'is not'),
    (r'(\w+)\'ll', '\g<1> will'),
    (r'(\w+)n\'t', '\g<1> not'),
    (r'(\w+)\'ve', '\g<1> have'),
    (r'(\w+)\'s', '\g<1> is'),
    (r'(\w+)\'re', '\g<1> are'),
    (r'(\w+)\'d', '\g<1> would')

class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            (s, count) = re.subn(pattern, repl, s)
        return s
class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'
    def replace(self, word):
        if wordnet.synsets(word):
            return word
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
            return repl_word

Spelling correction

Replacing synonyms

It is often useful to reduce the vocabulary of a text by replacing words with common synonyms. By compressing the vocabulary without losing meaning, you can save memory in cases such as frequency analysis and text indexing. Vocabulary reduction can also increase the occurrence of significant collocations

Part-of-speech Tagging

Training a unigram part-of-speech tagger

A unigram generally refers to a single token. Therefore, a unigram tagger only uses a single word as its context for determining the part-of-speech tag.

In [37]:
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
train_sents = treebank.tagged_sents()[:3000]
tagger = UnigramTagger(train_sents)
In [38]:
[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

Classification-based tagging

Text Classification

All are statistics!