{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Natural Language Processing with Python\n", "\n", "Feng Li\n", "\n", "School of Statistics and Mathematics\n", "\n", "Central University of Finance and Economics\n", "\n", "[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n", "\n", "[https://feng.li/python](https://feng.li/python)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Concepts in text processing (文本处理基本概念)\n", "\n", "## Corpora (语料库)\n", "\n", "Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples. Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "## Tokens\n", "\n", "A token is the technical name for a sequence of characters, that we want to treat as a group. The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items with the command: `set()`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Stopwords (停词)\n", "\n", "\n", "Stopwords are common words that generally do not contribute to the meaning of a sentence,\n", "at least for the purposes of information retrieval and natural language processing. These are\n", "words such as the and a. Most search engines will filter out stopwords from search queries\n", "and documents in order to save space in their index." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "## Stemming (词根检索)\n", "\n", "Stemming is a technique to remove affixes from a word, ending up with the stem. For\n", "example, the stem of cooking is cook , and a good stemming algorithm knows that the ing\n", "suffix can be removed. Stemming is most commonly used by search engines for indexing\n", "words. Instead of storing all forms of a word, a search engine can store only the stems, greatly\n", "reducing the size of index while increasing retrieval accuracy." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Frequency Counts (频数统计)\n", "\n", "Frequency Counts the number of hits. Frequency counts require finding all the occurences of a particular feature in the corpus. So it is implicit in concordancing. Software is used for this purpose. Frequency counts can be explained statistically. \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "## Word Segmenter (分词)\n", "\n", "Word segmentation is the problem of dividing a string of written language into its component words.\n", "\n", "In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). (Some examples where the space character alone may not be sufficient include contractions like can't for can not.)\n", "\n", "However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part-Of-Speech Tagger (词性标注工具)\n", "\n", "In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Named Entity Recognizer(命名实体识别工具)\n", "\n", "Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Natural Language Processing tools (自然语言处理工具)\n", "\n", "- Natural Language Toolkit\n", "\n", " **NLTK** is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.\n", " \n", " Natural Language Processing with Python provides a practical introduction to programming for language processing. Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Stanford Word Segmenter\n", "\n", " Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.\n", "\n", " The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications. \n", " \n", "- NLP toolkits for Chinese\n", "\n", " - [Toolkit for Chinese natural language processing](https://github.com/xpqiu/fnlp/)\n", " \n", " - [The ICT Natural Language Processing Research Group](http://nlp.ict.ac.cn/english/)\n", " \n", " - [Jieba Chinse Word Segmenter](https://github.com/fxsjy/jieba)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Tokenizing Text (标记文本)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "- Before you use the NLTK library, please use the NLTK Downloader to obtain the resource. The NLTK data will be downloaded in to the following director depending your OS.\n", " \n", " \n", " ```\n", " $HOME/nltk_data\n", " /usr/share/nltk_data\n", " /usr/local/share/nltk_data\n", " /usr/lib/nltk_data\n", " /usr/local/lib/nltk_data\n", " /usr/nltk_data\n", " /usr/lib/nltk_data\n", " ```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package punkt to /home/fli/nltk_data...\n", "[nltk_data] Unzipping tokenizers/punkt.zip.\n", "[nltk_data] Downloading package stopwords to /home/fli/nltk_data...\n", "[nltk_data] Unzipping corpora/stopwords.zip.\n", "[nltk_data] Downloading package treebank to /home/fli/nltk_data...\n", "[nltk_data] Unzipping corpora/treebank.zip.\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import nltk\n", "nltk.download('punkt')\n", "nltk.download('stopwords')\n", "nltk.download('treebank')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Tokenizing text into sentences (断句)\n", "\n", "The `sent_tokenize()` function uses an instance of `PunktSentenceTokenizer` from the\n", "`nltk.tokenize.punkt` module. This instance has already been trained and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "para = \"Python is a widely used general-purpose, high-level programming language. \\\n", " Its design philosophy emphasizes code readability, and its syntax allows programmers \\\n", " to express concepts in fewer lines of code than would be possible in languages such as \\\n", " C++ or Java. The language provides constructs intended to enable clear programs \\\n", " on both a small and large scale.\"" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['Python is a widely used general-purpose, high-level programming language.',\n", " 'Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.',\n", " 'The language provides constructs intended to enable clear programs on both a small and large scale.']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.tokenize import sent_tokenize\n", "sent_tokenize(para)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Tokenizing sentences into words (分词)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['Hello', 'World', '.']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.tokenize import word_tokenize\n", "word_tokenize('Hello World.')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Python', 'is', 'a', 'widely', 'used', 'general-purpose', ',', 'high-level', 'programming', 'language', '.', 'Its', 'design', 'philosophy', 'emphasizes', 'code', 'readability', ',', 'and', 'its', 'syntax', 'allows', 'programmers', 'to', 'express', 'concepts', 'in', 'fewer', 'lines', 'of', 'code', 'than', 'would', 'be', 'possible', 'in', 'languages', 'such', 'as', 'C++', 'or', 'Java', '.', 'The', 'language', 'provides', 'constructs', 'intended', 'to', 'enable', 'clear', 'programs', 'on', 'both', 'a', 'small', 'and', 'large', 'scale', '.']\n" ] } ], "source": [ "tok = word_tokenize(para)\n", "print(tok)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Tokenizing sentences using regular expressions (使用正则表达式分词)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "[\"Can't\", 'is', 'a', 'contraction']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.tokenize import RegexpTokenizer\n", "tokenizer = RegexpTokenizer(\"[\\w']+\")\n", "tokenizer.tokenize(\"Can't is a contraction.\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Filtering stopwords in a tokenized sentence (过滤停词)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'at', 'be', 'yours', 'own', 'if', 'we', 'that', 'll', 'again', 'had', 'you', 'aren', 'their', 'in', 't', \"couldn't\", 'ain', 'an', 'which', 'from', \"mustn't\", 'can', \"mightn't\", 'i', 'more', 'yourselves', 'as', 'should', 'weren', 'the', 'but', 'very', 'until', 'just', 'with', 'wouldn', 'too', 'below', 'she', 'further', 'will', 'now', 'why', \"aren't\", 'than', 'do', 'have', 'to', \"don't\", \"hasn't\", \"you're\", 'this', 'did', \"won't\", 'won', 'and', 'him', 'am', 'other', 'it', 'hers', 've', 'wasn', 'off', 'they', 'above', 'them', \"haven't\", 'before', 'where', 'there', 'being', 'nor', 'our', 'who', 'been', 'by', 'some', 'has', 'only', 'on', 'd', 'through', 'm', 'is', 'didn', 'ourselves', 'theirs', 'does', 'about', 'needn', 'those', 'between', \"that'll\", 'or', 'under', 'no', 'shan', \"wasn't\", \"you'll\", 'not', \"hadn't\", 'both', 'himself', \"shouldn't\", 'out', 'mustn', 'hadn', 'during', 'don', 'while', 'same', 'so', 'whom', 'then', 'few', 'shouldn', 'for', 'of', 'hasn', 'such', 'how', 'are', 'doing', 'after', 'its', \"needn't\", 'most', 're', 'isn', \"shan't\", \"weren't\", 'up', \"isn't\", 'his', 'haven', 'down', 'itself', \"you've\", \"should've\", 'her', \"didn't\", 'my', 'because', 'themselves', 'all', \"doesn't\", 'having', 'here', 's', 'myself', \"wouldn't\", \"you'd\", \"it's\", 'once', 'herself', 'each', 'mightn', 'ours', 'over', 'into', 'when', 'your', 'was', 'these', 'o', 'were', 'a', 'me', 'he', \"she's\", 'any', 'doesn', 'what', 'against', 'y', 'couldn', 'ma', 'yourself'}\n" ] } ], "source": [ "from nltk.corpus import stopwords\n", "english_stops = set(stopwords.words('english'))\n", "print(english_stops)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "words = [\"Can't\", 'is', 'a', 'contraction']" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "[\"Can't\", 'contraction']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[word for word in words if word not in english_stops]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Replacing and Correcting Words (替换和纠正)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Stemming (词干提取)\n", "\n", "One of the most common stemming algorithms is the **Porter stemming algorithm** by Martin\n", "Porter. It is designed to remove and replace well-known suffixes of English words" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'cook'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.stem import PorterStemmer\n", "stemmer = PorterStemmer()\n", "stemmer.stem('cooking')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Removing repeating characters (删除重复字符)\n", "\n", "In everyday language, people are often not strictly grammatical. They will write things such as\n", "I looooooove it in order to emphasize the word love . However, computers don't know\n", "that \"looooooove\" is a variation of \"love\" unless they are told. This recipe presents a method\n", "to remove these annoying repeating characters in order to end up with a proper English word." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import re\n", "from nltk.corpus import wordnet\n", "\n", "replacement_patterns = [\n", " (r'won\\'t', 'will not'),\n", " (r'can\\'t', 'cannot'),\n", " (r'i\\'m', 'i am'),\n", " (r'ain\\'t', 'is not'),\n", " (r'(\\w+)\\'ll', '\\g<1> will'),\n", " (r'(\\w+)n\\'t', '\\g<1> not'),\n", " (r'(\\w+)\\'ve', '\\g<1> have'),\n", " (r'(\\w+)\\'s', '\\g<1> is'),\n", " (r'(\\w+)\\'re', '\\g<1> are'),\n", " (r'(\\w+)\\'d', '\\g<1> would')\n", "]\n", "\n", "class RegexpReplacer(object):\n", " \n", " def __init__(self, patterns=replacement_patterns):\n", " self.patterns = [(re.compile(regex), repl) for (regex, repl) in\n", " patterns]\n", " \n", " def replace(self, text):\n", " s = text\n", " for (pattern, repl) in self.patterns:\n", " (s, count) = re.subn(pattern, repl, s)\n", " return s\n", " \n", "class RepeatReplacer(object):\n", " \n", " def __init__(self):\n", " self.repeat_regexp = re.compile(r'(\\w*)(\\w)\\2(\\w*)')\n", " self.repl = r'\\1\\2\\3'\n", " \n", " def replace(self, word):\n", " if wordnet.synsets(word):\n", " return word\n", " repl_word = self.repeat_regexp.sub(self.repl, word)\n", " if repl_word != word:\n", " return self.replace(repl_word)\n", " else:\n", " return repl_word" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Spelling correction (拼写纠正)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Replacing synonyms (同义词替换)\n", "\n", "It is often useful to reduce the vocabulary of a text by replacing words with common\n", "synonyms. By compressing the vocabulary without losing meaning, you can save memory\n", "in cases such as frequency analysis and text indexing. Vocabulary reduction\n", "can also increase the occurrence of significant collocations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part-of-speech Tagging (词性标注工具)\n", "\n", "## Training a unigram part-of-speech tagger\n", "\n", "A unigram generally refers to a single token. Therefore, a unigram tagger only uses a single\n", "word as its context for determining the part-of-speech tag." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['Pierre',\n", " 'Vinken',\n", " ',',\n", " '61',\n", " 'years',\n", " 'old',\n", " ',',\n", " 'will',\n", " 'join',\n", " 'the',\n", " 'board',\n", " 'as',\n", " 'a',\n", " 'nonexecutive',\n", " 'director',\n", " 'Nov.',\n", " '29',\n", " '.']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.tag import UnigramTagger\n", "from nltk.corpus import treebank\n", "train_sents = treebank.tagged_sents()[:3000]\n", "tagger = UnigramTagger(train_sents)\n", "treebank.sents()[0]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "[('Pierre', 'NNP'),\n", " ('Vinken', 'NNP'),\n", " (',', ','),\n", " ('61', 'CD'),\n", " ('years', 'NNS'),\n", " ('old', 'JJ'),\n", " (',', ','),\n", " ('will', 'MD'),\n", " ('join', 'VB'),\n", " ('the', 'DT'),\n", " ('board', 'NN'),\n", " ('as', 'IN'),\n", " ('a', 'DT'),\n", " ('nonexecutive', 'JJ'),\n", " ('director', 'NN'),\n", " ('Nov.', 'NNP'),\n", " ('29', 'CD'),\n", " ('.', '.')]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tagger.tag(treebank.sents()[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Word Clouds\n", "\n", "One approach to visualizing words and counts is word clouds, which artistically lay out the words with sizes proportional to their counts.\n", "\n", "Generally, though, data scientists don’t think much of word clouds, in large part\n", "because the placement of the words doesn’t mean anything other than “here’s some\n", "space where I was able to fit a word.”\n", "\n", "If you ever are forced to create a word cloud, think about whether you can make the\n", "axes convey something. For example, imagine that, for each of some collection of data\n", "science–related buzzwords, you have two numbers between 0 and 100—the first rep‐\n", "resenting how frequently it appears in job postings, the second how frequently it\n", "appears on resumes:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt \n", "\n", "\n", "def plot_resumes(plt):\n", " data = [ (\"big data\", 100, 15), (\"Hadoop\", 95, 25), (\"Python\", 75, 50),\n", " (\"R\", 50, 40), (\"machine learning\", 80, 20), (\"statistics\", 20, 60),\n", " (\"data science\", 60, 70), (\"analytics\", 90, 3),\n", " (\"team player\", 85, 85), (\"dynamic\", 2, 90), (\"synergies\", 70, 0),\n", " (\"actionable insights\", 40, 30), (\"think out of the box\", 45, 10),\n", " (\"self-starter\", 30, 50), (\"customer focus\", 65, 15),\n", " (\"thought leadership\", 35, 35)]\n", "\n", " def text_size(total):\n", " \"\"\"equals 8 if total is 0, 28 if total is 200\"\"\"\n", " return 8 + total / 200 * 20\n", "\n", " for word, job_popularity, resume_popularity in data:\n", " plt.text(job_popularity, resume_popularity, word,\n", " ha='center', va='center',\n", " size=text_size(job_popularity + resume_popularity))\n", " plt.xlabel(\"Popularity on Job Postings\")\n", " plt.ylabel(\"Popularity on Resumes\")\n", " plt.axis([0, 100, 0, 100])\n", " plt.show()\n", " \n", "\n", "plot_resumes(plt) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Text Classification (文本分类)\n", "\n", "All are statistics!" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" }, "rise": { "auto_select": "first", "autolaunch": false, "enable_chalkboard": true, "start_slideshow_at": "selected", "theme": "black" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }