{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Python and Texts\n", "Feng Li\n", "\n", "School of Statistics and Mathematics\n", "\n", "Central University of Finance and Economics\n", "\n", "[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n", "\n", "[https://feng.li/python](https://feng.li/python)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Regular Expression" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Regular expressions** (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the `re` module. \n", "\n", "The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Regular Expression Syntax " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so last matches the string 'last'.\n", "\n", "\n", "Some characters, like '|' or '(', are special. **Special characters** either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- `[]`: Used to indicate a set of characters. In a set: \n", " \n", " - Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.\n", " - Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a\\-z]) or if it’s placed as the first or last character (e.g. [a-]), it will match a literal '-'.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- `(...)`: Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the `\\number` special sequence, described below. To match the literals '(' or ')', use \\( or \\), or enclose them inside a character class: [(] [)]." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "- '`|'`: A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \\|, or enclose it inside a character class, as in [|]." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The special sequences consist of '\\' and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. For example, \\$ matches the character '$'\n", "\n", "\n", "\n", "- `\\d`: Matches any decimal digit; this is equivalent to the class `[0-9]`.\n", "\n", "- `\\D`: Matches any non-digit character; this is equivalent to the class `[^0-9]`.\n", "\n", "- `\\s`: Matches any whitespace character; this is equivalent to the class `[ \\t\\n\\r\\f\\v]`.\n", "\n", "- `\\S`: Matches any non-whitespace character; this is equivalent to the class `[^ \\t\\n\\r\\f\\v]`.\n", "\n", "- `\\w`: Matches any alphanumeric character; this is equivalent to the class `[a-zA-Z0-9_]`.\n", "\n", "- `\\W`: Matches any non-alphanumeric character; this is equivalent to the class `[^a-zA-Z0-9_]`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "These sequences can be included inside a character class. For example, `[\\s,.]` is a character class that will match any whitespace character, or ',' or '.'.\n", "\n", "The final metacharacter in this section is .. It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. '.' is often used where you want to match “any character”.\n", "\n", "\n", "For a complete list of sequences and expanded class definitions for Unicode string patterns, see the last part of [Regular Expression Syntax in the Standard Library](https://docs.python.org/3.9/library/re.html#re-syntax) reference." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Python `re` module" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This module provides regular expression matching operations similar to those found in Perl." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Compiling Regular Expressions " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The module defines several functions, constants, and an exception. Some of the functions are simplified versions of the full featured methods for compiled regular expressions. Most non-trivial applications always use the compiled form. Compile a regular expression pattern into a regular expression object, which can be used for matching using its `match()` and `search()` methods\n", "\n", "```\n", " re.compile(pattern, flags=0)\n", "```\n", "\n", "Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "re.compile(r'oh my god', re.UNICODE)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "p = re.compile('oh my god')\n", "p" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print(p.search('oh my god, I love Python'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Backslash character ('\\')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Regular expressions use the backslash character ('\\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python's usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\\\\\' as the pattern string, because the regular expression must be \\\\, and each backslash must be expressed as \\\\ inside a regular Python string literal." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\\\\\n" ] } ], "source": [ "print('\\\\\\\\')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So `r\"\\n\"` is a two-character string containing '\\' and 'n', while `\\n` is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "print('\\n') # print a new line" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\\n\n" ] } ], "source": [ "print(r'\\n') # print '\\n' string" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Matching Characters " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## `re.match()` and `re.search()`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Python offers two different primitive operations based on regular expressions: `re.match()` checks for a match only at the beginning of the string, while `re.search()` checks for a match anywhere in the string (this is what Perl does by default)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n", "\n" ] } ], "source": [ "import re\n", "out1 = re.match('c', \"hi china, I love coding\")\n", "out2 = re.search('c', \"I love coding\")\n", "\n", "print(out1)\n", "print(out2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Regular expressions beginning with '^' can be used with `search()` to **restrict the match at the beginning of the string**:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n", "None\n", "\n" ] } ], "source": [ "print(re.match(\"c\", \"abcdef\")) # No match\n", "print(re.search(\"^c\", \"abcdef\")) # No match\n", "print(re.search(\"^a\", \"abcdef\")) # Match" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In `MULTILINE` mode `match()` only matches at the beginning of the string, whereas using `search()` with a regular expression beginning with '^' will match at the beginning of each line." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A\n", "B\n", "X\n", "None\n", "\n" ] } ], "source": [ "print('A\\nB\\nX')\n", "print(re.match('X', 'A\\nB\\nX', re.MULTILINE)) # No match\n", "print(re.search('^X', 'A\\nB\\nX', re.MULTILINE)) # Match\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## `match.group([group1, ...])`\n", "\n", "Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range `[1..99]`, it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Isaac Newton\n", "Isaac\n", "Newton\n", "('Isaac', 'Newton')\n" ] } ], "source": [ "m = re.match(r\"(\\w+) (\\w+)\", \"Isaac Newton, physicist\")\n", "print(m.group(0)) # The entire match\n", "\n", "print(m.group(1)) # The first parenthesized subgroup.\n", "\n", "print(m.group(2)) # The second parenthesized subgroup.\n", "\n", "print(m.group(1, 2)) # Multiple arguments give us a tuple." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## `match.start([group])` and `match.end([group])`\n", "\n", "Return the indices of the start and end of the substring matched by group; group defaults to zero (meaning the whole matched substring). Return -1 if group exists but did not contribute to the match. For a match object m, and a group g that did contribute to the match, the substring matched by group g (equivalent to m.group(g)) is" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "'tony@tiger.net'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email = \"tony@tiremove_thisger.net\"\n", "m = re.search(\"remove_this\", email)\n", "print(m)\n", "email[:m.start()] + email[m.end():]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Splitting Strings (字符串拆分)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The `split()` method of a pattern splits a string apart wherever the RE matches, returning a list of the pieces. It’s similar to the `split()` method of strings but provides much more generality in the delimiters that you can split by; string `split()` only supports splitting by whitespace or by a fixed string. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['Words', 'words', 'words', '']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.split('\\W+', 'Words, words, words.')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['Words', ', ', 'words', ', ', 'words', '.', '']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.split('(\\W+)', 'Words, words, words.')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['Words', 'words, words.']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.split('\\W+', 'Words, words, words.', 1)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['0', '3', '9']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Substitution \n", "\n", " re.sub(pattern, repl, string, count=0, flags=0)\n", "\n", "Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, `\\n` is converted to a single newline character, `\\r` is converted to a carriage return, and so forth. Unknown escapes such as `\\j` are left alone. Backreferences, such as `\\6`, are replaced with the substring matched by group 6 in the pattern. For example:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'static PyObject*\\npy_myfunc(void)\\n{'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.sub(r'def\\s+([a-zA-Z_][a-zA-Z_0-9]*)\\s*\\(\\s*\\):',\n", " r'static PyObject*\\npy_\\1(void)\\n{',\n", " 'def myfunc():')" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" }, "rise": { "auto_select": "first", "autolaunch": false, "enable_chalkboard": true, "start_slideshow_at": "selected", "theme": "black" } }, "nbformat": 4, "nbformat_minor": 1 }