Python 正则表达式|极客教程

Python 正则表达式教程展示了如何在 Python 中使用正则表达式。对于 Python 中的正则表达式，我们使用 re 模块。

正则表达式用于文本搜索和更高级的文本操作。正则表达式是内置工具，如 grep，sed，文本编辑器（如 vi，emacs），编程语言（如 Tcl，Perl 和 Python）。

Python `re`模块

在 Python 中，re模块提供了正则表达式匹配操作。

模式是一个正则表达式，用于定义我们正在搜索或操纵的文本。它由文本文字和元字符组成。用compile()函数编译该模式。由于正则表达式通常包含特殊字符，因此建议使用原始字符串。（原始字符串以 r 字符开头。）这样，在将字符编译为模式之前，不会对这些字符进行解释。

编译模式后，可以使用其中一个函数将模式应用于文本字符串。函数包括match()，search()，find()和finditer()。

下表显示了一些正则表达式：

正则表达式	含义
`.`	匹配任何单个字符。
`?`	一次匹配或根本不匹配前面的元素。
`+`	与前面的元素匹配一次或多次。
`*`	与前面的元素匹配零次或多次。
`^`	匹配字符串中的起始位置。
`$`	匹配字符串中的结束位置。
`\|`	备用运算符。
`[abc]`	匹配 a 或 b 或 c。
`[a-c]`	范围; 匹配 a 或 b 或 c。
`[^abc]`	否定，匹配除 a 或 b 或 c 之外的所有内容。
`\s`	匹配空白字符。
`\w`	匹配单词字符；等同于`[a-zA-Z_0-9]`

匹配函数

以下是一个代码示例，演示了 Python 中简单正则表达式的用法。

match_fun.py

#!/usr/bin/python3

import re

words = ('book', 'bookworm', 'Bible', 
    'bookish','cookbook', 'bookstore', 'pocketbook')

pattern = re.compile(r'book')

for word in words:
    if re.match(pattern, word):
        print('The {} matches '.format(word))

在示例中，我们有一个单词元组。编译后的模式将在每个单词中寻找一个“ book”字符串。

pattern = re.compile(r'book')

使用compile()函数，我们可以创建图案。正则表达式是一个原始字符串，由四个普通字符组成。

for word in words:
    if re.match(pattern, word):
        print('The {} matches '.format(word))

我们遍历元组并调用match()函数。它将模式应用于单词。如果字符串开头有匹配项，则match()函数将返回匹配对象。

$ ./match_fun.py 
The book matches 
The bookworm matches 
The bookish matches 
The bookstore matches

元组中的四个单词与模式匹配。请注意，以“ book”一词开头的单词不匹配。为了包括这些词，我们使用search()函数。

搜索函数

search()函数查找正则表达式模式产生匹配项的第一个位置。

search_fun.py

#!/usr/bin/python3

import re

words = ('book', 'bookworm', 'Bible', 
    'bookish','cookbook', 'bookstore', 'pocketbook')

pattern = re.compile(r'book')

for word in words:
    if re.search(pattern, word):
        print('The {} matches '.format(word))

在示例中，我们使用search()函数查找“ book”一词。

$ ./search_fun.py 
The book matches 
The bookworm matches 
The bookish matches 
The cookbook matches 
The bookstore matches 
The pocketbook matches

这次还包括菜谱和袖珍书中的单词。

点元字符

点（。）元字符代表文本中的任何单个字符。

dot_meta.py

#!/usr/bin/python3

import re

words = ('seven', 'even', 'prevent', 'revenge', 'maven', 
    'eleven', 'amen', 'event')

pattern = re.compile(r'.even')

for word in words:
    if re.match(pattern, word):
        print('The {} matches '.format(word))

在示例中，我们有一个带有八个单词的元组。我们在每个单词上应用一个包含点元字符的模式。

pattern = re.compile(r'.even')

点代表文本中的任何单个字符。字符必须存在。

$ ./dot_meta.py 
The seven matches 
The revenge matches

两个字匹配模式：七个和复仇。

问号元字符

问号（？）元字符是与上一个元素零或一次匹配的量词。

question_mark_meta.py

#!/usr/bin/python3

import re

words = ('seven', 'even','prevent', 'revenge', 'maven', 
    'eleven', 'amen', 'event')

pattern = re.compile(r'.?even')

for word in words:
    if re.match(pattern, word):
        print('The {} matches '.format(word))

在示例中，我们在点字符后添加问号。这意味着在模式中我们可以有一个任意字符，也可以在那里没有任何字符。

$ ./question_mark_meta.py 
The seven matches 
The even matches 
The revenge matches 
The event matches

这次，除了七个和复仇外，偶数和事件词也匹配。

锚点

锚点匹配给定文本内字符的位置。当使用^锚时，匹配必须发生在字符串的开头，而当使用$锚时，匹配必须发生在字符串的结尾。

anchors.py

#!/usr/bin/python3

import re

sentences = ('I am looking for Jane.',
    'Jane was walking along the river.',
    'Kate and Jane are close friends.')

pattern = re.compile(r'^Jane')

for sentence in sentences:
    if re.search(pattern, sentence):
        print(sentence)

在示例中，我们有三个句子。搜索模式为^Jane。该模式检查“ Jane”字符串是否位于文本的开头。 Jane\.将在句子结尾处查找“ Jane”。

`fullmatch`

可以使用fullmatch()函数或通过将术语放在锚点之间来进行精确匹配：^和$。

exact_match.py

#!/usr/bin/python3

import re

words = ('book', 'bookworm', 'Bible', 
    'bookish','cookbook', 'bookstore', 'pocketbook')

pattern = re.compile(r'^book$')

for word in words:
    if re.search(pattern, word):
        print('The {} matches'.format(word))

在示例中，我们寻找与“ book”一词完全匹配的内容。

$ ./exact_match.py 
The book matches

这是输出。

字符类

字符类定义了一组字符，任何字符都可以出现在输入字符串中以使匹配成功。

character_class.py

#!/usr/bin/python3

import re

words = ('a gray bird', 'grey hair', 'great look')

pattern = re.compile(r'gr[ea]y')

for word in words:
    if re.search(pattern, word):
        print('{} matches'.format(word))

在该示例中，我们使用字符类同时包含灰色和灰色单词。

pattern = re.compile(r'gr[ea]y')

[ea]类允许在模式中使用’e’或’a’字符。

命名字符类

有一些预定义的字符类。 \s与空白字符[\t\n\t\f\v]匹配，\d与数字[0-9]匹配，\w与单词字符[a-zA-Z0-9_]匹配。

named_character_class.py

#!/usr/bin/python3

import re

text = 'We met in 2013\. She must be now about 27 years old.'

pattern = re.compile(r'\d+')

found = re.findall(pattern, text)

if found:
    print('There are {} numbers'.format(len(found)))

在示例中，我们计算文本中的数字。

pattern = re.compile(r'\d+')

\d+模式在文本中查找任意数量的数字集。

found = re.findall(pattern, text)

使用findall()方法，我们可以查找文本中的所有数字。

$ ./named_character_classes.py 
There are 2 numbers

这是输出。

不区分大小写的匹配

默认情况下，模式匹配区分大小写。通过将re.IGNORECASE传递给compile()函数，我们可以使其不区分大小写。

case_insensitive.py

#!/usr/bin/python3

import re

words = ('dog', 'Dog', 'DOG', 'Doggy')

pattern = re.compile(r'dog', re.IGNORECASE)

for word in words:
    if re.match(pattern, word):
        print('{} matches'.format(word))

在示例中，无论大小写如何，我们都将模式应用于单词。

$ ./case_insensitive.py 
dog matches
Dog matches
DOG matches
Doggy matches

所有四个单词都与模式匹配。

交替

交替运算符|创建具有多种选择的正则表达式。

alternations.py

#!/usr/bin/python3

import re

words = ("Jane", "Thomas", "Robert",
    "Lucy", "Beky", "John", "Peter", "Andy")

pattern = re.compile(r'Jane|Beky|Robert')

for word in words:
    if re.match(pattern, word):
        print(word)

列表中有八个名称。

pattern = re.compile(r'Jane|Beky|Robert')

此正则表达式查找“ Jane”，“ Beky”或“ Robert”字符串。

查找方法

finditer()方法返回一个迭代器，该迭代器在字符串中的模式的所有不重叠匹配上产生匹配对象。

find_iter.py

#!/usr/bin/python3

import re

text = ('I saw a fox in the wood. The fox had red fur.')

pattern = re.compile(r'fox')

found = re.finditer(pattern, text)

for item in found:

    s = item.start()
    e = item.end()
    print('Found {} at {}:{}'.format(text[s:e], s, e))

在示例中，我们在文本中搜索“ fox”一词。我们遍历找到的匹配项的迭代器，并使用它们的索引进行打印。

s = item.start()
e = item.end()

start()和end()方法分别返回起始索引和结束索引。

$ ./find_iter.py 
Found fox at 8:11
Found fox at 29:32

这是输出。

捕获组

捕获组是一种将多个字符视为一个单元的方法。通过将字符放置在一组圆括号内来创建它们。例如，（book）是包含’b’，’o’，’o’，’k’，字符的单个组。

捕获组技术使我们能够找出字符串中与常规模式匹配的那些部分。

capturing_groups.py

#!/usr/bin/python3

import re

content = '''<p>The <code>Pattern</code> is a compiled
representation of a regular expression.</p>'''

pattern = re.compile(r'(</?[a-z]*>)')

found = re.findall(pattern, content)

for tag in found:
    print(tag)

该代码示例通过捕获一组字符来打印提供的字符串中的所有 HTML 标签。

found = re.findall(pattern, content)

为了找到所有标签，我们使用findall()方法。

$ ./capturing_groups.py 
<p>
<code>
</code>
</p>

我们找到了四个 HTML 标签。

Python 正则表达式电子邮件示例

在以下示例中，我们创建一个用于检查电子邮件地址的正则表达式模式。

emails.py

#!/usr/bin/python3

import re

emails = ("luke@gmail.com", "andy@yahoocom", 
    "34234sdfa#2345", "f344@gmail.com")

pattern = re.compile(r'^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$')

for email in emails:
    if re.match(pattern, email):
        print("{} matches".format(email))
    else:
        print("{} does not match".format(email))

本示例提供了一种可能的解决方案。

pattern = re.compile(r'^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$')

前^和后$个字符提供精确的模式匹配。模式前后不允许有字符。电子邮件分为五个部分。第一部分是本地部分。这通常是公司，个人或昵称的名称。 [a-zA-Z0-9._-]+列出了所有可能的字符，我们可以在本地使用。它们可以使用一次或多次。

第二部分由文字@字符组成。第三部分是领域部分。通常是电子邮件提供商的域名，例如 yahoo 或 gmail。 [a-zA-Z0-9-]+是一个字符类，提供可在域名中使用的所有字符。 +量词允许使用这些字符中的一个或多个。

第四部分是点字符。它前面带有转义字符（\），以获取文字点。

最后一部分是顶级域：[a-zA-Z.]{2,18}。顶级域可以包含 2 到 18 个字符，例如 sk，net，信息，旅行，清洁，旅行保险。最大长度可以为 63 个字符，但是今天大多数域都少于 18 个字符。还有一个点字符。这是因为某些顶级域包含两个部分：例如 co.uk。

$ ./emails.py 
luke@gmail.com matches
andy@yahoocom does not match
34234sdfa#2345 does not match
f344@gmail.com matches

这是输出。

在本章中，我们介绍了 Python 中的正则表达式。

您可能也对以下相关教程感兴趣： Python CSV 教程和 Python 教程，正则表达式。

Python 正则表达式

Python `re`模块

匹配函数

搜索函数

点元字符

问号元字符

锚点

`fullmatch`

字符类

命名字符类

不区分大小写的匹配

交替

查找方法

捕获组

Python 正则表达式电子邮件示例

Python教程

Java教程

Web教程

数据库教程

图形图像教程

大数据教程

开发工具教程

计算机教程

Python 教程

回顶部

Python re模块

匹配函数

搜索函数

点元字符

问号元字符

锚点

fullmatch

字符类

命名字符类

不区分大小写的匹配

交替

查找方法

捕获组

Python 正则表达式电子邮件示例

Python教程

Java教程

Web教程

数据库教程

图形图像教程

大数据教程

开发工具教程

计算机教程

Python 教程

回顶部

Python `re`模块

`fullmatch`