PHP 正则表达式|极客教程

本章我们介绍了 PHP 中的正则表达式，正则表达式用于文本搜索和更高级的文本操作。正则表达式是内置工具，如 grep，sed，文本编辑器（如 vi，emacs），编程语言（如 Tcl，Perl 和 Python）。 PHP 也具有对正则表达式的内置支持。

在 PHP 中，有两个用于正则表达式的模块：POSIX Regex 和 PCRE。 POSIX 正则表达式已贬值。在本章中，我们将使用 PCRE 示例。 PCRE 代表与 Perl 兼容的正则表达式。

使用正则表达式时，需要做两件事：正则表达式函数和模式。

模式是一个正则表达式，用于定义我们正在搜索或操纵的文本。它由文本文字和元字符组成。该模式放置在两个定界符内。这些通常是//，##或@@字符。它们通知正则表达式函数模式的开始和结束位置。

这是 PCRE 中使用的部分元字符列表。


`.`	匹配任何单个字符。
`*`	与前面的元素匹配零次或多次。
`[ ]`	括号表达。与方括号内的字符匹配。
`[^ ]`	匹配方括号中未包含的单个字符。
`^`	匹配字符串中的起始位置。
`$`	匹配字符串中的结束位置。
`\|`	备用运算符。

PRCE 函数

我们定义一些 PCRE regex 函数。它们都有一个 preg 前缀。

preg_split()-按正则表达式模式分割字符串
preg_match()-执行正则表达式匹配
preg_replace()-用正则表达式模式搜索和替换字符串
preg_grep()-返回与正则表达式模式匹配的数组条目

接下来，我们将为每个函数提供一个示例。

php> print_r(preg_split("@\s@", "Jane\tKate\nLucy Marion"));
Array
(
    [0] => Jane
    [1] => Kate
    [2] => Lucy
    [3] => Marion
)

我们有四个名字，用空格分开。 \s是代表空格的字符类。 preg_split()函数返回数组中的拆分字符串。

php> echo preg_match("#[a-z]#", "s");
1

preg_match()函数查看字符类[a-z]中是否包含’s’字符。该类代表从 a 到 z 的所有字符。成功返回 1。

php> echo preg_replace("/Jane/","Beky","I saw Jane. Jane was beautiful.");
I saw Beky. Beky was beautiful.

preg_replace()函数将单词“简”的所有出现替换为单词“贝基”。

php> print_r(preg_grep("#Jane#", ["Jane", "jane", "Joan", "JANE"]));
Array
(
    [0] => Jane
)

preg_grep()函数返回与给定模式匹配的单词数组。在此示例中，数组中仅返回一个单词。这是因为默认情况下，搜索区分大小写。

php> print_r(preg_grep("#Jane#i", ["Jane", "jane", "Joan", "JANE"]));
Array
(
    [0] => Jane
    [1] => jane
    [3] => JANE
)

在此示例中，我们执行不区分大小写的 grep。我们将i修饰符放在右定界符之后。现在，返回的数组包含三个单词。

点元字符

.（点）元字符代表文本中的任何单个字符。

single.php

<?php

 $words = [ "Seven", "even", "Maven", "Amen", "Leven" ];$ pattern = "/.even/";

foreach ( $words as$ word) {

    if (preg_match( $pattern,$ word)) {
        echo " $word matches the pattern\n"; } else { echo "$ word does not match the pattern\n";
    }
}

在$words数组中，我们有五个词。

$pattern = "/.even/";

在这里，我们定义搜索模式。模式是一个字符串。正则表达式位于分隔符内。分隔符是强制性的。在我们的例子中，我们使用正斜杠/ /作为分隔符。请注意，如果需要，我们可以使用不同的定界符。点字符代表任何单个字符。

if (preg_match( $pattern,$ word)) {
    echo " $word matches the pattern\n"; } else { echo "$ word does not match the pattern\n";
}

如果这五个词与模式匹配，我们将对其进行测试。

$ php single.php 
Seven matches the pattern
even does not match the pattern
Maven does not match the pattern
Amen does not match the pattern
Leven matches the pattern

七个和七个词与我们的搜索模式匹配。

锚点

锚点匹配给定文本内字符的位置。

在下一个示例中，我们查看字符串是否位于句子的开头。

anchors.php

我们有两个句子。模式是^Jane。该模式检查’Jane’字符串是否位于文本的开头。

$ php anchors.php 
Jane is not at the beginning of the $sentence1
Jane is at the beginning of the $sentence2

php> echo preg_match("#Jane $#", "I love Jane"); 1 php> echo preg_match("#Jane$ #", "Jane does not love me");
0

Jane$模式与字符串“ Jane”结尾的字符串匹配。

完全匹配

在以下示例中，我们显示了如何查找完全匹配的单词。

php> echo preg_match("/mother/", "mother");
1
php> echo preg_match("/mother/", "motherboard");
1
php> echo preg_match("/mother/", "motherland");
1

mother模式适合单词 Mother，母板和祖国。说，我们只想查找完全匹配的单词。我们将使用前面提到的锚点^和$字符。

php> echo preg_match("/^mother $/", "motherland"); 0 php> echo preg_match("/^mother$ /", "Who is your mother?");
0
php> echo preg_match("/^mother$/", "mother");
1

使用定位字符，我们可以获得与模式完全匹配的单词。

量词

标记或组之后的量词指定允许前面的元素出现的频率。

 ?     - 0 or 1 match
 *     - 0 or more
 +     - 1 or more
 {n}   - exactly n
 {n,}  - n or more
 {,n}  - n or less (??)
 {n,m} - range n to m

上面是常见量词的列表。

问号?表示存在零或前一个元素之一。

zeroorone.php

<?php

 $words = [ "color", "colour", "comic", "colourful", "colored", "cosmos", "coloseum", "coloured", "colourful" ];$ pattern = "/colou?r/";

foreach ( $words as$ word) {
    if (preg_match( $pattern,$ word)) {
        echo " $word matches the pattern\n"; } else { echo "$ word does not match the pattern\n";
    }
}

$words数组中有四个 9。

$pattern = "/colou?r/";

颜色用于美式英语，颜色用于英式英语。此模式匹配两种情况。

$ php zeroorone.php 
color matches the pattern
colour matches the pattern
comic does not match the pattern
colourful matches the pattern
colored matches the pattern
cosmos does not match the pattern
coloseum does not match the pattern
coloured matches the pattern
colourful matches the pattern

这是zeroorone.php脚本的输出。

*元字符与前面的元素匹配零次或更多次。

zeroormore.php

<?php

 $words = [ "Seven", "even", "Maven", "Amen", "Leven" ];$ pattern = "/.*even/";

foreach ( $words as$ word) {

    if (preg_match( $pattern,$ word)) {
        echo " $word matches the pattern\n"; } else { echo "$ word does not match the pattern\n";
    }
}

在上面的脚本中，我们添加了*元字符。 .*组合表示零个，一个或多个单个字符。

$ php zeroormore.php 
Seven matches the pattern
even matches the pattern
Maven does not match the pattern
Amen does not match the pattern
Leven matches the pattern

现在该模式匹配三个单词：七个，偶数和 Leven。

php> print_r(preg_grep("#o{2}#", ["gool", "root", "foot", "dog"]));
Array
(
    [0] => gool
    [1] => root
    [2] => foot
)

o{2}模式匹配恰好包含两个’o’字符的字符串。

php> print_r(preg_grep("#^\d{2,4}$#", ["1", "12", "123", "1234", "12345"]));
Array
(
    [1] => 12
    [2] => 123
    [3] => 1234
)

我们有这个^\d{2,4}$模式。 \d是一个字符集；它代表数字。该模式匹配具有 2、3 或 4 位数字的数字。

交替

下一个示例说明了交替运算符|。该运算符使您可以创建具有多种选择的正则表达式。

alternation.php

<?php

 $names = [ "Jane", "Thomas", "Robert", "Lucy", "Beky", "John", "Peter", "Andy" ];$ pattern = "/Jane|Beky|Robert/";

foreach ( $names as$ name) {

    if (preg_match( $pattern,$ name)) {
        echo " $name is my friend\n"; } else { echo "$ name is not my friend\n";
    }
}

$names数组中有八个名称。

$pattern = "/Jane|Beky|Robert/";

这是搜索模式。该模式查找“简”，“贝基”或“罗伯特”字符串。

$ php alternation.php 
Jane is my friend
Thomas is not my friend
Robert is my friend
Lucy is not my friend
Beky is my friend
John is not my friend
Peter is not my friend
Andy is not my friend

这是脚本的输出。

子模式

我们可以使用方括号()在样式内创建子模式。

php> echo preg_match("/book(worm)? $/", "bookworm"); 1 php> echo preg_match("/book(worm)?$ /", "book");
1
php> echo preg_match("/book(worm)?$/", "worm");
0

我们有以下正则表达式模式：book(worm)?$。 (worm)是子模式。？字符跟随子模式，这意味着该子模式在最终模式中可能出现 0、1 次。这里的$字符用于字符串的精确末尾匹配。没有它，诸如书店，bookmania 之类的词也将匹配。

php> echo preg_match("/book(shelf|worm)? $/", "book"); 1 php> echo preg_match("/book(shelf|worm)?$ /", "bookshelf");
1
php> echo preg_match("/book(shelf|worm)? $/", "bookworm"); 1 php> echo preg_match("/book(shelf|worm)?$ /", "bookstore");
0

子模式经常交替使用。 (shelf|worm)子模式可创建多个单词组合。

字符类

我们可以使用方括号将字符组合成字符类。字符类与括号中指定的任何字符匹配。

characterclass.php

<?php

 $words = [ "sit", "MIT", "fit", "fat", "lot" ];$ pattern = "/[fs]it/";

foreach ( $words as$ word) {

    if (preg_match( $pattern,$ word)) {
        echo " $word matches the pattern\n"; } else { echo "$ word does not match the pattern\n";
    }
}

我们用两个字符定义一个字符集。

$pattern = "/[fs]it/";

这是我们的模式。 [fs]是字符类。请注意，我们一次只能处理一个字符。我们要么考虑 f 要么 s，但不能同时考虑两者。

$ php characterclass.php 
sit matches the pattern
MIT does not match the pattern
fit matches the pattern
fat does not match the pattern
lot does not match the pattern

这是脚本的结果。

我们还可以将速记元字符用于字符类。 \w代表字母数字字符，\d代表数字，\s空格字符。

shorthand.php

<?php

 $words = [ "Prague", "111978", "terry2", "mitt##" ];$ pattern = "/\w{6}/";

foreach ( $words as$ word) {

    if (preg_match( $pattern,$ word)) {
        echo " $word matches the pattern\n"; } else { echo "$ word does not match the pattern\n";
    }
}

在上面的脚本中，我们测试包含字母数字字符的单词。 \w{6}代表六个字母数字字符。仅单词mitt##不匹配，因为它包含非字母数字字符。

php> echo preg_match("#[^a-z]{3}#", "ABC");
1

#[^a-z]{3}#模式代表三个不在 a-z 类中的字符。 “ ABC”字符符合条件。

php> print_r(preg_grep("#\d{2,4}#", [ "32", "234", "2345", "3d3", "2"]));
Array
(
    [0] => 32
    [1] => 234
    [2] => 2345
)

在上面的示例中，我们有一个匹配 2、3 和 4 位数字的模式。

提取匹配

preg_match()具有可选的第三个参数。如果提供，则将其填充为搜索结果。变量是一个数组，其第一个元素包含与完整模式匹配的文本，第二个元素包含第一个捕获的带括号的子模式，依此类推。

extract_matches.php

<?php

 $times = [ "10:10:22", "23:23:11", "09:06:56" ];$ pattern = "/(\d\d):(\d\d):(\d\d)/";

foreach ( $times as$ time) {

     $r = preg_match($ pattern,  $time,$ match);

    if ( $r) { echo "The$ match[0] is split into:\n";

        echo "Hour:  $match[1]\n"; echo "Minute:$ match[2]\n";
        echo "Second: $match[3]\n";
    } 
}

在示例中，我们提取时间字符串的一部分。

$times = [ "10:10:22", "23:23:11", "09:06:56" ];

我们在英语语言环境中有三个时间字符串。

$pattern = "/(\d\d):(\d\d):(\d\d)/";

模式使用方括号分为三个子模式。我们想确切地指每个部分。

$r = preg_match($pattern, $time, $match);

我们将第三个参数传递给preg_match()函数。如果匹配，则包含匹配字符串的文本部分。

if ( $r) { echo "The$ match[0] is split into:\n";

    echo "Hour:  $match[1]\n"; echo "Minute:$ match[2]\n";
    echo "Second: $match[3]\n";
}

$match[0]包含与完整模式匹配的文本，$match[1]包含与第一个子模式匹配的文本，$match[2]与第二个子模式匹配，$match[3]与第三个子模式匹配。

$ php extract_matches.php 
The 10:10:22 is split into:
Hour: 10
Minute: 10
Second: 22
The 23:23:11 is split into:
Hour: 23
Minute: 23
Second: 11
The 09:06:56 is split into:
Hour: 09
Minute: 06
Second: 56

这是示例的输出。

电子邮件示例

接下来有一个实际的例子。我们创建一个正则表达式模式来检查电子邮件地址。

emails.php

<?php

 $emails = [ "luke@gmail.com", "andy@yahoocom", "34234sdfa#2345", "f344@gmail.com"]; # regular expression for emails$ pattern = "/^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18} $/"; foreach ($ emails as  $email) { if (preg_match($ pattern,  $email)) { echo "$ email matches \n";
    } else {
        echo "$email does not match\n";
    }
}

>?

请注意，此示例仅提供一种解决方案。它不一定是最好的。

$pattern = "/^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$/";

这就是模式。这里的第一个^和最后一个$字符可得到精确的模式匹配。模式前后不允许有字符。电子邮件分为五个部分。第一部分是本地部分。这通常是公司，个人或昵称的名称。 [a-zA-Z0-9._-]+列出了所有可能的字符，我们可以在本地使用。它们可以使用一次或多次。第二部分是文字@字符。第三部分是领域部分。通常是电子邮件提供商的域名，例如 yahoo 或 gmail。 [a-zA-Z0-9-]+是一个字符集，提供了可在域名中使用的所有字符。 +量词使用这些字符中的一个或多个。第四部分是点字符。它前面有转义字符（\）。这是因为点字符是一个元字符，并且具有特殊含义。通过转义，我们得到一个文字点。最后一部分是顶级域。模式如下：[a-zA-Z.]{2,18}顶级域可以包含 2 到 18 个字符，例如 sk，net，信息，旅行，清洁，旅行保险。最大长度可以为 63 个字符，但是今天大多数域都少于 18 个字符。还有一个点字符。这是因为某些顶级域包含两个部分：例如 co.uk。

$ php emails.php 
luke@gmail.com matches 
andy@yahoocom does not match
34234sdfa#2345 does not match
f344@gmail.com matches

这是emails.php示例的输出。

总结

最后，我们快速回顾一下正则表达式模式。

Jane    the 'Jane' string
^Jane   'Jane' at the start of a string
Jane $'Jane' at the end of a string ^Jane$   exact match of the string 'Jane'
[abc]   a, b, or c
[a-z]   any lowercase letter
[^A-Z]  any character that is not a uppercase letter
(Jane|Becky)   Matches either 'Jane' or 'Becky'
[a-z]+   one or more lowercase letters
^[98]?$  digits 9, 8 or empty string       
([wx])([yz])  wy, wz, xy, or xz
[0-9]         any digit
[^A-Za-z0-9]  any symbol (not a number or a letter)