c++ 正则表达式

什么是正则表达式

正则表达式是一种用于描述文本模式的表达式。它可以匹配一些特定的文本字符，并可用于搜索、替换、验证等多种应用。

为什么要使用正则表达式

在很多场合下，需要对一些文本进行处理、过滤、匹配等操作，如果只是使用字符串的基本方法，代码会变得十分繁琐和不易维护。而正则表达式能够用非常简明的方式来描述这些操作，让代码更加简洁且易于理解和维护。

比如我们可以使用正则表达式来验证一个字符串是否符合邮箱格式，代码如下：

#include <iostream>
#include <string>
#include <regex>

using namespace std;

int main()
{
    string email = "test@example.com";
    // 正则表达式
    regex pattern("[a-zA-Z0-9]+@[a-zA-Z0-9]+\\.[a-zA-Z]+");
    // 验证
    if (regex_match(email, pattern))
    {
        cout << email << " 符合邮箱格式" << endl;
    }
    else
    {
        cout << email << " 不符合邮箱格式" << endl;
    }
    return 0;
}

在上述代码中，我们使用了c++标准库中的regex类来创建正则表达式，使用regex_match函数来验证字符串是否符合正则表达式的要求。在正则表达式中，“[]”表示匹配其中的任意一个字符，“+”表示匹配一个或多个前面的字符，“\.”表示匹配字符“.”。通过这种方式可以非常方便地编写出匹配邮箱格式的正则表达式。

正则表达式的基本语法

正则表达式的基本语法包括一些特殊的字符和字符组合，这些字符代表的含义如下：

特殊字符	含义
.	匹配任意一种字符
^	匹配字符串开头
$	匹配字符串结尾
*	匹配前一个字符0次或多次
+	匹配前一个字符1次或多次
?	匹配前一个字符0次或1次
{n}	匹配前一个字符n次
{n,m}	匹配前一个字符至少n次，最多m次
[abc]	匹配a、b或c中的任意一个字符
[^abc]	匹配除a、b、c以外的任意一个字符
[a-z]	匹配任意一个小写字母
[A-Z]	匹配任意一个大写字母
[0-9]	匹配任意一个数字字符

正则表达式的高级用法

除了基本语法外，正则表达式还有一些高级用法，包括以下几点：

1. 子表达式

子表达式是正则表达式的一部分，它由一组字符组成，并可以使用括号将其括起来。在正则表达式中，“()”表示一个子表达式。子表达式可以达到分组的效果，我们可以为每个子表达式指定一个编号，在匹配时可以使用编号来引用这些子表达式，达到更加精确的匹配效果。

子表达式的编号从1开始，我们可以使用“\数字”来引用子表达式，如“\1”表示引用第1个子表达式。下面的示例演示了如何使用子表达式进行匹配：

#include <iostream>
#include <string>
#include <regex>

using namespace std;

int main()
{
    string str = "thisis a test string for regex";
    // 正则表达式
    regex pattern("(\\w+)\\s+(\\w+)\\s+(\\w+)\\s+(\\w+)");
    // 匹配
    smatch match_result;
    if (regex_search(str, match_result, pattern))
    {
        // 输出匹配结果
        for (int i = 1; i <= 4; i++)
        {
            cout << "子表达式" << i << "匹配结果：" << match_result[i].str() << endl;
        }
    }
    return 0;
}

在上述代码中，我们定义了一个包含4个单词的字符串，使用子表达式来匹配这个字符串中的单词，并输出每个子表达式的匹配结果。在正则表达式中，“(\w+)”表示匹配一个或多个单词字符，并将其作为一个子表达式。

2. 前后向查找

前后向查找是指在正则表达式中查找一个字符串时，可以查找其前后是否存在某个特定的字符串模式。前后向查找的符号分别为“?=”和“?<=”。

在下面的示例中，我们将演示如何使用前向查找来匹配一个字符串中所有的数字，并将其替换为对应的中文大写数字：

#include <iostream>
#include <string>
#include <regex>

using namespace std;

string to_chinese(int decimal)
{
    // 省略中文数字的转换代码
}

int main()
{
    string str = "today is 2022-01-01, tomorrow is 2022-01-02.";
    // 正则表达式
    regex pattern("\\d+(?=\\-\\d+\\-\\d+)");
    // 查找
    sregex_iterator current_match(str.begin(), str.end(), pattern);
    sregex_iterator last_match;
    // 匹配结果替换为中文数字
    while (current_match != last_match)
    {
        smatch match_result = *current_match;
        int decimal = stoi(match_result.str());
        string chinese = to_chinese(decimal);
        str.replace(match_result.position(), match_result.length(), chinese);
        current_match++;
    }
    cout << str << endl;
    return 0;
}

在上述代码中，我们定义了一个正则表达式，用于查找字符串中所有的数字。在正则表达式中，“(\d+)”表示匹配一个或多个数字字符，“(?=\-\d+\-\d+)”表示查找一个以“-”分隔的日期格式。通过前向查找，我们可以将匹配结果只限制在日期之前的数字部分，避免替换掉日期中的数字。

3. 贪婪匹配和非贪婪匹配

在正则表达式匹配时，默认采用贪婪匹配，即尽可能地匹配更长的子串。而非贪婪匹配则相反，它尽可能地匹配更短的子串。非贪婪匹配的符号是“?”，放在“*”或“+”的后面。

在下面的示例中，我们演示如何使用贪婪匹配和非贪婪匹配来匹配一个字符串中的HTML标签：

#include <iostream>
#include <string>
#include <regex>

using namespace std;

int main()
{
    string str = "<html><body><h1>Hello world!</h1></body></html>";
    // 正则表达式
    regex pattern("<.*>");
    regex non_greedy_pattern("<.*?>");
    // 匹配
    smatch match_result;
    if (regex_search(str, match_result, pattern))
    {
        cout << "贪婪匹配结果：" << match_result[0].str() << endl;
    }
    if (regex_search(str, match_result, non_greedy_pattern))
    {
        cout << "非贪婪匹配结果：" << match_result[0].str() << endl;
    }
    return 0;
}

在上述代码中，我们定义了一个字符串，包含了一个HTML标签。使用正则表达式来匹配这个标签。通过不同的正则表达式，我们可以演示出贪婪匹配和非贪婪匹配的区别。在正则表达式中，“<.>”表示匹配包含在尖括号中的任意字符串，它采用贪婪匹配；而“<.?>”表示匹配最短的以尖括号包含的字符串，它采用非贪婪匹配。

案例分析

下面我们来看一个实际场景下的应用案例。通常我们在使用正则表达式时，需要根据具体的业务需求来编写匹配模式。在以下案例中，我们使用正则表达式来计算一个数学表达式的值。

#include <iostream>
#include <string>
#include <regex>
#include <stack>

using namespace std;

int calc_expression(string expression)
{
    // 运算符优先级
    map<char, int> priority = {
        {'+', 1},
        {'-', 1},
        {'*', 2},
        {'/', 2}
    };
    // 运算符栈和数值栈
    stack<char> operator_stack;
    stack<int> value_stack;
    // 正则表达式
    regex pattern("[\\d]+|[\\+\\-\\*/]");
    // 匹配
    sregex_iterator current_match(expression.begin(), expression.end(), pattern);
    sregex_iterator last_match;
    // 处理匹配结果
    while (current_match != last_match)
    {
        smatch match_result = *current_match;
        string token = match_result.str();
        // 数字
        if (regex_match(token, regex("[\\d]+")))
        {
            value_stack.push(stoi(token));
        }
        // 运算符
        else
        {
            char op = token[0];
            while (!operator_stack.empty() && priority[op] <= priority[operator_stack.top()])
            {
                char top_oper = operator_stack.top();
                operator_stack.pop();
                int v2 = value_stack.top();
                value_stack.pop();
                int v1 = value_stack.top();
                value_stack.pop();
                int result = 0;
                switch (top_oper)
                {
                    case '+':
                        result = v1 + v2;
                        break;
                    case '-':
                        result = v1 - v2;
                        break;
                    case '*':
                        result = v1 * v2;
                        break;
                    case '/':
                        result = v1 / v2;
                        break;
                    default:
                        break;
                }
                value_stack.push(result);
            }
            operator_stack.push(op);
        }
        current_match++;
    }
    while (!operator_stack.empty())
    {
        char top_oper = operator_stack.top();
        operator_stack.pop();
        int v2 = value_stack.top();
        value_stack.pop();
        int v1 = value_stack.top();
        value_stack.pop();
        int result = 0;
        switch (top_oper)
        {
            case '+':
                result = v1 + v2;
                break;
            case '-':
                result = v1 - v2;
                break;
            case '*':
                result = v1 * v2;
                break;
            case '/':
                result = v1 / v2;
                break;
            default:
                break;
        }
        value_stack.push(result);
    }
    return value_stack.top();
}

int main()
{
    string expression = "10+20*3-5/2";
    int result = calc_expression(expression);
    cout << expression << " = " << result << endl;
    return 0;
}

在上述代码中，我们定义了一个数学表达式，“10+20*3-5/2”，使用正则表达式来提取其中的数字和运算符。然后我们使用两个栈来模拟运算符和数值的计算过程。使用正则表达式来提取数字和运算符，可以让代码更加清晰和易于维护。当然，在实际开发中，可能需要根据具体的业务需求来编写匹配模式和处理逻辑。

总结

正则表达式虽然强大，但也较为复杂，需要在实践中不断练习。在使用正则表达式时，应该尽量将需求分解成简单的模式，再用子表达式组合成复杂的匹配模式。同时，建议先将正则表达式写到一个字符串中，在实际使用时再转换为代码，可以避免特殊字符的转义问题。最后，需要注意正则表达式中的贪婪匹配和非贪婪匹配区别，以及正则表达式在具体业务场景中的应用和优化。