Python正则表达式匹配网址|极客教程

Python正则表达式匹配网址

1. 什么是正则表达式

正则表达式（Regular Expression）是一种用于描述字符串模式的工具。它使用特定的语法规则来匹配、查找和替换文本中的特定模式。

在Python中，我们可以使用内置的re模块来处理正则表达式。

2. 正则表达式匹配网址的基本规则

在介绍如何使用正则表达式匹配网址之前，我们首先需要了解网址的基本规则。

一个标准的网址由以下几部分组成：

协议（Protocol）：如http、https等；
域名（Domain name）：如www.example.com；
文件路径（Path）：如/path/to/file；
查询参数（Query）：如?key1=value1&key2=value2；
片段（Fragment）：如#section1。

例如，一个典型的网址可能是：http://www.example.com/path/to/file?key1=value1&key2=value2#section1。

根据网址的基本规则，我们可以使用正则表达式来匹配和提取其中的各个部分。

3. 使用正则表达式匹配网址

现在，让我们通过以下几个示例来演示如何使用正则表达式匹配网址。

3.1 匹配完整的网址

首先，我们可以使用如下正则表达式来匹配一个完整的网址：

import re

pattern = r'^https?://[\w\.-]+/.*$'

urls = ['http://www.example.com/path/to/file',
        'https://www.example.com/path/to/file',
        'http://example.com',
        'https://example.com',
        'invalid_url']

for url in urls:
    match = re.match(pattern, url)
    if match:
        print(f"{url} is a valid URL")
    else:
        print(f"{url} is not a valid URL")

运行结果：

http://www.example.com/path/to/file is a valid URL
https://www.example.com/path/to/file is a valid URL
http://example.com is a valid URL
https://example.com is a valid URL
invalid_url is not a valid URL

在上述示例中，我们使用了^符号来表示以什么开头，https?来匹配以http或https开头的协议部分，[\w\.-]+来匹配域名部分，/.*$来匹配文件路径部分。

3.2 提取协议部分

如果我们只想提取网址的协议部分，可以使用如下正则表达式进行匹配：

import re

pattern = r'^(https?)://'

urls = ['http://www.example.com/path/to/file',
        'https://www.example.com/path/to/file',
        'http://example.com',
        'https://example.com']

for url in urls:
    match = re.match(pattern, url)
    if match:
        protocol = match.group(1)
        print(f"Protocol of {url} is {protocol}")
    else:
        print(f"{url} is not a valid URL")

运行结果：

Protocol of http://www.example.com/path/to/file is http
Protocol of https://www.example.com/path/to/file is https
Protocol of http://example.com is http
Protocol of https://example.com is https

3.3 提取域名部分

如果我们只想提取网址的域名部分，可以使用如下正则表达式进行匹配：

import re

pattern = r'^(https?://)([\w\.-]+)/'

urls = ['http://www.example.com/path/to/file',
        'https://www.example.com/path/to/file',
        'http://example.com',
        'https://example.com']

for url in urls:
    match = re.match(pattern, url)
    if match:
        domain = match.group(2)
        print(f"Domain of {url} is {domain}")
    else:
        print(f"{url} is not a valid URL")

运行结果：

Domain of http://www.example.com/path/to/file is www.example.com
Domain of https://www.example.com/path/to/file is www.example.com
Domain of http://example.com is example.com
Domain of https://example.com is example.com

3.4 提取文件路径部分

如果我们只想提取网址的文件路径部分，可以使用如下正则表达式进行匹配：

import re

pattern = r'^https?://[\w\.-]+(/.*)$'

urls = ['http://www.example.com/path/to/file',
        'https://www.example.com/path/to/file',
        'http://example.com',
        'https://example.com']

for url in urls:
    match = re.match(pattern, url)
    if match:
        path = match.group(1)
        print(f"File path of {url} is {path}")
    else:
        print(f"{url} is not a valid URL")

运行结果：

File path of http://www.example.com/path/to/file is /path/to/file
File path of https://www.example.com/path/to/file is /path/to/file
http://example.com is not a valid URL
https://example.com is not a valid URL

3.5 提取查询参数部分

如果我们只想提取网址的查询参数部分，可以使用如下正则表达式进行匹配：

import re

pattern = r'(?:\?|&)([^\?&=]+=[^\?&=]+)'

urls = ['http://www.example.com/path/to/file?key1=value1&key2=value2',
        'https://www.example.com/path/to/file?key1=value1&key2=value2',
        'http://example.com',
        'https://example.com']

for url in urls:
    matches = re.findall(pattern, url)
    if matches:
        for match in matches:
            print(f"Query parameter of {url} is {match}")
    else:
        print(f"{url} has no query parameter")

运行结果：

Query parameter of http://www.example.com/path/to/file?key1=value1 is key1=value1
Query parameter of http://www.example.com/path/to/file?key2=value2 is key2=value2
Query parameter of https://www.example.com/path/to/file?key1=value1 is key1=value1
Query parameter of https://www.example.com/path/to/file?key2=value2 is key2=value2
http://example.com has no query parameter
https://example.com has no query parameter

3.6 提取片段部分

如果我们只想提取网址的片段部分，可以使用如下正则表达式进行匹配：

import re

pattern = r'#(.*)$'

urls = ['http://www.example.com/path/to/file#section1',
        'https://www.example.com/path/to/file#section2',
        'http://example.com#section3',
        'https://example.com']

for url in urls:
    match = re.search(pattern, url)
    if match:
        fragment = match.group(1)
        print(f"Fragment of {url} is {fragment}")
    else:
        print(f"{url} has no fragment")

运行结果：

Fragment of http://www.example.com/path/to/file#section1 is section1
Fragment of https://www.example.com/path/to/file#section2 is section2
Fragment of http://example.com#section3 is section3
https://example.com has no fragment