Python urlopen()介绍

1. 引言

在进行网络爬虫或者访问Web资源的实践中，我们经常需要使用Python来发送HTTP请求并获取响应内容。Python的标准库中提供了许多用于发送HTTP请求的模块和方法，其中最常用的就是urllib.request模块中的urlopen()方法。本文将详细介绍urlopen()的使用方法，包括基本的GET请求、带参数的GET请求以及POST请求等。

2. `urlopen()`方法概述

urlopen()方法是Python标准库urllib.request模块中的一个函数，用于向指定的URL发送HTTP请求并返回响应的内容。它的基本语法如下：

urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)

2.1 参数说明

url：要访问的URL地址，可以是字符串类型的URL，也可以是一个Request对象。
data（可选）：要发送的数据，可以是一个字节流（bytes类型）或字符串（str类型）的形式。
timeout（可选）：请求超时时间，单位为秒，默认值为系统默认超时时间。
其他参数（cafile、capath、cadefault、context）不常用，在此不详细讨论。

2.2 响应对象

urlopen()方法返回一个http.client.HTTPResponse对象，包含了响应的各种信息和内容。通过该对象，我们可以获取响应的状态码、头部信息及响应内容等。

3. 基本的GET请求

最常见的HTTP请求方法之一就是GET请求，用于从服务器获取资源。下面我们将演示如何使用urlopen()方法发送一个基本的GET请求，并获取服务器返回的内容。

3.1 示例代码

import urllib.request

response = urllib.request.urlopen("https://www.example.com")
content = response.read().decode('utf-8')
print(content)

3.2 运行结果

<!DOCTYPE html>
<html>
<head>
    <title>Example Domain</title>
    <style>
        body {
            background-color: #f0f0f2;
            margin: 0;
            padding: 0;
            font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

        }
        div {
            width: 600px;
            margin: 5em auto;
            padding: 5em;
            background-color: #fdfdff;
            border-radius: 1em;
            box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1);
        }
        a:link, a:visited {
            color: #38488f;
            text-decoration: none;
        }
        @media (max-width: 700px) {
            body {
                background-color: #fff;
            }
            div {
                width: auto;
                margin: 0 auto;
                border-radius: 0;
                padding: 1em;
                box-shadow: none;
            }
        }
    </style>
</head>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

如上述示例代码所示，我们使用urlopen()方法向”https://www.example.com”发送了一个GET请求，将返回的响应读取为字符串形式，并打印输出。

从运行结果中我们可以看到，服务器返回的是一个HTML页面的内容。

4. 带参数的GET请求

在实际应用中，我们经常需要向服务器发送带有参数的GET请求。Python中，通过在URL中添加参数对实现该功能。

4.1 示例代码

import urllib.request
import urllib.parse

# 构造参数字典
params = {
    'query': 'python',
    'page': 1,
    'per_page': 10
}

# 将参数编码并拼接到URL中
url = "https://www.example.com/search?" + urllib.parse.urlencode(params)

response = urllib.request.urlopen(url)
content = response.read().decode('utf-8')
print(content)

4.2 运行结果

<!DOCTYPE html>
<html>
<head>
    <title>Search Results</title>
    ...
</head>
<body>
    <h1>Search Results</h1>
    ...
    <ul>
        <li>Result 1</li>
        <li>Result 2</li>
        ...
    </ul>
</body>
</html>

在上述示例代码中，我们构造了一个参数字典params，包含了查询关键字query、页码page和每页个数per_page等参数。然后，通过urlencode()方法将参数字典编码为字符串，并拼接到URL后面。最后，使用urlopen()方法发送GET请求，并输出服务器返回的内容。

5. POST请求

除了GET请求，urlopen()方法还可以发送POST请求。POST请求用于向服务器提交数据，并通常用于表单提交等场景。

5.1 示例代码

import urllib.request
import urllib.parse

# 构造要提交的数据
data = {
    'username': 'admin',
    'password': '123456'
}

# 将数据编码为bytes类型
data = urllib.parse.urlencode(data).encode('utf-8')

response = urllib.request.urlopen("https://www.example.com/login", data)
content = response.read().decode('utf-8')
print(content)

5.2 运行结果

<!DOCTYPE html>
<html>
<head>
    <title>Login Result</title>
    ...
</head>
<body>
<h1>Login Result</h1>
<p>Login success!</p>
</body>
</html>

在上述示例代码中，我们通过构造一个包含用户名和密码的字典data，然后使用urlencode()方法将其编码为URL格式的字符串，并最后转换为bytes类型。将编码后的数据作为第二个参数传递给urlopen()方法，实现POST请求的发送。最后，我们输出服务器返回的内容。

6. 异常处理

在实际应用中，我们还需要对发送HTTP请求发生的异常进行适当的处理。Python的urlopen()方法可能会抛出urllib.error.URLError、http.client.HTTPException等异常。我们可以使用try...except语句来捕获这些异常，并进行相应的处理。

下面是一个简单的异常处理示例：

import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen("https://www.example.com")
    content = response.read().decode('utf-8')
    print(content)
except urllib.error.URLError as e:
    print("Error:", e.reason)
except http.client.HTTPException as e:
    print("HTTP Exception:", e)

在上述示例代码中，如果urlopen()方法产生了URLError异常，我们将输出异常的原因信息。如果产生了HTTPException异常，我们则输出异常对象本身。

7. 代理设置

urlopen()方法还支持设置代理服务器来发送HTTP请求。通过设置ProxyHandler对象，我们可以指定代理服务器的地址和端口。

下面是一个代理设置的示例：

import urllib.request

# 设置代理服务器地址和端口
proxy_handler = urllib.request.ProxyHandler({'http': 'http://10.10.10.10:8888', 'https': 'https://10.10.10.10:8888'})
opener = urllib.request.build_opener(proxy_handler)

# 设置全局默认的opener
urllib.request.install_opener(opener)

# 发送HTTP请求
response = urllib.request.urlopen("https://www.example.com")
content = response.read().decode('utf-8')
print(content)

在上述示例代码中，我们通过创建一个ProxyHandler对象来指定代理服务器的地址和端口，并将其传递给build_opener()函数。接着，我们使用install_opener()方法将自定义的opener设置为全局默认的opener。最后，我们使用urlopen()方法发送HTTP请求。

8. 请求头设置

在发送HTTP请求时，我们还可以自定义请求头信息。可以通过创建一个urllib.request.Request对象来设置请求头信息。

下面是一个请求头设置的示例：

import urllib.request

url = "https://www.example.com"

# 创建Request对象并设置请求头
req = urllib.request.Request(url)
req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36")

# 发送HTTP请求
response = urllib.request.urlopen(req)
content = response.read().decode('utf-8')
print(content)

在上述示例代码中，我们通过创建一个Request对象req，并使用add_header()方法来设置User-Agent请求头。然后，我们使用urlopen()方法发送HTTP请求，并输出服务器返回的内容。

9. SSL验证

默认情况下，urlopen()方法会自动验证SSL证书。如果目标网站的SSL证书无效或不被信任，urlopen()方法会抛出urllib.error.URLError异常。在实际应用中，我们可以选择忽略对SSL证书的验证。

下面是一个忽略SSL验证的示例：

import urllib.request
import ssl

# 创建一个未验证的SSL上下文
ssl._create_default_https_context = ssl._create_unverified_context

# 发送HTTP请求
response = urllib.request.urlopen("https://www.example.com")
content = response.read().decode('utf-8')
print(content)

在上述示例代码中，我们使用ssl._create_default_https_context函数创建了一个未验证的SSL上下文。然后，我们使用urlopen()方法发送HTTP请求，并输出服务器返回的内容。