Python 在Python中，如何使用urllib检测网站的页面状态码

在本文中，我们将介绍如何使用Python中的urllib库来检测网站的页面状态码。页面状态码是一个表示网站页面是否存在或可访问的数字代码。常见的状态码包括200（成功）、404（页面不存在）等。

什么是urllib库

urllib是Python标准库，提供了一系列用于处理URL的模块。其中的urllib.request模块可以用于发送HTTP请求并获取响应。我们可以利用urllib库中的urlopen函数发送HTTP请求，然后通过获取响应的状态码来判断网站的页面是否可访问。

使用urllib检测网站页面状态码

下面是一个使用urllib检测网站页面状态码的简单示例：

import urllib.request

def check_website_status(url):
    try:
        response = urllib.request.urlopen(url)
        status_code = response.getcode()
        print("网站状态码：", status_code)
        if status_code == 200:
            print("页面可以访问！")
        elif status_code == 404:
            print("页面不存在！")
        else:
            print("页面存在，但状态码为：", status_code)
    except urllib.error.URLError as e:
        print("发生错误：", e)

# 检测百度首页状态码
url = "https://www.baidu.com"
check_website_status(url)

# 检测404页面状态码
url = "https://www.example.com/404"
check_website_status(url)

在上面的示例中，我们定义了一个check_website_status函数，通过调用urlopen函数发送HTTP请求，然后使用getcode方法获取响应的状态码。根据状态码的不同，我们输出不同的提示信息来判断页面的状态。

在使用urlopen发送请求时，可能会遇到一些错误，比如无法连接到服务器或者网站不存在等。我们可以使用urllib.error.URLError来捕获这些错误，并进行相应的处理。

在示例中，我们检测了百度首页和一个不存在的网页的状态码。输出结果分别为：

网站状态码： 200
页面可以访问！

网站状态码： 404
页面不存在！

高级用法：请求头定制

除了获取状态码外，我们还可以定制请求头来模拟浏览器发送请求，这在一些网站对爬虫进行反爬虫策略时非常有用。下面是一个使用urllib定制请求头的示例：

import urllib.request

def check_website_status(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
        req = urllib.request.Request(url, headers=headers)
        response = urllib.request.urlopen(req)
        status_code = response.getcode()
        print("网站状态码：", status_code)
        if status_code == 200:
            print("页面可以访问！")
        elif status_code == 404:
            print("页面不存在！")
        else:
            print("页面存在，但状态码为：", status_code)
    except urllib.error.URLError as e:
        print("发生错误：", e)

# 检测百度首页状态码
url = "https://www.baidu.com"
check_website_status(url)

# 检测404页面状态码
url = "https://www.example.com/404"
check_website_status(url)