如何使用Python解析HTML页面以提取HTML表格？

阅读更多：Python 教程

问题

您需要从网页中提取HTML表格。

介绍

互联网和万维网（WWW）是当今最突出的信息来源。有这么多的信息，很难从众多选项中选择内容。其中大部分信息是可以通过HTTP检索出来的。

但是，我们也可以通过编程方式执行这些操作，以自动检索和处理信息。

Python允许我们使用其标准库和HTTP客户端来实现这一点，但是requests模块可帮助我们非常轻松地获取网页信息。

在本文中，我们将看到如何解析HTML页面以提取嵌入页面中的HTML表格。

如何操作..

1.我们将使用requests、pandas、beautifulsoup4和tabulate包。如果缺少，请在系统上安装它们。如果您不确定，请使用pip freeze进行验证。

import requests
import pandas as pd
from tabulate import tabulate

2.我们将使用 https://www.tutorialspoint.com/python/python_basic_operators.htm 解析页面并打印所有嵌入其中的HTML页面。

# 设置站点url
site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"

3.我们将向服务器进行请求并查看响应。

# 向服务器发出请求
response = requests.get(site_url)

# 检查响应
print(f"***  对于{site_url}的响应为 {response.status_code}")

4.好的，响应代码200-表示从服务器返回的响应成功。因此，我们现在将检查请求标题、响应标题以及服务器返回的前100个文本。

# 检查请求标题
print(f"*** 打印请求的标题 - \n {response.request.headers} ")

# 检查响应标题
print(f"*** 打印响应的标题 - \n {response.headers} ")

# 检查结果的内容
print(f"*** 访问前100/{len(response.text)}个字符 - \n\n {response.text[:100]} ")

输出

*** 打印请求的标题 -
{'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** 打印响应的标题 -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '213246', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Tue, 20 Oct 2020 09:45:18 GMT', 'Expires': 'Thu, 19 Nov 2020 09:45:18 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** 访问前100/37624个字符 -

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>

5.现在，我们将使用BeautifulSoup解析HTML页面。

# 解析HTML页面

from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** 页面的标题是 - {tutorialpoints_page.title}")

# 您还可以将页面标题提取为字符串
print(f"*** 页面的标题是 - {tutorialpoints_page.title.string}")

6. 嗯，大多数表格都会在h2、h3、h4、h5或h6标签中定义标题。我们首先识别这些标签，然后选择标识的html表格。为此逻辑，我们将使用如下定义的find、sibling和find_next_siblings。

# 找到所有h3元素
print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(tabulate(df[0], headers='keys', tablefmt='psql'))

完整代码

7. 现在把所有东西都放在一起。

# 第1步：下载所需页面
import requests
import pandas as pd


# 设置网站url
site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"

# 向服务器发出请求
response = requests.get(site_url)

# 检查响应
print(f"*** 对于{site_url}的响应为{response.status_code}")

# 检查请求标头
print(f"*** 打印请求标头- \n {response.request.headers} ")

# 检查响应标头
print(f"*** 打印请求标头- \n {response.headers} ")

# 检查结果的内容
print(f"*** 访问前100 / {len(response.text)}个字符 - \n\n {response.text[:100]} ")

# 解析HTML页面

from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** 页面标题是- {tutorialpoints_page.title}")

# 您还可以将页面标题提取为字符串
print(f"*** 页面标题为- {tutorialpoints_page.title.string}")

# 找到所有h3元素
# print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(df)

输出

*** https://www.tutorialspoint.com/python/python_basic_operators.htm的响应为200
*** 输出请求头 -
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** 输出响应头 -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '558841', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Sat, 24 Oct 2020 09:45:13 GMT', 'Expires': 'Mon, 23 Nov 2020 09:45:13 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** 访问前100/37624个字符 -

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>
*** 页面标题 - <title>Python - Basic Operators - Tutorialspoint</title>
*** 页面标题 - Python - Basic Operators - Tutorialspoint
[<h2>运算符类型</h2>, <h2>Python算术运算符</h2>, <h2>Python比较运算符</h2>, <h2>Python赋值运算符</h2>, <h2>Python位运算符</h2>, <h2>Python逻辑运算符</h2>, <h2>Python成员运算符</h2>, <h2>Python身份运算符</h2>, <h2>Python运算符优先级</h2>]
[ 运算符 描述 \
0 + 加 把两边的值相加。
1 - 减 把右边的操作数从左边的操作数中减去...
2 * 乘 把两边的值相乘
3 / 除 把左操作数除以右操作数
4 ％ 取模 把左操作数除以右操作数...
5 ** 指数 对操作数执行指数（幂）运算
6 // 地板除法 - 操作数的除法，其中...

示例

0 a + b = 30
1 a – b = -10
2 a * b = 200
3 b / a = 2
4 b % a = 0
5 a**b =10的20次方
6 9//2 = 4并且9.0//2.0 = 4.0， -11//3 = -4，-11.... ]
[ 运算符 描述 \
0 == 如果两个操作数的值相等，则返回true
1 != 如果两个操作数的值不相等，则返回true
2 <> 如果两个操作数的值不相等，则返回true
3 > 如果左操作数的值大于右操作数的值，则返回true
4 < 如果左操作数的值小于右操作数的值，则返回true
5 >= 如果左操作数的值大于或等于右操作数的值，则返回true
6 <= 如果左操作数的值小于或等于右操作数的值，则返回true

示例

0 (a == b) 不为true。
1 (a != b) 为true。
2 (a <> b) 为true。这类似于!=操作符。
3 (a > b) 不为true。
4 (a < b) 为true。
5 (a >= b) 不为true。
6 (a <= b) 为true。 ]
[ 运算符 描述 \
0 = 将右侧操作数的值分配给左侧操作数...
1 += 加并赋值 它将右操作数加到左操作数中并...
2 -= 减并赋值 它从左操作数中减去右操作数...
3 *= 乘并赋值 它将右操作数与左操作数相乘...
4 /= 除并赋值 它将左操作数除以右操作数...
5 %= 取模并赋值 它使用两个操作数进行模数，并将...
6 **= 指数并赋值 对操作数执行指数（幂）计算...
7 //= 地板除法 它对操作数执行地板除法，并...

示例

0 c = a + b 将a + b的值分配给c
1 c += a 相当于c = c + a
2 c -= a 相当于c = c - a
3 c *= a 相当于c = c * a
4 c /= a 相当于c = c / a
5 c %= a 相当于c = c % a
6 c **= a 相当于c = c ** a
7 c //= a 相当于c = c // a ]
[ Operator（运算符） \
0 & 二进位 AND（参照“例子”部分）
1 | 二进位 OR（参照“例子”部分）
2 ^ 二进位异或（参照“例子”部分）
3 ~ 二进位反转（参照“例子”部分）
4 << 将运算符左边值的位元左移N个位元
5 >> 将运算符左边值的位元右移N个位元

Description（描述） \
0 如果相对应位都是1，则结果为1，否则为0。 返回的二进位数第n位总是来自第一个数（从右往左数）
1 如果相对应位都是0，则结果为0，否则为1
2 如果某个位为1，则这个位在结果上就为1。否则为0
3 对数据位取反，即将0变成1，1变成0
4 把左边值的位向左移n位，右边用0补齐。 
5 把左边值的位向右移n位。空位插入0

Example（示例） 
0 (a & b) （得出二进制值00001100）
1 (a | b) = 61 （得出二进制值00111101）
2 (a ^ b) = 49 （得出二进制值00110001）
3 (~a ) = -61 （得出二进制值11000011的2补码表示）
4 a << 2 = 240（得出二进制值11110000）
5 a >> 2 = 15 （得出二进制值00001111）]
[ Operator（运算符） Description（描述） \
0 and （与，参照“例子”部分）。
1 or （或，参照“例子”部分）。
2 not（非）（对运算数的逻辑状态取反）

Example（示例）
0 (a and b) 的值为真。
1 (a or b)的值为真。
2 Not(a and b)的值为假。]
[ Operator（运算符）Description（描述） \
0 in（在）（当在指定的序列中找到值时返回True, 否则返回False）
1 not in（不在）（当在指定的序列中没有找到值时返回True, 否则返回False）]

Example（示例）
0 x in y，如果x是y中的元素，则返回1。
1 x not in y，如果x不是y中的元素，则返回1。]
[ Operator（运算符）Description（描述） \
0 is（是）（比较两个对象的ID是否相等）
1 is not（不是）（比较两个对象的ID是否不相等）]

Example（示例）
0 x is y，如果id(x)和id(y)相等则返回1。
1 x is not y，如果id(x)和id(y)不相等则返回1。]
[ Sr.No.（序号） Operator & Description（描述）
0 1 ** 指数 - 取数的几次方
1 2 ~ + - 补、一元加减法（方法名+ @/ -@）
2 3 * / % // 乘、除、模和整除运算符
3 4 + - 加减运算符
4 5 >> << 右/左位移运算符
5 6 & 位运算AND
6 7 ^ | 位运算符异或以及按位取反（OR）
7 8 <= < > >= 比较运算符
8 9 <> == != 等于运算符
9 10 = %= /= //= -= += *= **= 赋值运算符
10 11 is is not 身份运算符
11 12 in not in 成员运算符