Pandas 解析 HTML|极客教程

Pandas 解析 HTML

在数据分析和处理工作中，经常会遇到需要从网页中获取数据的情况。Pandas 提供了方便的方法来解析 HTML 页面，将页面中的表格数据转换为 DataFrame。在本文中，我们将详细介绍 Pandas 解析 HTML 的方法和示例。

使用 Pandas 的 read_html 函数

Pandas 提供了一个 read_html 函数，可以从 HTML 页面中解析表格数据。这个函数会返回一个包含所有表格数据的列表，每个元素都是一个 DataFrame 对象。

下面是 read_html 函数的基本用法：

import pandas as pd

url = 'https://www.example.com/table.html'
tables = pd.read_html(url)

# 打印所有表格数据
for table in tables:
    print(table)

在上面的代码中，我们首先导入 Pandas 库，然后通过 read_html 函数从指定 URL 中获取所有表格数据，并打印出每个 DataFrame。

解析单个表格

如果我们知道 HTML 页面中只包含一个表格，可以直接获取解析后的 DataFrame。例如，如果我们有一个名为 table.html 的 HTML 页面，内容如下：

<table>
  <tr>
    <th>姓名</th>
    <th>年龄</th>
  </tr>
  <tr>
    <td>Alice</td>
    <td>25</td>
  </tr>
  <tr>
    <td>Bob</td>
    <td>30</td>
  </tr>
</table>

我们可以使用如下代码解析该表格数据：

import pandas as pd

url = 'table.html'
df = pd.read_html(url)[0]

print(df)

运行以上代码，我们将得到如下输出：

    姓名  年龄
0  Alice  25
1    Bob  30

自定义解析参数

read_html 函数还支持一些自定义参数，可以帮助我们在解析 HTML 页面时进行配置。

header: 设置表格中的行作为列索引，默认为 0。
index_col: 设置某一列作为行索引。
flavor: 解析 HTML 表格的引擎，如 lxml 或 html5lib。

下面是一个自定义参数的示例：

import pandas as pd

url = 'table.html'
df = pd.read_html(url, header=0, index_col=0)[0]

print(df)

解析多个表格

如果 HTML 页面中包含多个表格，我们可以通过指定 match 参数来选择特定的表格进行解析。match 参数可以是一个字符串、正则表达式或函数，用于匹配表格内容。

例如，我们有一个包含两个表格的 HTML 页面 table.html：

<table>
  <tr>
    <th>姓名</th>
    <th>年龄</th>
  </tr>
  <tr>
    <td>Alice</td>
    <td>25</td>
  </tr>
  <tr>
    <td>Bob</td>
    <td>30</td>
  </tr>
</table>

<table>
  <tr>
    <th>国家</th>
    <th>人口</th>
  </tr>
  <tr>
    <td>China</td>
    <td>1400</td>
  </tr>
  <tr>
    <td>USA</td>
    <td>330</td>
  </tr>
</table>

我们可以通过如下方式选择解析第二个表格：

import pandas as pd

url = 'table.html'
df = pd.read_html(url, match='国家')[0]

print(df)

运行以上代码，我们将得到如下输出：

    国家  人口
0  China  1400
1   USA   330

指定 HTML 页面位置

除了直接传入 URL 外，read_html 函数还可以直接传入 HTML 页面的内容。这在我们已经下载了 HTML 页面的情况下非常有用。

例如，我们有一个包含表格数据的 HTML 文件 table.html：

<table>
  <tr>
    <th>城市</th>
    <th>温度</th>
  </tr>
  <tr>
    <td>北京</td>
    <td>25</td>
  </tr>
  <tr>
    <td>上海</td>
    <td>28</td>
  </tr>
</table>

我们可以使用如下代码解析该文件：

import pandas as pd

with open('table.html', 'r') as f:
    html_content = f.read()

df = pd.read_html(html_content)[0]

print(df)

总结

在本文中，我们详细介绍了 Pandas 解析 HTML 的方法和示例。通过使用 Pandas 的 read_html 函数，我们可以方便地从 HTML 页面中获取表格数据，并将其转换为 DataFrame 进行进一步的分析和处理。

Pandas 解析 HTML