使用BeautifulSoup从HTML中提取文本|极客教程

使用BeautifulSoup从HTML中提取文本

在网络爬虫和数据抓取的过程中，我们经常需要从HTML页面中提取文本信息。BeautifulSoup是一个Python库，可以帮助我们解析HTML文档，提取其中的文本内容。本文将介绍如何使用BeautifulSoup从HTML中提取文本，并提供一些示例代码。

安装BeautifulSoup

首先，我们需要安装BeautifulSoup库。可以使用pip来安装BeautifulSoup：

pip install beautifulsoup4

安装完成后，我们就可以开始使用BeautifulSoup来解析HTML文档了。

解析HTML文档

首先，我们需要准备一个HTML文档，以便后续的示例代码演示。以下是一个简单的HTML文档示例：

<!DOCTYPE html>
<html>
<head>
    <title>Geek Docs</title>
</head>
<body>
    <h1>Welcome to Geek Docs</h1>
    <p>This is a website for geeks.</p>
    <ul>
        <li>Python</li>
        <li>JavaScript</li>
        <li>HTML</li>
    </ul>
</body>
</html>

Output:

使用BeautifulSoup从HTML中提取文本

我们将使用BeautifulSoup来解析这个HTML文档，并提取其中的文本信息。

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Geek Docs</title>
</head>
<body>
    <h1>Welcome to Geek Docs</h1>
    <p>This is a website for geeks.</p>
    <ul>
        <li>Python</li>
        <li>JavaScript</li>
        <li>HTML</li>
    </ul>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.get_text())

Output:

使用BeautifulSoup从HTML中提取文本

以上代码演示了如何使用BeautifulSoup解析HTML文档，并提取其中的文本内容。

提取标题

有时候我们只需要提取HTML文档中的标题部分。以下是一个示例代码，演示如何提取HTML文档中的标题：

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Geek Docs</title>
</head>
<body>
    <h1>Welcome to Geek Docs</h1>
    <p>This is a website for geeks.</p>
    <ul>
        <li>Python</li>
        <li>JavaScript</li>
        <li>HTML</li>
    </ul>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

title = soup.title.get_text()
print(title)

Output:

使用BeautifulSoup从HTML中提取文本

以上代码演示了如何提取HTML文档中的标题部分。

提取段落内容

除了标题外，我们还经常需要提取HTML文档中的段落内容。以下是一个示例代码，演示如何提取HTML文档中的段落内容：

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Geek Docs</title>
</head>
<body>
    <h1>Welcome to Geek Docs</h1>
    <p>This is a website for geeks.</p>
    <ul>
        <li>Python</li>
        <li>JavaScript</li>
        <li>HTML</li>
    </ul>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

paragraph = soup.find('p').get_text()
print(paragraph)

Output:

使用BeautifulSoup从HTML中提取文本

以上代码演示了如何提取HTML文档中的段落内容。

提取列表内容

在HTML文档中，列表是一种常见的元素。以下是一个示例代码，演示如何提取HTML文档中的列表内容：

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Geek Docs</title>
</head>
<body>
    <h1>Welcome to Geek Docs</h1>
    <p>This is a website for geeks.</p>
    <ul>
        <li>Python</li>
        <li>JavaScript</li>
        <li>HTML</li>
    </ul>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

items = soup.find_all('li')
for item in items:
    print(item.get_text())

Output:

使用BeautifulSoup从HTML中提取文本

以上代码演示了如何提取HTML文档中的列表内容。

提取链接文本

在HTML文档中，链接是一种常见的元素。以下是一个示例代码，演示如何提取HTML文档中的链接文本：

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Geek Docs</title>
</head>
<body>
    <h1>Welcome to Geek Docs</h1>
    <p>This is a website for geeks.</p>
    <a href="https://www.geek-docs.com">Geek Docs</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

link_text = soup.a.get_text()
print(link_text)

Output:

使用BeautifulSoup从HTML中提取文本

以上代码演示了如何提取HTML文档中的链接文本。

提取表格内容

在HTML文档中，表格是一种常见的元素。以下是一个示例代码，演示如何提取HTML文档中的表格内容：

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Geek Docs</title>
</head>
<body>
    <h1>Welcome to Geek Docs</h1>
    <table>
        <tr>
            <th>Name</th>
            <th>Age</th>
        </tr>
        <tr>
            <td>Alice</td>
            <td>25</td>
        </tr>
        <tr>
            <td>Bob</td>
            <td>30</td>
        </tr>
    </table>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

table = soup.find('table')
rows = table.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    for cell in cells:
        print(cell.get_text(), end='\t')
    print()

Output:

使用BeautifulSoup从HTML中提取文本

以上代码演示了如何提取HTML文档中的表格内容。

提取特定属性的元素

有时候我们需要提取HTML文档中具有特定属性的元素。以下是一个示例代码，演示如何提取HTML文档中具有特定属性的元素：

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Geek Docs</title>
</head>
<body>
    <h1>Welcome to Geek Docs</h1>
    <p class="intro">This is a website for geeks.</p>
    <p class="content">Learn Python, JavaScript, and HTML here.</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

intro_paragraph = soup.find('p', class_='intro').get_text()
print(intro_paragraph)

content_paragraph = soup.find('p', class_='content').get_text()
print(content_paragraph)

Output:

使用BeautifulSoup从HTML中提取文本

以上代码演示了如何提取HTML文档中具有特定属性的元素。

提取嵌套元素的文本

有时候HTML文档中的元素是嵌套的，我们需要提取嵌套元素的文本。以下是一个示例代码，演示如何提取HTML文档中嵌套元素的文本：

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Geek Docs</title>
</head>
<body>
    <h1>Welcome to <span>Geek Docs</span></h1>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

nested_text = soup.h1.get_text()
print(nested_text)

Output:

使用BeautifulSoup从HTML中提取文本

以上代码演示了如何提取HTML文档中嵌套元素的文本。

提取注释内容

在HTML文档中，有时候会包含注释内容。以下是一个示例代码，演示如何提取HTML文档中的注释内容：

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Geek Docs</title>
</head>
<body>
    <!-- This is a comment -->
    <h1>Welcome to Geek Docs</h1>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

comment = soup.find(text=lambda text: isinstance(text, Comment))
print(comment)