BeautifulSoup 只提取元素的文本内容，而不包括其子元素

在本文中，我们将介绍如何使用BeautifulSoup库来提取HTML元素的文本内容，而不包括其子元素。通常情况下，使用BeautifulSoup来解析HTML页面，并提取所需的信息是非常方便的，但有时候我们只需要获取元素自身的文本内容，而不需要包括其嵌套的子元素。下面我们将通过示例演示如何实现这一功能。

阅读更多：BeautifulSoup 教程

使用BeautifulSoup提取元素的文本内容

首先，我们需要安装BeautifulSoup库。可以使用pip命令来安装：

pip install beautifulsoup4

安装完成后，我们就可以使用BeautifulSoup库来解析HTML页面。假设我们有如下的HTML代码：

<div class="article">
    <h1>This is the title</h1>
    <p>This is the first paragraph.</p>
    <p>This is the second paragraph.</p>
</div>

我们希望提取出<div>元素的文本内容，而不包括其子元素<h1>和<p>的文本内容。

首先，导入BeautifulSoup库和requests库（用于获取HTML页面的内容）：

from bs4 import BeautifulSoup
import requests

然后，使用requests库来获取HTML页面的内容，并传入BeautifulSoup对象：

html = requests.get('http://example.com')  # 替换成实际的URL
soup = BeautifulSoup(html.content, 'html.parser')

接下来，我们可以使用BeautifulSoup对象的find()或find_all()方法来查找到我们需要的元素，然后使用.text属性来获取元素的文本内容。

div_element = soup.find('div', class_='article')
text_content = div_element.text
print(text_content)

运行以上代码，输出结果将为This is the title\nThis is the first paragraph.\nThis is the second paragraph.。可以看到，通过使用BeautifulSoup库，我们成功提取出了<div>元素的文本内容，而不包括其子元素的文本内容。

不包括子元素的文本提取示例

下面我们使用一个更复杂的示例来演示如何提取不包括子元素的文本内容。

假设我们有如下的HTML代码：

<div class="container">
    <h2>This is the heading</h2>
    <p>This is the first paragraph.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
    </ul>
    <p>This is the second paragraph.</p>
</div>

我们希望提取出<div>元素的文本内容，而不包括其子元素<h2>、<p>和<ul>的文本内容。

首先按照之前的步骤，使用BeautifulSoup库来解析HTML页面：

html = requests.get('http://example.com')  # 替换成实际的URL
soup = BeautifulSoup(html.content, 'html.parser')

然后，找到我们希望提取的<div>元素，并使用.text属性来获取其文本内容：

div_element = soup.find('div', class_='container')
text_content = div_element.text
print(text_content)

运行以上代码，输出结果将为This is the heading\n\nThis is the first paragraph.\n\nItem 1\nItem 2\nItem 3\n\nThis is the second paragraph.。可以看到，通过使用BeautifulSoup库，我们成功提取出了<div>元素的文本内容，并且不包括其子元素的文本内容。