Beautiful Soup – 对象的种类

当我们把一个 html 文档或字符串传递给 beautifulsoup 构造函数时， beautifulsoup 基本上把一个复杂的 html 页面转换成不同的 python 对象。下面我们将讨论四种主要的对象：

Tag
NavigableString
BeautifulSoup
Comments

标签对象

一个HTML标签被用来定义各种类型的内容。BeautifulSoup中的一个标签对象对应于实际页面或文档中的一个HTML或XML标签。

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>')
>>> tag = soup.html
>>> type(tag)
<class 'bs4.element.Tag'>

标签包含很多属性和方法，标签的两个重要特征是它的名称和属性。

名称（tag.name）

每个标签都包含一个名称，可以通过’.name’作为后缀来访问。tag.name将返回它是哪种类型的标签。

>>> tag.name
'html'

然而，如果我们改变标签名称，同样会反映在由BeautifulSoup生成的HTML标记中。

>>> tag.name = "Strong"
>>> tag
<Strong><body><b class="boldest">TutorialsPoint</b></body></Strong>
>>> tag.name
'Strong'

属性 (tag.attrs)

一个标签对象可以有任意数量的属性。标签有一个属性’class’，其值是 “boldest”。任何不是标签的东西，基本上都是一个属性，必须包含一个值。你可以通过访问键来访问属性（比如在上面的例子中访问 “class”），也可以通过”.attrs “直接访问。

>>> tutorialsP = BeautifulSoup("<div class='tutorialsP'></div>",'lxml') >>> tag2 = tutorialsP.div >>> tag2['class'] ['tutorialsP']

我们可以对我们的标签的属性进行各种修改（添加/删除/修改）。

>>> tag2['class'] = 'Online-Learning' >>> tag2['style'] = '2007' >>> >>> tag2 <div class="Online-Learning" style="2007"></div> >>> del tag2['style'] >>> tag2 <div class="Online-Learning"></div> >>> del tag['class'] >>> tag <b SecondAttribute="2">TutorialsPoint</b> >>> >>> del tag['SecondAttribute'] >>> tag </b> >>> tag2['class'] 'Online-Learning' >>> tag2['style'] KeyError: 'style'

多值属性

一些HTML5属性可以有多个值。最常用的是class-attribute，它可以有多个CSS值。其他属性包括 “rel”、”rev”、”headers”、”accesskey “和 “accept-charset”。美丽汤中的多值属性以列表形式显示。

>>> from bs4 import BeautifulSoup >>> >>> css_soup = BeautifulSoup('<p class="body"></p>') >>> css_soup.p['class'] ['body'] >>> >>> css_soup = BeautifulSoup('<p class="body bold"></p>') >>> css_soup.p['class'] ['body', 'bold']

然而，如果任何属性包含一个以上的值，但根据任何版本的HTML标准，它不是多值属性，美丽汤会让该属性单独存在。

>>> id_soup = BeautifulSoup('<p id="body bold"></p>') >>> id_soup.p['id'] 'body bold' >>> type(id_soup.p['id']) <class 'str'>

如果你把一个标签变成一个字符串，你可以合并多个属性值。

>>> rel_soup = BeautifulSoup("<p> tutorialspoint Main <a rel='Index'> Page</a></p>") >>> rel_soup.a['rel'] ['Index'] >>> rel_soup.a['rel'] = ['Index', ' Online Library, Its all Free'] >>> print(rel_soup.p) <p> tutorialspoint Main <a rel="Index Online Library, Its all Free"> Page</a></p>

通过使用’get_attribute_list’，你得到的值总是一个列表、字符串，而不管它是否是一个多值。

id_soup.p.get_attribute_list(‘id’)

然而，如果你把文档解析为’xml’，就没有多值属性-

>>> xml_soup = BeautifulSoup('<p class="body bold"></p>', 'xml') >>> xml_soup.p['class'] 'body bold'

可导航字符串

navigablestring 对象被用来表示一个标签的内容。要访问其内容，请在标签中使用”.string”。

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>") >>> >>> soup.string 'Hello, Tutorialspoint!' >>> type(soup.string) >

你可以用另一个字符串替换，但你不能编辑现有的字符串。

>>> soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>") >>> soup.string.replace_with("Online Learning!") 'Hello, Tutorialspoint!' >>> soup.string 'Online Learning!' >>> soup <html><body><h2 id="message">Online Learning!</h2></body></html>

BeautifulSoup

BeautifulSoup是当我们试图搜刮一个网络资源时创建的对象。因此，它是我们试图搜刮的完整文档。大多数情况下，它被视为标签对象。

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>") >>> type(soup) <class 'bs4.BeautifulSoup'> >>> soup.name '[document]'

注释

评论对象说明了网络文档的评论部分。它只是NavigableString的一个特殊类型。

>>> soup = BeautifulSoup('<p></p>') >>> comment = soup.p.string >>> type(comment) <class 'bs4.element.Comment'> >>> type(comment) <class 'bs4.element.Comment'> >>> print(soup.p.prettify()) <p>  </p>

可导引字符串对象

navigablestring对象用于表示标签内的文本，而不是标签本身。