Beautiful Soup 安装

由于BeautifulSoup不是Python的标准库，所以我们需要先安装它。我们将安装最新的BeautifulSoup 4库（也称为BS4）。

为了隔离我们的工作环境，以免影响现有的设置，让我们首先创建一个虚拟环境。

创建虚拟环境（可选）

虚拟环境允许我们创建一个与特定项目无关的Python工作副本，而不影响外部设置。

最好的方式是使用pip来安装任何Python包，但是如果pip尚未安装（可以通过在命令提示符或shell提示符中使用”pip -version”来检查），您可以通过给出以下命令来安装：

Linux环境

$sudo apt-get install python-pip

Windows环境

要在Windows上安装pip，请执行以下步骤：

从https://bootstrap.pypa.io/get-pip.py或GitHub下载get-pip.py文件到您的计算机。
打开命令提示符并导航至包含get-pip.py文件的文件夹。
运行以下命令：

>python get-pip.py

完成了，在您的Windows机器上安装完pip。

您可以通过运行以下命令来验证您的pip是否已安装-

>pip --version
pip 19.2.3 from c:\users\yadur\appdata\local\programs\python\python37\lib\site-packages\pip (python 3.7)

安装虚拟环境

在命令提示符中运行以下命令：

>pip install virtualenv

运行后，你将看到以下截屏 −

Beautiful Soup 安装

下面的命令将在您当前的目录中创建一个虚拟环境（“myEnv”）−

>virtualenv myEnv

截图

Beautiful Soup 安装

>myEnv\Scripts\activate

Beautiful Soup 安装

在上面的截图中，你可以看到我们有一个以“myEnv”作为前缀的提示，它告诉我们我们正在使用虚拟环境“myEnv”。

要退出虚拟环境，请运行deactivate命令。

(myEnv) C:\Users\yadur>deactivate
C:\Users\yadur>

虚拟环境已准备就绪，现在让我们安装beautifulsoup。

安装BeautifulSoup

由于BeautifulSoup不是Python的标准库，我们需要安装它。我们将使用BeautifulSoup 4包（也称为bs4）。

Linux系统

要在Debian或Ubuntu Linux上使用系统软件包管理器安装bs4，请运行以下命令 –

$sudo apt-get install python-bs4 (for python 2.x)
$sudo apt-get install python3-bs4 (for python 3.x)

你可以使用easy_install或pip来安装bs4（如果你在使用系统包管理器安装时遇到问题）。

$easy_install beautifulsoup4
$pip install beautifulsoup4

（如果您使用的是Python3，您可能需要使用easy_install3或pip3）

Windows 系统

在 Windows 系统中安装 beautifulsoup4 非常简单，尤其是如果您已经安装了 pip。

>pip install beautifulsoup4

Beautiful Soup 安装

所以现在beautifulsoup4已经安装在我们的机器上。让我们谈谈安装后遇到的一些问题。

安装后的问题

在Windows机器上，你可能会遇到错误的版本被安装错误，主要是通过−

错误： ImportError “No module named HTMLParser” ，则必须在Python 3下运行Python 2版本的代码。
错误： ImportError “No module named html.parser” 错误，则必须在Python 2下运行Python 3版本的代码。

摆脱以上两种情况的最好方法是重新安装BeautifulSoup，彻底删除现有安装。

如果你在ROOT_TAG_NAME = u’[document]’这一行上遇到 SyntaxError “Invalid syntax” ，那么你需要将Python 2代码转换为Python 3，只需安装该软件包即可−

$ python3 setup.py install

或者通过手动在bs4目录上运行Python的2到3转换脚本

$ 2to3-3.2 -w bs4

安装解析器

默认情况下，Beautiful Soup支持Python标准库中包含的HTML解析器，但它也支持许多外部第三方Python解析器，如lxml解析器或html5lib解析器。

要安装lxml或html5lib解析器，请使用以下命令-

Linux机器

$apt-get install python-lxml
$apt-get insall python-html5lib

Windows Machine

$pip install lxml
$pip install html5lib

Beautiful Soup 安装

通常，用户使用lxml进行速度上的考虑，并且建议在使用较旧版本的Python 2（2.7.3版本之前）或Python 3（3.2.2之前）时使用lxml或html5lib解析器，因为Python内置的HTML解析器在处理较旧版本时不太好。

运行Beautiful Soup

是时候在其中一个HTML页面上测试我们的Beautiful Soup包了，并从中提取一些信息。

在下面的代码中，我们试图从网页中提取标题：

from bs4 import BeautifulSoup
import requests
url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print(soup.title)

输出

<title>H2O, Colab, Theano, Flutter, KNime, Mean.js, Weka, Solidity, Org.Json, AWS QuickSight, JSON.Simple, Jackson Annotations, Passay, Boon, MuleSoft, Nagios, Matplotlib, Java NIO, PyTorch, SLF4J, Parallax Scrolling, Java Cryptography</title>

一个常见的任务是提取网页中的所有URL。为此，我们只需要添加下面这行代码 –

for link in soup.find_all('a'):
print(link.get('href'))

输出

https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/about/about_careers.htm
https://www.tutorialspoint.com/questions/index.php
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/current_affairs.htm
https://www.tutorialspoint.com/upsc_ias_exams.htm
https://www.tutorialspoint.com/tutor_connect/index.php
https://www.tutorialspoint.com/whiteboard.htm
https://www.tutorialspoint.com/netmeeting.php
https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/tutorialslibrary.htm
https://www.tutorialspoint.com/videotutorials/index.php
https://store.tutorialspoint.com
https://www.tutorialspoint.com/gate_exams_tutorials.htm
https://www.tutorialspoint.com/html_online_training/index.asp
https://www.tutorialspoint.com/css_online_training/index.asp
https://www.tutorialspoint.com/3d_animation_online_training/index.asp
https://www.tutorialspoint.com/swift_4_online_training/index.asp
https://www.tutorialspoint.com/blockchain_online_training/index.asp
https://www.tutorialspoint.com/reactjs_online_training/index.asp
https://www.tutorix.com
https://www.tutorialspoint.com/videotutorials/top-courses.php
https://www.tutorialspoint.com/the_full_stack_web_development/index.asp
….
….
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/free_web_graphics.htm
https://www.tutorialspoint.com/online_file_conversion.htm
https://www.tutorialspoint.com/netmeeting.php
https://www.tutorialspoint.com/free_online_whiteboard.htm
https://www.tutorialspoint.com
https://www.facebook.com/tutorialspointindia
https://plus.google.com/u/0/+tutorialspoint
http://www.twitter.com/tutorialspoint
http://www.linkedin.com/company/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.tutorialspoint.com/index.htm
/about/about_privacy.htm#cookies
/about/faq.htm
/about/about_helping.htm
/about/contact_us.htm

类似的，我们可以使用beautifulsoup4来提取有用的信息。

现在让我们更多地了解上面例子中的“soup”。