Scrapy 终端(shell)|极客教程

Scrapy终端(shell)是一个交互终端，供您在未启动spider的情况下调试您的爬取代码。其本意是用来测试提取数据，你可以将其作为正常的Python终端，在上面测试任何的Python代码。

可以使用的方法:

shelp(): 打印可用的对象和方法
fetch(url[, redirect=True]): 爬取新的 URL 并更新相关对象
fetch(request): 通过 request 爬取，并更新相关对象
view(response): 使用本地浏览器打开爬取的页面

启动终端

使用 shell 来启动Scrapy终端:

scrapy shell <url>

Scrapy Shell根据下载的页面会自动创建一些方便使用的对象，例如 Response 对象、Selector 对象。

当shell载入后，将得到一个包含response数据的本地 response 变量，输入response.body将输出response的包体，输出 response.headers可以看到response的包头。
输入response.selector时，将获取到一个response 初始化的类 Selector 的对象，此时可以通过使用response.selector.xpath()或response.selector.css()来对 response 进行查询。
Scrapy也提供了一些快捷方式, 例如response.xpath()或response.css()同样可以生效。

例如：

scrapy shell "http://hr.tencent.com/position.php?&start=0#a" --nolog

输出结果如图：
Scrapy 终端(shell)

操作实列

Selector 的详细介绍可以参考文章选择器，如下所示，获取文章的标题和内容。

# 返回xpath选择器列表
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>搜索 | 腾讯招聘</title>'>]

# 使用extract方法返回 Unicode字符串列表
>>> response.xpath('//title').extract()
['<title>搜索 | 腾讯招聘</title>']

# 获取列表第一个元素
>>> response.xpath('//title').extract()[0]
'<title>搜索 | 腾讯招聘</title>'

# 返回 xpath选择器对象列表
>>> response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='搜索 | 腾讯招聘'>]

# 返回列表第一个元素的Unicode字符串
>>> response.xpath('//title/text()')[0].extract()
'搜索 | 腾讯招聘'

>>> response.text

输出结果如图:
Scrapy 终端(shell)

spider内调用shell

使用 scrapy.shell.inspect_response 函数可以在spider内调用shell。
启动爬虫，将会在执行到inspect_response时进入 shell，当处使用完使用Ctrl-D退出 shell，爬虫会恢复运行。

import scrapy

class DmozSpider(scrapy.Spider):
    name = "geek-docs"
    allowed_domains = ["geek-docs.com"]
    start_urls = [
        "https://geek-docs.com/",
    ]

    def parse(self, response):
        from scrapy.shell import inspect_response
        inspect_response(response, self)