Python爬虫 | Scrapy详解
<h2> <span style="font-size: 18pt"><strong>一<span style="font-family: "Microsoft YaHei"">.Scrapy框架简介</span></strong></span></h2><p class="p"><span style="font-size: 16px"> 何为框架,就相当于一个封装了很多功能的结构体,它帮我们把主要的结构给搭建好了,我们只需往骨架里添加内容就行。scrapy框架是一个为了爬取网站数据,提取数据的框架,我们熟知爬虫总共有四大部分,请求、响应、解析、存储,scrapy框架都已经搭建好了。scrapy是基于twisted框架开发而来,twisted是一个流行的事件驱动的python网络框架,scrapy使用了一种非阻塞(又名异步)的代码实现并发的,Scrapy之所以能实现异步,得益于twisted框架。twisted有事件队列,哪一个事件有活动,就会执行!Scrapy它集成高性能异步下载,队列,分布式,解析,持久化等。</span></p>
<p class="p"><span style="font-size: 18px"><strong>1.五大核心组件</strong></span></p>
<p><span style="font-size: 18px"><strong>引擎(Scrapy)</strong></span></p>
<p><span style="font-size: 18px"> 框架核心,<strong>用来处理整个系统的数据流的流动</strong>, 触发事务(判断是何种数据流,然后再调用相应的方法)。也就是负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯,信号、数据传递等,所以被称为框架的核心。</span></p>
<p><span style="font-size: 18px"><strong>调度器(Scheduler)</strong></span></p>
<p><span style="font-size: 18px"> 用来接受引擎发过来的请求,并按照一定的方式进行整理排列,放到队列中,当引擎需要时,交还给引擎。可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址。</span></p>
<p><span style="font-size: 18px"><strong>下载器(Downloader)</strong></span></p>
<p><span style="font-size: 18px"> 负责下载引擎发送的所有Requests请求,并将其获取到的Responses交还给Scrapy Engine(引擎),由引擎交给Spider来处理。Scrapy下载器是建立在twisted这个高效的异步模型上的。</span></p>
<p><span style="font-size: 18px"><strong>爬虫(Spiders)</strong></span></p>
<p><span style="font-size: 18px"> 用户根据自己的需求,编写程序,用于从特定的网页中提取自己需要的信息,即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面。跟进的URL提交给引擎,再次进入Scheduler(调度器)。</span></p>
<p><span style="font-size: 18px"><strong>项目管道(Pipeline)</strong></span></p>
<p><span style="font-size: 18px"> 负责处理爬虫提取出来的item,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。</span></p>
<p><span style="font-size: 18px"><img src="https://img2018.cnblogs.com/blog/1518468/201909/1518468-20190908072915037-1635660241.png"></span></p>
<p><span style="font-size: 18px"><strong>2.工作流程</strong></span></p>
<p><span style="font-size: 18px">Scrapy中的数据流由引擎控制,其过程如下:</span></p>
<p><span style="font-size: 18px">(1)用户编写爬虫主程序将需要下载的页面请求requests递交给引擎,引擎将请求转发给调度器;</span></p>
<p><span style="font-size: 18px">(2)调度实现了优先级、去重等策略,调度从队列中取出一个请求,交给引擎转发给下载器(引擎和下载器中间有中间件,作用是对请求加工如:对requests添加代理、ua、cookie,response进行过滤等);</span></p>
<p><span style="font-size: 18px">(3)下载器下载页面,将生成的响应通过下载器中间件发送到引擎;</span></p>
<p><span style="font-size: 18px">(4) 爬虫主程序进行解析,这个时候解析函数将产生两类数据,一种是items、一种是链接(URL),其中requests按上面步骤交给调度器;items交给数据管道(数据管道实现数据的最终处理);</span></p>
<p><strong>官方文档</strong></p>
<p> 英文版:https://docs.scrapy.org/en/latest/</p>
<p> http://doc.scrapy.org/en/master/</p>
<p> 中文版:https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/overview.html</p>
<p> https://www.osgeo.cn/scrapy/topics/architecture.html</p>
<p> </p>
<h2><span style="font-size: 18pt; font-family: "Microsoft YaHei""><strong>二、安装及常用命令介绍</strong></span></h2>
<p><strong><span style="font-size: 18px; font-family: "Microsoft YaHei"">1. 安装</span></strong></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"">Linux:pip3 install scrapy</span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"">Windows:</span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei""> a. pip3 install wheel</span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei""> b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted</span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei""> c. shift右击进入下载目录,执行 pip3 install typed_ast-1.4.0-cp36-cp36m-win32.whl</span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei""> d. pip3 install pywin32</span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei""> e. pip3 install scrapy</span></p>
<p><strong><span style="font-size: 18px; font-family: "Microsoft YaHei"">2.scrapy基本命令行</span></strong></p>
<p> </p>
<div class="cnblogs_code">
<pre>(1<span style="color: rgba(0, 0, 0, 1)">)创建一个新的项目
scrapy startproject ProjectName
(</span>2<span style="color: rgba(0, 0, 0, 1)">)生成爬虫
scrapy genspider </span>+SpiderName+<span style="color: rgba(0, 0, 0, 1)">website
(</span>3)运行(crawl) <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> -o output</span>
scrapy crawl +<span style="color: rgba(0, 0, 0, 1)">SpiderName
scrapy crawl SpiderName </span>-<span style="color: rgba(0, 0, 0, 1)">o file.json
scrapy crawl SpiderName</span>-<span style="color: rgba(0, 0, 0, 1)">o file.csv
(</span>4<span style="color: rgba(0, 0, 0, 1)">)检查spider文件是否有语法错误
scrapy check
(</span>5<span style="color: rgba(0, 0, 0, 1)">)list返回项目所有spider名称
scrapy list
(</span>6<span style="color: rgba(0, 0, 0, 1)">)测试电脑当前爬取速度性能:
scrapy bench
(</span>7<span style="color: rgba(0, 0, 0, 1)">)scrapy runspider
scrapy runspider zufang_spider.py
(</span>8<span style="color: rgba(0, 0, 0, 1)">)编辑spider文件:
scrapy edit </span><spider><span style="color: rgba(0, 0, 0, 1)">
相当于打开vim模式,实际并不好用,在IDE中编辑更为合适。
(</span>9<span style="color: rgba(0, 0, 0, 1)">)将网页内容下载下来,然后在终端打印当前返回的内容,相当于 request 和 urllib 方法:
scrapy fetch </span><url><span style="color: rgba(0, 0, 0, 1)">
(</span>10<span style="color: rgba(0, 0, 0, 1)">)将网页内容保存下来,并在浏览器中打开当前网页内容,直观呈现要爬取网页的内容:
scrapy view </span><url><span style="color: rgba(0, 0, 0, 1)">
(</span>11<span style="color: rgba(0, 0, 0, 1)">)进入终端。打开 scrapy 显示台,类似ipython,可以用来做测试:
scrapy shell
(</span>12<span style="color: rgba(0, 0, 0, 1)">)输出格式化内容:
scrapy parse </span><url><span style="color: rgba(0, 0, 0, 1)">
(</span>13<span style="color: rgba(0, 0, 0, 1)">)返回系统设置信息:
scrapy settings
如:
$ scrapy settings </span>--<span style="color: rgba(0, 0, 0, 1)">get BOT_NAME
scrapybot
(</span>14<span style="color: rgba(0, 0, 0, 1)">)显示scrapy版本:
scrapy version [</span>-<span style="color: rgba(0, 0, 0, 1)">v]
后面加 </span>-v 可以显示scrapy依赖库的版本</pre>
</div>
<p> </p>
<p> </p>
<p> </p>
<h2 class="p"><strong><span style="font-size: 18pt">三、简单实例</span></strong></h2>
<p class="p"><span style="font-size: 18px; color: rgba(0, 0, 0, 1); font-family: "Microsoft YaHei""><strong>以麦田租房信息爬取为例,网站http://bj.maitian.cn/zfall/PG1</strong></span></p>
<p class="p"><span style="font-size: 18px; color: rgba(0, 0, 0, 1); font-family: "Microsoft YaHei""><strong>1.</strong><strong>创建项目</strong></span></p>
<div class="cnblogs_code">
<pre>scrapy startproject houseinfo</pre>
</div>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei"">生成项目结构:</span></p>
<p class="p"><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"><img src="https://img2018.cnblogs.com/blog/1518468/201909/1518468-20190906143146876-2013687662.png"></span></p>
<p class="pre"><span style="font-size: 18px; font-family: "Microsoft YaHei"">scrapy.cfg 项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)</span></p>
<p class="pre"><span style="font-size: 18px; font-family: "Microsoft YaHei"">items.py 设置数据存储模板,用于结构化数据,如:Django的Model</span></p>
<p class="pre"><span style="font-size: 18px; font-family: "Microsoft YaHei"">pipelines 数据持久化处理</span></p>
<p class="pre"><span style="font-size: 18px; font-family: "Microsoft YaHei"">settings.py 配置文件</span></p>
<p class="pre"><span style="font-size: 18px; font-family: "Microsoft YaHei"">spiders 爬虫目录</span></p>
<p class="p"><strong><span style="font-size: 18px; color: rgba(0, 0, 0, 1); font-family: "Microsoft YaHei"">2.创建爬虫应用程序</span></strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">cd houseinfo
scrapy genspider maitian maitian.com</span></pre>
</div>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei"">然后就可以在spiders目录下看到我们的爬虫主程序</span></p>
<p class="p"><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"><img src="https://img2018.cnblogs.com/blog/1518468/201909/1518468-20190906143221359-1528779641.png"></span></p>
<p class="p"><strong><span style="font-size: 18px; font-family: "Microsoft YaHei"">3.编写爬虫文件</span></strong></p>
<p><span style="font-size: 18px; color: rgba(0, 0, 0, 1); font-family: "Microsoft YaHei""> 步骤2执行完毕后,会在项目的spiders中生成一个应用名的py爬虫文件,文件源码如下:</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> # -*- coding: utf-<span style="color: rgba(128, 0, 128, 1)">8</span> -*-
<span style="color: rgba(0, 128, 128, 1)"> 2</span> <span style="color: rgba(0, 0, 0, 1)">import scrapy
</span><span style="color: rgba(0, 128, 128, 1)"> 3</span>
<span style="color: rgba(0, 128, 128, 1)"> 4</span>
<span style="color: rgba(0, 128, 128, 1)"> 5</span> <span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> MaitianSpider(scrapy.Spider):
</span><span style="color: rgba(0, 128, 128, 1)"> 6</span> name = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">maitian</span><span style="color: rgba(128, 0, 0, 1)">'</span><strong><span style="color: rgba(0, 0, 0, 1)"> # 应用名称
</span></strong><span style="color: rgba(0, 128, 128, 1)"> 7</span> allowed_domains = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">maitian.com</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">] <strong>#一般注释掉,允许爬取的域名(如果遇到非该域名的url则爬取不到数据)
</strong></span><span style="color: rgba(0, 128, 128, 1)"> 8</span> start_urls = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://maitian.com/</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">] <strong>#起始爬取的url列表,该列表中存在的url,都会被parse进行请求的发送
</strong></span><span style="color: rgba(0, 128, 128, 1)"> 9</span>
<span style="color: rgba(0, 128, 128, 1)">10</span> <span style="color: rgba(0, 0, 0, 1)"> #解析函数
</span><span style="color: rgba(0, 128, 128, 1)">11</span> <span style="color: rgba(0, 0, 0, 1)"> def parse(self, response):
</span><span style="color: rgba(0, 128, 128, 1)">12</span> pass</pre>
</div>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"">我们可以在此基础上,根据需求进行编写</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> # -*- coding: utf-<span style="color: rgba(128, 0, 128, 1)">8</span> -*-
<span style="color: rgba(0, 128, 128, 1)"> 2</span> <span style="color: rgba(0, 0, 0, 1)">import scrapy
</span><span style="color: rgba(0, 128, 128, 1)"> 3</span>
<span style="color: rgba(0, 128, 128, 1)"> 4</span> <span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> MaitianSpider(scrapy.Spider):
</span><span style="color: rgba(0, 128, 128, 1)"> 5</span> name = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">maitian</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 6</span> start_urls = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://bj.maitian.cn/zfall/PG100</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)"> 7</span>
<span style="color: rgba(0, 128, 128, 1)"> 8</span>
<span style="color: rgba(0, 128, 128, 1)"> 9</span> <span style="color: rgba(0, 0, 0, 1)"> #解析函数
</span><span style="color: rgba(0, 128, 128, 1)">10</span> <span style="color: rgba(0, 0, 0, 1)"> def parse(self, response):
</span><span style="color: rgba(0, 128, 128, 1)">11</span>
<span style="color: rgba(0, 128, 128, 1)">12</span> li_list = response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//div[@class="list_wrap"]/ul/li</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">13</span> results =<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 128, 128, 1)">14</span> <span style="color: rgba(0, 0, 255, 1)">for</span> li <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> li_list:
</span><span style="color: rgba(0, 128, 128, 1)">15</span> title =li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/h1/a/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first().strip()
</span><span style="color: rgba(0, 128, 128, 1)">16</span> price = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/div/ol/strong/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first().strip()
</span><span style="color: rgba(0, 128, 128, 1)">17</span> square = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().replace(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">㎡</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">) # 将面积的单位去掉
</span><span style="color: rgba(0, 128, 128, 1)">18</span> area = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().strip().<strong><span style="color: rgba(255, 0, 0, 1)">split('\xa0') # 以空格分隔
</span></strong><span style="color: rgba(0, 128, 128, 1)">19</span> adress = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().strip().split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\xa0</span><span style="color: rgba(128, 0, 0, 1)">'</span>)[<span style="color: rgba(128, 0, 128, 1)">2</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)">20</span>
<span style="color: rgba(0, 128, 128, 1)">21</span> dict =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)">22</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">标题</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:title,
</span><span style="color: rgba(0, 128, 128, 1)">23</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">月租金</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:price,
</span><span style="color: rgba(0, 128, 128, 1)">24</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">面积</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:square,
</span><span style="color: rgba(0, 128, 128, 1)">25</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">区域</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:area,
</span><span style="color: rgba(0, 128, 128, 1)">26</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">地址</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:adress
</span><span style="color: rgba(0, 128, 128, 1)">27</span> <span style="color: rgba(0, 0, 0, 1)"> }
</span><span style="color: rgba(0, 128, 128, 1)">28</span> <span style="color: rgba(0, 0, 0, 1)"> results.append(dict)
</span><span style="color: rgba(0, 128, 128, 1)">29</span>
<span style="color: rgba(0, 128, 128, 1)">30</span> <span style="color: rgba(0, 0, 0, 1)"> print(title,price,square,area,adress)
</span><span style="color: rgba(0, 128, 128, 1)">31</span> <span style="color: rgba(0, 0, 255, 1)">return</span> results</pre>
</div>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong>须知:</strong></span></p>
<ul>
<li><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong>xpath为scrapy中的解析方式</strong></span></li>
<li><strong style="font-family: "Microsoft YaHei"; font-size: 18px">xpath函数返回的为列表,列表中存放的数据为Selector类型数据。解析到的内容被封装在Selector对象中,需要调用extract()函数将解析的内容从Selec</strong><strong style="font-family: "Microsoft YaHei"; font-size: 18px">t</strong><strong style="font-family: "Microsoft YaHei"; font-size: 18px">or中取出。</strong></li>
<li><span style="font-size: 18px; font-family: "Microsoft YaHei"; color: rgba(255, 0, 0, 1)"><strong>如果可以保证xpath返回的列表中只有一个列表元素,则可以使用extract_first(),</strong><strong> </strong><strong>否则必须使用extract()</strong></span></li>
</ul>
<p class="p"> </p>
<p class="p"><span style="color: rgba(0, 0, 0, 1); font-family: "Microsoft YaHei""><strong>两者等同,都是将列表中的内容提取出来</strong></span></p>
<p class="p"><span style="color: rgba(0, 0, 0, 1); font-family: "Microsoft YaHei""><strong>title = li.xpath('./div/h1/a/text()')<span style="color: rgba(255, 0, 0, 1)">.extract_first()</span>.strip()</strong></span></p>
<p class="p"><span style="color: rgba(0, 0, 0, 1); font-family: "Microsoft YaHei""><strong>title =</strong><strong> li.xpath('./div/h1/a/text()')</strong><span style="color: rgba(255, 0, 0, 1)"><strong></strong><strong>.extract()</strong></span><strong>.strip()</strong></span></p>
<p class="p"> </p>
<p class="p"><strong><span style="color: rgba(0, 0, 0, 1); font-size: 18px; font-family: "Microsoft YaHei"">4. 设置修改settings.py配置文件相关配置:</span></strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> <span style="color: rgba(0, 0, 0, 1)">#伪装请求载体身份
</span><span style="color: rgba(0, 128, 128, 1)">2</span> USER_AGENT = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)">3</span>
<span style="color: rgba(0, 128, 128, 1)">4</span> <span style="color: rgba(0, 0, 0, 1)">#可以忽略或者不遵守robots协议
</span><span style="color: rgba(0, 128, 128, 1)">5</span> ROBOTSTXT_OBEY =<span style="color: rgba(0, 0, 0, 1)"> False</span></pre>
</div>
<p class="p"><strong><span style="color: rgba(0, 0, 0, 1); font-size: 18px; font-family: "Microsoft YaHei"">5.执行爬虫程序:scrapy crawl maitain</span></strong></p>
<p><img src="https://img2018.cnblogs.com/blog/1518468/201909/1518468-20190906173659571-1971014414.png"></p>
<p> </p>
<p><span style="font-size: 18px; color: rgba(0, 0, 0, 1)"><strong><span style="color: rgba(255, 0, 0, 1)">爬取全站数据</span>,也就是全部页码数据。</strong>本例中,总共100页,观察页面之间的共性,构造通用url</span></p>
<p><strong><span style="font-size: 18px; color: rgba(0, 0, 0, 1)">方式一:通过占位符,构造通用url</span></strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> scrapy
</span><span style="color: rgba(0, 128, 128, 1)"> 2</span>
<span style="color: rgba(0, 128, 128, 1)"> 3</span> <span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> MaitianSpider(scrapy.Spider):
</span><span style="color: rgba(0, 128, 128, 1)"> 4</span> name = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">maitian</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 5</span> start_urls = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://bj.maitian.cn/zfall/PG{}</span><span style="color: rgba(128, 0, 0, 1)">'</span>.<span style="color: rgba(255, 0, 0, 1)"><strong>format(page) for page in range(1,4</strong></span>)] <strong><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">注意写法</span></strong>
<span style="color: rgba(0, 128, 128, 1)"> 6</span>
<span style="color: rgba(0, 128, 128, 1)"> 7</span>
<span style="color: rgba(0, 128, 128, 1)"> 8</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">解析函数</span>
<span style="color: rgba(0, 128, 128, 1)"> 9</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse(self, response):
</span><span style="color: rgba(0, 128, 128, 1)">10</span>
<span style="color: rgba(0, 128, 128, 1)">11</span> li_list = response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//div[@class="list_wrap"]/ul/li</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">12</span> results =<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 128, 128, 1)">13</span> <span style="color: rgba(0, 0, 255, 1)">for</span> li <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> li_list:
</span><span style="color: rgba(0, 128, 128, 1)">14</span> title =li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/h1/a/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first().strip()
</span><span style="color: rgba(0, 128, 128, 1)">15</span> price = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/div/ol/strong/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first().strip()
</span><span style="color: rgba(0, 128, 128, 1)">16</span> square = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().replace(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">㎡</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">17</span> <strong> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 也可以通过正则匹配提取出来</span></strong>
<span style="color: rgba(0, 128, 128, 1)">18</span> area = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>)..re(r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">昌平|朝阳|东城|大兴|丰台|海淀|石景山|顺义|通州|西城</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">19</span> adress = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().strip().split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\xa0</span><span style="color: rgba(128, 0, 0, 1)">'</span>)
</span><span style="color: rgba(0, 128, 128, 1)">20</span>
<span style="color: rgba(0, 128, 128, 1)">21</span> dict =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)">22</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">标题</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:title,
</span><span style="color: rgba(0, 128, 128, 1)">23</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">月租金</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:price,
</span><span style="color: rgba(0, 128, 128, 1)">24</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">面积</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:square,
</span><span style="color: rgba(0, 128, 128, 1)">25</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">区域</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:area,
</span><span style="color: rgba(0, 128, 128, 1)">26</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">地址</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:adress
</span><span style="color: rgba(0, 128, 128, 1)">27</span> <span style="color: rgba(0, 0, 0, 1)"> }
</span><span style="color: rgba(0, 128, 128, 1)">28</span> <span style="color: rgba(0, 0, 0, 1)"> results.append(dict)
</span><span style="color: rgba(0, 128, 128, 1)">29</span>
<span style="color: rgba(0, 128, 128, 1)">30</span> <span style="color: rgba(0, 0, 255, 1)">return</span> results</pre>
</div>
<p><span style="color: rgba(0, 0, 0, 1); font-family: "Microsoft YaHei""> </span><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)">如果碰到一个表达式不能包含所有情况的项目,解决方式是先分别写表达式,最后通过列表相加,将所有url合并成一个url列表,例如</span></p>
<div class="cnblogs_code">
<pre>start_urls = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://www.guokr.com/ask/hottest/?page={}</span><span style="color: rgba(128, 0, 0, 1)">'</span>.format(n) <span style="color: rgba(0, 0, 255, 1)">for</span> n <span style="color: rgba(0, 0, 255, 1)">in</span> range(1, 8)]<span style="color: rgba(255, 0, 0, 1)"><strong> +</strong></span><span style="color: rgba(0, 0, 0, 1)"> [
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://www.guokr.com/ask/highlight/?page={}</span><span style="color: rgba(128, 0, 0, 1)">'</span>.format(m) <span style="color: rgba(0, 0, 255, 1)">for</span> m <span style="color: rgba(0, 0, 255, 1)">in</span> range(1, 101)]</pre>
</div>
<p><strong><span style="font-family: "Microsoft YaHei"; font-size: 18px"> 方式二:通过重写start_requests方法,获取所有的起始url。(不用写<span style="font-family: "Microsoft YaHei""><code>start_urls</code></span>)</span></strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> scrapy
</span><span style="color: rgba(0, 128, 128, 1)"> 2</span>
<span style="color: rgba(0, 128, 128, 1)"> 3</span> <span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> MaitianSpider(scrapy.Spider):
</span><span style="color: rgba(0, 128, 128, 1)"> 4</span> name = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">maitian</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 5</span>
<strong><span style="color: rgba(0, 128, 128, 1)"> 6</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"><span style="color: rgba(255, 0, 0, 1)"> start_requests</span>(self):
</span><span style="color: rgba(0, 128, 128, 1)"> 7</span> pages=<span style="color: rgba(0, 0, 0, 1)">[]
</span><span style="color: rgba(0, 128, 128, 1)"> 8</span> <span style="color: rgba(0, 0, 255, 1)">for</span> page <span style="color: rgba(0, 0, 255, 1)">in</span> range(90,100<span style="color: rgba(0, 0, 0, 1)">):
</span><span style="color: rgba(0, 128, 128, 1)"> 9</span> url=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://bj.maitian.cn/zfall/PG{}</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">.format(page)
</span><span style="color: rgba(0, 128, 128, 1)">10</span> <span style="color: rgba(255, 0, 0, 1)">page=scrapy.Request(url)
</span><span style="color: rgba(0, 128, 128, 1)">11</span> <span style="color: rgba(255, 0, 0, 1)"> pages.append(page)
</span><span style="color: rgba(0, 128, 128, 1)">12</span> <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> pages
</span></strong><span style="color: rgba(0, 128, 128, 1)">13</span>
<span style="color: rgba(0, 128, 128, 1)">14</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">解析函数</span>
<span style="color: rgba(0, 128, 128, 1)">15</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse(self, response):
</span><span style="color: rgba(0, 128, 128, 1)">16</span>
<span style="color: rgba(0, 128, 128, 1)">17</span> li_list = response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//div[@class="list_wrap"]/ul/li</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">18</span>
<span style="color: rgba(0, 128, 128, 1)">19</span> results =<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 128, 128, 1)">20</span> <span style="color: rgba(0, 0, 255, 1)">for</span> li <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> li_list:
</span><span style="color: rgba(0, 128, 128, 1)">21</span> title =li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/h1/a/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first().strip(),
</span><span style="color: rgba(0, 128, 128, 1)">22</span> price = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/div/ol/strong/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first().strip(),
</span><span style="color: rgba(0, 128, 128, 1)">23</span> square = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().replace(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">㎡</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">),
</span><span style="color: rgba(0, 128, 128, 1)">24</span> area = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).re(r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">昌平|朝阳|东城|大兴|丰台|海淀|石景山|顺义|通州|西城</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">),
</span><span style="color: rgba(0, 128, 128, 1)">25</span> adress = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().strip().split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\xa0</span><span style="color: rgba(128, 0, 0, 1)">'</span>)
</span><span style="color: rgba(0, 128, 128, 1)">26</span>
<span style="color: rgba(0, 128, 128, 1)">27</span> dict =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)">28</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">标题</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:title,
</span><span style="color: rgba(0, 128, 128, 1)">29</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">月租金</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:price,
</span><span style="color: rgba(0, 128, 128, 1)">30</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">面积</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:square,
</span><span style="color: rgba(0, 128, 128, 1)">31</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">区域</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:area,
</span><span style="color: rgba(0, 128, 128, 1)">32</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">地址</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:adress
</span><span style="color: rgba(0, 128, 128, 1)">33</span> <span style="color: rgba(0, 0, 0, 1)"> }
</span><span style="color: rgba(0, 128, 128, 1)">34</span> <span style="color: rgba(0, 0, 0, 1)"> results.append(dict)
</span><span style="color: rgba(0, 128, 128, 1)">35</span>
<span style="color: rgba(0, 128, 128, 1)">36</span> <span style="color: rgba(0, 0, 255, 1)">return</span> results</pre>
</div>
<p> </p>
<h2><span style="font-size: 18pt; font-family: "Microsoft YaHei""><strong>四、数据</strong><strong>持久化存储</strong></span></h2>
<ul>
<li><span style="font-family: "Microsoft YaHei"">基于终端指令的持久化存储</span></li>
<li><span style="font-family: "Microsoft YaHei"">基于管道的持久化存储</span></li>
</ul>
<p><span style="font-size: 18px; color: rgba(255, 0, 0, 1)"><strong><span style="font-family: "Microsoft YaHei"">只要是数据持久化存储,parse方法必须有返回值(也就是return后的内容)</span></strong></span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px"><strong>1. </strong><strong>基于终端指令的持久化存储</strong></span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px"><strong> </strong></span><span style="font-family: "Microsoft YaHei"; font-size: 18px">执行输出指定格式进行存储:将爬取到的数据写入不同格式的文件中进行存储,windows终端不能使用txt格式</span></p>
<ul>
<li><span style="font-family: "Microsoft YaHei""> scrapy crawl 爬虫名称 -o xxx.json</span></li>
<li><span style="font-family: "Microsoft YaHei""> scrapy crawl 爬虫名称 -o xxx.xml</span></li>
<li><span style="font-family: "Microsoft YaHei""> scrapy crawl 爬虫名称 -o xxx.csv</span></li>
</ul>
<p><span style="font-family: "Microsoft YaHei""> <span style="font-size: 18px">以麦田为例,spider中的代码不变,将返回值写到qiubai.csv中。本地没有,就会自己创建一个。本地有就会追加</span></span></p>
<div class="cnblogs_code">
<pre><span style="font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1)">scrapy crawl maitian -o maitian.csv</span></pre>
</div>
<p><span style="font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1); font-size: 18px">就会在项目目录下看到,生成的文件</span></p>
<p><span style="font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1); font-size: 18px"><img src="https://img2018.cnblogs.com/blog/1518468/201909/1518468-20190906175313364-1383145937.png"></span></p>
<p><span style="font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1); font-size: 18px">查看文件内容</span></p>
<p><span style="font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1); font-size: 18px"><img src="https://img2018.cnblogs.com/blog/1518468/201909/1518468-20190906175417360-1178010237.png"></span></p>
<p> </p>
<p> </p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"><strong>2.</strong><strong>基于管道的持久化存储</strong></span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)">scrapy框架中已经为我们专门集成好了高效、便捷的持久化操作功能,我们直接使用即可。要想使用scrapy的持久化操作功能,我们首先来认识如下两个文件:</span></p>
<ul>
<li><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"><strong>items.py</strong>:<strong>数据结构模板文件。定义数据属性。</strong></span></li>
<li><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"><strong>pipelines.py</strong>:<strong>管道文件。接收数据(items),进行持久化操作。</strong></span></li>
</ul>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"> </span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"><strong>持久化流程:</strong></span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)">① <strong>爬虫文件爬取到数据</strong><strong>解析</strong><strong>后,需要将</strong><strong><span style="text-decoration: underline">数据封装到items对象</span></strong><strong>中。</strong></span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)">② <strong>使用yield关键字</strong><strong><span style="text-decoration: underline">将items对象提交给pipelines管道</span></strong><strong>,</strong><strong>进行持久化操作。</strong></span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)">③ <strong>在管道文件中的process_item方法中接收爬虫文件提交过来的item对象,然后编写持久化存储的代码</strong><strong>,</strong><strong>将item对象中存储的数据进行持久化存储</strong><strong>(</strong><strong>在管道的process_item方法中执行io操作,进行持久化存储</strong><strong>)</strong><strong> </strong></span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)">④ <strong>settings.py配置文件中开启管道</strong></span></p>
<p><span style="font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1)"> </span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"><strong>2.1</strong><strong>保存到本地的持久化存储</strong></span></p>
<p><span style="color: rgba(0, 0, 0, 1)"><strong><span style="font-family: 微软雅黑">爬虫文件</span></strong><strong>:maitian.py</strong></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> <span style="color: rgba(0, 0, 0, 1)">import scrapy
</span><span style="color: rgba(0, 128, 128, 1)"> 2</span> <span style="color: rgba(0, 0, 255, 1)">from</span><span style="color: rgba(0, 0, 0, 1)"> houseinfo.items import HouseinfoItem <span style="color: rgba(255, 0, 0, 1)"><strong> # 将item导入
</strong></span></span><span style="color: rgba(0, 128, 128, 1)"> 3</span>
<span style="color: rgba(0, 128, 128, 1)"> 4</span> <span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> MaitianSpider(scrapy.Spider):
</span><span style="color: rgba(0, 128, 128, 1)"> 5</span> name = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">maitian</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 6</span> start_urls = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://bj.maitian.cn/zfall/PG100</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)"> 7</span>
<span style="color: rgba(0, 128, 128, 1)"> 8</span> <span style="color: rgba(0, 0, 0, 1)"> #解析函数
</span><span style="color: rgba(0, 128, 128, 1)"> 9</span> <span style="color: rgba(0, 0, 0, 1)"> def parse(self, response):
</span><span style="color: rgba(0, 128, 128, 1)">10</span>
<span style="color: rgba(0, 128, 128, 1)">11</span> li_list = response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//div[@class="list_wrap"]/ul/li</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">12</span>
<span style="color: rgba(0, 128, 128, 1)">13</span> <span style="color: rgba(0, 0, 255, 1)">for</span> li <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> li_list:
</span><span style="color: rgba(0, 128, 128, 1)">14</span> <strong><span style="color: rgba(255, 0, 0, 1)">item = HouseinfoItem(
</span></strong><span style="color: rgba(0, 128, 128, 1)">15</span> title =li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/h1/a/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first().strip(),
</span><span style="color: rgba(0, 128, 128, 1)">16</span> price = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/div/ol/strong/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first().strip(),
</span><span style="color: rgba(0, 128, 128, 1)">17</span> square = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().replace(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">㎡</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">),
</span><span style="color: rgba(0, 128, 128, 1)">18</span> area = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().strip().split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\xa0</span><span style="color: rgba(128, 0, 0, 1)">'</span>)[<span style="color: rgba(128, 0, 128, 1)">0</span><span style="color: rgba(0, 0, 0, 1)">],
</span><span style="color: rgba(0, 128, 128, 1)">19</span> adress = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().strip().split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\xa0</span><span style="color: rgba(128, 0, 0, 1)">'</span>)[<span style="color: rgba(128, 0, 128, 1)">2</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)">20</span> <span style="color: rgba(0, 0, 0, 1)"> )
</span><span style="color: rgba(0, 128, 128, 1)">21</span>
<span style="color: rgba(0, 128, 128, 1)">22</span> <span style="color: rgba(0, 0, 255, 1)">yield</span> item <strong><span style="color: rgba(255, 0, 0, 1)"> # 提交给管道,然后管道定义存储方式</span></strong></pre>
</div>
<p><strong>items<span style="font-family: 微软雅黑">文件:</span><span style="font-family: "Times New Roman"">items.py</span></strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> <span style="color: rgba(0, 0, 0, 1)">import scrapy
</span><span style="color: rgba(0, 128, 128, 1)">2</span>
<span style="color: rgba(0, 128, 128, 1)">3</span> <span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> HouseinfoItem(scrapy.Item):
</span><span style="color: rgba(0, 128, 128, 1)">4</span> title =<span style="color: rgba(0, 0, 0, 1)"> scrapy.Field() #存储标题,里面可以存储任意类型的数据
</span><span style="color: rgba(0, 128, 128, 1)">5</span> price =<span style="color: rgba(0, 0, 0, 1)"> scrapy.Field()
</span><span style="color: rgba(0, 128, 128, 1)">6</span> square =<span style="color: rgba(0, 0, 0, 1)"> scrapy.Field()
</span><span style="color: rgba(0, 128, 128, 1)">7</span> area =<span style="color: rgba(0, 0, 0, 1)"> scrapy.Field()
</span><span style="color: rgba(0, 128, 128, 1)">8</span> adress = scrapy.Field()</pre>
</div>
<p><strong><span style="font-family: 微软雅黑">管道文件:</span>pipelines.py</strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> <span style="color: rgba(0, 0, 255, 1)">class</span> HouseinfoPipeline(<span style="color: rgba(0, 0, 255, 1)">object</span><span style="color: rgba(0, 0, 0, 1)">):
</span><span style="color: rgba(0, 128, 128, 1)"> 2</span> <span style="color: rgba(0, 0, 0, 1)"> def __init__(self):
</span><span style="color: rgba(0, 128, 128, 1)"> 3</span> self.file =<span style="color: rgba(0, 0, 0, 1)"> None
</span><span style="color: rgba(0, 128, 128, 1)"> 4</span>
<span style="color: rgba(0, 128, 128, 1)"> 5</span> <span style="color: rgba(0, 0, 0, 1)"> #开始爬虫时,执行一次
</span><span style="color: rgba(0, 128, 128, 1)"> 6</span> <span style="color: rgba(0, 0, 0, 1)"> def open_spider(self,<strong><span style="color: rgba(255, 0, 0, 1)">spider</span></strong>):
</span><span style="color: rgba(0, 128, 128, 1)"> 7</span> self.file = open(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">maitian.csv</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<strong><span style="color: rgba(255, 0, 0, 1)">'</span><span style="color: rgba(255, 0, 0, 1)">a'</span></strong>,encoding=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">) # 选用了追加模式
</span><span style="color: rgba(0, 128, 128, 1)"> 8</span> self.file.write(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">"</span>.join([<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">标题</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">月租金</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">面积</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">区域</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">地址</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">\n</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">]))
</span><span style="color: rgba(0, 128, 128, 1)"> 9</span> print(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">开始爬虫</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">10</span>
<span style="color: rgba(0, 128, 128, 1)">11</span> <span style="color: rgba(0, 0, 0, 1)"> # 因为该方法会被执行调用多次,所以文件的开启和关闭操作写在了另外两个只会各自执行一次的方法中。
</span><span style="color: rgba(0, 128, 128, 1)">12</span> <span style="color: rgba(0, 0, 0, 1)"> def process_item(self, item, spider):
</span><span style="color: rgba(0, 128, 128, 1)">13</span> content = , item[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">price</span><span style="color: rgba(128, 0, 0, 1)">"</span>], item[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">square</span><span style="color: rgba(128, 0, 0, 1)">"</span>], item[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">area</span><span style="color: rgba(128, 0, 0, 1)">"</span>], item[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">adress</span><span style="color: rgba(128, 0, 0, 1)">"</span>], <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">\n</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)">14</span> self.file.write(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">.join(content))
</span><span style="color: rgba(0, 128, 128, 1)">15</span> <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> item
</span><span style="color: rgba(0, 128, 128, 1)">16</span>
<span style="color: rgba(0, 128, 128, 1)">17</span> <span style="color: rgba(0, 0, 0, 1)"> # 结束爬虫时,执行一次
</span><span style="color: rgba(0, 128, 128, 1)">18</span> <span style="color: rgba(0, 0, 0, 1)"> def close_spider(self,<strong><span style="color: rgba(255, 0, 0, 1)">spider</span></strong>):
</span><span style="color: rgba(0, 128, 128, 1)">19</span> <span style="color: rgba(0, 0, 0, 1)"> self.file.close()
</span><span style="color: rgba(0, 128, 128, 1)">20</span> print(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">结束爬虫</span><span style="color: rgba(128, 0, 0, 1)">"</span>)</pre>
</div>
<p><strong><span style="font-family: 微软雅黑">配置文件:</span>settings.py</strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> <span style="color: rgba(0, 0, 0, 1)">#伪装请求载体身份
</span><span style="color: rgba(0, 128, 128, 1)"> 2</span> USER_AGENT = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 3</span>
<span style="color: rgba(0, 128, 128, 1)"> 4</span> <span style="color: rgba(0, 0, 0, 1)">#可以忽略或者不遵守robots协议
</span><span style="color: rgba(0, 128, 128, 1)"> 5</span> ROBOTSTXT_OBEY =<span style="color: rgba(0, 0, 0, 1)"> False
</span><span style="color: rgba(0, 128, 128, 1)"> 6</span>
<span style="color: rgba(0, 128, 128, 1)"> 7</span> <strong><span style="color: rgba(255, 0, 0, 1)">#开启管道
</span></strong><span style="color: rgba(0, 128, 128, 1)"> 8</span> ITEM_PIPELINES =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)"> 9</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">houseinfo.pipelines.HouseinfoPipeline</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 128, 1)">300</span><span style="color: rgba(0, 0, 0, 1)">, #数值300表示为优先级,值越小优先级越高
</span><span style="color: rgba(0, 128, 128, 1)">10</span> }</pre>
</div>
<p><img src="https://img2018.cnblogs.com/blog/1518468/201909/1518468-20190906193711619-309442127.png"></p>
<p> </p>
<p> <img src="https://img2018.cnblogs.com/blog/1518468/201909/1518468-20190906193741546-1277509436.png"></p>
<h2 id="blogTitle0"> </h2>
<h2><span style="font-size: 18pt; font-family: "Microsoft YaHei""><strong>五、爬取多级页面</strong></span></h2>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1)">爬取多级页面,会遇到2个问题:</span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1)"><strong>问题1:如何对下一层级页面发送请求?</strong></span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1)">答:在每一个解析函数的末尾,通过Request方法对下一层级的页面手动发起请求</span></p>
<div class="cnblogs_code">
<pre><strong><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 先提取二级页面url,再对二级页面发送请求。多级页面以此类推</span></strong>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse(self, response):
next_url </span>= response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//div/h2/a/@href</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract() <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 提取二级页面url</span>
<span style="color: rgba(0, 0, 255, 1)"> yield</span> scrapy.Request(url=next_url, callback=self.next_parse)<strong><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 对二级页面发送请求,注意要用yield,回调函数不带括号</span></strong></pre>
</div>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1)"><strong>问题2:解析的数据不在同一张页面中,最终如何将数据传递</strong></span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1)">答:涉及到请求传参,可以在对下一层级页面发送请求的时候,<strong><span style="color: rgba(255, 0, 0, 1)">通过meta参数进行数据传递,meta字典就会传递给回调函数的response参数。下一级的解析函数通过response获取item(先通过 response.meta返回接收到的meta字典,再获得item字典)</span></strong></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 通过meta参数进行Request的数据传递,meta字典就会传递给回调函数的response参数</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse(self, response):
item </span>= Item() <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 实例化item对象</span>
Item[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">field1</span><span style="color: rgba(128, 0, 0, 1)">"</span>] = response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">expression1</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract() <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 列表中只有一个元素</span>
Item[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">field2</span><span style="color: rgba(128, 0, 0, 1)">"</span>] = response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">expression2</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract() <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 列表</span>
next_url = response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">expression3</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract() <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 提取二级页面url</span>
<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> meta参数:请求传参.通过meta参数进行Request的数据传递,meta字典就会传递给回调函数的response参数</span>
<span style="color: rgba(0, 0, 255, 1)">yield</span> scrapy.Request(url=next_url, callback=self.next_parse,<strong><span style="color: rgba(255, 0, 0, 1)">meta={'item':item}</span></strong>) <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 对二级页面发送请求</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> next_parse(self,response):
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 通过response获取item. 先通过 response.meta返回接收到的meta字典,再获得item字典</span>
item = <strong><span style="color: rgba(255, 0, 0, 1)">response.meta['item'</span></strong><span style="color: rgba(0, 0, 0, 1)"><strong><span style="color: rgba(255, 0, 0, 1)">]</span></strong>
item[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">field</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">expression</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
</span><span style="color: rgba(0, 0, 255, 1)">yield</span> item <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">提交给管道</span></pre>
</div>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px"><strong><span style="color: rgba(0, 0, 0, 1)">案例1</span>:麦田,对所有页码发送请求。不推荐将每一个页码对应的url存放到爬虫文件的起始url列表(start_urls)中。这里我们使用Request方法手动发起请求。</strong></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> -*- coding: utf-8 -*-</span>
<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> scrapy
</span><span style="color: rgba(0, 0, 255, 1)">from</span> houseinfo.items <span style="color: rgba(0, 0, 255, 1)">import</span> HouseinfoItem <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将item导入</span>
<span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> MaitianSpider(scrapy.Spider):
name </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">maitian</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
start_urls </span>= [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://bj.maitian.cn/zfall/PG1</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">爬取多页</span>
page = 1<span style="color: rgba(0, 0, 0, 1)">
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://bj.maitian.cn/zfall/PG%d</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">解析函数</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse(self, response):
li_list </span>= response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//div[@class="list_wrap"]/ul/li</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">for</span> li <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> li_list:
item </span>=<span style="color: rgba(0, 0, 0, 1)"> HouseinfoItem(
title </span>=li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/h1/a/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first().strip(),
price </span>= li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/div/ol/strong/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first().strip(),
square </span>= li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().replace(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">㎡</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">),
area </span>= li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).re(r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">昌平|朝阳|东城|大兴|丰台|海淀|石景山|顺义|通州|西城</span><span style="color: rgba(128, 0, 0, 1)">'</span>), <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 也可以通过正则匹配提取出来</span>
adress = li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/p/span/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().strip().split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\xa0</span><span style="color: rgba(128, 0, 0, 1)">'</span>)
)
[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://bj.maitian.cn/zfall/PG{}</span><span style="color: rgba(128, 0, 0, 1)">'</span>.format(page) <span style="color: rgba(0, 0, 255, 1)">for</span> page <span style="color: rgba(0, 0, 255, 1)">in</span> range(1, 4<span style="color: rgba(0, 0, 0, 1)">)]
</span><span style="color: rgba(0, 0, 255, 1)">yield</span> item <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 提交给管道,然后管道定义存储方式</span>
<span style="color: rgba(0, 0, 255, 1)">if</span> self.page < 4<span style="color: rgba(0, 0, 0, 1)">:
self.page </span>+= 1<span style="color: rgba(0, 0, 0, 1)">
new_url </span>= format(self.url%self.page) <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 这里的%是拼接的意思</span>
<span style="color: rgba(0, 0, 255, 1)">yield</span> scrapy.Request(url=new_url,callback=self.parse) <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 手动发起一个请求,注意一定要写yield</span></pre>
</div>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px"><strong>案例2:这个案例比较好的一点是,parse函数,既有对下一页的回调,又有对详情页的回调</strong></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> scrapy
</span><span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> QuotesSpider(scrapy.Spider):
name </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">quotes_2_3</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
start_urls </span>=<span style="color: rgba(0, 0, 0, 1)"> [
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://quotes.toscrape.com</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
]
allowed_domains </span>=<span style="color: rgba(0, 0, 0, 1)"> [
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">toscrape.com</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
]
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse(self,response):
</span><span style="color: rgba(0, 0, 255, 1)">for</span> quote <span style="color: rgba(0, 0, 255, 1)">in</span> response.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">div.quote</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">):
</span><span style="color: rgba(0, 0, 255, 1)">yield</span><span style="color: rgba(0, 0, 0, 1)">{
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">quote</span><span style="color: rgba(128, 0, 0, 1)">'</span>: quote.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">span.text::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first(),
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">author</span><span style="color: rgba(128, 0, 0, 1)">'</span>: quote.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">small.author::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first(),
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">tags</span><span style="color: rgba(128, 0, 0, 1)">'</span>: quote.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">div.tags a.tag::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract(),
}
author_page </span>= response.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">small.author+a::attr(href)</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
authro_full_url </span>=<span style="color: rgba(0, 0, 0, 1)"> response.urljoin(author_page)
</span><span style="color: rgba(0, 0, 255, 1)">yield</span> scrapy.Request(<strong><span style="color: rgba(255, 0, 0, 1)">authro_full_url, callback=self.parse_author</span></strong>) <strong> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 对详情页发送请求,回调详情页的解析函数</span></strong>
<span style="color: rgba(0, 0, 0, 1)">
next_page </span>= response.css(<span style="color: rgba(255, 0, 0, 1)"><strong>'li.next a::attr("href")'</strong></span><span style="color: rgba(0, 0, 0, 1)">).extract_first() <strong> # 通过css选择器定位到下一页
</strong></span><span style="color: rgba(0, 0, 255, 1)">if</span> next_page <span style="color: rgba(0, 0, 255, 1)">is</span> <span style="color: rgba(0, 0, 255, 1)">not</span><span style="color: rgba(0, 0, 0, 1)"> None:
next_full_url </span>=<span style="color: rgba(0, 0, 0, 1)"> response.urljoin(next_page)
</span><span style="color: rgba(0, 0, 255, 1)">yield</span> scrapy.Request(<span style="color: rgba(255, 0, 0, 1)"><strong>next_full_url, callback=self.pars</strong>e</span>) <strong><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 对下一页发送请求,回调自己的解析函数</span></strong>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse_author(self,response):
</span><span style="color: rgba(0, 0, 255, 1)">yield</span><span style="color: rgba(0, 0, 0, 1)">{
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">author</span><span style="color: rgba(128, 0, 0, 1)">'</span>: response.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.author-title::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first(),
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">author_born_date</span><span style="color: rgba(128, 0, 0, 1)">'</span>: response.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.author-born-date::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first(),
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">author_born_location</span><span style="color: rgba(128, 0, 0, 1)">'</span>: response.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.author-born-location::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first(),
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">authro_description</span><span style="color: rgba(128, 0, 0, 1)">'</span>: response.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.author-born-location::text</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first(),</pre>
</div>
<p><strong><span style="font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1); font-size: 18px">案例3:</span></strong><span style="font-size: 18px"><strong><span style="font-family: 微软雅黑">爬取</span>www.id97.com<span style="font-family: 微软雅黑">电影网,将一级页面中的电影名称,类型,评分,二级页面中的上映时间,导演,片长进行爬取。(多级页面+传参)</span></strong></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> -*- coding: utf-8 -*-</span>
<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> scrapy
</span><span style="color: rgba(0, 0, 255, 1)">from</span> moviePro.items <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> MovieproItem
</span><span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> MovieSpider(scrapy.Spider):
name </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">movie</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
allowed_domains </span>= [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">www.id97.com</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
start_urls </span>= [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://www.id97.com/</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse(self, response):
div_list </span>= response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//div[@class="col-xs-1-5 movie-item"]</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">for</span> div <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> div_list:
item </span>= MovieproItem() item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">name</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = div.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.//h1/a/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
item[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">score</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = div.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.//h1/em/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
item[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">kind</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = div.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.//div[@class="otherinfo"]</span><span style="color: rgba(128, 0, 0, 1)">'</span>).xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">string(.)</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
item[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">detail_url</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = div.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/a/@href</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">meta参数:请求传参.通过meta参数进行Request的数据传递,meta字典就会传递给回调函数的response参数</span>
<span style="color: rgba(0, 0, 255, 1)">yield</span> scrapy.Request(url=item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">detail_url</span><span style="color: rgba(128, 0, 0, 1)">'</span>],callback=self.parse_detail,meta={<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">item</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:item})
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse_detail(self,response):
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">通过response获取item. 先通过 response.meta返回接收到的meta字典,再获得item字典</span>
item = response.meta[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">item</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
item[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">actor</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//div[@class="row"]//table/tr/a/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
item[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">time</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//div[@class="row"]//table/tr/td/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
item[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">long</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = response.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//div[@class="row"]//table/tr/td/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
</span><span style="color: rgba(0, 0, 255, 1)">yield</span> item <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">提交item到管道</span></pre>
</div>
<p><strong><span style="font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1); font-size: 18px">案例4:稍复杂,可参考链接进行理解:</span></strong><span style="font-family: "Microsoft YaHei"; color: rgba(0, 0, 0, 1); font-size: 12px">https://github.com/makcyun/web_scraping_with_python/tree/master/,https://www.cnblogs.com/sanduzxcvbnm/p/10277414.html</span></p>
<div class="cnblogs_code"><img id="code_img_closed_a9c8abae-8d39-429e-a8fa-f766db1120c2" class="code_img_closed lazyload" alt="" data-src="http://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif"><img id="code_img_opened_a9c8abae-8d39-429e-a8fa-f766db1120c2" class="code_img_opened lazyload" style="display: none" alt="" data-src="http://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif">
<div id="cnblogs_code_open_a9c8abae-8d39-429e-a8fa-f766db1120c2" class="cnblogs_code_hide">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">!/user/bin/env python</span>
<span style="color: rgba(0, 128, 128, 1)">2</span>
<span style="color: rgba(0, 128, 128, 1)">3</span> <span style="color: rgba(128, 0, 0, 1)">"""</span>
<span style="color: rgba(0, 128, 128, 1)">4</span> <span style="color: rgba(128, 0, 0, 1)">爬取豌豆荚网站所有分类下的全部 app
</span><span style="color: rgba(0, 128, 128, 1)">5</span> <span style="color: rgba(128, 0, 0, 1)">数据爬取包括两个部分:
</span><span style="color: rgba(0, 128, 128, 1)">6</span> <span style="color: rgba(128, 0, 0, 1)">一:数据指标
</span><span style="color: rgba(0, 128, 128, 1)">7</span> <span style="color: rgba(128, 0, 0, 1)">1 爬取首页
</span><span style="color: rgba(0, 128, 128, 1)">8</span> <span style="color: rgba(128, 0, 0, 1)">2 爬取第2页开始的 ajax 页
</span><span style="color: rgba(0, 128, 128, 1)">9</span> <span style="color: rgba(128, 0, 0, 1)">二:图标
</span><span style="color: rgba(0, 128, 128, 1)"> 10</span> <span style="color: rgba(128, 0, 0, 1)">使用class方法下载首页和 ajax 页
</span><span style="color: rgba(0, 128, 128, 1)"> 11</span> <span style="color: rgba(128, 0, 0, 1)">分页循环两种爬取思路,
</span><span style="color: rgba(0, 128, 128, 1)"> 12</span> <span style="color: rgba(128, 0, 0, 1)">指定页数进行for 循环,和不指定页数一直往下爬直到爬不到内容为止
</span><span style="color: rgba(0, 128, 128, 1)"> 13</span> <span style="color: rgba(128, 0, 0, 1)">1 for 循环
</span><span style="color: rgba(0, 128, 128, 1)"> 14</span> <span style="color: rgba(128, 0, 0, 1)">"""</span>
<span style="color: rgba(0, 128, 128, 1)"> 15</span>
<span style="color: rgba(0, 128, 128, 1)"> 16</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> scrapy
</span><span style="color: rgba(0, 128, 128, 1)"> 17</span> <span style="color: rgba(0, 0, 255, 1)">from</span> wandoujia.items <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> WandoujiaItem
</span><span style="color: rgba(0, 128, 128, 1)"> 18</span>
<span style="color: rgba(0, 128, 128, 1)"> 19</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
</span><span style="color: rgba(0, 128, 128, 1)"> 20</span> <span style="color: rgba(0, 0, 255, 1)">from</span> pyquery <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> PyQuery as pq
</span><span style="color: rgba(0, 128, 128, 1)"> 21</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> re
</span><span style="color: rgba(0, 128, 128, 1)"> 22</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> csv
</span><span style="color: rgba(0, 128, 128, 1)"> 23</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pandas as pd
</span><span style="color: rgba(0, 128, 128, 1)"> 24</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> numpy as np
</span><span style="color: rgba(0, 128, 128, 1)"> 25</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> time
</span><span style="color: rgba(0, 128, 128, 1)"> 26</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pymongo
</span><span style="color: rgba(0, 128, 128, 1)"> 27</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> json
</span><span style="color: rgba(0, 128, 128, 1)"> 28</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> os
</span><span style="color: rgba(0, 128, 128, 1)"> 29</span> <span style="color: rgba(0, 0, 255, 1)">from</span> urllib.parse <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> urlencode
</span><span style="color: rgba(0, 128, 128, 1)"> 30</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> random
</span><span style="color: rgba(0, 128, 128, 1)"> 31</span> <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> logging
</span><span style="color: rgba(0, 128, 128, 1)"> 32</span>
<span style="color: rgba(0, 128, 128, 1)"> 33</span> logging.basicConfig(filename=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">wandoujia.log</span><span style="color: rgba(128, 0, 0, 1)">'</span>,filemode=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">w</span><span style="color: rgba(128, 0, 0, 1)">'</span>,level=logging.DEBUG,format=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">%(asctime)s %(message)s</span><span style="color: rgba(128, 0, 0, 1)">'</span>,datefmt=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">%Y/%m/%d %I:%M:%S %p</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)"> 34</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> https://juejin.im/post/5aee70105188256712786b7f</span>
<span style="color: rgba(0, 128, 128, 1)"> 35</span> logging.warning(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">warn message</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)"> 36</span> logging.error(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">error message</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)"> 37</span>
<span style="color: rgba(0, 128, 128, 1)"> 38</span>
<span style="color: rgba(0, 128, 128, 1)"> 39</span> <span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> WandouSpider(scrapy.Spider):
</span><span style="color: rgba(0, 128, 128, 1)"> 40</span> name = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">wandou</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 41</span> allowed_domains = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">www.wandoujia.com</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)"> 42</span> start_urls = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://www.wandoujia.com/</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)"> 43</span>
<span style="color: rgba(0, 128, 128, 1)"> 44</span> <span style="color: rgba(0, 0, 255, 1)">def</span> <span style="color: rgba(128, 0, 128, 1)">__init__</span><span style="color: rgba(0, 0, 0, 1)">(self):
</span><span style="color: rgba(0, 128, 128, 1)"> 45</span> self.cate_url = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.wandoujia.com/category/app</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 46</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 首页url</span>
<span style="color: rgba(0, 128, 128, 1)"> 47</span> self.url = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.wandoujia.com/category/</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 48</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> ajax 请求url</span>
<span style="color: rgba(0, 128, 128, 1)"> 49</span> self.ajax_url = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.wandoujia.com/wdjweb/api/category/more?</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 50</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 实例化分类标签</span>
<span style="color: rgba(0, 128, 128, 1)"> 51</span> self.wandou_category =<span style="color: rgba(0, 0, 0, 1)"> Get_category()
</span><span style="color: rgba(0, 128, 128, 1)"> 52</span>
<span style="color: rgba(0, 128, 128, 1)"> 53</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> start_requests(self):
</span><span style="color: rgba(0, 128, 128, 1)"> 54</span> <span style="color: rgba(0, 0, 255, 1)">yield</span> scrapy.Request(self.cate_url,callback=<span style="color: rgba(0, 0, 0, 1)">self.get_category)
</span><span style="color: rgba(0, 128, 128, 1)"> 55</span>
<span style="color: rgba(0, 128, 128, 1)"> 56</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_category(self,response):
</span><span style="color: rgba(0, 128, 128, 1)"> 57</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> # num = 0</span>
<span style="color: rgba(0, 128, 128, 1)"> 58</span> cate_content =<span style="color: rgba(0, 0, 0, 1)"> self.wandou_category.parse_category(response)
</span><span style="color: rgba(0, 128, 128, 1)"> 59</span> <span style="color: rgba(0, 0, 255, 1)">for</span> item <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> cate_content:
</span><span style="color: rgba(0, 128, 128, 1)"> 60</span> child_cate = item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_codes</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)"> 61</span> <span style="color: rgba(0, 0, 255, 1)">for</span> cate <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> child_cate:
</span><span style="color: rgba(0, 128, 128, 1)"> 62</span> cate_code = item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cate_code</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)"> 63</span> cate_name = item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)"> 64</span> child_cate_code = cate[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_code</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)"> 65</span> child_cate_name = cate[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)"> 66</span>
<span style="color: rgba(0, 128, 128, 1)"> 67</span>
<span style="color: rgba(0, 128, 128, 1)"> 68</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> # 单类别下载</span>
<span style="color: rgba(0, 128, 128, 1)"> 69</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> cate_code = 5029</span>
<span style="color: rgba(0, 128, 128, 1)"> 70</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> child_cate_code = 837</span>
<span style="color: rgba(0, 128, 128, 1)"> 71</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> cate_name = '通讯社交'</span>
<span style="color: rgba(0, 128, 128, 1)"> 72</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> child_cate_name = '收音机'</span>
<span style="color: rgba(0, 128, 128, 1)"> 73</span>
<span style="color: rgba(0, 128, 128, 1)"> 74</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> while循环</span>
<span style="color: rgba(0, 128, 128, 1)"> 75</span> page = 1 <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 设置爬取起始页数</span>
<span style="color: rgba(0, 128, 128, 1)"> 76</span> <span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">*</span><span style="color: rgba(128, 0, 0, 1)">'</span> * 50<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)"> 77</span>
<span style="color: rgba(0, 128, 128, 1)"> 78</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> # for 循环下一页</span>
<span style="color: rgba(0, 128, 128, 1)"> 79</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> pages = []</span>
<span style="color: rgba(0, 128, 128, 1)"> 80</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> for page in range(1,3):</span>
<span style="color: rgba(0, 128, 128, 1)"> 81</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print('正在爬取:%s-%s 第 %s 页 ' %</span>
<span style="color: rgba(0, 128, 128, 1)"> 82</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> (cate_name, child_cate_name, page))</span>
<span style="color: rgba(0, 128, 128, 1)"> 83</span> logging.debug(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">正在爬取:%s-%s 第 %s 页 </span><span style="color: rgba(128, 0, 0, 1)">'</span> %
<span style="color: rgba(0, 128, 128, 1)"> 84</span> <span style="color: rgba(0, 0, 0, 1)"> (cate_name, child_cate_name, page))
</span><span style="color: rgba(0, 128, 128, 1)"> 85</span>
<span style="color: rgba(0, 128, 128, 1)"> 86</span> <span style="color: rgba(0, 0, 255, 1)">if</span> page == 1<span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 128, 128, 1)"> 87</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 构造首页url</span>
<span style="color: rgba(0, 128, 128, 1)"> 88</span> category_url = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">{}{}_{}</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)"> .format(self.url, cate_code, child_cate_code)
</span><span style="color: rgba(0, 128, 128, 1)"> 89</span> <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 128, 128, 1)"> 90</span> params =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)"> 91</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">catId</span><span style="color: rgba(128, 0, 0, 1)">'</span>: cate_code,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 大类别</span>
<span style="color: rgba(0, 128, 128, 1)"> 92</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">subCatId</span><span style="color: rgba(128, 0, 0, 1)">'</span>: child_cate_code,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 小类别</span>
<span style="color: rgba(0, 128, 128, 1)"> 93</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">page</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">: page,
</span><span style="color: rgba(0, 128, 128, 1)"> 94</span> <span style="color: rgba(0, 0, 0, 1)"> }
</span><span style="color: rgba(0, 128, 128, 1)"> 95</span> category_url = self.ajax_url +<span style="color: rgba(0, 0, 0, 1)"> urlencode(params)
</span><span style="color: rgba(0, 128, 128, 1)"> 96</span>
<span style="color: rgba(0, 128, 128, 1)"> 97</span> dict = {<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">page</span><span style="color: rgba(128, 0, 0, 1)">'</span>:page,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span>:cate_name,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cate_code</span><span style="color: rgba(128, 0, 0, 1)">'</span>:cate_code,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span>:child_cate_name,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_code</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:child_cate_code}
</span><span style="color: rgba(0, 128, 128, 1)"> 98</span>
<span style="color: rgba(0, 128, 128, 1)"> 99</span> <span style="color: rgba(0, 0, 255, 1)">yield</span> scrapy.Request(category_url,callback=self.parse,meta=<span style="color: rgba(0, 0, 0, 1)">dict)
</span><span style="color: rgba(0, 128, 128, 1)">100</span>
<span style="color: rgba(0, 128, 128, 1)">101</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> # for 循环方法</span>
<span style="color: rgba(0, 128, 128, 1)">102</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> pa = yield scrapy.Request(category_url,callback=self.parse,meta=dict)</span>
<span style="color: rgba(0, 128, 128, 1)">103</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> pages.append(pa)</span>
<span style="color: rgba(0, 128, 128, 1)">104</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> return pages</span>
<span style="color: rgba(0, 128, 128, 1)">105</span>
<span style="color: rgba(0, 128, 128, 1)">106</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse(self, response):
</span><span style="color: rgba(0, 128, 128, 1)">107</span> <span style="color: rgba(0, 0, 255, 1)">if</span> len(response.body) >= 100:<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 判断该页是否爬完,数值定为100是因为无内容时长度是87</span>
<span style="color: rgba(0, 128, 128, 1)">108</span> page = response.meta[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">page</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)">109</span> cate_name = response.meta[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)">110</span> cate_code = response.meta[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cate_code</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)">111</span> child_cate_name = response.meta[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)">112</span> child_cate_code = response.meta[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_code</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)">113</span>
<span style="color: rgba(0, 128, 128, 1)">114</span> <span style="color: rgba(0, 0, 255, 1)">if</span> page == 1<span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 128, 128, 1)">115</span> contents =<span style="color: rgba(0, 0, 0, 1)"> response
</span><span style="color: rgba(0, 128, 128, 1)">116</span> <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 128, 128, 1)">117</span> jsonresponse =<span style="color: rgba(0, 0, 0, 1)"> json.loads(response.body_as_unicode())
</span><span style="color: rgba(0, 128, 128, 1)">118</span> contents = jsonresponse[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">data</span><span style="color: rgba(128, 0, 0, 1)">'</span>][<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)">119</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> response 是json,json内容是html,html 为文本不能直接使用.css 提取,要先转换</span>
<span style="color: rgba(0, 128, 128, 1)">120</span> contents = scrapy.Selector(text=contents, type=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">html</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">121</span>
<span style="color: rgba(0, 128, 128, 1)">122</span> contents = contents.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.card</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">123</span> <span style="color: rgba(0, 0, 255, 1)">for</span> content <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> contents:
</span><span style="color: rgba(0, 128, 128, 1)">124</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> num += 1</span>
<span style="color: rgba(0, 128, 128, 1)">125</span> item =<span style="color: rgba(0, 0, 0, 1)"> WandoujiaItem()
</span><span style="color: rgba(0, 128, 128, 1)">126</span> item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span>] =<span style="color: rgba(0, 0, 0, 1)"> cate_name
</span><span style="color: rgba(0, 128, 128, 1)">127</span> item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span>] =<span style="color: rgba(0, 0, 0, 1)"> child_cate_name
</span><span style="color: rgba(0, 128, 128, 1)">128</span> item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">app_name</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = self.clean_name(content.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.name::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first())
</span><span style="color: rgba(0, 128, 128, 1)">129</span> item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">install</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = content.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.install-count::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
</span><span style="color: rgba(0, 128, 128, 1)">130</span> item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">volume</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = content.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.meta span:last-child::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
</span><span style="color: rgba(0, 128, 128, 1)">131</span> item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">comment</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = content.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.comment::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first().strip()
</span><span style="color: rgba(0, 128, 128, 1)">132</span> item[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">icon_url</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = self.get_icon_url(content.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.icon-wrap a img</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">),page)
</span><span style="color: rgba(0, 128, 128, 1)">133</span> <span style="color: rgba(0, 0, 255, 1)">yield</span><span style="color: rgba(0, 0, 0, 1)"> item
</span><span style="color: rgba(0, 128, 128, 1)">134</span>
<span style="color: rgba(0, 128, 128, 1)">135</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 递归爬下一页</span>
<span style="color: rgba(0, 128, 128, 1)">136</span> page += 1
<span style="color: rgba(0, 128, 128, 1)">137</span> params =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)">138</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">catId</span><span style="color: rgba(128, 0, 0, 1)">'</span>: cate_code,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 大类别</span>
<span style="color: rgba(0, 128, 128, 1)">139</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">subCatId</span><span style="color: rgba(128, 0, 0, 1)">'</span>: child_cate_code,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 小类别</span>
<span style="color: rgba(0, 128, 128, 1)">140</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">page</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">: page,
</span><span style="color: rgba(0, 128, 128, 1)">141</span> <span style="color: rgba(0, 0, 0, 1)"> }
</span><span style="color: rgba(0, 128, 128, 1)">142</span> ajax_url = self.ajax_url +<span style="color: rgba(0, 0, 0, 1)"> urlencode(params)
</span><span style="color: rgba(0, 128, 128, 1)">143</span>
<span style="color: rgba(0, 128, 128, 1)">144</span> dict = {<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">page</span><span style="color: rgba(128, 0, 0, 1)">'</span>:page,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span>:cate_name,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cate_code</span><span style="color: rgba(128, 0, 0, 1)">'</span>:cate_code,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span>:child_cate_name,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_code</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:child_cate_code}
</span><span style="color: rgba(0, 128, 128, 1)">145</span> <span style="color: rgba(0, 0, 255, 1)">yield</span> scrapy.Request(ajax_url,callback=self.parse,meta=<span style="color: rgba(0, 0, 0, 1)">dict)
</span><span style="color: rgba(0, 128, 128, 1)">146</span>
<span style="color: rgba(0, 128, 128, 1)">147</span>
<span style="color: rgba(0, 128, 128, 1)">148</span>
<span style="color: rgba(0, 128, 128, 1)">149</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 名称清除方法1 去除不能用于文件命名的特殊字符</span>
<span style="color: rgba(0, 128, 128, 1)">150</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> clean_name(self, name):
</span><span style="color: rgba(0, 128, 128, 1)">151</span> rule = re.compile(r<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">[\/\\\:\*\?\"\<\>\|]</span><span style="color: rgba(128, 0, 0, 1)">"</span>)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> '/ \ : * ? " < > |')</span>
<span style="color: rgba(0, 128, 128, 1)">152</span> name = re.sub(rule, <span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">, name)
</span><span style="color: rgba(0, 128, 128, 1)">153</span> <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> name
</span><span style="color: rgba(0, 128, 128, 1)">154</span>
<span style="color: rgba(0, 128, 128, 1)">155</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_icon_url(self,item,page):
</span><span style="color: rgba(0, 128, 128, 1)">156</span> <span style="color: rgba(0, 0, 255, 1)">if</span> page == 1<span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 128, 128, 1)">157</span> <span style="color: rgba(0, 0, 255, 1)">if</span> item.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">::attr("src")</span><span style="color: rgba(128, 0, 0, 1)">'</span>).extract_first().startswith(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">):
</span><span style="color: rgba(0, 128, 128, 1)">158</span> url = item.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">::attr("src")</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
</span><span style="color: rgba(0, 128, 128, 1)">159</span> <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 128, 128, 1)">160</span> url = item.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">::attr("data-original")</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
</span><span style="color: rgba(0, 128, 128, 1)">161</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> ajax页url提取</span>
<span style="color: rgba(0, 128, 128, 1)">162</span> <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 128, 128, 1)">163</span> url = item.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">::attr("data-original")</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
</span><span style="color: rgba(0, 128, 128, 1)">164</span>
<span style="color: rgba(0, 128, 128, 1)">165</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> if url:# 不要在这里添加url存在判断,否则空url 被过滤掉 导致编号对不上</span>
<span style="color: rgba(0, 128, 128, 1)">166</span> <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> url
</span><span style="color: rgba(0, 128, 128, 1)">167</span>
<span style="color: rgba(0, 128, 128, 1)">168</span>
<span style="color: rgba(0, 128, 128, 1)">169</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 首先获取主分类和子分类的数值代码 # # # # # # # # # # # # # # # #</span>
<span style="color: rgba(0, 128, 128, 1)">170</span> <span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> Get_category():
</span><span style="color: rgba(0, 128, 128, 1)">171</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse_category(self, response):
</span><span style="color: rgba(0, 128, 128, 1)">172</span> category = response.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.parent-cate</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">173</span> data =<span style="color: rgba(0, 0, 0, 1)"> [{
</span><span style="color: rgba(0, 128, 128, 1)">174</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span>: item.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.cate-link::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first(),
</span><span style="color: rgba(0, 128, 128, 1)">175</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cate_code</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">: self.get_category_code(item),
</span><span style="color: rgba(0, 128, 128, 1)">176</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_codes</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">: self.get_child_category(item),
</span><span style="color: rgba(0, 128, 128, 1)">177</span> } <span style="color: rgba(0, 0, 255, 1)">for</span> item <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> category]
</span><span style="color: rgba(0, 128, 128, 1)">178</span> <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> data
</span><span style="color: rgba(0, 128, 128, 1)">179</span>
<span style="color: rgba(0, 128, 128, 1)">180</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 获取所有主分类标签数值代码</span>
<span style="color: rgba(0, 128, 128, 1)">181</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_category_code(self, item):
</span><span style="color: rgba(0, 128, 128, 1)">182</span> cate_url = item.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.cate-link::attr("href")</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
</span><span style="color: rgba(0, 128, 128, 1)">183</span>
<span style="color: rgba(0, 128, 128, 1)">184</span> pattern = re.compile(r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.*/(\d+)</span><span style="color: rgba(128, 0, 0, 1)">'</span>)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 提取主类标签代码</span>
<span style="color: rgba(0, 128, 128, 1)">185</span> cate_code =<span style="color: rgba(0, 0, 0, 1)"> re.search(pattern, cate_url)
</span><span style="color: rgba(0, 128, 128, 1)">186</span> <span style="color: rgba(0, 0, 255, 1)">return</span> cate_code.group(1<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">187</span>
<span style="color: rgba(0, 128, 128, 1)">188</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 获取所有子分类标签数值代码</span>
<span style="color: rgba(0, 128, 128, 1)">189</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_child_category(self, item):
</span><span style="color: rgba(0, 128, 128, 1)">190</span> child_cate = item.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.child-cate a</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">191</span> child_cate_url =<span style="color: rgba(0, 0, 0, 1)"> [{
</span><span style="color: rgba(0, 128, 128, 1)">192</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_name</span><span style="color: rgba(128, 0, 0, 1)">'</span>: child.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">::text</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first(),
</span><span style="color: rgba(0, 128, 128, 1)">193</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">child_cate_code</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">: self.get_child_category_code(child)
</span><span style="color: rgba(0, 128, 128, 1)">194</span> } <span style="color: rgba(0, 0, 255, 1)">for</span> child <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> child_cate]
</span><span style="color: rgba(0, 128, 128, 1)">195</span>
<span style="color: rgba(0, 128, 128, 1)">196</span> <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> child_cate_url
</span><span style="color: rgba(0, 128, 128, 1)">197</span>
<span style="color: rgba(0, 128, 128, 1)">198</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 正则提取子分类</span>
<span style="color: rgba(0, 128, 128, 1)">199</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_child_category_code(self, child):
</span><span style="color: rgba(0, 128, 128, 1)">200</span> child_cate_url = child.css(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">::attr("href")</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">).extract_first()
</span><span style="color: rgba(0, 128, 128, 1)">201</span> pattern = re.compile(r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.*_(\d+)</span><span style="color: rgba(128, 0, 0, 1)">'</span>)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 提取小类标签编号</span>
<span style="color: rgba(0, 128, 128, 1)">202</span> child_cate_code =<span style="color: rgba(0, 0, 0, 1)"> re.search(pattern, child_cate_url)
</span><span style="color: rgba(0, 128, 128, 1)">203</span> <span style="color: rgba(0, 0, 255, 1)">return</span> child_cate_code.group(1<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">204</span>
<span style="color: rgba(0, 128, 128, 1)">205</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> # 可以选择保存到txt 文件</span>
<span style="color: rgba(0, 128, 128, 1)">206</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> def write_category(self,category):</span>
<span style="color: rgba(0, 128, 128, 1)">207</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> with open('category.txt','a',encoding='utf_8_sig',newline='') as f:</span>
<span style="color: rgba(0, 128, 128, 1)">208</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> w = csv.writer(f)</span>
<span style="color: rgba(0, 128, 128, 1)">209</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> w.writerow(category.values())</span></pre>
</div>
<span class="cnblogs_code_collapse">View Code</span></div>
<p>以上4个案例都只贴出了爬虫主程序脚本,因篇幅原因,所以item、pipeline和settings等脚本未贴出,可参考上面案例进行编写。</p>
<p> </p>
<h2><span style="font-size: 18pt"><strong>六</strong><strong>、Scrapy发送post<span style="font-family: 微软雅黑">请求</span></strong></span></h2>
<p><span style="font-size: 18px; color: rgba(0, 0, 0, 1); font-family: "Microsoft YaHei""><strong>问题:</strong>在之前代码中,我们从来没有手动的对start_urls列表中存储的起始url进行过请求的发送,但是起始url的确是进行了请求的发送,那这是如何实现的呢?</span></p>
<p><span style="font-size: 18px; color: rgba(0, 0, 0, 1); font-family: "Microsoft YaHei""><strong>解答:</strong>其实是因为爬虫文件中的爬虫类<span style="text-decoration: underline">继承到了Spider父类中的start_requests(self)这个方法</span>,该方法就可以对start_urls列表中的url发起请求:</span></p>
<div class="cnblogs_code">
<pre> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> start_requests(self):
</span><span style="color: rgba(0, 0, 255, 1)">for</span> u <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self.start_urls:
</span><span style="color: rgba(0, 0, 255, 1)">yield</span> scrapy.Request(url=u,callback=<span style="color: rgba(0, 0, 0, 1)">self.parse)
</span></pre>
</div>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong>注意:</strong><strong>该方法默认的实现,是对起始的url发起get请求,如果想发起post请求,则需要子类重写该方法。不过,</strong><strong>一般情况下不用scrapy发post请求,用request模块。</strong></span></p>
<p><strong><span style="font-family: 微软雅黑">例:爬取百度翻译</span></strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> -*- coding: utf-8 -*-</span>
<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> scrapy
</span><span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> PostSpider(scrapy.Spider):
name </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">post</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> allowed_domains = ['www.xxx.com']</span>
start_urls = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://fanyi.baidu.com/sug</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> start_requests(self):
data </span>= { <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> post请求参数</span>
<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">kw</span><span style="color: rgba(128, 0, 0, 1)">'</span>:<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">dog</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
}
</span><span style="color: rgba(0, 0, 255, 1)">for</span> url <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self.start_urls:
</span><span style="color: rgba(0, 0, 255, 1)">yield</span> <span style="color: rgba(255, 0, 0, 1)"><strong>scrapy.FormRequest</strong></span>(url=url,formdata=data,callback=self.parse) <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 发送post请求</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse(self, response):
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(response.text)</pre>
</div>
<p> </p>
<h2><span style="font-family: "Microsoft YaHei"; font-size: 18pt">七、设置日志等级</span></h2>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"> - 在使用scrapy crawl spiderFileName运行程序时,在终端里打印输出的就是scrapy的日志信息。</span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"> - 日志信息的种类:</span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"> ERROR : 一般错误</span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"> WARNING : 警告</span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"> INFO : 一般的信息</span></p>
<p><span style="color: rgba(0, 0, 0, 1)"><span style="font-family: "Microsoft YaHei"; font-size: 18px"> DEBUG : 调试信息</span><span style="font-family: "Microsoft YaHei"; font-size: 18px"> </span></span></p>
<p class="p"> </p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"> - 设置日志信息指定输出:</span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"> 在<strong>settings.py</strong>配置文件中,加入</span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"> LOG_LEVEL = ‘指定日志信息种类’即可。</span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)"> LOG_FILE = 'log.txt'则表示将日志信息写入到指定文件中进行存储。</span></p>
<p><span style="font-family: "Microsoft YaHei"; font-size: 18px; color: rgba(0, 0, 0, 1)">其他常用设置:</span></p>
<p> </p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">BOT_NAME
默认:“scrapybot”,使用startproject命令创建项目时,其被自动赋值
CONCURRENT_ITEMS
默认为100,Item Process(即Item Pipeline)同时处理(每个response的)item时最大值
CONCURRENT_REQUEST
默认为16,scrapy downloader并发请求(concurrent requests)的最大值
LOG_ENABLED
默认为True,是否启用logging
DEFAULT_REQUEST_HEADERS
默认如下:{</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Accept</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Accept-Language</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">en</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,}
scrapy http request使用的默认header
LOG_ENCODING
默认utt</span>-8<span style="color: rgba(0, 0, 0, 1)">,logging中使用的编码
LOG_LEVEL
默认“DEBUG”,log中最低级别,可选级别有:CRITICAL,ERROR,WARNING,DEBUG
USER_AGENT
默认:“Scrapy</span>/VERSION(....)”,爬取的默认User-<span style="color: rgba(0, 0, 0, 1)">Agent,除非被覆盖
COOKIES_ENABLED</span>=False,禁用cookies</pre>
</div>
<p> </p>
<p> </p>
<p> </p>
<h2><strong><span style="font-family: "Microsoft YaHei"; font-size: 18pt; color: rgba(0, 0, 0, 1)">八、同时运行多个爬虫</span></strong></h2>
<p class="p"><span style="color: rgba(0, 0, 0, 1)"><span style="font-family: "Microsoft YaHei"; font-size: 18px"> 实际开发中,通常在同一个项目里会有多个爬虫,多个爬虫的时候是怎么将他们运行起来呢?</span></span></p>
<p class="p"><span style="color: rgba(0, 0, 0, 1)"><strong><span style="font-family: "Microsoft YaHei"; font-size: 18px">运行单个爬虫</span></strong></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> sys
</span><span style="color: rgba(0, 0, 255, 1)">from</span> scrapy.cmdline <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> execute
</span><span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 128, 1)">__name__</span> == <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">__main__</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:
execute([</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">scrapy</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">crawl</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">maitian</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">--nolog</span><span style="color: rgba(128, 0, 0, 1)">"</span>])</pre>
</div>
<p><span style="font-size: 16px; font-family: "Microsoft YaHei"">然后运行py文件即可运行名为‘maitian‘的爬虫</span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"">同时运行多个爬虫</span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"">步骤如下:</span></p>
<pre><span style="font-size: 18px; font-family: "Microsoft YaHei"">- 在spiders同级创建任意目录,如:commands</span><br><span style="font-size: 18px; font-family: "Microsoft YaHei"">- 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)</span><br><span style="font-size: 18px; font-family: "Microsoft YaHei"">- 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'</span><br><span style="font-size: 18px; font-family: "Microsoft YaHei"">- 在项目目录执行命令:scrapy crawlall </span><br><br><span style="font-size: 16px; font-family: "Microsoft YaHei"">crawlall.py代码</span></pre>
<div class="cnblogs_code">
<pre> 1 <span style="color: rgba(0, 0, 255, 1)">from</span> scrapy.commands <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> ScrapyCommand
</span>2 <span style="color: rgba(0, 0, 255, 1)">from</span> scrapy.utils.project <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> get_project_settings
</span>3
4 <span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> Command(ScrapyCommand):
</span>5
6 requires_project =<span style="color: rgba(0, 0, 0, 1)"> True
</span>7
8 <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> syntax(self):
</span>9 <span style="color: rgba(0, 0, 255, 1)">return</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)"></span><span style="color: rgba(128, 0, 0, 1)">'</span>
10
11 <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> short_desc(self):
</span>12 <span style="color: rgba(0, 0, 255, 1)">return</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Runs all of the spiders</span><span style="color: rgba(128, 0, 0, 1)">'</span>
13
14 <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> run(self, args, opts):
</span>15 spider_list =<span style="color: rgba(0, 0, 0, 1)"> self.crawler_process.spiders.list()
</span>16 <span style="color: rgba(0, 0, 255, 1)">for</span> name <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> spider_list:
</span>17 self.crawler_process.crawl(name, **opts.<span style="color: rgba(128, 0, 128, 1)">__dict__</span><span style="color: rgba(0, 0, 0, 1)">)
</span>18<span style="color: rgba(0, 0, 0, 1)"> self.crawler_process.start()</span></pre>
</div><br><br>
来源:https://www.cnblogs.com/Summer-skr--blog/p/11477117.html
頁:
[1]