Python爬虫进阶 | 异步协程
<h2 class="p"><span style="font-size: 18pt; font-family: "Microsoft YaHei""><strong>一、</strong><strong>背景</strong></span></h2><p><span style="font-size: 18px; font-family: "Microsoft YaHei""> 之前爬虫使用的是requests+多线程/多进程,后来随着前几天的深入了解,才发现,对于爬虫来说,真正的瓶颈并不是CPU的处理速度,<span style="color: rgba(255, 0, 0, 1)"><strong>而是对于网页抓取时候的往返时间</strong></span>,因为如果采用requests+多线程/多进程,他本身是阻塞式的编程,所以时间都花费在了等待网页结果的返回和对爬取到的数据的写入上面。而如果采用非阻塞编程,那么就没有这个困扰。这边首先要理解一下阻塞和非阻塞的区别。</span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"">(1)阻塞调用是指<span style="text-decoration: underline">调用结果返回之前,当前线程会被挂起</span>(线程进入非可执行状态,在这个状态下,CPU不会给线程分配时间片,即线程暂停运行)。函数只有在得到结果之后才会返回。</span></p>
<p><span style="font-size: 18px; font-family: "Microsoft YaHei"">(2)对于非阻塞则不会挂起,直接执行接下去的程序,返回结果后再回来处理返回值。</span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong> </strong></span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""> 其实爬虫的本质就是client发请求,批量获取server的响应数据,如果我们有多个url待爬取,只用一个线程且采用串行的方式执行,那只能等待爬取一个结束后才能继续下一个,效率会非常低。需要强调的是:对于单线程下串行N个任务,并不完全等同于低效,如果这N个任务都是纯计算的任务,那么该线程对cpu的利用率仍然会很高,之所以单线程下串行多个爬虫任务低效,是因为爬虫任务是明显的<strong>IO密集型(阻塞)</strong>程序。那么该如何提高爬取性能呢?</span></p>
<p class="p"> </p>
<h2 class="p"><span style="font-size: 18pt; font-family: "Microsoft YaHei""><strong>二、</strong><strong>基本概念</strong></span></h2>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong>2.1 阻塞</strong></span></p>
<p class="p"><span style="font-family: "Microsoft YaHei"; font-size: 18px">阻塞状态指程序未得到所需计算资源时被挂起的状态。<strong>程序在等待某个操作完成期间,自身无法继续干别的事情</strong>,则称该程序在该操作上是阻塞的。</span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei"">常见的阻塞形式有:网络 I/O 阻塞、磁盘 I/O 阻塞、用户输入阻塞等。阻塞是无处不在的,包括 CPU 切换上下文时,所有的进程都无法真正干事情,它们也会被阻塞。如果是多核 CPU 则正在执行上下文切换操作的核不可被利用。</span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong> </strong></span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong>2.2 非阻塞</strong></span></p>
<p class="p"><strong><span style="font-family: "Microsoft YaHei"; font-size: 18px">程序在等待某操作过程中,自身不被阻塞,可以继续运行干别的事情,则称该程序在该操作上是非阻塞的。</span></strong></p>
<p class="p"><span style="font-family: "Microsoft YaHei"; font-size: 18px">非阻塞并不是在任何程序级别、任何情况下都可以存在的。</span><span style="font-family: "Microsoft YaHei"; font-size: 18px">仅当程序封装的级别可以囊括独立的子程序单元时,它才可能存在非阻塞状态。</span></p>
<p class="p"><span style="font-family: "Microsoft YaHei"; font-size: 18px">非阻塞的存在是因为阻塞存在,正因为某个操作阻塞导致的耗时与效率低下,我们才要把它变成非阻塞的。</span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""> </span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong>2.3 同步</strong></span></p>
<p class="p"><span style="font-family: "Microsoft YaHei"; font-size: 18px">不同程序单元为了完成某个任务,在执行过程中需靠某种通信方式以协调一致,称这些程序单元是同步执行的。</span><span style="font-size: 18px; font-family: "Microsoft YaHei"">例如购物系统中更新商品库存,需要用“行锁”作为通信信号,让不同的更新请求强制排队顺序执行,那更新库存的操作是同步的。</span><span style="font-family: "Microsoft YaHei"; font-size: 18px">简言之,<strong>同步意味着有序。</strong></span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong> </strong></span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong>2.4 异步</strong></span></p>
<p class="p"><span style="font-family: "Microsoft YaHei"; font-size: 18px"><span style="text-decoration: underline">为完成某个任务,不同程序单元之间过程中无需通信协调,也能完成任务的方式,不相关的程序单元之间可以是异步的</span>。</span><span style="font-family: "Microsoft YaHei"; font-size: 18px">例如,爬虫下载网页。调度程序调用下载程序后,即可调度其他任务,而无需与该下载任务保持通信以协调行为。不同网页的下载、保存等操作都是无关的,也无需相互通知协调。这些异步操作的完成时刻并不确定。</span><span style="font-family: "Microsoft YaHei"; font-size: 18px">简言之,异步意味着无序。</span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong> </strong></span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong>2.5 多进程</strong></span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei"">多进程就是利用 CPU 的多核优势,在同一时间并行地执行多个任务,可以大大提高执行效率。</span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""> </span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong>2.6 协程</strong></span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei"">协程,英文叫做 Coroutine,又称微线程,纤程,协程是一种用户态的轻量级线程。</span></p>
<p class="p"><span style="font-family: "Microsoft YaHei"; font-size: 18px">协程拥有自己的寄存器上下文和栈。协程调度切换时,将寄存器上下文和栈保存到其他地方,<span style="text-decoration: underline">在切回来的时候,恢复先前保存的寄存器上下文和栈</span>。因此协程能保留上一次调用时的状态,即所有局部状态的一个特定组合,每次过程重入时,就相当于进入上一次调用的状态。</span></p>
<p class="p"><span style="font-family: "Microsoft YaHei"; font-size: 18px">协程本质上是个单进程,协程相对于多进程来说,无需线程上下文切换的开销,无需原子操作锁定及同步的开销,编程模型也非常简单。</span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei"">我们可以使用协程来实现异步操作,比如在网络爬虫场景下,<strong>我们发出一个请求之后,需要等待一定的时间才能得到响应,但其实在这个等待过程中,程序可以干许多其他的事情,<span style="color: rgba(255, 0, 0, 1)">等到响应得到之后才切换回来继续处理,这样可以充分利用 CPU 和其他资源,这就是异步协程的优势。</span></strong></span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""> </span></p>
<h2 class="p"><span style="font-size: 18pt; font-family: "Microsoft YaHei""><strong>三、</strong><strong>分析处理</strong></span> </h2>
<p class="p"><span style="font-family: 微软雅黑; font-size: 18px"> 同步调用:即提交一个任务后就在原地等待任务结束,等到拿到任务的结果后再继续下一行代码,效率低</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_page(url):
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">下载 %s</span><span style="color: rgba(128, 0, 0, 1)">'</span> %<span style="color: rgba(0, 0, 0, 1)">url)
response</span>=<span style="color: rgba(0, 0, 0, 1)">requests.get(url)
</span><span style="color: rgba(0, 0, 255, 1)">if</span> response.status_code == 200<span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> response.text
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse_page(res):
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">解析 %s</span><span style="color: rgba(128, 0, 0, 1)">'</span> %<span style="color: rgba(0, 0, 0, 1)">(len(res)))
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> main():
urls</span>=[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.baidu.com/</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://www.sina.com.cn/</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.python.org</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 0, 255, 1)">for</span> url <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> urls:
res</span>=get_page(url) <strong><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">调用一个任务,就在原地等待任务结束拿到结果后才继续往后执行</span></strong>
<span style="color: rgba(0, 0, 0, 1)"> parse_page(res)
</span><span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 128, 1)">__name__</span> == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">__main__</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
main()</span></pre>
</div>
<p><strong><img src="https://img2018.cnblogs.com/blog/1518468/201909/1518468-20190908155525884-1971985041.png"></strong></p>
<p><span style="font-size: 18px"><strong>a. <span style="font-family: 微软雅黑">解决同步调用方案之多线程</span><span style="font-family: "Times New Roman"">/</span><span style="font-family: 微软雅黑">多进程</span></strong></span></p>
<p><span style="font-family: 微软雅黑; font-size: 18px">好处:在服务器端使用多线程(或多进程)。多线程(或多进程)的目的是让每个连接都拥有独立的线程(或进程),这样任何一个连接的阻塞都不会影响其他的连接。</span></p>
<p><span style="font-size: 18px"><span style="font-family: 微软雅黑">弊端:开启多进程或都线程的方式,我们是无法无限制地开启多进程或多线程的:在遇到要同时响应成百上千路的连接请求,则无论多线程还是多进程都会</span><span style="text-decoration: underline"><span style="font-family: 微软雅黑">严重占据系统资源,降低系统对外界响应效</span></span><span style="font-family: 微软雅黑">率,而且线程与进程本身也更容易进入假死状态。</span></span></p>
<p><span style="font-size: 18px"><strong>b. <span style="font-family: 微软雅黑">解决同步调用方案之线程</span><span style="font-family: "Times New Roman"">/</span><span style="font-family: 微软雅黑">进程池</span></strong></span></p>
<p><span style="font-size: 18px"><span style="font-family: 微软雅黑">好处:很多程序员可能会考虑使用</span>“<span style="font-family: 微软雅黑">线程池</span><span style="font-family: "Times New Roman"">”</span><span style="font-family: 微软雅黑">或</span><span style="font-family: "Times New Roman"">“</span><span style="font-family: 微软雅黑">连接池</span><span style="font-family: "Times New Roman"">”</span><span style="font-family: 微软雅黑">。</span><span style="font-family: "Times New Roman"">“</span><span style="font-family: 微软雅黑">线程池</span><span style="font-family: "Times New Roman"">”</span><span style="font-family: 微软雅黑">旨在减少创建和销毁线程的频率,其维持一定合理数量的线程,并让空闲的线程重新承担新的执行任务。可以很好的降低系统开销。</span></span></p>
<p><span style="font-size: 18px"><span style="font-family: 微软雅黑">弊端:</span>“<span style="font-family: 微软雅黑">线程池</span><span style="font-family: "Times New Roman"">”</span><span style="font-family: 微软雅黑">和</span><span style="font-family: "Times New Roman"">“</span><span style="font-family: 微软雅黑">连接池</span><span style="font-family: "Times New Roman"">”</span><span style="font-family: 微软雅黑">技术也只是在一定程度上缓解了频繁调用</span><span style="font-family: "Times New Roman"">IO</span><span style="font-family: 微软雅黑">接口带来的资源占用。而且,所谓</span><span style="font-family: "Times New Roman"">“</span><span style="font-family: 微软雅黑">池</span><span style="font-family: "Times New Roman"">”</span><span style="text-decoration: underline"><span style="font-family: 微软雅黑">始终有其上限,当请求大大超过上限时,</span>“<span style="font-family: 微软雅黑">池</span><span style="font-family: "Times New Roman"">”</span><span style="font-family: 微软雅黑">构成的系统对外界的响应并不比没有池的时候效果好多少。</span></span><span style="font-family: 微软雅黑">所以使用</span>“<span style="font-family: 微软雅黑">池</span><span style="font-family: "Times New Roman"">”</span><span style="font-family: 微软雅黑">必须考虑其面临的响应规模,并根据响应规模调整</span><span style="font-family: "Times New Roman"">“</span><span style="font-family: 微软雅黑">池</span><span style="font-family: "Times New Roman"">”</span><span style="font-family: 微软雅黑">的大小。</span></span></p>
<p><span style="font-size: 18px"> </span></p>
<p><span style="font-family: 微软雅黑">案例:基于</span>multiprocessing.dummy<span style="font-family: 微软雅黑">线程池爬取梨视频的视频信息</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> re
</span><span style="color: rgba(0, 0, 255, 1)">from</span> lxml <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> etree
</span><span style="color: rgba(0, 0, 255, 1)">from</span> multiprocessing.dummy <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> Pool
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> random
header </span>=<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">User-Agent</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
}
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_page(url):
response </span>= requests.get(url=url,headers=<span style="color: rgba(0, 0, 0, 1)">header)
</span><span style="color: rgba(0, 0, 255, 1)">if</span> response.status_code == 200<span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> response.text
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> None
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse_page(res):
tree </span>=<span style="color: rgba(0, 0, 0, 1)"> etree.HTML(res)
li_list </span>= tree.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//div[@id="listvideoList"]/ul/li</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
video_url_list </span>=<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> li <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> li_list:
detail_url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.pearvideo.com/</span><span style="color: rgba(128, 0, 0, 1)">'</span> + li.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./div/a/@href</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
detail_page </span>= requests.get(url=detail_url, headers=<span style="color: rgba(0, 0, 0, 1)">header).text
video_url </span>= re.findall(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">srcUrl="(.*?)",vdoUrl</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, detail_page, re.S)
video_url_list.append(video_url)
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> video_url_list
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 获取视频</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> getVideoData(url):
</span><span style="color: rgba(0, 0, 255, 1)">return</span> requests.get(url=url, headers=<span style="color: rgba(0, 0, 0, 1)">header).content
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 持久化存储</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> saveVideo(data):
fileName </span>= str(random.randint(0, 5000)) + <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.mp4</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 因回调函数只能传一个参数,所以没办法再传名字了,只能自己取名</span>
with open(fileName, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">wb</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">) as fp:
fp.write(data)
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> main():
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.pearvideo.com/category_1</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
res </span>=<span style="color: rgba(0, 0, 0, 1)"> get_page(url)
links </span>=<span style="color: rgba(0, 0, 0, 1)"> parse_page(res)
pool </span>= Pool(5)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 实例化一个线程池对象</span>
<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">pool.map(回调函数,可迭代对象)函数依次执行对象</span>
video_data_list =<span style="color: rgba(0, 0, 0, 1)"> pool.map(getVideoData, links)
pool.map(saveVideo, video_data_list)
<span style="color: rgba(255, 0, 0, 1)"> pool.close()
pool.join()
</span></span><span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 128, 1)">__name__</span>== <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">__main__</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
main()</span></pre>
</div>
<p><span style="font-size: 18px"><span style="font-family: 微软雅黑">总结:对应上例中的所面临的可能同时出现的上千甚至上万次的客户端请求,</span>“<span style="font-family: 微软雅黑">线程池</span><span style="font-family: "Times New Roman"">”</span><span style="font-family: 微软雅黑">或</span><span style="font-family: "Times New Roman"">“</span><span style="font-family: 微软雅黑">连接池</span><span style="font-family: "Times New Roman"">”</span><span style="font-family: 微软雅黑">或许可以缓解部分压力,但是不能解决所有问题。总之,多线程模型可以方便高效的解决小规模的服务请求,但面对大规模的服务请求,多线程模型也会遇到瓶颈,可以用非阻塞接口来尝试解决这个问题。</span></span></p>
<p><span style="font-size: 18px"> </span></p>
<p><strong><span style="font-family: 微软雅黑; font-size: 18px">终极处理方案</span></strong></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑"> 上述无论哪种方案</span><span style="font-family: 微软雅黑">都没有</span><span style="font-family: 微软雅黑">解决一个性能相关的问题:</span><strong>IO<span style="font-family: 微软雅黑">阻塞</span></strong><span style="font-family: 微软雅黑">,无论是多进程还是多线程,在遇到</span>IO<span style="font-family: 微软雅黑">阻塞时都会被操作系统强行剥夺走</span><span style="font-family: "Times New Roman"">CPU</span><span style="font-family: 微软雅黑">的执行权限,程序的执行效率因此就降低了下来。</span></span></p>
<p class="p"><span style="font-size: 18px"><strong><span style="font-family: 微软雅黑"> 解决这一问题的关键在于,我们自己从应用程序级别检测</span>IO<span style="font-family: 微软雅黑">阻塞</span></strong><strong><span style="font-family: 微软雅黑">,</span></strong><strong><span style="font-family: 微软雅黑">然后切换到我们自己程序的其他任务执行,这样把我们程序的</span>IO<span style="font-family: 微软雅黑">降到最低,我们的程序处于就绪态就会增多,以此来迷惑操作系统,操作系统便以为我们的程序是</span><span style="font-family: "Times New Roman"">IO</span><span style="font-family: 微软雅黑">比较少的程序,从而会尽可能多的分配</span><span style="font-family: "Times New Roman"">CPU</span><span style="font-family: 微软雅黑">给我们,这样也就达到了提升程序执行效率的目的。</span></strong></span></p>
<p class="p"><span style="font-size: 18px"><strong><span style="font-family: 微软雅黑"> <span style="color: rgba(255, 0, 0, 1)"> 实现方式:单线程</span></span><span style="color: rgba(255, 0, 0, 1)">+<span style="font-family: 微软雅黑">协程实现异步</span><span style="font-family: "Times New Roman"">IO</span><span style="font-family: 微软雅黑">操作。</span></span></strong></span></p>
<p class="p"><span style="font-size: 18px"><strong><span style="font-family: 微软雅黑"> 异步</span>IO<span style="font-family: 微软雅黑">:就是你发起一个 网络</span><span style="font-family: "Times New Roman"">IO </span><span style="font-family: 微软雅黑">操作,却不用等它结束,你可以继续做其他事情,当它结束时,你会得到通知。</span></strong></span></p>
<p class="p"> </p>
<h2 class="p"><span style="font-size: 18pt"><strong><span style="font-family: 微软雅黑">四、</span></strong><strong> <span style="font-family: 微软雅黑">异步协程</span></strong></span></h2>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">在</span>python3.4<span style="font-family: 微软雅黑">之后新增了</span><span style="font-family: "Times New Roman"">asyncio</span><span style="font-family: 微软雅黑">模块,可以帮我们检测</span><span style="font-family: "Times New Roman"">IO</span><span style="font-family: 微软雅黑">(只能是网络</span><span style="font-family: "Times New Roman"">IO</span><span style="font-family: 微软雅黑">【</span><span style="font-family: "Times New Roman"">HTTP</span><span style="font-family: 微软雅黑">连接就是网络</span><span style="font-family: "Times New Roman"">IO</span><span style="font-family: 微软雅黑">操作】),实现应用程序级别的切换(</span><strong><span style="font-family: 微软雅黑">异步</span>IO</strong><span style="font-family: 微软雅黑">)。</span><strong><span style="font-family: 微软雅黑">注意:</span>asyncio<span style="font-family: 微软雅黑">只能发</span><span style="font-family: "Times New Roman"">tcp</span><span style="font-family: 微软雅黑">级别的请求,不能发</span><span style="font-family: "Times New Roman"">http</span><span style="font-family: 微软雅黑">协议。</span></strong></span></p>
<p><span style="font-size: 18px"><strong>asyncio <span style="font-family: 微软雅黑">是干什么的?</span></strong></span></p>
<ul>
<li><span style="font-size: 18px"><span style="font-family: 微软雅黑">异步网络操作</span></span></li>
<li><span style="font-size: 18px"><span style="font-family: 微软雅黑">并发</span></span></li>
<li><span style="font-size: 18px"><span style="font-family: 微软雅黑">协程</span></span><span style="font-size: 18px"> </span></li>
</ul>
<p class="p"> </p>
<p class="p"><span style="font-family: 微软雅黑; font-size: 18px">几个概念:</span></p>
<p class="p"><span style="font-size: 18px"><strong>event_loop</strong><span style="font-family: 微软雅黑">:事件循环,相当于一个无限循环,我们可以把一些函数注册到这个事件循环上,当满足条件发生的时候,就会调用对应的处理方法。</span></span></p>
<p class="p"><span style="font-size: 18px"><strong>coroutine</strong><span style="font-family: 微软雅黑">:中文翻译叫协程,在</span> Python <span style="font-family: 微软雅黑">中常指代为协程对象类型,我们可以将协程对象注册到时间循环中,它会被事件循环调用。我们可以使用 </span><strong><span style="font-family: "Times New Roman"">async </span><span style="font-family: 微软雅黑">关键字来定义一个方法,这个方法在调用时不会立即被执行,而是返回一个协程对象。</span></strong></span></p>
<p class="p"><span style="font-size: 18px"><strong>task</strong><span style="font-family: 微软雅黑">:任务,它是对协程对象的进一步封装,包含了任务的各个状态。</span></span></p>
<p class="p"><span style="font-size: 18px"><strong>future</strong><span style="font-family: 微软雅黑">:代表将来执行或没有执行的任务的结果,实际上和</span> task <span style="font-family: 微软雅黑">没有本质区别。</span></span></p>
<p class="p"><span style="font-size: 18px"><strong>async<span style="font-family: 微软雅黑">关键字</span></strong><span style="font-family: 微软雅黑">:</span>async <span style="font-family: 微软雅黑">定义一个协程;</span></span></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: "Times New Roman""><strong><span style="font-family: "Microsoft YaHei"">await</span> 关键字</strong>:</span><span style="font-family: 微软雅黑">用来挂起阻塞方法的执行。</span></span></p>
<p class="p"> </p>
<p class="p"><strong><span style="font-size: 18px; font-family: "Microsoft YaHei"; color: rgba(255, 0, 0, 1)">注意事项:在特殊函数内部不可以出现不支持异步模块相关的代码。(例:time,request)</span></strong></p>
<p class="p"> </p>
<p class="p"><span style="font-size: 18px"><strong>1.</strong><strong><span style="font-family: 微软雅黑">定义一个协程</span></strong></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> asyncio
async </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> execute(x):
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Number:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, x)
coroutine </span>= execute(1<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Coroutine:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, coroutine)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">After calling execute</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
loop </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.get_event_loop()
<span style="color: rgba(255, 0, 0, 1)"><strong>loop.run_until_complete(coroutine)
</strong></span></span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">After calling loop</span><span style="color: rgba(128, 0, 0, 1)">'</span>) </pre>
</div>
<p><strong><span style="font-family: 微软雅黑; font-size: 12px">运行结果:</span></strong></p>
<p><span style="font-size: 12px">Coroutine: <coroutine object execute at 0x1034cf830></span></p>
<p><span style="font-size: 12px">After calling execute</span></p>
<p><span style="font-size: 12px">Number: 1</span></p>
<p><span style="font-size: 12px">After calling loop</span></p>
<p><span style="font-size: 18px; color: rgba(255, 0, 0, 1)"><strong><span style="font-family: 微软雅黑">可见,</span>async <span style="font-family: 微软雅黑">定义的方法就会变成一个无法直接执行的 </span><span style="font-family: "Times New Roman"">coroutine </span><span style="font-family: 微软雅黑">对象,必须将其注册到事件循环中才可以执行。</span></strong></span></p>
<p><span style="font-size: 18px"><span style="font-family: 微软雅黑">上文我们还提到了</span> task<span style="font-family: 微软雅黑">,它是对 </span><span style="font-family: "Times New Roman"">coroutine </span><span style="font-family: 微软雅黑">对象的进一步封装,它里面相比 </span><span style="font-family: "Times New Roman"">coroutine </span><span style="font-family: 微软雅黑">对象多了运行状态,比如 </span><span style="font-family: "Times New Roman"">running</span><span style="font-family: 微软雅黑">、</span><span style="font-family: "Times New Roman"">finished </span><span style="font-family: 微软雅黑">等,我们可以用这些状态来获取协程对象的执行情况。</span></span></p>
<p><span style="font-size: 18px"> </span></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">在上面的例子中,当我们<span style="color: rgba(0, 0, 0, 1)"><strong>将</strong></span></span><span style="color: rgba(0, 0, 0, 1)"><strong> coroutine <span style="font-family: 微软雅黑">对象传递给 </span><span style="font-family: "Times New Roman"">run_until_complete() </span><span style="font-family: 微软雅黑">方法的时候,实际上它进行了一个操作就是将 </span><span style="font-family: "Times New Roman"">coroutine </span><span style="font-family: 微软雅黑">封装成了 </span><span style="font-family: "Times New Roman"">task </span></strong></span><span style="font-family: 微软雅黑"><span style="color: rgba(0, 0, 0, 1)"><strong>对象</strong>,</span>我们也可以显式地进行声明,如下所示:</span></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> asyncio
async </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> execute(x):
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Number:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,x)
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> x
coroutine </span>= execute(1<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Coroutine:</span><span style="color: rgba(128, 0, 0, 1)">'</span>,coroutine)<span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">After calling execute</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
loop </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.get_event_loop()
<span style="color: rgba(255, 0, 0, 1)"><strong>task </strong></span></span><span style="color: rgba(255, 0, 0, 1)"><strong>= loop.create_task(coroutine)
</strong></span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Task:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,task)
loop.run_until_complete(<span style="color: rgba(255, 0, 0, 1)"><strong>task</strong></span>)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Task:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,task)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">After calling loop</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p><strong><span style="font-family: 微软雅黑; font-size: 12px">运行结果:</span></strong></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Coroutine: <coroutine object execute at 0x10e0f7830></span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">After calling execute</span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Task: <Task pending coro=<execute() running at demo.py:4>></span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Number: 1</span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Task: <Task finished coro=<execute() done, defined at demo.py:4> result=1></span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">After calling loop</span></p>
<p><span style="font-size: 18px"><span style="font-family: 微软雅黑">这里我们定义了</span> loop <span style="font-family: 微软雅黑">对象之后,接着调用了它的 </span><strong><span style="font-family: "Times New Roman"">create_task() </span><span style="font-family: 微软雅黑">方法将 </span><span style="font-family: "Times New Roman"">coroutine </span><span style="font-family: 微软雅黑">对象转化为了 </span><span style="font-family: "Times New Roman"">task </span></strong><span style="font-family: 微软雅黑"><strong>对象</strong>,随后我们打印输出一下,发现它是 </span><span style="font-family: "Times New Roman"">pending </span><span style="font-family: 微软雅黑">状态。接着我们将 </span><span style="font-family: "Times New Roman"">task </span><span style="font-family: 微软雅黑">对象添加到事件循环中得到执行,随后我们再打印输出一下 </span><span style="font-family: "Times New Roman"">task </span><span style="font-family: 微软雅黑">对象,发现它的状态就变成了 </span><span style="font-family: "Times New Roman"">finished</span><span style="font-family: 微软雅黑">,同时还可以看到其 </span><span style="font-family: "Times New Roman"">result </span><span style="font-family: 微软雅黑">变成了 </span><span style="font-family: "Times New Roman"">1</span><span style="font-family: 微软雅黑">,也就是我们定义的 </span><span style="font-family: "Times New Roman"">execute() </span><span style="font-family: 微软雅黑">方法的返回结果。</span></span></p>
<p><span style="font-size: 18px"> </span></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">另外,定义</span> task <span style="font-family: 微软雅黑">对象还有一种方式,就是直接通过 </span><span style="font-family: "Times New Roman"">asyncio </span><span style="font-family: 微软雅黑">的 </span><strong><span style="font-family: "Times New Roman"">ensure_future() </span><span style="font-family: 微软雅黑">方法,返回结果也是 </span><span style="font-family: "Times New Roman"">task </span></strong><span style="font-family: 微软雅黑"><strong>对象</strong>,这样的话我们就可以不借助于 </span><span style="font-family: "Times New Roman"">loop </span><span style="font-family: 微软雅黑">来定义,即使我们还没有声明 </span><span style="font-family: "Times New Roman"">loop </span><span style="font-family: 微软雅黑">也可以提前定义好 </span><span style="font-family: "Times New Roman"">task </span><span style="font-family: 微软雅黑">对象,写法如下:</span></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> asyncio
async </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> execute(x):
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Number:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,x)
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> x
coroutine </span>= execute(1<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Coroutine:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,coroutine)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">After calling execute</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
<span style="color: rgba(255, 0, 0, 1)"><strong>task</strong></span></span><span style="color: rgba(255, 0, 0, 1)"><strong>=asyncio.ensure_future(coroutine)
</strong></span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Task:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,task)
loop</span>=<span style="color: rgba(0, 0, 0, 1)">asyncio.get_event_loop()
loop.run_until_complete(task)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Task:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,task)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">After calling loop</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p class="p"><span style="font-size: 18px"><strong>2.</strong><strong><span style="font-family: 微软雅黑">绑定回调:也可以为某个</span> task <span style="font-family: 微软雅黑">绑定一个回调方法.</span></strong></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> asyncio
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
async </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> request():
url</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.baidu.com</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
status </span>=<span style="color: rgba(0, 0, 0, 1)"> requests.get(url).status_code
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> status
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)">callback(task):
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Status:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,task.result())
coroutine </span>=<span style="color: rgba(0, 0, 0, 1)"> request()
task </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.ensure_future(coroutine)
<strong><span style="color: rgba(255, 0, 0, 1)">task.add_done_callback(callback)
</span></strong></span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Task:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,task)
loop </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.get_event_loop()
loop.run_until_complete(task)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Task:</span><span style="color: rgba(128, 0, 0, 1)">'</span>,task)</pre>
</div>
<p class="p"> <span style="font-family: "times new roman", times; font-size: 12px">运行结果:</span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Task: <Task pending coro=<request() running at demo.py:5> cb=></span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Status: <Response ></span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Task: <Task finished coro=<request() done, defined at demo.py:5> result=<Response >></span></p>
<p><span style="font-size: 18px"><span style="font-family: 微软雅黑">在这里我们定义了一个</span> request() <span style="font-family: 微软雅黑">方法,请求了百度,返回状态码,但是这个方法里面我们没有任何 </span><span style="font-family: "Times New Roman"">print() </span><span style="font-family: 微软雅黑">语句。随后我们定义了一个 </span><span style="font-family: "Times New Roman"">callback() </span><span style="font-family: 微软雅黑">方法,这个方法接收一个参数,是 </span><span style="font-family: "Times New Roman"">task </span><span style="font-family: 微软雅黑">对象,然后调用 </span><span style="font-family: "Times New Roman"">print() </span><span style="font-family: 微软雅黑">方法打印了 </span><span style="font-family: "Times New Roman"">task </span><span style="font-family: 微软雅黑">对象的结果。这样我们就定义好了一个 </span><span style="font-family: "Times New Roman"">coroutine </span><span style="font-family: 微软雅黑">对象和一个回调方法,我们现在希望的效果是,当 </span><span style="font-family: "Times New Roman"">coroutine </span><span style="font-family: 微软雅黑">对象执行完毕之后,就去执行声明的 </span><span style="font-family: "Times New Roman"">callback() </span><span style="font-family: 微软雅黑">方法。</span></span></p>
<p><span style="font-size: 18px"><span style="font-family: 微软雅黑">那么它们二者怎样关联起来呢?很简单,只需要调用</span> add_done_callback() <span style="font-family: 微软雅黑">方法即可,我们将 </span><span style="font-family: "Times New Roman"">callback() </span><span style="font-family: 微软雅黑">方法传递给了封装好的 </span><span style="font-family: "Times New Roman"">task </span><span style="font-family: 微软雅黑">对象,这样当 </span><span style="font-family: "Times New Roman"">task </span><span style="font-family: 微软雅黑">执行完毕之后就可以调用 </span><span style="font-family: "Times New Roman"">callback() </span><span style="font-family: 微软雅黑">方法了,同时 </span><span style="font-family: "Times New Roman"">task </span><span style="font-family: 微软雅黑">对象还会作为参数传递给 </span><span style="font-family: "Times New Roman"">callback() </span><span style="font-family: 微软雅黑">方法,调用 </span><span style="font-family: "Times New Roman"">task </span><span style="font-family: 微软雅黑">对象的 </span><span style="font-family: "Times New Roman"">result() </span><span style="font-family: 微软雅黑">方法就可以获取返回结果了。</span></span></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">实际上不用回调方法,直接在</span> task <span style="font-family: 微软雅黑">运行完毕之后也可以直接调用 </span><span style="font-family: "Times New Roman"">result() </span><span style="font-family: 微软雅黑">方法获取结果,运行结果是一样的。如下所示:</span></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> asyncio
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
async </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> request():
url</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.baidu.com</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
status</span>=<span style="color: rgba(0, 0, 0, 1)">requests.get(url).status_code
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> status
coroutine</span>=<span style="color: rgba(0, 0, 0, 1)">request()
task</span>=<span style="color: rgba(0, 0, 0, 1)">asyncio.ensure_future(coroutine)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Task:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,task)
loop</span>=<span style="color: rgba(0, 0, 0, 1)">asyncio.get_event_loop()
loop.run_until_complete(task)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Task:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,task)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Task Result:</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(255, 0, 0, 1)"><strong>task.result()</strong></span>)</pre>
</div>
<p class="p"> </p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong>3.</strong><strong>多任务协程</strong></span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei""><strong> 上面的例子我们只执行了一次请求,如果我们想执行多次请求应该怎么办呢?我们可以<span style="color: rgba(255, 0, 0, 1)">定义一个 task 列表,然后使用 asyncio 的 wait() 方法即可执行。</span></strong></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> asyncio
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
async </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> request():
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.baidu.com</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
status </span>=<span style="color: rgba(0, 0, 0, 1)"> requests.get(url).status_code
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> status
tasks </span>= <strong><span style="color: rgba(255, 0, 0, 1)">
</span></strong><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Tasks:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,tasks)
loop </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.get_event_loop()
loop.run_until_complete(<strong><span style="color: rgba(255, 0, 0, 1)">asyncio.wait(tasks)</span></strong>)
</span><span style="color: rgba(0, 0, 255, 1)">for</span> task <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> tasks:
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Task Result:</span><span style="color: rgba(128, 0, 0, 1)">'</span>,task.result())</pre>
</div>
<p><span style="font-size: 12px; font-family: "times new roman", times">运行结果:</span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Tasks: [<Task pending coro=<request() running at demo.py:5>>, <Task pending coro=<request() running at demo.py:5>>, <Task pending coro=<request() running at demo.py:5>>, <Task pending coro=<request() running at demo.py:5>>, <Task pending coro=<request() running at demo.py:5>>]</span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Task Result: <Response ></span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Task Result: <Response ></span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Task Result: <Response ></span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Task Result: <Response ></span></p>
<p><span style="font-size: 12px; font-family: "times new roman", times">Task Result: <Response ></span></p>
<p><span style="font-size: 18px"><span style="font-family: 微软雅黑">这里我们使用一个</span> for <span style="font-family: 微软雅黑">循环创建了五个 </span><span style="font-family: "Times New Roman"">task</span><span style="font-family: 微软雅黑">,组成了一个列表,然后把这个列表首先传递给了 </span><span style="font-family: "Times New Roman"">asyncio </span><span style="font-family: 微软雅黑">的 </span><span style="font-family: "Times New Roman"">wait() </span><span style="font-family: 微软雅黑">方法,然后再将其注册到时间循环中,就可以发起五个任务了。</span></span></p>
<p> </p>
<p class="p"><span style="font-size: 18px"><strong>4.</strong><strong><span style="font-family: 微软雅黑">协程实现</span></strong></span></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑"> 上面的案例只是为后面的使用作铺垫,接下来我们正式来看下协程在解决</span> IO <span style="font-family: 微软雅黑">密集型任务上有怎样的优势吧!</span></span></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑"> 为了表现出协程的优势,我们需要先创建一个合适的实验环境,最好的方法就是模拟一个需要等待一定时间才可以获取返回结果的网页,上面的代码中使用了百度,但百度的响应太快了,而且响应速度也会受本机网速影响,所以最好的方式是自己在本地模拟一个慢速服务器,这里我们选用</span> Flask<span style="font-family: 微软雅黑">。</span></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">服务器代码:
</span><span style="color: rgba(0, 0, 255, 1)">from</span> flask <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> Flask
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> time
app </span>= Flask(<span style="color: rgba(128, 0, 128, 1)">__name__</span><span style="color: rgba(0, 0, 0, 1)">)
@app.route(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> index():
time.sleep(</span>3<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">return</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Hello!</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 128, 1)">__name__</span> == <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">__main__</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:
app.run(threaded</span>=True) <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">这表明 Flask 启动了多线程模式,不然默认是只有一个线程的。</span></pre>
</div>
<p><span style="font-family: 微软雅黑">接下来我们再重新使用上面的方法请求一遍:</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> asyncio
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> time
start </span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
<strong>async</strong> </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> request():
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://127.0.0.1:5000</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Waiting for</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, url)
response </span>=<span style="color: rgba(0, 0, 0, 1)"> requests.get(url)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Get response from</span><span style="color: rgba(128, 0, 0, 1)">'</span>, url, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Result:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, response.text)
tasks </span>=
loop </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
end </span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Cost time:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, end -<span style="color: rgba(0, 0, 0, 1)"> start)</span></pre>
</div>
<pre><span style="font-family: "times new roman", times">运行结果如下:
Waiting for http://127.0.0.1:5000
Get response from http://127.0.0.1:5000 Result: Hello!
Waiting for http://127.0.0.1:5000
Get response from http://127.0.0.1:5000 Result: Hello!
Waiting for http://127.0.0.1:5000
Get response from http://127.0.0.1:5000 Result: Hello!
Waiting for http://127.0.0.1:5000
Get response from http://127.0.0.1:5000 Result: Hello!
Waiting for http://127.0.0.1:5000
Get response from http://127.0.0.1:5000 Result: Hello!
<strong>Cost time: 15.049368143081665</strong></span></pre>
<p><span style="font-size: 18px">在这里我们还是创建了五个 task,然后将 task 列表传给 wait() 方法并注册到时间循环中执行。</span></p>
<p class="p"><span style="font-size: 18px; color: rgba(0, 204, 255, 1)"><strong><span style="color: rgba(255, 0, 0, 1)"><span style="font-family: 微软雅黑">其实,要实现异步处理,我们得先要有挂起的操作,当一个任务需要等待</span> IO </span><span style="font-family: 微软雅黑"><span style="color: rgba(255, 0, 0, 1)">结果的时候,可以挂起当前任务,转而去执行其他任务,这样我们才能充分利用好资源,</span><span style="color: rgba(0, 0, 0, 1)">上面方法都是一本正经的串行走下来,连个挂起都没有,怎么可能实现异步?</span></span></strong></span></p>
<p class="p"><span style="font-size: 18px"><strong><span style="font-family: 微软雅黑">要实现异步,接下来我们再了解一下</span> await <span style="font-family: 微软雅黑">的用法,<span style="color: rgba(255, 0, 0, 1)">使用 </span></span><span style="color: rgba(255, 0, 0, 1)"><span style="font-family: "Times New Roman"">await </span><span style="font-family: 微软雅黑">可以将耗时等待的操作挂起,让出控制权。当协程执行的时候遇到 </span><span style="font-family: "Times New Roman"">await</span><span style="font-family: 微软雅黑">,时间循环就会将本协程挂起,转而去执行别的协程,直到其他的协程挂起或执行完毕。</span></span></strong></span></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">所以,我们可能会将代码中的</span> request() <span style="font-family: 微软雅黑">方法改成如下的样子:</span></span></p>
<div class="cnblogs_code">
<pre>async <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> request():
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://127.0.0.1:5000</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Waiting for</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, url)
response </span>=<span style="color: rgba(0, 0, 0, 1)"> <span style="color: rgba(255, 0, 0, 1)"><strong>await</strong></span> requests.get(url)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Get response from</span><span style="color: rgba(128, 0, 0, 1)">'</span>, url, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Result:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, response.text)</span></pre>
</div>
<p><span style="font-size: 18px">仅仅是在 requests 前面加了一个 await,然而执行以下代码,会得到如下报错:</span></p>
<pre><span><span style="font-family: "times new roman", times">Waiting for http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times">
Waiting for http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times">
Waiting for http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times">
Waiting for http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times">
Waiting for http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times">
Cost time: 15.048935890197754</span><span><span style="font-family: "times new roman", times">
Task exception was never retrieved
future: <Task finished coro=<request() done, defined at demo.py:7> exception=TypeError("object Response can't be used in 'await' expression",)></span><span><span style="font-family: "times new roman", times">
Traceback (most recent call last):
File "demo.py", line 10, in</span><span><span style="font-family: "times new roman", times"> request
status =</span><span><span style="font-family: "times new roman", times"> await requests.get(url)
TypeError: object Response can't be used in 'await' expression</span><br></span></span></span></span></span></span></span></span></span></span></pre>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">这次它遇到</span> await <span style="font-family: 微软雅黑">方法确实挂起了,也等待了,但是最后却报了这么个错,这个错误的意思是 </span><span style="font-family: "Times New Roman"">requests </span><span style="font-family: 微软雅黑">返回的 </span><span style="font-family: "Times New Roman"">Response </span><span style="font-family: 微软雅黑">对象不能和 </span><span style="font-family: "Times New Roman"">await </span><span style="font-family: 微软雅黑">一起使用,为什么呢?因为根据官方文档说明,</span><strong>await <span style="font-family: 微软雅黑">后面的对象必须是如下格式之一:</span></strong></span></p>
<ul>
<li class="p"><span style="font-size: 18px">A native coroutine object returned from a native coroutine function<span style="font-family: 微软雅黑">,一个原生 </span><span style="font-family: "Times New Roman"">coroutine </span><span style="font-family: 微软雅黑">对象。</span></span></li>
<li class="p"><span style="font-size: 18px">A generator-based coroutine object returned from a function decorated with types.coroutine()<span style="font-family: 微软雅黑">,一个由 </span><span style="font-family: "Times New Roman"">types.coroutine() </span><span style="font-family: 微软雅黑">修饰的生成器,这个生成器可以返回 </span><span style="font-family: "Times New Roman"">coroutine </span><span style="font-family: 微软雅黑">对象。</span></span></li>
<li class="p"><span style="font-size: 18px">An object with an await__ method returning an iterator<span style="font-family: 微软雅黑">,一个包含 </span><span style="font-family: "Times New Roman"">__await </span><span style="font-family: 微软雅黑">方法的对象返回的一个迭代器。</span></span></li>
</ul>
<p class="p"><span style="font-size: 18px">reqeusts <span style="font-family: 微软雅黑">返回的 </span><span style="font-family: "Times New Roman"">Response </span><span style="font-family: 微软雅黑">不符合上面任一条件,因此就会报上面的错误了。<span style="color: rgba(255, 0, 0, 1)"><strong>既然 </strong></span></span><span style="color: rgba(255, 0, 0, 1)"><strong><span style="font-family: "Times New Roman"">await </span><span style="font-family: 微软雅黑">后面可以跟一个 </span><span style="font-family: "Times New Roman"">coroutine </span><span style="font-family: 微软雅黑">对象,那么我将请求页面的方法独立出来,并用 </span><span style="font-family: "Times New Roman"">async </span><span style="font-family: 微软雅黑">修饰,这样就得到了一个 </span><span style="font-family: "Times New Roman"">coroutine </span><span style="font-family: 微软雅黑">对象</span></strong></span></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> asyncio
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> time
start </span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
<strong><span style="color: rgba(255, 0, 0, 1)">async </span></strong></span><strong><span style="color: rgba(255, 0, 0, 1)">def</span></strong><span style="color: rgba(0, 0, 0, 1)"><strong><span style="color: rgba(255, 0, 0, 1)"> ge</span><span style="color: rgba(255, 0, 0, 1)">t</span></strong>(url):
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> requests.get(url)
async </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> request():
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://127.0.0.1:5000</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Waiting for</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, url)
response </span>=<span style="color: rgba(0, 0, 0, 1)"> await get(url)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Get response from</span><span style="color: rgba(128, 0, 0, 1)">'</span>, url, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Result:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, response.text)
tasks </span>=
loop </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
end </span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Cost time:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, end -<span style="color: rgba(0, 0, 0, 1)"> start)<br></span></pre>
</div>
<p><span style="font-size: 18px">这里我们,我们运行一下看看:</span></p>
<pre><span style="font-family: "times new roman", times">Waiting for http://127.0.0.1:5000
Get response from http://127.0.0.1:5000 Result: Hello!
Waiting for http://127.0.0.1:5000
Get response from http://127.0.0.1:5000 Result: Hello!
Waiting for http://127.0.0.1:5000
Get response from http://127.0.0.1:5000 Result: Hello!
Waiting for http://127.0.0.1:5000
Get response from http://127.0.0.1:5000 Result: Hello!
Waiting for http://127.0.0.1:5000
Get response from http://127.0.0.1:5000 Result: Hello!
Cost time: 15.134317874908447</span></pre>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">还是不行,它还不是异步执行,也就是说我们仅仅将涉及</span> IO <span style="font-family: 微软雅黑">操作的代码封装到 </span><span style="font-family: "Times New Roman"">async </span><span style="font-family: 微软雅黑">修饰的方法里面是不可行的!我们必须要使用支持异步操作的请求方式才可以实现真正的异步,所以这里就需要 </span><strong>aiohttp </strong><span style="font-family: 微软雅黑">派上用场了。<strong>(<span style="color: rgba(255, 0, 0, 1)">由于requests 模块不支持异步,所以用aiohttp 模块</span>)</strong></span></span></p>
<p class="p"> </p>
<p class="p"><span style="font-size: 18px"><strong>5.<span style="font-family: 微软雅黑">使用</span></strong><strong> aiohttp </strong></span></p>
<p class="p"><span style="font-size: 18px">-<span style="font-family: 微软雅黑">环境安装:</span><span style="font-family: "Times New Roman"">pip install aiohttp</span></span></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">我们将</span> aiohttp <span style="font-family: 微软雅黑">用上来<span style="color: rgba(0, 0, 0, 1)">,<strong>将请求库由 </strong></span></span><span style="color: rgba(0, 0, 0, 1)"><strong><span style="font-family: "Times New Roman"">requests </span><span style="font-family: 微软雅黑">改成了 </span><span style="font-family: "Times New Roman"">aiohttp</span><span style="font-family: 微软雅黑">,通过 </span><span style="font-family: "Times New Roman"">aiohttp </span><span style="font-family: 微软雅黑">的 </span><span style="font-family: "Times New Roman"">ClientSession </span><span style="font-family: 微软雅黑">类的 </span><span style="font-family: "Times New Roman"">get() </span></strong><span style="font-family: 微软雅黑"><strong>方法进行请求</strong></span></span></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> asyncio
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> aiohttp
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> time
start</span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
<strong>async </strong></span><strong><span style="color: rgba(0, 0, 255, 1)">def</span></strong><span style="color: rgba(0, 0, 0, 1)"><strong> get</strong>(url):
<strong><span style="color: rgba(255, 0, 0, 1)"> session </span></strong></span><strong><span style="color: rgba(255, 0, 0, 1)">=</span></strong><span style="color: rgba(0, 0, 0, 1)"><strong><span style="color: rgba(255, 0, 0, 1)"><strong> a</strong>iohttp.ClientSession()</span></strong>
response </span>=<span style="color: rgba(0, 0, 0, 1)"><strong><span style="color: rgba(255, 0, 0, 1)"> await session.get(url)</span></strong>
result </span>=<span style="color: rgba(0, 0, 0, 1)"> await response.text()
session.close()
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> result
async </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> request():
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://127.0.0.1:5000</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Waiting for</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,url)
result </span>=<span style="color: rgba(0, 0, 0, 1)"> await get(url)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Get response from</span><span style="color: rgba(128, 0, 0, 1)">'</span>,url,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Result:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,result)
tasks </span>=
loop </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
end </span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Cost time:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, end - start)</pre>
</div>
<p><span style="font-size: 18px">结果如下:</span><span style="font-size: 18px"><span style="font-family: 微软雅黑">我们发现这次请求的耗时由</span> 15 <span style="font-family: 微软雅黑">秒变成了 </span><span style="font-family: "Times New Roman"">3 </span><span style="font-family: 微软雅黑">秒,耗时直接变成了原来的 </span><span style="font-family: "Times New Roman"">1/5</span></span></p>
<pre><span style="font-family: "times new roman", times">Waiting for http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times">
Waiting for http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times">
Waiting for http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times">
Waiting for http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times">
Waiting for http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times">
Get response from http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times"> Result: Hello!
Get response from http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times"> Result: Hello!
Get response from http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times"> Result: Hello!
Get response from http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times"> Result: Hello!
Get response from http://127.0.0.1:5000</span><span><span style="font-family: "times new roman", times"> Result: Hello!
Cost time: </span><strong><span style="color: rgba(255, 0, 0, 1)"><span style="font-family: "times new roman", times">3.0199508666992188</span><br><br></span></strong></span></span></span></span></span></span></span></span></span></span></pre>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei"">代码里面我们使用了 await,后面跟了 get() 方法,<span style="color: rgba(255, 0, 0, 1)"><strong>在执行这五个协程的时候,如果遇到了 await,那么就会将当前协程挂起,转而去执行其他的协程,直到其他的协程也挂起或执行完毕,再进行下一个协程的执行。充分利用 CPU 时间,而不必把时间浪费在等待 IO 上</strong></span></span></p>
<p class="p"><span style="font-size: 16px; font-family: "Microsoft YaHei"">开始运行时,时间循环会运行第一个 task,针对第一个 task 来说,当执行到第一个 await 跟着的 get() 方法时,它被挂起,但这个 get() 方法第一步的执行是非阻塞的,挂起之后立马被唤醒,所以立即又进入执行,创建了 ClientSession 对象,接着遇到了第二个 await,调用了 session.get() 请求方法,然后就被挂起了,由于请求需要耗时很久,所以一直没有被唤醒,好第一个 task 被挂起了,那接下来该怎么办呢?事件循环会寻找当前未被挂起的协程继续执行,于是就转而执行第二个 task 了,也是一样的流程操作,直到执行了第五个 task 的 session.get() 方法之后,全部的 task 都被挂起了。所有 task 都已经处于挂起状态,那咋办?只好等待了。3 秒之后,几个请求几乎同时都有了响应,然后几个 task 也被唤醒接着执行,输出请求结果,最后耗时,3 秒!</span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei"">在上面的例子中,在发出网络请求后,既然接下来的 3 秒都是在等待的,在 3 秒之内,CPU 可以处理的 task 数量远不止这些,那么岂不是我们放 很多 个 task 一起执行,最后得到所有结果的耗时不都是 3 秒左右吗?因为这几个任务被挂起后都是一起等待的。理论来说确实是这样的,不过有个前提,那就是服务器在同一时刻接受无限次请求都能保证正常返回结果,也就是服务器无限抗压,另外还要忽略 IO 传输时延,确实可以做到无限 task 一起执行且在预想时间内得到结果。</span></p>
<p class="p"><span style="font-size: 18px; font-family: "Microsoft YaHei"">我们这里将 task 数量设置成 100,再试一下:</span></p>
<div class="cnblogs_code">
<pre>tasks = <br></span></pre>
</div>
<pre><span style="font-family: "times new roman", times; font-size: 12px">耗时结果如下:
Cost time: <span style="color: rgba(255, 0, 0, 1)"><strong>3.106252670288086</strong></span><br></span></pre>
<p class="p"><span style="font-family: "Microsoft YaHei"; font-size: 18px">最后运行时间也是在 3 秒左右,当然多出来的时间就是 IO 时延了。可见,使用了异步协程之后,我们几乎可以在相同的时间内实现成百上千倍次的网络请求,把这个运用在爬虫中,速度提升可谓是非常可观了。</span></p>
<p class="p"> </p>
<p class="p"><span style="font-size: 18px"><strong>6. <span style="font-family: 微软雅黑">与单进程、多进程对比</span></strong></span></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">单进程</span></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> time
start </span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> request():
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://127.0.0.1:5000</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Waiting for</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, url)
result </span>=<span style="color: rgba(0, 0, 0, 1)"> requests.get(url).text
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Get response from</span><span style="color: rgba(128, 0, 0, 1)">'</span>, url, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Result:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, result)
</span><span style="color: rgba(0, 0, 255, 1)">for</span> _ <span style="color: rgba(0, 0, 255, 1)">in</span> range(100<span style="color: rgba(0, 0, 0, 1)">):
request()
end </span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Cost time:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, end -<span style="color: rgba(0, 0, 0, 1)"> start)<br></span></pre>
</div>
<pre><span style="font-size: 12px; font-family: "times new roman", times"><span style="font-size: 14px">最后耗时:</span>
Cost time: <span style="color: rgba(255, 0, 0, 1)"><strong>305</strong></span>.16639709472656<br><br><span style="font-size: 18px; font-family: "Microsoft YaHei"">多进程<br></span></span></pre>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> time
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> multiprocessing
start </span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> request(_):
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://127.0.0.1:5000</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Waiting for</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, url)
result </span>=<span style="color: rgba(0, 0, 0, 1)"> requests.get(url).text
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Get response from</span><span style="color: rgba(128, 0, 0, 1)">'</span>, url, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Result:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, result)
cpu_count </span>=<span style="color: rgba(0, 0, 0, 1)"> multiprocessing.cpu_count()
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Cpu count:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, cpu_count)
pool </span>=<span style="color: rgba(0, 0, 0, 1)"> multiprocessing.Pool(cpu_count)
pool.map(request, range(</span>100<span style="color: rgba(0, 0, 0, 1)">))
end </span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Cost time:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, end -<span style="color: rgba(0, 0, 0, 1)"> start)<br></span></pre>
</div>
<p>这里我使用了multiprocessing 里面的 Pool 类,即进程池。我的电脑的 CPU 个数是 8 个,这里的进程池的大小就是 8。</p>
<pre><span>耗时:
Cost time: <span style="color: rgba(255, 0, 0, 1)"><strong>48</strong></span>.17306900024414<br><br></span></pre>
<pre></pre>
<p class="p"><span style="font-size: 18px"><strong>7.</strong><strong><span style="font-family: 微软雅黑">与多进程结合</span></strong></span></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">在最新的</span> PyCon 2018 <span style="font-family: 微软雅黑">上,来自 </span><span style="font-family: "Times New Roman"">Facebook </span><span style="font-family: 微软雅黑">的 </span><span style="font-family: "Times New Roman"">John Reese </span><span style="font-family: 微软雅黑">介绍了 </span><span style="font-family: "Times New Roman"">asyncio </span><span style="font-family: 微软雅黑">和 </span><span style="font-family: "Times New Roman"">multiprocessing </span><span style="font-family: 微软雅黑">各自的特点,并开发了一个新的库,叫做 </span><span style="font-family: "Times New Roman"">aiomultiprocess</span><span style="font-family: 微软雅黑">。</span><span style="font-family: 微软雅黑">需要</span> Python 3.6 <span style="font-family: 微软雅黑">及更高版本才可使用。</span></span></p>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">安装:</span>pip install aiomultiprocess</span></p>
<p class="p"><span style="font-family: 微软雅黑; font-size: 18px">使用这个库,我们可以将上面的例子改写如下:</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> asyncio
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> aiohttp
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> time
</span><span style="color: rgba(0, 0, 255, 1)">from</span> aiomultiprocess <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> Pool
start </span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
async </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get(url):
session </span>=<span style="color: rgba(0, 0, 0, 1)"> aiohttp.ClientSession()
response </span>=<span style="color: rgba(0, 0, 0, 1)"> await session.get(url)
result </span>=<span style="color: rgba(0, 0, 0, 1)"> await response.text()
session.close()
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> result
async </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> request():
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://127.0.0.1:5000</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
urls </span>=
<span style="color: rgba(255, 0, 0, 1)"><strong>async with Pool() as pool:</strong></span>
result </span>=<strong><span style="color: rgba(255, 0, 0, 1)"> await pool.map(get, urls)
</span></strong><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> result
coroutine </span>=<span style="color: rgba(0, 0, 0, 1)"> request()
task </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.ensure_future(coroutine)
loop </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.get_event_loop()
loop.run_until_complete(task)
end </span>=<span style="color: rgba(0, 0, 0, 1)"> time.time()
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Cost time:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, end -<span style="color: rgba(0, 0, 0, 1)"> start)<br></span></pre>
</div>
<p><span style="font-size: 18px">这样就会同时使用多进程和异步协程进行请求,当然最后的结果其实和异步是差不多的:</span></p>
<pre><span>Cost time: <span style="color: rgba(255, 0, 0, 1)"><strong>3</strong></span>.1156570434570312</span></pre>
<p class="p"><span style="font-size: 18px"><span style="font-family: 微软雅黑">因为我的测试接口的原因,最快的响应也是</span> 3 <span style="font-family: 微软雅黑">秒,所以这部分多余的时间基本都是 </span><span style="font-family: "Times New Roman"">IO </span><span style="font-family: 微软雅黑">传输时延。但在真实情况下,我们在做爬取的时候遇到的情况千变万化,一方面我们使用异步协程来防止阻塞,另一方面我们使用 </span><span style="font-family: "Times New Roman"">multiprocessing </span><span style="font-family: 微软雅黑">来利用多核成倍加速,节省时间其实还是非常可观的。</span></span></p>
<p> </p>
<p>更多案例</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> aiohttp
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> asyncio
</span><span style="color: rgba(0, 0, 255, 1)">from</span> lxml <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> etree
all_titles </span>=<span style="color: rgba(0, 0, 0, 1)"> []
headers </span>=<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">User-Agent</span><span style="color: rgba(128, 0, 0, 1)">'</span>:<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
}
async </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> request(url):
async with aiohttp.ClientSession() as s:
async with await s.get(url,headers</span>=<span style="color: rgba(0, 0, 0, 1)">headers) as response:
page_text </span>=<span style="color: rgba(0, 0, 0, 1)"> await response.text()
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> page_text
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse(task):
page_text </span>=<span style="color: rgba(0, 0, 0, 1)"> task.result()
page_text </span>= page_text.encode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">gb2312</span><span style="color: rgba(128, 0, 0, 1)">'</span>).decode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">gbk</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
tree </span>=<span style="color: rgba(0, 0, 0, 1)"> etree.HTML(page_text)
tr_list </span>= tree.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">//*[@id="morelist"]/div/table//tr/td/table//tr</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">for</span> tr <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> tr_list:
title </span>= tr.xpath(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">./td/a/text()</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(title)
all_titles.append(title)
urls </span>=<span style="color: rgba(0, 0, 0, 1)"> []
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://wz.sun0769.com/index.php/question/questionType?type=4&page=%d</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 0, 255, 1)">for</span> page <span style="color: rgba(0, 0, 255, 1)">in</span> range(100<span style="color: rgba(0, 0, 0, 1)">):
u_page </span>= page * 30<span style="color: rgba(0, 0, 0, 1)">
new_url </span>= format(url%<span style="color: rgba(0, 0, 0, 1)">u_page)
urls.append(new_url)
tasks </span>=<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> url <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> urls:
c </span>=<span style="color: rgba(0, 0, 0, 1)"> request(url)
task </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.ensure_future(c)
task.add_done_callback(parse)
tasks.append(task)
loop </span>=<span style="color: rgba(0, 0, 0, 1)"> asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))</span></pre>
</div>
<p> </p>
<p> </p>
<p>参考链接: https://blog.csdn.net/zhusongziye/article/details/81637088</p>
<p> </p><br><br>
来源:https://www.cnblogs.com/Summer-skr--blog/p/11486634.html
頁:
[1]