丗汰夵湸 發表於 2016-4-29 17:34:00

Python3学习笔记(urllib模块的使用)

<h2><span style="font-family: &quot;Microsoft YaHei&quot;">1.基本方法</span></h2>
<h3><span style="font-family: &quot;Microsoft YaHei&quot;"><code class="descclassname">urllib.request.</code><code class="descname">urlopen</code><span class="sig-paren">(<em>url</em>,&nbsp;<em>data=None</em>,&nbsp;<span class="optional">[<em>timeout</em>,&nbsp;<span class="optional">]<em>*</em>,&nbsp;<em>cafile=None</em>,&nbsp;<em>capath=None</em>,&nbsp;<em>cadefault=False</em>,&nbsp;<em>context=None</em><span class="sig-paren">)</span></span></span></span></span></h3>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;url: &nbsp;需要打开的网址</span></p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;data:Post提交的数据</span></p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;timeout:设置网站的访问超时时间</span></p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">直接用urllib.request模块的urlopen()获取页面,page的数据格式为bytes类型,需要decode()解码,转换成str类型。</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> <span style="color: rgba(0, 0, 255, 1)">from</span> urllib <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> request
</span><span style="color: rgba(0, 128, 128, 1)">2</span> response = request.urlopen(r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://python.org/</span><span style="color: rgba(128, 0, 0, 1)">'</span>) <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> &lt;http.client.HTTPResponse object at 0x00000000048BC908&gt; HTTPResponse类型</span>
<span style="color: rgba(0, 128, 128, 1)">3</span> page =<span style="color: rgba(0, 0, 0, 1)"> response.read()
</span><span style="color: rgba(0, 128, 128, 1)">4</span> page = page.decode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p><strong><span style="font-family: &quot;Microsoft YaHei&quot;">urlopen返回对象提供方法:</span></strong></p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;read() , readline() ,readlines() , fileno() , close() :对</span>HTTPResponse类型数据进行操作</p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;info():返回HTTPMessage对象,表示远程服务器返回的头信息</span></p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;getcode():返回Http状态码。如果是http请求,200请求成功完成;404网址未找到</span></p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;geturl():返回请求的url</span></p>
<h2><span style="font-family: &quot;Microsoft YaHei&quot;">2.使用Request</span></h2>
<h3><span style="font-family: &quot;Microsoft YaHei&quot;"><code class="descclassname">urllib.request.</code><code class="descname">Request</code><span class="sig-paren">(<em>url,&nbsp;data=None,&nbsp;headers={}, method=None</em><span class="sig-paren">)</span></span></span></h3>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">使用request()来包装请求,再通过urlopen()获取页面。</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> url = r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://www.lagou.com/zhaopin/Python/?labelWords=label</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 2</span> headers =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)"> 3</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">User-Agent</span><span style="color: rgba(128, 0, 0, 1)">'</span>: r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) </span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 4</span>                   r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)"> 5</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Referer</span><span style="color: rgba(128, 0, 0, 1)">'</span>: r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://www.lagou.com/zhaopin/Python/?labelWords=label</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)"> 6</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Connection</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">keep-alive</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 7</span> <span style="color: rgba(0, 0, 0, 1)">}
</span><span style="color: rgba(0, 128, 128, 1)"> 8</span> req = request.Request(url, headers=<span style="color: rgba(0, 0, 0, 1)">headers)
</span><span style="color: rgba(0, 128, 128, 1)"> 9</span> page =<span style="color: rgba(0, 0, 0, 1)"> request.urlopen(req).read()
</span><span style="color: rgba(0, 128, 128, 1)">10</span> page = page.decode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p><strong><span style="font-family: &quot;Microsoft YaHei&quot;">用来包装头部的数据:</span></strong></p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;User-Agent :这个头部可以携带如下几条信息:浏览器名和版本号、操作系统名和版本号、默认语言</span></p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Referer:可以用来防止盗链,有一些网站图片显示来源http://***.com,就是检查Referer来鉴定的</span></p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Connection:表示连接状态,记录Session的状态。</span></p>
<h2><span style="font-family: &quot;Microsoft YaHei&quot;">3.Post数据</span></h2>
<h3><span style="font-family: &quot;Microsoft YaHei&quot;"><code class="descclassname">urllib.request.</code><code class="descname">urlopen</code><span class="sig-paren">(<em>url</em>,&nbsp;<em>data=None</em>,&nbsp;<span class="optional">[<em>timeout</em>,&nbsp;<span class="optional">]<em>*</em>,&nbsp;<em>cafile=None</em>,&nbsp;<em>capath=None</em>,&nbsp;<em>cadefault=False</em>,&nbsp;<em>context=None</em><span class="sig-paren">)</span></span></span></span></span></h3>
<p><span class="sig-paren" style="font-family: &quot;Microsoft YaHei&quot;">urlopen()的data参数默认为None,当data参数不为空的时候,urlopen()提交方式为Post。</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> <span style="color: rgba(0, 0, 255, 1)">from</span> urllib <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> request, parse
</span><span style="color: rgba(0, 128, 128, 1)"> 2</span> url = r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://www.lagou.com/jobs/positionAjax.json?</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 3</span> headers =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)"> 4</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">User-Agent</span><span style="color: rgba(128, 0, 0, 1)">'</span>: r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) </span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 5</span>                   r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)"> 6</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Referer</span><span style="color: rgba(128, 0, 0, 1)">'</span>: r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://www.lagou.com/zhaopin/Python/?labelWords=label</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)"> 7</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Connection</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">keep-alive</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 8</span> <span style="color: rgba(0, 0, 0, 1)">}
</span><span style="color: rgba(0, 128, 128, 1)"> 9</span> data =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)">10</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">first</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">true</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)">11</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">pn</span><span style="color: rgba(128, 0, 0, 1)">'</span>: 1<span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)">12</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">kd</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Python</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)">13</span> <span style="color: rgba(0, 0, 0, 1)">}
</span><span style="color: rgba(0, 128, 128, 1)">14</span> data = parse.urlencode(data).encode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">15</span> req = request.Request(url, headers=headers, data=<span style="color: rgba(0, 0, 0, 1)">data)
</span><span style="color: rgba(0, 128, 128, 1)">16</span> page =<span style="color: rgba(0, 0, 0, 1)"> request.urlopen(req).read()
</span><span style="color: rgba(0, 128, 128, 1)">17</span> page = page.decode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<h3><code class="descclassname">urllib.parse.urlencode</code><span style="font-family: &quot;Microsoft YaHei&quot;">(<em>query, doseq=False, safe='', encoding=None, errors=None</em>)</span></h3>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">urlencode()主要作用就是将url附上要提交的数据。</span>&nbsp;</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> data =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)">2</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">first</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">true</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)">3</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">pn</span><span style="color: rgba(128, 0, 0, 1)">'</span>: 1<span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)">4</span>   <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">kd</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Python</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)">5</span> <span style="color: rgba(0, 0, 0, 1)">}
</span><span style="color: rgba(0, 128, 128, 1)">6</span> data = parse.urlencode(data).encode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">经过urlencode()转换后的data数据为?first=true?pn=1?kd=Python,最后提交的url为</span></p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;"><strong>http://www.lagou.com/jobs/positionAjax.json?first=true?pn=1?kd=Python</strong></span></p>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">Post的数据必须是bytes或者iterable of bytes,不能是str,因此需要进行encode()编码</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> page = request.urlopen(req, data=data).read()</pre>
</div>
<p><span style="font-family: &quot;Microsoft YaHei&quot;">当然,也可以把data的数据封装在urlopen()参数中</span></p>
<h2><span style="font-family: &quot;Microsoft YaHei&quot;">4.异常处理</span></h2>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_page(url):
</span><span style="color: rgba(0, 128, 128, 1)"> 2</span>   headers =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)"> 3</span>         <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">User-Agent</span><span style="color: rgba(128, 0, 0, 1)">'</span>: r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) </span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 4</span>                     r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)"> 5</span>         <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Referer</span><span style="color: rgba(128, 0, 0, 1)">'</span>: r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://www.lagou.com/zhaopin/Python/?labelWords=label</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)"> 6</span>         <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Connection</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">keep-alive</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 7</span> <span style="color: rgba(0, 0, 0, 1)">    }
</span><span style="color: rgba(0, 128, 128, 1)"> 8</span>   data =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)"> 9</span>         <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">first</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">true</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)">10</span>         <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">pn</span><span style="color: rgba(128, 0, 0, 1)">'</span>: 1<span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)">11</span>         <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">kd</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Python</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)">12</span> <span style="color: rgba(0, 0, 0, 1)">    }
</span><span style="color: rgba(0, 128, 128, 1)">13</span>   data = parse.urlencode(data).encode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">14</span>   req = request.Request(url, headers=<span style="color: rgba(0, 0, 0, 1)">headers)
</span><span style="color: rgba(0, 128, 128, 1)">15</span>   <span style="color: rgba(0, 0, 255, 1)">try</span><span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 128, 128, 1)">16</span>         page = request.urlopen(req, data=<span style="color: rgba(0, 0, 0, 1)">data).read()
</span><span style="color: rgba(0, 128, 128, 1)">17</span>         page = page.decode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">18</span>   <span style="color: rgba(0, 0, 255, 1)">except</span><span style="color: rgba(0, 0, 0, 1)"> error.HTTPError as e:
</span><span style="color: rgba(0, 128, 128, 1)">19</span>         <span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(e.code())
</span><span style="color: rgba(0, 128, 128, 1)">20</span>         <span style="color: rgba(0, 0, 255, 1)">print</span>(e.read().decode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">))
</span><span style="color: rgba(0, 128, 128, 1)">21</span>   <span style="color: rgba(0, 0, 255, 1)">return</span> page</pre>
</div>
<h2><span style="font-family: &quot;Microsoft YaHei&quot;">5、使用代理</span>&nbsp;</h2>
<h3><span style="font-family: &quot;Microsoft YaHei&quot;"><code class="descclassname"><span class="highlighted">urllib.request.</span></code><code class="descname">ProxyHandler</code><span class="sig-paren">(<em>proxies=None</em><span class="sig-paren">)</span></span></span></h3>
<p><span style="font-family: &quot;Microsoft YaHei&quot;"><span class="sig-paren"><span class="sig-paren">当需要抓取的网站设置了访问限制,这时就需要用到代理来抓取数据。</span></span></span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> data =<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 128, 128, 1)"> 2</span>         <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">first</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">true</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)"> 3</span>         <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">pn</span><span style="color: rgba(128, 0, 0, 1)">'</span>: 1<span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(0, 128, 128, 1)"> 4</span>         <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">kd</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Python</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 128, 1)"> 5</span> <span style="color: rgba(0, 0, 0, 1)">    }
</span><span style="color: rgba(0, 128, 128, 1)"> 6</span> proxy = request.ProxyHandler({<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">5.22.195.215:80</span><span style="color: rgba(128, 0, 0, 1)">'</span>})<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 设置proxy</span>
<span style="color: rgba(0, 128, 128, 1)"> 7</span> opener = request.build_opener(proxy)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 挂载opener</span>
<span style="color: rgba(0, 128, 128, 1)"> 8</span> request.install_opener(opener)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 安装opener</span>
<span style="color: rgba(0, 128, 128, 1)"> 9</span> data = parse.urlencode(data).encode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">10</span> page =<span style="color: rgba(0, 0, 0, 1)"> opener.open(url, data).read()
</span><span style="color: rgba(0, 128, 128, 1)">11</span> page = page.decode(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">12</span> <span style="color: rgba(0, 0, 255, 1)">return</span> page</pre>
</div>
<p>&nbsp;</p><br><br>
来源:https://www.cnblogs.com/Lands-ljk/p/5447127.html
頁: [1]
查看完整版本: Python3学习笔记(urllib模块的使用)