云深不归处 發表於 2019-5-8 00:00:00

python爬取淘宝商品做数据挖掘

<h2>&nbsp;项目内容:</h2>
<p>  本项目选择 淘宝商品类目:零食</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp;数量:一共100页,4400个零食商品</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp;筛选条件:天猫、销量从高到低、价格0元到200元以内</p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201905/1426803-20190506132634989-373670490.png" alt="" width="912" height="725"></p>
<p>&nbsp;</p>
<p>项目目的:</p>
<ol>
<li>对商品标题进行文本分析以及词云可视化</li>
<li>商品价格分布情况分析</li>
<li>商品的销量分布情况分析</li>
<li>商品价格对销量的影响分析</li>
<li>商品价格对销售额的影响分析</li>
<li>不同省份或城市的商品数量分布</li>
</ol>
<p>&nbsp;</p>
<p>项目步骤:</p>
<ol>
<li>数据采集模块:利用Python爬虫爬取淘宝网商品数据</li>
<li>数据预处理模块:对商品数据进行清洗和处理</li>
<li>数据分析模块:jieba分词、wordcloud可视化、数据分析及可视化</li>
</ol>
<p>&nbsp;</p>
<p>项目环境:</p>
<p>  系统环境:win10 64位</p>
<p>  工具:pycharm,chrome devTools,Anaconda</p>
<p>&nbsp;</p>
<h2>一、爬取数据</h2>
<p>  因为淘宝网是有反爬虫机制的,虽然我使用了多线程、修改headers参数,以及使用代理ip等,也考虑到我当前测试环境是使用校园网进行爬取淘宝商品信息的,学校只有一个公网ip,按照以往的经验,使用校园网做测试环境的话是不容易被封的,但仍然不能保证每次100%爬取,所以我增加了循环爬取,每次循环爬取未爬取成功的页面,直至所有的页面全部爬取成功。</p>
<p>  淘宝商品页面上存储的商品数据是以Json格式存储的,在这里我选择用正则表达式进行解析:</p>
<p>  代码如下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> re
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> time
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> random
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pandas as pd
</span><span style="color: rgba(0, 0, 255, 1)">from</span> retrying <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> retry
</span><span style="color: rgba(0, 0, 255, 1)">from</span> concurrent.futures <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> ThreadPoolExecutor

start </span>= time.clock()<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 开始计时</span>

<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 请求头池</span>
user_agent =<span style="color: rgba(0, 0, 0, 1)"> [
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">.NET CLR 3.0.04506)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">2.0.50727)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">.NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">3.0.04506.30)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (</span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Change: 287 c9dfb30)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Safari/535.20</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Safari/536.11</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LBBROWSER</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LBBROWSER</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mobile/8C148 Safari/6533.18.5</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 </span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Safari/537.36</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
]

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 代理ip池</span>
proxies = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://125.71.212.25:9000</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://202.109.157.47:9000</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://47.94.169.110:80</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
         </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://111.40.84.73:9999</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://114.245.221.21:8060</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://117.131.235.198:8060</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> plist 为1-100页的URL的编号num</span>
plist =<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span> range(1, 101<span style="color: rgba(0, 0, 0, 1)">):
    j </span>= 44 * (i - 1<span style="color: rgba(0, 0, 0, 1)">)
    plist.append(j)

listno </span>=<span style="color: rgba(0, 0, 0, 1)"> plist
datatmsp </span>= pd.DataFrame(columns=<span style="color: rgba(0, 0, 0, 1)">[])

</span><span style="color: rgba(0, 0, 255, 1)">while</span><span style="color: rgba(0, 0, 0, 1)"> True:
    @retry(stop_max_attempt_number</span>=8<span style="color: rgba(0, 0, 0, 1)">)
    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> network_programming(num):
      url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://s.taobao.com/search?q=%E9%9B%B6%E9%A3%9F&amp;imgfile=&amp;js=1&amp;stats_click=search_radio_tmall%3A1</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)"> \
            </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">&amp;initiative_id=staobaoz_20190508&amp;tab=mall&amp;ie=utf8&amp;sort=sale-desc&amp;filter=reserve_price%5B%2C200%5D</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)"> \
            </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">&amp;bcoffset=0&amp;p4ppushleft=%2C44&amp;s=</span><span style="color: rgba(128, 0, 0, 1)">'</span> +<span style="color: rgba(0, 0, 0, 1)"> str(num)
      random_user_agent </span>= random.choice(user_agent)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 从user_agent池中随机生成headers</span>
      random_proxies = random.choice(proxies)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 从代理ip池中随机生成proxies</span>
      web = requests.get(url, headers={<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">user-agent</span><span style="color: rgba(128, 0, 0, 1)">'</span>: random_user_agent}, proxies={<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">: random_proxies})
      web.encoding </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span>
      <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> web


    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 多线程</span>
    <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> multithreading():
      number </span>= listno<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 每次爬取未成功爬取的页</span>
      event =<span style="color: rgba(0, 0, 0, 1)"> []

      with ThreadPoolExecutor(max_workers</span>=10<span style="color: rgba(0, 0, 0, 1)">) as executor:
            </span><span style="color: rgba(0, 0, 255, 1)">for</span> result <span style="color: rgba(0, 0, 255, 1)">in</span> executor.map(network_programming, number, chunksize=10<span style="color: rgba(0, 0, 0, 1)">):
                event.append(result)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> event


    headers </span>= {<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">User-Agent</span><span style="color: rgba(128, 0, 0, 1)">"</span>: <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (WindowsNT 10.0; WOW64);Chrome/55.0.2883.87 Safari/537.36</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">}

    listpg </span>=<span style="color: rgba(0, 0, 0, 1)"> []
    event </span>=<span style="color: rgba(0, 0, 0, 1)"> multithreading()
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> event:
      json </span>= re.findall(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">"auctions":(.*?),"recommendAuctions"</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, i.text)
      </span><span style="color: rgba(0, 0, 255, 1)">if</span><span style="color: rgba(0, 0, 0, 1)"> len(json):
            table </span>=<span style="color: rgba(0, 0, 0, 1)"> pd.read_json(json)
            datatmsp </span>= pd.concat(, axis=0, ignore_index=<span style="color: rgba(0, 0, 0, 1)">True)
            pg </span>= re.findall(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">"pageNum":(.*?),"p4pbottom_up"</span><span style="color: rgba(128, 0, 0, 1)">'</span>, i.text)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 记入每一次成功爬取的页码</span>
<span style="color: rgba(0, 0, 0, 1)">            listpg.append(pg)

    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将爬取成功的页码转为url中的num值</span>
    lists =<span style="color: rgba(0, 0, 0, 1)"> []
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> a <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> listpg:
      b </span>= 44 * (int(a) - 1<span style="color: rgba(0, 0, 0, 1)">)
      lists.append(b)

    listn </span>=<span style="color: rgba(0, 0, 0, 1)"> listno

    listno </span>=<span style="color: rgba(0, 0, 0, 1)"> []
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> p <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> listn:
      </span><span style="color: rgba(0, 0, 255, 1)">if</span> p <span style="color: rgba(0, 0, 255, 1)">not</span> <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> lists:
            listno.append(p)

    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 当未爬取页数未0时,终止循环</span>
    <span style="color: rgba(0, 0, 255, 1)">if</span> len(listno) ==<span style="color: rgba(0, 0, 0, 1)"> 0:
      </span><span style="color: rgba(0, 0, 255, 1)">break</span><span style="color: rgba(0, 0, 0, 1)">

datatmsp.to_excel(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">datatmsp.xls</span><span style="color: rgba(128, 0, 0, 1)">'</span>, index=<span style="color: rgba(0, 0, 0, 1)">False)

end </span>=<span style="color: rgba(0, 0, 0, 1)"> time.clock()
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">爬取完成 用时:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, end - start, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">s</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p>&nbsp;</p>
<p>  爬取到商品数据我是先以Excel文件的xls格式保存存储到本地上,方便调试,以下图1.1是已经爬取到的数据。</p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201905/1426803-20190508163232048-1209283946.png" alt="" width="1347" height="550"></p>
<p style="text-align: center">图1.1 商品数据</p>
<p style="text-align: center">&nbsp;</p>
<h2 style="text-align: left">二、数据清洗、预处理</h2>
<p>  从本地导入上一步爬取到商品数据:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pandas as pd
datatmsp </span>= pd.read_excel(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">datatmsp.xls</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)<br></span></pre>
</div>
<p>  查看数据维度:</p>
<div class="cnblogs_code">
<pre>datatmsp.shape</pre>
</div>
<p style="text-align: center"><img src="https://img2018.cnblogs.com/blog/1426803/201905/1426803-20190508173354928-445482286.png" alt=""></p>
<p style="text-align: center">图2.1</p>
<p>  通过数据维度可以知道已成功爬取了100页共4400个商品数据,而每个商品数据共有21个字段。</p>
<p>  接下来对数据缺失值进行分析,代码如下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 数据缺失值分析</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)"> 需要模块 missingno库pip install missingno</span>
<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> missingno as msno
msno.bar(datatmsp.sample(len(datatmsp)), figsize</span>=(10, 4<span style="color: rgba(0, 0, 0, 1)">))</span></pre>
</div>
<p>  运行代码后,发现pid字段和risk字段数据完全缺失,如图所示:</p>
<p style="text-align: center"><img src="https://img2018.cnblogs.com/blog/1426803/201905/1426803-20190508175753369-1510567797.png" alt=""></p>
<p style="text-align: center">图2.2 数据维度图</p>
<p>  将缺失值过半的列以及重复的行的商品数据将被删除,代码如下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 删除缺失值过半的列</span>
half_count = len(datatmsp)/2<span style="color: rgba(0, 0, 0, 1)">
datatmsp </span>= datatmsp.dropna(thresh=half_count, axis=1<span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 删除重复行</span>
datatmsp = datatmsp.drop_duplicates()<br>datatmsp.shape()</pre>
</div>
<p>&nbsp;  数据清洗后数据维度降到只有4389个商品数据,一共19个字段,删除后的数据维度:</p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201905/1426803-20190508180215649-1935923603.png" alt=""></p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201905/1426803-20190508175146996-1749943432.png" alt=""></p>
<p style="text-align: center">图2.3 清洗后的数据维度图</p>
<p>&nbsp;</p>
<p>  此时数据已经清洗好了,接下来把淘宝商品数据存进mysql数据库中,因为我用的是私人云服务器上的mysql,在这里就不写连接参数,所以下面代码中连接数据库中的参数读者参考自己的实际情况进行修改哈。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pymysql
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sqlalchemy <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> create_engine
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 与mysql服务器建立起连接</span>
engine = create_engine(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">mysql+pymysql://{数据库用户名}:{密码}@{ip地址或主机名}:{端口}/{表名}</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
con </span>=<span style="color: rgba(0, 0, 0, 1)"> engine.connect()
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将清洗好的商品数据存进mysql</span>
datatmsp.to_sql(name=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">snacks</span><span style="color: rgba(128, 0, 0, 1)">'</span>, con=con, if_exists=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">append</span><span style="color: rgba(128, 0, 0, 1)">'</span>, index=False)</pre>
</div>
<p>  数据成功导入mysql:</p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201905/1426803-20190508215707039-1104913112.png" alt="" width="1111" height="521"></p>
<p style="text-align: center">图2.4 淘宝商品表</p>
<p style="text-align: center">&nbsp;</p>
<p>  根据此次项目需求,本文只需要取item_loc,raw_title,view_price,view_sales这四列的数据,主要是对商品标题、区域、价格、销量四个维度进行分析,代码如下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 取出商品标题、区域、价格、销量四个维度的数据</span>
data = datatmsp[[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">item_loc</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">raw_title</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">view_price</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">view_sales</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]]
data.head()    </span></pre>
</div>
<p>  <img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201905/1426803-20190508220025396-1756416406.png" alt=""></p>
<p style="text-align: center">图2.5&nbsp;</p>
<p style="text-align: left">  </p>
<p style="text-align: left">  接下来,对商品数据进行分析前的预处理,代码如下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 对商品所在地item_loc列中的省份和城市进行拆分,生成province列</span>
data[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">province</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = data.item_loc.apply(<span style="color: rgba(0, 0, 255, 1)">lambda</span><span style="color: rgba(0, 0, 0, 1)"> x: x.split())

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 因为直辖市的省份和城市相同,在这里根据字符长度进行判断</span>
data[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">city</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = data.item_loc.apply(<span style="color: rgba(0, 0, 255, 1)">lambda</span> x: x.split() <span style="color: rgba(0, 0, 255, 1)">if</span> len(x) &lt; 4 <span style="color: rgba(0, 0, 255, 1)">else</span> x.split())

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 提取商品销售量view_sales列中的数组,得到sales列</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> dealSales(x):
    x </span>= x.split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">人</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">万</span><span style="color: rgba(128, 0, 0, 1)">'</span> <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> x:
      </span><span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span> <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> x:
            x </span>= x.replace(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">''</span>).replace(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">万</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">000</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
      </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
            x </span>= x.replace(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">万</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">0000</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
   
    </span><span style="color: rgba(0, 0, 255, 1)">return</span> x.replace(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">+</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">)

data[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">sales</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = data.view_sales.apply(<span style="color: rgba(0, 0, 255, 1)">lambda</span><span style="color: rgba(0, 0, 0, 1)"> x: dealSales(x))

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将sales列的数据类型改为int类型</span>
data[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">sales</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = data.sales.astype(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">int</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 用province,city替换category,且转换成与category相同的类型</span>
list_col = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">province</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">city</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> list_col:
    data </span>= data.astype(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">category</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
   
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 删除不用的列</span>
data = data.drop([<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">item_loc</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">view_sales</span><span style="color: rgba(128, 0, 0, 1)">'</span>], axis=1)</pre>
</div>
<p>  查看预处理后的前十行数据:</p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201905/1426803-20190509114951996-799582081.png" alt=""></p>
<p style="text-align: center">图2.6 数据</p>
<p style="text-align: left">&nbsp;</p>
<h2>三、数据挖掘与分析</h2>
<h3>  3.1&nbsp; 对商品标题进行文本分析</h3>
<p>  使用jieba分词器,对raw_title列每一个商品标题进行分词,通过停用表StopWords对标题进行去除停用词。因为下面要统计每个词语的个数,所以 为了准确性,在这里对过滤后的数据 title_clean 中的每个list的元素进行去重,即每个标题被分割后的词语唯一。代码如下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将所有商品标题转换为list</span>
title =<span style="color: rgba(0, 0, 0, 1)"> data.raw_title.values.tolist()

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 对每个标题进行分词,使用jieba分词</span>
<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> jieba
title_s </span>=<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> line <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> title:
    title_cut </span>=<span style="color: rgba(0, 0, 0, 1)"> jieba.lcut(line)
    title_s.append(title_cut)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 导入停用此表</span>
stopwords =

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 剔除停用词</span>
title_clean =<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> line <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> title_s:
    line_clean </span>=<span style="color: rgba(0, 0, 0, 1)"> []
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> word <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> line:
      </span><span style="color: rgba(0, 0, 255, 1)">if</span> word <span style="color: rgba(0, 0, 255, 1)">not</span> <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> stopwords:
            line_clean.append(word)
    title_clean.append(line_clean)
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 进行去重</span>
title_clean_dist =<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> line <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> title_clean:
    line_dist </span>=<span style="color: rgba(0, 0, 0, 1)"> []
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> word <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> line:
      </span><span style="color: rgba(0, 0, 255, 1)">if</span> word <span style="color: rgba(0, 0, 255, 1)">not</span> <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> line_dist:
            line_dist.append(word)
    title_clean_dist.append(line_dist)
   
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将 title_clean_dist 转化为一个list</span>
allwords_clean_dist =<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> line <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> title_clean_dist:
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> word <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> line:
      allwords_clean_dist.append(word)
      
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 把列表 allwords_clean_dist 转为数据框</span>
df_allwords_clean_dist =<span style="color: rgba(0, 0, 0, 1)"> pd.DataFrame({
    </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">allwords</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:allwords_clean_dist
})

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 对过滤_去重的词语 进行分类汇总</span>
word_count =<span style="color: rgba(0, 0, 0, 1)"> df_allwords_clean_dist.allwords.value_counts().reset_index()
word_count.columns </span>= [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">word</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">count</span><span style="color: rgba(128, 0, 0, 1)">'</span>]</pre>
</div>
<p>  接下来需要对已分词好的数据进行词云可视化,代码如下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> wordcloud <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> WordCloud
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> matplotlib.pyplot as plt
</span><span style="color: rgba(0, 0, 255, 1)">from</span> scipy.misc <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> imread
plt.figure(figsize</span>=(20,10<span style="color: rgba(0, 0, 0, 1)">))

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 读取图片</span>
pic = imread(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">猫.png</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
w_c </span>= WordCloud(font_path=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">simhei.ttf</span><span style="color: rgba(128, 0, 0, 1)">"</span>, background_color=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">white</span><span style="color: rgba(128, 0, 0, 1)">"</span>, mask=pic, max_font_size=100, margin=1<span style="color: rgba(0, 0, 0, 1)">)
wc </span>=<span style="color: rgba(0, 0, 0, 1)"> w_c.fit_words({
    x:x[</span>1] <span style="color: rgba(0, 0, 255, 1)">for</span> x <span style="color: rgba(0, 0, 255, 1)">in</span> word_count.head(100<span style="color: rgba(0, 0, 0, 1)">).values
})
plt.imshow(wc, interpolation</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">bilinear</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.axis(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">off</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
plt.show()</span></pre>
</div>
<p>&nbsp;<img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201905/1426803-20190510141640272-302613643.png" alt=""></p>
<p>  分析结论:</p>
<ol>
<li>组合、整装商品占比很高;</li>
<li>
<p>特产、零食、休闲、小吃等字眼的商品占比较高;&nbsp;</p>
</li>
<li>从品牌上看:三只松鼠、百草味、良品铺子等网红零食品牌为多。</li>
</ol>
<p>&nbsp;</p>
<h3>3.2&nbsp; 不同商品关键字word对应的sales之和的统计分析:</h3>
<p>  假如所爬取到的商品标题中含有“糖果”一词的销量之和,也就是说求出具有“糖果”关键字的商品销量之和。代码如下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> numpy as np

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 重新更新索引,之前去重的时候没有更新数据data的索引,导致部分行缺失值</span>
data = data.reset_index(drop=<span style="color: rgba(0, 0, 0, 1)">True)


</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 不同关键词word对应的sales之和的统计分析</span>
w_s_sum =<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> word_count.word:
    i </span>=<span style="color: rgba(0, 0, 0, 1)"> 0
    s_list </span>=<span style="color: rgba(0, 0, 0, 1)"> []
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> t <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> title_clean_dist:
      </span><span style="color: rgba(0, 0, 255, 1)">if</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> t:
            s_list.append(data.sales);
      i</span>+=1<span style="color: rgba(0, 0, 0, 1)">
    w_s_sum.append(sum(s_list))</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> list求和</span>
<span style="color: rgba(0, 0, 0, 1)">   
df_w_s_sum </span>= pd.DataFrame({<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">w_s_sum</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:w_s_sum})

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 把 word_count 与对应的 df_w_s_sum 合并为一个表:</span>
df_word_sum =<span style="color: rgba(0, 0, 0, 1)"> pd.concat(,
                     axis</span>=1, ignore_index=<span style="color: rgba(0, 0, 0, 1)">True)
df_word_sum.columns </span>= [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">word</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">count</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">w_s_sum</span><span style="color: rgba(128, 0, 0, 1)">'</span>] <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">添加列名</span></pre>
</div>
<p>  然后对df_word_sum中的word和w_s_sum两列进行可视化,本文将取销量排名前30的词语进行绘图:</p>
<div class="cnblogs_code">
<pre>df_word_sum.sort_values(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">w_s_sum</span><span style="color: rgba(128, 0, 0, 1)">'</span>, inplace=True, ascending=True) <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 升序</span>
df_w_s = df_word_sum.tail(30)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 取最大的30行数据</span>

<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> matplotlib
</span><span style="color: rgba(0, 0, 255, 1)">from</span> matplotlib <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pyplot as plt

font </span>= {<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">family</span><span style="color: rgba(128, 0, 0, 1)">'</span> : <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">SimHei</span><span style="color: rgba(128, 0, 0, 1)">'</span>}<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 设置字体</span>
matplotlib.rc(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">font</span><span style="color: rgba(128, 0, 0, 1)">'</span>, **<span style="color: rgba(0, 0, 0, 1)">font)

index </span>=<span style="color: rgba(0, 0, 0, 1)"> np.arange(df_w_s.word.size)
plt.figure(figsize</span>=(10,20<span style="color: rgba(0, 0, 0, 1)">))
plt.barh(index, df_w_s.w_s_sum, color</span>=<span style="color: rgba(128, 0, 0, 1)">'blue</span><span style="color: rgba(128, 0, 0, 1)">'</span>, align=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">center</span><span style="color: rgba(128, 0, 0, 1)">'</span>, alpha=0.8<span style="color: rgba(0, 0, 0, 1)">)
plt.yticks(index, df_w_s.word, fontsize</span>=15<span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">添加数据标签</span>
<span style="color: rgba(0, 0, 255, 1)">for</span> y, x <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> zip(index, df_w_s.w_s_sum):
    plt.text(x, y, </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">%.0f</span><span style="color: rgba(128, 0, 0, 1)">'</span> %x , ha=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">left</span><span style="color: rgba(128, 0, 0, 1)">'</span>, va=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">center</span><span style="color: rgba(128, 0, 0, 1)">'</span>, fontsize=15<span style="color: rgba(0, 0, 0, 1)">)
plt.show()</span></pre>
</div>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201906/1426803-20190619151838791-1540775873.png" alt="" width="842" height="1527"></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>由图表可知:</p>
<p>1. 休闲零食小吃之类的销量最高;</p>
<p>2.&nbsp;组合、整装商品占比很高;</p>
<p>3. 从关键字可以看出销量榜上以网红品牌为主。</p>
<p>&nbsp;</p>
<h3>3.3&nbsp; 商品的价格分布情况分析:</h3>
<p>  本文中限定所爬取的零食单品的销售价格区间在0-200元,在这里我们结合自身产品情况对商品的价格分布情况分析,代码如下:</p>
<div class="cnblogs_code">
<p>plt.figure(figsize=(7,5))<br>plt.hist(data['view_price'], bins=15, color='blue')<br>plt.xlabel('价格', fontsize=25)<br>plt.ylabel('商品数量', fontsize=25)<br>plt.title('不同价格对应的商品数量分布', fontsize=17)<br>plt.show()</p>














</div>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201906/1426803-20190619155328550-1724968839.png" alt=""></p>
<p>&nbsp;</p>
<p>由图表可知:</p>
<p>1. 商品数量集中在0-50元之间,总体呈现先增后减;</p>
<p>2. 低价位商品居多,价格在12-25元之间的商品最多,次之0-12元,商品最少的在价格160-180元之间;</p>
<p>&nbsp;</p>
<h3>3.4&nbsp; 商品的销量分布情况分析:</h3>
<p>&nbsp;  为了商品的可视化效果更直观,在这里我们选择销量大于100的商品,代码如下:</p>
<div class="cnblogs_code">
<pre>data_s = data &gt; 100<span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">销量100以上的商品占比:%.3f</span><span style="color: rgba(128, 0, 0, 1)">'</span>%(len(data_s) /<span style="color: rgba(0, 0, 0, 1)"> len(data)))

plt.figure(figsize</span>=(12,8<span style="color: rgba(0, 0, 0, 1)">))
plt.hist(data_s[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">sales</span><span style="color: rgba(128, 0, 0, 1)">'</span>],bins=20, color=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">blue</span><span style="color: rgba(128, 0, 0, 1)">'</span>)   <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 分二十组</span>
plt.xlabel(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">销量</span><span style="color: rgba(128, 0, 0, 1)">'</span>, fontsize=25<span style="color: rgba(0, 0, 0, 1)">)
plt.ylabel(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">商品数量</span><span style="color: rgba(128, 0, 0, 1)">'</span>, fontsize=25<span style="color: rgba(0, 0, 0, 1)">)
plt.title(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">不同销量对应的商品数量分布</span><span style="color: rgba(128, 0, 0, 1)">'</span>, fontsize=25<span style="color: rgba(0, 0, 0, 1)">)
plt.show()</span></pre>
</div>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201906/1426803-20190619155403559-996503203.png" alt=""></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>由图表可知:</p>
<p>1. 销量100以上的商品接近100%,其中销量100-18000之间的商品最多;</p>
<p>2. 销量在18000以上的,商品的数量下降的很厉害,低销量商品居多。</p>
<p>3. 销量在60000以上的商品很少。</p>
<p>&nbsp;</p>
<h3>3.5&nbsp; 商品价格对销量的影响分析</h3>
<p>&nbsp;  在这里我们结合自身产品情况对商品价格在0-200元之间对销量的影响分析:</p>
<div class="cnblogs_code">
<pre>flg, ax =<span style="color: rgba(0, 0, 0, 1)"> plt.subplots()
ax.scatter(data[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">view_price</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">],
          data[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">sales</span><span style="color: rgba(128, 0, 0, 1)">'</span>], color=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">blue</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
ax.set_xlabel(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">价格</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
ax.set_ylabel(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">销量</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
ax.set_title(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">商品价格对销量的影响</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.show()</span></pre>
</div>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201906/1426803-20190619160248095-957872364.png" alt="" width="526" height="361"></p>
<p>&nbsp;</p>
<p>&nbsp;由图表可知:</p>
<p>1.&nbsp; 总体趋势:随着商品价格增多,其销量有所减少,商品价格对其销量有影响的;</p>
<p>2.&nbsp; 价格在0-50之间的商品销量比较集中,销量在100-100000之间,价格150-200元之间的商品多数销量偏少,少数相对较高。</p>
<p>&nbsp;</p>
<h3>3.6 商品价格对销售额的影响分析</h3>
<p>代码如下:</p>
<div class="cnblogs_code">
<pre>data[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">GMV</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = data[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">price</span><span style="color: rgba(128, 0, 0, 1)">'</span>] * data[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">sales</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]

</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> seaborn as sns
sns.regplot(x</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">price</span><span style="color: rgba(128, 0, 0, 1)">'</span>, y=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">GMV</span><span style="color: rgba(128, 0, 0, 1)">'</span>, data=data, color=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">purple</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201906/1426803-20190619162048775-944679461.png" alt="" width="519" height="360"></p>
<p>&nbsp;</p>
<p>由图表可知:</p>
<p>1. 总体趋势:由线性回归拟合线可以看出,商品销售额随着价格增长呈现缓慢上升趋势;</p>
<p>2. 多数商品的价格偏高,销售额也偏低。</p>
<p>&nbsp;</p>
<h3>3.7 不同省份的商品数量分布:</h3>
<p>代码如下:</p>
<div class="cnblogs_code">
<pre>plt.figure(figsize=(12,8<span style="color: rgba(0, 0, 0, 1)">))
data.province.value_counts().plot(
                kind</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">bar</span><span style="color: rgba(128, 0, 0, 1)">'</span>, color=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">purple</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.xticks(rotation</span>=<span style="color: rgba(0, 0, 0, 1)">0)
plt.xlabel(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">省份</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.ylabel(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">数量</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.title(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">不同省份的商品数量分布</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.show()</span></pre>
</div>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1426803/201906/1426803-20190619162527399-702297425.png" alt="" width="863" height="587"></p>
<p>&nbsp;</p>
<p>由图表可知:</p>
<p>1. 位于上海的商品最多,浙江次之,湖南第三,尤其是上海的商品数量远超过浙江、湖南、湖北等地,说明在零食这个子类目上,上海的店铺居多。</p>
<p>2. 总体趋势:商品店铺大部分位于沿海地区以及长江中下游。</p>

</div>
<div id="MySignature" role="contentinfo">
    <p>作者:buildings<br>声明 :对于转载分享我是没有意见的,出于对博客园社区和作者的尊重请保留原文地址哈。<br>致读者 :坚持写博客不容易,写高质量博客更难,我也在不断的学习和进步,希望和所有同路人一道用技术来改变生活。觉得有点用就点个赞哈。</p><br><br>
来源:https://www.cnblogs.com/luengmingbiao/p/10822513.html
頁: [1]
查看完整版本: python爬取淘宝商品做数据挖掘