茗道 發表於 2019-7-7 17:38:00

python爬取拉勾网数据并进行数据可视化

<p>爬取拉勾网关于python职位相关的数据信息,并将爬取的数据已csv各式存入文件,然后对csv文件相关字段的数据进行清洗,并对数据可视化展示,包括柱状图展示、直方图展示、词云展示等并根据可视化的数据做进一步的分析,其余分析和展示读者可自行发挥和扩展包括各种分析和不同的存储方式等。。。。。</p>
<h3>一、爬取和分析相关依赖包</h3>
<ol>
<li><strong>Python版本: Python3.6</strong></li>
<li><strong>requests: 下载网页</strong></li>
<li><strong>math: 向上取整</strong></li>
<li><strong>time: 暂停进程</strong></li>
<li><strong>pandas:数据分析并保存为csv文件</strong></li>
<li><strong>matplotlib:绘图</strong></li>
<li><strong>pyecharts:绘图</strong></li>
<li><strong>statsmodels:统计建模</strong></li>
<li><strong>wordcloud、scipy、jieba:生成中文词云</strong></li>
<li><strong>pylab:设置画图能显示中文</strong></li>
</ol>
<p>在以上安装或使用过程中可能读者会遇到安装或导入失败等问题自行百度,选择依赖包的合适版本</p>
<h3>二、分析网页结构</h3>
<p>通过Chrome搜索'python工程师',然后右键点击检查或者F12,,使用检查功能查看网页源代码,<span style="color: rgba(255, 0, 0, 1)">当我们点击下一页观察浏览器的搜索栏的url并没有改变</span>,这是因为拉勾网做了反爬虫机制, 职位信息并不在源代码里,而是保存在JSON的文件里,因此我们直接下载JSON,并使用字典方法直接读取数据.即可拿到我们想要的python职位相关的信息,</p>
<p><img src="https://img2018.cnblogs.com/blog/1364097/201907/1364097-20190707155957428-237706729.png" alt="" width="850" height="382"></p>
<p>待爬取的python工程师职位信息如下:</p>
<p><img src="https://img2018.cnblogs.com/blog/1364097/201907/1364097-20190707182939204-1317759390.png" alt="" width="844" height="389"></p>
<p>为了能爬到我们想要的数据,我们要用程序来模拟浏览器来查看网页,所以我们在爬取的过程中会加上头信息,头信息也是我们通过分析网页获取到的,通过网页分析我们知道该请求的头信息,以及请求的信息和请求的方式是POST请求,这样我们就可以该url请求拿到我们想的数据做进一步处理</p>
<p><img src="https://img2018.cnblogs.com/blog/1364097/201907/1364097-20190707161706043-292472777.png" alt="" width="757" height="428"></p>
<p>爬取网页信息代码如下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests

url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)"> https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false</span><span style="color: rgba(128, 0, 0, 1)">'</span>


<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_json(url, num):
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
    从指定的url中通过requests请求携带请求头和请求体获取网页中的信息,
    :return:
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
    url1 </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.lagou.com/jobs/list_python%E5%BC%80%E5%8F%91%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=&amp;fromSearch=true&amp;suginput=</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
    headers </span>=<span style="color: rgba(0, 0, 0, 1)"> {
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">User-Agent</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Host</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">www.lagou.com</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Referer</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=&amp;fromSearch=true&amp;suginput=</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">X-Anit-Forge-Code</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">0</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">X-Anit-Forge-Token</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">None</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">X-Requested-With</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">XMLHttpRequest</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
    }
    data </span>=<span style="color: rgba(0, 0, 0, 1)"> {
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">first</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">true</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">pn</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">: num,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">kd</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">python工程师</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">}
    s </span>=<span style="color: rgba(0, 0, 0, 1)"> requests.Session()
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">建立session:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, s, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
    s.get(url</span>=url1, headers=headers, timeout=3<span style="color: rgba(0, 0, 0, 1)">)
    cookie </span>=<span style="color: rgba(0, 0, 0, 1)"> s.cookies
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">获取cookie:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, cookie, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
    res </span>= requests.post(url, headers=headers, data=data, cookies=cookie, timeout=3<span style="color: rgba(0, 0, 0, 1)">)
    res.raise_for_status()
    res.encoding </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
    page_data </span>=<span style="color: rgba(0, 0, 0, 1)"> res.json()
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">请求响应结果:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, page_data, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> page_data


</span><span style="color: rgba(0, 0, 255, 1)">print</span>(get_json(url, 1))</pre>
</div>
<p>通过搜索我们知道每页显示15个职位,最多显示30页,通过分析网页源代码知道,可以通过JSON里读取总职位数,通过总的职位数和每页能显示的职位数.我们可以计算出总共有多少页,然后使用循环按页爬取, 最后将职位信息汇总, 写入到CSV格式的文件中.</p>
<p>程序运行结果如图:&nbsp;</p>
<p><img src="https://img2018.cnblogs.com/blog/1364097/201907/1364097-20190707163029524-1977134893.png" alt="" width="880" height="301"></p>
<p>爬取所有python相关职位信息如下:</p>
<p><img src="https://img2018.cnblogs.com/blog/1364097/201907/1364097-20190707163310122-1930362742.png" alt="" width="875" height="384"></p>
<h3>三、数据清洗后入库</h3>
<p>数据清洗其实会占用很大一部分工作,我们在这里只做一些简单的数据分析后入库。在拉勾网输入python相关的职位会有18988个。你可以根据工作中需求选择要入库的字段,并对一些字段做进一步的筛选,比如我们可以去除职位名称中为实习生的岗位,过滤指定的字段区域在我们指定区域的职位,取字段薪资的平均值,以最低值和差值的四分之一为平均值等等根据需求自由发挥</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pandas as pd
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> matplotlib.pyplot as plt
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> statsmodels.api as sm
</span><span style="color: rgba(0, 0, 255, 1)">from</span> wordcloud <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> WordCloud
</span><span style="color: rgba(0, 0, 255, 1)">from</span> scipy.misc <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> imread
</span><span style="color: rgba(0, 0, 255, 1)">from</span> imageio <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> imread
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> jieba
</span><span style="color: rgba(0, 0, 255, 1)">from</span> pylab <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> mpl

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 使用matplotlib能够显示中文</span>
mpl.rcParams[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">font.sans-serif</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">SimHei</span><span style="color: rgba(128, 0, 0, 1)">'</span>]<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 指定默认字体</span>
mpl.rcParams[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">axes.unicode_minus</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = False<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 解决保存图像是负号'-'显示为方块的问题</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">读取数据</span>
df = pd.read_csv(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Python_development_engineer.csv</span><span style="color: rgba(128, 0, 0, 1)">'</span>, encoding=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 进行数据清洗,过滤掉实习岗位</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)"> df.drop(df.str.contains('实习')].index, inplace=True)</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)"> print(df.describe())</span>


<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 由于csv文件中的字符是字符串形式,先用正则表达式将字符串转化为列表,在去区间的均值</span>
pattern = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\d+</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print(df['工作经验'], '\n\n\n')</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)"> print(df['工作经验'].str.findall(pattern))</span>
df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">工作年限</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">工作经验</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">].str.findall(pattern)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(type(df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">工作年限</span><span style="color: rgba(128, 0, 0, 1)">'</span>]), <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n\n\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
avg_work_year </span>=<span style="color: rgba(0, 0, 0, 1)"> []
count </span>=<span style="color: rgba(0, 0, 0, 1)"> 0
</span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span> df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">工作年限</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]:
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print('每个职位对应的工作年限',i)</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 如果工作经验为'不限'或'应届毕业生',那么匹配值为空,工作年限为0</span>
    <span style="color: rgba(0, 0, 255, 1)">if</span> len(i) ==<span style="color: rgba(0, 0, 0, 1)"> 0:
      avg_work_year.append(0)
      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print('nihao')</span>
      count += 1
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 如果匹配值为一个数值,那么返回该数值</span>
    <span style="color: rgba(0, 0, 255, 1)">elif</span> len(i) == 1<span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print('hello world')</span>
      avg_work_year.append(int(<span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">.join(i)))
      count </span>+= 1
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 如果匹配为一个区间则取平均值</span>
    <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
      num_list </span>=
      avg_year </span>= sum(num_list) / 2<span style="color: rgba(0, 0, 0, 1)">
      avg_work_year.append(avg_year)
      count </span>+= 1
<span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(count)
df[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">avg_work_year</span><span style="color: rgba(128, 0, 0, 1)">'</span>] =<span style="color: rgba(0, 0, 0, 1)"> avg_work_year
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将字符串转化为列表,薪资取最低值加上区间值得25%,比较贴近现实</span>
df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">salary</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">薪资</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">].str.findall(pattern)
</span><span style="color: rgba(0, 128, 0, 1)">#
</span>avg_salary_list =<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> k <span style="color: rgba(0, 0, 255, 1)">in</span> df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">salary</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]:
    int_list </span>=
    avg_salary </span>= int_list + (int_list - int_list) / 4<span style="color: rgba(0, 0, 0, 1)">
    avg_salary_list.append(avg_salary)
df[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">月薪</span><span style="color: rgba(128, 0, 0, 1)">'</span>] =<span style="color: rgba(0, 0, 0, 1)"> avg_salary_list
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> df.to_csv('python.csv', index=False)</span></pre>
</div>
<h3>四、数据可视化展示</h3>
<p>下面是对数据的可视化展示,仅以部分视图进行一些可视化的展示,如果读者想对其他字段做一些展示以及想使用不同的视图类型进行展示,请自行发挥,注:以下代码中引入的模块见最后的完整代码</p>
<h4>1、绘制python薪资的频率直方图并保存</h4>
<p>如果我们想看看关于互联网行业python工程师相关的岗位大家普遍薪资的一个分部区间在哪个范围,占据了多达的比例我们就可以借助matplotlib库,来将我们保存在csv文件中的数据进行可视化的展示,然我们能够更直观的看到数据的一个分部趋势</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 绘制python薪资的频率直方图并保存</span>
plt.hist(df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">月薪</span><span style="color: rgba(128, 0, 0, 1)">'</span>],bins=8,facecolor=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">#ff6700</span><span style="color: rgba(128, 0, 0, 1)">'</span>,edgecolor=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">blue</span><span style="color: rgba(128, 0, 0, 1)">'</span>)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> bins是默认的条形数目</span>
plt.xlabel(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">薪资(单位/千元)</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.ylabel(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">频数/频率</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.title(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">python薪资直方图</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.savefig(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">python薪资分布.jpg</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.show()</span></pre>
</div>
<p>运行结果如下:</p>
<p><img src="https://img2018.cnblogs.com/blog/1364097/201907/1364097-20190707165136874-1953364733.jpg" alt=""></p>
<h4>2、绘制python相关职位的地理位置饼状图</h4>
<p>通过地理python职位地理位置的分部我们可以大致了解IT行业主要集中分部在哪些城市,这样也更利于我们选择地域进行选择性就业,可以获得更多的面试机会等,参数可自行调试,或根据需要添加。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 绘制饼状图并保存</span>
city = df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">城市</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">].value_counts()
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(type(city))
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print(len(city))</span>
label =<span style="color: rgba(0, 0, 0, 1)"> city.keys()
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(label)
city_list </span>=<span style="color: rgba(0, 0, 0, 1)"> []
count </span>=<span style="color: rgba(0, 0, 0, 1)"> 0
n </span>= 1<span style="color: rgba(0, 0, 0, 1)">
distance </span>=<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> city:

    city_list.append(i)
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">列表长度</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, len(city_list))
    count </span>+= 1
    <span style="color: rgba(0, 0, 255, 1)">if</span> count &gt; 5<span style="color: rgba(0, 0, 0, 1)">:
      n </span>+= 0.1<span style="color: rgba(0, 0, 0, 1)">
      distance.append(n)
    </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
      distance.append(0)
plt.pie(city_list, labels</span>=label, labeldistance=1.2, autopct=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">%2.1f%%</span><span style="color: rgba(128, 0, 0, 1)">'</span>, pctdistance=0.6, shadow=True, explode=<span style="color: rgba(0, 0, 0, 1)">distance)
plt.axis(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">equal</span><span style="color: rgba(128, 0, 0, 1)">'</span>)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 使饼图为正圆形</span>
plt.legend(loc=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">upper left</span><span style="color: rgba(128, 0, 0, 1)">'</span>, bbox_to_anchor=(-0.1, 1<span style="color: rgba(0, 0, 0, 1)">))
plt.savefig(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">python地理位置分布图.jpg</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.show()</span></pre>
</div>
<p>运行结果如下:</p>
<p><img src="https://img2018.cnblogs.com/blog/1364097/201907/1364097-20190707165954331-743290671.jpg" alt=""></p>
<h4>3、绘制基于pyechart的城市分布柱状图</h4>
<p>pycharts是python中调用百度基于js开发的echarts接口,也可以对数据进行各种可视化操作,更多数据可视化图形展示,可参考echarts官网:https://www.echartsjs.com/,echarts官网提供了各种实例供我们参考,如折线图、柱状图、饼图、路径图、树图等等,基于pyecharts的文档可参考以下官网:https://pyecharts.org/#/,更多用法也可自行百度网络资源</p>
<div class="cnblogs_code">
<pre>city = df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">城市</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">].value_counts()
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(type(city))
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(city)
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print(len(city))</span>
<span style="color: rgba(0, 0, 0, 1)">
keys </span>= city.index<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 等价于keys = city.keys()</span>
values =<span style="color: rgba(0, 0, 0, 1)"> city.values
</span><span style="color: rgba(0, 0, 255, 1)">from</span> pyecharts <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> Bar

bar </span>= Bar(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">python职位的城市分布图</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
bar.add(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">城市</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, keys, values)
bar.print_echarts_options()</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 该行只为了打印配置项,方便调试时使用</span>
bar.render(path=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">a.html</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p>运行结果如下:</p>
<p><img src="https://img2018.cnblogs.com/blog/1364097/201907/1364097-20190707171056457-532823023.png" alt=""></p>
<p>&nbsp;</p>
<h4>4、绘制python福利相关的词云</h4>
<p>词云图又叫文字云,是对文本数据中出现频率较高的关键词予以视觉上的突出,形成"关键词的渲染"就类似云一样的彩色图片,从而过滤掉大量的文本信息,,使人一眼就可以领略文本数据的主要表达意思。利用jieba分词和词云生成WorldCloud(可自定义背景),下面就是对python相关职位的福利做了一个词云的展示,可以更直观的看到大多数公司的福利待遇集中在哪些地方</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 绘制福利待遇的词云</span>
text = <span style="color: rgba(128, 0, 0, 1)">''</span>
<span style="color: rgba(0, 0, 255, 1)">for</span> line <span style="color: rgba(0, 0, 255, 1)">in</span> df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">公司福利</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]:
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> len(eval(line)) ==<span style="color: rgba(0, 0, 0, 1)"> 0:
      </span><span style="color: rgba(0, 0, 255, 1)">continue</span>
    <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> word <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> eval(line):
            </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print(word)</span>
            text +=<span style="color: rgba(0, 0, 0, 1)"> word

cut_word </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">.join(jieba.cut(text))
word_background </span>= imread(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">公主.jpg</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
cloud </span>=<span style="color: rgba(0, 0, 0, 1)"> WordCloud(
    font_path</span>=r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">C:\Windows\Fonts\simfang.ttf</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
    background_color</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">black</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
    mask</span>=<span style="color: rgba(0, 0, 0, 1)">word_background,
    max_words</span>=500<span style="color: rgba(0, 0, 0, 1)">,
    max_font_size</span>=100<span style="color: rgba(0, 0, 0, 1)">,
    width</span>=400<span style="color: rgba(0, 0, 0, 1)">,
    height</span>=800<span style="color: rgba(0, 0, 0, 1)">

)
word_cloud </span>=<span style="color: rgba(0, 0, 0, 1)"> cloud.generate(cut_word)
word_cloud.to_file(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">福利待遇词云.png</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.imshow(word_cloud)
plt.axis(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">off</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.show()</span></pre>
</div>
<p>运行结果如下:</p>
<p><img src="https://img2018.cnblogs.com/blog/1364097/201907/1364097-20190707171705249-1342332957.png" alt=""></p>
<h3>五、爬虫及可视化完整代码</h3>
<p>完整代码在下面,代码均测试可正常运行,感兴趣的小伙伴可去尝试和了解其中的使用方法,如运行或者模块安装等失败可以在评论区进行留言,让我们一同解决吧</p>
<p>如果你觉得对你有帮助可以点个赞哦,原创内容转载需说明出处!!!</p>
<h4>1、爬虫完整代码</h4>
<p>为了防止我们频繁请求一个网站被限制ip,我们在爬取每一页后选择睡一段时间,当然你也可以使用代理等其他方式自行实现</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> math
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> time
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pandas as pd


</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_json(url, num):
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
    从指定的url中通过requests请求携带请求头和请求体获取网页中的信息,
    :return:
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
    url1 </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.lagou.com/jobs/list_python%E5%BC%80%E5%8F%91%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=&amp;fromSearch=true&amp;suginput=</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
    headers </span>=<span style="color: rgba(0, 0, 0, 1)"> {
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">User-Agent</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Host</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">www.lagou.com</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Referer</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=&amp;fromSearch=true&amp;suginput=</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">X-Anit-Forge-Code</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">0</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">X-Anit-Forge-Token</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">None</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">X-Requested-With</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">XMLHttpRequest</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
    }
    data </span>=<span style="color: rgba(0, 0, 0, 1)"> {
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">first</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">true</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">pn</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">: num,
      </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">kd</span><span style="color: rgba(128, 0, 0, 1)">'</span>: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">python工程师</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">}
    s </span>=<span style="color: rgba(0, 0, 0, 1)"> requests.Session()
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">建立session:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, s, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
    s.get(url</span>=url1, headers=headers, timeout=3<span style="color: rgba(0, 0, 0, 1)">)
    cookie </span>=<span style="color: rgba(0, 0, 0, 1)"> s.cookies
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">获取cookie:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, cookie, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
    res </span>= requests.post(url, headers=headers, data=data, cookies=cookie, timeout=3<span style="color: rgba(0, 0, 0, 1)">)
    res.raise_for_status()
    res.encoding </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
    page_data </span>=<span style="color: rgba(0, 0, 0, 1)"> res.json()
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">请求响应结果:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, page_data, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> page_data


</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_page_num(count):
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
    计算要抓取的页数,通过在拉勾网输入关键字信息,可以发现最多显示30页信息,每页最多显示15个职位信息
    :return:
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
    page_num </span>= math.ceil(count / 15<span style="color: rgba(0, 0, 0, 1)">)
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> page_num &gt; 30<span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 0, 255, 1)">return</span> 30
    <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> page_num


</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> get_page_info(jobs_list):
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
    获取职位
    :param jobs_list:
    :return:
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
    page_info_list </span>=<span style="color: rgba(0, 0, 0, 1)"> []
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span> jobs_list:<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 循环每一页所有职位信息</span>
      job_info =<span style="color: rgba(0, 0, 0, 1)"> []
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">companyFullName</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">companyShortName</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">companySize</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">financeStage</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">district</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">positionName</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">workYear</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">education</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">salary</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">positionAdvantage</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">industryField</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">firstType</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">companyLabelList</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">secondType</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      job_info.append(i[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">city</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      page_info_list.append(job_info)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> page_info_list


</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> main():
    url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)"> https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
    first_page </span>= get_json(url, 1<span style="color: rgba(0, 0, 0, 1)">)
    total_page_count </span>= first_page[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">'</span>][<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">positionResult</span><span style="color: rgba(128, 0, 0, 1)">'</span>][<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">totalCount</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
    num </span>=<span style="color: rgba(0, 0, 0, 1)"> get_page_num(total_page_count)
    total_info </span>=<span style="color: rgba(0, 0, 0, 1)"> []
    time.sleep(</span>10<span style="color: rgba(0, 0, 0, 1)">)
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">python开发相关职位总数:{},总页数为:{}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">.format(total_page_count, num))
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> num <span style="color: rgba(0, 0, 255, 1)">in</span> range(1, num + 1<span style="color: rgba(0, 0, 0, 1)">):
      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 获取每一页的职位相关的信息</span>
      page_data = get_json(url, num)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 获取响应json</span>
      jobs_list = page_data[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">'</span>][<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">positionResult</span><span style="color: rgba(128, 0, 0, 1)">'</span>][<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">result</span><span style="color: rgba(128, 0, 0, 1)">'</span>]<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 获取每页的所有python相关的职位信息</span>
      page_info =<span style="color: rgba(0, 0, 0, 1)"> get_page_info(jobs_list)
      </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">每一页python相关的职位信息:%s</span><span style="color: rgba(128, 0, 0, 1)">"</span> % page_info, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
      total_info </span>+=<span style="color: rgba(0, 0, 0, 1)"> page_info
      </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">已经爬取到第{}页,职位总数为{}</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">.format(num, len(total_info)))
      time.sleep(</span>20<span style="color: rgba(0, 0, 0, 1)">)
      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将总数据转化为data frame再输出,然后在写入到csv各式的文件中</span>
      df = pd.DataFrame(data=<span style="color: rgba(0, 0, 0, 1)">total_info,
                        columns</span>=[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">公司全名</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">公司简称</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">公司规模</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">融资阶段</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">区域</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">职位名称</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">工作经验</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">学历要求</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">薪资</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">职位福利</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">经营范围</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
                                 </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">职位类型</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">公司福利</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">第二职位类型</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">城市</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> df.to_csv('Python_development_engineer.csv', index=False)</span>
      <span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">python相关职位信息已保存</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)


</span><span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 128, 1)">__name__</span> == <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">__main__</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:
    main()</span></pre>
</div>
<h4>2、可视化完整代码</h4>
<p>数据可视化涉及到matplotlib、jieba、wordcloud、pyecharts、pylab、scipy等等模块的使用,读者可以自行了解各个模块的使用方法,和其中涉及的各种参数</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pandas as pd
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> matplotlib.pyplot as plt
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> statsmodels.api as sm
</span><span style="color: rgba(0, 0, 255, 1)">from</span> wordcloud <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> WordCloud
</span><span style="color: rgba(0, 0, 255, 1)">from</span> scipy.misc <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> imread
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> from imageio import imread</span>
<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> jieba
</span><span style="color: rgba(0, 0, 255, 1)">from</span> pylab <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> mpl

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 使用matplotlib能够显示中文</span>
mpl.rcParams[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">font.sans-serif</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">SimHei</span><span style="color: rgba(128, 0, 0, 1)">'</span>]<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 指定默认字体</span>
mpl.rcParams[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">axes.unicode_minus</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = False<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 解决保存图像是负号'-'显示为方块的问题</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">读取数据</span>
df = pd.read_csv(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Python_development_engineer.csv</span><span style="color: rgba(128, 0, 0, 1)">'</span>, encoding=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 进行数据清洗,过滤掉实习岗位</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)"> df.drop(df.str.contains('实习')].index, inplace=True)</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)"> print(df.describe())</span>


<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 由于csv文件中的字符是字符串形式,先用正则表达式将字符串转化为列表,在去区间的均值</span>
pattern = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\d+</span><span style="color: rgba(128, 0, 0, 1)">'</span>
<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print(df['工作经验'], '\n\n\n')</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)"> print(df['工作经验'].str.findall(pattern))</span>
df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">工作年限</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">工作经验</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">].str.findall(pattern)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(type(df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">工作年限</span><span style="color: rgba(128, 0, 0, 1)">'</span>]), <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n\n\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
avg_work_year </span>=<span style="color: rgba(0, 0, 0, 1)"> []
count </span>=<span style="color: rgba(0, 0, 0, 1)"> 0
</span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span> df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">工作年限</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]:
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print('每个职位对应的工作年限',i)</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 如果工作经验为'不限'或'应届毕业生',那么匹配值为空,工作年限为0</span>
    <span style="color: rgba(0, 0, 255, 1)">if</span> len(i) ==<span style="color: rgba(0, 0, 0, 1)"> 0:
      avg_work_year.append(0)
      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print('nihao')</span>
      count += 1
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 如果匹配值为一个数值,那么返回该数值</span>
    <span style="color: rgba(0, 0, 255, 1)">elif</span> len(i) == 1<span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print('hello world')</span>
      avg_work_year.append(int(<span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">.join(i)))
      count </span>+= 1
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 如果匹配为一个区间则取平均值</span>
    <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
      num_list </span>=
      avg_year </span>= sum(num_list) / 2<span style="color: rgba(0, 0, 0, 1)">
      avg_work_year.append(avg_year)
      count </span>+= 1
<span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(count)
df[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">avg_work_year</span><span style="color: rgba(128, 0, 0, 1)">'</span>] =<span style="color: rgba(0, 0, 0, 1)"> avg_work_year
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将字符串转化为列表,薪资取最低值加上区间值得25%,比较贴近现实</span>
df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">salary</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">薪资</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">].str.findall(pattern)
</span><span style="color: rgba(0, 128, 0, 1)">#
</span>avg_salary_list =<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> k <span style="color: rgba(0, 0, 255, 1)">in</span> df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">salary</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]:
    int_list </span>=
    avg_salary </span>= int_list + (int_list - int_list) / 4<span style="color: rgba(0, 0, 0, 1)">
    avg_salary_list.append(avg_salary)
df[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">月薪</span><span style="color: rgba(128, 0, 0, 1)">'</span>] =<span style="color: rgba(0, 0, 0, 1)"> avg_salary_list
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> df.to_csv('python.csv', index=False)</span>


<span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">1、绘制python薪资的频率直方图并保存</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
plt.hist(df[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">月薪</span><span style="color: rgba(128, 0, 0, 1)">'</span>], bins=8, facecolor=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">#ff6700</span><span style="color: rgba(128, 0, 0, 1)">'</span>, edgecolor=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">blue</span><span style="color: rgba(128, 0, 0, 1)">'</span>)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> bins是默认的条形数目</span>
plt.xlabel(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">薪资(单位/千元)</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.ylabel(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">频数/频率</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.title(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">python薪资直方图</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.savefig(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">python薪资分布.jpg</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.show()

</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">2、绘制饼状图并保存</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
city </span>= df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">城市</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">].value_counts()
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(type(city))
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print(len(city))</span>
label =<span style="color: rgba(0, 0, 0, 1)"> city.keys()
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(label)
city_list </span>=<span style="color: rgba(0, 0, 0, 1)"> []
count </span>=<span style="color: rgba(0, 0, 0, 1)"> 0
n </span>= 1<span style="color: rgba(0, 0, 0, 1)">
distance </span>=<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> city:

    city_list.append(i)
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">列表长度</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">, len(city_list))
    count </span>+= 1
    <span style="color: rgba(0, 0, 255, 1)">if</span> count &gt; 5<span style="color: rgba(0, 0, 0, 1)">:
      n </span>+= 0.1<span style="color: rgba(0, 0, 0, 1)">
      distance.append(n)
    </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
      distance.append(0)
plt.pie(city_list, labels</span>=label, labeldistance=1.2, autopct=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">%2.1f%%</span><span style="color: rgba(128, 0, 0, 1)">'</span>, pctdistance=0.6, shadow=True, explode=<span style="color: rgba(0, 0, 0, 1)">distance)
plt.axis(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">equal</span><span style="color: rgba(128, 0, 0, 1)">'</span>)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 使饼图为正圆形</span>
plt.legend(loc=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">upper left</span><span style="color: rgba(128, 0, 0, 1)">'</span>, bbox_to_anchor=(-0.1, 1<span style="color: rgba(0, 0, 0, 1)">))
plt.savefig(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">python地理位置分布图.jpg</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.show()

</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">3、绘制福利待遇的词云</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
text </span>= <span style="color: rgba(128, 0, 0, 1)">''</span>
<span style="color: rgba(0, 0, 255, 1)">for</span> line <span style="color: rgba(0, 0, 255, 1)">in</span> df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">公司福利</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]:
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> len(eval(line)) ==<span style="color: rgba(0, 0, 0, 1)"> 0:
      </span><span style="color: rgba(0, 0, 255, 1)">continue</span>
    <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> word <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> eval(line):
            </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print(word)</span>
            text +=<span style="color: rgba(0, 0, 0, 1)"> word

cut_word </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">.join(jieba.cut(text))
word_background </span>= imread(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">公主.jpg</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
cloud </span>=<span style="color: rgba(0, 0, 0, 1)"> WordCloud(
    font_path</span>=r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">C:\Windows\Fonts\simfang.ttf</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
    background_color</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">black</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
    mask</span>=<span style="color: rgba(0, 0, 0, 1)">word_background,
    max_words</span>=500<span style="color: rgba(0, 0, 0, 1)">,
    max_font_size</span>=100<span style="color: rgba(0, 0, 0, 1)">,
    width</span>=400<span style="color: rgba(0, 0, 0, 1)">,
    height</span>=800<span style="color: rgba(0, 0, 0, 1)">

)
word_cloud </span>=<span style="color: rgba(0, 0, 0, 1)"> cloud.generate(cut_word)
word_cloud.to_file(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">福利待遇词云.png</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.imshow(word_cloud)
plt.axis(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">off</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
plt.show()

</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">4、基于pyechart的柱状图</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
city </span>= df[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">城市</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">].value_counts()
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(type(city))
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(city)
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> print(len(city))</span>
<span style="color: rgba(0, 0, 0, 1)">
keys </span>= city.index<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 等价于keys = city.keys()</span>
values =<span style="color: rgba(0, 0, 0, 1)"> city.values
</span><span style="color: rgba(0, 0, 255, 1)">from</span> pyecharts <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> Bar

bar </span>= Bar(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">python职位的城市分布图</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
bar.add(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">城市</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, keys, values)
bar.print_echarts_options()</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 该行只为了打印配置项,方便调试时使用</span>
bar.render(path=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">a.html</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p>&nbsp;</p>

</div>
<div id="MySignature" role="contentinfo">
    python之基础知识大全<br><br>
来源:https://www.cnblogs.com/sui776265233/p/11146969.html
頁: [1]
查看完整版本: python爬取拉勾网数据并进行数据可视化