东方珍珠 發表於 2016-3-11 03:27:00

使用nodejs爬取拉勾苏州和上海的.NET职位信息

<p>最近开始找工作,本人苏州,面了几家都没有结果很是伤心。在拉勾上按照城市苏州关键字.NET来搜索一共才80来个职位,再用薪水一过滤,基本上没几个能投了。再加上最近苏州的房价蹭蹭的长,房贷压力也是非常大,所以有点想往上海去发展。闲来无聊写了个小爬虫,爬了下苏州跟上海的.NET职位的信息,然后简单对比了一下。</p>
<p>是的小弟擅长.NET,为啥用nodejs?因为前几天有家公司给了个机会可以转nodejs,所以我是用来练手的,不过后来也泡汤了,但是还是花两晚写完了。刚学,代码丑轻喷哈!</p>
<h3>一:如何爬取拉勾的数据</h3>
<p>这个其实非常简单,本来还以为要用正则去分析html,其实拉勾分页提了ajax的接口,可以直接用http去访问。打开神器Chrome的F12一看便知。</p>
<p>这是用nodejs模拟分页请求的代码:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">var</span> getData = <span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> (kd,city,pn) {
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> mongo = require('./mongo'<span style="color: rgba(0, 0, 0, 1)">);
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> http = require('http'<span style="color: rgba(0, 0, 0, 1)">);
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> queryString = require('querystring'<span style="color: rgba(0, 0, 0, 1)">);

    </span><span style="color: rgba(0, 0, 255, 1)">var</span> postData=<span style="color: rgba(0, 0, 0, 1)">queryString.stringify({
      </span>'pn'<span style="color: rgba(0, 0, 0, 1)">:pn,
      </span>'kd'<span style="color: rgba(0, 0, 0, 1)">:kd,
      </span>'first':<span style="color: rgba(0, 0, 255, 1)">false</span><span style="color: rgba(0, 0, 0, 1)">
    });

    </span><span style="color: rgba(0, 0, 255, 1)">var</span> options =<span style="color: rgba(0, 0, 0, 1)"> {
      hostname:</span>'www.lagou.com'<span style="color: rgba(0, 0, 0, 1)">,
      method:</span>'POST'<span style="color: rgba(0, 0, 0, 1)">,
      path:</span>'/jobs/positionAjax.json?px=default&amp;city='+<span style="color: rgba(0, 0, 0, 1)">city,
      headers: {
      </span>'Content-Type': 'application/x-www-form-urlencoded'<span style="color: rgba(0, 0, 0, 1)">,
      </span>'Content-Length'<span style="color: rgba(0, 0, 0, 1)">: postData.length
      }
    };
   
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> postResult = ''<span style="color: rgba(0, 0, 0, 1)">;
   
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> req = http.request(options,(res)=&gt;<span style="color: rgba(0, 0, 0, 1)">{
      console.log(`STATUS:${res.statusCode}`);
      res.setEncoding(</span>'utf8'<span style="color: rgba(0, 0, 0, 1)">);
      res.on(</span>'data',(chunk)=&gt;<span style="color: rgba(0, 0, 0, 1)">{
            postResult</span>+=<span style="color: rgba(0, 0, 0, 1)">chunk;
      });
      res.on(</span>'end',()=&gt;<span style="color: rgba(0, 0, 0, 1)">{
            console.log(`RESULT:${postResult}`);
            </span><span style="color: rgba(0, 0, 255, 1)">var</span> jsonObj =<span style="color: rgba(0, 0, 0, 1)">JSON.parse(postResult);
            </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">insert into db</span>
            jsonObj.content.result.forEach((item)=&gt;<span style="color: rgba(0, 0, 0, 1)">{
                </span><span style="color: rgba(0, 0, 255, 1)">var</span> salary =<span style="color: rgba(0, 0, 0, 1)"> item.salary;
                </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">拆分3k-6k,易于统计</span>
                <span style="color: rgba(0, 0, 255, 1)">var</span> arr = salary.split('-'<span style="color: rgba(0, 0, 0, 1)">);
                </span><span style="color: rgba(0, 0, 255, 1)">var</span> min = arr.substring(0,arr.indexOf('k'<span style="color: rgba(0, 0, 0, 1)">));
                </span><span style="color: rgba(0, 0, 255, 1)">var</span> max = arr.length&gt;1? arr.substring(0,arr.indexOf('k'<span style="color: rgba(0, 0, 0, 1)">)):min;
                item.salaryMin </span>=<span style="color: rgba(0, 0, 0, 1)"> parseInt(min);
                item.salaryMax </span>=<span style="color: rgba(0, 0, 0, 1)"> parseInt(max);
               
                mongo.save(city,item);
            });
            </span><span style="color: rgba(0, 0, 255, 1)">if</span>(jsonObj.content.hasNextPage&amp;&amp;jsonObj.content.totalPageCount&gt;<span style="color: rgba(0, 0, 0, 1)">pn){
                getData(kd,city,pn</span>+1<span style="color: rgba(0, 0, 0, 1)">);
            }
      });
      req.on(</span>'error',(e)=&gt;<span style="color: rgba(0, 0, 0, 1)">{
            console.log(`problem </span><span style="color: rgba(0, 0, 255, 1)">with</span><span style="color: rgba(0, 0, 0, 1)"> request:${e.message}`);
      });
    });

    req.write(postData);
    req.end();
    console.log(`start to get data. pn:${pn} city:${city} kd:${kd}`);
};

exports.run </span>= getData;</pre>
</div>
<h3>二:数据存储在哪里</h3>
<p>拉勾的分页接口返回的是json对象,那么自然是存mongoDb最简单了。</p>
<p>下面是mongoDb的封装:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">var</span> save=<span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> (city,jsonObj) {
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> Db = require('mongodb'<span style="color: rgba(0, 0, 0, 1)">).Db;
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> Server = require('mongodb'<span style="color: rgba(0, 0, 0, 1)">).Server;

    </span><span style="color: rgba(0, 0, 255, 1)">var</span> db = <span style="color: rgba(0, 0, 255, 1)">new</span> Db('test',<span style="color: rgba(0, 0, 255, 1)">new</span> Server('localhost',27017<span style="color: rgba(0, 0, 0, 1)">))

    db.open((err,db)</span>=&gt;<span style="color: rgba(0, 0, 0, 1)">{
      </span><span style="color: rgba(0, 0, 255, 1)">var</span> coll =<span style="color: rgba(0, 0, 0, 1)"> db.collection(city);
      coll.save(jsonObj,(err,r)</span>=&gt;<span style="color: rgba(0, 0, 0, 1)">{
            </span><span style="color: rgba(0, 0, 255, 1)">if</span>(!<span style="color: rgba(0, 0, 0, 1)">err){
               console.log(</span>'save to '+<span style="color: rgba(0, 0, 0, 1)">city);
            }
            
            db.close();
      });
      
    });
};

</span><span style="color: rgba(0, 0, 255, 1)">var</span> removeAll = <span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> (city,callback) {
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> Db = require('mongodb'<span style="color: rgba(0, 0, 0, 1)">).Db;
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> Server = require('mongodb'<span style="color: rgba(0, 0, 0, 1)">).Server;

    </span><span style="color: rgba(0, 0, 255, 1)">var</span> db = <span style="color: rgba(0, 0, 255, 1)">new</span> Db('test',<span style="color: rgba(0, 0, 255, 1)">new</span> Server('localhost',27017<span style="color: rgba(0, 0, 0, 1)">))

    db.open((err,db)</span>=&gt;<span style="color: rgba(0, 0, 0, 1)">{
      </span><span style="color: rgba(0, 0, 255, 1)">var</span> coll =<span style="color: rgba(0, 0, 0, 1)"> db.collection(city);
      coll.remove((err,numOfRows)</span>=&gt;<span style="color: rgba(0, 0, 0, 1)">{
            </span><span style="color: rgba(0, 0, 255, 1)">if</span>(!<span style="color: rgba(0, 0, 0, 1)">err){
                console.log(`${city} collection be removed. ${numOfRows}`);
            }
            db.close();
            callback(err);
      });
      
    });
};

</span><span style="color: rgba(0, 0, 255, 1)">var</span> readAll=<span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> (city,callback) {
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> Db = require('mongodb'<span style="color: rgba(0, 0, 0, 1)">).Db;
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> Server = require('mongodb'<span style="color: rgba(0, 0, 0, 1)">).Server;

    </span><span style="color: rgba(0, 0, 255, 1)">var</span> db = <span style="color: rgba(0, 0, 255, 1)">new</span> Db('test',<span style="color: rgba(0, 0, 255, 1)">new</span> Server('localhost',27017<span style="color: rgba(0, 0, 0, 1)">))

    db.open((err,db)</span>=&gt;<span style="color: rgba(0, 0, 0, 1)">{
      </span><span style="color: rgba(0, 0, 255, 1)">var</span> coll =<span style="color: rgba(0, 0, 0, 1)"> db.collection(city);
      </span><span style="color: rgba(0, 0, 255, 1)">var</span> cursor =<span style="color: rgba(0, 0, 0, 1)"> coll.find();
      cursor.toArray((err,results)</span>=&gt;<span style="color: rgba(0, 0, 0, 1)">{
            </span><span style="color: rgba(0, 0, 255, 1)">if</span>(!<span style="color: rgba(0, 0, 0, 1)">err){
                callback(results);
                </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">db.close();      </span>
<span style="color: rgba(0, 0, 0, 1)">            }
            db.close();
      });
    });
}

exports.save </span>=<span style="color: rgba(0, 0, 0, 1)"> save;
exports.removeAll </span>=<span style="color: rgba(0, 0, 0, 1)"> removeAll;
exports.readAll </span>=<span style="color: rgba(0, 0, 0, 1)"> readAll;
</span></pre>
</div>
<h3>三:如何展示数据</h3>
<p>使用nodejs自带的httpServer,接受到请求的时候直接读取一个html文件,然后把对比的信息填入html文本里,用一个h5的chart来展示</p>
<p>下面是服务器的代码:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">var</span> http = require('http'<span style="color: rgba(0, 0, 0, 1)">);
</span><span style="color: rgba(0, 0, 255, 1)">var</span> fs = require('fs'<span style="color: rgba(0, 0, 0, 1)">);
</span><span style="color: rgba(0, 0, 255, 1)">var</span> stati = require('./statistics'<span style="color: rgba(0, 0, 0, 1)">);
</span><span style="color: rgba(0, 0, 255, 1)">var</span> szStati = {text:'SuZhou'<span style="color: rgba(0, 0, 0, 1)">};
</span><span style="color: rgba(0, 0, 255, 1)">var</span> shStati = {text:'ShangHai'<span style="color: rgba(0, 0, 0, 1)">};

</span><span style="color: rgba(0, 0, 255, 1)">var</span> server=<span style="color: rgba(0, 0, 255, 1)">new</span><span style="color: rgba(0, 0, 0, 1)"> http.Server();
server.on(</span>'request',<span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)">(req,res){
    res.writeHead(</span>200,{'Content-Type':'text/html'<span style="color: rgba(0, 0, 0, 1)">});
   
    fs.readFile(</span>'./index.html','utf8',(err,data)=&gt;<span style="color: rgba(0, 0, 0, 1)">{
      </span><span style="color: rgba(0, 0, 255, 1)">if</span><span style="color: rgba(0, 0, 0, 1)"> (err) {
            </span><span style="color: rgba(0, 0, 255, 1)">throw</span><span style="color: rgba(0, 0, 0, 1)"> err;
      }
      console.log(data);
      </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)"> res.write(data);</span>
      <span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)"> res.end();</span>
      stati.statiSalary('苏州',(results)=&gt;<span style="color: rgba(0, 0, 0, 1)">{
            szStati.values </span>=<span style="color: rgba(0, 0, 0, 1)"> results;
            stati.statiSalary(</span>'上海',(results)=&gt;<span style="color: rgba(0, 0, 0, 1)">{
                shStati.values </span>=<span style="color: rgba(0, 0, 0, 1)"> results;
                </span><span style="color: rgba(0, 0, 255, 1)">var</span> series =<span style="color: rgba(0, 0, 0, 1)">;
                </span><span style="color: rgba(0, 0, 255, 1)">var</span> strSeries =<span style="color: rgba(0, 0, 0, 1)"> JSON.stringify(series);
                console.log(strSeries);
               
                data </span>= data.replace('@series'<span style="color: rgba(0, 0, 0, 1)">,strSeries);
                console.log(data);
               
                res.write(data);
                res.end();
            });
      });
    });
});

server.listen(</span>3000<span style="color: rgba(0, 0, 0, 1)">);
console.log(</span>'http server started...port:3000');</pre>
</div>
<h3>四:统计结果</h3>
<p><img src="https://images2015.cnblogs.com/blog/36200/201603/36200-20160311030738225-470990979.png" alt="" width="908" height="528"></p>
<p>统计按照 0-5k,5-10k,10-15k,15-20k,20-25k,&gt;25k这几个区间按照职位的数量进行统计。</p>
<p>0-5k:上海是苏州的4倍</p>
<p>5-10k:上海是苏州的4倍</p>
<p>10-15k:上海是苏州的9倍</p>
<p>15-20k:上海是苏州的12倍</p>
<p>20-25k:上海是苏州的17倍</p>
<p>&gt;25k:上海是苏州的26倍</p>
<p>可以看到从10-15k开始的职位,上海的数量是苏州的10多倍,越是高薪的职位倍数越高。由此可以看出,苏州跟上海的差距还是非常大的。苏州政府一直沾沾自喜,觉得自己在互联网圈子有多牛逼,搞了一堆孵化器,但其实拿的出手的公司有几家呢,一只手都数过来了,跟北上广深一线还是差的很远呢,还是要努力啊。</p>
<p>恐怕我也要背井离乡去上海的寻找未来了。</p>
<p>还没学会用VS Code上传到github上,先直接上传代码吧:lagouSpider.zip</p>

</div>
<div id="MySignature" role="contentinfo">
    <div id="AllanboltSignature">      
<p id="PSignature" style="border-top: #e0e0e0 1px dashed; border-right: #e0e0e0 1px dashed; border-bottom: #e0e0e0 1px dashed; border-left: #e0e0e0 1px dashed; padding-top: 10px; padding-right: 10px; padding-bottom: 10px; padding-left: 10px; font-family: 微软雅黑; font-size: 11px">      
QQ群:1022985150 VX:kklldog 一起探讨学习.NET技术
<br>
作者:Agile.Zhou(kklldog)            
<br>
出处:http://www.cnblogs.com/kklldog/
<br>本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
</p>
</div><br><br>
来源:https://www.cnblogs.com/kklldog/p/lagouNodeSpider.html
頁: [1]
查看完整版本: 使用nodejs爬取拉勾苏州和上海的.NET职位信息