python爬虫----通过Node.js来执行js
<p><span style="font-size: 18px">python脚本中可以通过PyExecJS库来处理js代码(可参考:excejs的使用),但是性能并不高,很难满足高并发的要求</span></p><p><span style="font-size: 18px">Node.js是一个Javascript运行环境(runtime)。它对Google V8引擎进行了封装,使用事件驱动, 非阻塞I/O 模型而得以轻量和高效,能够方便地搭建响应速度快、易于扩展的网络应用,因此我们可以借助Node.js来执行js代码。</span></p>
<h2>思路:</h2>
<ul>
<li><span style="font-size: 18px">创建一个js文件,用于存储我们抠出来的js代码,并且通过exports对外暴露属性或方法,或者通过module.exports对外暴露对象(包含多个属性或方法)</span></li>
<li><span style="font-size: 18px">创建一个server.js文件,通过require载入上述js文件,然后利用express框架来搭建web应用,通过定义路由来实现各种js逻辑</span></li>
<li><span style="font-size: 18px">通过命令:<span class="cnblogs_code">node server.js</span> 来开启web应用,然后在python脚本中通过requests向应用服务器发送get/post请求获取响应数据</span></li>
</ul>
<h2>我们以百度翻译案例中的js为例,获取sign参数:</h2>
<ul>
<li>
<h3>准备工作</h3>
<ul>
<li><span style="font-size: 18px">安装Node.js</span>
<ul>
<li><span style="font-size: 18px">Node.js 安装包及源码下载地址为:https://nodejs.org/en/download/,历史版本下载地址:https://nodejs.org/dist/</span></li>
<li><span style="font-size: 18px">安装好以后,在命令行终端输入: <span class="cnblogs_code">node --version</span>,如果能够显示版本号,则说明安装成功</span></li>
</ul>
</li>
<li><span style="font-size: 18px">使用淘宝镜像</span>
<div class="cnblogs_code">
<pre>npm install -g cnpm --registry=https://registry.npm.taobao.org</pre>
</div>
<p><span style="font-size: 18px">国内直接使用 npm 的官方镜像是非常慢,推荐使用淘宝 NPM 镜像,使用淘宝定制的 cnpm 命令行工具代替默认的 npm</span></p>
</li>
<li><span style="font-size: 18px">安装Express框架,并将其保存到依赖列表</span>
<div class="cnblogs_code">
<pre>cnpm install express --save</pre>
</div>
<p><span style="font-size: 18px">以上命令会将 Express 框架安装在当前目录的 <strong>node_modules</strong> 目录中,如果没有该目录则自动创建, <strong>node_modules</strong> 目录下会自动创建 express 目录</span></p>
</li>
<li><span style="font-size: 18px">安装body-parser模块</span>
<div class="cnblogs_code">
<pre>cnpm install body-parser --save</pre>
</div>
<p><span style="font-size: 18px">该模块用于处理 JSON, Raw, Text 和 URL 编码的数据</span></p>
</li>
</ul>
</li>
<li>
<h3>编写baiduTranslate.js文件</h3>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> n(r, o) {
</span><span style="color: rgba(0, 0, 255, 1)">for</span> (<span style="color: rgba(0, 0, 255, 1)">var</span> t = 0; t < o.length - 2; t += 3<span style="color: rgba(0, 0, 0, 1)">) {
</span><span style="color: rgba(0, 0, 255, 1)">var</span> a = o.charAt(t + 2<span style="color: rgba(0, 0, 0, 1)">);
a </span>= a >= "a" ? a.charCodeAt(0) - 87<span style="color: rgba(0, 0, 0, 1)"> : Number(a),
a </span>= "+" === o.charAt(t + 1) ? r >>> a : r <<<span style="color: rgba(0, 0, 0, 1)"> a,
r </span>= "+" === o.charAt(t) ? r + a & 4294967295 : r ^<span style="color: rgba(0, 0, 0, 1)"> a
}
</span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> r
}
</span><span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> getSign(r) {
</span><span style="color: rgba(0, 0, 255, 1)">var</span> i = '320305.131321201'<span style="color: rgba(0, 0, 0, 1)">;
</span><span style="color: rgba(0, 0, 255, 1)">var</span> o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/<span style="color: rgba(0, 0, 0, 1)">g);
</span><span style="color: rgba(0, 0, 255, 1)">if</span> (<span style="color: rgba(0, 0, 255, 1)">null</span> ===<span style="color: rgba(0, 0, 0, 1)"> o) {
</span><span style="color: rgba(0, 0, 255, 1)">var</span> t =<span style="color: rgba(0, 0, 0, 1)"> r.length;
t </span>> 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10<span style="color: rgba(0, 0, 0, 1)">))
} </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(0, 0, 255, 1)">for</span> (<span style="color: rgba(0, 0, 255, 1)">var</span> e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++<span style="color: rgba(0, 0, 0, 1)">)
</span>"" !== e && f.push.apply(f, a(e.split(""<span style="color: rgba(0, 0, 0, 1)">))),
C </span>!== h - 1 &&<span style="color: rgba(0, 0, 0, 1)"> f.push(o);
</span><span style="color: rgba(0, 0, 255, 1)">var</span> g =<span style="color: rgba(0, 0, 0, 1)"> f.length;
g </span>> 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join(""<span style="color: rgba(0, 0, 0, 1)">))
}
</span><span style="color: rgba(0, 0, 255, 1)">var</span> u = <span style="color: rgba(0, 0, 255, 1)">void</span> 0<span style="color: rgba(0, 0, 0, 1)">
, l </span>= "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107<span style="color: rgba(0, 0, 0, 1)">);
u </span>= <span style="color: rgba(0, 0, 255, 1)">null</span> !== i ? i : (i = window || "") || ""<span style="color: rgba(0, 0, 0, 1)">;
</span><span style="color: rgba(0, 0, 255, 1)">for</span> (<span style="color: rgba(0, 0, 255, 1)">var</span> d = u.split("."), m = Number(d) || 0, s = Number(d) || 0, S = [], c = 0, v = 0; v < r.length; v++<span style="color: rgba(0, 0, 0, 1)">) {
</span><span style="color: rgba(0, 0, 255, 1)">var</span> A =<span style="color: rgba(0, 0, 0, 1)"> r.charCodeAt(v);
</span>128 > A ? S = A : (2048 > A ? S = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++<span style="color: rgba(0, 0, 0, 1)">v)),
S = A >> 18 | 240<span style="color: rgba(0, 0, 0, 1)">,
S = A >> 12 & 63 | 128) : S = A >> 12 | 224<span style="color: rgba(0, 0, 0, 1)">,
S = A >> 6 & 63 | 128<span style="color: rgba(0, 0, 0, 1)">),
S = 63 & A | 128<span style="color: rgba(0, 0, 0, 1)">)
}
</span><span style="color: rgba(0, 0, 255, 1)">for</span> (<span style="color: rgba(0, 0, 255, 1)">var</span> p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++<span style="color: rgba(0, 0, 0, 1)">)
p </span>+=<span style="color: rgba(0, 0, 0, 1)"> S,
p </span>=<span style="color: rgba(0, 0, 0, 1)"> n(p, F);
</span><span style="color: rgba(0, 0, 255, 1)">return</span> p =<span style="color: rgba(0, 0, 0, 1)"> n(p, D),
p </span>^=<span style="color: rgba(0, 0, 0, 1)"> s,
</span>0 > p && (p = (2147483647 & p) + 2147483648<span style="color: rgba(0, 0, 0, 1)">),
p </span>%= 1e6<span style="color: rgba(0, 0, 0, 1)">,
p.toString() </span>+ "." + (p ^<span style="color: rgba(0, 0, 0, 1)"> m)
}
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">可以采用以下3种方式对外暴露getSign方法</span>
<span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">方式1:使用exports对外暴露getSign方法</span>
exports.getSign =<span style="color: rgba(0, 0, 0, 1)"> getSign
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">方式2:使用module.exports对外暴露对象,该对象具有getSign方法</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)"> module.exports = {</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)"> getSign:getSign</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)"> }</span>
<span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">方式3:是方式2的简写方式,对于方法名和函数名相同的都可以采用简写方式,多个方法间用逗号','隔开</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)"> module.exports = {</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)"> getSign</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)"> }</span></pre>
</div>
<p><span style="font-size: 18px">该js文件中是抠出来的js代码,并且向外暴露getSign方法</span></p>
</li>
<li>
<h3>server.js</h3>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">引入依赖包</span>
<span style="color: rgba(0, 0, 255, 1)">var</span> express = require('express'<span style="color: rgba(0, 0, 0, 1)">);
</span><span style="color: rgba(0, 0, 255, 1)">var</span> bodyParser = require('body-parser'<span style="color: rgba(0, 0, 0, 1)">);
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">引入自定义模块</span>
<span style="color: rgba(0, 0, 255, 1)">var</span> baiduTranslate = require('./baiduTranslate'<span style="color: rgba(0, 0, 0, 1)">);
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">创建应用实例</span>
<span style="color: rgba(0, 0, 255, 1)">var</span> app =<span style="color: rgba(0, 0, 0, 1)"> express();
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">将表单数据或者json数据转换成对象,以下只有一个会执行</span>
app.use(bodyParser.urlencoded({extended:<span style="color: rgba(0, 0, 255, 1)">true</span><span style="color: rgba(0, 0, 0, 1)">}));
app.use(bodyParser.json());
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">创建路由</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)">POST请求</span>
app.post('/get_sign',<span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> (req,res) {
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">获取请求体中传递的参数</span>
let result =<span style="color: rgba(0, 0, 0, 1)"> req.body;
let content </span>=<span style="color: rgba(0, 0, 0, 1)"> result.content
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">调用baiduTranslate模块中的getSign方法,并传入参数</span>
result =<span style="color: rgba(0, 0, 0, 1)"> baiduTranslate.getSign(content)
res.send(result.toString());
});
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">创建路由</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)">GET请求</span>
app.get('/get_sign',<span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> (req,res) {
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">获取查询字符串参数</span>
let result =<span style="color: rgba(0, 0, 0, 1)"> req.query;
let content </span>=<span style="color: rgba(0, 0, 0, 1)"> result.content;
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">调用baiduTranslate模块中的getSign方法,并传入参数</span>
result =<span style="color: rgba(0, 0, 0, 1)"> baiduTranslate.getSign(content)
res.send(result.toString());
})
</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">启动服务</span>
<span style="color: rgba(0, 0, 255, 1)">var</span> server = app.listen(8888,<span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> () {
</span><span style="color: rgba(0, 0, 255, 1)">var</span> host =<span style="color: rgba(0, 0, 0, 1)"> server.address().address
</span><span style="color: rgba(0, 0, 255, 1)">var</span> port =<span style="color: rgba(0, 0, 0, 1)"> server.address().port
console.log(</span>"开启服务,访问地址为 http://%s:%s"<span style="color: rgba(0, 0, 0, 1)">, host, port)
})</span></pre>
</div>
<p>以后编写server.js代码,可以直接参照上述模板,只需要修改引入的(自定义模块名)以及(路由函数中的逻辑)即可</p>
</li>
<li>
<h3>python脚本test.py</h3>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">coding:utf-8</span>
<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests
url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://localhost:8888/get_sign</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
content </span>= input(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">请输入需要翻译的内容:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
sign1 </span>= requests.post(url=url,data = {<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:content}).text
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">通过post请求获取sign参数:{}</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">.format(sign1))
sign2 </span>= requests.get(url=url,params={<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:content}).text
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">通过get请求获取sign参数:{}</span><span style="color: rgba(128, 0, 0, 1)">'</span>.format(sign2))</pre>
</div>
</li>
<li>
<h3>在server.js文件所在目录下,通过cmd命令行启动服务</h3>
<div class="cnblogs_code">
<pre>node server.js</pre>
</div>
</li>
<li>
<h3>运行python脚本后结果展示:</h3>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">请输入需要翻译的内容:hello
通过post请求获取sign参数:</span>54706.276099<span style="color: rgba(0, 0, 0, 1)">
通过get请求获取sign参数:</span>54706.276099</pre>
</div>
<p> </p>
</li>
</ul><br><br>
来源:https://www.cnblogs.com/eliwang/p/15374855.html
頁:
[1]