耳洞 發表於 2021-10-7 02:51:00

python爬虫----通过Node.js来执行js

<p><span style="font-size: 18px">python脚本中可以通过PyExecJS库来处理js代码(可参考:excejs的使用),但是性能并不高,很难满足高并发的要求</span></p>
<p><span style="font-size: 18px">Node.js是一个Javascript运行环境(runtime)。它对Google V8引擎进行了封装,使用事件驱动, 非阻塞I/O 模型而得以轻量和高效,能够方便地搭建响应速度快、易于扩展的网络应用,因此我们可以借助Node.js来执行js代码。</span></p>
<h2>思路:</h2>
<ul>
<li><span style="font-size: 18px">创建一个js文件,用于存储我们抠出来的js代码,并且通过exports对外暴露属性或方法,或者通过module.exports对外暴露对象(包含多个属性或方法)</span></li>
<li><span style="font-size: 18px">创建一个server.js文件,通过require载入上述js文件,然后利用express框架来搭建web应用,通过定义路由来实现各种js逻辑</span></li>
<li><span style="font-size: 18px">通过命令:<span class="cnblogs_code">node server.js</span>&nbsp;来开启web应用,然后在python脚本中通过requests向应用服务器发送get/post请求获取响应数据</span></li>
</ul>
<h2>我们以百度翻译案例中的js为例,获取sign参数:</h2>
<ul>
<li>
<h3>准备工作</h3>
<ul>
<li><span style="font-size: 18px">安装Node.js</span>
<ul>
<li><span style="font-size: 18px">Node.js 安装包及源码下载地址为:https://nodejs.org/en/download/,历史版本下载地址:https://nodejs.org/dist/</span></li>
<li><span style="font-size: 18px">安装好以后,在命令行终端输入:&nbsp;<span class="cnblogs_code">node --version</span>,如果能够显示版本号,则说明安装成功</span></li>
</ul>
</li>
<li><span style="font-size: 18px">使用淘宝镜像</span>
<div class="cnblogs_code">
<pre>npm install -g cnpm --registry=https://registry.npm.taobao.org</pre>
</div>
<p><span style="font-size: 18px">国内直接使用 npm 的官方镜像是非常慢,推荐使用淘宝 NPM 镜像,使用淘宝定制的 cnpm 命令行工具代替默认的 npm</span></p>
</li>
<li><span style="font-size: 18px">安装Express框架,并将其保存到依赖列表</span>
<div class="cnblogs_code">
<pre>cnpm install express --save</pre>
</div>
<p><span style="font-size: 18px">以上命令会将 Express 框架安装在当前目录的&nbsp;<strong>node_modules</strong>&nbsp;目录中,如果没有该目录则自动创建,&nbsp;<strong>node_modules</strong>&nbsp;目录下会自动创建 express 目录</span></p>
</li>
<li><span style="font-size: 18px">安装body-parser模块</span>
<div class="cnblogs_code">
<pre>cnpm install body-parser --save</pre>
</div>
<p><span style="font-size: 18px">该模块用于处理 JSON, Raw, Text 和 URL 编码的数据</span></p>
</li>
</ul>
</li>
<li>
<h3>编写baiduTranslate.js文件</h3>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> n(r, o) {
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> (<span style="color: rgba(0, 0, 255, 1)">var</span> t = 0; t &lt; o.length - 2; t += 3<span style="color: rgba(0, 0, 0, 1)">) {
            </span><span style="color: rgba(0, 0, 255, 1)">var</span> a = o.charAt(t + 2<span style="color: rgba(0, 0, 0, 1)">);
            a </span>= a &gt;= "a" ? a.charCodeAt(0) - 87<span style="color: rgba(0, 0, 0, 1)"> : Number(a),
            a </span>= "+" === o.charAt(t + 1) ? r &gt;&gt;&gt; a : r &lt;&lt;<span style="color: rgba(0, 0, 0, 1)"> a,
            r </span>= "+" === o.charAt(t) ? r + a &amp; 4294967295 : r ^<span style="color: rgba(0, 0, 0, 1)"> a
      }
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> r
    }

</span><span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> getSign(r) {
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> i = '320305.131321201'<span style="color: rgba(0, 0, 0, 1)">;
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/<span style="color: rgba(0, 0, 0, 1)">g);
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> (<span style="color: rgba(0, 0, 255, 1)">null</span> ===<span style="color: rgba(0, 0, 0, 1)"> o) {
      </span><span style="color: rgba(0, 0, 255, 1)">var</span> t =<span style="color: rgba(0, 0, 0, 1)"> r.length;
      t </span>&gt; 30 &amp;&amp; (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10<span style="color: rgba(0, 0, 0, 1)">))
    } </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)"> {
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> (<span style="color: rgba(0, 0, 255, 1)">var</span> e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h &gt; C; C++<span style="color: rgba(0, 0, 0, 1)">)
            </span>"" !== e &amp;&amp; f.push.apply(f, a(e.split(""<span style="color: rgba(0, 0, 0, 1)">))),
            C </span>!== h - 1 &amp;&amp;<span style="color: rgba(0, 0, 0, 1)"> f.push(o);
      </span><span style="color: rgba(0, 0, 255, 1)">var</span> g =<span style="color: rgba(0, 0, 0, 1)"> f.length;
      g </span>&gt; 30 &amp;&amp; (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join(""<span style="color: rgba(0, 0, 0, 1)">))
    }
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> u = <span style="color: rgba(0, 0, 255, 1)">void</span> 0<span style="color: rgba(0, 0, 0, 1)">
      , l </span>= "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107<span style="color: rgba(0, 0, 0, 1)">);
    u </span>= <span style="color: rgba(0, 0, 255, 1)">null</span> !== i ? i : (i = window || "") || ""<span style="color: rgba(0, 0, 0, 1)">;
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> (<span style="color: rgba(0, 0, 255, 1)">var</span> d = u.split("."), m = Number(d) || 0, s = Number(d) || 0, S = [], c = 0, v = 0; v &lt; r.length; v++<span style="color: rgba(0, 0, 0, 1)">) {
      </span><span style="color: rgba(0, 0, 255, 1)">var</span> A =<span style="color: rgba(0, 0, 0, 1)"> r.charCodeAt(v);
      </span>128 &gt; A ? S = A : (2048 &gt; A ? S = A &gt;&gt; 6 | 192 : (55296 === (64512 &amp; A) &amp;&amp; v + 1 &lt; r.length &amp;&amp; 56320 === (64512 &amp; r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 &amp; A) &lt;&lt; 10) + (1023 &amp; r.charCodeAt(++<span style="color: rgba(0, 0, 0, 1)">v)),
            S = A &gt;&gt; 18 | 240<span style="color: rgba(0, 0, 0, 1)">,
            S = A &gt;&gt; 12 &amp; 63 | 128) : S = A &gt;&gt; 12 | 224<span style="color: rgba(0, 0, 0, 1)">,
            S = A &gt;&gt; 6 &amp; 63 | 128<span style="color: rgba(0, 0, 0, 1)">),
            S = 63 &amp; A | 128<span style="color: rgba(0, 0, 0, 1)">)
    }
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> (<span style="color: rgba(0, 0, 255, 1)">var</span> p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b &lt; S.length; b++<span style="color: rgba(0, 0, 0, 1)">)
      p </span>+=<span style="color: rgba(0, 0, 0, 1)"> S,
            p </span>=<span style="color: rgba(0, 0, 0, 1)"> n(p, F);
    </span><span style="color: rgba(0, 0, 255, 1)">return</span> p =<span style="color: rgba(0, 0, 0, 1)"> n(p, D),
      p </span>^=<span style="color: rgba(0, 0, 0, 1)"> s,
    </span>0 &gt; p &amp;&amp; (p = (2147483647 &amp; p) + 2147483648<span style="color: rgba(0, 0, 0, 1)">),
      p </span>%= 1e6<span style="color: rgba(0, 0, 0, 1)">,
    p.toString() </span>+ "." + (p ^<span style="color: rgba(0, 0, 0, 1)"> m)
}

</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">可以采用以下3种方式对外暴露getSign方法</span>

<span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">方式1:使用exports对外暴露getSign方法</span>
exports.getSign =<span style="color: rgba(0, 0, 0, 1)"> getSign

</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">方式2:使用module.exports对外暴露对象,该对象具有getSign方法</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)"> module.exports = {</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)">   getSign:getSign</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)"> }</span>

<span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">方式3:是方式2的简写方式,对于方法名和函数名相同的都可以采用简写方式,多个方法间用逗号','隔开</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)"> module.exports = {</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)">   getSign</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)"> }</span></pre>
</div>
<p><span style="font-size: 18px">该js文件中是抠出来的js代码,并且向外暴露getSign方法</span></p>
</li>
<li>
<h3>server.js</h3>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">引入依赖包</span>

<span style="color: rgba(0, 0, 255, 1)">var</span> express = require('express'<span style="color: rgba(0, 0, 0, 1)">);
</span><span style="color: rgba(0, 0, 255, 1)">var</span> bodyParser = require('body-parser'<span style="color: rgba(0, 0, 0, 1)">);

</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">引入自定义模块</span>
<span style="color: rgba(0, 0, 255, 1)">var</span> baiduTranslate = require('./baiduTranslate'<span style="color: rgba(0, 0, 0, 1)">);

</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">创建应用实例</span>
<span style="color: rgba(0, 0, 255, 1)">var</span> app =<span style="color: rgba(0, 0, 0, 1)"> express();

</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">将表单数据或者json数据转换成对象,以下只有一个会执行</span>
app.use(bodyParser.urlencoded({extended:<span style="color: rgba(0, 0, 255, 1)">true</span><span style="color: rgba(0, 0, 0, 1)">}));
app.use(bodyParser.json());


</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">创建路由</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)">POST请求</span>
app.post('/get_sign',<span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> (req,res) {
    </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">获取请求体中传递的参数</span>
    let result =<span style="color: rgba(0, 0, 0, 1)"> req.body;
    let content </span>=<span style="color: rgba(0, 0, 0, 1)"> result.content
    </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">调用baiduTranslate模块中的getSign方法,并传入参数</span>
    result =<span style="color: rgba(0, 0, 0, 1)"> baiduTranslate.getSign(content)
    res.send(result.toString());
});

</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">创建路由</span><span style="color: rgba(0, 128, 0, 1)">
//</span><span style="color: rgba(0, 128, 0, 1)">GET请求</span>
app.get('/get_sign',<span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> (req,res) {
    </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">获取查询字符串参数</span>
    let result =<span style="color: rgba(0, 0, 0, 1)"> req.query;
    let content </span>=<span style="color: rgba(0, 0, 0, 1)"> result.content;
    </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">调用baiduTranslate模块中的getSign方法,并传入参数</span>
    result =<span style="color: rgba(0, 0, 0, 1)"> baiduTranslate.getSign(content)
    res.send(result.toString());
})

</span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">启动服务</span>
<span style="color: rgba(0, 0, 255, 1)">var</span> server = app.listen(8888,<span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> () {
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> host =<span style="color: rgba(0, 0, 0, 1)"> server.address().address
    </span><span style="color: rgba(0, 0, 255, 1)">var</span> port =<span style="color: rgba(0, 0, 0, 1)"> server.address().port
    console.log(</span>"开启服务,访问地址为 http://%s:%s"<span style="color: rgba(0, 0, 0, 1)">, host, port)
})</span></pre>
</div>
<p>以后编写server.js代码,可以直接参照上述模板,只需要修改引入的(自定义模块名)以及(路由函数中的逻辑)即可</p>
</li>
<li>
<h3>python脚本test.py</h3>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">coding:utf-8</span>

<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> requests

url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://localhost:8888/get_sign</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
content </span>= input(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">请输入需要翻译的内容:</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

sign1 </span>= requests.post(url=url,data = {<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:content}).text
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">通过post请求获取sign参数:{}</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">.format(sign1))
sign2 </span>= requests.get(url=url,params={<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:content}).text
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">通过get请求获取sign参数:{}</span><span style="color: rgba(128, 0, 0, 1)">'</span>.format(sign2))</pre>
</div>
</li>
<li>
<h3>在server.js文件所在目录下,通过cmd命令行启动服务</h3>
<div class="cnblogs_code">
<pre>node server.js</pre>
</div>
</li>
<li>
<h3>运行python脚本后结果展示:</h3>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">请输入需要翻译的内容:hello
通过post请求获取sign参数:</span>54706.276099<span style="color: rgba(0, 0, 0, 1)">
通过get请求获取sign参数:</span>54706.276099</pre>
</div>
<p>&nbsp;</p>
</li>
</ul><br><br>
来源:https://www.cnblogs.com/eliwang/p/15374855.html
頁: [1]
查看完整版本: python爬虫----通过Node.js来执行js