山水源 發表於 2019-9-4 13:18:00

NLTK最详细功能介绍

<h1>目录</h1>
<blockquote>
<p><strong><span style="font-size: 15px">一、前言</span></strong></p>
<p><strong><span style="font-size: 15px">二、NLTK模块</span></strong></p>
<p><strong><span style="font-size: 15px">三、使用 NLTK 分析单词和句子</span></strong></p>
<h1><strong><span style="font-size: 15px">四、NLTK 与停止词</span></strong></h1>
<h1><strong><span style="font-size: 15px">五、NLTK 词干提取</span></strong></h1>
<h1><strong><span style="font-size: 15px">六、NLTK 词性标注</span></strong></h1>
<h1><strong><span style="font-size: 15px">七、NLTK 分块</span></strong></h1>
<h1><strong><span style="font-size: 15px">八、 NLTK 添加缝隙(Chinking)</span></strong></h1>
<h1><strong><span style="font-size: 15px">九、NLTK 命名实体识别</span></strong></h1>
<h1><strong><span style="font-size: 15px">十、NLTK 词形还原</span></strong></h1>
<h1><strong><span style="font-size: 15px">十一、NLTK 语料库</span></strong></h1>
<h1><strong><span style="font-size: 15px">十二、 NLTK 和 Wordnet</span></strong></h1>
<h1><strong><span style="font-size: 15px">十三、NLTK 文本分类</span></strong></h1>
<h1><strong><span style="font-size: 15px">十四、使用 NLTK 将单词转换为特征</span></strong></h1>
<h1><strong><span style="font-size: 15px">十五、NLTK 朴素贝叶斯分类器</span></strong></h1>
<h1><strong><span style="font-size: 15px">十六、使用 NLTK 保存分类器</span></strong></h1>
<h1><strong><span style="font-size: 15px">十七、NLTK 和 Sklearn</span></strong></h1>
<h1><strong><span style="font-size: 15px">十八、使用 NLTK 组合算法</span></strong></h1>
<h1><strong><span style="font-size: 15px">十九、使用 NLTK 调查偏差</span></strong></h1>
<h1><strong><span style="font-size: 15px">二十、使用 NLTK 改善情感分析的训练数据</span></strong></h1>
<h1><strong><span style="font-size: 15px">二十一、使用 NLTK 为情感分析创建模块</span></strong></h1>
<h1><strong><span style="font-size: 15px">二十二、NLTK Twitter 情感分析</span></strong></h1>
<h1><strong><span style="font-size: 15px">二十三,使用 NLTK 绘制 Twitter 实时情感分析</span></strong></h1>
<h1><strong><span style="font-size: 15px">二十四、斯坦福 NER 标记器与命名实体识别</span></strong></h1>
<h1><strong><span style="font-size: 15px">二十五、测试 NLTK 和斯坦福 NER 标记器的准确性</span></strong></h1>
<h1><strong><span style="font-size: 15px">二十六、测试 NLTK 和斯坦福 NER 标记器的速度</span></strong></h1>
<h1><strong><span style="font-size: 15px">二十七、使用 BIO 标签创建可读的命名实体列表</span></strong></h1>
</blockquote>
<h1>一、前言 </h1>
<p>  python进行自然语言处理,有一些第三方库供大家使用:</p>
<div>
<ul>
<li style="list-style-type: none">
<ul>
<li>NLTK(Python自然语言工具包)用于诸如标记化、词形还原、词干化、解析、POS标注等任务。该库具有几乎所有NLP任务的工具。</li>
<li>Spacy是NLTK的主要竞争对手。这两个库可用于相同的任务。</li>
<li>Scikit-learn为机器学习提供了一个大型库。此外还提供了用于文本预处理的工具。</li>
<li>Gensim是一个主题和向量空间建模、文档集合相似性的工具包。</li>
<li>Pattern库的一般任务是充当Web挖掘模块。因此,它仅支持自然语言处理(NLP)作为辅助任务。</li>
<li>Polyglot是自然语言处理(NLP)的另一个Python工具包。它不是很受欢迎,但也可以用于各种NLP任务。</li>
</ul>
</li>
</ul>
<p style="margin-left: 30px">NLTK是一个高效的Python构建的平台,用来处理人类自然语言数据。它提供了易于使用的接口,通过这些接口可以访问超过50个语料库和词汇资源(如WordNet),还有一套用于分类、标记化、词干标记、解析和语义推理的文本处理库,以及工业级NLP库的封装器和一个活跃的讨论论坛。</p>
<h1 class="title">二、NLTK模块</h1>
<div class="article">
<div class="show-content" data-note-content="">
<div class="show-content-free">
<table style="height: 304px; width: 888px">
<thead>
<tr><th>语言处理任务</th><th>NLTK模块</th><th>功能描述</th></tr>
</thead>
<tbody>
<tr>
<td><strong>获取和处理语料库</strong></td>
<td>nltk.corpus</td>
<td>语料库和词典的标准化接口</td>
</tr>
<tr>
<td><strong>字符串处理</strong></td>
<td>nltk.tokenize, nltk.stem</td>
<td>分词,句子分解提取主干</td>
</tr>
<tr>
<td><strong>搭配发现</strong></td>
<td>nltk.collocations</td>
<td>t-检验,卡方,点互信息PMI</td>
</tr>
<tr>
<td><strong>词性标识符</strong></td>
<td>nltk.tag</td>
<td>n-gram, backoff, Brill, HMM, TnT</td>
</tr>
<tr>
<td><strong>分类</strong></td>
<td>nltk.classify, nltk.cluster</td>
<td>决策树,最大熵,贝叶斯,EM,k-means</td>
</tr>
<tr>
<td><strong>分块</strong></td>
<td>nltk.chunk</td>
<td>正则表达式,n-gram,命名实体</td>
</tr>
<tr>
<td><strong>解析</strong></td>
<td>nltk.parse</td>
<td>图表,基于特征,一致性,概率,依赖</td>
</tr>
<tr>
<td><strong>语义解释</strong></td>
<td>nltk.sem, nltk.inference</td>
<td>λ演算,一阶逻辑,模型检验</td>
</tr>
<tr>
<td><strong>指标评测</strong></td>
<td>nltk.metrics</td>
<td>精度,召回率,协议系数</td>
</tr>
<tr>
<td><strong>概率与估计</strong></td>
<td>nltk.probability</td>
<td>频率分布,平滑概率分布</td>
</tr>
<tr>
<td><strong>应用</strong></td>
<td>nltk.app nltk.chat</td>
<td>图形化的关键词排序,分析器,WordNet查看器,聊天机器人</td>
</tr>
<tr>
<td><strong>语言学领域的工作</strong></td>
<td>nltk.toolbox</td>
<td>处理SIL工具箱格式的数据</td>
</tr>
</tbody>
</table>
<strong><span style="font-size: 2em">三、使用 NLTK 分析单词和句子</span></strong></div>
</div>
</div>
<div id="free-reward-panel" class="support-author">
<div>
<p>  NLTK 模块是一个巨大的工具包,目的是在整个自然语言处理(NLP)方法上帮助您。 NLTK 将为您提供一切,从将段落拆分为句子,拆分词语,识别这些词语的词性,高亮主题,甚至帮助您的机器了解文本关于什么。在这个系列中,我们将要解决意见挖掘或情感分析的领域。</p>
<p>  在我们学习如何使用 NLTK 进行情感分析的过程中,我们将学习以下内容:</p>
<ul>
<li style="list-style-type: none">
<ul>
<li>分词 - 将文本正文分割为句子和单词。</li>
<li>词性标注</li>
<li>机器学习与朴素贝叶斯分类器</li>
<li>如何一起使用 Scikit Learn(sklearn)与 NLTK</li>
<li>用数据集训练分类器</li>
<li>用 Twitter 进行实时的流式情感分析。</li>
<li>...以及更多。</li>
</ul>
</li>
</ul>
<p>  为了开始,你需要 NLTK 模块,以及 Python。</p>
<p>  接下来,您需要 NLTK 3。安装 NLTK 模块的最简单方法是使用<code>pip</code>。</p>
<p>  对于所有的用户来说,这通过打开<code>cmd.exe</code>,bash,或者你使用的任何 shell,并键入以下命令来完成:</p>
<div class="cnblogs_code">
<pre>pip install nltk</pre>
</div>
<p>  接下来,我们需要为 NLTK 安装一些组件。通过你的任何常用方式打开 python,然后键入:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
nltk.download()</span></pre>
</div>
<p>  除非你正在操作无头版本,否则一个 GUI 会弹出来,可能只有红色而不是绿色:</p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1554973/201909/1554973-20190904132403582-99244719.png" alt=""></p>
<p style="text-align: left">  为所有软件包选择下载“全部”,然后单击“下载”。 这会给你所有分词器,分块器,其他算法和所有的语料库。 如果空间是个问题,您可以选择手动选择性下载所有内容。 NLTK 模块将占用大约 7MB,整个<code>nltk_data</code>目录将占用大约 1.8GB,其中包括您的分块器,解析器和语料库。</p>
<p>  如果您正在使用 VPS 运行无头版本,您可以通过运行 Python ,并执行以下操作来安装所有内容:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk<br>
nltk.download()
d (</span><span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)"> download)
all (</span><span style="color: rgba(0, 0, 255, 1)">for</span> download everything)</pre>
</div>
<p>  这将为你下载一切东西。</p>
<p>  现在你已经拥有了所有你需要的东西,让我们敲一些简单的词汇:</p>
<ul>
<li style="list-style-type: none">
<ul>
<li>语料库(Corpus) - 文本的正文,单数。Corpora 是它的复数。示例:<code>A collection of medical journals</code>。</li>
<li>词库(Lexicon) - 词汇及其含义。例如:英文字典。但是,考虑到各个领域会有不同的词库。例如:对于金融投资者来说,<code>Bull</code>(牛市)这个词的第一个含义是对市场充满信心的人,与“普通英语词汇”相比,这个词的第一个含义是动物。因此,金融投资者,医生,儿童,机械师等都有一个特殊的词库。</li>
<li>标记(Token) - 每个“实体”都是根据规则分割的一部分。例如,当一个句子被“拆分”成单词时,每个单词都是一个标记。如果您将段落拆分为句子,则每个句子也可以是一个标记。</li>
</ul>
</li>
</ul>
<p>  这些是在进入自然语言处理(NLP)领域时,最常听到的词语,但是我们将及时涵盖更多的词汇。以此,我们来展示一个例子,说明如何用 NLTK 模块将某些东西拆分为标记。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> sent_tokenize, word_tokenize<br>
EXAMPLE_TEXT </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard.</span><span style="color: rgba(128, 0, 0, 1)">"</span>
<span style="color: rgba(0, 0, 255, 1)">print</span>(sent_tokenize(EXAMPLE_TEXT))</pre>
</div>
<p>  起初,你可能会认为按照词或句子来分词,是一件相当微不足道的事情。 对于很多句子来说,它可能是。 第一步可能是执行一个简单的<code>.split('. ')</code>,或按照句号,然后是空格分割。 之后也许你会引入一些正则表达式,来按照句号,空格,然后是大写字母分割。 问题是像<code>Mr. Smith</code>这样的事情,还有很多其他的事情会给你带来麻烦。 按照词分割也是一个挑战,特别是在考虑缩写的时候,例如<code>we</code>和<code>we're</code>。 NLTK 用这个看起来简单但非常复杂的操作为您节省大量的时间。</p>
<p>  上面的代码会输出句子,分成一个句子列表,你可以用<code>for</code>循环来遍历。</p>
<div class="cnblogs_code">
<pre>[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Hello Mr. Smith, how are you doing today?</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">The weather is great, and Python is awesome.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">The sky is pinkish-blue.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">You shouldn't eat cardboard.</span><span style="color: rgba(128, 0, 0, 1)">"</span>]</pre>
</div>
<p>  所以这里,我们创建了标记,它们都是句子。让我们这次按照词来分词。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(word_tokenize(EXAMPLE_TEXT))

[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Hello</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Mr.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Smith</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">how</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">are</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">you</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">doing</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">today</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">?</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">The</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">weather</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">is</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">great</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">and</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Python</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">is</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">awesome</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">The</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">sky</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">is</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">pinkish-blue</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">You</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">should</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">n't</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">eat</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cardboard</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>]</pre>
</div>
<p>  这里有几件事要注意。 首先,注意标点符号被视为一个单独的标记。 另外,注意单词<code>shouldn't</code>分隔为<code>should</code>和<code>n't</code>。 最后要注意的是,<code>pinkish-blue</code>确实被当作“一个词”来对待,本来就是这样。很酷!</p>
<p>  现在,看着这些分词后的单词,我们必须开始思考我们的下一步可能是什么。 我们开始思考如何通过观察这些词汇来获得含义。 我们可以想清楚,如何把价值放在许多单词上,但我们也看到一些基本上毫无价值的单词。 这是一种“停止词”的形式,我们也可以处理。 这就是我们将在下一个教程中讨论的内容。</p>
<h1>四、NLTK 与停止词</h1>
<p>  自然语言处理的思想,是进行某种形式的分析或处理,机器至少可以在某种程度上理解文本的含义,表述或暗示。</p>
<p>  这显然是一个巨大的挑战,但是有一些任何人都能遵循的步骤。然而,主要思想是电脑根本不会直接理解单词。令人震惊的是,人类也不会。在人类中,记忆被分解成大脑中的电信号,以发射模式的神经组的形式。对于大脑还有很多未知的事情,但是我们越是把人脑分解成基本的元素,我们就会发现基本的元素。那么,事实证明,计算机以非常相似的方式存储信息!如果我们要模仿人类如何阅读和理解文本,我们需要一种尽可能接近的方法。一般来说,计算机使用数字来表示一切事物,但是我们经常直接在编程中看到使用二进制信号(<code>True</code>或<code>False</code>,可以直接转换为 1 或 0,直接来源于电信号存在<code>(True, 1)</code>或不存在<code>(False, 0)</code>)。为此,我们需要一种方法,将单词转换为数值或信号模式。将数据转换成计算机可以理解的东西,这个过程称为“预处理”。预处理的主要形式之一就是过滤掉无用的数据。在自然语言处理中,无用词(数据)被称为停止词。</p>
<p>  我们可以立即认识到,有些词语比其他词语更有意义。我们也可以看到,有些单词是无用的,是填充词。例如,我们在英语中使用它们来填充句子,这样就没有那么奇怪的声音了。一个最常见的,非官方的,无用词的例子是单词<code>umm</code>。人们经常用<code>umm</code>来填充,比别的词多一些。这个词毫无意义,除非我们正在寻找一个可能缺乏自信,困惑,或者说没有太多话的人。我们都这样做,有...呃...很多时候,你可以在视频中听到我说<code>umm</code>或<code>uhh</code>。对于大多数分析而言,这些词是无用的。</p>
<p>  我们不希望这些词占用我们数据库的空间,或占用宝贵的处理时间。因此,我们称这些词为“无用词”,因为它们是无用的,我们希望对它们不做处理。 “停止词”这个词的另一个版本可以更书面一些:我们停在上面的单词。</p>
<p>  例如,如果您发现通常用于讽刺的词语,可能希望立即停止。讽刺的单词或短语将因词库和语料库而异。就目前而言,我们将把停止词当作不含任何含义的词,我们要把它们删除。</p>
<p>  您可以轻松地实现它,通过存储您认为是停止词的单词列表。 NLTK 用一堆他们认为是停止词的单词,来让你起步,你可以通过 NLTK 语料库来访问它:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span> stopwords</pre>
</div>
<p>  这里是这个列表:</p>
<div class="cnblogs_code">
<pre>&gt;&gt;&gt; set(stopwords.words(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">english</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">))
{</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ourselves</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">hers</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">between</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">yourself</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">but</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">again</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">there</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">about</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">once</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">during</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">out</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">very</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">having</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">with</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">they</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">own</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">an</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">be</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">some</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">for</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">do</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">its</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">yours</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">such</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">into</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">of</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">most</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">itself</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">other</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">off</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">is</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">s</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">am</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">or</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">who</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">as</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">from</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">him</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">each</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">themselves</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">until</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">below</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">are</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">we</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">these</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">your</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">his</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">through</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">don</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">nor</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">me</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">were</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">her</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">more</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">himself</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">this</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">down</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">should</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">our</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">their</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">while</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">above</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">both</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">up</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">to</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ours</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">had</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">she</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">all</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">no</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">when</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">at</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">any</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">before</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">them</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">same</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">and</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">been</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">have</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">in</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">will</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">on</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">does</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">yourselves</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">then</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">that</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">because</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">what</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">over</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">why</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">so</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">can</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">did</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">not</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">now</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">under</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">he</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">you</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">herself</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">has</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">just</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">where</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">too</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">only</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">myself</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">which</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">those</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">i</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">after</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">few</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">whom</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">t</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">being</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">if</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">theirs</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">my</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">against</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">a</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">by</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">doing</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">it</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">how</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">further</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">was</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">here</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">than</span><span style="color: rgba(128, 0, 0, 1)">'</span>}</pre>
</div>
<p>  以下是结合使用<code>stop_words</code>集合,从文本中删除停止词的方法:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> stopwords
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> word_tokenize<br>

example_sent </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">This is a sample sentence, showing off the stop words filtration.</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">

stop_words </span>= set(stopwords.words(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">english</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">))

word_tokens </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(example_sent)

filtered_sentence </span>=

filtered_sentence </span>=<span style="color: rgba(0, 0, 0, 1)"> []

</span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> word_tokens:
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> w <span style="color: rgba(0, 0, 255, 1)">not</span> <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> stop_words:
      filtered_sentence.append(w)

</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(word_tokens)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(filtered_sentence)</pre>
</div>
<p>  我们的输出是:</p>
<div class="cnblogs_code">
<pre>[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">This</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">is</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">a</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">sample</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">sentence</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">showing</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">off</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">stop</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">words</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">filtration</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">This</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">sample</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">sentence</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">showing</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">stop</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">words</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">filtration</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>]</pre>
</div>
<p>  我们的数据库感谢了我们。数据预处理的另一种形式是“词干提取(Stemming)”,这就是我们接下来要讨论的内容。</p>
<h1>五、NLTK 词干提取</h1>
<p>  词干的概念是一种规范化方法。 除涉及时态之外,许多词语的变体都具有相同的含义。</p>
<p>  我们提取词干的原因是为了缩短查找的时间,使句子正常化。</p>
<p>  考虑:</p>
<div class="cnblogs_code">
<pre>I was taking a ride <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> the car.
I was riding </span><span style="color: rgba(0, 0, 255, 1)">in</span> the car.</pre>
</div>
<p>  这两句话意味着同样的事情。 <code>in the car</code>(在车上)是一样的。 <code>I</code>(我)是一样的。 在这两种情况下,<code>ing</code>都明确表示过去式,所以在试图弄清这个过去式活动的含义的情况下,是否真的有必要区分<code>riding</code>和<code>taking a ride</code>?</p>
<p>  不,并没有。</p>
<p>  这只是一个小例子,但想象英语中的每个单词,可以放在单词上的每个可能的时态和词缀。 每个版本有单独的字典条目,将非常冗余和低效,特别是因为一旦我们转换为数字,“价值”将是相同的。</p>
<p>  最流行的瓷感提取算法之一是 Porter,1979 年就存在了。</p>
<p>  首先,我们要抓取并定义我们的词干:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.stem <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> PorterStemmer
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> sent_tokenize, word_tokenize

ps </span>= PorterStemmer()</pre>
</div>
<p>  现在让我们选择一些带有相似词干的单词,例如:</p>
<div class="cnblogs_code">
<pre>example_words = [<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">python</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pythoner</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pythoning</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pythoned</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pythonly</span><span style="color: rgba(128, 0, 0, 1)">"</span>]</pre>
</div>
<p>  下面,我们可以这样做来轻易提取词干:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> example_words:
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(ps.stem(w))</pre>
</div>
<p>  我们的输出:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">python
python
python
python
pythonli</span></pre>
</div>
<p>  现在让我们尝试对一个典型的句子,而不是一些单词提取词干:</p>
<div class="cnblogs_code">
<pre>new_text = <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once.</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
words </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(new_text)

</span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> words:
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(ps.stem(w))</pre>
</div>
<p>  现在我们的结果为:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">It
</span><span style="color: rgba(0, 0, 255, 1)">is</span>
<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)">
to
by
veri
pythonli
</span><span style="color: rgba(0, 0, 255, 1)">while</span><span style="color: rgba(0, 0, 0, 1)">
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.</span></pre>
</div>
<p>  接下来,我们将讨论 NLTK 模块中一些更高级的内容,词性标注,其中我们可以使用 NLTK 模块来识别句子中每个单词的词性。</p>
<h1>六、NLTK 词性标注</h1>
<p>  NLTK模块的一个更强大的方面是,它可以为你做词性标注。 意思是把一个句子中的单词标注为名词,形容词,动词等。 更令人印象深刻的是,它也可以按照时态来标记,以及其他。 这是一列标签,它们的含义和一些例子:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">POS tag list:

CCcoordinating conjunction
CDcardinal digit
DTdeterminer
EXexistential there (like: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">there is</span><span style="color: rgba(128, 0, 0, 1)">"</span> ... think of it like <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">there exists</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
FWforeign word
INpreposition</span>/<span style="color: rgba(0, 0, 0, 1)">subordinating conjunction
JJadjective   </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">big</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
JJR adjective, comparative</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">bigger</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
JJS adjective, superlative</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">biggest</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
LSlist marker </span>1<span style="color: rgba(0, 0, 0, 1)">)
MDmodal   could, will
NNnoun, singular </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">desk</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
NNS noun plural </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">desks</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
NNP proper noun, singular   </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Harrison</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
NNPS    proper noun, plural </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Americans</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
PDT predeterminer   </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">all the kids</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
POS possessive ending   parent</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">s</span>
<span style="color: rgba(0, 0, 0, 1)">PRP personal pronoun    I, he, she
PRP$    possessive pronounmy, his, hers
RBadverbvery, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RPparticle    give up
TOtogo </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">to</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)"> the store.
UHinterjection    errrrrrrrm
VBverb, base form take
VBD verb, past tense    took
VBG verb, gerund</span>/<span style="color: rgba(0, 0, 0, 1)">present participle taking
VBN verb, past participle   taken
VBP verb, sing. present, non</span>-<span style="color: rgba(0, 0, 0, 1)">3d take
VBZ verb, 3rd person sing. presenttakes
WDT wh</span>-<span style="color: rgba(0, 0, 0, 1)">determiner   which
WPwh</span>-<span style="color: rgba(0, 0, 0, 1)">pronounwho, what
WP$ possessive wh</span>-<span style="color: rgba(0, 0, 0, 1)">pronoun   whose
WRB wh</span>-abverb   where, when</pre>
</div>
<p>  我们如何使用这个? 当我们处理它的时候,我们要讲解一个新的句子标记器,叫做<code>PunktSentenceTokenizer</code>。 这个标记器能够无监督地进行机器学习,所以你可以在你使用的任何文本上进行实际的训练。 首先,让我们获取一些我们打算使用的导入:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> state_union
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span> PunktSentenceTokenizer</pre>
</div>
<p>  现在让我们创建训练和测试数据:</p>
<div class="cnblogs_code">
<pre>train_text = state_union.raw(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">2005-GWBush.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
sample_text </span>= state_union.raw(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">2006-GWBush.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span>)</pre>
</div>
<p>  一个是 2005 年以来的国情咨文演说,另一个是 2006 年以来的乔治·W·布什总统的演讲。</p>
<p>  接下来,我们可以训练 Punkt 标记器,如下所示:</p>
<div class="cnblogs_code">
<pre>custom_sent_tokenizer = PunktSentenceTokenizer(train_text)</pre>
</div>
<p>  之后我们可以实际分词,使用:</p>
<div class="cnblogs_code">
<pre>tokenized = custom_sent_tokenizer.tokenize(sample_text)</pre>
</div>
<p>  现在我们可以通过创建一个函数,来完成这个词性标注脚本,该函数将遍历并标记每个句子的词性,如下所示:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> process_content():
    </span><span style="color: rgba(0, 0, 255, 1)">try</span><span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span> tokenized[:5<span style="color: rgba(0, 0, 0, 1)">]:
            words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.word_tokenize(i)
            tagged </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.pos_tag(words)
            </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(tagged)

    </span><span style="color: rgba(0, 0, 255, 1)">except</span><span style="color: rgba(0, 0, 0, 1)"> Exception as e:
      </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(str(e))


process_content()</span></pre>
</div>
<p>  输出应该是元组列表,元组中的第一个元素是单词,第二个元素是词性标签。 它应该看起来像:</p>
<div class="cnblogs_code">
<pre>[(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PRESIDENT</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">GEORGE</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">W.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">BUSH</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">'S</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">POS</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ADDRESS</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">BEFORE</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">A</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">JOINT</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">SESSION</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">OF</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">THE</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">CONGRESS</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ON</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">THE</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">STATE</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">OF</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">THE</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">UNION</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">January</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">31</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">CD</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">2006</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">CD</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">THE</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">DT</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PRESIDENT</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">:</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Thank</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">you</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PRP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">all</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">DT</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>)] [(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Mr.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Speaker</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Vice</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">President</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Cheney</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">members</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNS</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">of</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">IN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Congress</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">members</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNS</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">of</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">IN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">DT</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Supreme</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Court</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">and</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">CC</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">diplomatic</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">JJ</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">corps</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNS</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">distinguished</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VBD</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">guests</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNS</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">and</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">CC</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">fellow</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">JJ</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">citizens</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNS</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">:</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">:</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Today</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">our</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PRP$</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">nation</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">lost</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VBD</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">a</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">DT</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">beloved</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VBN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">graceful</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">JJ</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">courageous</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">JJ</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">woman</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">who</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">WP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">called</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VBN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">America</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">to</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">TO</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">its</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PRP$</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">founding</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ideals</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNS</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">and</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">CC</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">carried</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VBD</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">on</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">IN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">a</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">DT</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">noble</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">JJ</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">dream</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>)] [(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Tonight</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">we</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PRP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">are</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VBP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">comforted</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VBN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">by</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">IN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">DT</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">hope</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">of</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">IN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">a</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">DT</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">glad</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">reunion</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">with</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">IN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">DT</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">husband</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">who</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">WP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">was</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VBD</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">taken</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VBN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">so</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">RB</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">long</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">RB</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ago</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">RB</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">and</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">CC</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">we</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PRP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">are</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VBP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">grateful</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">JJ</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">for</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">IN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">DT</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">good</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">life</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">of</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">IN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Coretta</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Scott</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">King</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>)] [(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">(</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Applause</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">)</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">:</span><span style="color: rgba(128, 0, 0, 1)">'</span>)] [(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">President</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">George</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">W.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Bush</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">reacts</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VBZ</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">to</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">TO</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">applause</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">VB</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">during</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">IN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">his</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PRP$</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">State</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">of</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">IN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">DT</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Union</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Address</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">at</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">IN</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">DT</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Capitol</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Tuesday</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Jan</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NNP</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>)]</pre>
</div>
<p>  到了这里,我们可以开始获得含义,但是还有一些工作要做。 我们将要讨论的下一个话题是分块(chunking),其中我们跟句单词的词性,将单词分到,有意义的分组中。</p>
<h1>七、NLTK 分块</h1>
<p>  现在我们知道了词性,我们可以注意所谓的分块,把词汇分成有意义的块。 分块的主要目标之一是将所谓的“名词短语”分组。 这些是包含一个名词的一个或多个单词的短语,可能是一些描述性词语,也可能是一个动词,也可能是一个副词。 这个想法是把名词和与它们有关的词组合在一起。</p>
<p>  为了分块,我们将词性标签与正则表达式结合起来。 主要从正则表达式中,我们要利用这些东西:</p>
<div class="cnblogs_code">
<pre>+ = match 1 <span style="color: rgba(0, 0, 255, 1)">or</span><span style="color: rgba(0, 0, 0, 1)"> more
? </span>= match 0 <span style="color: rgba(0, 0, 255, 1)">or</span> 1<span style="color: rgba(0, 0, 0, 1)"> repetitions.
</span>* = match 0 <span style="color: rgba(0, 0, 255, 1)">or</span><span style="color: rgba(0, 0, 0, 1)"> MORE repetitions   
. </span>= Any character <span style="color: rgba(0, 0, 255, 1)">except</span> a new line</pre>
</div>
<p>  如果您需要正则表达式的帮助,请参阅上面链接的教程。 最后需要注意的是,词性标签中用<code>&lt;</code>和<code>&gt;</code>表示,我们也可以在标签本身中放置正则表达式,来表达“全部名词”(<code>&lt;N.*&gt;</code>)。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> state_union
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> PunktSentenceTokenizer

train_text </span>= state_union.raw(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">2005-GWBush.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
sample_text </span>= state_union.raw(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">2006-GWBush.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)

custom_sent_tokenizer </span>=<span style="color: rgba(0, 0, 0, 1)"> PunktSentenceTokenizer(train_text)

tokenized </span>=<span style="color: rgba(0, 0, 0, 1)"> custom_sent_tokenizer.tokenize(sample_text)

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> process_content():
    </span><span style="color: rgba(0, 0, 255, 1)">try</span><span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> tokenized:
            words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.word_tokenize(i)
            tagged </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.pos_tag(words)
            chunkGram </span>= r<span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">Chunk: {&lt;RB.?&gt;*&lt;VB.?&gt;*&lt;NNP&gt;+&lt;NN&gt;?}</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
            chunkParser </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.RegexpParser(chunkGram)
            chunked </span>=<span style="color: rgba(0, 0, 0, 1)"> chunkParser.parse(tagged)
            chunked.draw()   

    </span><span style="color: rgba(0, 0, 255, 1)">except</span><span style="color: rgba(0, 0, 0, 1)"> Exception as e:
      </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(str(e))

process_content()</span></pre>
</div>
<p>  结果是这样的:</p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1554973/201909/1554973-20190904133713534-847787440.png" alt=""></p>
<p>  这里的主要一行是:</p>
<div class="cnblogs_code">
<pre>chunkGram = r<span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">Chunk: {&lt;RB.?&gt;*&lt;VB.?&gt;*&lt;NNP&gt;+&lt;NN&gt;?}</span><span style="color: rgba(128, 0, 0, 1)">"""</span></pre>
</div>
<p>  把这一行拆分开:</p>
<p><code>  &lt;RB.?&gt;*</code>:零个或多个任何时态的副词,后面是:</p>
<p><code>  &lt;VB.?&gt;*</code>:零个或多个任何时态的动词,后面是:</p>
<p><code>  &lt;NNP&gt;+</code>:一个或多个合理的名词,后面是:</p>
<p><code>  &lt;NN&gt;?</code>:零个或一个名词单数。</p>
<p>  尝试玩转组合来对各种实例进行分组,直到您觉得熟悉了。</p>
<p>  视频中没有涉及,但是也有个合理的任务是实际访问具体的块。 这是很少被提及的,但根据你在做的事情,这可能是一个重要的步骤。 假设你把块打印出来,你会看到如下输出:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">(S
(Chunk PRESIDENT</span>/NNP GEORGE/NNP W./NNP BUSH/<span style="color: rgba(0, 0, 0, 1)">NNP)
</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">S/POS</span>
<span style="color: rgba(0, 0, 0, 1)">(Chunk
    ADDRESS</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    BEFORE</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    A</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    JOINT</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    SESSION</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    OF</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    THE</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    CONGRESS</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    ON</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    THE</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    STATE</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    OF</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    THE</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    UNION</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
    January</span>/<span style="color: rgba(0, 0, 0, 1)">NNP)
</span>31/<span style="color: rgba(0, 0, 0, 1)">CD
,</span>/<span style="color: rgba(0, 0, 0, 1)">,
</span>2006/<span style="color: rgba(0, 0, 0, 1)">CD
THE</span>/<span style="color: rgba(0, 0, 0, 1)">DT
(Chunk PRESIDENT</span>/<span style="color: rgba(0, 0, 0, 1)">NNP)
:</span>/<span style="color: rgba(0, 0, 0, 1)">:
(Chunk Thank</span>/<span style="color: rgba(0, 0, 0, 1)">NNP)
you</span>/<span style="color: rgba(0, 0, 0, 1)">PRP
all</span>/<span style="color: rgba(0, 0, 0, 1)">DT
.</span>/.)</pre>
</div>
<p>  很酷,这可以帮助我们可视化,但如果我们想通过我们的程序访问这些数据呢? 那么,这里发生的是我们的“分块”变量是一个 NLTK 树。 每个“块”和“非块”是树的“子树”。 我们可以通过像<code>chunked.subtrees</code>的东西来引用它们。 然后我们可以像这样遍历这些子树:</p>
<div class="cnblogs_code">
<pre> <span style="color: rgba(0, 0, 255, 1)">for</span> subtree <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> chunked.subtrees():
   </span><span style="color: rgba(0, 0, 255, 1)">print</span>(subtree)</pre>
</div>
<p>  接下来,我们可能只关心获得这些块,忽略其余部分。 我们可以在<code>chunked.subtrees()</code>调用中使用<code>filter</code>参数。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">for</span> subtree <span style="color: rgba(0, 0, 255, 1)">in</span> chunked.subtrees(filter=<span style="color: rgba(0, 0, 255, 1)">lambda</span> t: t.label() == <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Chunk</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">):
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(subtree)</pre>
</div>
<p>  现在,我们执行过滤,来显示标签为“块”的子树。 请记住,这不是 NLTK 块属性中的“块”...这是字面上的“块”,因为这是我们给它的标签:<code>chunkGram = r"""Chunk: {&lt;RB.?&gt;*&lt;VB.?&gt;*&lt;NNP&gt;+&lt;NN&gt;?}"""</code>。</p>
<p>  如果我们写了一些东西,类似<code>chunkGram = r"""Pythons: {&lt;RB.?&gt;*&lt;VB.?&gt;*&lt;NNP&gt;+&lt;NN&gt;?}"""</code>,那么我们可以通过<code>"Pythons."</code>标签来过滤。 结果应该是这样的:</p>
<div class="cnblogs_code">
<pre>-<span style="color: rgba(0, 0, 0, 1)">
(Chunk PRESIDENT</span>/NNP GEORGE/NNP W./NNP BUSH/<span style="color: rgba(0, 0, 0, 1)">NNP)
(Chunk
ADDRESS</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
BEFORE</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
A</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
JOINT</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
SESSION</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
OF</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
THE</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
CONGRESS</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
ON</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
THE</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
STATE</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
OF</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
THE</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
UNION</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
January</span>/<span style="color: rgba(0, 0, 0, 1)">NNP)
(Chunk PRESIDENT</span>/<span style="color: rgba(0, 0, 0, 1)">NNP)
(Chunk Thank</span>/NNP)</pre>
</div>
<p>  完整的代码是:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> state_union
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> PunktSentenceTokenizer

train_text </span>= state_union.raw(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">2005-GWBush.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
sample_text </span>= state_union.raw(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">2006-GWBush.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)

custom_sent_tokenizer </span>=<span style="color: rgba(0, 0, 0, 1)"> PunktSentenceTokenizer(train_text)

tokenized </span>=<span style="color: rgba(0, 0, 0, 1)"> custom_sent_tokenizer.tokenize(sample_text)

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> process_content():
    </span><span style="color: rgba(0, 0, 255, 1)">try</span><span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> tokenized:
            words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.word_tokenize(i)
            tagged </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.pos_tag(words)
            chunkGram </span>= r<span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">Chunk: {&lt;RB.?&gt;*&lt;VB.?&gt;*&lt;NNP&gt;+&lt;NN&gt;?}</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
            chunkParser </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.RegexpParser(chunkGram)
            chunked </span>=<span style="color: rgba(0, 0, 0, 1)"> chunkParser.parse(tagged)
            
            </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(chunked)
            </span><span style="color: rgba(0, 0, 255, 1)">for</span> subtree <span style="color: rgba(0, 0, 255, 1)">in</span> chunked.subtrees(filter=<span style="color: rgba(0, 0, 255, 1)">lambda</span> t: t.label() == <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Chunk</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">):
                </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(subtree)

            chunked.draw()

    </span><span style="color: rgba(0, 0, 255, 1)">except</span><span style="color: rgba(0, 0, 0, 1)"> Exception as e:
      </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(str(e))

process_content()</span></pre>
</div>
<h1>八、 NLTK 添加缝隙(Chinking)</h1>
<p>  你可能会发现,经过大量的分块之后,你的块中还有一些你不想要的单词,但是你不知道如何通过分块来摆脱它们。 你可能会发现添加缝隙是你的解决方案。</p>
<p>  添加缝隙与分块很像,它基本上是一种从块中删除块的方法。 你从块中删除的块就是你的缝隙。</p>
<p>  代码非常相似,你只需要用<code>}{</code>来代码缝隙,在块后面,而不是块的<code>{}</code>。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> state_union
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> PunktSentenceTokenizer

train_text </span>= state_union.raw(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">2005-GWBush.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
sample_text </span>= state_union.raw(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">2006-GWBush.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)

custom_sent_tokenizer </span>=<span style="color: rgba(0, 0, 0, 1)"> PunktSentenceTokenizer(train_text)

tokenized </span>=<span style="color: rgba(0, 0, 0, 1)"> custom_sent_tokenizer.tokenize(sample_text)

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> process_content():
    </span><span style="color: rgba(0, 0, 255, 1)">try</span><span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span> tokenized:
            words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.word_tokenize(i)
            tagged </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.pos_tag(words)

            chunkGram </span>= r<span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">Chunk: {&lt;.*&gt;+}
                                    }&lt;VB.?|IN|DT|TO&gt;+{</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">

            chunkParser </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.RegexpParser(chunkGram)
            chunked </span>=<span style="color: rgba(0, 0, 0, 1)"> chunkParser.parse(tagged)

            chunked.draw()

    </span><span style="color: rgba(0, 0, 255, 1)">except</span><span style="color: rgba(0, 0, 0, 1)"> Exception as e:
      </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(str(e))

process_content()</span></pre>
</div>
<p>  使用它,你得到了一些东西:</p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1554973/201909/1554973-20190904134049480-1487324337.png" alt=""></p>
<p>  现在,主要的区别是:</p>
<div class="cnblogs_code">
<pre>}&lt;VB.?|IN|DT|TO&gt;+{</pre>
</div>
<p>  这意味着我们要从缝隙中删除一个或多个动词,介词,限定词或<code>to</code>这个词。</p>
<p>  现在我们已经学会了,如何执行一些自定义的分块和添加缝隙,我们来讨论一下 NLTK 自带的分块形式,这就是命名实体识别。</p>
<h1>九、NLTK 命名实体识别</h1>
<p>  自然语言处理中最主要的分块形式之一被称为“命名实体识别”。 这个想法是让机器立即能够拉出“实体”,例如人物,地点,事物,位置,货币等等。</p>
<p>  这可能是一个挑战,但 NLTK 是为我们内置了它。 NLTK 的命名实体识别有两个主要选项:识别所有命名实体,或将命名实体识别为它们各自的类型,如人物,地点,位置等。</p>
<p>  这是一个例子:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> state_union
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> PunktSentenceTokenizer

train_text </span>= state_union.raw(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">2005-GWBush.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
sample_text </span>= state_union.raw(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">2006-GWBush.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)

custom_sent_tokenizer </span>=<span style="color: rgba(0, 0, 0, 1)"> PunktSentenceTokenizer(train_text)

tokenized </span>=<span style="color: rgba(0, 0, 0, 1)"> custom_sent_tokenizer.tokenize(sample_text)

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> process_content():
    </span><span style="color: rgba(0, 0, 255, 1)">try</span><span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span> tokenized:
            words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.word_tokenize(i)
            tagged </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.pos_tag(words)
            namedEnt </span>= nltk.ne_chunk(tagged, binary=<span style="color: rgba(0, 0, 0, 1)">True)
            namedEnt.draw()
    </span><span style="color: rgba(0, 0, 255, 1)">except</span><span style="color: rgba(0, 0, 0, 1)"> Exception as e:
      </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(str(e))


process_content()</span></pre>
</div>
<p>  在这里,选择<code>binary = True</code>,这意味着一个东西要么是命名实体,要么不是。 将不会有进一步的细节。 结果是:</p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1554973/201909/1554973-20190904134222323-61980523.png" alt=""></p>
<p>  如果你设置了<code>binary = False</code>,结果为:</p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1554973/201909/1554973-20190904134245407-1048520302.png" alt="" width="636" height="259"></p>
<p>  你可以马上看到一些事情。 当<code>binary</code>是假的时候,它也选取了同样的东西,但是把<code>White House</code>这样的术语分解成<code>White</code>和<code>House</code>,就好像它们是不同的,而我们可以在<code>binary = True</code>的选项中看到,命名实体的识别 说<code>White House</code>是相同命名实体的一部分,这是正确的。</p>
<p>  根据你的目标,你可以使用<code>binary</code>选项。 如果您的<code>binary</code>为<code>false</code>,这里是你可以得到的,命名实体的类型:</p>
<div class="cnblogs_code">
<pre>NE Type <span style="color: rgba(0, 0, 255, 1)">and</span><span style="color: rgba(0, 0, 0, 1)"> Examples
ORGANIZATION </span>- Georgia-<span style="color: rgba(0, 0, 0, 1)">Pacific Corp., WHO
PERSON </span>-<span style="color: rgba(0, 0, 0, 1)"> Eddy Bonte, President Obama
LOCATION </span>-<span style="color: rgba(0, 0, 0, 1)"> Murray River, Mount Everest
DATE </span>- June, 2008-06-29<span style="color: rgba(0, 0, 0, 1)">
TIME </span>- two fifty a m, 1:30<span style="color: rgba(0, 0, 0, 1)"> p.m.
MONEY </span>- 175 million Canadian Dollars, GBP 10.40<span style="color: rgba(0, 0, 0, 1)">
PERCENT </span>- twenty pct, 18.75 %<span style="color: rgba(0, 0, 0, 1)">
FACILITY </span>-<span style="color: rgba(0, 0, 0, 1)"> Washington Monument, Stonehenge
GPE </span>- South East Asia, Midlothian</pre>
</div>
<p>  无论哪种方式,你可能会发现,你需要做更多的工作才能做到恰到好处,但是这个功能非常强大。</p>
<p>  在接下来的教程中,我们将讨论类似于词干提取的东西,叫做“词形还原”(lemmatizing)。</p>
<h1>十、NLTK 词形还原</h1>
<p>  与词干提权非常类似的操作称为词形还原。 这两者之间的主要区别是,你之前看到了,词干提权经常可能创造出不存在的词汇,而词形是实际的词汇。</p>
<p>  所以,你的词干,也就是你最终得到的词,不是你可以在字典中查找的东西,但你可以查找一个词形。</p>
<p>  有时你最后会得到非常相似的词语,但有时候,你会得到完全不同的词语。 我们来看一些例子。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.stem <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> WordNetLemmatizer

lemmatizer </span>=<span style="color: rgba(0, 0, 0, 1)"> WordNetLemmatizer()

</span><span style="color: rgba(0, 0, 255, 1)">print</span>(lemmatizer.lemmatize(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">cats</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(lemmatizer.lemmatize(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">cacti</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(lemmatizer.lemmatize(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">geese</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(lemmatizer.lemmatize(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rocks</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(lemmatizer.lemmatize(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">python</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(lemmatizer.lemmatize(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">better</span><span style="color: rgba(128, 0, 0, 1)">"</span>, pos=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">a</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(lemmatizer.lemmatize(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">best</span><span style="color: rgba(128, 0, 0, 1)">"</span>, pos=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">a</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(lemmatizer.lemmatize(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">run</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(lemmatizer.lemmatize(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">run</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">v</span><span style="color: rgba(128, 0, 0, 1)">'</span>))</pre>
</div>
<p>  在这里,我们有一些我们使用的词的词形的例子。 唯一要注意的是,<code>lemmatize</code>接受词性参数<code>pos</code>。 如果没有提供,默认是“名词”。 这意味着,它将尝试找到最接近的名词,这可能会给你造成麻烦。 如果你使用词形还原,请记住!</p>
<p>  在接下来的教程中,我们将深入模块附带的 NTLK 语料库,查看所有优秀文档,他们在那里等待着我们。</p>
<h1>十一、 NLTK 语料库</h1>
<p>  NLTK 语料库中的几乎所有文件都遵循相同的规则,通过使用 NLTK 模块来访问它们,但是它们没什么神奇的。 这些文件大部分都是纯文本文件,其中一些是 XML 文件,另一些是其他格式文件,但都可以通过手动或模块和 Python 访问。 让我们来谈谈手动查看它们。根据您的安装,您的<code>nltk_data</code>目录可能隐藏在多个位置。 为了找出它的位置,请转到您的 Python 目录,也就是 NLTK 模块所在的位置。 如果您不知道在哪里,请使用以下代码:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(nltk.<span style="color: rgba(128, 0, 128, 1)">__file__</span>)</pre>
</div>
<p>  运行它,输出将是 NLTK 模块<code>__init__.py</code>的位置。 进入 NLTK 目录,然后查找<code>data.py</code>文件。</p>
<p>  代码的重要部分是:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">if</span> sys.platform.startswith(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">win</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">):
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Common locations on Windows:</span>
    path +=<span style="color: rgba(0, 0, 0, 1)"> [
      str(r</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">C:\nltk_data</span><span style="color: rgba(128, 0, 0, 1)">'</span>), str(r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">D:\nltk_data</span><span style="color: rgba(128, 0, 0, 1)">'</span>), str(r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">E:\nltk_data</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">),
      os.path.join(sys.prefix, str(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">nltk_data</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)),
      os.path.join(sys.prefix, str(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">lib</span><span style="color: rgba(128, 0, 0, 1)">'</span>), str(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">nltk_data</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)),
      os.path.join(os.environ.get(str(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">APPDATA</span><span style="color: rgba(128, 0, 0, 1)">'</span>), str(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">C:\\</span><span style="color: rgba(128, 0, 0, 1)">'</span>)), str(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">nltk_data</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">))
    ]
</span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Common locations on UNIX &amp; OS X:</span>
    path +=<span style="color: rgba(0, 0, 0, 1)"> [
      str(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/nltk_data</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">),
      str(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/local/share/nltk_data</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">),
      str(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/lib/nltk_data</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">),
      str(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/local/lib/nltk_data</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
    ]</span></pre>
</div>
<p>  在那里,你可以看到<code>nltk_data</code>的各种可能的目录。 如果你在 Windows 上,它很可能是在你的<code>appdata</code>中,在本地目录中。 为此,你需要打开你的文件浏览器,到顶部,然后输入<code>%appdata%</code>。</p>
<p>  接下来点击<code>roaming</code>,然后找到<code>nltk_data</code>目录。 在那里,你将找到你的语料库文件。 完整的路径是这样的:</p>
<div class="cnblogs_code">
<pre>C:\Users\yourname\AppData\Roaming\nltk_data\corpora</pre>
</div>
<p>  在这里,你有所有可用的语料库,包括书籍,聊天记录,电影评论等等。</p>
<p>  现在,我们将讨论通过 NLTK 访问这些文档。 正如你所看到的,这些主要是文本文档,所以你可以使用普通的 Python 代码来打开和阅读文档。 也就是说,NLTK 模块有一些很好的处理语料库的方法,所以你可能会发现使用他们的方法是实用的。 下面是我们打开“古腾堡圣经”,并阅读前几行的例子:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> sent_tokenize, PunktSentenceTokenizer
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> gutenberg

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> sample text</span>
sample = gutenberg.raw(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">bible-kjv.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)

tok </span>=<span style="color: rgba(0, 0, 0, 1)"> sent_tokenize(sample)

</span><span style="color: rgba(0, 0, 255, 1)">for</span> x <span style="color: rgba(0, 0, 255, 1)">in</span> range(5<span style="color: rgba(0, 0, 0, 1)">):
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(tok)</pre>
</div>
<p>  其中一个更高级的数据集是<code>wordnet</code>。 Wordnet 是一个单词,定义,他们使用的例子,同义词,反义词,等等的集合。 接下来我们将深入使用 wordnet。</p>
<h1>十二、NLTK 和 Wordnet</h1>
<p>  WordNet 是英语的词汇数据库,由普林斯顿创建,是 NLTK 语料库的一部分。</p>
<p>  您可以一起使用 WordNet 和 NLTK 模块来查找单词含义,同义词,反义词等。 我们来介绍一些例子。</p>
<p>  首先,你将需要导入<code>wordnet</code>:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span> wordnet</pre>
</div>
<p>  之后我们打算使用单词<code>program</code>来寻找同义词:</p>
<div class="cnblogs_code">
<pre>syns = wordnet.synsets(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">program</span><span style="color: rgba(128, 0, 0, 1)">"</span>)</pre>
</div>
<p>  一个同义词的例子:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(syns.name())

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> plan.n.01</span></pre>
</div>
<p>  只是单词:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(syns.lemmas().name())

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> plan</span></pre>
</div>
<p>  第一个同义词的定义:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(syns.definition())

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> a series of steps to be carried out or goals to be accomplished</span></pre>
</div>
<p>  单词的使用示例:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(syns.examples())

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> ['they drew up a six-step plan', 'they discussed plans for a new bond issue']</span></pre>
</div>
<p>  接下来,我们如何辨别一个词的同义词和反义词? 这些词形是同义词,然后你可以使用<code>.antonyms</code>找到词形的反义词。 因此,我们可以填充一些列表,如:</p>
<div class="cnblogs_code">
<pre>synonyms =<span style="color: rgba(0, 0, 0, 1)"> []
antonyms </span>=<span style="color: rgba(0, 0, 0, 1)"> []

</span><span style="color: rgba(0, 0, 255, 1)">for</span> syn <span style="color: rgba(0, 0, 255, 1)">in</span> wordnet.synsets(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">good</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">):
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> l <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> syn.lemmas():
      synonyms.append(l.name())
      </span><span style="color: rgba(0, 0, 255, 1)">if</span><span style="color: rgba(0, 0, 0, 1)"> l.antonyms():
            antonyms.append(l.antonyms().name())

</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(set(synonyms))
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(set(antonyms))

</span><span style="color: rgba(128, 0, 0, 1)">'''</span><span style="color: rgba(128, 0, 0, 1)">
{'beneficial', 'just', 'upright', 'thoroughly', 'in_force', 'well', 'skilful', 'skillful', 'sound', 'unspoiled', 'expert', 'proficient', 'in_effect', 'honorable', 'adept', 'secure', 'commodity', 'estimable', 'soundly', 'right', 'respectable', 'good', 'serious', 'ripe', 'salutary', 'dear', 'practiced', 'goodness', 'safe', 'effective', 'unspoilt', 'dependable', 'undecomposed', 'honest', 'full', 'near', 'trade_good'} {'evil', 'evilness', 'bad', 'badness', 'ill'}
</span><span style="color: rgba(128, 0, 0, 1)">'''</span></pre>
</div>
<p>  你可以看到,我们的同义词比反义词更多,因为我们只是查找了第一个词形的反义词,但是你可以很容易地平衡这个,通过也为<code>bad</code>这个词执行完全相同的过程。</p>
<p>  接下来,我们还可以很容易地使用 WordNet 来比较两个词的相似性和他们的时态,把 Wu 和 Palmer 方法结合起来用于语义相关性。</p>
<p>  我们来比较名词<code>ship</code>和<code>boat</code>:</p>
<div class="cnblogs_code">
<pre>w1 = wordnet.synset(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ship.n.01</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
w2 </span>= wordnet.synset(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">boat.n.01</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(w1.wup_similarity(w2))

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 0.9090909090909091</span>
<span style="color: rgba(0, 0, 0, 1)">
w1 </span>= wordnet.synset(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ship.n.01</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
w2 </span>= wordnet.synset(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">car.n.01</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(w1.wup_similarity(w2))

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 0.6956521739130435</span>
<span style="color: rgba(0, 0, 0, 1)">
w1 </span>= wordnet.synset(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ship.n.01</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
w2 </span>= wordnet.synset(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cat.n.01</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(w1.wup_similarity(w2))

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 0.38095238095238093</span></pre>
</div>
<p>  接下来,我们将讨论一些问题并开始讨论文本分类的主题。</p>
<h1>十三、NLTK 文本分类</h1>
<p>  现在我们熟悉 NLTK 了,我们来尝试处理文本分类。 文本分类的目标可能相当宽泛。 也许我们试图将文本分类为政治或军事。 也许我们试图按照作者的性别来分类。 一个相当受欢迎的文本分类任务是,将文本的正文识别为垃圾邮件或非垃圾邮件,例如电子邮件过滤器。 在我们的例子中,我们将尝试创建一个情感分析算法。</p>
<p>  为此,我们首先尝试使用属于 NLTK 语料库的电影评论数据库。 从那里,我们将尝试使用词汇作为“特征”,这是“正面”或“负面”电影评论的一部分。 NLTK 语料库<code>movie_reviews</code>数据集拥有评论,他们被标记为正面或负面。 这意味着我们可以训练和测试这些数据。 首先,让我们来预处理我们的数据。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> random
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews

documents </span>=<span style="color: rgba(0, 0, 0, 1)"> [(list(movie_reviews.words(fileid)), category)
             </span><span style="color: rgba(0, 0, 255, 1)">for</span> category <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews.categories()
             </span><span style="color: rgba(0, 0, 255, 1)">for</span> fileid <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews.fileids(category)]

random.shuffle(documents)

</span><span style="color: rgba(0, 0, 255, 1)">print</span>(documents)

all_words </span>=<span style="color: rgba(0, 0, 0, 1)"> []
</span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews.words():
    all_words.append(w.lower())

all_words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.FreqDist(all_words)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(all_words.most_common(15<span style="color: rgba(0, 0, 0, 1)">))
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(all_words[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">stupid</span><span style="color: rgba(128, 0, 0, 1)">"</span>])</pre>
</div>
<p>  运行此脚本可能需要一些时间,因为电影评论数据集有点大。 我们来介绍一下这里发生的事情。</p>
<p>  导入我们想要的数据集后,您会看到:</p>
<div class="cnblogs_code">
<pre>documents =<span style="color: rgba(0, 0, 0, 1)"> [(list(movie_reviews.words(fileid)), category)
             </span><span style="color: rgba(0, 0, 255, 1)">for</span> category <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews.categories()
             </span><span style="color: rgba(0, 0, 255, 1)">for</span> fileid <span style="color: rgba(0, 0, 255, 1)">in</span> movie_reviews.fileids(category)]</pre>
</div>
<p>  基本上,用简单的英文,上面的代码被翻译成:在每个类别(我们有正向和独享),选取所有的文件 ID(每个评论有自己的 ID),然后对文件 ID存储<code>word_tokenized</code>版本(单词列表),后面是一个大列表中的正面或负面标签。</p>
<p>  接下来,我们用<code>random</code>来打乱我们的文件。这是因为我们将要进行训练和测试。如果我们把他们按序排列,我们可能会训练所有的负面评论,和一些正面评论,然后在所有正面评论上测试。我们不想这样,所以我们打乱了数据。</p>
<p>  然后,为了你能看到你正在使用的数据,我们打印出<code>documents</code>,这是一个大列表,其中第一个元素是一列单词,第二个元素是<code>pos</code>或<code>neg</code>标签。</p>
<p>  接下来,我们要收集我们找到的所有单词,所以我们可以有一个巨大的典型单词列表。从这里,我们可以执行一个频率分布,然后找出最常见的单词。正如你所看到的,最受欢迎的“词语”其实就是标点符号,<code>the</code>,<code>a</code>等等,但是很快我们就会得到有效词汇。我们打算存储几千个最流行的单词,所以这不应该是一个问题。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">print</span>(all_words.most_common(15))</pre>
</div>
<p>  以上给出了15个最常用的单词。 你也可以通过下面的步骤找出一个单词的出现次数:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">print</span>(all_words[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">stupid</span><span style="color: rgba(128, 0, 0, 1)">"</span>])</pre>
</div>
<p>  接下来,我们开始将我们的单词,储存为正面或负面的电影评论的特征。</p>
<h1>十四、使用 NLTK 将单词转换为特征</h1>
<p>  编撰正面评论和负面评论中的单词的特征列表,来看到正面或负面评论中特定类型单词的趋势。</p>
<p>  最初,我们的代码:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> random
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews

documents </span>=<span style="color: rgba(0, 0, 0, 1)"> [(list(movie_reviews.words(fileid)), category)
             </span><span style="color: rgba(0, 0, 255, 1)">for</span> category <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews.categories()
             </span><span style="color: rgba(0, 0, 255, 1)">for</span> fileid <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews.fileids(category)]

random.shuffle(documents)

all_words </span>=<span style="color: rgba(0, 0, 0, 1)"> []

</span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews.words():
    all_words.append(w.lower())

all_words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.FreqDist(all_words)

word_features </span>= list(all_words.keys())[:3000]</pre>
</div>
<p>  几乎和以前一样,只是现在有一个新的变量,<code>word_features</code>,它包含了前 3000 个最常用的单词。 接下来,我们将建立一个简单的函数,在我们的正面和负面的文档中找到这些前 3000 个单词,将他们的存在标记为是或否:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> find_features(document):
    words </span>=<span style="color: rgba(0, 0, 0, 1)"> set(document)
    features </span>=<span style="color: rgba(0, 0, 0, 1)"> {}
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> word_features:
      features </span>= (w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> words)

    </span><span style="color: rgba(0, 0, 255, 1)">return</span> features</pre>
</div>
<p>  下面,我们可以打印出特征集:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">print</span>((find_features(movie_reviews.words(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">neg/cv000_29416.txt</span><span style="color: rgba(128, 0, 0, 1)">'</span>))))</pre>
</div>
<p>  之后我们可以为我们所有的文档做这件事情,通过做下列事情,保存特征存在性布尔值,以及它们各自的正面或负面的类别:</p>
<div class="cnblogs_code">
<pre>featuresets = [(find_features(rev), category) <span style="color: rgba(0, 0, 255, 1)">for</span> (rev, category) <span style="color: rgba(0, 0, 255, 1)">in</span> documents]</pre>
</div>
<p>  真棒,现在我们有了特征和标签,接下来是什么? 通常,下一步是继续并训练算法,然后对其进行测试。 所以,让我们继续这样做,从下一个教程中的朴素贝叶斯分类器开始!</p>
<h1>十五、NLTK 朴素贝叶斯分类器</h1>
<p>  现在是时候选择一个算法,将我们的数据分成训练和测试集,然后启动!我们首先要使用的算法是朴素贝叶斯分类器。这是一个非常受欢迎的文本分类算法,所以我们只能先试一试。然而,在我们可以训练和测试我们的算法之前,我们需要先把数据分解成训练集和测试集。</p>
<p>  你可以训练和测试同一个数据集,但是这会给你带来一些严重的偏差问题,所以你不应该训练和测试完全相同的数据。为此,由于我们已经打乱了数据集,因此我们将首先将包含正面和负面评论的 1900 个乱序评论作为训练集。然后,我们可以在最后的 100 个上测试,看看我们有多准确。</p>
<p>  这被称为监督机器学习,因为我们正在向机器展示数据,并告诉它“这个数据是正面的”,或者“这个数据是负面的”。然后,在完成训练之后,我们向机器展示一些新的数据,并根据我们之前教过计算机的内容询问计算机,计算机认为新数据的类别是什么。</p>
<p>  我们可以用以下方式分割数据:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> set that we'll train our classifier with</span>
training_set = featuresets[:1900<span style="color: rgba(0, 0, 0, 1)">]

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> set that we'll test against.</span>
testing_set = featuresets</pre>
</div>
<p>  下面,我们可以定义并训练我们的分类器:</p>
<div class="cnblogs_code">
<pre>classifier = nltk.NaiveBayesClassifier.train(training_set)</pre>
</div>
<p>  首先,我们只是简单调用朴素贝叶斯分类器,然后在一行中使用<code>.train()</code>进行训练。</p>
<p>  足够简单,现在它得到了训练。 接下来,我们可以测试它:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>,(nltk.classify.accuracy(classifier, testing_set))*100)</pre>
</div>
<p>  砰,你得到了你的答案。 如果你错过了,我们可以“测试”数据的原因是,我们仍然有正确的答案。 因此,在测试中,我们向计算机展示数据,而不提供正确的答案。 如果它正确猜测我们所知的答案,那么计算机是正确的。 考虑到我们所做的打乱,你和我可能准确度不同,但你应该看到准确度平均为 60-75%。</p>
<p>  接下来,我们可以进一步了解正面或负面评论中最有价值的词汇:</p>
<div class="cnblogs_code">
<pre>classifier.show_most_informative_features(15)</pre>
</div>
<p>  这对于每个人都不一样,但是你应该看到这样的东西:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">Most Informative Features
insulting </span>= True neg : pos = 10.6 : 1.0<span style="color: rgba(0, 0, 0, 1)">
ludicrous </span>= True neg : pos = 10.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
winslet </span>= True pos : neg = 9.0 : 1.0<span style="color: rgba(0, 0, 0, 1)">
detract </span>= True pos : neg = 8.4 : 1.0<span style="color: rgba(0, 0, 0, 1)">
breathtaking </span>= True pos : neg = 8.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
silverstone </span>= True neg : pos = 7.6 : 1.0<span style="color: rgba(0, 0, 0, 1)">
excruciatingly </span>= True neg : pos = 7.6 : 1.0<span style="color: rgba(0, 0, 0, 1)">
warns </span>= True pos : neg = 7.0 : 1.0<span style="color: rgba(0, 0, 0, 1)">
tracy </span>= True pos : neg = 7.0 : 1.0<span style="color: rgba(0, 0, 0, 1)">
insipid </span>= True neg : pos = 7.0 : 1.0<span style="color: rgba(0, 0, 0, 1)">
freddie </span>= True neg : pos = 7.0 : 1.0<span style="color: rgba(0, 0, 0, 1)">
damon </span>= True pos : neg = 5.9 : 1.0<span style="color: rgba(0, 0, 0, 1)">
debate </span>= True pos : neg = 5.9 : 1.0<span style="color: rgba(0, 0, 0, 1)">
ordered </span>= True pos : neg = 5.8 : 1.0<span style="color: rgba(0, 0, 0, 1)">
lang </span>= True pos : neg = 5.7 : 1.0</pre>
</div>
<p>  这个告诉你的是,每一个词的负面到正面的出现几率,或相反。 因此,在这里,我们可以看到,负面评论中的<code>insulting</code>一词比正面评论多出现 10.6 倍。<code>Ludicrous</code>是 10.1。</p>
<p>  现在,让我们假设,你完全满意你的结果,你想要继续,也许使用这个分类器来预测现在的事情。 训练分类器,并且每当你需要使用分类器时,都要重新训练,是非常不切实际的。 因此,您可以使用<code>pickle</code>模块保存分类器。 我们接下来做。</p>
<h1>十六、使用 NLTK 保存分类器</h1>
<p>  训练分类器和机器学习算法可能需要很长时间,特别是如果您在更大的数据集上训练。 我们的其实很小。 你可以想象,每次你想开始使用分类器的时候,都要训练分类器吗? 这么恐怖! 相反,我们可以使用<code>pickle</code>模块,并序列化我们的分类器对象,这样我们所需要做的就是简单加载该文件。</p>
<p>  那么,我们该怎么做呢? 第一步是保存对象。 为此,首先需要在脚本的顶部导入<code>pickle</code>,然后在使用<code>.train()</code>分类器进行训练后,可以调用以下几行:</p>
<div class="cnblogs_code">
<pre>save_classifier = open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">naivebayes.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">wb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
pickle.dump(classifier, save_classifier)
save_classifier.close()</span></pre>
</div>
<p>  这打开了一个<code>pickle</code>文件,准备按字节写入一些数据。 然后,我们使用<code>pickle.dump()</code>来转储数据。 <code>pickle.dump()</code>的第一个参数是你写入的东西,第二个参数是你写入它的地方。</p>
<p>  之后,我们按照我们的要求关闭文件,这就是说,我们现在在脚本的目录中保存了一个<code>pickle</code>或序列化的对象!</p>
<p>  接下来,我们如何开始使用这个分类器? <code>.pickle</code>文件是序列化的对象,我们现在需要做的就是将其读入内存,这与读取任何其他普通文件一样简单。 这样做:</p>
<div class="cnblogs_code">
<pre>classifier_f = open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">naivebayes.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> pickle.load(classifier_f)
classifier_f.close()</span></pre>
</div>
<p>  在这里,我们执行了非常相似的过程。 我们打开文件来读取字节。 然后,我们使用<code>pickle.load()</code>来加载文件,并将数据保存到分类器变量中。 然后我们关闭文件,就是这样。 我们现在有了和以前一样的分类器对象!</p>
<p>  现在,我们可以使用这个对象,每当我们想用它来分类时,我们不再需要训练我们的分类器。</p>
<p>  虽然这一切都很好,但是我们可能不太满意我们所获得的 60-75% 的准确度。 其他分类器呢? 其实,有很多分类器,但我们需要 scikit-learn(sklearn)模块。 幸运的是,NLTK 的员工认识到将 sklearn 模块纳入 NLTK 的价值,他们为我们构建了一个小 API。 这就是我们将在下一个教程中做的事情。</p>
<h1>十七、NLTK 和 Sklearn</h1>
<p>  现在我们已经看到,使用分类器是多么容易,现在我们想尝试更多东西! Python 的最好的模块是 Scikit-learn(sklearn)模块。</p>
<p>  如果您想了解 Scikit-learn 模块的更多信息,我有一些关于 Scikit-Learn 机器学习的教程。</p>
<p>  幸运的是,对于我们来说,NLTK 背后的人们更看重将 sklearn 模块纳入NLTK分类器方法的价值。 就这样,他们创建了各种<code>SklearnClassifier</code> API。 要使用它,你只需要像下面这样导入它:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.classify.scikitlearn <span style="color: rgba(0, 0, 255, 1)">import</span> SklearnClassifier</pre>
</div>
<p>  从这里开始,你可以使用任何<code>sklearn</code>分类器。 例如,让我们引入更多的朴素贝叶斯算法的变体:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.naive_bayes <span style="color: rgba(0, 0, 255, 1)">import</span> MultinomialNB,BernoulliNB</pre>
</div>
<p>  之后,如何使用它们?结果是,这非常简单。</p>
<div class="cnblogs_code">
<pre>MNB_classifier =<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">MultinomialNB accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,nltk.classify.accuracy(MNB_classifier, testing_set))

BNB_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">BernoulliNB accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>,nltk.classify.accuracy(BNB_classifier, testing_set))</pre>
</div>
<p>  就是这么简单。让我们引入更多东西:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.linear_model <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> LogisticRegression,SGDClassifier
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.svm <span style="color: rgba(0, 0, 255, 1)">import</span> SVC, LinearSVC, NuSVC</pre>
</div>
<p>  现在,我们所有分类器应该是这样:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Original Naive Bayes Algo accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)
classifier.show_most_informative_features(</span>15<span style="color: rgba(0, 0, 0, 1)">)

MNB_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">MNB_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(MNB_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

BernoulliNB_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">BernoulliNB_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

LogisticRegression_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LogisticRegression_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

SGDClassifier_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">SGDClassifier_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

SVC_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(SVC())
SVC_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">SVC_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(SVC_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

LinearSVC_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LinearSVC_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

NuSVC_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">NuSVC_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)</pre>
</div>
<p>  运行它的结果应该是这样:</p>
<div class="cnblogs_code">
<pre>Original Naive Bayes Algo accuracy percent: 63.0<span style="color: rgba(0, 0, 0, 1)">
Most Informative Features
                thematic </span>= True            pos : neg    =      9.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                secondly </span>= True            pos : neg    =      8.5 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                narrates </span>= True            pos : neg    =      7.8 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               rounded </span>= True            pos : neg    =      7.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               supreme </span>= True            pos : neg    =      7.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               layered </span>= True            pos : neg    =      7.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                  crappy </span>= True            neg : pos    =      6.9 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               uplifting </span>= True            pos : neg    =      6.2 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                     ugh </span>= True            neg : pos    =      5.3 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                   mamet </span>= True            pos : neg    =      5.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               gaining </span>= True            pos : neg    =      5.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                   wanda </span>= True            neg : pos    =      4.9 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                   onset </span>= True            neg : pos    =      4.9 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               fantastic </span>= True            pos : neg    =      4.5 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                kentucky </span>= True            pos : neg    =      4.4 : 1.0<span style="color: rgba(0, 0, 0, 1)">
MNB_classifier accuracy percent: </span>66.0<span style="color: rgba(0, 0, 0, 1)">
BernoulliNB_classifier accuracy percent: </span>72.0<span style="color: rgba(0, 0, 0, 1)">
LogisticRegression_classifier accuracy percent: </span>64.0<span style="color: rgba(0, 0, 0, 1)">
SGDClassifier_classifier accuracy percent: </span>61.0<span style="color: rgba(0, 0, 0, 1)">
SVC_classifier accuracy percent: </span>45.0<span style="color: rgba(0, 0, 0, 1)">
LinearSVC_classifier accuracy percent: </span>68.0<span style="color: rgba(0, 0, 0, 1)">
NuSVC_classifier accuracy percent: </span>59.0</pre>
</div>
<p>  所以,我们可以看到,SVC 的错误比正确更常见,所以我们可能应该丢弃它。 但是呢? 接下来我们可以尝试一次使用所有这些算法。 一个算法的算法! 为此,我们可以创建另一个分类器,并根据其他算法的结果来生成分类器的结果。 有点像投票系统,所以我们只需要奇数数量的算法。 这就是我们将在下一个教程中讨论的内容。</p>
<h1>十八、使用 NLTK 组合算法</h1>
<p>  现在我们知道如何使用一堆算法分类器,就像糖果岛上的一个孩子,告诉他们只能选择一个,我们可能会发现很难只选择一个分类器。 好消息是,你不必这样! 组合分类器算法是一种常用的技术,通过创建一种投票系统来实现,每个算法拥有一票,选择得票最多分类。</p>
<p>  为此,我们希望我们的新分类器的工作方式像典型的 NLTK 分类器,并拥有所有方法。 很简单,使用面向对象编程,我们可以确保从 NLTK 分类器类继承。 为此,我们将导入它:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.classify <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> ClassifierI
</span><span style="color: rgba(0, 0, 255, 1)">from</span> statistics <span style="color: rgba(0, 0, 255, 1)">import</span> mode</pre>
</div>
<p>  我们也导入<code>mode</code>(众数),因为这将是我们选择最大计数的方法。</p>
<p>  现在,我们来建立我们的分类器类:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> VoteClassifier(ClassifierI):
    </span><span style="color: rgba(0, 0, 255, 1)">def</span> <span style="color: rgba(128, 0, 128, 1)">__init__</span>(self, *<span style="color: rgba(0, 0, 0, 1)">classifiers):
      self._classifiers </span>= classifiers</pre>
</div>
<p>  我们把我们的类叫做<code>VoteClassifier</code>,我们继承了 NLTK 的<code>ClassifierI</code>。 接下来,我们将传递给我们的类的分类器列表赋给<code>self._classifiers</code>。</p>
<p>  接下来,我们要继续创建我们自己的分类方法。 我们打算把它称为<code>.classify</code>,以便我们可以稍后调用<code>.classify</code>,就像传统的 NLTK 分类器那样。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> classify(self, features):
      votes </span>=<span style="color: rgba(0, 0, 0, 1)"> []
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> c <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self._classifiers:
            v </span>=<span style="color: rgba(0, 0, 0, 1)"> c.classify(features)
            votes.append(v)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span> mode(votes)</pre>
</div>
<p>  很简单,我们在这里所做的就是,遍历我们的分类器对象列表。 然后,对于每一个,我们要求它基于特征分类。 分类被视为投票。 遍历完成后,我们返回<code>mode(votes)</code>,这只是返回投票的众数。</p>
<p>  这是我们真正需要的,但是我认为另一个参数,置信度是有用的。 由于我们有了投票算法,所以我们也可以统计支持和反对票数,并称之为“置信度”。 例如,3/5 票的置信度弱于 5/5 票。 因此,我们可以从字面上返回投票比例,作为一种置信度指标。 这是我们的置信度方法:</p>
<div class="cnblogs_code">
<pre> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> confidence(self, features):
      votes </span>=<span style="color: rgba(0, 0, 0, 1)"> []
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> c <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self._classifiers:
            v </span>=<span style="color: rgba(0, 0, 0, 1)"> c.classify(features)
            votes.append(v)

      choice_votes </span>=<span style="color: rgba(0, 0, 0, 1)"> votes.count(mode(votes))
      conf </span>= choice_votes /<span style="color: rgba(0, 0, 0, 1)"> len(votes)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span> conf</pre>
</div>
<p>  现在,让我们把东西放到一起:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> random
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.classify.scikitlearn <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pickle

</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.naive_bayes <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> MultinomialNB, BernoulliNB
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.linear_model <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> LogisticRegression, SGDClassifier
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.svm <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> SVC, LinearSVC, NuSVC

</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.classify <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> ClassifierI
</span><span style="color: rgba(0, 0, 255, 1)">from</span> statistics <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> mode


</span><span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> VoteClassifier(ClassifierI):
    </span><span style="color: rgba(0, 0, 255, 1)">def</span> <span style="color: rgba(128, 0, 128, 1)">__init__</span>(self, *<span style="color: rgba(0, 0, 0, 1)">classifiers):
      self._classifiers </span>=<span style="color: rgba(0, 0, 0, 1)"> classifiers

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> classify(self, features):
      votes </span>=<span style="color: rgba(0, 0, 0, 1)"> []
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> c <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self._classifiers:
            v </span>=<span style="color: rgba(0, 0, 0, 1)"> c.classify(features)
            votes.append(v)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> mode(votes)

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> confidence(self, features):
      votes </span>=<span style="color: rgba(0, 0, 0, 1)"> []
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> c <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self._classifiers:
            v </span>=<span style="color: rgba(0, 0, 0, 1)"> c.classify(features)
            votes.append(v)

      choice_votes </span>=<span style="color: rgba(0, 0, 0, 1)"> votes.count(mode(votes))
      conf </span>= choice_votes /<span style="color: rgba(0, 0, 0, 1)"> len(votes)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> conf

documents </span>=<span style="color: rgba(0, 0, 0, 1)"> [(list(movie_reviews.words(fileid)), category)
             </span><span style="color: rgba(0, 0, 255, 1)">for</span> category <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews.categories()
             </span><span style="color: rgba(0, 0, 255, 1)">for</span> fileid <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews.fileids(category)]

random.shuffle(documents)

all_words </span>=<span style="color: rgba(0, 0, 0, 1)"> []

</span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews.words():
    all_words.append(w.lower())

all_words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.FreqDist(all_words)

word_features </span>= list(all_words.keys())[:3000<span style="color: rgba(0, 0, 0, 1)">]

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> find_features(document):
    words </span>=<span style="color: rgba(0, 0, 0, 1)"> set(document)
    features </span>=<span style="color: rgba(0, 0, 0, 1)"> {}
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> word_features:
      features </span>= (w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> words)

    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> features

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))</span>
<span style="color: rgba(0, 0, 0, 1)">
featuresets </span>= [(find_features(rev), category) <span style="color: rgba(0, 0, 255, 1)">for</span> (rev, category) <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> documents]
      
training_set </span>= featuresets[:1900<span style="color: rgba(0, 0, 0, 1)">]
testing_set </span>=featuresets

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">classifier = nltk.NaiveBayesClassifier.train(training_set)</span>
<span style="color: rgba(0, 0, 0, 1)">
classifier_f </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">naivebayes.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> pickle.load(classifier_f)
classifier_f.close()


</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Original Naive Bayes Algo accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)
classifier.show_most_informative_features(</span>15<span style="color: rgba(0, 0, 0, 1)">)

MNB_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">MNB_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(MNB_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

BernoulliNB_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">BernoulliNB_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

LogisticRegression_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LogisticRegression_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

SGDClassifier_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">SGDClassifier_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">#SVC_classifier = SklearnClassifier(SVC())</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">#SVC_classifier.train(training_set)</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">#print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)</span>
<span style="color: rgba(0, 0, 0, 1)">
LinearSVC_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LinearSVC_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

NuSVC_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">NuSVC_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)


voted_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> VoteClassifier(classifier,
                                  NuSVC_classifier,
                                  LinearSVC_classifier,
                                  SGDClassifier_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)

</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">voted_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(voted_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Classification:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, voted_classifier.classify(testing_set), <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Confidence %:</span><span style="color: rgba(128, 0, 0, 1)">"</span>,voted_classifier.confidence(testing_set)*100<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Classification:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, voted_classifier.classify(testing_set), <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Confidence %:</span><span style="color: rgba(128, 0, 0, 1)">"</span>,voted_classifier.confidence(testing_set)*100<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Classification:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, voted_classifier.classify(testing_set), <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Confidence %:</span><span style="color: rgba(128, 0, 0, 1)">"</span>,voted_classifier.confidence(testing_set)*100<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Classification:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, voted_classifier.classify(testing_set), <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Confidence %:</span><span style="color: rgba(128, 0, 0, 1)">"</span>,voted_classifier.confidence(testing_set)*100<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Classification:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, voted_classifier.classify(testing_set), <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Confidence %:</span><span style="color: rgba(128, 0, 0, 1)">"</span>,voted_classifier.confidence(testing_set)*100<span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Classification:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, voted_classifier.classify(testing_set), <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Confidence %:</span><span style="color: rgba(128, 0, 0, 1)">"</span>,voted_classifier.confidence(testing_set)*100)</pre>
</div>
<p>  所以到了最后,我们对文本运行一些分类器示例。我们所有输出:</p>
<div class="cnblogs_code">
<pre>Original Naive Bayes Algo accuracy percent: 66.0<span style="color: rgba(0, 0, 0, 1)">
Most Informative Features
                thematic </span>= True            pos : neg    =      9.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                secondly </span>= True            pos : neg    =      8.5 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                narrates </span>= True            pos : neg    =      7.8 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               layered </span>= True            pos : neg    =      7.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               rounded </span>= True            pos : neg    =      7.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               supreme </span>= True            pos : neg    =      7.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                  crappy </span>= True            neg : pos    =      6.9 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               uplifting </span>= True            pos : neg    =      6.2 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                     ugh </span>= True            neg : pos    =      5.3 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               gaining </span>= True            pos : neg    =      5.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                   mamet </span>= True            pos : neg    =      5.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                   wanda </span>= True            neg : pos    =      4.9 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                   onset </span>= True            neg : pos    =      4.9 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               fantastic </span>= True            pos : neg    =      4.5 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                   milos </span>= True            pos : neg    =      4.4 : 1.0<span style="color: rgba(0, 0, 0, 1)">
MNB_classifier accuracy percent: </span>67.0<span style="color: rgba(0, 0, 0, 1)">
BernoulliNB_classifier accuracy percent: </span>67.0<span style="color: rgba(0, 0, 0, 1)">
LogisticRegression_classifier accuracy percent: </span>68.0<span style="color: rgba(0, 0, 0, 1)">
SGDClassifier_classifier accuracy percent: </span>57.99999999999999<span style="color: rgba(0, 0, 0, 1)">
LinearSVC_classifier accuracy percent: </span>67.0<span style="color: rgba(0, 0, 0, 1)">
NuSVC_classifier accuracy percent: </span>65.0<span style="color: rgba(0, 0, 0, 1)">
voted_classifier accuracy percent: </span>65.0<span style="color: rgba(0, 0, 0, 1)">
Classification: neg Confidence </span>%: 100.0<span style="color: rgba(0, 0, 0, 1)">
Classification: pos Confidence </span>%: 57.14285714285714<span style="color: rgba(0, 0, 0, 1)">
Classification: neg Confidence </span>%: 57.14285714285714<span style="color: rgba(0, 0, 0, 1)">
Classification: neg Confidence </span>%: 57.14285714285714<span style="color: rgba(0, 0, 0, 1)">
Classification: pos Confidence </span>%: 57.14285714285714<span style="color: rgba(0, 0, 0, 1)">
Classification: pos Confidence </span>%: 85.71428571428571</pre>
</div>
<h1>十九、使用 NLTK 调查偏差</h1>
<p>  我们将讨论一些问题。最主要的问题是我们有一个相当有偏差的算法。你可以通过注释掉文档的打乱,然后使用前 1900 个进行训练,并留下最后的 100 个(所有正面)评论来测试它。测试它,你会发现你的准确性很差。</p>
<p>  相反,你可以使用前 100 个数据进行测试,所有的数据都是负面的,并且使用后 1900 个训练。在这里你会发现准确度非常高。这是一个不好的迹象。这可能意味着很多东西,我们有很多选择来解决它。</p>
<p>  也就是说,我们所考虑的项目建议我们继续,并使用不同的数据集,所以我们会这样做。最后,我们会发现这个新的数据集仍然存在一些偏差,那就是它更经常选择负面的东西。原因是负面评论的负面往往比正面评论的正面程度更大。这个可以用一些简单的加权来完成,但是它也可以变得很复杂。也许是另一天的教程。现在,我们要抓取一个新的数据集,我们将在下一个教程中讨论这个数据集。</p>
<h1>二十、使用 NLTK 改善情感分析的训练数据</h1>
<p>  所以现在是时候在新的数据集上训练了。 我们的目标是分析 Twitter 的情绪,所以我们希望数据集的每个正面和负面语句都有点短。 恰好我有 5300+ 个正面和 5300 + 个负面电影评论,这是短得多的数据集。 我们应该能从更大的训练集中获得更多的准确性,并且把 Twitter 的推文拟合得更好。</p>
<p>  我在这里托管了这两个文件,您可以通过下载简短的评论来找到它们。 将这些文件保存为<code>positive.txt</code>和<code>negative.txt</code>。</p>
<p>  现在,我们可以像以前一样建立新的数据集。 需要改变什么呢?</p>
<p>  我们需要一种新的方法来创建我们的“文档”变量,然后我们还需要一种新的方法来创建<code>all_words</code>变量。 真的没问题,我是这么做的:</p>
<div class="cnblogs_code">
<pre>short_pos = open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">short_reviews/positive.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">r</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">).read()
short_neg </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">short_reviews/negative.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">r</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">).read()

documents </span>=<span style="color: rgba(0, 0, 0, 1)"> []

</span><span style="color: rgba(0, 0, 255, 1)">for</span> r <span style="color: rgba(0, 0, 255, 1)">in</span> short_pos.split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">):
    documents.append( (r, </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pos</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) )

</span><span style="color: rgba(0, 0, 255, 1)">for</span> r <span style="color: rgba(0, 0, 255, 1)">in</span> short_neg.split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">):
    documents.append( (r, </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">neg</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) )


all_words </span>=<span style="color: rgba(0, 0, 0, 1)"> []

short_pos_words </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(short_pos)
short_neg_words </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(short_neg)

</span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> short_pos_words:
    all_words.append(w.lower())

</span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> short_neg_words:
    all_words.append(w.lower())

all_words </span>= nltk.FreqDist(all_words)</pre>
</div>
<p>  接下来,我们还需要调整我们的特征查找功能,主要是按照文档中的单词进行标记,因为我们的新样本没有漂亮的<code>.words()</code>特征。 我继续并增加了最常见的词语:</p>
<div class="cnblogs_code">
<pre>word_features = list(all_words.keys())[:5000<span style="color: rgba(0, 0, 0, 1)">]

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> find_features(document):
    words </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(document)
    features </span>=<span style="color: rgba(0, 0, 0, 1)"> {}
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> word_features:
      features </span>= (w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> words)

    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> features
   
featuresets </span>= [(find_features(rev), category) <span style="color: rgba(0, 0, 255, 1)">for</span> (rev, category) <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> documents]
random.shuffle(featuresets)</span></pre>
</div>
<p>  除此之外,其余的都是一样的。 这是完整的脚本,以防万一你或我错过了一些东西:</p>
<p>  这个过程需要一段时间..你可能想要干些别的。 我花了大约 30-40 分钟来全部运行完成,而我在 i7 3930k 上运行它。 在我写这篇文章的时候(2015),一般处理器可能需要几个小时。 不过这是一次性的过程。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> random
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.corpus <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> movie_reviews
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.classify.scikitlearn <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pickle

</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.naive_bayes <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> MultinomialNB, BernoulliNB
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.linear_model <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> LogisticRegression, SGDClassifier
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.svm <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> SVC, LinearSVC, NuSVC

</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.classify <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> ClassifierI
</span><span style="color: rgba(0, 0, 255, 1)">from</span> statistics <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> mode

</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> word_tokenize


</span><span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> VoteClassifier(ClassifierI):
    </span><span style="color: rgba(0, 0, 255, 1)">def</span> <span style="color: rgba(128, 0, 128, 1)">__init__</span>(self, *<span style="color: rgba(0, 0, 0, 1)">classifiers):
      self._classifiers </span>=<span style="color: rgba(0, 0, 0, 1)"> classifiers

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> classify(self, features):
      votes </span>=<span style="color: rgba(0, 0, 0, 1)"> []
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> c <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self._classifiers:
            v </span>=<span style="color: rgba(0, 0, 0, 1)"> c.classify(features)
            votes.append(v)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> mode(votes)

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> confidence(self, features):
      votes </span>=<span style="color: rgba(0, 0, 0, 1)"> []
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> c <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self._classifiers:
            v </span>=<span style="color: rgba(0, 0, 0, 1)"> c.classify(features)
            votes.append(v)

      choice_votes </span>=<span style="color: rgba(0, 0, 0, 1)"> votes.count(mode(votes))
      conf </span>= choice_votes /<span style="color: rgba(0, 0, 0, 1)"> len(votes)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> conf
      
short_pos </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">short_reviews/positive.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">r</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">).read()
short_neg </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">short_reviews/negative.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">r</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">).read()

documents </span>=<span style="color: rgba(0, 0, 0, 1)"> []

</span><span style="color: rgba(0, 0, 255, 1)">for</span> r <span style="color: rgba(0, 0, 255, 1)">in</span> short_pos.split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">):
    documents.append( (r, </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pos</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) )

</span><span style="color: rgba(0, 0, 255, 1)">for</span> r <span style="color: rgba(0, 0, 255, 1)">in</span> short_neg.split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">):
    documents.append( (r, </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">neg</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) )


all_words </span>=<span style="color: rgba(0, 0, 0, 1)"> []

short_pos_words </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(short_pos)
short_neg_words </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(short_neg)

</span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> short_pos_words:
    all_words.append(w.lower())

</span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> short_neg_words:
    all_words.append(w.lower())

all_words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.FreqDist(all_words)

word_features </span>= list(all_words.keys())[:5000<span style="color: rgba(0, 0, 0, 1)">]

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> find_features(document):
    words </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(document)
    features </span>=<span style="color: rgba(0, 0, 0, 1)"> {}
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> word_features:
      features </span>= (w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> words)

    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> features

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))</span>
<span style="color: rgba(0, 0, 0, 1)">
featuresets </span>= [(find_features(rev), category) <span style="color: rgba(0, 0, 255, 1)">for</span> (rev, category) <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> documents]

random.shuffle(featuresets)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> positive data example:      </span>
training_set = featuresets[:10000<span style="color: rgba(0, 0, 0, 1)">]
testing_set </span>=featuresets

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">## negative data example:      </span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">#training_set = featuresets</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">#testing_set =featuresets[:100]</span>
<span style="color: rgba(0, 0, 0, 1)">

classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.NaiveBayesClassifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Original Naive Bayes Algo accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)
classifier.show_most_informative_features(</span>15<span style="color: rgba(0, 0, 0, 1)">)

MNB_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">MNB_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(MNB_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

BernoulliNB_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">BernoulliNB_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

LogisticRegression_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LogisticRegression_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

SGDClassifier_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">SGDClassifier_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">#SVC_classifier = SklearnClassifier(SVC())</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">#SVC_classifier.train(training_set)</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">#print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)</span>
<span style="color: rgba(0, 0, 0, 1)">
LinearSVC_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LinearSVC_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

NuSVC_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">NuSVC_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)


voted_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> VoteClassifier(
                                  NuSVC_classifier,
                                  LinearSVC_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)

</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">voted_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(voted_classifier, testing_set))*100)</pre>
</div>
<p>  输出:</p>
<div class="cnblogs_code">
<pre>Original Naive Bayes Algo accuracy percent: 66.26506024096386<span style="color: rgba(0, 0, 0, 1)">
Most Informative Features
            refreshing </span>= True            pos : neg    =   13.6 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                captures </span>= True            pos : neg    =   11.3 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                  stupid </span>= True            neg : pos    =   10.7 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                  tender </span>= True            pos : neg    =      9.6 : 1.0<span style="color: rgba(0, 0, 0, 1)">
            meandering </span>= True            neg : pos    =      9.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                      tv </span>= True            neg : pos    =      8.6 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               low</span>-key = True            pos : neg    =      8.3 : 1.0<span style="color: rgba(0, 0, 0, 1)">
            thoughtful </span>= True            pos : neg    =      8.1 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                   banal </span>= True            neg : pos    =      7.7 : 1.0<span style="color: rgba(0, 0, 0, 1)">
            amateurish </span>= True            neg : pos    =      7.7 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                terrific </span>= True            pos : neg    =      7.6 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                  record </span>= True            pos : neg    =      7.6 : 1.0<span style="color: rgba(0, 0, 0, 1)">
             captivating </span>= True            pos : neg    =      7.6 : 1.0<span style="color: rgba(0, 0, 0, 1)">
                portrait </span>= True            pos : neg    =      7.4 : 1.0<span style="color: rgba(0, 0, 0, 1)">
               culture </span>= True            pos : neg    =      7.3 : 1.0<span style="color: rgba(0, 0, 0, 1)">
MNB_classifier accuracy percent: </span>65.8132530120482<span style="color: rgba(0, 0, 0, 1)">
BernoulliNB_classifier accuracy percent: </span>66.71686746987952<span style="color: rgba(0, 0, 0, 1)">
LogisticRegression_classifier accuracy percent: </span>67.16867469879519<span style="color: rgba(0, 0, 0, 1)">
SGDClassifier_classifier accuracy percent: </span>65.8132530120482<span style="color: rgba(0, 0, 0, 1)">
LinearSVC_classifier accuracy percent: </span>66.71686746987952<span style="color: rgba(0, 0, 0, 1)">
NuSVC_classifier accuracy percent: </span>60.09036144578314<span style="color: rgba(0, 0, 0, 1)">
voted_classifier accuracy percent: </span>65.66265060240963</pre>
</div>
<p>  是的,我敢打赌你花了一段时间,所以,在下一个教程中,我们将谈论<code>pickle</code>所有东西!</p>
<h1>二十一、使用 NLTK 为情感分析创建模块</h1>
<p>  有了这个新的数据集和新的分类器,我们可以继续前进。 你可能已经注意到的,这个新的数据集需要更长的时间来训练,因为它是一个更大的集合。 我已经向你显示,通过<code>pickel</code>或序列化训练出来的分类器,我们实际上可以节省大量的时间,这些分类器只是对象。</p>
<p>  我已经向你证明了如何使用<code>pickel</code>来实现它,所以我鼓励你尝试自己做。 如果你需要帮助,我会粘贴完整的代码...但要注意,自己动手!</p>
<p>  这个过程需要一段时间..你可能想要干些别的。大约 30-40 分钟来全部运行完成。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> random
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">from nltk.corpus import movie_reviews</span>
<span style="color: rgba(0, 0, 255, 1)">from</span> nltk.classify.scikitlearn <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pickle
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.naive_bayes <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> MultinomialNB, BernoulliNB
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.linear_model <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> LogisticRegression, SGDClassifier
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.svm <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> SVC, LinearSVC, NuSVC
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.classify <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> ClassifierI
</span><span style="color: rgba(0, 0, 255, 1)">from</span> statistics <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> mode
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> word_tokenize



</span><span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> VoteClassifier(ClassifierI):
    </span><span style="color: rgba(0, 0, 255, 1)">def</span> <span style="color: rgba(128, 0, 128, 1)">__init__</span>(self, *<span style="color: rgba(0, 0, 0, 1)">classifiers):
      self._classifiers </span>=<span style="color: rgba(0, 0, 0, 1)"> classifiers

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> classify(self, features):
      votes </span>=<span style="color: rgba(0, 0, 0, 1)"> []
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> c <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self._classifiers:
            v </span>=<span style="color: rgba(0, 0, 0, 1)"> c.classify(features)
            votes.append(v)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> mode(votes)

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> confidence(self, features):
      votes </span>=<span style="color: rgba(0, 0, 0, 1)"> []
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> c <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self._classifiers:
            v </span>=<span style="color: rgba(0, 0, 0, 1)"> c.classify(features)
            votes.append(v)

      choice_votes </span>=<span style="color: rgba(0, 0, 0, 1)"> votes.count(mode(votes))
      conf </span>= choice_votes /<span style="color: rgba(0, 0, 0, 1)"> len(votes)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> conf
   
short_pos </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">short_reviews/positive.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">r</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">).read()
short_neg </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">short_reviews/negative.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">r</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">).read()

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> move this up here</span>
all_words =<span style="color: rgba(0, 0, 0, 1)"> []
documents </span>=<span style="color: rgba(0, 0, 0, 1)"> []


</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">j is adject, r is adverb, and v is verb</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">allowed_word_types = ["J","R","V"]</span>
allowed_word_types = [<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">J</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">]

</span><span style="color: rgba(0, 0, 255, 1)">for</span> p <span style="color: rgba(0, 0, 255, 1)">in</span> short_pos.split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">):
    documents.append( (p, </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pos</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) )
    words </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(p)
    pos </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.pos_tag(words)
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> pos:
      </span><span style="color: rgba(0, 0, 255, 1)">if</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> allowed_word_types:
            all_words.append(w.lower())

   
</span><span style="color: rgba(0, 0, 255, 1)">for</span> p <span style="color: rgba(0, 0, 255, 1)">in</span> short_neg.split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">):
    documents.append( (p, </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">neg</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) )
    words </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(p)
    pos </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.pos_tag(words)
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> pos:
      </span><span style="color: rgba(0, 0, 255, 1)">if</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> allowed_word_types:
            all_words.append(w.lower())



save_documents </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/documents.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">wb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
pickle.dump(documents, save_documents)
save_documents.close()


all_words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.FreqDist(all_words)


word_features </span>= list(all_words.keys())[:5000<span style="color: rgba(0, 0, 0, 1)">]


save_word_features </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/word_features5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">wb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
pickle.dump(word_features, save_word_features)
save_word_features.close()


</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> find_features(document):
    words </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(document)
    features </span>=<span style="color: rgba(0, 0, 0, 1)"> {}
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> word_features:
      features </span>= (w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> words)

    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> features

featuresets </span>= [(find_features(rev), category) <span style="color: rgba(0, 0, 255, 1)">for</span> (rev, category) <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> documents]

random.shuffle(featuresets)
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(len(featuresets))

testing_set </span>= featuresets
training_set </span>= featuresets[:10000<span style="color: rgba(0, 0, 0, 1)">]


classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.NaiveBayesClassifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Original Naive Bayes Algo accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)
classifier.show_most_informative_features(</span>15<span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">##############</span>
save_classifier = open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/originalnaivebayes5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">wb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
pickle.dump(classifier, save_classifier)
save_classifier.close()

MNB_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">MNB_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(MNB_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

save_classifier </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/MNB_classifier5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">wb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
pickle.dump(MNB_classifier, save_classifier)
save_classifier.close()

BernoulliNB_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">BernoulliNB_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

save_classifier </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/BernoulliNB_classifier5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">wb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
pickle.dump(BernoulliNB_classifier, save_classifier)
save_classifier.close()

LogisticRegression_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LogisticRegression_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

save_classifier </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/LogisticRegression_classifier5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">wb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
pickle.dump(LogisticRegression_classifier, save_classifier)
save_classifier.close()


LinearSVC_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LinearSVC_classifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100<span style="color: rgba(0, 0, 0, 1)">)

save_classifier </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/LinearSVC_classifier5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">wb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
pickle.dump(LinearSVC_classifier, save_classifier)
save_classifier.close()


</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">#NuSVC_classifier = SklearnClassifier(NuSVC())</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">#NuSVC_classifier.train(training_set)</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">#print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)</span>
<span style="color: rgba(0, 0, 0, 1)">

SGDC_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier(SGDClassifier())
SGDC_classifier.train(training_set)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">SGDClassifier accuracy percent:</span><span style="color: rgba(128, 0, 0, 1)">"</span>,nltk.classify.accuracy(SGDC_classifier, testing_set)*100<span style="color: rgba(0, 0, 0, 1)">)

save_classifier </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/SGDC_classifier5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">wb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
pickle.dump(SGDC_classifier, save_classifier)
save_classifier.close()
现在,你只需要运行一次。 如果你希望,你可以随时运行它,但现在,你已经准备好了创建情绪分析模块。 这是我们称为sentiment_mod.py的文件:

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">File: sentiment_mod.py</span>

<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> random
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">from nltk.corpus import movie_reviews</span>
<span style="color: rgba(0, 0, 255, 1)">from</span> nltk.classify.scikitlearn <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> SklearnClassifier
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pickle
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.naive_bayes <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> MultinomialNB, BernoulliNB
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.linear_model <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> LogisticRegression, SGDClassifier
</span><span style="color: rgba(0, 0, 255, 1)">from</span> sklearn.svm <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> SVC, LinearSVC, NuSVC
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.classify <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> ClassifierI
</span><span style="color: rgba(0, 0, 255, 1)">from</span> statistics <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> mode
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> word_tokenize



</span><span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> VoteClassifier(ClassifierI):
    </span><span style="color: rgba(0, 0, 255, 1)">def</span> <span style="color: rgba(128, 0, 128, 1)">__init__</span>(self, *<span style="color: rgba(0, 0, 0, 1)">classifiers):
      self._classifiers </span>=<span style="color: rgba(0, 0, 0, 1)"> classifiers

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> classify(self, features):
      votes </span>=<span style="color: rgba(0, 0, 0, 1)"> []
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> c <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self._classifiers:
            v </span>=<span style="color: rgba(0, 0, 0, 1)"> c.classify(features)
            votes.append(v)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> mode(votes)

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> confidence(self, features):
      votes </span>=<span style="color: rgba(0, 0, 0, 1)"> []
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> c <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self._classifiers:
            v </span>=<span style="color: rgba(0, 0, 0, 1)"> c.classify(features)
            votes.append(v)

      choice_votes </span>=<span style="color: rgba(0, 0, 0, 1)"> votes.count(mode(votes))
      conf </span>= choice_votes /<span style="color: rgba(0, 0, 0, 1)"> len(votes)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> conf


documents_f </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/documents.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
documents </span>=<span style="color: rgba(0, 0, 0, 1)"> pickle.load(documents_f)
documents_f.close()




word_features5k_f </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/word_features5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
word_features </span>=<span style="color: rgba(0, 0, 0, 1)"> pickle.load(word_features5k_f)
word_features5k_f.close()


</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> find_features(document):
    words </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(document)
    features </span>=<span style="color: rgba(0, 0, 0, 1)"> {}
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> word_features:
      features </span>= (w <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> words)

    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> features



featuresets_f </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/featuresets.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
featuresets </span>=<span style="color: rgba(0, 0, 0, 1)"> pickle.load(featuresets_f)
featuresets_f.close()

random.shuffle(featuresets)
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(len(featuresets))

testing_set </span>= featuresets
training_set </span>= featuresets[:10000<span style="color: rgba(0, 0, 0, 1)">]



open_file </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/originalnaivebayes5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> pickle.load(open_file)
open_file.close()


open_file </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/MNB_classifier5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
MNB_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> pickle.load(open_file)
open_file.close()



open_file </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/BernoulliNB_classifier5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
BernoulliNB_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> pickle.load(open_file)
open_file.close()


open_file </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/LogisticRegression_classifier5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
LogisticRegression_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> pickle.load(open_file)
open_file.close()


open_file </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/LinearSVC_classifier5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
LinearSVC_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> pickle.load(open_file)
open_file.close()


open_file </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pickled_algos/SGDC_classifier5k.pickle</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">rb</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
SGDC_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> pickle.load(open_file)
open_file.close()




voted_classifier </span>=<span style="color: rgba(0, 0, 0, 1)"> VoteClassifier(
                                  classifier,
                                  LinearSVC_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)




</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> sentiment(text):
    feats </span>=<span style="color: rgba(0, 0, 0, 1)"> find_features(text)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span> voted_classifier.classify(feats),voted_classifier.confidence(feats)</pre>
</div>
<p>  所以在这里,除了最终的函数外,其实并没有什么新东西,这很简单。 这个函数是我们从这里开始与之交互的关键。 这个我们称之为“情感”的函数带有一个参数,即文本。 在这里,我们用我们早已创建的<code>find_features</code>函数,来分解这些特征。 现在我们所要做的就是,使用我们的投票分类器返回分类,以及返回分类的置信度。</p>
<p>  有了这个,我们现在可以将这个文件,以及情感函数用作一个模块。 以下是使用该模块的示例脚本:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> sentiment_mod as s

</span><span style="color: rgba(0, 0, 255, 1)">print</span>(s.sentiment(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">This movie was awesome! The acting was great, plot was wonderful, and there were pythons...so yea!</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(s.sentiment(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">This movie was utter junk. There were absolutely 0 pythons. I don't see what the point was at all. Horrible movie, 0/10</span><span style="color: rgba(128, 0, 0, 1)">"</span>))</pre>
</div>
<p>  正如预期的那样,带有<code>python</code>的电影的评论显然很好,没有任何<code>python</code>的电影是垃圾。 这两个都有 100% 的置信度。</p>
<p>  我花了大约 5 秒钟的时间导入模块,因为我们保存了分类器,没有保存的话可能要花 30 分钟。 多亏了<code>pickle</code> 你的时间会有很大的不同,取决于你的处理器。如果你继续下去,我会说你可能也想看看<code>joblib</code>。</p>
<p>  现在我们有了这个很棒的模块,它很容易就能工作,我们可以做什么? 我建议我们去 Twitter 上进行实时情感分析!</p>
<h1>二十二、NLTK Twitter 情感分析</h1>
<p>  现在我们有一个情感分析模块,我们可以将它应用于任何文本,但最好是短小的文本,比如 Twitter! 为此,我们将把本教程与 Twitter 流式 API 教程结合起来。</p>
<p>  该教程的初始代码是:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> tweepy <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> Stream
</span><span style="color: rgba(0, 0, 255, 1)">from</span> tweepy <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> OAuthHandler
</span><span style="color: rgba(0, 0, 255, 1)">from</span> tweepy.streaming <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> StreamListener


</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">consumer key, consumer secret, access token, access secret.</span>
ckey=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">fsdfasdfsafsffa</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
csecret</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">asdfsadfsadfsadf</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
atoken</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">asdf-aassdfs</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
asecret</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">asdfsadfsdafsdafs</span><span style="color: rgba(128, 0, 0, 1)">"</span>

<span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> listener(StreamListener):

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> on_data(self, data):
      </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(data)
      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)">(True)

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> on_error(self, status):
      </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)"> status

auth </span>=<span style="color: rgba(0, 0, 0, 1)"> OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)

twitterStream </span>=<span style="color: rgba(0, 0, 0, 1)"> Stream(auth, listener())
twitterStream.filter(track</span>=[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">car</span><span style="color: rgba(128, 0, 0, 1)">"</span>])</pre>
</div>
<p>  这足以打印包含词语<code>car</code>的流式实时推文的所有数据。 我们可以使用<code>json</code>模块,使用<code>json.loads(data)</code>来加载数据变量,然后我们可以引用特定的<code>tweet</code>:</p>
<div class="cnblogs_code">
<pre>tweet = all_data[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span>]</pre>
</div>
<p>  既然我们有了一条推文,我们可以轻易将其传入我们的<code>sentiment_mod</code>模块。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> tweepy <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> Stream
</span><span style="color: rgba(0, 0, 255, 1)">from</span> tweepy <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> OAuthHandler
</span><span style="color: rgba(0, 0, 255, 1)">from</span> tweepy.streaming <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> StreamListener
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> json
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> sentiment_mod as s

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">consumer key, consumer secret, access token, access secret.</span>
ckey=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">asdfsafsafsaf</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
csecret</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">asdfasdfsadfsa</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
atoken</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">asdfsadfsafsaf-asdfsaf</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
asecret</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">asdfsadfsadfsadfsadfsad</span><span style="color: rgba(128, 0, 0, 1)">"</span>

<span style="color: rgba(0, 0, 255, 1)">from</span> twitterapistuff <span style="color: rgba(0, 0, 255, 1)">import</span> *

<span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> listener(StreamListener):

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> on_data(self, data):

      all_data </span>=<span style="color: rgba(0, 0, 0, 1)"> json.loads(data)

      tweet </span>= all_data[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">]
      sentiment_value, confidence </span>=<span style="color: rgba(0, 0, 0, 1)"> s.sentiment(tweet)
      </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(tweet, sentiment_value, confidence)

      </span><span style="color: rgba(0, 0, 255, 1)">if</span> confidence*100 &gt;= 80<span style="color: rgba(0, 0, 0, 1)">:
            output </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">twitter-out.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">a</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
            output.write(sentiment_value)
            output.write(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
            output.close()

      </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> True

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> on_error(self, status):
      </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(status)

auth </span>=<span style="color: rgba(0, 0, 0, 1)"> OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)

twitterStream </span>=<span style="color: rgba(0, 0, 0, 1)"> Stream(auth, listener())
twitterStream.filter(track</span>=[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">happy</span><span style="color: rgba(128, 0, 0, 1)">"</span>])</pre>
</div>
<p>  除此之外,我们还将结果保存到输出文件<code>twitter-out.txt</code>中。</p>
<p>  接下来,什么没有图表的数据分析是完整的? 让我们再结合另一个教程,从 Twitter API 上的情感分析绘制实时流式图。</p>
<h1>二十三,使用 NLTK 绘制 Twitter 实时情感分析</h1>
<p>  现在我们已经从 Twitter 流媒体 API 获得了实时数据,为什么没有显示情绪趋势的活动图呢? 为此,我们将结合本教程和 matplotlib 绘图教程。</p>
<p>  如果您想了解代码工作原理的更多信息,请参阅该教程。 否则:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> matplotlib.pyplot as plt
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> matplotlib.animation as animation
</span><span style="color: rgba(0, 0, 255, 1)">from</span> matplotlib <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> style
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> time

style.use(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ggplot</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)

fig </span>=<span style="color: rgba(0, 0, 0, 1)"> plt.figure()
ax1 </span>= fig.add_subplot(1,1,1<span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> animate(i):
    pullData </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">twitter-out.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">r</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">).read()
    lines </span>= pullData.split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">\n</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

    xar </span>=<span style="color: rgba(0, 0, 0, 1)"> []
    yar </span>=<span style="color: rgba(0, 0, 0, 1)"> []

    x </span>=<span style="color: rgba(0, 0, 0, 1)"> 0
    y </span>=<span style="color: rgba(0, 0, 0, 1)"> 0

    </span><span style="color: rgba(0, 0, 255, 1)">for</span> l <span style="color: rgba(0, 0, 255, 1)">in</span> lines[-200<span style="color: rgba(0, 0, 0, 1)">:]:
      x </span>+= 1
      <span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pos</span><span style="color: rgba(128, 0, 0, 1)">"</span> <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> l:
            y </span>+= 1
      <span style="color: rgba(0, 0, 255, 1)">elif</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">neg</span><span style="color: rgba(128, 0, 0, 1)">"</span> <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> l:
            y </span>-= 1<span style="color: rgba(0, 0, 0, 1)">

      xar.append(x)
      yar.append(y)
      
    ax1.clear()
    ax1.plot(xar,yar)
ani </span>= animation.FuncAnimation(fig, animate, interval=1000<span style="color: rgba(0, 0, 0, 1)">)
plt.show()</span></pre>
</div>
<h1>二十四、斯坦福 NER 标记器与命名实体识别</h1>
<p>  斯坦福 NER 标记器提供了 NLTK 的命名实体识别(NER)分类器的替代方案。这个标记器在很大程度上被看作是命名实体识别的标准,但是由于它使用了先进的统计学习算法,它的计算开销比 NLTK 提供的选项更大。</p>
<p>  斯坦福 NER 标记器的一大优势是,为我们提供了几种不同的模型来提取命名实体。我们可以使用以下任何一个:</p>
<ul>
<li style="list-style-type: none">
<ul>
<li>三类模型,用于识别位置,人员和组织</li>
<li>四类模型,用于识别位置,人员,组织和杂项实体</li>
<li>七类模型,识别位置,人员,组织,时间,金钱,百分比和日期</li>
</ul>
</li>
</ul>
<p>  为了继续,我们需要下载模型和<code>jar</code>文件,因为 NER 分类器是用 Java 编写的。这些可从斯坦福自然语言处理小组免费获得。 NTLK 为了使我们方便,NLTK 提供了斯坦福标记器的包装,所以我们可以用最好的语言(当然是 Python)来使用它!</p>
<p>  传递给<code>StanfordNERTagger</code>类的参数包括:</p>
<ul>
<li style="list-style-type: none">
<ul>
<li>分类模型的路径(以下使用三类模型)</li>
<li>斯坦福标记器<code>jar</code>文件的路径</li>
<li>训练数据编码(默认为 ASCII)</li>
</ul>
</li>
</ul>
<p>  以下是我们设置它来使用三类模型标记句子的方式:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> -*- coding: utf-8 -*-</span>

<span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tag <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> StanfordNERTagger
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> word_tokenize

st </span>= StanfordNERTagger(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
                     </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/stanford-ner/stanford-ner.jar</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
                     encoding</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

text </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">While in France, Christine Lagarde discussed short-term stimulus efforts in a recent interview with the Wall Street Journal.</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">

tokenized_text </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(text)
classified_text </span>=<span style="color: rgba(0, 0, 0, 1)"> st.tag(tokenized_text)

</span><span style="color: rgba(0, 0, 255, 1)">print</span>(classified_text)</pre>
</div>
<p>  一旦我们按照单词分词,并且对句子进行分类,我们就会看到标记器产生了如下的元组列表:</p>
<div class="cnblogs_code">
<pre>[(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">While</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">in</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">France</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">LOCATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Christine</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Lagarde</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">discussed</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">short-term</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">stimulus</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">efforts</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">in</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">a</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">recent</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">interview</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">with</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Wall</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Street</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Journal</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>)]</pre>
</div>
<p>  太好了! 每个标记都使用<code>PERSON</code>,<code>LOCATION</code>,<code>ORGANIZATION</code>或<code>O</code>标记(使用我们的三类模型)。 <code>O</code>只代表其他,即非命名的实体。</p>
<p>  这个列表现在可以用于测试已标注数据了,我们将在下一个教程中介绍。</p>
<h1>二十五、测试 NLTK 和斯坦福 NER 标记器的准确性</h1>
<p>  我们知道了如何使用两个不同的 NER 分类器! 但是我们应该选择哪一个,NLTK 还是斯坦福大学的呢? 让我们做一些测试来找出答案。</p>
<p>  我们需要的第一件事是一些已标注的参考数据,用来测试我们的 NER 分类器。 获取这些数据的一种方法是查找大量文章,并将每个标记标记为一种命名实体(例如,人员,组织,位置)或其他非命名实体。 然后我们可以用我们所知的正确标签,来测试我们单独的 NER 分类器。</p>
<p>  不幸的是,这是非常耗时的! 好消息是,有一个手动标注的数据集可以免费获得,带有超过 16,000 英语句子。 还有德语,西班牙语,法语,意大利语,荷兰语,波兰语,葡萄牙语和俄语的数据集!</p>
<p>  这是一个来自数据集的已标注的句子:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">Founding O
member O
Kojima I</span>-<span style="color: rgba(0, 0, 0, 1)">PER
Minoru I</span>-<span style="color: rgba(0, 0, 0, 1)">PER
played O
guitar O
on O
Good I</span>-<span style="color: rgba(0, 0, 0, 1)">MISC
Day I</span>-<span style="color: rgba(0, 0, 0, 1)">MISC
, O
</span><span style="color: rgba(0, 0, 255, 1)">and</span><span style="color: rgba(0, 0, 0, 1)"> O
Wardanceis I</span>-<span style="color: rgba(0, 0, 0, 1)">MISC
cover O
of O
a O
song O
by O
UK I</span>-<span style="color: rgba(0, 0, 0, 1)">LOC
post O
punk O
industrial O
band O
Killing I</span>-<span style="color: rgba(0, 0, 0, 1)">ORG
Joke I</span>-<span style="color: rgba(0, 0, 0, 1)">ORG
. O</span></pre>
</div>
<p>  让我们阅读,分割和操作数据,使其成为用于测试的更好格式。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tag <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> StanfordNERTagger
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.metrics.scores <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> accuracy

raw_annotations </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/wikigold.conll.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">).read()
split_annotations </span>=<span style="color: rgba(0, 0, 0, 1)"> raw_annotations.split()

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Amend class annotations to reflect Stanford's NERTagger</span>
<span style="color: rgba(0, 0, 255, 1)">for</span> n,i <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> enumerate(split_annotations):
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> i == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">I-PER</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
      split_annotations </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(0, 0, 255, 1)">if</span> i == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">I-ORG</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
      split_annotations </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(0, 0, 255, 1)">if</span> i == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">I-LOC</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
      split_annotations </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LOCATION</span><span style="color: rgba(128, 0, 0, 1)">"</span>

<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Group NE data into tuples</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> group(lst, n):
</span><span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> range(0, len(lst), n):
    val </span>= lst
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> len(val) ==<span style="color: rgba(0, 0, 0, 1)"> n:
      </span><span style="color: rgba(0, 0, 255, 1)">yield</span><span style="color: rgba(0, 0, 0, 1)"> tuple(val)

reference_annotations </span>= list(group(split_annotations, 2))</pre>
</div>
<p>  好的,看起来不错! 但是,我们还需要将这些数据的“整洁”形式粘贴到我们的 NER 分类器中。 让我们来做吧。</p>
<div class="cnblogs_code">
<pre>pure_tokens = split_annotations[::2]</pre>
</div>
<p>  这读入数据,按照空白字符分割,然后以二的增量(从第零个元素开始),取<code>split_annotations</code>中的所有东西的子集。 这产生了一个数据集,类似下面的(小得多)例子:</p>
<div class="cnblogs_code">
<pre>[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Founding</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">member</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Kojima</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Minoru</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">played</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">guitar</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">on</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Good</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Day</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">and</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Wardanceis</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">cover</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">of</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">a</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">song</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">by</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">UK</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">post</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">punk</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">industrial</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">band</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Killing</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Joke</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>]</pre>
</div>
<p>  让我们继续并测试 NLTK 分类器:</p>
<div class="cnblogs_code">
<pre>tagged_words =<span style="color: rgba(0, 0, 0, 1)"> nltk.pos_tag(pure_tokens)
nltk_unformatted_prediction </span>= nltk.ne_chunk(tagged_words)</pre>
</div>
<p>  由于 NLTK NER 分类器产生树(包括 POS 标签),我们需要做一些额外的数据操作来获得用于测试的适当形式。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">Convert prediction to multiline string and then to list (includes pos tags)</span>
multiline_string =<span style="color: rgba(0, 0, 0, 1)"> nltk.chunk.tree2conllstr(nltk_unformatted_prediction)
listed_pos_and_ne </span>=<span style="color: rgba(0, 0, 0, 1)"> multiline_string.split()

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Delete pos tags and rename</span>
<span style="color: rgba(0, 0, 255, 1)">del</span> listed_pos_and_ne
listed_ne </span>=<span style="color: rgba(0, 0, 0, 1)"> listed_pos_and_ne

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Amend class annotations for consistency with reference_annotations</span>
<span style="color: rgba(0, 0, 255, 1)">for</span> n,i <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> enumerate(listed_ne):
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> i == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">B-PERSON</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
      listed_ne </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(0, 0, 255, 1)">if</span> i == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">I-PERSON</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
      listed_ne </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">"</span>   
    <span style="color: rgba(0, 0, 255, 1)">if</span> i == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">B-ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
      listed_ne </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(0, 0, 255, 1)">if</span> i == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">I-ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
      listed_ne </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(0, 0, 255, 1)">if</span> i == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">B-LOCATION</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
      listed_ne </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LOCATION</span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(0, 0, 255, 1)">if</span> i == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">I-LOCATION</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
      listed_ne </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LOCATION</span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(0, 0, 255, 1)">if</span> i == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">B-GPE</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
      listed_ne </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LOCATION</span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(0, 0, 255, 1)">if</span> i == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">I-GPE</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
      listed_ne </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">LOCATION</span><span style="color: rgba(128, 0, 0, 1)">"</span>

<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Group prediction into tuples</span>
nltk_formatted_prediction = list(group(listed_ne, 2))</pre>
</div>
<p>  现在我们可以测试 NLTK 的准确率。</p>
<div class="cnblogs_code">
<pre>nltk_accuracy =<span style="color: rgba(0, 0, 0, 1)"> accuracy(reference_annotations, nltk_formatted_prediction)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(nltk_accuracy)</pre>
</div>
<p>  哇,准确率为<code>.8971</code>!</p>
<p>  现在让我们测试斯坦福分类器。 由于此分类器以元组形式生成输出,因此测试不需要更多的数据操作。</p>
<div class="cnblogs_code">
<pre>st = StanfordNERTagger(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
                     </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/stanford-ner/stanford-ner.jar</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
                     encoding</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)                  
stanford_prediction </span>=<span style="color: rgba(0, 0, 0, 1)"> st.tag(pure_tokens)
stanford_accuracy </span>=<span style="color: rgba(0, 0, 0, 1)"> accuracy(reference_annotations, stanford_prediction)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(stanford_accuracy)</pre>
</div>
<p><code>  .9223</code>的准确率!更好!</p>
<p>  如果你想绘制这个,这里有一些额外的代码。 如果你想深入了解这如何工作,查看 matplotlib 系列:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> numpy as np
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> matplotlib.pyplot as plt
</span><span style="color: rgba(0, 0, 255, 1)">from</span> matplotlib <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> style

style.use(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">fivethirtyeight</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

N </span>= 1<span style="color: rgba(0, 0, 0, 1)">
ind </span>= np.arange(N)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> the x locations for the groups</span>
width = 0.35       <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> the width of the bars</span>
<span style="color: rgba(0, 0, 0, 1)">
fig, ax </span>=<span style="color: rgba(0, 0, 0, 1)"> plt.subplots()

stanford_percentage </span>= stanford_accuracy * 100<span style="color: rgba(0, 0, 0, 1)">
rects1 </span>= ax.bar(ind, stanford_percentage, width, color=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">r</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

nltk_percentage </span>= nltk_accuracy * 100<span style="color: rgba(0, 0, 0, 1)">
rects2 </span>= ax.bar(ind+width, nltk_percentage, width, color=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">y</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> add some text for labels, title and axes ticks</span>
ax.set_xlabel(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Classifier</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
ax.set_ylabel(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Accuracy (by percentage)</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
ax.set_title(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Accuracy by NER Classifier</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
ax.set_xticks(ind</span>+<span style="color: rgba(0, 0, 0, 1)">width)
ax.set_xticklabels( (</span><span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">) )

ax.legend( (rects1, rects2), (</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Stanford</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NLTK</span><span style="color: rgba(128, 0, 0, 1)">'</span>), bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=<span style="color: rgba(0, 0, 0, 1)">0. )

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> autolabel(rects):
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> attach some text labels</span>
    <span style="color: rgba(0, 0, 255, 1)">for</span> rect <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> rects:
      height </span>=<span style="color: rgba(0, 0, 0, 1)"> rect.get_height()
      ax.text(rect.get_x()</span>+rect.get_width()/2., 1.02*height, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">%10.2f</span><span style="color: rgba(128, 0, 0, 1)">'</span> %<span style="color: rgba(0, 0, 0, 1)"> float(height),
                ha</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">center</span><span style="color: rgba(128, 0, 0, 1)">'</span>, va=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">bottom</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

autolabel(rects1)
autolabel(rects2)

plt.show()</span></pre>
</div>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1554973/201909/1554973-20190904140934422-1107548395.png" alt=""></p>
<h1>二十六、测试 NLTK 和斯坦福 NER 标记器的速度</h1>
<p>  我们已经测试了我们的 NER 分类器的准确性,但是在决定使用哪个分类器时,还有更多的问题需要考虑。 接下来我们来测试速度吧!</p>
<p>  我们知道我们正在比较同一个东西,我们将在同一篇文章中进行测试。 使用 NBC 新闻里的这个片段吧:</p>
<div class="cnblogs_code">
<pre>House Speaker John Boehner became animated Tuesday over the proposed Keystone Pipeline, castigating the Obama administration <span style="color: rgba(0, 0, 255, 1)">for</span> <span style="color: rgba(0, 0, 255, 1)">not</span><span style="color: rgba(0, 0, 0, 1)"> having approved the project yet.

Republican House Speaker John Boehner says there</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">s "nothing complex about the Keystone Pipeline," and that it</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">s time to build it.

</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Complex? You think the Keystone Pipeline is complex?!</span><span style="color: rgba(128, 0, 0, 1)">"</span> Boehner responded to a questioner. <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">It's been under study for five years! We build pipelines in America every day. Do you realize there are 200,000 miles of pipelines in the United States?</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">

The speaker went on: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">And the only reason the president's involved in the Keystone Pipeline is because it crosses an international boundary. Listen, we can build it. There's nothing complex about the Keystone Pipeline -- it's time to build it.</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">

Boehner said the president had no excuse at this point to </span><span style="color: rgba(0, 0, 255, 1)">not</span> give the pipeline the go-<span style="color: rgba(0, 0, 0, 1)">ahead after the State Department released a report on Friday indicating the project would have a minimal impact on the environment.

Republicans have long pushed </span><span style="color: rgba(0, 0, 255, 1)">for</span> construction of the project, which enjoys some measure of Democratic support as well. The GOP <span style="color: rgba(0, 0, 255, 1)">is</span><span style="color: rgba(0, 0, 0, 1)"> considering conditioning an extension of the debt limit on approval of the project by Obama.

The White House, though, has said that it has no timetable </span><span style="color: rgba(0, 0, 255, 1)">for</span> a final decision on the project.</pre>
</div>
<p>  首先,我们执行导入,通过阅读和分词来处理文章。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> -*- coding: utf-8 -*-</span>

<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> os
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> numpy as np
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> matplotlib.pyplot as plt
</span><span style="color: rgba(0, 0, 255, 1)">from</span> matplotlib <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> style
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pos_tag
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tag <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> StanfordNERTagger
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> word_tokenize

style.use(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">fivethirtyeight</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Process text</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> process_text(txt_file):
    raw_text </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/news_article.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">).read()
    token_text </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(raw_text)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span> token_text</pre>
</div>
<p>  很棒! 现在让我们写一些函数来拆分我们的分类任务。 因为 NLTK NEG 分类器需要 POS 标签,所以我们会在我们的 NLTK 函数中加入 POS 标签。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Stanford NER tagger    </span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> stanford_tagger(token_text):
    st </span>= StanfordNERTagger(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
                            </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/stanford-ner/stanford-ner.jar</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
                            encoding</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)   
    ne_tagged </span>=<span style="color: rgba(0, 0, 0, 1)"> st.tag(token_text)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)">(ne_tagged)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> NLTK POS and NER taggers   </span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> nltk_tagger(token_text):
    tagged_words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.pos_tag(token_text)
    ne_tagged </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.ne_chunk(tagged_words)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span>(ne_tagged)</pre>
</div>
<p>  每个分类器都需要读取文章,并对命名实体进行分类,所以我们将这些函数包装在一个更大的函数中,使计时变得简单。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> stanford_main():
    </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(stanford_tagger(process_text(txt_file)))

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> nltk_main():
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(nltk_tagger(process_text(txt_file)))</pre>
</div>
<p>  当我们调用我们的程序时,我们调用这些函数。 我们将在<code>os.times()</code>函数调用中包装我们的<code>stanford_main()</code>和<code>nltk_main()</code>函数,取第四个索引,它是经过的时间。 然后我们将图绘制我们的结果。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 128, 1)">__name__</span> == <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">__main__</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:
    stanford_t0 </span>= os.times()
    stanford_main()
    stanford_t1 </span>= os.times()
    stanford_total_time </span>= stanford_t1 -<span style="color: rgba(0, 0, 0, 1)"> stanford_t0
   
    nltk_t0 </span>= os.times()
    nltk_main()
    nltk_t1 </span>= os.times()
    nltk_total_time </span>= nltk_t1 -<span style="color: rgba(0, 0, 0, 1)"> nltk_t0
   
    time_plot(stanford_total_time, nltk_total_time)</span></pre>
</div>
<p>  对于我们的绘图,我们使用<code>time_plot()</code>函数:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> time_plot(stanford_total_time, nltk_total_time):
    N </span>= 1<span style="color: rgba(0, 0, 0, 1)">
    ind </span>= np.arange(N)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> the x locations for the groups</span>
    width = 0.35       <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> the width of the bars</span>
    stanford_total_time =<span style="color: rgba(0, 0, 0, 1)"> stanford_total_time
    nltk_total_time </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk_total_time   
    fig, ax </span>=<span style="color: rgba(0, 0, 0, 1)"> plt.subplots()   
    rects1 </span>= ax.bar(ind, stanford_total_time, width, color=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">r</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)   
    rects2 </span>= ax.bar(ind+width, nltk_total_time, width, color=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">y</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
   
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Add text for labels, title and axes ticks</span>
    ax.set_xlabel(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Classifier</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
    ax.set_ylabel(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Time (in seconds)</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
    ax.set_title(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Speed by NER Classifier</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
    ax.set_xticks(ind</span>+<span style="color: rgba(0, 0, 0, 1)">width)
    ax.set_xticklabels( (</span><span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">) )   
    ax.legend( (rects1, rects2), (</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Stanford</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">NLTK</span><span style="color: rgba(128, 0, 0, 1)">'</span>), bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=<span style="color: rgba(0, 0, 0, 1)">0. )

    </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> autolabel(rects):
      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> attach some text labels</span>
      <span style="color: rgba(0, 0, 255, 1)">for</span> rect <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> rects:
            height </span>=<span style="color: rgba(0, 0, 0, 1)"> rect.get_height()
            ax.text(rect.get_x()</span>+rect.get_width()/2., 1.02*height, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">%10.2f</span><span style="color: rgba(128, 0, 0, 1)">'</span> %<span style="color: rgba(0, 0, 0, 1)"> float(height),
                  ha</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">center</span><span style="color: rgba(128, 0, 0, 1)">'</span>, va=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">bottom</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
   
    autolabel(rects1)
    autolabel(rects2)   
    plt.show()</span></pre>
</div>
<p>  哇,NLTK 像闪电一样快! 看来斯坦福更准确,但 NLTK 更快。 当平衡我们偏爱的精确度,和所需的计算资源时,这是需要知道的重要信息。</p>
<p>  但是等等,还是有问题。我们的输出比较丑陋! 这是斯坦福大学的一个小样本:</p>
<div class="cnblogs_code">
<pre>[(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">House</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Speaker</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">John</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Boehner</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">became</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">animated</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Tuesday</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">over</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">proposed</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Keystone</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Pipeline</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">,</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">castigating</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Obama</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">administration</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">for</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">not</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">having</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">approved</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">the</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">project</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">yet</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">.</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p>  以及 NLTK:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">(S
(ORGANIZATION House</span>/<span style="color: rgba(0, 0, 0, 1)">NNP)
Speaker</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
(PERSON John</span>/NNP Boehner/<span style="color: rgba(0, 0, 0, 1)">NNP)
became</span>/<span style="color: rgba(0, 0, 0, 1)">VBD
animated</span>/<span style="color: rgba(0, 0, 0, 1)">VBN
Tuesday</span>/<span style="color: rgba(0, 0, 0, 1)">NNP
over</span>/<span style="color: rgba(0, 0, 0, 1)">IN
the</span>/<span style="color: rgba(0, 0, 0, 1)">DT
proposed</span>/<span style="color: rgba(0, 0, 0, 1)">VBN
(PERSON Keystone</span>/NNP Pipeline/<span style="color: rgba(0, 0, 0, 1)">NNP)
,</span>/<span style="color: rgba(0, 0, 0, 1)">,
castigating</span>/<span style="color: rgba(0, 0, 0, 1)">VBG
the</span>/<span style="color: rgba(0, 0, 0, 1)">DT
(ORGANIZATION Obama</span>/<span style="color: rgba(0, 0, 0, 1)">NNP)
administration</span>/<span style="color: rgba(0, 0, 0, 1)">NN
</span><span style="color: rgba(0, 0, 255, 1)">for</span>/<span style="color: rgba(0, 0, 0, 1)">IN
</span><span style="color: rgba(0, 0, 255, 1)">not</span>/<span style="color: rgba(0, 0, 0, 1)">RB
having</span>/<span style="color: rgba(0, 0, 0, 1)">VBG
approved</span>/<span style="color: rgba(0, 0, 0, 1)">VBN
the</span>/<span style="color: rgba(0, 0, 0, 1)">DT
project</span>/<span style="color: rgba(0, 0, 0, 1)">NN
yet</span>/<span style="color: rgba(0, 0, 0, 1)">RB
.</span>/.</pre>
</div>
<p>  让我们在下个教程中,将它们转为可读的形式。</p>
<p><img style="display: block; margin-left: auto; margin-right: auto" src="https://img2018.cnblogs.com/blog/1554973/201909/1554973-20190904141233405-1134884541.png" alt=""></p>
<h1>二十七、使用 BIO 标签创建可读的命名实体列表</h1>
<p>  现在我们已经完成了测试,让我们将我们的命名实体转为良好的可读格式。</p>
<p>  再次,我们将使用来自 NBC 新闻的同一篇新闻:</p>
<div class="cnblogs_code">
<pre>House Speaker John Boehner became animated Tuesday over the proposed Keystone Pipeline, castigating the Obama administration <span style="color: rgba(0, 0, 255, 1)">for</span> <span style="color: rgba(0, 0, 255, 1)">not</span><span style="color: rgba(0, 0, 0, 1)"> having approved the project yet.

Republican House Speaker John Boehner says there</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">s "nothing complex about the Keystone Pipeline," and that it</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">s time to build it.

</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Complex? You think the Keystone Pipeline is complex?!</span><span style="color: rgba(128, 0, 0, 1)">"</span> Boehner responded to a questioner. <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">It's been under study for five years! We build pipelines in America every day. Do you realize there are 200,000 miles of pipelines in the United States?</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">

The speaker went on: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">And the only reason the president's involved in the Keystone Pipeline is because it crosses an international boundary. Listen, we can build it. There's nothing complex about the Keystone Pipeline -- it's time to build it.</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">

Boehner said the president had no excuse at this point to </span><span style="color: rgba(0, 0, 255, 1)">not</span> give the pipeline the go-<span style="color: rgba(0, 0, 0, 1)">ahead after the State Department released a report on Friday indicating the project would have a minimal impact on the environment.

Republicans have long pushed </span><span style="color: rgba(0, 0, 255, 1)">for</span> construction of the project, which enjoys some measure of Democratic support as well. The GOP <span style="color: rgba(0, 0, 255, 1)">is</span><span style="color: rgba(0, 0, 0, 1)"> considering conditioning an extension of the debt limit on approval of the project by Obama.

The White House, though, has said that it has no timetable </span><span style="color: rgba(0, 0, 255, 1)">for</span> a final decision on the project.</pre>
</div>
<p>  我们的 NTLK 输出已经是树了(只需要最后一步),所以让我们来看看我们的斯坦福输出。 我们将对标记进行 BIO 标记,B 分配给命名实体的开始,I 分配给内部,O 分配给其他。 例如,如果我们的句子是<code>Barack Obama went to Greece today</code>,我们应该把它标记为<code>Barack-B Obama-I went-O to-O Greece-B today-O</code>。 为此,我们将编写一系列条件来检查当前和以前的标记的<code>O</code>标签。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> -*- coding: utf-8 -*-</span>

<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> nltk
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> os
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> numpy as np
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> matplotlib.pyplot as plt
</span><span style="color: rgba(0, 0, 255, 1)">from</span> matplotlib <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> style
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> pos_tag
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tag <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> StanfordNERTagger
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tokenize <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> word_tokenize
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.chunk <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> conlltags2tree
</span><span style="color: rgba(0, 0, 255, 1)">from</span> nltk.tree <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> Tree

style.use(</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">fivethirtyeight</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Process text</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> process_text(txt_file):
    raw_text </span>= open(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/news_article.txt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">).read()
    token_text </span>=<span style="color: rgba(0, 0, 0, 1)"> word_tokenize(raw_text)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> token_text

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Stanford NER tagger    </span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> stanford_tagger(token_text):
    st </span>= StanfordNERTagger(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
                            </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/usr/share/stanford-ner/stanford-ner.jar</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
                            encoding</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)   
    ne_tagged </span>=<span style="color: rgba(0, 0, 0, 1)"> st.tag(token_text)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)">(ne_tagged)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> NLTK POS and NER taggers   </span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> nltk_tagger(token_text):
    tagged_words </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.pos_tag(token_text)
    ne_tagged </span>=<span style="color: rgba(0, 0, 0, 1)"> nltk.ne_chunk(tagged_words)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)">(ne_tagged)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Tag tokens with standard NLP BIO tags</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> bio_tagger(ne_tagged):
      bio_tagged </span>=<span style="color: rgba(0, 0, 0, 1)"> []
      prev_tag </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">"</span>
      <span style="color: rgba(0, 0, 255, 1)">for</span> token, tag <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> ne_tagged:
            </span><span style="color: rgba(0, 0, 255, 1)">if</span> tag == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">"</span>: <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">O</span>
<span style="color: rgba(0, 0, 0, 1)">                bio_tagged.append((token, tag))
                prev_tag </span>=<span style="color: rgba(0, 0, 0, 1)"> tag
                </span><span style="color: rgba(0, 0, 255, 1)">continue</span>
            <span style="color: rgba(0, 0, 255, 1)">if</span> tag != <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">"</span> <span style="color: rgba(0, 0, 255, 1)">and</span> prev_tag == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">"</span>: <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Begin NE</span>
                bio_tagged.append((token, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">B-</span><span style="color: rgba(128, 0, 0, 1)">"</span>+<span style="color: rgba(0, 0, 0, 1)">tag))
                prev_tag </span>=<span style="color: rgba(0, 0, 0, 1)"> tag
            </span><span style="color: rgba(0, 0, 255, 1)">elif</span> prev_tag != <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">"</span> <span style="color: rgba(0, 0, 255, 1)">and</span> prev_tag == tag: <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Inside NE</span>
                bio_tagged.append((token, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">I-</span><span style="color: rgba(128, 0, 0, 1)">"</span>+<span style="color: rgba(0, 0, 0, 1)">tag))
                prev_tag </span>=<span style="color: rgba(0, 0, 0, 1)"> tag
            </span><span style="color: rgba(0, 0, 255, 1)">elif</span> prev_tag != <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">O</span><span style="color: rgba(128, 0, 0, 1)">"</span> <span style="color: rgba(0, 0, 255, 1)">and</span> prev_tag != tag: <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Adjacent NE</span>
                bio_tagged.append((token, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">B-</span><span style="color: rgba(128, 0, 0, 1)">"</span>+<span style="color: rgba(0, 0, 0, 1)">tag))
                prev_tag </span>=<span style="color: rgba(0, 0, 0, 1)"> tag
      </span><span style="color: rgba(0, 0, 255, 1)">return</span> bio_tagged</pre>
</div>
<p>  现在我们将 BIO 标记后的标记写入树中,因此它们与 NLTK 输出格式相同。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Create tree       </span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> stanford_tree(bio_tagged):
    tokens, ne_tags </span>= zip(*<span style="color: rgba(0, 0, 0, 1)">bio_tagged)
    pos_tags </span>=

    conlltags </span>= [(token, pos, ne) <span style="color: rgba(0, 0, 255, 1)">for</span> token, pos, ne <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> zip(tokens, pos_tags, ne_tags)]
    ne_tree </span>=<span style="color: rgba(0, 0, 0, 1)"> conlltags2tree(conlltags)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span> ne_tree</pre>
</div>
<p>  遍历并解析出所有命名实体:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Parse named entities from tree</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> structure_ne(ne_tree):
    ne </span>=<span style="color: rgba(0, 0, 0, 1)"> []
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> subtree <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> ne_tree:
      </span><span style="color: rgba(0, 0, 255, 1)">if</span> type(subtree) == Tree: <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> If subtree is a noun chunk, i.e. NE != "O"</span>
            ne_label =<span style="color: rgba(0, 0, 0, 1)"> subtree.label()
            ne_string </span>= <span style="color: rgba(128, 0, 0, 1)">"</span> <span style="color: rgba(128, 0, 0, 1)">"</span>.join()
            ne.append((ne_string, ne_label))
    </span><span style="color: rgba(0, 0, 255, 1)">return</span> ne</pre>
</div>
<p>  在我们的调用中,我们把所有附加函数聚到一起。</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> stanford_main():
    </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(structure_ne(stanford_tree(bio_tagger(stanford_tagger(process_text(txt_file))))))

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> nltk_main():
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(structure_ne(nltk_tagger(process_text(txt_file))))</pre>
</div>
<p>  之后调用这些函数:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 128, 1)">__name__</span> == <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">__main__</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">:
    stanford_main()
    nltk_main()</span></pre>
</div>
<p>  这里是来自斯坦福的看起来不错的输出:</p>
<div class="cnblogs_code">
<pre>[(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">House</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">John Boehner</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Keystone Pipeline</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Obama</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Republican House</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">John Boehner</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Keystone Pipeline</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Keystone Pipeline</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Boehner</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">America</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">LOCATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">United States</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">LOCATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Keystone Pipeline</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Keystone Pipeline</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Boehner</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">State Department</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Republicans</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">MISC</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Democratic</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">MISC</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">GOP</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">MISC</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Obama</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">White House</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">LOCATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>)]</pre>
</div>
<p>  以及来自 NLTK 的:</p>
<div class="cnblogs_code">
<pre>[(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">House</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">John Boehner</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Keystone Pipeline</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Obama</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Republican</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">House</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">John Boehner</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Keystone Pipeline</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Keystone Pipeline</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Boehner</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">America</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">GPE</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">United States</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">GPE</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Keystone Pipeline</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Listen</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Keystone</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Boehner</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">State Department</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Democratic</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">GOP</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ORGANIZATION</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Obama</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">PERSON</span><span style="color: rgba(128, 0, 0, 1)">'</span>), (<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">White House</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">FACILITY</span><span style="color: rgba(128, 0, 0, 1)">'</span>)]</pre>
</div>
<p>分块在一起,可读性强。</p>
</div>
<br>注:参考:Natural Language Process</div>






</div><br><br>
来源:https://www.cnblogs.com/chen8023miss/p/11458571.html
頁: [1]
查看完整版本: NLTK最详细功能介绍