超净科技 發表於 2019-11-10 22:40:00

Python爬虫之BeautifulSoap的用法

<h2>1. Beautiful Soup的简介</h2>
<p>简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据。官方解释如下:</p>
<blockquote>
<p>Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。</p>
<p>Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。</p>
<p>Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。</p>
</blockquote>
<p>废话不多说,我们来试一下吧~</p>
<h2>2. Beautiful Soup 安装</h2>
<p>Beautiful Soup 3 目前已经停止开发,推荐在现在的项目中使用Beautiful Soup 4,不过它已经被移植到BS4了,也就是说导入时我们需要 import bs4 。所以这里我们用的版本是 Beautiful Soup 4.3.2 (简称BS4),另外据说 BS4 对 Python3 的支持不够好,不过我用的是 Python2.7.7,如果有小伙伴用的是 Python3 版本,可以考虑下载 BS3 版本。</p>
<p>&nbsp;</p>
<p>可以利用 pip 或者 easy_install 来安装,以下两种方法均可</p>
<div id="crayon-5dc820a271971586129662" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271971586129662-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271971586129662-1" class="crayon-line"><span class="crayon-e">easy_install <span class="crayon-v">beautifulsoup4</span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a271979017677644" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271979017677644-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271979017677644-1" class="crayon-line"><span class="crayon-e">pip <span class="crayon-e">install <span class="crayon-v">beautifulsoup4</span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>如果想安装最新的版本,请直接下载安装包来手动安装,也是十分方便的方法。在这里我安装的是&nbsp;Beautiful Soup 4.3.2</p>
<p>Beautiful Soup 3.2.1Beautiful Soup 4.3.2</p>
<p>下载完成之后解压</p>
<p>运行下面的命令即可完成安装</p>
<div id="crayon-5dc820a27197b019986252" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a27197b019986252-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a27197b019986252-1" class="crayon-line"><span class="crayon-e">sudo <span class="crayon-e">python <span class="crayon-v">setup<span class="crayon-sy">.<span class="crayon-e">py <span class="crayon-v">install</span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>然后需要安装 lxml</p>
<div id="crayon-5dc820a27197d186983172" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a27197d186983172-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a27197d186983172-1" class="crayon-line"><span class="crayon-e">easy_install <span class="crayon-v">lxml</span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a27197e610858664" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a27197e610858664-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a27197e610858664-1" class="crayon-line"><span class="crayon-e">pip <span class="crayon-e">install <span class="crayon-v">lxml</span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:</p>
<div id="crayon-5dc820a271980757388150" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271980757388150-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271980757388150-1" class="crayon-line"><span class="crayon-e">easy_install <span class="crayon-v">html5lib</span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a271981450560069" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271981450560069-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271981450560069-1" class="crayon-line"><span class="crayon-e">pip <span class="crayon-e">install <span class="crayon-v">html5lib</span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python 会使用 Python默认的解析器,lxml 解析器更加强大,速度更快,推荐安装。</p>
<div>&lt;thead”&gt;
<p>&nbsp;</p>
<table>
<tbody>
<tr><th class="head">解析器</th><th class="head">使用方法</th><th class="head">优势</th><th class="head">劣势</th></tr>
</tbody>
<tbody valign="top">
<tr>
<td>Python标准库</td>
<td>BeautifulSoup(markup, “html.parser”)</td>
<td>
<ul class="first last simple">
<li>Python的内置标准库</li>
<li>执行速度适中</li>
<li>文档容错能力强</li>
</ul>
</td>
<td>
<ul class="first last simple">
<li>Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差</li>
</ul>
</td>
</tr>
<tr>
<td>lxml HTML 解析器</td>
<td>BeautifulSoup(markup, “lxml”)</td>
<td>
<ul class="first last simple">
<li>速度快</li>
<li>文档容错能力强</li>
</ul>
</td>
<td>
<ul class="first last simple">
<li>需要安装C语言库</li>
</ul>
</td>
</tr>
<tr>
<td>lxml XML 解析器</td>
<td>BeautifulSoup(markup, [“lxml”, “xml”])BeautifulSoup(markup, “xml”)</td>
<td>
<ul class="first last simple">
<li>速度快</li>
<li>唯一支持XML的解析器</li>
</ul>
</td>
<td>
<ul class="first last simple">
<li>需要安装C语言库</li>
</ul>
</td>
</tr>
<tr>
<td>html5lib</td>
<td>BeautifulSoup(markup, “html5lib”)</td>
<td>
<ul class="first last simple">
<li>最好的容错性</li>
<li>以浏览器的方式解析文档</li>
<li>生成HTML5格式的文档</li>
</ul>
</td>
<td>
<ul class="first last simple">
<li>速度慢</li>
<li>不依赖外部扩展</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
<h2>3. 开启Beautiful Soup 之旅</h2>
<p>在这里先分享官方文档链接,不过内容是有些多,也不够条理,在此本文章做一下整理方便大家参考。</p>
<p>官方文档</p>
<h2>4.&nbsp;创建 Beautiful Soup 对象</h2>
<p>首先必须要导入 bs4 库</p>
<div id="crayon-5dc820a271984359830854" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271984359830854-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271984359830854-1" class="crayon-line"><span class="crayon-e">from <span class="crayon-e">bs4 <span class="crayon-e">import <span class="crayon-v">BeautifulSoup</span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>我们创建一个字符串,后面的例子我们便会用它来演示</p>
<div id="crayon-5dc820a271986616445999" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">
<div class="crayon-tools">
<div class="crayon-button crayon-nums-button crayon-pressed" title="Toggle Line Numbers">&nbsp;</div>
<div class="crayon-button crayon-plain-button" title="Toggle Plain Code">&nbsp;</div>
<div class="crayon-button crayon-wrap-button crayon-pressed" title="Toggle Line Wrap">&nbsp;</div>
<div class="crayon-button crayon-copy-button" title="Copy">&nbsp;</div>
<div class="crayon-button crayon-popup-button" title="Open Code In New Window">&nbsp;</div>
<span class="crayon-language">Python</span></div>
</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271986616445999-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271986616445999-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a271986616445999-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271986616445999-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a271986616445999-5">5</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271986616445999-6">6</div>
<div class="crayon-num" data-line="crayon-5dc820a271986616445999-7">7</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271986616445999-8">8</div>
<div class="crayon-num" data-line="crayon-5dc820a271986616445999-9">9</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271986616445999-10">10</div>
<div class="crayon-num" data-line="crayon-5dc820a271986616445999-11">11</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271986616445999-1" class="crayon-line"><span class="crayon-v">html<span class="crayon-h"> <span class="crayon-o">=<span class="crayon-h"> <span class="crayon-s">"""</span></span></span></span></span></div>
<div id="crayon-5dc820a271986616445999-2" class="crayon-line crayon-striped-line"><span class="crayon-s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span></div>
<div id="crayon-5dc820a271986616445999-3" class="crayon-line"><span class="crayon-s">&lt;body&gt;</span></div>
<div id="crayon-5dc820a271986616445999-4" class="crayon-line crayon-striped-line"><span class="crayon-s">&lt;p class="title" name="dromouse"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</span></div>
<div id="crayon-5dc820a271986616445999-5" class="crayon-line"><span class="crayon-s">&lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were</span></div>
<div id="crayon-5dc820a271986616445999-6" class="crayon-line crayon-striped-line"><span class="crayon-s">&lt;a href="http://example.com/elsie" class="sister" id="link1"&gt;&lt;!-- Elsie --&gt;&lt;/a&gt;,</span></div>
<div id="crayon-5dc820a271986616445999-7" class="crayon-line"><span class="crayon-s">&lt;a href="http://example.com/lacie" class="sister" id="link2"&gt;Lacie&lt;/a&gt; and</span></div>
<div id="crayon-5dc820a271986616445999-8" class="crayon-line crayon-striped-line"><span class="crayon-s">&lt;a href="http://example.com/tillie" class="sister" id="link3"&gt;Tillie&lt;/a&gt;;</span></div>
<div id="crayon-5dc820a271986616445999-9" class="crayon-line"><span class="crayon-s">and they lived at the bottom of a well.&lt;/p&gt;</span></div>
<div id="crayon-5dc820a271986616445999-10" class="crayon-line crayon-striped-line"><span class="crayon-s">&lt;p class="story"&gt;...&lt;/p&gt;</span></div>
<div id="crayon-5dc820a271986616445999-11" class="crayon-line"><span class="crayon-s">"""</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>创建 beautifulsoup 对象</p>
<div id="crayon-5dc820a271988972250292" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">
<div class="crayon-tools">
<div class="crayon-button crayon-nums-button crayon-pressed" title="Toggle Line Numbers">&nbsp;</div>
<div class="crayon-button crayon-plain-button" title="Toggle Plain Code">&nbsp;</div>
<div class="crayon-button crayon-wrap-button crayon-pressed" title="Toggle Line Wrap">&nbsp;</div>
<div class="crayon-button crayon-copy-button" title="Copy">&nbsp;</div>
<div class="crayon-button crayon-popup-button" title="Open Code In New Window">&nbsp;</div>
<span class="crayon-language">Python</span></div>
</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271988972250292-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271988972250292-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-h"> <span class="crayon-o">=<span class="crayon-h"> <span class="crayon-e">BeautifulSoup<span class="crayon-sy">(<span class="crayon-v">html<span class="crayon-sy">)</span></span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>另外,我们还可以用本地 HTML 文件来创建对象,例如</p>
<div id="crayon-5dc820a271989862636529" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271989862636529-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271989862636529-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-h"> <span class="crayon-o">=<span class="crayon-h"> <span class="crayon-e">BeautifulSoup<span class="crayon-sy">(<span class="crayon-e">open<span class="crayon-sy">(<span class="crayon-s">'index.html'<span class="crayon-sy">)<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>上面这句代码便是将本地 index.html 文件打开,用它来创建 soup 对象</p>
<p>下面我们来打印一下 soup 对象的内容,格式化输出</p>
<div id="crayon-5dc820a27198b337054984" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a27198b337054984-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a27198b337054984-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">prettify<span class="crayon-sy">(<span class="crayon-sy">)</span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a27198c004863666" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a27198c004863666-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a27198c004863666-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a27198c004863666-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a27198c004863666-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a27198c004863666-5">5</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a27198c004863666-1" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-v">html<span class="crayon-o">&gt;</span></span></span></div>
<div id="crayon-5dc820a27198c004863666-2" class="crayon-line crayon-striped-line"><span class="crayon-h"> <span class="crayon-o">&lt;<span class="crayon-v">head<span class="crayon-o">&gt;</span></span></span></span></div>
<div id="crayon-5dc820a27198c004863666-3" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;<span class="crayon-o">&lt;<span class="crayon-v">title<span class="crayon-o">&gt;</span></span></span></span></div>
<div id="crayon-5dc820a27198c004863666-4" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp; <span class="crayon-e">The <span class="crayon-i">Dormouse<span class="crayon-s">'s story</span></span></span></span></div>
<div id="crayon-5dc820a27198c004863666-5" class="crayon-line"><span class="crayon-s">&nbsp;&nbsp;&lt;/title&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>以上便是输出结果,格式化打印出了它的内容,这个函数经常用到,小伙伴们要记好咯。</p>
<h2>5.&nbsp;四大对象种类</h2>
<p>Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:</p>
<ul>
<li><tt class="docutils literal">Tag</tt></li>
<li><tt class="docutils literal">NavigableString</tt></li>
<li><tt class="docutils literal">BeautifulSoup</tt></li>
<li><tt class="docutils literal">Comment</tt></li>
</ul>
<p>下面我们进行一一介绍</p>
<h3>(1)Tag</h3>
<p>Tag 是什么?通俗点讲就是 HTML 中的一个个标签,例如</p>
<div id="crayon-5dc820a27198f802188004" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a27198f802188004-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a27198f802188004-1" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-v">title<span class="crayon-o">&gt;<span class="crayon-e">The <span class="crayon-i">Dormouse'<span class="crayon-i">s<span class="crayon-h"> <span class="crayon-v">story<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">title<span class="crayon-o">&gt;</span></span></span></span></span></span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a271990469247476" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271990469247476-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271990469247476-1" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-i">a<span class="crayon-h"> <span class="crayon-t">class<span class="crayon-o">=<span class="crayon-s">"sister"<span class="crayon-h"> <span class="crayon-v">href<span class="crayon-o">=<span class="crayon-s">"http://example.com/elsie"<span class="crayon-h"> <span class="crayon-v">id<span class="crayon-o">=<span class="crayon-s">"link1"<span class="crayon-o">&gt;<span class="crayon-v">Elsie<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">a<span class="crayon-o">&gt;</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>上面的 title a 等等 HTML 标签加上里面包括的内容就是 Tag,下面我们来感受一下怎样用 Beautiful Soup 来方便地获取 Tags</p>
<p>下面每一段代码中注释部分即为运行结果</p>
<div id="crayon-5dc820a271992289991938" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271992289991938-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271992289991938-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271992289991938-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">title</span></span></span></span></div>
<div id="crayon-5dc820a271992289991938-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;title&gt;The Dormouse's story&lt;/title&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a271993211538754" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271993211538754-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271993211538754-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271993211538754-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">head</span></span></span></span></div>
<div id="crayon-5dc820a271993211538754-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a271994900767405" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271994900767405-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271994900767405-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271994900767405-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">a</span></span></span></span></div>
<div id="crayon-5dc820a271994900767405-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;&lt;!-- Elsie --&gt;&lt;/a&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a271996046223738" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271996046223738-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271996046223738-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271996046223738-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p</span></span></span></span></div>
<div id="crayon-5dc820a271996046223738-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;p class="title" name="dromouse"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>我们可以利用 soup加标签名轻松地获取这些标签的内容,是不是感觉比正则表达式方便多了?不过有一点是,它查找的是在所有内容中的第一个符合要求的标签,如果要查询所有的标签,我们在后面进行介绍。</p>
<p>我们可以验证一下这些对象的类型</p>
<div id="crayon-5dc820a271998868533980" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271998868533980-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271998868533980-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271998868533980-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-e">type<span class="crayon-sy">(<span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">a<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a271998868533980-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;class 'bs4.element.Tag'&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>对于 Tag,它有两个重要的属性,是 name 和 attrs,下面我们分别来感受一下</p>
<p>name</p>
<div id="crayon-5dc820a271999013408022" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271999013408022-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271999013408022-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a271999013408022-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271999013408022-4">4</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271999013408022-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">name</span></span></span></span></div>
<div id="crayon-5dc820a271999013408022-2" class="crayon-line crayon-striped-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">head<span class="crayon-sy">.<span class="crayon-v">name</span></span></span></span></span></span></div>
<div id="crayon-5dc820a271999013408022-3" class="crayon-line"><span class="crayon-p">#</span></div>
<div id="crayon-5dc820a271999013408022-4" class="crayon-line crayon-striped-line"><span class="crayon-p">#head</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>soup 对象本身比较特殊,它的 name 即为 ,对于其他内部标签,输出的值便为标签本身的名称。</p>
<p>attrs</p>
<div id="crayon-5dc820a27199b270720976" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a27199b270720976-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a27199b270720976-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a27199b270720976-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p<span class="crayon-sy">.<span class="crayon-v">attrs</span></span></span></span></span></span></div>
<div id="crayon-5dc820a27199b270720976-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#{'class': ['title'], 'name': 'dromouse'}</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>在这里,我们把 p 标签的所有属性打印输出了出来,得到的类型是一个字典。</p>
<p>如果我们想要单独获取某个属性,可以这样,例如我们获取它的 class 叫什么</p>
<div id="crayon-5dc820a27199c213965718" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a27199c213965718-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a27199c213965718-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a27199c213965718-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p<span class="crayon-sy">[<span class="crayon-s">'class'<span class="crayon-sy">]</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a27199c213965718-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#['title']</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>还可以这样,利用get方法,传入属性的名称,二者是等价的</p>
<div id="crayon-5dc820a27199e200135604" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a27199e200135604-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a27199e200135604-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a27199e200135604-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p<span class="crayon-sy">.<span class="crayon-e">get<span class="crayon-sy">(<span class="crayon-s">'class'<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a27199e200135604-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#['title']</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>我们可以对这些属性和内容等等进行修改,例如</p>
<div id="crayon-5dc820a27199f759664064" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a27199f759664064-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a27199f759664064-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a27199f759664064-3">3</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a27199f759664064-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p<span class="crayon-sy">[<span class="crayon-s">'class'<span class="crayon-sy">]<span class="crayon-o">=<span class="crayon-s">"newClass"</span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a27199f759664064-2" class="crayon-line crayon-striped-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p</span></span></span></span></div>
<div id="crayon-5dc820a27199f759664064-3" class="crayon-line"><span class="crayon-p">#&lt;p class="newClass" name="dromouse"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>还可以对这个属性进行删除,例如</p>
<div id="crayon-5dc820a2719a1241387575" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719a1241387575-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719a1241387575-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719a1241387575-3">3</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719a1241387575-1" class="crayon-line"><span class="crayon-e">del <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p<span class="crayon-sy">[<span class="crayon-s">'class'<span class="crayon-sy">]</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719a1241387575-2" class="crayon-line crayon-striped-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p</span></span></span></span></div>
<div id="crayon-5dc820a2719a1241387575-3" class="crayon-line"><span class="crayon-p">#&lt;p name="dromouse"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>不过,对于修改删除的操作,不是我们的主要用途,在此不做详细介绍了,如果有需要,请查看前面提供的官方文档</p>
<h3>(2)NavigableString</h3>
<p>既然我们已经得到了标签的内容,那么问题来了,我们要想获取标签内部的文字怎么办呢?很简单,用 .string 即可,例如</p>
<div id="crayon-5dc820a2719a2773993131" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719a2773993131-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719a2773993131-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719a2773993131-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p<span class="crayon-sy">.<span class="crayon-t">string</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719a2773993131-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#The Dormouse's story</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>这样我们就轻松获取到了标签里面的内容,想想如果用正则表达式要多麻烦。它的类型是一个&nbsp;NavigableString,翻译过来叫 可以遍历的字符串,不过我们最好还是称它英文名字吧。</p>
<p>来检查一下它的类型</p>
<div id="crayon-5dc820a2719a4723885541" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719a4723885541-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719a4723885541-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719a4723885541-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-e">type<span class="crayon-sy">(<span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p<span class="crayon-sy">.<span class="crayon-t">string<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719a4723885541-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;class 'bs4.element.NavigableString'&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(3)BeautifulSoup</h3>
<p><tt class="docutils literal">BeautifulSoup</tt>&nbsp;对象表示的是一个文档的全部内容.大部分时候,可以把它当作&nbsp;<tt class="docutils literal">Tag</tt>&nbsp;对象,是一个特殊的 Tag,我们可以分别获取它的类型,名称,以及属性来感受一下</p>
<div id="crayon-5dc820a2719a6836418204" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719a6836418204-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719a6836418204-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719a6836418204-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719a6836418204-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719a6836418204-5">5</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719a6836418204-6">6</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719a6836418204-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-e">type<span class="crayon-sy">(<span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">name<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719a6836418204-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;type 'unicode'&gt;</span></div>
<div id="crayon-5dc820a2719a6836418204-3" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">name</span></span></span></span></div>
<div id="crayon-5dc820a2719a6836418204-4" class="crayon-line crayon-striped-line"><span class="crayon-p"># </span></div>
<div id="crayon-5dc820a2719a6836418204-5" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">attrs</span></span></span></span></div>
<div id="crayon-5dc820a2719a6836418204-6" class="crayon-line crayon-striped-line"><span class="crayon-p">#{} 空字典</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(4)Comment</h3>
<p><tt class="docutils literal">Comment</tt>&nbsp;对象是一个特殊类型的&nbsp;<tt class="docutils literal">NavigableString</tt>&nbsp;对象,其实输出的内容仍然不包括注释符号,但是如果不好好处理它,可能会对我们的文本处理造成意想不到的麻烦。</p>
<p>我们找一个带注释的标签</p>
<div id="crayon-5dc820a2719a7649557562" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719a7649557562-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719a7649557562-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719a7649557562-3">3</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719a7649557562-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-i">a</span></span></span></span></div>
<div id="crayon-5dc820a2719a7649557562-2" class="crayon-line crayon-striped-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">a<span class="crayon-sy">.<span class="crayon-t">string</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719a7649557562-3" class="crayon-line"><span class="crayon-e">print <span class="crayon-e">type<span class="crayon-sy">(<span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">a<span class="crayon-sy">.<span class="crayon-t">string<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<p>运行结果如下</p>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719a9808884676" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719a9808884676-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719a9808884676-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719a9808884676-3">3</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719a9808884676-1" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-i">a<span class="crayon-h"> <span class="crayon-t">class<span class="crayon-o">=<span class="crayon-s">"sister"<span class="crayon-h"> <span class="crayon-v">href<span class="crayon-o">=<span class="crayon-s">"http://example.com/elsie"<span class="crayon-h"> <span class="crayon-v">id<span class="crayon-o">=<span class="crayon-s">"link1"<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-o">!<span class="crayon-o">--<span class="crayon-h"> <span class="crayon-v">Elsie<span class="crayon-h"> <span class="crayon-o">--<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">a<span class="crayon-o">&gt;</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719a9808884676-2" class="crayon-line crayon-striped-line"><span class="crayon-h"> <span class="crayon-v">Elsie</span></span></div>
<div id="crayon-5dc820a2719a9808884676-3" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-t">class<span class="crayon-h"> <span class="crayon-s">'bs4.element.Comment'<span class="crayon-o">&gt;</span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>a 标签里的内容实际上是注释,但是如果我们利用 .string 来输出它的内容,我们发现它已经把注释符号去掉了,所以这可能会给我们带来不必要的麻烦。</p>
<p>另外我们打印输出下它的类型,发现它是一个 Comment 类型,所以,我们在使用前最好做一下判断,判断代码如下</p>
<div id="crayon-5dc820a2719aa545087733" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719aa545087733-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719aa545087733-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719aa545087733-1" class="crayon-line"><span class="crayon-st">if<span class="crayon-h"> <span class="crayon-e">type<span class="crayon-sy">(<span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">a<span class="crayon-sy">.<span class="crayon-t">string<span class="crayon-sy">)<span class="crayon-o">==<span class="crayon-v">bs4<span class="crayon-sy">.<span class="crayon-v">element<span class="crayon-sy">.<span class="crayon-v">Comment<span class="crayon-o">:</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719aa545087733-2" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">a<span class="crayon-sy">.<span class="crayon-t">string</span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>上面的代码中,我们首先判断了它的类型,是否为 Comment 类型,然后再进行其他操作,如打印输出。</p>
<h2>6.&nbsp;遍历文档树</h2>
<h3>(1)直接子节点</h3>
<blockquote>
<p>要点:.contents &nbsp;.children&nbsp;&nbsp;属性</p>
</blockquote>
<p>.contents</p>
<p>tag 的 .content&nbsp;属性可以将tag的子节点以列表的方式输出</p>
<div id="crayon-5dc820a2719ac079558702" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719ac079558702-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ac079558702-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719ac079558702-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">head<span class="crayon-sy">.<span class="crayon-i">contents&nbsp;</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719ac079558702-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>输出方式为列表,我们可以用列表索引来获取它的某一个元素</p>
<div id="crayon-5dc820a2719ad022812200" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719ad022812200-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ad022812200-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719ad022812200-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">head<span class="crayon-sy">.<span class="crayon-v">contents<span class="crayon-sy">[<span class="crayon-cn">0<span class="crayon-sy">]</span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719ad022812200-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;title&gt;The Dormouse's story&lt;/title&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>.children</p>
<p>它返回的不是一个 list,不过我们可以通过遍历获取所有子节点。</p>
<p>我们打印输出 .children 看一下,可以发现它是一个 list 生成器对象</p>
<div id="crayon-5dc820a2719af212771811" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719af212771811-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719af212771811-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719af212771811-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">head<span class="crayon-sy">.<span class="crayon-v">children</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719af212771811-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;listiterator object at 0x7f71457f5710&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>我们怎样获得里面的内容呢?很简单,遍历一下就好了,代码及结果如下</p>
<div id="crayon-5dc820a2719b4772753963" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719b4772753963-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719b4772753963-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719b4772753963-1" class="crayon-line"><span class="crayon-st">for<span class="crayon-h"> <span class="crayon-r">child<span class="crayon-h"> <span class="crayon-st">in<span class="crayon-h">&nbsp;&nbsp;<span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">body<span class="crayon-sy">.<span class="crayon-v">children<span class="crayon-o">:</span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719b4772753963-2" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-e">print <span class="crayon-r">child</span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719b5017724436" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719b5017724436-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719b5017724436-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719b5017724436-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719b5017724436-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719b5017724436-5">5</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719b5017724436-1" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-i">p<span class="crayon-h"> <span class="crayon-t">class<span class="crayon-o">=<span class="crayon-s">"title"<span class="crayon-h"> <span class="crayon-v">name<span class="crayon-o">=<span class="crayon-s">"dromouse"<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-v">b<span class="crayon-o">&gt;<span class="crayon-e">The <span class="crayon-i">Dormouse'<span class="crayon-i">s<span class="crayon-h"> <span class="crayon-v">story<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">b<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">p<span class="crayon-o">&gt;</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719b5017724436-2" class="crayon-line crayon-striped-line">&nbsp;</div>
<div id="crayon-5dc820a2719b5017724436-3" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-i">p<span class="crayon-h"> <span class="crayon-t">class<span class="crayon-o">=<span class="crayon-s">"story"<span class="crayon-o">&gt;<span class="crayon-e">Once <span class="crayon-i">upon<span class="crayon-h"> <span class="crayon-i">a<span class="crayon-h"> <span class="crayon-e">time <span class="crayon-e">there <span class="crayon-e">were <span class="crayon-e">three <span class="crayon-e">little <span class="crayon-v">sisters<span class="crayon-sy">;<span class="crayon-h"> <span class="crayon-st">and<span class="crayon-h"> <span class="crayon-e">their <span class="crayon-e">names <span class="crayon-v">were</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719b5017724436-4" class="crayon-line crayon-striped-line"><span class="crayon-o">&lt;<span class="crayon-i">a<span class="crayon-h"> <span class="crayon-t">class<span class="crayon-o">=<span class="crayon-s">"sister"<span class="crayon-h"> <span class="crayon-v">href<span class="crayon-o">=<span class="crayon-s">"http://example.com/elsie"<span class="crayon-h"> <span class="crayon-v">id<span class="crayon-o">=<span class="crayon-s">"link1"<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-o">!<span class="crayon-o">--<span class="crayon-h"> <span class="crayon-v">Elsie<span class="crayon-h"> <span class="crayon-o">--<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">a<span class="crayon-o">&gt;<span class="crayon-sy">,</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719b5017724436-5" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-i">a<span class="crayon-h"> <span class="crayon-t">class<span class="crayon-o">=<span class="crayon-s">"sister"<span class="crayon-h"> <span class="crayon-v">href<span class="crayon-o">=<span class="crayon-s">"http://example.com/lacie"<span class="crayon-h"> <span class="crayon-v">id<span class="crayon-o">=<span class="crayon-s">"link2"<span class="crayon-o">&gt;<span class="crayon-v">Lacie<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">a<span class="crayon-o">&gt;<span class="crayon-h"> <span class="crayon-st">and</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>&nbsp;(2)所有子孙节点</h3>
<blockquote>
<p>知识点:.descendants&nbsp;属性</p>
</blockquote>
<p>.descendants</p>
<p><tt class="docutils literal">.contents</tt>&nbsp;和&nbsp;<tt class="docutils literal">.children</tt>&nbsp;属性仅包含tag的直接子节点,<tt class="docutils literal">.descendants</tt>&nbsp;属性可以对所有tag的子孙节点进行递归循环,和 children类似,我们也需要遍历获取其中的内容。</p>
<div id="crayon-5dc820a2719b7840671205" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719b7840671205-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719b7840671205-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719b7840671205-1" class="crayon-line"><span class="crayon-st">for<span class="crayon-h"> <span class="crayon-r">child<span class="crayon-h"> <span class="crayon-st">in<span class="crayon-h"> <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">descendants<span class="crayon-o">:</span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719b7840671205-2" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-e">print <span class="crayon-r">child</span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>运行结果如下,可以发现,所有的节点都被打印出来了,先生最外层的 HTML标签,其次从 head 标签一个个剥离,以此类推。</p>
<div id="crayon-5dc820a2719b9526519768" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719b9526519768-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719b9526519768-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719b9526519768-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719b9526519768-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719b9526519768-5">5</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719b9526519768-1" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-v">html<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-v">head<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-v">title<span class="crayon-o">&gt;<span class="crayon-e">The <span class="crayon-i">Dormouse<span class="crayon-s">'s story&lt;/title&gt;&lt;/head&gt;</span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719b9526519768-2" class="crayon-line crayon-striped-line"><span class="crayon-s">&lt;body&gt;</span></div>
<div id="crayon-5dc820a2719b9526519768-3" class="crayon-line"><span class="crayon-s">&lt;p class="title" name="dromouse"&gt;&lt;b&gt;The Dormouse'<span class="crayon-i">s<span class="crayon-h"> <span class="crayon-v">story<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">b<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">p<span class="crayon-o">&gt;</span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719b9526519768-4" class="crayon-line crayon-striped-line"><span class="crayon-o">&lt;<span class="crayon-i">p<span class="crayon-h"> <span class="crayon-t">class<span class="crayon-o">=<span class="crayon-s">"story"<span class="crayon-o">&gt;<span class="crayon-e">Once <span class="crayon-i">upon<span class="crayon-h"> <span class="crayon-i">a<span class="crayon-h"> <span class="crayon-e">time <span class="crayon-e">there <span class="crayon-e">were <span class="crayon-e">three <span class="crayon-e">little <span class="crayon-v">sisters<span class="crayon-sy">;<span class="crayon-h"> <span class="crayon-st">and<span class="crayon-h"> <span class="crayon-e">their <span class="crayon-e">names <span class="crayon-v">were</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719b9526519768-5" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-i">a<span class="crayon-h"> <span class="crayon-t">class<span class="crayon-o">=<span class="crayon-s">"sister"<span class="crayon-h"> <span class="crayon-v">href<span class="crayon-o">=<span class="crayon-s">"http://example.com/elsie"<span class="crayon-h"> <span class="crayon-v">id<span class="crayon-o">=<span class="crayon-s">"link1"<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-o">!<span class="crayon-o">--<span class="crayon-h"> <span class="crayon-v">Elsie<span class="crayon-h"> <span class="crayon-o">--<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">a<span class="crayon-o">&gt;<span class="crayon-sy">,</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>&nbsp;(3)节点内容</h3>
<blockquote>
<p>知识点:.string 属性</p>
</blockquote>
<p>如果tag只有一个&nbsp;<tt class="docutils literal">NavigableString</tt>&nbsp;类型子节点,那么这个tag可以使用&nbsp;<tt class="docutils literal">.string</tt>&nbsp;得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用&nbsp;<tt class="docutils literal">.string</tt>&nbsp;方法,输出结果与当前唯一子节点的&nbsp;<tt class="docutils literal">.string</tt>&nbsp;结果相同。</p>
<p>通俗点说就是:如果一个标签里面没有标签了,那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了,那么 .string 也会返回最里面的内容。例如</p>
<div id="crayon-5dc820a2719bc020907494" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719bc020907494-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719bc020907494-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719bc020907494-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719bc020907494-4">4</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719bc020907494-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">head<span class="crayon-sy">.<span class="crayon-t">string</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719bc020907494-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#The Dormouse's story</span></div>
<div id="crayon-5dc820a2719bc020907494-3" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">title<span class="crayon-sy">.<span class="crayon-t">string</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719bc020907494-4" class="crayon-line crayon-striped-line"><span class="crayon-p">#The Dormouse's story</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>如果tag包含了多个子节点,tag就无法确定,string&nbsp;方法应该调用哪个子节点的内容, .string&nbsp;的输出结果是 None</p>
<div id="crayon-5dc820a2719bf920249436" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719bf920249436-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719bf920249436-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719bf920249436-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">html<span class="crayon-sy">.<span class="crayon-t">string</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719bf920249436-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># None</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(4)多个内容</h3>
<blockquote>
<p>知识点: .strings &nbsp;.stripped_strings 属性</p>
</blockquote>
<p>.strings</p>
<p>获取多个内容,不过需要遍历获取,比如下面的例子</p>
<div id="crayon-5dc820a2719c0977164024" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719c0977164024-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c0977164024-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c0977164024-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c0977164024-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c0977164024-5">5</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c0977164024-6">6</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c0977164024-7">7</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c0977164024-8">8</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c0977164024-9">9</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c0977164024-10">10</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c0977164024-11">11</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c0977164024-12">12</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c0977164024-13">13</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c0977164024-14">14</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c0977164024-15">15</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c0977164024-16">16</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719c0977164024-1" class="crayon-line"><span class="crayon-st">for<span class="crayon-h"> <span class="crayon-t">string<span class="crayon-h"> <span class="crayon-st">in<span class="crayon-h"> <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">strings<span class="crayon-o">:</span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719c0977164024-2" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-e">print<span class="crayon-sy">(<span class="crayon-e">repr<span class="crayon-sy">(<span class="crayon-t">string<span class="crayon-sy">)<span class="crayon-sy">)</span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719c0977164024-3" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u"The Dormouse's story"</span></span></div>
<div id="crayon-5dc820a2719c0977164024-4" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'\n\n'</span></span></div>
<div id="crayon-5dc820a2719c0977164024-5" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u"The Dormouse's story"</span></span></div>
<div id="crayon-5dc820a2719c0977164024-6" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'\n\n'</span></span></div>
<div id="crayon-5dc820a2719c0977164024-7" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'Once upon a time there were three little sisters; and their names were\n'</span></span></div>
<div id="crayon-5dc820a2719c0977164024-8" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'Elsie'</span></span></div>
<div id="crayon-5dc820a2719c0977164024-9" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u',\n'</span></span></div>
<div id="crayon-5dc820a2719c0977164024-10" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'Lacie'</span></span></div>
<div id="crayon-5dc820a2719c0977164024-11" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u' and\n'</span></span></div>
<div id="crayon-5dc820a2719c0977164024-12" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'Tillie'</span></span></div>
<div id="crayon-5dc820a2719c0977164024-13" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u';\nand they lived at the bottom of a well.'</span></span></div>
<div id="crayon-5dc820a2719c0977164024-14" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'\n\n'</span></span></div>
<div id="crayon-5dc820a2719c0977164024-15" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'...'</span></span></div>
<div id="crayon-5dc820a2719c0977164024-16" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'\n'</span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>.stripped_strings&nbsp;</p>
<p>输出的字符串中可能包含了很多空格或空行,使用&nbsp;<tt class="docutils literal">.stripped_strings</tt>&nbsp;可以去除多余空白内容</p>
<div id="crayon-5dc820a2719c2125425544" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719c2125425544-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c2125425544-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c2125425544-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c2125425544-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c2125425544-5">5</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c2125425544-6">6</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c2125425544-7">7</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c2125425544-8">8</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c2125425544-9">9</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c2125425544-10">10</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c2125425544-11">11</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c2125425544-12">12</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719c2125425544-1" class="crayon-line"><span class="crayon-st">for<span class="crayon-h"> <span class="crayon-t">string<span class="crayon-h"> <span class="crayon-st">in<span class="crayon-h"> <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">stripped_strings<span class="crayon-o">:</span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719c2125425544-2" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-e">print<span class="crayon-sy">(<span class="crayon-e">repr<span class="crayon-sy">(<span class="crayon-t">string<span class="crayon-sy">)<span class="crayon-sy">)</span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719c2125425544-3" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u"The Dormouse's story"</span></span></div>
<div id="crayon-5dc820a2719c2125425544-4" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u"The Dormouse's story"</span></span></div>
<div id="crayon-5dc820a2719c2125425544-5" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'Once upon a time there were three little sisters; and their names were'</span></span></div>
<div id="crayon-5dc820a2719c2125425544-6" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'Elsie'</span></span></div>
<div id="crayon-5dc820a2719c2125425544-7" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u','</span></span></div>
<div id="crayon-5dc820a2719c2125425544-8" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'Lacie'</span></span></div>
<div id="crayon-5dc820a2719c2125425544-9" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'and'</span></span></div>
<div id="crayon-5dc820a2719c2125425544-10" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'Tillie'</span></span></div>
<div id="crayon-5dc820a2719c2125425544-11" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u';\nand they lived at the bottom of a well.'</span></span></div>
<div id="crayon-5dc820a2719c2125425544-12" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'...'</span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(5)父节点</h3>
<blockquote>
<p>&nbsp;知识点: .parent 属性</p>
</blockquote>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719c4603813513" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719c4603813513-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c4603813513-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c4603813513-3">3</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719c4603813513-1" class="crayon-line"><span class="crayon-v">p<span class="crayon-h"> <span class="crayon-o">=<span class="crayon-h"> <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-i">p</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719c4603813513-2" class="crayon-line crayon-striped-line"><span class="crayon-i">print<span class="crayon-h"> <span class="crayon-v">p<span class="crayon-sy">.<span class="crayon-v">parent<span class="crayon-sy">.<span class="crayon-v">name</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719c4603813513-3" class="crayon-line"><span class="crayon-p">#body</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719c5964903104" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719c5964903104-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c5964903104-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c5964903104-3">3</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719c5964903104-1" class="crayon-line"><span class="crayon-v">content<span class="crayon-h"> <span class="crayon-o">=<span class="crayon-h"> <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">head<span class="crayon-sy">.<span class="crayon-v">title<span class="crayon-sy">.<span class="crayon-t">string</span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719c5964903104-2" class="crayon-line crayon-striped-line"><span class="crayon-e">print <span class="crayon-v">content<span class="crayon-sy">.<span class="crayon-v">parent<span class="crayon-sy">.<span class="crayon-v">name</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719c5964903104-3" class="crayon-line"><span class="crayon-p">#title</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(6)全部父节点</h3>
<blockquote>
<p>知识点:.parents 属性</p>
</blockquote>
<p>通过元素的&nbsp;<tt class="docutils literal">.parents</tt>&nbsp;属性可以递归得到元素的所有父辈节点,例如</p>
<div id="crayon-5dc820a2719c7898000615" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719c7898000615-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c7898000615-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c7898000615-3">3</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719c7898000615-1" class="crayon-line"><span class="crayon-v">content<span class="crayon-h"> <span class="crayon-o">=<span class="crayon-h"> <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">head<span class="crayon-sy">.<span class="crayon-v">title<span class="crayon-sy">.<span class="crayon-t">string</span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719c7898000615-2" class="crayon-line crayon-striped-line"><span class="crayon-st">for<span class="crayon-h"> <span class="crayon-r">parent<span class="crayon-h"> <span class="crayon-st">in<span class="crayon-h">&nbsp;&nbsp;<span class="crayon-v">content<span class="crayon-sy">.<span class="crayon-v">parents<span class="crayon-o">:</span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719c7898000615-3" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-e">print <span class="crayon-r">parent<span class="crayon-sy">.<span class="crayon-v">name</span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719c8745935317" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719c8745935317-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c8745935317-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719c8745935317-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719c8745935317-4">4</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719c8745935317-1" class="crayon-line"><span class="crayon-e">title</span></div>
<div id="crayon-5dc820a2719c8745935317-2" class="crayon-line crayon-striped-line"><span class="crayon-e">head</span></div>
<div id="crayon-5dc820a2719c8745935317-3" class="crayon-line"><span class="crayon-i">html</span></div>
<div id="crayon-5dc820a2719c8745935317-4" class="crayon-line crayon-striped-line"><span class="crayon-sy">[<span class="crayon-v">document<span class="crayon-sy">]</span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>&nbsp;(7)兄弟节点</h3>
<blockquote>
<p>知识点:.next_sibling &nbsp;.previous_sibling 属性</p>
</blockquote>
<p>兄弟节点可以理解为和本节点处在统一级的节点,.next_sibling 属性获取了该节点的下一个兄弟节点,.previous_sibling 则与之相反,如果节点不存在,则返回 None</p>
<p>注意:实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白,因为空白或者换行也可以被视作一个节点,所以得到的结果可能是空白或者换行</p>
<div id="crayon-5dc820a2719ca542943950" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719ca542943950-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ca542943950-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719ca542943950-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ca542943950-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719ca542943950-5">5</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ca542943950-6">6</div>
<div class="crayon-num" data-line="crayon-5dc820a2719ca542943950-7">7</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ca542943950-8">8</div>
<div class="crayon-num" data-line="crayon-5dc820a2719ca542943950-9">9</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ca542943950-10">10</div>
<div class="crayon-num" data-line="crayon-5dc820a2719ca542943950-11">11</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719ca542943950-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p<span class="crayon-sy">.<span class="crayon-v">next_sibling</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719ca542943950-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 实际该处为空白</span></div>
<div id="crayon-5dc820a2719ca542943950-3" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p<span class="crayon-sy">.<span class="crayon-v">prev_sibling</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719ca542943950-4" class="crayon-line crayon-striped-line"><span class="crayon-p">#None&nbsp;&nbsp; 没有前一个兄弟节点,返回 None</span></div>
<div id="crayon-5dc820a2719ca542943950-5" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">p<span class="crayon-sy">.<span class="crayon-v">next_sibling<span class="crayon-sy">.<span class="crayon-v">next_sibling</span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719ca542943950-6" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were</span></div>
<div id="crayon-5dc820a2719ca542943950-7" class="crayon-line"><span class="crayon-p">#&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;&lt;!-- Elsie --&gt;&lt;/a&gt;,</span></div>
<div id="crayon-5dc820a2719ca542943950-8" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt; and</span></div>
<div id="crayon-5dc820a2719ca542943950-9" class="crayon-line"><span class="crayon-p">#&lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;;</span></div>
<div id="crayon-5dc820a2719ca542943950-10" class="crayon-line crayon-striped-line"><span class="crayon-p">#and they lived at the bottom of a well.&lt;/p&gt;</span></div>
<div id="crayon-5dc820a2719ca542943950-11" class="crayon-line"><span class="crayon-p">#下一个节点的下一个兄弟节点是我们可以看到的节点</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(8)全部兄弟节点</h3>
<blockquote>
<p>知识点:.next_siblings &nbsp;.previous_siblings 属性</p>
</blockquote>
<p>通过&nbsp;<tt class="docutils literal">.next_siblings</tt>&nbsp;和&nbsp;<tt class="docutils literal">.previous_siblings</tt>&nbsp;属性可以对当前节点的兄弟节点迭代输出</p>
<div id="crayon-5dc820a2719cc746530743" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719cc746530743-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719cc746530743-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719cc746530743-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719cc746530743-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719cc746530743-5">5</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719cc746530743-6">6</div>
<div class="crayon-num" data-line="crayon-5dc820a2719cc746530743-7">7</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719cc746530743-8">8</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719cc746530743-1" class="crayon-line"><span class="crayon-st">for<span class="crayon-h"> <span class="crayon-e">sibling <span class="crayon-st">in<span class="crayon-h"> <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">a<span class="crayon-sy">.<span class="crayon-v">next_siblings<span class="crayon-o">:</span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719cc746530743-2" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-e">print<span class="crayon-sy">(<span class="crayon-e">repr<span class="crayon-sy">(<span class="crayon-v">sibling<span class="crayon-sy">)<span class="crayon-sy">)</span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719cc746530743-3" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u',\n'</span></span></div>
<div id="crayon-5dc820a2719cc746530743-4" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;</span></span></div>
<div id="crayon-5dc820a2719cc746530743-5" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u' and\n'</span></span></div>
<div id="crayon-5dc820a2719cc746530743-6" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;</span></span></div>
<div id="crayon-5dc820a2719cc746530743-7" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># u'; and they lived at the bottom of a well.'</span></span></div>
<div id="crayon-5dc820a2719cc746530743-8" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-p"># None</span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(9)前后节点</h3>
<blockquote>
<p>知识点:.next_element &nbsp;.previous_element 属性</p>
</blockquote>
<p>与&nbsp;.next_sibling &nbsp;.previous_sibling 不同,它并不是针对于兄弟节点,而是在所有节点,不分层次</p>
<p>比如 head 节点为</p>
<div id="crayon-5dc820a2719ce202855665" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719ce202855665-1">1</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719ce202855665-1" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-v">head<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-v">title<span class="crayon-o">&gt;<span class="crayon-e">The <span class="crayon-i">Dormouse'<span class="crayon-i">s<span class="crayon-h"> <span class="crayon-v">story<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">title<span class="crayon-o">&gt;<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">head<span class="crayon-o">&gt;</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>那么它的下一个节点便是 title,它是不分层次关系的</p>
<div id="crayon-5dc820a2719cf366679811" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719cf366679811-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719cf366679811-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719cf366679811-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">head<span class="crayon-sy">.<span class="crayon-v">next_element</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719cf366679811-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#&lt;title&gt;The Dormouse's story&lt;/title&gt;</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(10)所有前后节点</h3>
<blockquote>
<p>知识点:.next_elements &nbsp;.previous_elements 属性</p>
</blockquote>
<p>通过&nbsp;<tt class="docutils literal">.next_elements</tt>&nbsp;和&nbsp;<tt class="docutils literal">.previous_elements</tt>&nbsp;的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样</p>
<div id="crayon-5dc820a2719d1481320123" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719d1481320123-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719d1481320123-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719d1481320123-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719d1481320123-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719d1481320123-5">5</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719d1481320123-6">6</div>
<div class="crayon-num" data-line="crayon-5dc820a2719d1481320123-7">7</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719d1481320123-8">8</div>
<div class="crayon-num" data-line="crayon-5dc820a2719d1481320123-9">9</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719d1481320123-1" class="crayon-line"><span class="crayon-st">for<span class="crayon-h"> <span class="crayon-e">element <span class="crayon-st">in<span class="crayon-h"> <span class="crayon-v">last_a_tag<span class="crayon-sy">.<span class="crayon-v">next_elements<span class="crayon-o">:</span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719d1481320123-2" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-e">print<span class="crayon-sy">(<span class="crayon-e">repr<span class="crayon-sy">(<span class="crayon-v">element<span class="crayon-sy">)<span class="crayon-sy">)</span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719d1481320123-3" class="crayon-line"><span class="crayon-p"># u'Tillie'</span></div>
<div id="crayon-5dc820a2719d1481320123-4" class="crayon-line crayon-striped-line"><span class="crayon-p"># u';\nand they lived at the bottom of a well.'</span></div>
<div id="crayon-5dc820a2719d1481320123-5" class="crayon-line"><span class="crayon-p"># u'\n\n'</span></div>
<div id="crayon-5dc820a2719d1481320123-6" class="crayon-line crayon-striped-line"><span class="crayon-p"># &lt;p class="story"&gt;...&lt;/p&gt;</span></div>
<div id="crayon-5dc820a2719d1481320123-7" class="crayon-line"><span class="crayon-p"># u'...'</span></div>
<div id="crayon-5dc820a2719d1481320123-8" class="crayon-line crayon-striped-line"><span class="crayon-p"># u'\n'</span></div>
<div id="crayon-5dc820a2719d1481320123-9" class="crayon-line"><span class="crayon-p"># None</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>以上是遍历文档树的基本用法。</p>
<h2>7.搜索文档树</h2>
<h3>(1)find_all( name , attrs , recursive , text , **kwargs )</h3>
<p><tt class="docutils literal">find_all()</tt>&nbsp;方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件</p>
<p>1)name 参数</p>
<p><tt class="docutils literal">name</tt>&nbsp;参数可以查找所有名字为&nbsp;<tt class="docutils literal">name</tt>&nbsp;的tag,字符串对象会被自动忽略掉</p>
<p>A.传字符串</p>
<p>最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的&lt;b&gt;标签</p>
<div id="crayon-5dc820a2719d3360991483" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719d3360991483-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719d3360991483-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719d3360991483-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-s">'b'<span class="crayon-sy">)</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719d3360991483-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># [&lt;b&gt;The Dormouse's story&lt;/b&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719d5033276556" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719d5033276556-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719d5033276556-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719d5033276556-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-s">'a'<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719d5033276556-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;&lt;!-- Elsie --&gt;&lt;/a&gt;, &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;, &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>B.传正则表达式</p>
<p>如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的&nbsp;<tt class="docutils literal">match()</tt>&nbsp;来匹配内容.下面例子中找出所有以b开头的标签,这表示&lt;body&gt;和&lt;b&gt;标签都应该被找到</p>
<div id="crayon-5dc820a2719d6448249012" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719d6448249012-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719d6448249012-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719d6448249012-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719d6448249012-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719d6448249012-5">5</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719d6448249012-1" class="crayon-line"><span class="crayon-e">import <span class="crayon-e">re</span></span></div>
<div id="crayon-5dc820a2719d6448249012-2" class="crayon-line crayon-striped-line"><span class="crayon-st">for<span class="crayon-h"> <span class="crayon-e">tag <span class="crayon-st">in<span class="crayon-h"> <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-v">re<span class="crayon-sy">.<span class="crayon-e">compile<span class="crayon-sy">(<span class="crayon-s">"^b"<span class="crayon-sy">)<span class="crayon-sy">)<span class="crayon-o">:</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719d6448249012-3" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-e">print<span class="crayon-sy">(<span class="crayon-v">tag<span class="crayon-sy">.<span class="crayon-v">name<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719d6448249012-4" class="crayon-line crayon-striped-line"><span class="crayon-p"># body</span></div>
<div id="crayon-5dc820a2719d6448249012-5" class="crayon-line"><span class="crayon-p"># b</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>C.传列表</p>
<p>如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有&lt;a&gt;标签和&lt;b&gt;标签</p>
<div id="crayon-5dc820a2719d8777572648" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719d8777572648-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719d8777572648-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719d8777572648-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719d8777572648-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719d8777572648-5">5</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719d8777572648-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-sy">[<span class="crayon-s">"a"<span class="crayon-sy">,<span class="crayon-h"> <span class="crayon-s">"b"<span class="crayon-sy">]<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719d8777572648-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># [&lt;b&gt;The Dormouse's story&lt;/b&gt;,</span></div>
<div id="crayon-5dc820a2719d8777572648-3" class="crayon-line"><span class="crayon-p">#&nbsp;&nbsp;&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span></div>
<div id="crayon-5dc820a2719d8777572648-4" class="crayon-line crayon-striped-line"><span class="crayon-p">#&nbsp;&nbsp;&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span></div>
<div id="crayon-5dc820a2719d8777572648-5" class="crayon-line"><span class="crayon-p">#&nbsp;&nbsp;&lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>D.传 True</p>
<p><tt class="docutils literal">True</tt>&nbsp;可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点</p>
<div id="crayon-5dc820a2719da853587916" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719da853587916-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719da853587916-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719da853587916-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719da853587916-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719da853587916-5">5</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719da853587916-6">6</div>
<div class="crayon-num" data-line="crayon-5dc820a2719da853587916-7">7</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719da853587916-8">8</div>
<div class="crayon-num" data-line="crayon-5dc820a2719da853587916-9">9</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719da853587916-10">10</div>
<div class="crayon-num" data-line="crayon-5dc820a2719da853587916-11">11</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719da853587916-1" class="crayon-line"><span class="crayon-st">for<span class="crayon-h"> <span class="crayon-e">tag <span class="crayon-st">in<span class="crayon-h"> <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-t">True<span class="crayon-sy">)<span class="crayon-o">:</span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719da853587916-2" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-e">print<span class="crayon-sy">(<span class="crayon-v">tag<span class="crayon-sy">.<span class="crayon-v">name<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719da853587916-3" class="crayon-line"><span class="crayon-p"># html</span></div>
<div id="crayon-5dc820a2719da853587916-4" class="crayon-line crayon-striped-line"><span class="crayon-p"># head</span></div>
<div id="crayon-5dc820a2719da853587916-5" class="crayon-line"><span class="crayon-p"># title</span></div>
<div id="crayon-5dc820a2719da853587916-6" class="crayon-line crayon-striped-line"><span class="crayon-p"># body</span></div>
<div id="crayon-5dc820a2719da853587916-7" class="crayon-line"><span class="crayon-p"># p</span></div>
<div id="crayon-5dc820a2719da853587916-8" class="crayon-line crayon-striped-line"><span class="crayon-p"># b</span></div>
<div id="crayon-5dc820a2719da853587916-9" class="crayon-line"><span class="crayon-p"># p</span></div>
<div id="crayon-5dc820a2719da853587916-10" class="crayon-line crayon-striped-line"><span class="crayon-p"># a</span></div>
<div id="crayon-5dc820a2719da853587916-11" class="crayon-line"><span class="crayon-p"># a</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>E.传方法</p>
<p>如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数&nbsp;&nbsp;,如果这个方法返回&nbsp;<tt class="docutils literal">True</tt>&nbsp;表示当前元素匹配并且被找到,如果不是则反回&nbsp;<tt class="docutils literal">False</tt></p>
<p>下面方法校验了当前元素,如果包含&nbsp;<tt class="docutils literal">class</tt>&nbsp;属性却不包含&nbsp;<tt class="docutils literal">id</tt>&nbsp;属性,那么将返回&nbsp;<tt class="docutils literal">True</tt>:</p>
<div class="highlight-python">
<div class="highlight">
<div id="crayon-5dc820a2719dc038760375" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719dc038760375-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719dc038760375-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719dc038760375-1" class="crayon-line"><span class="crayon-e">def <span class="crayon-e">has_class_but_no_id<span class="crayon-sy">(<span class="crayon-v">tag<span class="crayon-sy">)<span class="crayon-o">:</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719dc038760375-2" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-st">return<span class="crayon-h"> <span class="crayon-v">tag<span class="crayon-sy">.<span class="crayon-e">has_attr<span class="crayon-sy">(<span class="crayon-s">'class'<span class="crayon-sy">)<span class="crayon-h"> <span class="crayon-st">and<span class="crayon-h"> <span class="crayon-st">not<span class="crayon-h"> <span class="crayon-v">tag<span class="crayon-sy">.<span class="crayon-e">has_attr<span class="crayon-sy">(<span class="crayon-s">'id'<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>将这个方法作为参数传入&nbsp;<tt class="docutils literal">find_all()</tt>&nbsp;方法,将得到所有&lt;p&gt;标签:</p>
<div class="highlight-python">
<div class="highlight">
<div id="crayon-5dc820a2719dd892466546" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719dd892466546-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719dd892466546-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719dd892466546-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719dd892466546-4">4</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719dd892466546-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-v">has_class_but_no_id<span class="crayon-sy">)</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719dd892466546-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># [&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;,</span></div>
<div id="crayon-5dc820a2719dd892466546-3" class="crayon-line"><span class="crayon-p">#&nbsp;&nbsp;&lt;p class="story"&gt;Once upon a time there were...&lt;/p&gt;,</span></div>
<div id="crayon-5dc820a2719dd892466546-4" class="crayon-line crayon-striped-line"><span class="crayon-p">#&nbsp;&nbsp;&lt;p class="story"&gt;...&lt;/p&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>2)keyword 参数</p>
<blockquote>
<p>注意:如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为&nbsp;<tt class="docutils literal"><span class="pre">id</span></tt>&nbsp;的参数,Beautiful Soup会搜索每个tag的”id”属性</p>
</blockquote>
<div class="highlight-python">
<div class="highlight">
<div id="crayon-5dc820a2719df652273989" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719df652273989-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719df652273989-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719df652273989-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-v">id<span class="crayon-o">=<span class="crayon-s">'link2'<span class="crayon-sy">)</span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719df652273989-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># [&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>如果传入&nbsp;<tt class="docutils literal">href</tt>&nbsp;参数,Beautiful Soup会搜索每个tag的”href”属性</p>
<div class="highlight-python">
<div class="highlight">
<div id="crayon-5dc820a2719e1859108952" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719e1859108952-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719e1859108952-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719e1859108952-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-v">href<span class="crayon-o">=<span class="crayon-v">re<span class="crayon-sy">.<span class="crayon-e">compile<span class="crayon-sy">(<span class="crayon-s">"elsie"<span class="crayon-sy">)<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719e1859108952-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>使用多个指定名字的参数可以同时过滤tag的多个属性</p>
<div class="highlight-python">
<div class="highlight">
<div id="crayon-5dc820a2719e2412666082" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719e2412666082-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719e2412666082-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719e2412666082-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-v">href<span class="crayon-o">=<span class="crayon-v">re<span class="crayon-sy">.<span class="crayon-e">compile<span class="crayon-sy">(<span class="crayon-s">"elsie"<span class="crayon-sy">)<span class="crayon-sy">,<span class="crayon-h"> <span class="crayon-v">id<span class="crayon-o">=<span class="crayon-s">'link1'<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719e2412666082-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;three&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>在这里我们想用 class 过滤,不过 class 是 python 的关键词,这怎么办?加个下划线就可以</p>
<div id="crayon-5dc820a2719e4885664454" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719e4885664454-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719e4885664454-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719e4885664454-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719e4885664454-4">4</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719e4885664454-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-s">"a"<span class="crayon-sy">,<span class="crayon-h"> <span class="crayon-v">class_<span class="crayon-o">=<span class="crayon-s">"sister"<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719e4885664454-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span></div>
<div id="crayon-5dc820a2719e4885664454-3" class="crayon-line"><span class="crayon-p">#&nbsp;&nbsp;&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span></div>
<div id="crayon-5dc820a2719e4885664454-4" class="crayon-line crayon-striped-line"><span class="crayon-p">#&nbsp;&nbsp;&lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性</p>
<div class="highlight-python">
<div class="highlight">
<div id="crayon-5dc820a2719e5879792708" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719e5879792708-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719e5879792708-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719e5879792708-3">3</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719e5879792708-1" class="crayon-line"><span class="crayon-v">data_soup<span class="crayon-h"> <span class="crayon-o">=<span class="crayon-h"> <span class="crayon-e">BeautifulSoup<span class="crayon-sy">(<span class="crayon-s">'&lt;div data-foo="value"&gt;foo!&lt;/div&gt;'<span class="crayon-sy">)</span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719e5879792708-2" class="crayon-line crayon-striped-line"><span class="crayon-v">data_soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-v">data<span class="crayon-o">-<span class="crayon-v">foo<span class="crayon-o">=<span class="crayon-s">"value"<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719e5879792708-3" class="crayon-line"><span class="crayon-p"># SyntaxError: keyword can't be an expression</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>但是可以通过&nbsp;<tt class="docutils literal">find_all()</tt>&nbsp;方法的&nbsp;<tt class="docutils literal">attrs</tt>&nbsp;参数定义一个字典参数来搜索包含特殊属性的tag</p>
<div class="highlight-python">
<div class="highlight">
<div id="crayon-5dc820a2719e7949035940" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719e7949035940-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719e7949035940-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719e7949035940-1" class="crayon-line"><span class="crayon-v">data_soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-v">attrs<span class="crayon-o">=<span class="crayon-sy">{<span class="crayon-s">"data-foo"<span class="crayon-o">:<span class="crayon-h"> <span class="crayon-s">"value"<span class="crayon-sy">}<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719e7949035940-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># [&lt;div data-foo="value"&gt;foo!&lt;/div&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>3)text 参数</p>
<p>通过&nbsp;<tt class="docutils literal">text</tt>&nbsp;参数可以搜搜文档中的字符串内容.与&nbsp;<tt class="docutils literal">name</tt>&nbsp;参数的可选值一样,&nbsp;<tt class="docutils literal">text</tt>&nbsp;参数接受 字符串 , 正则表达式 , 列表, True</p>
<div id="crayon-5dc820a2719e9244854404" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719e9244854404-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719e9244854404-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719e9244854404-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719e9244854404-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719e9244854404-5">5</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719e9244854404-6">6</div>
<div class="crayon-num" data-line="crayon-5dc820a2719e9244854404-7">7</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719e9244854404-8">8</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719e9244854404-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-v">text<span class="crayon-o">=<span class="crayon-s">"Elsie"<span class="crayon-sy">)</span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719e9244854404-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># </span></div>
<div id="crayon-5dc820a2719e9244854404-3" class="crayon-line">&nbsp;</div>
<div id="crayon-5dc820a2719e9244854404-4" class="crayon-line crayon-striped-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-v">text<span class="crayon-o">=<span class="crayon-sy">[<span class="crayon-s">"Tillie"<span class="crayon-sy">,<span class="crayon-h"> <span class="crayon-s">"Elsie"<span class="crayon-sy">,<span class="crayon-h"> <span class="crayon-s">"Lacie"<span class="crayon-sy">]<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719e9244854404-5" class="crayon-line"><span class="crayon-p"># </span></div>
<div id="crayon-5dc820a2719e9244854404-6" class="crayon-line crayon-striped-line">&nbsp;</div>
<div id="crayon-5dc820a2719e9244854404-7" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-v">text<span class="crayon-o">=<span class="crayon-v">re<span class="crayon-sy">.<span class="crayon-e">compile<span class="crayon-sy">(<span class="crayon-s">"Dormouse"<span class="crayon-sy">)<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719e9244854404-8" class="crayon-line crayon-striped-line"><span class="crayon-sy">[<span class="crayon-i">u<span class="crayon-s">"The Dormouse's story"<span class="crayon-sy">,<span class="crayon-h"> <span class="crayon-i">u<span class="crayon-s">"The Dormouse's story"<span class="crayon-sy">]</span></span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>4)limit 参数</p>
<p><tt class="docutils literal">find_all()</tt>&nbsp;方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用&nbsp;<tt class="docutils literal">limit</tt>&nbsp;参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到&nbsp;<tt class="docutils literal">limit</tt>&nbsp;的限制时,就停止搜索返回结果.</p>
<p>文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量</p>
<div class="highlight-python">
<div class="highlight">
<div id="crayon-5dc820a2719eb837128745" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719eb837128745-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719eb837128745-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719eb837128745-3">3</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719eb837128745-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-s">"a"<span class="crayon-sy">,<span class="crayon-h"> <span class="crayon-v">limit<span class="crayon-o">=<span class="crayon-cn">2<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719eb837128745-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span></div>
<div id="crayon-5dc820a2719eb837128745-3" class="crayon-line"><span class="crayon-p">#&nbsp;&nbsp;&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>5)recursive 参数</p>
<p>调用tag的&nbsp;<tt class="docutils literal">find_all()</tt>&nbsp;方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数&nbsp;<tt class="docutils literal">recursive=False</tt>&nbsp;.</p>
<p>一段简单的文档:</p>
<div class="highlight-python">
<div id="crayon-5dc820a2719ed238662669" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719ed238662669-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ed238662669-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719ed238662669-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ed238662669-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719ed238662669-5">5</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ed238662669-6">6</div>
<div class="crayon-num" data-line="crayon-5dc820a2719ed238662669-7">7</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719ed238662669-1" class="crayon-line"><span class="crayon-o">&lt;<span class="crayon-v">html<span class="crayon-o">&gt;</span></span></span></div>
<div id="crayon-5dc820a2719ed238662669-2" class="crayon-line crayon-striped-line"><span class="crayon-h"> <span class="crayon-o">&lt;<span class="crayon-v">head<span class="crayon-o">&gt;</span></span></span></span></div>
<div id="crayon-5dc820a2719ed238662669-3" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;<span class="crayon-o">&lt;<span class="crayon-v">title<span class="crayon-o">&gt;</span></span></span></span></div>
<div id="crayon-5dc820a2719ed238662669-4" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp; <span class="crayon-e">The <span class="crayon-i">Dormouse'<span class="crayon-i">s<span class="crayon-h"> <span class="crayon-v">story</span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719ed238662669-5" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;<span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">title<span class="crayon-o">&gt;</span></span></span></span></span></div>
<div id="crayon-5dc820a2719ed238662669-6" class="crayon-line crayon-striped-line"><span class="crayon-h"> <span class="crayon-o">&lt;<span class="crayon-o">/<span class="crayon-v">head<span class="crayon-o">&gt;</span></span></span></span></span></div>
<div id="crayon-5dc820a2719ed238662669-7" class="crayon-line"><span class="crayon-sy">.<span class="crayon-sy">.<span class="crayon-sy">.</span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<p>是否使用&nbsp;<tt class="docutils literal">recursive</tt>&nbsp;参数的搜索结果:</p>
<div class="highlight-python">
<div class="highlight">
<div id="crayon-5dc820a2719ee141418257" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719ee141418257-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ee141418257-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a2719ee141418257-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719ee141418257-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a2719ee141418257-5">5</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719ee141418257-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">html<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-s">"title"<span class="crayon-sy">)</span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719ee141418257-2" class="crayon-line crayon-striped-line"><span class="crayon-p"># [&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span></div>
<div id="crayon-5dc820a2719ee141418257-3" class="crayon-line">&nbsp;</div>
<div id="crayon-5dc820a2719ee141418257-4" class="crayon-line crayon-striped-line"><span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-v">html<span class="crayon-sy">.<span class="crayon-e">find_all<span class="crayon-sy">(<span class="crayon-s">"title"<span class="crayon-sy">,<span class="crayon-h"> <span class="crayon-v">recursive<span class="crayon-o">=<span class="crayon-t">False<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719ee141418257-5" class="crayon-line"><span class="crayon-p"># []</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<h3>(2)find( name , attrs , recursive , text , **kwargs )</h3>
<p>它与 find_all() 方法唯一的区别是&nbsp;<tt class="docutils literal">find_all()</tt>&nbsp;方法的返回结果是值包含一个元素的列表,而&nbsp;<tt class="docutils literal">find()</tt>&nbsp;方法直接返回结果</p>
<h3>(3)find_parents() &nbsp;find_parent()</h3>
<p><tt class="docutils literal">find_all()</tt>&nbsp;和&nbsp;<tt class="docutils literal">find()</tt>&nbsp;只搜索当前节点的所有子节点,孙子节点等.&nbsp;<tt class="docutils literal">find_parents()</tt>&nbsp;和&nbsp;<tt class="docutils literal">find_parent()</tt>&nbsp;用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容</p>
<h3>(4)find_next_siblings() &nbsp;find_next_sibling()</h3>
<p>这2个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代,&nbsp;<tt class="docutils literal">find_next_siblings()</tt>&nbsp;方法返回所有符合条件的后面的兄弟节点,<tt class="docutils literal">find_next_sibling()</tt>&nbsp;只返回符合条件的后面的第一个tag节点</p>
<h3>(5)find_previous_siblings() &nbsp;find_previous_sibling()</h3>
<p>这2个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代,&nbsp;<tt class="docutils literal">find_previous_siblings()</tt>方法返回所有符合条件的前面的兄弟节点,&nbsp;<tt class="docutils literal">find_previous_sibling()</tt>&nbsp;方法返回第一个符合条件的前面的兄弟节点</p>
<h3>(6)find_all_next() &nbsp;find_next()</h3>
<p>这2个方法通过 .next_elements 属性对当前 tag 的之后的&nbsp;tag 和字符串进行迭代,&nbsp;<tt class="docutils literal">find_all_next()</tt>&nbsp;方法返回所有符合条件的节点,&nbsp;<tt class="docutils literal">find_next()</tt>&nbsp;方法返回第一个符合条件的节点</p>
<h3>(7)find_all_previous() 和 find_previous()</h3>
<p>这2个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代,&nbsp;<tt class="docutils literal">find_all_previous()</tt>&nbsp;方法返回所有符合条件的节点,&nbsp;<tt class="docutils literal">find_previous()</tt>方法返回第一个符合条件的节点</p>
<blockquote>
<p>注:以上(2)(3)(4)(5)(6)(7)方法参数用法与 find_all() 完全相同,原理均类似,在此不再赘述。</p>
</blockquote>
<h2>8.CSS选择器</h2>
<p>我们在写 CSS 时,标签名不加任何修饰,类名前加点,id名前加 #,在这里我们也可以利用类似的方法来筛选元素,用到的方法是&nbsp;soup.select(),返回类型是&nbsp;list</p>
<h3>(1)通过标签名查找</h3>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719f1705191092" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719f1705191092-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719f1705191092-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719f1705191092-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'title'<span class="crayon-sy">)&nbsp;</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719f1705191092-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719f3246257642" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719f3246257642-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719f3246257642-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719f3246257642-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'a'<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719f3246257642-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;&lt;!-- Elsie --&gt;&lt;/a&gt;, &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;, &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719f5557810688" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719f5557810688-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719f5557810688-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719f5557810688-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'b'<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719f5557810688-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;b&gt;The Dormouse's story&lt;/b&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(2)通过类名查找</h3>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719f7036139703" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719f7036139703-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719f7036139703-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719f7036139703-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'.sister'<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719f7036139703-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;&lt;!-- Elsie --&gt;&lt;/a&gt;, &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;, &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(3)通过 id 名查找</h3>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719f8399087061" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719f8399087061-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719f8399087061-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719f8399087061-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'#link1'<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719f8399087061-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;&lt;!-- Elsie --&gt;&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(4)组合查找</h3>
<p>组合查找即和写 class 文件时,标签名与类名、id名进行的组合原理是一样的,例如查找 p 标签中,id 等于 link1的内容,二者需要用空格分开</p>
<div id="crayon-5dc820a2719fa439880575" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719fa439880575-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719fa439880575-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719fa439880575-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'p #link1'<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719fa439880575-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;&lt;!-- Elsie --&gt;&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>直接子标签查找</p>
<div id="crayon-5dc820a2719fb847467173" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719fb847467173-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719fb847467173-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719fb847467173-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">"head &gt; title"<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719fb847467173-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<h3>(5)属性查找</h3>
<p>查找时还可以加入属性元素,属性需要用中括号括起来,注意属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。</p>
<div id="crayon-5dc820a2719fd064620063" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719fd064620063-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719fd064620063-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719fd064620063-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'a'<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719fd064620063-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;&lt;!-- Elsie --&gt;&lt;/a&gt;, &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;, &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>&nbsp;</p>
<div id="crayon-5dc820a2719fe905591258" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a2719fe905591258-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a2719fe905591258-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a2719fe905591258-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'a'<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a2719fe905591258-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;&lt;!-- Elsie --&gt;&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>同样,属性仍然可以与上述查找方式组合,不在同一节点的空格隔开,同一节点的不加空格</p>
<div id="crayon-5dc820a271a00126008145" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271a00126008145-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271a00126008145-2">2</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271a00126008145-1" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'p a'<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a271a00126008145-2" class="crayon-line crayon-striped-line"><span class="crayon-p">#[&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;&lt;!-- Elsie --&gt;&lt;/a&gt;]</span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>以上的 select 方法返回的结果都是列表形式,可以遍历形式输出,然后用 get_text()&nbsp;方法来获取它的内容。</p>
<div id="crayon-5dc820a271a02876737777" class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-pc print-yes notranslate crayon-wrapped" data-settings=" minimize scroll-mouseover wrap">
<div class="crayon-toolbar" data-settings=" show">&nbsp;</div>
<div class="crayon-plain-wrap">&nbsp;</div>
<div class="crayon-main">
<table class="crayon-table">
<tbody>
<tr class="crayon-row">
<td class="crayon-nums " data-settings="show">
<div class="crayon-nums-content">
<div class="crayon-num" data-line="crayon-5dc820a271a02876737777-1">1</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271a02876737777-2">2</div>
<div class="crayon-num" data-line="crayon-5dc820a271a02876737777-3">3</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271a02876737777-4">4</div>
<div class="crayon-num" data-line="crayon-5dc820a271a02876737777-5">5</div>
<div class="crayon-num crayon-striped-num" data-line="crayon-5dc820a271a02876737777-6">6</div>
</div>
</td>
<td class="crayon-code">
<div class="crayon-pre">
<div id="crayon-5dc820a271a02876737777-1" class="crayon-line"><span class="crayon-v">soup<span class="crayon-h"> <span class="crayon-o">=<span class="crayon-h"> <span class="crayon-e">BeautifulSoup<span class="crayon-sy">(<span class="crayon-v">html<span class="crayon-sy">,<span class="crayon-h"> <span class="crayon-s">'lxml'<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a271a02876737777-2" class="crayon-line crayon-striped-line"><span class="crayon-e">print <span class="crayon-e">type<span class="crayon-sy">(<span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'title'<span class="crayon-sy">)<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a271a02876737777-3" class="crayon-line"><span class="crayon-e">print <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'title'<span class="crayon-sy">)<span class="crayon-sy">[<span class="crayon-cn">0<span class="crayon-sy">]<span class="crayon-sy">.<span class="crayon-e">get_text<span class="crayon-sy">(<span class="crayon-sy">)</span></span></span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a271a02876737777-4" class="crayon-line crayon-striped-line">&nbsp;</div>
<div id="crayon-5dc820a271a02876737777-5" class="crayon-line"><span class="crayon-st">for<span class="crayon-h"> <span class="crayon-e">title <span class="crayon-st">in<span class="crayon-h"> <span class="crayon-v">soup<span class="crayon-sy">.<span class="crayon-e">select<span class="crayon-sy">(<span class="crayon-s">'title'<span class="crayon-sy">)<span class="crayon-o">:</span></span></span></span></span></span></span></span></span></span></span></span></div>
<div id="crayon-5dc820a271a02876737777-6" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;<span class="crayon-e">print <span class="crayon-v">title<span class="crayon-sy">.<span class="crayon-e">get_text<span class="crayon-sy">(<span class="crayon-sy">)</span></span></span></span></span></span></span></div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>好,这就是另一种与 find_all 方法有异曲同工之妙的查找方法,是不是感觉很方便?</p>
<h2>&nbsp;</h2>
<p>&nbsp;</p>
<div>&nbsp;</div><br><br>
来源:https://www.cnblogs.com/mxk123/p/11832247.html
頁: [1]
查看完整版本: Python爬虫之BeautifulSoap的用法