豆豆利剑 發表於 2025-12-6 18:25:00

markdown文档格式分析,再使用python对md文件进行结构化拆解

<h1>一、markdown文档</h1>
<p>Markdown 文档本质上是:一个树状结构(Block 级) + 行内结构(Inline 级)</p>
<p>Block 级元素(结构):</p>
<ul>
<li data-start="2845" data-end="2886">
<p data-start="2847" data-end="2886">heading_open → inline → heading_close</p>
</li>
<li data-start="2887" data-end="2932">
<p data-start="2889" data-end="2932">paragraph_open → inline → paragraph_close</p>
</li>
<li data-start="2933" data-end="2990">
<p data-start="2935" data-end="2990">list_open → list_item_open → inline → list_item_close</p>
</li>
<li data-start="2991" data-end="3016">
<p data-start="2993" data-end="3016">blockquote_open → ...</p>
</li>
<li data-start="3017" data-end="3031">
<p data-start="3019" data-end="3031">fence(代码块)</p>
</li>
</ul>
<p data-start="3061" data-end="3084">Inline 级元素(在一行内出现):</p>
<ul data-start="3086" data-end="3184">
<li data-start="3086" data-end="3092">
<p data-start="3088" data-end="3092">text</p>
</li>
<li data-start="3093" data-end="3100">
<p data-start="3095" data-end="3100">image</p>
</li>
<li data-start="3101" data-end="3134">
<p data-start="3103" data-end="3134">link_open → inline → link_close</p>
</li>
<li data-start="3135" data-end="3163">
<p data-start="3137" data-end="3163">strong_open / strong_close</p>
</li>
<li data-start="3164" data-end="3184">
<p data-start="3166" data-end="3184">em_open / em_close</p>
</li>
</ul>
<h2 data-start="3186" data-end="3220">md文件结构规律</h2>
<p data-start="3186" data-end="3220">一个 block 总是成对出现</p>
<table>
<thead>
<tr><th>Markdown</th><th>Token</th></tr>
</thead>
<tbody>
<tr>
<td><code># 标题</code></td>
<td>heading_open → inline → heading_close</td>
</tr>
<tr>
<td>段落</td>
<td>paragraph_open → inline → paragraph_close</td>
</tr>
<tr>
<td>列表项</td>
<td>list_item_open → inline → list_item_close</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>content 只用于“行内”文本,不用于结构 token</p>
<table class="w-fit min-w-(--thread-content-width)" data-start="4021" data-end="4155">
<thead data-start="4021" data-end="4039">
<tr data-start="4021" data-end="4039"><th data-start="4021" data-end="4028" data-col-size="sm">type</th><th data-start="4028" data-end="4039" data-col-size="sm">content</th></tr>
</thead>
<tbody data-start="4060" data-end="4155">
<tr data-start="4060" data-end="4081">
<td data-start="4060" data-end="4075" data-col-size="sm">heading_open</td>
<td data-col-size="sm" data-start="4075" data-end="4081">""</td>
</tr>
<tr data-start="4082" data-end="4109">
<td data-start="4082" data-end="4091" data-col-size="sm">inline</td>
<td data-col-size="sm" data-start="4091" data-end="4109">整行文本(包含 md 语法)</td>
</tr>
<tr data-start="4110" data-end="4125">
<td data-start="4110" data-end="4117" data-col-size="sm">text</td>
<td data-col-size="sm" data-start="4117" data-end="4125">文本内容</td>
</tr>
<tr data-start="4126" data-end="4155">
<td data-start="4126" data-end="4134" data-col-size="sm">image</td>
<td data-col-size="sm" data-start="4134" data-end="4155">alt 文本(即 ! )</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<h1>二、python包:markdown-it</h1>
<h2>1.&nbsp;markdown-it 将 Markdown 文档解析成一个<strong data-start="335" data-end="352">扁平化的 Token 列表</strong>,每个 Token 都有下列属性:</h2>
<p><span style="font-family: monospace">type</span>—— “语法元素类型”(关键),决定 Token 代表哪种 Markdown 结构,常见type包括:</p>
<table style="height: 310px; width: 354px" border="0">
<tbody>
<tr>
<td style="text-align: left">语法</td>
<td style="text-align: left">type</td>
</tr>
<tr>
<td style="text-align: left"># 标题</td>
<td style="text-align: left"><code data-start="503" data-end="517">heading_open</code>, <code data-start="519" data-end="534">heading_close</code></td>
</tr>
<tr>
<td style="text-align: left">段落</td>
<td style="text-align: left"><code data-start="546" data-end="562">paragraph_open</code>, <code data-start="564" data-end="581">paragraph_close</code></td>
</tr>
<tr>
<td style="text-align: left">行内内容</td>
<td style="text-align: left">inline</td>
</tr>
<tr>
<td style="text-align: left">图片 <code data-start="609" data-end="616">![]()</code></td>
<td style="text-align: left">image</td>
</tr>
<tr>
<td style="text-align: left">列表 <code data-start="676" data-end="684">- item</code></td>
<td style="text-align: left"><code data-start="687" data-end="705">bullet_list_open</code>, <code data-start="707" data-end="723">list_item_open</code> 等</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p><code data-start="969" data-end="974">tag</code> —— 对应 HTML 标签名称(比如 h1, p, img)</p>
<table border="0">
<tbody>
<tr>
<td>类型</td>
<td>tag</td>
</tr>
<tr>
<td>heading_open(###)</td>
<td>h3</td>
</tr>
<tr>
<td>paragraph_open</td>
<td>p</td>
</tr>
<tr>
<td>image</td>
<td>img</td>
</tr>
</tbody>
</table>
<p><code data-start="1199" data-end="1208">content</code> —— 文本内容(只有 inline 或 text 子 Token 才有)</p>
<ul>
<li data-start="1253" data-end="1306">
<p data-start="1255" data-end="1306">对于 <code data-start="1258" data-end="1266">inline</code> → content 是整行的原始文本(如 <code data-start="1288" data-end="1305">"docker images"</code>)</p>
</li>
<li data-start="1307" data-end="1342">
<p data-start="1309" data-end="1342">对于 <code data-start="1312" data-end="1318">text</code> → content 是纯文字(真正的文本节点)</p>
</li>
<li data-start="1343" data-end="1398">
<p data-start="1345" data-end="1398">对于 <code data-start="1348" data-end="1355">image</code> → content 是 <code data-start="1368" data-end="1373">alt</code> 内容(比如 <code data-start="1380" data-end="1397">"image-2025..."</code>)</p>
</li>
</ul>
<p><code data-start="1597" data-end="1604">attrs</code> —— HTML 属性(图片的 src/alt/title 全在这里)</p>
<h2>2.&nbsp;Markdown → Token 映射</h2>
<p>假设有markdown原文:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">## 语法</span>
<span style="color: rgba(0, 0, 0, 1)">
docker images

!(docker学习-use-images/image-2025xxxx.png)</pre>
</div>
<p>使用代码将文档进行拆解:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> pathlib <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> Path
</span><span style="color: rgba(0, 0, 255, 1)">from</span> markdown_it <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> MarkdownIt

md </span>=<span style="color: rgba(0, 0, 0, 1)"> MarkdownIt()


md_path </span>= Path(r<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">./docker学习.md</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
md_text </span>= md_path.read_text(encoding=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)

tokens </span>=<span style="color: rgba(0, 0, 0, 1)"> md.parse(md_text)

</span><span style="color: rgba(0, 0, 255, 1)">for</span> t <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> tokens:
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(f<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">type: {t.type}, tag: {t.tag}, content: {t.content}, attrs: {t.attrs}</span><span style="color: rgba(128, 0, 0, 1)">"</span>)</pre>
</div>
<p>得到语义树:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">heading_open(tag h3)
   inline </span>-&gt; text(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">语法</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
heading_close (tag h3)

paragraph_open
   inline </span>-&gt; text(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">docker images</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
paragraph_close

paragraph_open
   inline </span>-&gt; image (alt=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">image-2025...</span><span style="color: rgba(128, 0, 0, 1)">"</span>, src=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">docker学习-use-images/...</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
paragraph_close</span></pre>
</div>
<p>这套结构适合用代码进行文档分析。</p>
<h2>3. markdown-it的一些用法</h2>
<p>简要示例</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> markdown_it <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> MarkdownIt

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 安装 &amp; 创建解析器</span>
md =<span style="color: rgba(0, 0, 0, 1)"> MarkdownIt()

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将文本渲染成html格式字符串</span>
text = <span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
### 标题

这是一个段落,包含 **粗体** 和 *斜体*。

![图1](images/img1.png "图1标题")
</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">

html </span>=<span style="color: rgba(0, 0, 0, 1)"> md.render(text)
</span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(html)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将文本解析成token列表,需要先将md文档逐行读取到变量里面</span>
md_path = Path(r<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">./docker学习.md</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
md_text </span>= md_path.read_text(encoding=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)

tokens </span>=<span style="color: rgba(0, 0, 0, 1)"> md.parse(md_text)

</span><span style="color: rgba(0, 0, 255, 1)">for</span> t <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> tokens:
    </span><span style="color: rgba(0, 0, 255, 1)">print</span>(f<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">type: {t.type}, tag: {t.tag}, content: {t.content}, attrs: {t.attrs}</span><span style="color: rgba(128, 0, 0, 1)">"</span>)</pre>
</div>
<p>&nbsp;</p><br><br>
来源:https://www.cnblogs.com/xiaojp65536/p/19316355
頁: [1]
查看完整版本: markdown文档格式分析,再使用python对md文件进行结构化拆解