哭丧脸的骑士 發表於 2025-5-8 11:50:00

大模型评估排障指南 | 关于 LaTeX 公式解析

<p>这是 <strong>大模型评估排障指南</strong> 系列文章的第二篇,敬请关注系列文章:</p>
<ul>
<li>关于推理</li>
<li>关于 <span class="math inline">\(\LaTeX\)</span> 公式解析</li>
<li>关于可复现性</li>
</ul>
<p>解析 LaTeX 很难。这个问题在评估输出为 <span class="math inline">\(\LaTeX\)</span> 的模型时经常会遇到,例如 Hugging Face 的 数学评估基准。</p>
<p>这个基准使用 <span class="math inline">\(\LaTeX\)</span> 来表示数学领域的计算和符号。评估难点在于对模型输出与标准答案的解析和比较。<br>
结果表明,解析 <span class="math inline">\(\LaTeX\)</span> 没有标准方法。</p>
<p><img alt="" loading="lazy" src="https://img-s2.andfun.cn/devrel/posts/2025/05/7def55ba74118.png" class="lazyload"><br>
<em>摘自 <code>sympy</code> 文档</em></p>
<p>lm-evaluation 框架使用 <code>sympy</code> (一个用于符号数学的 Python 库) 来对 latex 进行解析和比较。<br>
使用 <code>sympy</code> 解析真值 (用真值自身对比测试) 只能得到约 0.94 的准确率。<br>
怎么会是这样呢?后来发现 <code>sympy</code> 无法解析某些 (标准的 <span class="math inline">\(\LaTeX\)</span>) 表达式。</p>
<p>例如:</p>
<pre><code>couldn't parse one of '
[0,1)
~~^
</code></pre>
<pre><code>couldn't parse one of (-\iny,-5]\cup\cup[5,\iny), I expected something else here
(-\iny,-5]\cup[5,\iny)
~~~~~~^
</code></pre>
<pre><code>couldn't parse one of -\frac{1}{{}2x} or -\frac{1}{{}2x}, I don't understand this
-\frac{1}{{}2x}
~~~~~~~~~~~^
</code></pre>
<h3 id="如何缓解这个问题">如何缓解这个问题?</h3>
<p>重写 <span class="math inline">\(\LaTeX\)</span> 语法解析模块 并在代码中添加必须功能;或者往代码里添加人工检查来提高模型得分。<br>
在几乎陷入问题陷阱之后,我们认为在代码中添加字符串比较检查差不多就能缓解这个问题了。</p>
<p><img alt="Lm Eval 工具修复" loading="lazy" src="https://man-archives.oss-cn-hangzhou.aliyuncs.com/goofan/202505062211727.png" class="lazyload"><br>
<em>LM 评估工具修复</em></p>
<h3 id="结果">结果</h3>
<p>修复前后模型 Top 25 对比结果表格如下:</p>
<div id="xdihwljbql" style="padding: 10px 0; overflow-x: auto; overflow-y: auto; width: auto; height: auto">
<table class="gt_table" data-quarto-disable-processing="false" data-quarto-bootstrap="false">
<thead>
<tr class="gt_heading">
    <td colspan="5" class="gt_heading gt_title gt_font_normal">解析器修复前后模型在 MATH 基准测试结果对比</td>
</tr>
<tr class="gt_col_headings gt_spanner_row">
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="2" colspan="1" scope="col" id="Model">Model</th>
<th class="gt_center gt_columns_top_border gt_column_spanner_outer" rowspan="1" colspan="2" scope="colgroup" id="Score">
    <span class="gt_column_spanner">Score</span>
</th>
<th class="gt_center gt_columns_top_border gt_column_spanner_outer" rowspan="1" colspan="2" scope="colgroup" id="Rank">
    <span class="gt_column_spanner">Rank</span>
</th>
</tr>
<tr class="gt_col_headings">
<th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Original">Original</th>
<th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Fixed parser">Fixed parser</th>
<th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Original">Original</th>
<th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Fixed parser">Fixed parser</th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr>
    <td class="gt_row gt_left">rombodawg/Rombos-LLM-V2.5-Qwen-72b</td>
    <td class="gt_row gt_right">47.58</td>
    <td class="gt_row gt_right">50.68</td>
    <td style="color: rgba(255, 255, 255, 1); background-color: rgba(0, 0, 0, 1)" class="gt_row gt_right">1</td>
    <td style="color: rgba(255, 255, 255, 1); background-color: rgba(0, 0, 0, 1)" class="gt_row gt_right">1</td>
</tr>
<tr>
    <td class="gt_row gt_left">MaziyarPanahi/calme-2.2-qwen2-72b</td>
    <td class="gt_row gt_right">41.16</td>
    <td class="gt_row gt_right">43.43</td>
    <td style="color: rgba(255, 255, 255, 1); background-color: rgba(65, 24, 31, 1)" class="gt_row gt_right">2</td>
    <td style="color: rgba(255, 255, 255, 1); background-color: rgba(65, 24, 31, 1)" class="gt_row gt_right">2</td>
</tr>
<tr>
    <td class="gt_row gt_left">arcee-ai/Arcee-Nova</td>
    <td class="gt_row gt_right">40.48</td>
    <td class="gt_row gt_right">42.90</td>
    <td style="color: rgba(255, 255, 255, 1); background-color: rgba(130, 48, 62, 1)" class="gt_row gt_right">3</td>
    <td style="color: rgba(255, 255, 255, 1); background-color: rgba(130, 48, 62, 1)" class="gt_row gt_right">3</td>
</tr>
<tr>
    <td class="gt_row gt_left">fblgit/TheBeagle-v2beta-32B-MGS</td>
    <td class="gt_row gt_right">39.43</td>
    <td class="gt_row gt_right">42.52</td>
    <td style="color: rgba(255, 255, 255, 1); background-color: rgba(195, 73, 94, 1)" class="gt_row gt_right">4</td>
    <td style="color: rgba(255, 255, 255, 1); background-color: rgba(195, 73, 94, 1)" class="gt_row gt_right">4</td>
</tr>
<tr>
    <td class="gt_row gt_left">rombodawg/Rombos-LLM-V2.5-Qwen-32b</td>
    <td class="gt_row gt_right">39.12</td>
    <td class="gt_row gt_right">41.99</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(202, 104, 102, 1)" class="gt_row gt_right">5</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(202, 104, 102, 1)" class="gt_row gt_right">5</td>
</tr>
<tr>
    <td class="gt_row gt_left">dnhkng/RYS-XLarge</td>
    <td class="gt_row gt_right">38.97</td>
    <td class="gt_row gt_right">41.24</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(165, 140, 94, 1)" class="gt_row gt_right">6</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(165, 140, 94, 1)" class="gt_row gt_right">6</td>
</tr>
<tr>
    <td class="gt_row gt_left">dfurman/CalmeRys-78B-Orpo-v0.1</td>
    <td class="gt_row gt_right">37.92</td>
    <td class="gt_row gt_right">40.71</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(110, 195, 82, 1)" class="gt_row gt_right">8</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(128, 177, 86, 1)" class="gt_row gt_right">7</td>
</tr>
<tr>
    <td class="gt_row gt_left">MaziyarPanahi/calme-2.2-rys-78b</td>
    <td class="gt_row gt_right">37.92</td>
    <td class="gt_row gt_right">39.95</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(110, 195, 82, 1)" class="gt_row gt_right">8</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(76, 189, 129, 1)" class="gt_row gt_right">9</td>
</tr>
<tr>
    <td class="gt_row gt_left">MaziyarPanahi/calme-2.4-rys-78b</td>
    <td class="gt_row gt_right">37.69</td>
    <td class="gt_row gt_right">40.41</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(76, 189, 129, 1)" class="gt_row gt_right">9</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(94, 206, 85, 1)" class="gt_row gt_right">8</td>
</tr>
<tr>
    <td class="gt_row gt_left">MaziyarPanahi/calme-2.3-rys-78b</td>
    <td class="gt_row gt_right">36.56</td>
    <td class="gt_row gt_right">38.97</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(58, 172, 173, 1)" class="gt_row gt_right">10</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(58, 172, 173, 1)" class="gt_row gt_right">10</td>
</tr>
<tr>
    <td class="gt_row gt_left">MaziyarPanahi/calme-2.1-rys-78b</td>
    <td class="gt_row gt_right">36.40</td>
    <td class="gt_row gt_right">38.90</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(39, 156, 217, 1)" class="gt_row gt_right">11</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(39, 156, 217, 1)" class="gt_row gt_right">11</td>
</tr>
<tr>
    <td class="gt_row gt_left">Qwen/Qwen2.5-72B</td>
    <td class="gt_row gt_right">36.10</td>
    <td class="gt_row gt_right">38.67</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(35, 167, 230, 1)" class="gt_row gt_right">12</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(35, 167, 230, 1)" class="gt_row gt_right">12</td>
</tr>
<tr>
    <td class="gt_row gt_left">MaziyarPanahi/calme-2.1-qwen2-72b</td>
    <td class="gt_row gt_right">36.03</td>
    <td class="gt_row gt_right">38.07</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(37, 188, 230, 1)" class="gt_row gt_right">13</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(54, 208, 226, 1)" class="gt_row gt_right">15</td>
</tr>
<tr>
    <td class="gt_row gt_left">Qwen/Qwen2-Math-72B-Instruct</td>
    <td class="gt_row gt_right">35.95</td>
    <td class="gt_row gt_right">38.14</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(39, 210, 229, 1)" class="gt_row gt_right">14</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(39, 210, 229, 1)" class="gt_row gt_right">14</td>
</tr>
<tr>
    <td class="gt_row gt_left">dfurman/Qwen2-72B-Orpo-v0.1</td>
    <td class="gt_row gt_right">35.42</td>
    <td class="gt_row gt_right">38.14</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(54, 208, 226, 1)" class="gt_row gt_right">15</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(37, 188, 230, 1)" class="gt_row gt_right">13</td>
</tr>
<tr>
    <td class="gt_row gt_left">abacusai/Smaug-Qwen2-72B-Instruct</td>
    <td class="gt_row gt_right">35.35</td>
    <td class="gt_row gt_right">37.46</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(102, 145, 214, 1)" class="gt_row gt_right">16</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(215, 58, 145, 1)" class="gt_row gt_right">19</td>
</tr>
<tr>
    <td class="gt_row gt_left">anthracite-org/magnum-v1-72b</td>
    <td class="gt_row gt_right">35.27</td>
    <td class="gt_row gt_right">37.69</td>
    <td style="color: rgba(255, 255, 255, 1); background-color: rgba(174, 51, 196, 1)" class="gt_row gt_right">18</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(126, 114, 208, 1)" class="gt_row gt_right">16</td>
</tr>
<tr>
    <td class="gt_row gt_left">alpindale/magnum-72b-v1</td>
    <td class="gt_row gt_right">35.27</td>
    <td class="gt_row gt_right">37.69</td>
    <td style="color: rgba(255, 255, 255, 1); background-color: rgba(174, 51, 196, 1)" class="gt_row gt_right">18</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(126, 114, 208, 1)" class="gt_row gt_right">16</td>
</tr>
<tr>
    <td class="gt_row gt_left">Qwen/Qwen2-72B-Instruct</td>
    <td class="gt_row gt_right">35.12</td>
    <td class="gt_row gt_right">37.69</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(215, 58, 145, 1)" class="gt_row gt_right">19</td>
    <td style="color: rgba(255, 255, 255, 1); background-color: rgba(198, 20, 190, 1)" class="gt_row gt_right">18</td>
</tr>
<tr>
    <td class="gt_row gt_left">dnhkng/RYS-XLarge-base</td>
    <td class="gt_row gt_right">34.67</td>
    <td class="gt_row gt_right">37.16</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(227, 113, 95, 1)" class="gt_row gt_right">20</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(227, 113, 95, 1)" class="gt_row gt_right">20</td>
</tr>
<tr>
    <td class="gt_row gt_left">Undi95/MG-FinalMix-72B</td>
    <td class="gt_row gt_right">33.61</td>
    <td class="gt_row gt_right">36.10</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(244, 195, 20, 1)" class="gt_row gt_right">22</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(238, 168, 45, 1)" class="gt_row gt_right">21</td>
</tr>
<tr>
    <td class="gt_row gt_left">abacusai/Dracarys-72B-Instruct</td>
    <td class="gt_row gt_right">33.61</td>
    <td class="gt_row gt_right">35.65</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(244, 195, 20, 1)" class="gt_row gt_right">22</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(234, 194, 34, 1)" class="gt_row gt_right">22</td>
</tr>
<tr>
    <td class="gt_row gt_left">Qwen/Qwen2.5-32B</td>
    <td class="gt_row gt_right">32.85</td>
    <td class="gt_row gt_right">35.50</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(209, 182, 75, 1)" class="gt_row gt_right">23</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(209, 182, 75, 1)" class="gt_row gt_right">23</td>
</tr>
<tr>
    <td class="gt_row gt_left">anthracite-org/magnum-v2-72b</td>
    <td class="gt_row gt_right">31.65</td>
    <td class="gt_row gt_right">34.06</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(183, 170, 117, 1)" class="gt_row gt_right">24</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(183, 170, 117, 1)" class="gt_row gt_right">24</td>
</tr>
<tr>
    <td class="gt_row gt_left">dnhkng/RYS-Huge-bnb-4bit</td>
    <td class="gt_row gt_right">31.57</td>
    <td class="gt_row gt_right">33.84</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(158, 158, 158, 1)" class="gt_row gt_right">25</td>
    <td style="color: rgba(0, 0, 0, 1); background-color: rgba(158, 158, 158, 1)" class="gt_row gt_right">25</td>
</tr>
</tbody>
</table>
</div>
<hr>
<blockquote>
<p>英文原文: https://raw.githubusercontent.com/huggingface/evaluation-guidebook/refs/heads/main/contents/troubleshooting/troubleshooting-math-parsing.md</p>
<p>原文作者: Nathan Habib</p>
<p>译者: SuSung-boy</p>
<p>审校: Adeena</p>
</blockquote><br><br>
来源:https://www.cnblogs.com/huggingface/p/18866004
頁: [1]
查看完整版本: 大模型评估排障指南 | 关于 LaTeX 公式解析