庚茗溪澈 發表於 2025-6-25 09:47:00

终止分区表变更操作时误删数据字典缓存导致MySQL崩溃分析

<h1 id="终止分区表变更操作时误删数据字典缓存导致mysql崩溃分析">终止分区表变更操作时误删数据字典缓存导致MySQL崩溃分析</h1>
<h2 id="1-问题简述">1. 问题简述</h2>
<p>在 MySQL 中,当终止一个处于 <code>committing alter table to storage engine</code> 阶段的分区表操作时,InnoDB 会尝试进行回滚并清理数据字典缓存。不幸的是,过程中发生了<strong>误删表缓存对象</strong>的情况 —— InnoDB 错误地移除了另一张非目标表的缓存条目,导致引用计数不为 0,触发断言失败并导致 MySQL 崩溃。</p>
<h2 id="2-复现步骤">2. 复现步骤</h2>
<p>环境说明:</p>
<p>系统:CentOS 7</p>
<p>数据库:MySQL 8.0.32</p>
<h3 id="21-建表准备">2.1 建表准备</h3>
<pre><code class="language-SQL">CREATE DATABASE TEST;
CREATE TABLE TEST.A ( X INT)
PARTITION BY RANGE (X) (
    PARTITION P0 VALUES LESS THAN (10000),
    PARTITION PMAX VALUES LESS THAN MAXVALUE
);
CREATE TABLE TEST.A_1 LIKE TEST.A;
SELECT COUNT(*) FROM TEST.A_1;
</code></pre>
<h3 id="22-反复尝试终止-alter-table-操作">2.2 反复尝试终止 ALTER TABLE 操作</h3>
<p>在一个 Shell 中持续执行终止 <code>"committing alter table to storage engine"</code> 状态(KILL相应SQL请求)的线程:</p>
<pre><code class="language-Bash">while true; do{   mysql --login-path=mylogin -BNe'SELECT CONCAT("KILL ",ID ,";") FROM INFORMATION_SCHEMA.PROCESSLIST WHERE STATE = "COMMITTING ALTER TABLE TO STORAGE ENGINE";' | mysql --login-path=mylogin-vvv ; } ; done
</code></pre>
<h3 id="23-循环执行-alter-操作">2.3 循环执行 ALTER 操作</h3>
<p>在另一个 Shell 中循环执行添加与删除分区的操作:</p>
<pre><code class="language-Bash">while true; do{   mysql --login-path=mylogin -BNe"ALTER TABLE test.a ADD PARTITION (PARTITION pmax VALUES LESS THAN MAXVALUE);" ;   mysql --login-path=mylogin -BNe " ALTER TABLE test.a DROP PARTITION pmax;" ; }; done
</code></pre>
<h2 id="3-崩溃日志与原因分析">3. 崩溃日志与原因分析</h2>
<p>当 <code>DROP PARTITION</code> 操作在关键阶段被 KILL 时,崩溃发生:</p>
<pre><code class="language-c++">2025-06-05T17:03:19.270698+08:00 2975 Assertion failure: dict0dict.cc:1885:table-&gt;get_ref_count() == 0 thread 140327459395328
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
2025-06-05T09:03:19Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
BuildID=7afc1fad28c808c287fa7599451d3355e1b3be73
Thread pointer: 0x7fa054000f40
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fa0885e8a30 thread_stack 0x100000
/usr/local/mysql-debug/bin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x43)
/usr/local/mysql-debug/bin/mysqld(print_fatal_signal(int)+0x3a2)
/usr/local/mysql-debug/bin/mysqld(my_server_abort()+0x6b)
/usr/local/mysql-debug/bin/mysqld(my_abort()+0xd)
/usr/local/mysql-debug/bin/mysqld(ut_dbg_assertion_failed(char const*, char const*, unsigned long)+0x1d1)
/usr/local/mysql-debug/bin/mysqld()
/usr/local/mysql-debug/bin/mysqld(dict_table_remove_from_cache(dict_table_t*)+0x1d)
/usr/local/mysql-debug/bin/mysqld(dict_partitioned_table_remove_from_cache(char const*)+0x159)
/usr/local/mysql-debug/bin/mysqld()
/usr/local/mysql-debug/bin/mysqld()
/usr/local/mysql-debug/bin/mysqld(mysql_alter_table(THD*, char const*, char const*, HA_CREATE_INFO*, Table_ref*, Alter_info*)+0x3bf6)
/usr/local/mysql-debug/bin/mysqld(Sql_cmd_alter_table::execute(THD*)+0x5d4)
/usr/local/mysql-debug/bin/mysqld(mysql_execute_command(THD*, bool)+0x54c8)
/usr/local/mysql-debug/bin/mysqld(dispatch_sql_command(THD*, Parser_state*)+0x756)
/usr/local/mysql-debug/bin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x15a8)
/usr/local/mysql-debug/bin/mysqld(do_command(THD*)+0x5bd)
/usr/local/mysql-debug/bin/mysqld()
/usr/local/mysql-debug/bin/mysqld()
/lib64/libpthread.so.0(+0x7ea5)
/lib64/libc.so.6(clone+0x6d)

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fa054010390): ALTER TABLE test.a DROP PARTITION pmax
Connection ID (thread ID): 2975
Status: KILL_CONNECTION
</code></pre>
<h4 id="堆栈信息">堆栈信息</h4>
<pre><code class="language-c++">(gdb) bt
#00x00007fa13cf46aa1 in pthread_kill () from /lib64/libpthread.so.0
#10x00000000048da549 in my_write_core (sig=6) at /software/db/mysql-8.0.32/mysys/stacktrace.cc:295
#20x00000000034cf4ec in handle_fatal_signal (sig=6) at /software/db/mysql-8.0.32/sql/signal_handler.cc:230
#3&lt;signal handler called&gt;
#40x00007fa13b493387 in raise () from /lib64/libc.so.6
#50x00007fa13b494a78 in abort () from /lib64/libc.so.6
#60x00000000034cf671 in my_server_abort () at /software/db/mysql-8.0.32/sql/signal_handler.cc:286
#70x00000000048d0eaf in my_abort () at /software/db/mysql-8.0.32/mysys/my_init.cc:258
#80x0000000004d9b051 in ut_dbg_assertion_failed (expr=0x6d821d3 "table-&gt;get_ref_count() == 0",
    file=0x6d81ab8 "/software/db/mysql-8.0.32/storage/innobase/dict/dict0dict.cc", line=1885) at /software/db/mysql-8.0.32/storage/innobase/ut/ut0dbg.cc:99
#90x0000000004f15f0b in dict_table_remove_from_cache_low (table=0x7fa0445c6e08, lru_evict=false)
    at /software/db/mysql-8.0.32/storage/innobase/dict/dict0dict.cc:1885
#10 0x0000000004f166cd in dict_table_remove_from_cache (table=0x7fa0445c6e08) at /software/db/mysql-8.0.32/storage/innobase/dict/dict0dict.cc:1969
#11 0x0000000004f16829 in dict_partitioned_table_remove_from_cache (name=0x7fa0885e2bd0 "test/a")
    at /software/db/mysql-8.0.32/storage/innobase/dict/dict0dict.cc:1999
#12 0x0000000004a07dc2 in innobase_dict_cache_reset (schema_name=0x7fa054010868 "test", table_name=0x7fa054010880 "a")
    at /software/db/mysql-8.0.32/storage/innobase/handler/ha_innodb.cc:3991
#13 0x000000000337ee49 in mysql_inplace_alter_table (thd=0x7fa054000f40, schema=..., new_schema=..., table_def=0x7fa0445af410,
    altered_table_def=0x7fa054012d60, table_list=0x7fa054011020, table=0x0, altered_table=0x7fa05403e230, ha_alter_info=0x7fa0885e3710,
    inplace_supported=HA_ALTER_INPLACE_NO_LOCK_AFTER_PREPARE, alter_ctx=0x7fa0885e4630, columns=std::set with 0 elements, fk_key_info=0x7fa05403b720,
    fk_key_count=0, fk_invalidator=0x7fa0885e4560) at /software/db/mysql-8.0.32/sql/sql_table.cc:13744
#14 0x000000000338a12a in mysql_alter_table (thd=0x7fa054000f40, new_db=0x7fa054010868 "test", new_name=0x0, create_info=0x7fa0885e6170,
    table_list=0x7fa054011020, alter_info=0x7fa0885e6000) at /software/db/mysql-8.0.32/sql/sql_table.cc:17405
#15 0x00000000039af486 in Sql_cmd_alter_table::execute (this=0x7fa054011670, thd=0x7fa054000f40) at /software/db/mysql-8.0.32/sql/sql_alter.cc:349
#16 0x00000000032a5677 in mysql_execute_command (thd=0x7fa054000f40, first_level=true) at /software/db/mysql-8.0.32/sql/sql_parse.cc:4688
#17 0x00000000032a785f in dispatch_sql_command (thd=0x7fa054000f40, parser_state=0x7fa0885e7910) at /software/db/mysql-8.0.32/sql/sql_parse.cc:5322
#18 0x000000000329d7c3 in dispatch_command (thd=0x7fa054000f40, com_data=0x7fa0885e8a00, command=COM_QUERY)
    at /software/db/mysql-8.0.32/sql/sql_parse.cc:2036
#19 0x000000000329b853 in do_command (thd=0x7fa054000f40) at /software/db/mysql-8.0.32/sql/sql_parse.cc:1439
#20 0x00000000034ba2df in handle_connection (arg=0xbe191d0) at /software/db/mysql-8.0.32/sql/conn_handler/connection_handler_per_thread.cc:302
#21 0x00000000051ed236 in pfs_spawn_thread (arg=0xbe23890) at /software/db/mysql-8.0.32/storage/perfschema/pfs.cc:2986
#22 0x00007fa13cf41ea5 in start_thread () from /lib64/libpthread.so.0
#23 0x00007fa13b55bb0d in clone () from /lib64/libc.so.6
</code></pre>
<h4 id="崩溃核心栈"><strong>崩溃核心栈:</strong></h4>
<pre><code class="language-c++">dict_partitioned_table_remove_from_cache
-&gt; strncmp("test/a", "test/a_1#p#p0", 6) == 0// 误判为匹配
-&gt; dict_table_remove_from_cache() 导致 ref_count != 0 触发断言
</code></pre>
<h4 id="gdb-调试关键值">GDB 调试关键值:</h4>
<pre><code class="language-Plaintext">(gdb) p name
$1 = "test/a"
(gdb) p prev_table-&gt;name.m_name
$2 = "test/a_1#p#p0"
(gdb) p strncmp("test/a", "test/a_1#p#p0", 6)
$3 = 0   // 返回0,误认为匹配
</code></pre>
<p>由于 <code>"test/a"</code> 是 <code>"test/a_1#p#p0"</code> 的前缀,使用 <code>strncmp(name, m_name, name_len)</code> 判断时发生<strong>误判</strong>,错误地将 <code>test.a_1</code> 的分区子表从缓存中移除。</p>
<h2 id="4-关键源码解读">4. 关键源码解读</h2>
<p>崩溃函数在 InnoDB 源码中的定义如下(文件 <code>dict0dict.cc</code>):</p>
<pre><code class="language-C++">void dict_partitioned_table_remove_from_cache(const char *name) {
ut_ad(dict_sys_mutex_own());

size_t name_len = strlen(name);

for (uint32_t i = 0; i &lt; hash_get_n_cells(dict_sys-&gt;table_id_hash); ++i) {
    dict_table_t *table = static_cast&lt;dict_table_t *&gt;(hash_get_first(dict_sys-&gt;table_hash, i));

    while (table != nullptr) {
      dict_table_t *prev_table = table;
      table = static_cast&lt;dict_table_t *&gt;(HASH_GET_NEXT(name_hash, prev_table));

      if (prev_table-&gt;is_dd_table) {
      continue;
      }

      if ((strncmp(name, prev_table-&gt;name.m_name, name_len) == 0) &amp;&amp;
          dict_table_is_partition(prev_table)) {
      btr_drop_ahi_for_table(prev_table);
      dict_table_remove_from_cache(prev_table);// 错误删除
      }
    }
}
}
</code></pre>
<h4 id="核心问题">核心问题:</h4>
<ul>
<li>使用 <code>strncmp(name, m_name, name_len)</code> 导致<strong>前缀匹配误判为全匹配</strong>;</li>
<li>应该采用 <strong>精确表名匹配</strong>,如:</li>
</ul>
<pre><code class="language-C++">if (strcmp(name, prev_table-&gt;name.m_name) == 0)
</code></pre>
<ul>
<li>或使用更严格的逻辑,确保仅匹配当前表或其合法分区。</li>
</ul>
<h2 id="5-结论与建议">5. 结论与建议</h2>
<h4 id="问题根因">问题根因:</h4>
<ul>
<li>缓存移除逻辑基于前缀匹配,未验证完整表名,误删其他表结构对象;</li>
<li>被删表正在被引用,引用计数 ref_count ≠ 0,触发 InnoDB 保护性断言;</li>
<li>最终导致 MySQL 进程崩溃。</li>
</ul>
<h4 id="潜在影响">潜在影响:</h4>
<ul>
<li>表结构类似(同名前缀)的表,在 DDL 并发或异常终止场景下可能被误删;</li>
<li>可能存在更广泛的 InnoDB 缓存一致性漏洞,特别在并发和 kill 介入场景中。</li>
</ul>
<h4 id="建议修复方向">建议修复方向:</h4>
<ul>
<li>将 <code>strncmp(name, m_name, name_len)</code> 替换为更安全的 <code>strcmp</code> 或引入完整名判断;</li>
<li>限制 <code>dict_partitioned_table_remove_from_cache()</code> 中删除的表名范围,仅处理严格匹配的表;</li>
<li>增强 DDL 异常中断处理的健壮性,确保引用关系、缓存清理逻辑的一致性。</li>
</ul>
<hr>
<p>Enjoy GreatSQL 😃</p>
<h2 id="关于-greatsql">关于 GreatSQL</h2>
<p>GreatSQL是适用于金融级应用的国内自主开源数据库,具备高性能、高可靠、高易用性、高安全等多个核心特性,可以作为MySQL或Percona Server的可选替换,用于线上生产环境,且完全免费并兼容MySQL或Percona Server。</p>
<p>相关链接:   GreatSQL社区        Gitee        GitHub        Bilibili</p>
<h2 id="greatsql社区">GreatSQL社区:</h2>
<blockquote>
<p>社区博客有奖征稿详情:https://greatsql.cn/thread-100-1-1.html</p>
</blockquote>
<p><img src="https://img2024.cnblogs.com/other/2630741/202506/2630741-20250625094749352-962258639.png"></p>
<h2 id="技术交流群">技术交流群:</h2>
<blockquote>
<p>微信:扫码添加<code>GreatSQL社区助手</code>微信好友,发送验证信息<code>加群</code>。</p>
</blockquote>
<p><img src="https://img2024.cnblogs.com/other/2630741/202506/2630741-20250625094749627-2040455347.png"></p><br><br>
来源:https://www.cnblogs.com/greatsql/p/18947465
頁: [1]
查看完整版本: 终止分区表变更操作时误删数据字典缓存导致MySQL崩溃分析