MySQL 派生表查询导致 Crash 的根源分析与解决方案
<h1 id="mysql-派生表查询导致-crash-的根源分析与解决方案">MySQL 派生表查询导致 Crash 的根源分析与解决方案</h1><h2 id="一问题发现">一、问题发现</h2>
<p>在之前的 MySQL 8.0.32 使用中,发现使用以下带有派生表的 SQL 会导致 MySQL Crash,以下的sequence_table(2)替换为任何非常量表都行:</p>
<blockquote>
<p>仅 MySQL 8.0.32 版本有影响。</p>
</blockquote>
<pre><code class="language-SQL">EXPLAIN FORMAT=TREE
select
trim(ref_15.c_ogj),
0<>0 as c_lrcm63eani
from
(select
0<>0 as c_ogj
from
sequence_table(2) t1
where 0<>0
order by c_ogj asc) as ref_15;
</code></pre>
<p>Crash 的堆栈如下:</p>
<pre><code class="language-SQL">Thread 55 "mysqld" received signal SIGSEGV, Segmentation fault.
Item_view_ref::used_tables (this=0x7fff2418f410)
at sql/item.h:6670
6670 table_map inner_map = ref_item()->used_tables(); ==> ref_item()为空指针,因此crash了
(gdb) bt
#0Item_view_ref::used_tables (this=0x7fff2418f410)
at sql/item.h:6670
#10x0000555558e978d1 in Item::const_item (this=0x7fff2418f410)
at sql/item.h:2342
#20x0000555558ecc765 in Item_ref::print (this=0x7fff2418f410, thd=0x7fff24001050,
str=0x7fffc83ee7e0, query_type=(QT_TO_SYSTEM_CHARSET | QT_SHOW_SELECT_NUMBER))
at sql/item.cc:9993
#30x000055555903b839 in Item_func_trim::print (this=0x7fff24120d20, thd=0x7fff24001050,
str=0x7fffc83ee7e0, query_type=(QT_TO_SYSTEM_CHARSET | QT_SHOW_SELECT_NUMBER))
at sql/item_strfunc.cc:3244
#40x0000555558ea7fc5 in Item::print_item_w_name (this=0x7fff24120d20, thd=0x7fff24001050,
str=0x7fffc83ee7e0, query_type=(QT_TO_SYSTEM_CHARSET | QT_SHOW_SELECT_NUMBER))
at sql/item.cc:727
#50x00005555593f18c0 in Query_block::print_item_list (this=0x7fff24120768, thd=0x7fff24001050,
str=0x7fffc83ee7e0, query_type=(QT_TO_SYSTEM_CHARSET | QT_SHOW_SELECT_NUMBER))
at sql/sql_lex.cc:4041
#60x00005555593efb50 in Query_block::print_query_block (this=0x7fff24120768,
thd=0x7fff24001050, str=0x7fffc83ee7e0,
query_type=(QT_TO_SYSTEM_CHARSET | QT_SHOW_SELECT_NUMBER))
at sql/sql_lex.cc:3614
#70x00005555593efa3d in Query_block::print (this=0x7fff24120768, thd=0x7fff24001050,
str=0x7fffc83ee7e0, query_type=(QT_TO_SYSTEM_CHARSET | QT_SHOW_SELECT_NUMBER))
at sql/sql_lex.cc:3598
#80x00005555593ee556 in Query_expression::print (this=0x7fff24120670, thd=0x7fff24001050,
str=0x7fffc83ee7e0, query_type=(QT_TO_SYSTEM_CHARSET | QT_SHOW_SELECT_NUMBER))
at sql/sql_lex.cc:3232
#90x0000555559a89c2c in print_query_for_explain (query_thd=0x7fff24001050,
unit=0x7fff24120670, str=0x7fffc83ee7e0)
at sql/opt_explain.cc:2288
#10 0x0000555559a10b11 in PrintQueryPlan(THD*, THD const*, Query_expression*) (
ethd=0x7fff24001050, query_thd=0x7fff24001050, unit=0x7fff24120670)
at sql/join_optimizer/explain_access_path.cc:1894
#11 0x0000555559a8985a in ExplainIterator (ethd=0x7fff24001050, query_thd=0x7fff24001050,
unit=0x7fff24120670) at sql/opt_explain.cc:2205
#12 0x0000555559a89e91 in explain_query (explain_thd=0x7fff24001050, query_thd=0x7fff24001050,
unit=0x7fff24120670) at sql/opt_explain.cc:2359
#13 0x000055555955cd46 in Sql_cmd_dml::execute_inner (this=0x7fff24165630, thd=0x7fff24001050)
</code></pre>
<h2 id="二问题调查过程">二、问题调查过程</h2>
<p>调查执行 SQL 的 optimize 的过程,分析发现该 SQL 的 SQL 变换情况如下:</p>
<p>以下的 <code>trim(ref_15.c_ogj)</code> 执行完<code> find_order_in_list</code> 后,<code>Item_func_trim</code>的<code>args->m_ref_item</code> 等于<code>0<>0 as c_lrcm63eani</code>,而不是<code>0<>0 as c_ogj</code>,这是因为<code>c_lrcm63eani</code>和c_ogj的名字都一样,都是0<>0,在<code>find_order_in_list</code>函数里面由于名字一样因此内层字段被外层替代了。而后在<code>Item::clean_up_after_removal</code>执行的时候,Item_func_ne即c_lrcm63eani因为出现了2次,因此执行了2次<code>decrement_ref_count()</code>,然而在<code>Query_block::delete_unused_merged_columns</code>函数却把<code>0<>0 as c_lrcm63eani</code>的Item置为空了,因为这个时候c_lrcm63eani的<code>item->decrement_ref_count()</code>以后ref_count()为0因此继续执行<code>Item::clean_up_after_removal</code>了。</p>
<pre><code class="language-SQL">EXPLAIN FORMAT=TREE
select
trim(ref_15.c_ogj),
0<>0 as c_lrcm63eani
from
(select
0<>0 as c_ogj
from
sequence_table(2) t1
where 0<>0
order by c_ogj asc) as ref_15;
</code></pre>
<p>查看函数调用过程发现 Query_block 在 prepare 的时候执行了 delete_unused_merged_columns,</p>
<pre><code class="language-SQL">-- 函数调用过程: Query_block::prepare -> Query_block::apply_local_transforms -> Query_block::delete_unused_merged_columns
bool find_order_in_list() {
if (select_item != not_found_item) {
if ((*order->item)->real_item() != (*select_item)->real_item()) {
Item::Cleanup_after_removal_context ctx(
thd->lex->current_query_block());
(*order->item)
->walk(&Item::clean_up_after_removal, walk_options,==>Item_func_ne执行了2次,也执行了2次decrement_ref_count()
pointer_cast<uchar *>(&ctx));
}
}
}
bool Query_block::apply_local_transforms(THD *thd, bool prune) {
DBUG_TRACE;
assert(first_execution);
-- 这个函数把((Item_func *)&fields)->args->m_ref_item给删了
if (derived_table_count) delete_unused_merged_columns(&m_table_nest);
}
void Query_block::delete_unused_merged_columns(
mem_root_deque<Table_ref *> *tables) {
DBUG_TRACE;
for (Table_ref *tl : *tables) {
if (tl->nested_join == nullptr) continue;
if (tl->is_merged()) {
for (Field_translator *transl = tl->field_translation;
transl < tl->field_translation_end; transl++) {
Item *const item = transl->item;
// Decrement the ref count as its no more used in
// select list.
if (item->decrement_ref_count()) continue; -- 因为执行完decrement_ref_count()以后返回的m_ref_count=0因此不会跳出这个循环
// Cleanup the item since its not referenced from
// anywhere.
assert(item->fixed);
Item::Cleanup_after_removal_context ctx(this);
item->walk(&Item::clean_up_after_removal, walk_options,
pointer_cast<uchar *>(&ctx));
transl->item = nullptr; -- 这个地方把Item_view_ref引用的Item_func_ne对象置为空了,即把trim函数参数的c_lrcm63eani列删除了
}
}
delete_unused_merged_columns(&tl->nested_join->m_tables);
}
}
</code></pre>
<h2 id="三解决方案">三、解决方案</h2>
<p>通过上面的分析,我们可以发现问题在于多执行了一次<code>Item::clean_up_after_removal</code>,随后在 MySQL 最新代码尝试执行以上 SQL 发现该 BUG 已经被修复,找到相关修复代码,可以发现以下修复代码。</p>
<p>相关commit ID号为: 2171a1260e2cdbbd379646be8ff6413a92fd48f4</p>
<pre><code class="language-SQL">-- 相关修复代码如下:
@@ -7575,7 +7865,6 @@ bool Item::clean_up_after_removal(uchar *arg) {
if (reference_count() > 1) {
(void)decrement_ref_count();
+ ctx->stop_at(this);
}
return false;
}
</code></pre>
<p>修改完查看一下这个函数的堆栈信息:</p>
<pre><code class="language-C++">#0Item::clean_up_after_removal (this=0x2,
arg=0x41 <error: Cannot access memory at address 0x41>)
at sql/item.cc:9236
#10x0000555558fea5a8 in Item::walk (this=0x7fff2c338db8, processor=&virtual table offset 864,
walk=7, arg=0x7fffc83ee4b0 "") at sql/item.h:2543
#20x00005555596cc6f2 in find_order_in_list (thd=0x7fff2c001070, ref_item_array=...,
tables=0x7fff2c330b90, order=0x7fff2c32eae8, fields=0x7fff2c32fb20, is_group_field=false,
is_window_order=false) at sql/sql_resolver.cc:4625
#30x00005555596cd0ae in setup_order (thd=0x7fff2c001070, ref_item_array=...,
tables=0x7fff2c330b90, fields=0x7fff2c32fb20, order=0x7fff2c32eae8)
at sql/sql_resolver.cc:4811
#40x00005555596bf528 in Query_block::prepare (this=0x7fff2c32fae0, thd=0x7fff2c001070,
insert_field_list=0x0) at sql/sql_resolver.cc:400
#50x00005555597d035d in Query_expression::prepare (this=0x7fff2c32f9e8, thd=0x7fff2c001070,
sel_result=0x7fff2c33b2a8, insert_field_list=0x0, added_options=0, removed_options=0)
at sql/sql_union.cc:758
#60x0000555559590772 in Table_ref::resolve_derived (this=0x7fff2c339790, thd=0x7fff2c001070,
apply_semijoin=true) at sql/sql_derived.cc:451
#70x00005555596c2a80 in Query_block::resolve_placeholder_tables (this=0x7fff2c333f08,
thd=0x7fff2c001070, apply_semijoin=true)
at sql/sql_resolver.cc:1408
#80x00005555596bea62 in Query_block::prepare (this=0x7fff2c333f08, thd=0x7fff2c001070,
insert_field_list=0x0) at sql/sql_resolver.cc:265
</code></pre>
<p>对于<code>0<>0 as c_lrcm63eani</code>这个<code>Item_func_ne</code>对象,执行到<code>Item::clean_up_after_removal</code>的时候,因为<code>reference_count() > 1</code>因此会执行新添加的<code>ctx->stop_at(this)</code>,等到下一次再执行到这个<code>Item_func_ne</code>的<code>clean_up_after_removal()</code>函数的时候,就会因为<code>ctx->is_stopped(this)</code>而直接返回,不再执行一次<code>decrement_ref_count()</code>,从而避免了执行后面的<code>transl->item = nullptr</code>。</p>
<pre><code class="language-C++">bool find_order_in_list() {
if (select_item != not_found_item) {
if ((*order->item)->real_item() != (*select_item)->real_item()) {
Item::Cleanup_after_removal_context ctx(
thd->lex->current_query_block());
(*order->item)
->walk(&Item::clean_up_after_removal, walk_options,-- Item_func_ne执行了2次,而只执行了一次decrement_ref_count()
pointer_cast<uchar *>(&ctx));
}
}
}
void Query_block::delete_unused_merged_columns(
mem_root_deque<Table_ref *> *tables) {
DBUG_TRACE;
for (Table_ref *tl : *tables) {
if (tl->nested_join == nullptr) continue;
if (tl->is_merged()) {
for (Field_translator *transl = tl->field_translation;
transl < tl->field_translation_end; transl++) {
Item *const item = transl->item;
// Decrement the ref count as its no more used in
// select list.
if (item->decrement_ref_count()) continue; 因为执行完decrement_ref_count()以后返回的m_ref_count=1因此不会继续执行后面的置空设置
// Cleanup the item since its not referenced from
// anywhere.
assert(item->fixed);
Item::Cleanup_after_removal_context ctx(this);
item->walk(&Item::clean_up_after_removal, walk_options,
pointer_cast<uchar *>(&ctx));
transl->item = nullptr; ==>这个地方不会运行到
}
}
delete_unused_merged_columns(&tl->nested_join->m_tables);
}
}
</code></pre>
<h2 id="四问题总结">四、问题总结</h2>
<p>通过以上分析我们可以发现,对于复杂的 SQL 会执行复杂的 Item 变换和删除不需要的 Item,但是正是由于这样才更容易导致 Crash 的出现。分析类似这样的 Crash 问题的时候,因为涉及代码量大,代码逻辑复杂往往很难找到相关修复代码,因此需要对代码运行流程比较熟悉,同时要有相关复杂问题解决的经验才能更好的应对这类问题。</p>
<hr>
<p>Enjoy GreatSQL 😃</p>
<h2 id="关于-greatsql">关于 GreatSQL</h2>
<p>GreatSQL是适用于金融级应用的国内自主开源数据库,具备高性能、高可靠、高易用性、高安全等多个核心特性,可以作为MySQL或Percona Server的可选替换,用于线上生产环境,且完全免费并兼容MySQL或Percona Server。</p>
<p>相关链接: GreatSQL社区 Gitee GitHub Bilibili</p>
<h2 id="greatsql社区">GreatSQL社区:</h2>
<blockquote>
<p>社区博客有奖征稿详情:https://greatsql.cn/thread-100-1-1.html</p>
</blockquote>
<p><img src="https://img2024.cnblogs.com/other/2630741/202505/2630741-20250523105633666-918065113.png"></p>
<h2 id="技术交流群">技术交流群:</h2>
<blockquote>
<p>微信:扫码添加<code>GreatSQL社区助手</code>微信好友,发送验证信息<code>加群</code>。</p>
</blockquote>
<p><img src="https://img2024.cnblogs.com/other/2630741/202505/2630741-20250523105633986-1691869788.png"></p><br><br>
来源:https://www.cnblogs.com/greatsql/p/18892710
頁:
[1]