千万级用户系统的SQL调优实战
<p data-id="pd157317-neFJWEjj"><img title="千万级用户系统的SQL调优实战" alt="千万级用户系统的SQL调优实战" src="https://zhuji.jb51.net/uploads/img/202305/e81dd97b527f56ed94a77c8473eef35c.jpg"></p>
<p data-id="pd157317-IRXWbTc4">
某系统专门通过各种条件筛选大量用户,接着对那些用户去推送一些消息:</p>
<ul data-id="ucd67dc5-INWU1FoD">
<li data-id="l20de63f-U5sQiljX">
一些促销活动消息</li>
<li data-id="l20de63f-vqMlIXrS">
让你办会员卡的消息</li>
<li data-id="l20de63f-2HCVUhu6">
告诉你有一个特价商品的消息</li>
</ul>
<p data-id="pd157317-9NOVOep7">
通过一些条件筛选出大量用户,针对这些用户做推送,该过程较耗时-筛选用户过程。</p>
<p data-id="pd157317-g3R5k8Lr">
用户日活百万级,注册用户千万级,而且若还没有进行分库分表,则该DB里的用户表可能就一张,单表上千万的用户数据。</p>
<p data-id="pd157317-2ZUBWiQc">
对运营系统筛选用户的SQL:</p>
<pre>
<span class="cm-variable">SELECT</span> <span class="cm-variable">id</span>, <span class="cm-variable">name</span> <span class="cm-variable">FROM</span> <span class="cm-variable">users</span> <span class="cm-variable">WHERE</span> <span class="cm-variable">id</span> <span class="cm-variable">IN</span> ( <span class="cm-variable">SELECT</span> <span class="cm-variable">user_id</span> <span class="cm-variable">FROM</span> <span class="cm-variable">users_extent_info</span> <span class="cm-variable">WHERE</span> <span class="cm-variable">latest_login_time</span> <span class="cm-operator"><</span> <span class="cm-variable">xxxxx</span> ) </pre>
<p data-id="pd157317-5NfPunns">
一般存储用户数据的表会分为两张表:</p>
<ul data-id="ucd67dc5-gbGAZGS2">
<li data-id="l20de63f-rHjglBBV">
存储用户的核心数据,如id、name、昵称、手机号之类的信息,也就是上面SQL语句里的users表</li>
<li data-id="l20de63f-rOIRFksG">
存储用户的一些拓展信息,比如说家庭住址、兴趣爱好、最近一次登录时间之类的,即users_extent_info表</li>
</ul>
<p data-id="pd157317-T1bDfqwJ">
有个子查询,里面针对用户的拓展信息表,即users_extent_info查下最近一次登录时间<某个时间点的用户,可以查询最近才登录过的用户,也可查询很长时间未登录的用户,然后给他们发push,无论哪种场景, 该SQL都适用。</p>
<p data-id="pd157317-ExvMRJ2M">
然后在外层查询,用id IN子句查询 id 在子查询结果范围里的users表的所有数据,此时该SQL突然会查出很多数据,可能几千、几万、几十万,所以执行此类SQL前,都会先执行count:</p>
<pre>
<span class="cm-variable">SELECT</span> <span class="cm-variable">COUNT</span>(<span class="cm-variable">id</span>) <span class="cm-variable">FROM</span> <span class="cm-variable">users</span> <span class="cm-variable">WHERE</span> <span class="cm-variable">id</span> <span class="cm-variable">IN</span> ( <span class="cm-variable">SELECT</span> <span class="cm-variable">user_id</span> <span class="cm-variable">FROM</span> <span class="cm-variable">users_extent_info</span> <span class="cm-variable">WHERE</span> <span class="cm-variable">latest_login_time</span> <span class="cm-operator"><</span> <span class="cm-variable">xxxxx</span> )</pre>
<p data-id="pd157317-LNIHqH8E">
然后内存里做个小批量,多批次读取数据的操作,比如判断如果在1000条以内,那么就一下子读取出来,若超过1000条,可通过LIMIT语句,每次就从该结果集里查1000条数据,查1000条就做次批量PUSH,再查下一波1000条。</p>
<p data-id="pd157317-5OfHF00D">
就是在千万级数据量大表场景下,上面SQL直接轻松跑出来耗时几十s,不优化不行!</p>
<p data-id="pd157317-aBJO6Ds0">
今天咱们继续来看这个千万级用户场景下的运营系统SQL调优案例,上次已经给大家说了一下业务背景 以及SQL,这个SQL就是如下的一个:</p>
<pre>
<span class="cm-variable">SELECT</span> <span class="cm-variable">COUNT</span>(<span class="cm-variable">id</span>) <span class="cm-variable">FROM</span> <span class="cm-variable">users</span> <span class="cm-variable">WHERE</span> <span class="cm-variable">id</span> <span class="cm-variable">IN</span> (<span class="cm-variable">SELECT</span> <span class="cm-variable">user_id</span> <span class="cm-variable">FROM</span> <span class="cm-variable">users_extent_info</span> <span class="cm-variable">WHERE</span> <span class="cm-variable">latest_login_time</span> <span class="cm-operator"><</span> <span class="cm-variable">xxxxx</span>)</pre>
<p data-id="pd157317-C49o5pDJ">
系统运行时,先COUNT查该结果集有多少数据,再分批查询。然而COUNT在千万级大表场景下,都要花几十s。实际上每个不同的MySQL版本都可能会调整生成执行计划的方式。</p>
<p data-id="pd157317-EP5SKXmE">
通过:</p>
<pre>
<span class="cm-variable">EXPLAIN</span> <span class="cm-variable">SELECT</span> <span class="cm-variable">COUNT</span>(<span class="cm-variable">id</span>) <span class="cm-variable">FROM</span> <span class="cm-variable">users</span> <span class="cm-variable">WHERE</span> <span class="cm-variable">id</span> <span class="cm-variable">IN</span> ( <span class="cm-variable">SELECT</span> <span class="cm-variable">user_id</span> <span class="cm-variable">FROM</span> <span class="cm-variable">users_extent_info</span> <span class="cm-variable">WHERE</span> <span class="cm-variable">latest_login_time</span> <span class="cm-operator"><</span> <span class="cm-variable">xxxxx</span> )</pre>
<p data-id="pd157317-EqjCjBLg">
如下执行计划是为了调优,在测试环境的单表2万条数据场景,即使是5万条数据,当时这个SQL都跑了十多s,注意执行计划里的数据量</p>
<p data-id="pd157317-25TzF6cr">
<strong>执行计划里的第三行</strong></p>
<p data-id="pd157317-XahSbd07">
先子查询,针对users_extent_info,使用idx_login_time索引,做了range类型的索引范围扫描,查出4561条数据,没有做额外筛选,所以filtered=100%。</p>
<p data-id="pd157317-lEqhbfy0">
MATERIALIZED:这里把子查询的4561条数据代表的结果集进行了物化,物化成了一个临时表,这个临时表物化,一定是会把4561条数据临时落到磁盘文件里去的,这过程很慢。</p>
<p data-id="pd157317-xP1CYDLj">
<strong>第二条执行计划</strong></p>
<p data-id="pd157317-R1Ehsc9H">
针对users表做了一个全表扫描,在全表扫描的时候扫出来49651条数据,Extra=Using join buffer,此处居然在执行join。</p>
<p data-id="pd157317-F7kO2fKO">
<strong>执行计划里的第一条</strong></p>
<p data-id="pd157317-cEets0DJ">
针对子查询产出的一个物化临时表,即做了个全表查询,把里面的数据都扫描了一遍。</p>
<p data-id="pd157317-IvI7liTv">
为何对这临时表进行全表扫描?让users表的每条数据都和物化临时表里的数据进行join,所以针对users表里的每条数据,只能是去全表扫描一遍物化临时表,从物化临时表里确认哪条数据和他匹配,才能筛选出一条结果。</p>
<p data-id="pd157317-xsrzxtUo">
第二条执行计划的全表扫描结果表明一共扫到49651条,但全表扫描过程中,因为和物化临时表执行join,而物化临时表里就4561条数据,所以最终第二条执行计划的filtered=10%,即最终从users表里也筛选出4000多条数据。</p>
<h4 data-id="h6e90be6-TAooXYkm" id="h6e90be6-TAooXYkm">
到底为什么慢</h4>
<pre>
<span class="cm-operator">|</span> <span class="cm-variable">id</span> <span class="cm-operator">|</span> <span class="cm-variable">select_type</span> <span class="cm-operator">|</span> <span class="cm-variable">table</span> <span class="cm-operator">|</span> <span class="cm-variable">type</span> <span class="cm-operator">|</span> <span class="cm-variable">key</span> <span class="cm-operator">|</span> <span class="cm-variable">rows</span> <span class="cm-operator">|</span> <span class="cm-variable">filtered</span> <span class="cm-operator">|</span> <span class="cm-variable">Extra</span> <span class="cm-operator">|</span> <span class="cm-operator">+</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">+</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">-</span><span class="cm-operator">+</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">-</span><span class="cm-operator">+</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">+</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">-</span><span class="cm-operator">+</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">-</span><span class="cm-operator">+</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">+</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">--</span><span class="cm-operator">-</span><span class="cm-operator">+</span><span class="cm-operator">--</span><span class="cm-operator">-</span> <span class="cm-operator">|</span> <span class="cm-number">1</span> <span class="cm-operator">|</span> <span class="cm-variable">SIMPLE</span> <span class="cm-operator">|</span> <span class="cm-operator">|</span> <span class="cm-variable">ALL</span> <span class="cm-operator">|</span> <span class="cm-variable">NULL</span> <span class="cm-operator">|</span> <span class="cm-variable">NULL</span> <span class="cm-operator">|</span> <span class="cm-number">100.00</span> <span class="cm-operator">|</span> <span class="cm-variable">NULL</span> <span class="cm-operator">|</span> <span class="cm-operator">|</span> <span class="cm-number">1</span> <span class="cm-operator">|</span> <span class="cm-variable">SIMPLE</span> <span class="cm-operator">|</span> <span class="cm-variable">users</span> <span class="cm-operator">|</span> <span class="cm-variable">ALL</span> <span class="cm-operator">|</span> <span class="cm-variable">NULL</span> <span class="cm-operator">|</span> <span class="cm-number">49651</span> <span class="cm-operator">|</span> <span class="cm-number">10.00</span> <span class="cm-operator">|</span> <span class="cm-variable">Using</span> <span class="cm-variable">where</span>; <span class="cm-variable">Using</span> <span class="cm-variable">join</span> <span class="cm-variable">buffer</span>(<span class="cm-variable">Block</span> <span class="cm-variable">Nested</span> <span class="cm-variable">Loop</span>) <span class="cm-operator">|</span> <span class="cm-operator">|</span> <span class="cm-number">2</span> <span class="cm-operator">|</span> <span class="cm-variable">MATERIALIZED</span> <span class="cm-operator">|</span> <span class="cm-variable">users_extent_info</span> <span class="cm-operator">|</span> <span class="cm-variable">range</span> <span class="cm-operator">|</span> <span class="cm-variable">idx_login_time</span> <span class="cm-operator">|</span> <span class="cm-number">4561</span> <span class="cm-operator">|</span> <span class="cm-number">100.00</span> <span class="cm-operator">|</span> <span class="cm-variable">NULL</span> <span class="cm-operator">|</span></pre>
<p data-id="pd157317-LfSIfXA7">
先执行了子查询查出4561条数据,物化成临时表,接着对users主表全表扫描,扫描过程把每条数据都放到物化临时表里做全表扫描,本质在做join。</p>
<p data-id="pd157317-xyAhEnJ1">
对子查询的结果做了一次物化临时表,落地磁盘,接着还全表扫描users表,每条数据居然跑到一个没有索引的物化临时表里,又做了一次全表扫描找匹配的数据。</p>
<p data-id="pd157317-eJtjIsOK">
对users表的全表扫描耗时吗?</p>
<p data-id="pd157317-UVJLFxms">
对users表的每一条数据跑到物化临时表里做全表扫描耗时吗?</p>
<p data-id="pd157317-ELmQcWXW">
所以必然非常慢,几乎用不到索引。为什么MySQL会这样呢?</p>
<p data-id="pd157317-ZyMquvdW">
执行完上述SQL的EXPLAIN命令,看到执行计划之后,再执行:</p>
<h4 data-id="h6e90be6-gOTAedaB" id="h6e90be6-gOTAedaB">
show warnings</h4>
<p data-id="pd157317-zidRFIrQ">
显示出:</p>
<pre>
<span class="cm-comment">/* select#1 */</span> <span class="cm-variable">select</span> <span class="cm-variable">count</span>( <span class="cm-variable">d2</span>. <span class="cm-property">users</span> . <span class="cm-property">user_id</span> <span class="cm-string-2">`) AS </span> <span class="cm-string-2">COUNT(users.user_id)`</span> <span class="cm-variable">from</span> <span class="cm-variable">d2</span> . <span class="cm-variable">users</span> <span class="cm-variable">users</span> <span class="cm-variable">semi</span> <span class="cm-variable">join</span> <span class="cm-variable">xxxxxx</span></pre>
<p data-id="pd157317-vQBiq347">
注意 semi join ,MySQL在这里,生成执行计划的时候,自动就把一个普通IN子句,“优化”成基于semi join来进行IN+子查询的操作。那对users表不是全表扫描了吗?对users表里每条数据,去对物化临时表全表扫描做semi join,无需将users表里的数据真的跟物化临时表里的数据join。只要users表里的一条数据,在物化临时表能找到匹配数据,则users表里的数据就会返回,这就是semi join,用来做筛选。</p>
<p data-id="pd157317-F3Iv63o2">
所以就是semi join和物化临时表导致的慢题,那怎么优化?</p>
<h4 data-id="h6e90be6-Wf9Vf9Cd" id="h6e90be6-Wf9Vf9Cd">
做个实验</h4>
<p data-id="pd157317-fMXBuLVj">
执行</p>
<pre>
SET optimizer_switch='semijoin=off'</pre>
<p data-id="pd157317-IGR3THUC">
关闭半连接优化,再执行EXPLAIN发现恢复为正常状态:</p>
<ul data-id="ucd67dc5-jlFFJAUI">
<li data-id="l20de63f-NUV4OH7N">
有个SUBQUERY子查询,基于range方式去扫描索引,搜索出4561条数据</li>
<li data-id="l20de63f-QRFRKMUQ">
接着有个PRIMARY类型主查询,直接基于id这个PRIMARY主键聚簇索引去执行的搜索</li>
<li data-id="l20de63f-ei0mPO86">
然后再把这个SQL语句真实跑一下看看,性能竟然提升了几十倍,仅100多ms。</li>
</ul>
<p data-id="pd157317-H0HcRkj0">
所以,其实反而是MySQL自动执行的semi join半连接优化,导致了极差性能,关闭即可。</p>
<p data-id="pd157317-zjzMAobW">
生产环境当然不能随意更改这些设置,于是想了多种办法尝试去修改SQL语句的写法,在不影响其语义情况下,尽可能改变SQL语句的结构和格式,最终尝试出如下写法:</p>
<pre>
SELECT COUNT(id)
FROM users
WHERE (
id IN (
SELECT user_id
FROM users_extent_info
WHERE latest_login_time <span class="cm-tag cm-bracket"><</span> <span class="cm-tag">xxxxx)</span> <span class="cm-attribute">OR</span> <span class="cm-attribute">id</span> <span class="cm-attribute">IN</span> <span class="cm-attribute">(</span> <span class="cm-attribute">SELECT</span> <span class="cm-attribute">user_id</span> <span class="cm-attribute">FROM</span> <span class="cm-attribute">users_extent_info</span> <span class="cm-attribute">WHERE</span> <span class="cm-attribute">latest_login_time</span> <span class="cm-tag cm-error">< -1)</span> )</pre>
<p data-id="pd157317-Ec8nFLcS">
上述写法下,WHERE语句的OR后面的第二个条件,根本不可能成立,因为没有数据的latest_login_time<-1,所以那不会影响SQL业务语义,但改变SQL后,执行计划也会变,就没有再semi join优化了,而是常规地用了子查询,主查询也是基于索引,同样达到几百ms 性能优化。</p>
<p data-id="pd157317-tNcrhY4T">
所以最核心的,还是看懂SQL执行计划,分析慢的原因,尽量避免全表扫描,务必用上索引。</p>
<p>
原文地址:https://www.toutiao.com/a7068935781741855262/</p>
頁:
[1]