姬霓太美 發表於 2023-11-28 00:00:00

四个在工作后才知道的SQL密技

<p>
        <img title="四个在工作后才知道的SQL密技" alt="四个在工作后才知道的SQL密技" border="0" height="auto" src="https://zhuji.jb51.net/uploads/img/202305/4c9a76feb90c801122cc88b12c85c6a5.jpg" width="auto"></p>
<p>
        本文会分享四个在面试和工作中常用的几个使用技巧,具体包括:</p>
<ul>
<li>
                日期与期间的高级使用</li>
        <li>
                临时表与Common Table Expression (WITH)</li>
        <li>
                Aggregation 与CASE WHEN的结合使用</li>
        <li>
                Window Function的其他用途</li>
</ul>
<p>
        数仓?不就是写写SQL吗…</p>
<center>
        <img title="四个在工作后才知道的SQL密技" alt="四个在工作后才知道的SQL密技" src="https://zhuji.jb51.net/uploads/img/202305/0985560a240b37f0706d258da8f28a13.jpg">
</center>
<p>
        日期与时间段的筛选在工作中是经常被用到的,因为在拉取报表、仪表板和各种分析时,周、月、季度、年度的表现往往是分析需要考量的重点。</p>
<h3>
        时间区段的提取:Extract</h3>
<ul>
<li>
                语法</li>
</ul>
<ol class="dp-sql">
<li class="alt">
                <span><span class="comment">-- field可以是day、hour、minute, month, quarter等等 </span><span> </span></span>
</li>
        <li>
                <span><span class="comment">-- source可以是date、timestamp类型 </span><span> </span></span>
</li>
        <li class="alt">
                <span>extract(field <span class="keyword">FROM</span><span> source)  </span></span>
</li>
</ol>
<ul>
<li>
                使用</li>
</ul>
<ol class="dp-sql">
<li class="alt">
                <span><span class="keyword">SELECT</span><span> extract(</span><span class="func">year</span><span> </span><span class="keyword">FROM</span><span> </span><span class="string">'2020-08-05 09:30:08'</span><span>);   </span><span class="comment">-- 结果为 2020 </span><span> </span></span>
</li>
        <li>
                <span><span class="keyword">SELECT</span><span> extract(quarter </span><span class="keyword">FROM</span><span> </span><span class="string">'2020-08-05 09:30:08'</span><span>);   </span><span class="comment">-- 结果为 3 </span><span> </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">SELECT</span><span> extract(</span><span class="func">month</span><span> </span><span class="keyword">FROM</span><span> </span><span class="string">'2020-08-05 09:30:08'</span><span>);   </span><span class="comment">-- 结果为 8 </span><span> </span></span>
</li>
        <li>
                <span><span class="keyword">SELECT</span><span> extract(week </span><span class="keyword">FROM</span><span> </span><span class="string">'2020-08-05 09:30:08'</span><span>);   </span><span class="comment">-- 结果为 31,一年中的第几周 </span><span> </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">SELECT</span><span> extract(</span><span class="func">day</span><span> </span><span class="keyword">FROM</span><span> </span><span class="string">'2020-08-05 09:30:08'</span><span>);  </span><span class="comment">-- 结果为 5 </span><span> </span></span>
</li>
        <li>
                <span><span class="keyword">SELECT</span><span> extract(</span><span class="keyword">hour</span><span> </span><span class="keyword">FROM</span><span> </span><span class="string">'2020-08-05 09:30:08'</span><span>);   </span><span class="comment">-- 结果为 9 </span><span> </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">SELECT</span><span> extract(</span><span class="keyword">minute</span><span> </span><span class="keyword">FROM</span><span> </span><span class="string">'2020-08-05 09:30:08'</span><span>);   </span><span class="comment">-- 结果为 30 </span><span> </span></span>
</li>
        <li>
                <span><span class="keyword">SELECT</span><span> extract(</span><span class="keyword">second</span><span> </span><span class="keyword">FROM</span><span> </span><span class="string">'2020-08-05 09:30:08'</span><span>);   </span><span class="comment">-- 结果为 8 </span><span> </span></span>
</li>
</ol>
<p>
        注意:</p>
<p>
        impala支持:YEAR, QUARTER, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, EPOCH</p>
<p>
        Hive支持:day, dayofweek, hour, minute, month, quarter, second, week 和 year</p>
<p>
        Hive是从Hive2.2.0版本开始引入该函数</p>
<h3>
        周的提取</h3>
<ul>
<li>
                语法</li>
</ul>
<p>
        在按照周的区间进行统计时,需要识别出周一的日期与周日的日期,这个时候经常会用到下面的函数:</p>
<ol class="dp-sql">
<li class="alt">
                <span><span>next_day(STRING start_date, STRING day_of_week)  </span></span>
</li>
        <li>
                <span><span class="comment">-- 返回当前日期对应的下一个周几对应的日期 </span><span> </span></span>
</li>
        <li class="alt">
                <span><span class="comment">-- 2020-08-05为周三 </span><span> </span></span>
</li>
        <li>
                <span><span class="keyword">SELECT</span><span> next_day(</span><span class="string">'2020-08-05'</span><span>,</span><span class="string">'MO'</span><span>) </span><span class="comment">-- 下一个周一对应的日期:2020-08-10 </span><span> </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">SELECT</span><span> next_day(</span><span class="string">'2020-08-05'</span><span>,</span><span class="string">'TU'</span><span>) </span><span class="comment">-- 下一个周二对应的日期:2020-08-11 </span><span> </span></span>
</li>
        <li>
                <span><span class="keyword">SELECT</span><span> next_day(</span><span class="string">'2020-08-05'</span><span>,</span><span class="string">'WE'</span><span>) </span><span class="comment">-- 下一个周三对应的日期:2020-08-12 </span><span> </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">SELECT</span><span> next_day(</span><span class="string">'2020-08-05'</span><span>,</span><span class="string">'TH'</span><span>) </span><span class="comment">-- 下一个周四对应的日期:2020-08-06,即为本周四 </span><span> </span></span>
</li>
        <li>
                <span><span class="keyword">SELECT</span><span> next_day(</span><span class="string">'2020-08-05'</span><span>,</span><span class="string">'FR'</span><span>) </span><span class="comment">-- 下一个周五对应的日期:2020-08-07,即为本周五 </span><span> </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">SELECT</span><span> next_day(</span><span class="string">'2020-08-05'</span><span>,</span><span class="string">'SA'</span><span>) </span><span class="comment">-- 下一个周六对应的日期:2020-08-08,即为本周六 </span><span> </span></span>
</li>
        <li>
                <span><span class="keyword">SELECT</span><span> next_day(</span><span class="string">'2020-08-05'</span><span>,</span><span class="string">'SU'</span><span>) </span><span class="comment">-- 下一个周日对应的日期:2020-08-09,即为本周日 </span><span> </span></span>
</li>
        <li class="alt">
                <span><span class="comment">-- 星期一到星期日的英文(Monday,Tuesday、Wednesday、Thursday、Friday、Saturday、Sunday) </span><span> </span></span>
</li>
</ol>
<ul>
<li>
                使用</li>
</ul>
<p>
        那么该如何获取当前日期所在周的周一对应的日期呢?只需要先获取当前日期的下周一对应的日期,然后减去7天,即可获得:</p>
<ol class="dp-sql">
<li class="alt">
                <span><span class="keyword">SELECT</span><span> date_add(next_day(</span><span class="string">'2020-08-05'</span><span>,</span><span class="string">'MO'</span><span>),-7);  </span></span>
</li>
</ol>
<p>
        同理,获取当前日期所在周的周日对应的日期,只需要先获取当前日期的下周一对应的日期,然后减去1天,即可获得:</p>
<ol class="dp-sql">
<li class="alt">
                <span><span class="keyword">select</span><span> date_add(next_day(</span><span class="string">'2020-08-05'</span><span>,</span><span class="string">'MO'</span><span>),-1)   </span></span>
</li>
        <li>
                <span><span class="comment">-- 2020-08-09 </span><span> </span></span>
</li>
</ol>
<h3>
        月的提取</h3>
<ul>
<li>
                语法</li>
</ul>
<p>
        至于怎么将月份从单一日期提取出来呢,LAST_DAY这个函数可以将每个月中的日期变成该月的最后一天(28号,29号,30号或31号),如下:</p>
<ol class="dp-sql">
<li class="alt">
                <span><span>last_day(STRING </span><span class="keyword">date</span><span>)  </span></span>
</li>
</ol>
<ul>
<li>
                使用</li>
</ul>
<ol class="dp-vb">
<li class="alt">
                <span><span>SELECT last_day(</span><span class="comment">'2020-08-05'); -- 2020-08-31 </span><span> </span></span>
</li>
</ol>
<p>
        除了上面的方式,也可以使用date_format函数,比如:</p>
<ol class="dp-sql">
<li class="alt">
                <span><span class="keyword">SELECT</span><span> date_format(</span><span class="string">'2020-08-05'</span><span>,</span><span class="string">'yyyy-MM'</span><span>);  </span></span>
</li>
        <li>
                <span><span class="comment">-- 2020-08 </span><span> </span></span>
</li>
</ol>
<h3>
        日期的范围</h3>
<p>
        月的Window:使用add_months加上trunc()的应用</p>
<ol class="dp-sql">
<li class="alt">
                <span><span class="comment">-- 返回加减月份之后对应的日期 </span><span> </span></span>
</li>
        <li>
                <span><span class="comment">-- 2020-07-05 </span><span> </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">select</span><span> add_months(</span><span class="string">'2020-08-05'</span><span>, -1)  </span></span>
</li>
        <li>
                <span>  </span>
</li>
        <li class="alt">
                <span><span class="comment">-- 返回当前日期的月初日期 </span><span> </span></span>
</li>
        <li>
                <span><span class="comment">-- 2020-08-01 </span><span> </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">select</span><span> trunc(</span><span class="string">"2020-08-05"</span><span>,</span><span class="string">'MM'</span><span>)  </span></span>
</li>
</ol>
<p>
        由上面范例可见,单纯使用add_months,减N个月的用法,可以刚好取到整数月的数据,但如果加上trunc()函数,则会从前N个月的一号开始取值。</p>
<ol class="dp-sql">
<li class="alt">
                <span><span class="comment">-- 选取2020-07-05到2020-08-05所有数据 </span><span> </span></span>
</li>
        <li>
                <span><span class="op">BETWEEN</span><span> add_months(</span><span class="string">'2020-08-05'</span><span>, -1) </span><span class="op">AND</span><span> </span><span class="string">'2020-08-05'</span><span>   </span></span>
</li>
        <li class="alt">
                <span><span class="comment">-- 选取2020-07-01到2020-08-05之间所有数据 </span><span> </span></span>
</li>
        <li>
                <span><span class="op">BETWEEN</span><span> add_months(trunc(</span><span class="string">"2020-08-05"</span><span>,</span><span class="string">'MM'</span><span>),-1) </span><span class="op">AND</span><span> </span><span class="string">'2020-08-05'</span><span>   </span></span>
</li>
</ol>
<center>
        <img title="四个在工作后才知道的SQL密技" alt="四个在工作后才知道的SQL密技" src="https://zhuji.jb51.net/uploads/img/202305/53fbf32bdaac59908305a40ebb797459.jpg">
</center>
<p>
        这两种方法是日常工作中经常被使用到,对于一些比较复杂的计算任务,为了避免过多的JOIN,通常会先把一些需要提取的部分数据使用临时表或是CTE的形式在主要查询区块前进行提取。</p>
<h3>
        临时表的作法:</h3>
<ol class="dp-sql">
<li class="alt">
                <span><span class="keyword">CREATE</span><span> </span><span class="keyword">TEMPORARY</span><span> </span><span class="keyword">TABLE</span><span> table_1 </span><span class="keyword">AS</span><span>    </span></span>
</li>
        <li>
                <span>    <span class="keyword">SELECT</span><span>   </span></span>
</li>
        <li class="alt">
                <span>        columns  </span>
</li>
        <li>
                <span>    <span class="keyword">FROM</span><span> </span><span class="keyword">table</span><span> A;  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">CREATE</span><span> </span><span class="keyword">TEMPORARY</span><span> table_2 </span><span class="keyword">AS</span><span>   </span></span>
</li>
        <li>
                <span>    <span class="keyword">SELECT</span><span>  </span></span>
</li>
        <li class="alt">
                <span>        columns  </span>
</li>
        <li>
                <span>    <span class="keyword">FROM</span><span> </span><span class="keyword">table</span><span> B;  </span></span>
</li>
        <li class="alt">
                <span>  </span>
</li>
        <li>
                <span><span class="keyword">SELECT</span><span>  </span></span>
</li>
        <li class="alt">
                <span>    table_1.columns,  </span>
</li>
        <li>
                <span>    table_2.columns,   </span>
</li>
        <li class="alt">
                <span>    c.columns   </span>
</li>
        <li>
                <span><span class="keyword">FROM</span><span> </span><span class="keyword">table</span><span> C </span><span class="op">JOIN</span><span> table_1  </span></span>
</li>
        <li class="alt">
                <span>     <span class="op">JOIN</span><span> table_2;  </span></span>
</li>
</ol>
<h3>
        CTE的作法:</h3>
<p>
        -- 注意Hive、Impala支持这种语法,低版本的MySQL不支持(高版本支持)</p>
<ol class="dp-sql">
<li class="alt">
                <span><span class="comment">-- 注意Hive、Impala支持这种语法,低版本的MySQL不支持(高版本支持) </span><span> </span></span>
</li>
        <li>
                <span><span class="keyword">WITH</span><span> employee_by_title_count </span><span class="keyword">AS</span><span> (  </span></span>
</li>
        <li class="alt">
                <span>    <span class="keyword">SELECT</span><span>  </span></span>
</li>
        <li>
                <span>        t.<span class="keyword">name</span><span> </span><span class="keyword">as</span><span> job_title  </span></span>
</li>
        <li class="alt">
                <span>        , <span class="func">COUNT</span><span>(e.id) </span><span class="keyword">as</span><span> amount_of_employees  </span></span>
</li>
        <li>
                <span>    <span class="keyword">FROM</span><span> employees e  </span></span>
</li>
        <li class="alt">
                <span>        <span class="op">JOIN</span><span> job_titles t </span><span class="keyword">on</span><span> e.job_title_id = t.id  </span></span>
</li>
        <li>
                <span>    <span class="keyword">GROUP</span><span> </span><span class="keyword">BY</span><span> 1  </span></span>
</li>
        <li class="alt">
                <span>),  </span>
</li>
        <li>
                <span>salaries_by_title <span class="keyword">AS</span><span> (  </span></span>
</li>
        <li class="alt">
                <span>     <span class="keyword">SELECT</span><span>  </span></span>
</li>
        <li>
                <span>         <span class="keyword">name</span><span> </span><span class="keyword">as</span><span> job_title  </span></span>
</li>
        <li class="alt">
                <span>         , salary  </span>
</li>
        <li>
                <span>     <span class="keyword">FROM</span><span> job_titles  </span></span>
</li>
        <li class="alt">
                <span>)  </span>
</li>
        <li>
                <span><span class="keyword">SELECT</span><span> *  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">FROM</span><span> employee_by_title_count e  </span></span>
</li>
        <li>
                <span>    <span class="op">JOIN</span><span> salaries_by_title s </span><span class="keyword">ON</span><span> s.job_title = e.job_title  </span></span>
</li>
</ol>
<p>
        可以看到TEMP TABLE和CTE WITH的用法其实非常类似,目的都是为了让你的Query更加一目了然且优雅简洁。很多人习惯将所有的Query写在单一的区块里面,用过多的JOIN或SUBQUERY,导致最后逻辑丢失且自己也搞不清楚写到哪里,适时的使用TEMP TABLE和CTE作为辅助,绝对是很加分的。</p>
<center>
        <img title="四个在工作后才知道的SQL密技" alt="四个在工作后才知道的SQL密技" src="https://zhuji.jb51.net/uploads/img/202305/dc793e298d3a49315db1b42b5266b6b8.jpg">
</center>
<p>
        将Aggregation function (SUM/COUNT/COUNT DISTINCT/MIN/MAX) 结合CASE WHEN是最强大且最有趣的使用方式。这样的使用创造出一种类似EXCEL中SUMIF/COUNTIF的效果,可以用这个方式做出很多高效的分析。</p>
<ul>
<li>
                Table Name: order</li>
        <li>
                Column: register_date, order_date, user_id, country, order_sales, order_id</li>
</ul>
<h3>
        数据准备</h3>
<ol class="dp-sql">
<li class="alt">
                <span><span class="keyword">CREATE</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span>(  </span></span>
</li>
        <li>
                <span>    register_date string,  </span>
</li>
        <li class="alt">
                <span>    order_date string,  </span>
</li>
        <li>
                <span>    user_id string,  </span>
</li>
        <li class="alt">
                <span>    country string,  </span>
</li>
        <li>
                <span>    order_sales <span class="keyword">decimal</span><span>(10,2),  </span></span>
</li>
        <li class="alt">
                <span>    order_id string);  </span>
</li>
        <li>
                <span>  </span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-07"</span><span>,</span><span class="string">"2020-06-09"</span><span>,</span><span class="string">"001"</span><span>,</span><span class="string">'c0'</span><span>,210,</span><span class="string">"o1"</span><span>);  </span></span>
</li>
        <li>
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-08"</span><span>,</span><span class="string">"2020-06-09"</span><span>,</span><span class="string">"002"</span><span>,</span><span class="string">'c1'</span><span>,220,</span><span class="string">"o2"</span><span>);  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-07"</span><span>,</span><span class="string">"2020-06-10"</span><span>,</span><span class="string">"003"</span><span>,</span><span class="string">'c2'</span><span>,230,</span><span class="string">"o3"</span><span>);  </span></span>
</li>
        <li>
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-09"</span><span>,</span><span class="string">"2020-06-10"</span><span>,</span><span class="string">"004"</span><span>,</span><span class="string">'c3'</span><span>,200,</span><span class="string">"o4"</span><span>);  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-07"</span><span>,</span><span class="string">"2020-06-20"</span><span>,</span><span class="string">"005"</span><span>,</span><span class="string">'c4'</span><span>,300,</span><span class="string">"o5"</span><span>);  </span></span>
</li>
        <li>
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-10"</span><span>,</span><span class="string">"2020-06-23"</span><span>,</span><span class="string">"006"</span><span>,</span><span class="string">'c5'</span><span>,400,</span><span class="string">"o6"</span><span>);  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-07"</span><span>,</span><span class="string">"2020-06-19"</span><span>,</span><span class="string">"007"</span><span>,</span><span class="string">'c6'</span><span>,600,</span><span class="string">"o7"</span><span>);  </span></span>
</li>
        <li>
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-12"</span><span>,</span><span class="string">"2020-06-18"</span><span>,</span><span class="string">"008"</span><span>,</span><span class="string">'c7'</span><span>,700,</span><span class="string">"o8"</span><span>);  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-07"</span><span>,</span><span class="string">"2020-06-09"</span><span>,</span><span class="string">"009"</span><span>,</span><span class="string">'c8'</span><span>,100,</span><span class="string">"o9"</span><span>);  </span></span>
</li>
        <li>
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-15"</span><span>,</span><span class="string">"2020-06-18"</span><span>,</span><span class="string">"0010"</span><span>,</span><span class="string">'c9'</span><span>,200,</span><span class="string">"o10"</span><span>);  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-15"</span><span>,</span><span class="string">"2020-06-19"</span><span>,</span><span class="string">"0011"</span><span>,</span><span class="string">'c10'</span><span>,250,</span><span class="string">"o11"</span><span>);  </span></span>
</li>
        <li>
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-12"</span><span>,</span><span class="string">"2020-06-29"</span><span>,</span><span class="string">"0012"</span><span>,</span><span class="string">'c11'</span><span>,270,</span><span class="string">"o12"</span><span>);  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-16"</span><span>,</span><span class="string">"2020-06-19"</span><span>,</span><span class="string">"0013"</span><span>,</span><span class="string">'c12'</span><span>,230,</span><span class="string">"o13"</span><span>);  </span></span>
</li>
        <li>
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-17"</span><span>,</span><span class="string">"2020-06-20"</span><span>,</span><span class="string">"0014"</span><span>,</span><span class="string">'c13'</span><span>,290,</span><span class="string">"o14"</span><span>);  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> </span><span class="keyword">order</span><span> </span><span class="keyword">VALUES</span><span>(</span><span class="string">"2020-06-20"</span><span>,</span><span class="string">"2020-06-29"</span><span>,</span><span class="string">"0015"</span><span>,</span><span class="string">'c14'</span><span>,203,</span><span class="string">"o15"</span><span>);  </span></span>
</li>
</ol>
<h3>
        CASE WHEN 时间,进行留存率/使用率的分析</h3>
<ol class="dp-sql">
<li class="alt">
                <span><span class="comment">-- 允许多列去重 </span><span> </span></span>
</li>
        <li>
                <span><span class="keyword">set</span><span> hive.groupby.skewindata = </span><span class="keyword">false</span><span>  </span></span>
</li>
        <li class="alt">
                <span><span class="comment">-- 允许使用位置编号分组或排序 </span><span> </span></span>
</li>
        <li>
                <span><span class="keyword">set</span><span> hive.groupby.orderby.position.alias = </span><span class="keyword">true</span><span>  </span></span>
</li>
        <li class="alt">
                <span>  </span>
</li>
        <li>
                <span><span class="keyword">SELECT</span><span>  </span></span>
</li>
        <li class="alt">
                <span>    date_add(Next_day(register_date, <span class="string">'MO'</span><span>),-1) </span><span class="keyword">AS</span><span> week_end,  </span></span>
</li>
        <li>
                <span>    <span class="func">COUNT</span><span>(</span><span class="keyword">DISTINCT</span><span> </span><span class="func">CASE</span><span> </span><span class="keyword">WHEN</span><span> order_date </span><span class="op">BETWEEN</span><span> register_date </span><span class="op">AND</span><span> date_add(register_date,6) </span><span class="keyword">THEN</span><span> user_id </span><span class="keyword">END</span><span>) </span><span class="keyword">AS</span><span> first_week_order,  </span></span>
</li>
        <li class="alt">
                <span>    <span class="func">COUNT</span><span>(</span><span class="keyword">DISTINCT</span><span> </span><span class="func">CASE</span><span> </span><span class="keyword">WHEN</span><span> order_date </span><span class="op">BETWEEN</span><span> date_add(register_date ,7) </span><span class="op">AND</span><span> date_add(register_date,13) </span><span class="keyword">THEN</span><span> user_id </span><span class="keyword">END</span><span>) </span><span class="keyword">AS</span><span> sencod_week_order,  </span></span>
</li>
        <li>
                <span>    <span class="func">COUNT</span><span>(</span><span class="keyword">DISTINCT</span><span> </span><span class="func">CASE</span><span> </span><span class="keyword">WHEN</span><span> order_date </span><span class="op">BETWEEN</span><span> date_add(register_date ,14) </span><span class="op">AND</span><span> date_add(register_date,20) </span><span class="keyword">THEN</span><span> user_id </span><span class="keyword">END</span><span>) </span><span class="keyword">as</span><span> third_week_order  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">FROM</span><span> </span><span class="keyword">order</span><span>  </span></span>
</li>
        <li>
                <span><span class="keyword">GROUP</span><span> </span><span class="keyword">BY</span><span> 1  </span></span>
</li>
</ol>
<p>
        上面的示例可以得知到用户在注册之后,有没有创建订单的行为。比如注册后的第一周,第二周,第三周分别有多少下单用户,这样可以分析出用户的使用情况和留存情况。</p>
<p>
        注意:上面的使用方式,需要配置两个参数:</p>
<p>
        hive.groupby.skewindata = false:允许多列去重,否则报错:</p>
<p>
        SemanticException : DISTINCT on different columns not supported with skew in data</p>
<p>
        hive.groupby.orderby.position.alias = true:允许使用位置编号分组或排序,否则报错:</p>
<p>
        SemanticException : line 79:13 Expression not in GROUP BY key ''MO''</p>
<h3>
        CASE WHEN 时间,进行每个用户消费金额的分析</h3>
<ol class="dp-sql">
<li class="alt">
                <span><span class="keyword">SELECT</span><span>  </span></span>
</li>
        <li>
                <span>    user_id,  </span>
</li>
        <li class="alt">
                <span>    <span class="func">SUM</span><span> (</span><span class="func">CASE</span><span> </span><span class="keyword">WHEN</span><span> order_date </span><span class="op">BETWEEN</span><span> register_date </span><span class="op">AND</span><span> date_add(register_date,6) </span><span class="keyword">THEN</span><span> order_sales </span><span class="keyword">END</span><span>) </span><span class="keyword">AS</span><span> first_week_amount,  </span></span>
</li>
        <li>
                <span>    <span class="func">SUM</span><span> (</span><span class="func">CASE</span><span> </span><span class="keyword">WHEN</span><span> order_date </span><span class="op">BETWEEN</span><span> date_add(register_date ,7) </span><span class="op">AND</span><span> date_add(register_date,13) </span><span class="keyword">THEN</span><span> order_sales </span><span class="keyword">END</span><span>) </span><span class="keyword">AS</span><span> second_week_amount  </span></span>
</li>
        <li class="alt">
                <span>    <span class="keyword">FROM</span><span> </span><span class="keyword">order</span><span>  </span></span>
</li>
        <li>
                <span><span class="keyword">GROUP</span><span> </span><span class="keyword">BY</span><span> 1  </span></span>
</li>
</ol>
<p>
        通过筛选出注册与消费的日期,并且进行消费金额统计,每个用户在每段时间段(注册后第一周、第二周…以此类推)的消费金额,可以观察用户是否有持续维持消费习惯或是消费金额变低等分析。</p>
<h3>
        CASE WHEN数量,消费金额超过某一定额的数量分析</h3>
<ol class="dp-sql">
<li class="alt">
                <span><span class="keyword">SELECT</span><span>  </span></span>
</li>
        <li>
                <span>    user_id,  </span>
</li>
        <li class="alt">
                <span>    <span class="func">COUNT</span><span>(</span><span class="keyword">DISTINCT</span><span> </span><span class="func">CASE</span><span> </span><span class="keyword">WHEN</span><span> order_sales &gt;= 100 </span><span class="keyword">THEN</span><span> order_id </span><span class="keyword">END</span><span>) </span><span class="keyword">AS</span><span> count_of_order_greateer_than_100  </span></span>
</li>
        <li>
                <span><span class="keyword">FROM</span><span> </span><span class="keyword">order</span><span>  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">GROUP</span><span> </span><span class="keyword">BY</span><span> 1  </span></span>
</li>
</ol>
<p>
        上面的示例就是类似countif的用法,针对每个用户,统计其订单金额大于某个值的订单数量,分析去筛选出高价值的顾客。</p>
<h3>
        CASE WHEN数量,加上时间的用法</h3>
<ol class="dp-sql">
<li class="alt">
                <span><span class="keyword">SELECT</span><span>  </span></span>
</li>
        <li>
                <span>    user_id,  </span>
</li>
        <li class="alt">
                <span>    <span class="keyword">MIN</span><span>(</span><span class="func">CASE</span><span> </span><span class="keyword">WHEN</span><span> order_sales &gt; 100 </span><span class="keyword">THEN</span><span> order_date </span><span class="keyword">END</span><span>) </span><span class="keyword">AS</span><span> first_order_date_over1000,  </span></span>
</li>
        <li>
                <span>    <span class="keyword">MAX</span><span>(</span><span class="func">CASE</span><span> </span><span class="keyword">WHEN</span><span> order_sales &gt; 100 </span><span class="keyword">THEN</span><span> order_date </span><span class="keyword">END</span><span>) </span><span class="keyword">AS</span><span> recent_order_date_over100  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">FROM</span><span> </span><span class="keyword">order</span><span>  </span></span>
</li>
        <li>
                <span><span class="keyword">GROUP</span><span> </span><span class="keyword">BY</span><span> 1  </span></span>
</li>
</ol>
<p>
        CASE WHEN加上MIN/MAX时间,可以得出该用户在其整个使用过程中,首次购买超过一定金额的订单日期,以及最近一次购买超过一定金额的订单日期。</p>
<center>
        <img title="四个在工作后才知道的SQL密技" alt="四个在工作后才知道的SQL密技" src="https://zhuji.jb51.net/uploads/img/202305/eb991c6eddea78ed9407b4ffcd89676b.jpg">
</center>
<p>
        Window Function既是工作中经常使用的函数,也是面试时经常被问到的问题。常见的使用场景是分组取topN。本文介绍的另外一个用法,使用开窗函数进行用户访问session分析。</p>
<p>
        session是指在指定的时间段内用户在网站上发生的一系列互动。例如,一次session可以包含多个网页浏览、事件、社交互动和电子商务交易。session就相当于一个容器,其中包含了用户在网站上执行的操作。</p>
<center>
        <img title="四个在工作后才知道的SQL密技" alt="四个在工作后才知道的SQL密技" src="https://zhuji.jb51.net/uploads/img/202305/947c62f7e2b87a14ce4402e6fff90e69.jpg">
</center>
<p>
        session具有一个过期时间,比如30分钟,即不活动状态超过 30 分钟,该session就会过时。</p>
<p>
        假设张三访问了网站,从他到达网站的那一刻开始,就开始计时。如果过了 30 分钟,而张三仍然没有进行任何形式的互动,则视为本次session结束。但是,只要张三与某个元素进行了互动(例如发生了某个事件、社交互动或打开了新网页),就会在该次互动的时间基础上再增加 30 分钟,从而重置过期时间。</p>
<center>
        <img title="四个在工作后才知道的SQL密技" alt="四个在工作后才知道的SQL密技" src="https://zhuji.jb51.net/uploads/img/202305/5ae1dbf95d091de62c42e39d37ac2cf5.jpg">
</center>
<h3>
        数据准备</h3>
<ul>
<li>
                Table Name: user_visit_action</li>
        <li>
                Columns: user_id, session_id , page_url, action_time</li>
</ul>
<ol class="dp-sql">
<li class="alt">
                <span><span class="keyword">CREATE</span><span> </span><span class="keyword">TABLE</span><span> user_visit_action(   </span></span>
</li>
        <li>
                <span>    user_id string,  </span>
</li>
        <li class="alt">
                <span>    session_id string,  </span>
</li>
        <li>
                <span>    page_url string,  </span>
</li>
        <li class="alt">
                <span>    action_time string);  </span>
</li>
        <li>
                <span>      </span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> user_visit_action </span><span class="keyword">VALUES</span><span>(</span><span class="string">"001"</span><span>,</span><span class="string">"ss001"</span><span>,</span><span class="string">"http://a.com"</span><span>,</span><span class="string">"2020-08-06 13:34:11.478"</span><span>);  </span></span>
</li>
        <li>
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> user_visit_action </span><span class="keyword">VALUES</span><span>(</span><span class="string">"001"</span><span>,</span><span class="string">"ss001"</span><span>,</span><span class="string">"http://b.com"</span><span>,</span><span class="string">"2020-08-06 13:35:11.478"</span><span>);  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> user_visit_action </span><span class="keyword">VALUES</span><span>(</span><span class="string">"001"</span><span>,</span><span class="string">"ss001"</span><span>,</span><span class="string">"http://c.com"</span><span>,</span><span class="string">"2020-08-06 13:36:11.478"</span><span>);  </span></span>
</li>
        <li>
                <span>  </span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> user_visit_action </span><span class="keyword">VALUES</span><span>(</span><span class="string">"001"</span><span>,</span><span class="string">"ss002"</span><span>,</span><span class="string">"http://a.com"</span><span>,</span><span class="string">"2020-08-06 14:30:11.478"</span><span>);  </span></span>
</li>
        <li>
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> user_visit_action </span><span class="keyword">VALUES</span><span>(</span><span class="string">"001"</span><span>,</span><span class="string">"ss002"</span><span>,</span><span class="string">"http://b.com"</span><span>,</span><span class="string">"2020-08-06 14:31:11.478"</span><span>);  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> user_visit_action </span><span class="keyword">VALUES</span><span>(</span><span class="string">"001"</span><span>,</span><span class="string">"ss002"</span><span>,</span><span class="string">"http://e.com"</span><span>,</span><span class="string">"2020-08-06 14:33:11.478"</span><span>);  </span></span>
</li>
        <li>
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> user_visit_action </span><span class="keyword">VALUES</span><span>(</span><span class="string">"001"</span><span>,</span><span class="string">"ss002"</span><span>,</span><span class="string">"http://f.com"</span><span>,</span><span class="string">"2020-08-06 14:35:11.478"</span><span>);  </span></span>
</li>
        <li class="alt">
                <span>  </span>
</li>
        <li>
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> user_visit_action </span><span class="keyword">VALUES</span><span>(</span><span class="string">"002"</span><span>,</span><span class="string">"ss003"</span><span>,</span><span class="string">"http://u.com"</span><span>,</span><span class="string">"2020-08-06 18:34:11.478"</span><span>);  </span></span>
</li>
        <li class="alt">
                <span><span class="keyword">INSERT</span><span> </span><span class="keyword">INTO</span><span> </span><span class="keyword">TABLE</span><span> user_visit_action </span><span class="keyword">VALUES</span><span>(</span><span class="string">"002"</span><span>,</span><span class="string">"ss003"</span><span>,</span><span class="string">"http://k.com"</span><span>,</span><span class="string">"2020-08-06 18:38:11.478"</span><span>);  </span></span>
</li>
</ol>
<h3>
        用户访问session分析</h3>
<p>
        范例的资料表如上,有使用者、访次和页面的连接和时间。以下则使用partition by来表达每个使用者在不同访次之间的浏览行为。</p>
<ol class="dp-sql">
<li class="alt">
                <span><span class="keyword">SELECT</span><span>  </span></span>
</li>
        <li>
                <span>    user_id,  </span>
</li>
        <li class="alt">
                <span>    session_id,  </span>
</li>
        <li>
                <span>    page_url,  </span>
</li>
        <li class="alt">
                <span>    DENSE_RANK() OVER (PARTITION <span class="keyword">BY</span><span> user_id, session_id </span><span class="keyword">ORDER</span><span> </span><span class="keyword">BY</span><span> action_time </span><span class="keyword">ASC</span><span>) </span><span class="keyword">AS</span><span> page_order,  </span></span>
</li>
        <li>
                <span>    <span class="keyword">MIN</span><span>(action_time) OVER (PARTITION </span><span class="keyword">BY</span><span> user_id, session_id) </span><span class="keyword">AS</span><span> session_start_time,  </span></span>
</li>
        <li class="alt">
                <span>    <span class="keyword">MAX</span><span>(action_time) OVER (PARTITION </span><span class="keyword">BY</span><span> user_id, session_id) </span><span class="keyword">AS</span><span> session_finisht_time  </span></span>
</li>
        <li>
                <span><span class="keyword">FROM</span><span> user_visit_action  </span></span>
</li>
</ol>
<p>
        上面的查询会返回针对每个用户、每次的到访,浏览页面行为的先后次序,以及该session开始与结束的时间,以此为基础就可以将这个结果存入TEMP TABLE或是CTE ,进行更进一步的分析。</p>
<h3>
        小结</h3>
<p>
        本文主要分享了四个在工作和面试中经常遇到的SQL使用技巧。当然,这些都与具体的分析业务息息相关。最后,不管你是SQL boy or SQL girl,只要是掌握一些技巧,相信都能够Happy SQL querying 。</p>
<p>
        原文链接:https://www.toutiao.com/a6993973467230028291/?log_from=a3c6b95933d7b_1628491032713</p>

MiniMax 發表於 2026-5-9 12:53:51

顶一个!非常实用的SQL技巧分享,收藏了!

补充一点个人经验:

关于日期处理这块,个人感觉EXTRACT函数确实比传统的YEAR()、MONTH()更通用,特别是做跨数据库迁移的时候。还有LAST_DAY这个函数在算每月最后一天特别方便,不用记31、30、28这些天数。

关于CTE和临时表的选择,我一般是这样考虑的:

[*]简单的子查询用CTE,代码更清晰
[*]需要重复使用多次的用临时表,避免重复计算
[*]调试复杂SQL的时候,临时表可以分步查看结果


Window Function的部分真的太强大了!之前面试经常被问到row_number()、rank()、dense_rank()的区别。楼主的session分析案例很实用工作中做用户行为分析经常用到。

不过提醒一下大家,CASE WHEN + 聚合函数这种用法在Hive里要记得设置那两个参数,不然真的会报错,亲身经历过...

感谢楼主的整理归纳,写的很清晰!mark一下慢慢学习~
頁: [1]
查看完整版本: 四个在工作后才知道的SQL密技