Hive自定义函数(UDF)开发和应用流程

校团委组织部 發表於 2025-7-19 08:10:00

Hive自定义函数(UDF)开发和应用流程

<p></p><div class="toc"><div class="toc-container-header">目录</div><ul><li>引言</li><li>一、Hive自定义函数的类型</li><li>二、准备环境和工具</li><li>三、实际案例开发编译</li><li>四、前方有坑请注意</li><li>五、总结</li></ul></div><p></p>
<h1 id="引言">引言</h1>
<p> Hive作为大数据领域的核心计算引擎，凭借其强大的SQL支持和丰富的内置函数，早已成为数据开发者的效率利器。然而在实际业务场景中，面对复杂的数据处理需求时，仅仅依赖内置函数往往力不从心，当需要实现多步骤逻辑组合（如日期换算+字符串清洗+条件判断）时，开发者常需反复调用add_months、replace、substr等多个函数，甚至嵌套多层case when。偶尔使用一两次还可接受，但在同一段HQL脚本中需要多次重复组合使用时，不仅容易因疏忽导致参数顺序错误或函数遗漏，还会让代码变得冗余繁杂，可读性与维护性大幅下降。<br>
笔者近期在参与ODS→DW→DM数据链路开发时便深有体会，表数据按日期分区存储（分区格式为yyyyMM，如202507），数据随时间滚动更新，在汇总计算近3个月、6个月、12个月等周期指标时，需频繁对分区字段进行“日期换算+格式清洗”操作。这时候代码中充斥着 add_months(concat(substr(dt,1,4),'-',substr(dt,5,2)), -3) 这样的复杂表达式，不仅容易出错，更让SQL脚本变得惨不忍睹。<br>
Hive自定义函数（UDF）的出现，正好解决这一痛点。通过将高频复用的业务逻辑封装为UDF，开发者不仅能扩展Hive 的计算边界，更能将原本需要多行代码实现的功能，浓缩为一行简洁的 SQL 调用。这不仅大幅减少了重复代码，更让业务逻辑在SQL中清晰可读，显著提升了开发效率与代码可维护性。下面是笔者针对日期换算需求实现UDF的过程。</p>
<h1 id="一hive自定义函数的类型">一、Hive自定义函数的类型</h1>
<p> Hive自定义函数可以通过Java/Scala语言实现，主要有下面几种自定义函数类型：</p>
<table>
<thead>
<tr>
<th>类型</th>
<th>特点</th>
<th>使用场景</th>
</tr>
</thead>
<tbody>
<tr>
<td>UDF</td>
<td>单行输入 -> 单行输出（跟普通内置函数相似）</td>
<td>简单的计算，例如字符串截取、字符替换等</td>
</tr>
<tr>
<td>UDAF</td>
<td>多行输入 -> 单行输出（类似聚合函数）</td>
<td>自定义聚合功能数据逻辑，例如按条件统计个数或者做加权取平均值</td>
</tr>
<tr>
<td>UDTF</td>
<td>单行输入 -> 多行输出（跟lateral view explode功能相似）</td>
<td>进行行列转换、数据拆分或者JSON之类的文本解析</td>
</tr>
</tbody>
</table>
<p> 在日常开发中大多数场景使用的都是UDF，这是实现复杂业务场景的首选，开发过程也简单。</p>
<h1 id="二准备环境和工具">二、准备环境和工具</h1>
<p> 1.准备开发环境和工具</p>
<blockquote>
<p>OS: Windows 10<br>
Java: 8<br>
Hive: 2.7.4<br>
IDEA：社区版<br>
maven 3.9.11</p>
</blockquote>
<p>软件安装步骤这里就省略了，网上基本都能搜索到相关安装教程。</p>
<p> 2.MAVEN依赖配置POM.xml，添加Hive核心以来，确保与集群版本一致</p>
<pre><code class="language-xml"><?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>org.mycode</groupId>
<artifactId>SuperAddMonths</artifactId>
<version>1.0-SNAPSHOT</version>

<properties>
   <maven.compiler.source>8</maven.compiler.source>
   <maven.compiler.target>8</maven.compiler.target>
   <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
   <hive.version>2.3.10</hive.version>
</properties>

<dependencies>
   <dependency>
         <groupId>org.apache.hive</groupId>
         <artifactId>hive-exec</artifactId>
         <version>${hive.version}</version>
   </dependency>
</dependencies>

<build>
   <plugins>
         <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.10.1</version>
            <configuration>
               <source>1.8</source>
               <target>1.8</target>
               <encoding>UTF-8</encoding>
            </configuration>
         </plugin>
   </plugins>
</build>
</project>
</code></pre>
<h1 id="三实际案例开发编译">三、实际案例开发编译</h1>
<p> UDF示例代码SuperAddMonths.java</p>
<pre><code class="language-java">package org.mycode;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.IntObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;

import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.time.format.DateTimeParseException;

@Description(name = "super_add_months",
   value = "计算日期调整月份数结果，输入yyyyMM格式日期及调整月份数，返回相同格式yyyyMM日期",
   extended = "功能：将输入的 yyyyMM 格式日期（如 '202507'）按指定月份数（如 1）调整，返回调整后的 yyyyMM 格式日期。\n" +
            "参数说明：\n" +
            "- 第1个参数：输入日期字符串（格式 yyyyMM，非空）\n" +
            "- 第2个参数：调整月份数（整数，正数表示向后调整，负数表示向前调整，允许为 NULL）\n" +
            "返回值：调整后的日期字符串（格式 yyyyMM）\n" +
            "示例：\n" +
            "SELECT super_add_months('202507', 1); → '202508'（2025年7月 +1个月 → 2025年8月）\n" +
            "SELECT super_add_months('202507', -11); → '202408'（2025年7月 -11个月 → 2024年8月）\n" +
            "SELECT super_add_months('202507', NULL); → '202507'（偏移量为 NULL 时返回原日期）")
public class SuperAddMonths extends GenericUDF {
private static final DateTimeFormatter INIT_DATE_FORMAT = DateTimeFormatter.ofPattern("yyyyMMdd");
private static final DateTimeFormatter TARGET_DATE_FORMAT = DateTimeFormatter.ofPattern("yyyyMM");

@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
   if (arguments.length != 2) {
         throw new UDFArgumentException("调用函数需要传入2个参数，实际传入"+arguments.length+"个参数");
   }
   ObjectInspector firstArg = arguments;
   ObjectInspector secondArg = arguments;
   if (!(firstArg instanceof StringObjectInspector)) {
         throw new UDFArgumentException("第一个传入参数必须是字符串(日期格式为yyyyMM)");
   }
   if (!(secondArg instanceof LongObjectInspector || secondArg instanceof IntObjectInspector || secondArg == null)) {
         throw new UDFArgumentException("第二个传入参数必须是整型数字");
   }
   return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}

@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
   // 获取第一个参数（字符串类型，Hive存储为Text）
   Text monthText = (Text) arguments.get();
   if (monthText == null) {
         return null; // 输入为NULL时返回NULL
   }
   String inputMonth = monthText.toString();

   // 获取第二个参数（Long或Int类型）
   Object offsetObj = arguments.get();
   if (offsetObj == null) {
         return monthText; // 偏移量为NULL时返回原月份
   }

   long offset;
   if (offsetObj instanceof LongWritable) {
         offset = ((LongWritable) offsetObj).get();
   } else if (offsetObj instanceof IntWritable) {
         offset = ((IntWritable) offsetObj).get();
   } else {
         throw new UDFArgumentException("第二个参数必须是Long或Int类型");
   }

   // 计算新月份
   try {
         // 补全为当月第一天（如"202301" → "20230101"）
         LocalDate firstDayOfMonth = LocalDate.parse(inputMonth + "01", INIT_DATE_FORMAT);
         LocalDate newMonth = firstDayOfMonth.plusMonths(offset);
         return new Text(newMonth.format(TARGET_DATE_FORMAT)); // 返回Text类型
   } catch (DateTimeParseException e) {
         throw new HiveException("日期格式错误，期望yyyyMM，实际输入：" + inputMonth, e);
   }
}

@Override
public String getDisplayString(String[] children) {
   return String.format("super_add_months(%s, %s)", children, children);
}
}

</code></pre>
<p> 编译打包上面代码并上传到HDFS，以笔者的需求为例，在使用UDF前判断日期范围的sql如下，假设日期字段是period，传参变量为p_period。对比使用UDF和内置函数，显然用自定义函数可以简洁高效的完成相同功能的逻辑，而且UDF还可以实现更复杂的业务需求。</p>
<p>上传jar包到HDFS</p>
<pre><code class="language-shell">hdfs dfs -put SuperAddMonths-1.0.jar /user/hive/function/
# 确认文件是否上传成功
hdfs dfs -ls /user/hive/function/
</code></pre>
<p>使用UDF方式</p>
<pre><code class="language-sql">add jar hdfs:///user/hive/function/SuperAddMonths-1.0.jar;
create temporary function super_add_months as 'org.mycode.SuperAddMonths';
-- 测试，查看返回结果是否正确
select super_add_months('202507', -12);
</code></pre>
<p>测试没问题就可以改写原有的SQL语句</p>
<pre><code class="language-sql">-- 使用Hive内置函数
select *
from table_xx....
where period <= '${p_period}'
and period > replace(substr(add_months(concat_ws('-', substr('${p_period}', 1, 4), substr('${p_period}', 5, 2), '01'), -12), 1, 7), '-', '')

-- 改写后
select *
from table_xx....
where period <= '${p_period}'
and period > super_add_months('${p_period}', -12)
</code></pre>
<p>下面语句可以用来查看函数相关信息，本文就不再赘述。</p>
<pre><code class="language-sql">show functions like '%super%'

describe function super_add_months;

describe function extended super_add_months;
</code></pre>
<h1 id="四前方有坑请注意">四、前方有坑请注意</h1>
<p>1、出现代码运行报错：ClassCastException java.lang.String cannot be cast to org.apache.hadoop.io.Text<br>
解：evaluate应该返回Text对象（与initialize声明的返回类型一致），而不是String。因为String是Java原生类型，而Hive内部使用Writable类型，所以需要将结果包装为Text对象</p>
<h1 id="五总结">五、总结</h1>
<p> Hive自定义函数是扩展SQL能力的一把利器，掌握这门技巧可以让达到事半功倍的效果。动手实践是掌握UDF开发的关键，不妨从一个小需求开始逐步积累经验！<br>
如果读者遇到其他问题欢迎评论区留言。</p>
<p><strong>参考资料</strong>：</p>
<ul>
<li>Hive 官方文档：Hive UDF Development</li>
<li>Java 时间 API：LocalDate 官方文档</li>
</ul>

</div>
<div id="MySignature" role="contentinfo">
<p>本文来自博客园，作者：计艺回忆路，转载请注明原文链接：https://www.cnblogs.com/ckmemory/p/18988754</p><br><br>
来源：https://www.cnblogs.com/ckmemory/p/18988754

頁: [1]

圆梦公社's Archiver

Hive自定义函数(UDF)开发和应用流程