Hive 中使用 UDF (用户自定义函数) 示例

记个简单的步骤方便后面使用。想实现的效果是自定义一个函数,用在 Hive 中。例如,在字符串前加个 Hello.
[cc lang=”text”]
hive> select hello(firstname) from people limit 10;
OK
Hello hehe
[/cc]

环境

AWS EMR 5.20.0

编译 UDF 对应的 JAR

使用这个 git[1] 提供的源码作为示例。
[cc lang=”text”]
$ git clone https://github.com/rathboma/hive-extension-examples.git
[/cc]
这个示例有个小问题,定义 class 的时候忘了指定 public。所以我们要把 public 先加上。修改 hive-extension-examples/src/main/java/com/matthewrathbone/example/SimpleUDFExample.java 如下:
[cc lang=”java”]
public class SimpleUDFExample extends UDF {

public Text evaluate(Text input) {
if(input == null) return null;
return new Text(“Hello ” + input.toString());
}
}
[/cc]
这里定义了一个继承 UDF 的类 SimpleUDFExample,后面 Hive 用作函数的类就在这里实现。它就是简单地返回一个加上 “Hello” 的字符串。

修改 hive-extension-examples/pom.xml 如下,使编译出来的 JAR 与 EMR 环境兼容[2]。
[cc lang=”xml”]

org.apache.maven.plugins
maven-surefire-plugin
2.8
maven-assembly-plugin



com.matthewrathbone.example.RawMapreduce



jar-with-dependencies

4.0.0
com.matthewrathbone.example
hive-extensions jar 1.0-SNAPSHOT
hive-extensions
http://maven.apache.org


org.apache.hadoop
hadoop-client
2.8.5-amzn-1
provided


org.apache.hive
hive-exec
2.3.4-amzn-0
provided



org.apache.commons
commons-io
1.3.2
test


commons-httpclient
commons-httpclient
3.1
test


org.apache.hadoop
hadoop-test
2.8.5-amzn-1
test


junit
junit
4.8.2
test




emr-5.20.0-artifacts

true


false

https://s3.us-west-2.amazonaws.com/us-west-2-emr-artifacts/emr-5.20.0/repos/maven/

[/cc]

编译打包:
[cc lang=”text”]
$ cd hive-extension-examples
$ mvn compile
$ mvn assembly:single
[/cc]

将生成的 JAR 包复制到 Hive 能访问的位置,比如,
[cc lang=”text”]
$ cp target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar /tmp/
[/cc]

导入 Hive

[cc lang=”text”]
hive> create table people(firstname String);
OK
Time taken: 0.816 seconds

hive> INSERT INTO TABLE people VALUES (‘hehe’);

hive> ADD JAR /tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
Added [/tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [/tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar]

hive> create temporary function hello as ‘com.matthewrathbone.example.SimpleUDFExample’;
OK
Time taken: 0.017 seconds

hive> select hello(firstname) from people limit 10;
OK
Hello hehe
Time taken: 2.513 seconds, Fetched: 1 row(s)
[/cc]

链接

[1] https://github.com/rathboma/hive-extension-examples
[2] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-artifact-repository.html