Hive 中使用 UDF (用户自定义函数) 示例

记个简单的步骤方便后面使用。想实现的效果是自定义一个函数,用在 Hive 中。例如,在字符串前加个 Hello.

hive> select hello(firstname) from people limit 10;
OK
Hello hehe

环境

AWS EMR 5.20.0

编译 UDF 对应的 JAR

使用这个 git[1] 提供的源码作为示例。

$ git clone https://github.com/rathboma/hive-extension-examples.git

这个示例有个小问题,定义 class 的时候忘了指定 public。所以我们要把 public 先加上。修改 hive-extension-examples/src/main/java/com/matthewrathbone/example/SimpleUDFExample.java 如下:

public class SimpleUDFExample extends UDF {

  public Text evaluate(Text input) {
    if(input == null) return null;
    return new Text("Hello " + input.toString());
  }
}

这里定义了一个继承 UDF 的类 SimpleUDFExample,后面 Hive 用作函数的类就在这里实现。它就是简单地返回一个加上 "Hello" 的字符串。

修改 hive-extension-examples/pom.xml 如下,使编译出来的 JAR 与 EMR 环境兼容[2]。

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

<build>
    <pluginManagement>
      <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.8</version>
        </plugin>
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <archive>
                    <manifest>
                        <mainClass>com.matthewrathbone.example.RawMapreduce</mainClass>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>

  <modelVersion>4.0.0</modelVersion>
  <groupId>com.matthewrathbone.example</groupId>
  <artifactId>hive-extensions</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>hive-extensions</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.8.5-amzn-1</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>2.3.4-amzn-0</version>
      <scope>provided</scope>
    </dependency>
    <!-- TEST DEPENDENCIES -->
    <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-io</artifactId>
      <version>1.3.2</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>commons-httpclient</groupId>
      <artifactId>commons-httpclient</artifactId>
      <version>3.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-test</artifactId>
      <version>2.8.5-amzn-1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.8.2</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
    <repositories>
    <repository>
      <id>emr-5.20.0-artifacts</id>
      <releases>
        <enabled>true</enabled>
      </releases>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
      <url>https://s3.us-west-2.amazonaws.com/us-west-2-emr-artifacts/emr-5.20.0/repos/maven/</url>
    </repository>
  </repositories>
</project>

编译打包:

$ cd hive-extension-examples
$ mvn compile
$ mvn assembly:single

将生成的 JAR 包复制到 Hive 能访问的位置,比如,

$ cp target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar /tmp/

导入 Hive

hive> create table people(firstname String);
OK
Time taken: 0.816 seconds

hive> INSERT INTO TABLE people VALUES ('hehe');

hive> ADD JAR /tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
Added [/tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [/tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar]

hive> create temporary function hello as 'com.matthewrathbone.example.SimpleUDFExample';
OK
Time taken: 0.017 seconds

hive> select hello(firstname) from people limit 10;
OK
Hello hehe
Time taken: 2.513 seconds, Fetched: 1 row(s)

链接

[1] https://github.com/rathboma/hive-extension-examples
[2] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-artifact-repository.html