记个简单的步骤方便后面使用。想实现的效果是自定义一个函数,用在 Hive 中。例如,在字符串前加个 Hello.
hive> select hello(firstname) from people limit 10;
OK
Hello hehe
OK
Hello hehe
环境
AWS EMR 5.20.0
编译 UDF 对应的 JAR
使用这个 git[1] 提供的源码作为示例。
$ git clone https://github.com/rathboma/hive-extension-examples.git
这个示例有个小问题,定义 class 的时候忘了指定 public。所以我们要把 public 先加上。修改 hive-extension-examples/src/main/java/com/matthewrathbone/example/SimpleUDFExample.java 如下:
public class SimpleUDFExample extends UDF {
public Text evaluate(Text input) {
if(input == null) return null;
return new Text("Hello " + input.toString());
}
}
public Text evaluate(Text input) {
if(input == null) return null;
return new Text("Hello " + input.toString());
}
}
这里定义了一个继承 UDF 的类 SimpleUDFExample,后面 Hive 用作函数的类就在这里实现。它就是简单地返回一个加上 "Hello" 的字符串。
修改 hive-extension-examples/pom.xml 如下,使编译出来的 JAR 与 EMR 环境兼容[2]。
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.8</version>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.matthewrathbone.example.RawMapreduce</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
<modelVersion>4.0.0</modelVersion>
<groupId>com.matthewrathbone.example</groupId>
<artifactId>hive-extensions</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>hive-extensions</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.8.5-amzn-1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.4-amzn-0</version>
<scope>provided</scope>
</dependency>
<!-- TEST DEPENDENCIES -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-io</artifactId>
<version>1.3.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-test</artifactId>
<version>2.8.5-amzn-1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8.2</version>
<scope>test</scope>
</dependency>
</dependencies>
<repositories>
<repository>
<id>emr-5.20.0-artifacts</id>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
<url>https://s3.us-west-2.amazonaws.com/us-west-2-emr-artifacts/emr-5.20.0/repos/maven/</url>
</repository>
</repositories>
</project>
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.8</version>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.matthewrathbone.example.RawMapreduce</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
<modelVersion>4.0.0</modelVersion>
<groupId>com.matthewrathbone.example</groupId>
<artifactId>hive-extensions</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>hive-extensions</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.8.5-amzn-1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.4-amzn-0</version>
<scope>provided</scope>
</dependency>
<!-- TEST DEPENDENCIES -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-io</artifactId>
<version>1.3.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-test</artifactId>
<version>2.8.5-amzn-1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8.2</version>
<scope>test</scope>
</dependency>
</dependencies>
<repositories>
<repository>
<id>emr-5.20.0-artifacts</id>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
<url>https://s3.us-west-2.amazonaws.com/us-west-2-emr-artifacts/emr-5.20.0/repos/maven/</url>
</repository>
</repositories>
</project>
编译打包:
$ cd hive-extension-examples
$ mvn compile
$ mvn assembly:single
$ mvn compile
$ mvn assembly:single
将生成的 JAR 包复制到 Hive 能访问的位置,比如,
$ cp target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar /tmp/
导入 Hive
hive> create table people(firstname String);
OK
Time taken: 0.816 seconds
hive> INSERT INTO TABLE people VALUES ('hehe');
hive> ADD JAR /tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
Added [/tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [/tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar]
hive> create temporary function hello as 'com.matthewrathbone.example.SimpleUDFExample';
OK
Time taken: 0.017 seconds
hive> select hello(firstname) from people limit 10;
OK
Hello hehe
Time taken: 2.513 seconds, Fetched: 1 row(s)
OK
Time taken: 0.816 seconds
hive> INSERT INTO TABLE people VALUES ('hehe');
hive> ADD JAR /tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
Added [/tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [/tmp/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar]
hive> create temporary function hello as 'com.matthewrathbone.example.SimpleUDFExample';
OK
Time taken: 0.017 seconds
hive> select hello(firstname) from people limit 10;
OK
Hello hehe
Time taken: 2.513 seconds, Fetched: 1 row(s)
链接
[1] https://github.com/rathboma/hive-extension-examples
[2] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-artifact-repository.html