Java BigDecimal 转换为 Parquet Decimal

记录个问题。有用户想通过 Java 自行生成 Parquet 文件,不通过 Hive, Spark 等软件。

用户直接将 BigDecimal toByteArray(),然后用 Hive/Athena 都读不出来正确的数值。查看 Hive 的做法,它实际上是用 unscaledValue() 转换成 BigInt,再 toByteArray() 存入 Parquet 的。

hive/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
[cc lang=”java”]
private Binary decimalToBinary(final HiveDecimal hiveDecimal, final DecimalTypeInfo decimalTypeInfo) {
int prec = decimalTypeInfo.precision();
int scale = decimalTypeInfo.scale();
byte[] decimalBytes = hiveDecimal.setScale(scale).unscaledValue().toByteArray(); //<----- // Estimated number of bytes needed. int precToBytes = ParquetHiveSerDe.PRECISION_TO_BYTE_COUNT[prec - 1]; if (precToBytes == decimalBytes.length) { // No padding needed. return Binary.fromByteArray(decimalBytes); } byte[] tgt = new byte[precToBytes]; if (hiveDecimal.signum() == -1) { // For negative number, initializing bits to 1 for (int i = 0; i < precToBytes; i++) { tgt[i] |= 0xFF; } } System.arraycopy(decimalBytes, 0, tgt, precToBytes - decimalBytes.length, decimalBytes.length); // Padding leading zeroes/ones. return Binary.fromByteArray(tgt); } [/cc] 关于生成 parquet 的完整源码,暂时未能在此提供,以下库可能有帮助: [cc lang="java"] import org.apache.parquet.column.ParquetProperties; import org.apache.parquet.column.ParquetProperties.WriterVersion; import org.apache.parquet.example.data.Group; import org.apache.parquet.example.data.GroupFactory; import org.apache.parquet.example.data.simple.SimpleGroupFactory; import org.apache.parquet.hadoop.ParquetFileWriter; import org.apache.parquet.hadoop.ParquetReader; import org.apache.parquet.hadoop.ParquetReader.Builder; import org.apache.parquet.hadoop.ParquetWriter; import org.apache.parquet.hadoop.example.ExampleParquetWriter; import org.apache.parquet.hadoop.example.GroupReadSupport; import org.apache.parquet.hadoop.metadata.CompressionCodecName; import org.apache.parquet.io.api.Binary; import org.apache.parquet.schema.MessageType; import org.apache.parquet.schema.OriginalType; import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName; import org.apache.parquet.schema.Types; import org.apache.parquet.schema.Types.MessageTypeBuilder; [/cc]