Java BigDecimal 转换为 Parquet Decimal

记录个问题。有用户想通过 Java 自行生成 Parquet 文件,不通过 Hive, Spark 等软件。

用户直接将 BigDecimal toByteArray(),然后用 Hive/Athena 都读不出来正确的数值。查看 Hive 的做法,它实际上是用 unscaledValue() 转换成 BigInt,再 toByteArray() 存入 Parquet 的。

hive/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java

    private Binary decimalToBinary(final HiveDecimal hiveDecimal, final DecimalTypeInfo decimalTypeInfo) {
      int prec = decimalTypeInfo.precision();
      int scale = decimalTypeInfo.scale();
      byte[] decimalBytes = hiveDecimal.setScale(scale).unscaledValue().toByteArray(); //<-----

      // Estimated number of bytes needed.
      int precToBytes = ParquetHiveSerDe.PRECISION_TO_BYTE_COUNT[prec - 1];
      if (precToBytes == decimalBytes.length) {
        // No padding needed.
        return Binary.fromByteArray(decimalBytes);
      }

      byte[] tgt = new byte[precToBytes];
      if (hiveDecimal.signum() == -1) {
        // For negative number, initializing bits to 1
        for (int i = 0; i < precToBytes; i++) {
          tgt[i] |= 0xFF;
        }
      }

      System.arraycopy(decimalBytes, 0, tgt, precToBytes - decimalBytes.length, decimalBytes.length); // Padding leading zeroes/ones.
      return Binary.fromByteArray(tgt);
    }

关于生成 parquet 的完整源码,暂时未能在此提供,以下库可能有帮助:

import org.apache.parquet.column.ParquetProperties;
import org.apache.parquet.column.ParquetProperties.WriterVersion;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.GroupFactory;
import org.apache.parquet.example.data.simple.SimpleGroupFactory;
import org.apache.parquet.hadoop.ParquetFileWriter;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.ParquetReader.Builder;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.example.ExampleParquetWriter;
import org.apache.parquet.hadoop.example.GroupReadSupport;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import org.apache.parquet.io.api.Binary;
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.OriginalType;
import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
import org.apache.parquet.schema.Types;
import org.apache.parquet.schema.Types.MessageTypeBuilder;