Java BigDecimal 转换为 Parquet Decimal

记录个问题。有用户想通过 Java 自行生成 Parquet 文件,不通过 Hive, Spark 等软件。

用户直接将 BigDecimal toByteArray(),然后用 Hive/Athena 都读不出来正确的数值。查看 Hive 的做法,它实际上是用 unscaledValue() 转换成 BigInt,再 toByteArray() 存入 Parquet 的。
继续阅读“Java BigDecimal 转换为 Parquet Decimal”

hive failed renaming s3 table with error "New location for this table already exist"

Issue

- In hive-cli, rename table with command:

hive> alter table large_table_bk rename to large_table;

- 10 minutes later, it prompts error.

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. New location for this table default.large_table already exists : s3://feichashao-hadoop/warehouse/large_table

- However, before executing the "rename" command, the directory was not exist in S3, so we don't expect such an error.
继续阅读“hive failed renaming s3 table with error "New location for this table already exist"”

hive-server2 remains 2k+ s3n-worker threads after job finished


tldr: This is a bug in EMR 5.6. Upgrading to EMR 5.8 or above can solve the issue.

Issue

User reported that he sees 2000+ s3n-worker threads after job finished. He has to restart the hive-server2 service everyday to mitigate the issue.

# sudo -u hive jstack 11089 | grep s3n-worker | wc -l
2000

The threads are repeating from s3n-worker-0 to s3n-worker-19. In another word, there are 100 * 20 s3n-worker threads.

"s3n-worker-19" #70 daemon prio=5 os_prio=0 tid=0x00007f5ac4cf0800 nid=0x10ad waiting on condition [0x00007f5ac1dee000]
......
"s3n-worker-1" #52 daemon prio=5 os_prio=0 tid=0x00007f5ac5462000 nid=0x109b waiting on condition [0x00007f5aca23f000]
"s3n-worker-0" #51 daemon prio=5 os_prio=0 tid=0x00007f5ac5480000 nid=0x109a waiting on condition [0x00007f5aca641000]
......

Environment

AWS EMR 5.6
继续阅读“hive-server2 remains 2k+ s3n-worker threads after job finished”

Spark RDD checkpoint on S3 exits with exception intermittently

Issue

- Run a spark job and save RDD checkpoint to S3.
- Spark job failed intermittently with below error:

org.apache.spark.SparkException: Checkpoint RDD has a different number of partitions from original RDD. Original RDD [ID: xxx, num of partitions: 6]; Checkpoint RDD [ID: xxx, num of partitions: 5].

继续阅读“Spark RDD checkpoint on S3 exits with exception intermittently”

AWS SNS 自定义邮件格式

AWS 的 SES (Simple Email Service) 可以提供邮件收发服务。对于邮件收发的反馈,如 Bounce message, 可以发送到 SNS (Simple Notification Service) 作进一步处理。SNS 可以指定某个邮件地址作为订阅者,将消息发送到该邮箱中。然而,SNS发出来的邮件是 JSON,可读性不好,例如:

{"notificationType":"Delivery","mail":{"timestamp":"2019-02-18T06:03:02.669Z","source":"kfc@feichashao.com","sourceArn":"arn:aws:ses:us-west-2:xxxxxx:identity/feichashao.com","sourceIp":"205.251.234.36","sendingAccountId":"xxxxxx","messageId":"01010168ff335c8d-f00ce1c1-e103-49cd-912f-9f397c7a463c-000000","destination":["feichashao@gmail.com"],"headersTruncated":false,"headers":[{"name":"From","value":"kfc@feichashao.com"},{"name":"To","value":"feichashao@gmail.com"},{"name":"Subject","value":"free kfc"},{"name":"MIME-Version","value":"1.0"},{"name":"Content-Type","value":"text/plain; charset=UTF-8"},{"name":"Content-Transfer-Encoding","value":"7bit"}],"commonHeaders":{"from":["kfc@feichashao.com"],"to":["feichashao@gmail.com"],"subject":"free kfc"}},"delivery":{"timestamp":"2019-02-18T06:03:03.917Z","processingTimeMillis":1248,"recipients":["feichashao@gmail.com"],"smtpResponse":"250 2.0.0 OK  1550469783 q2si13329671plh.79 - gsmtp","remoteMtaIp":"74.125.20.27","reportingMTA":"a27-30.smtp-out.us-west-2.amazonses.com"}}

怎么能让这个提醒邮件变得更加友好呢? SNS目前不支持自定义邮件格式。一个思路是,将 SNS 的消息发送到 Lambda 上,让 Lambda 处理好格式后,再发送到指定邮箱。即 SES -> SNS -> Lambda -> SES.
继续阅读“AWS SNS 自定义邮件格式”