hive failed renaming s3 table with error “New location for this table already exist”

Contents

1 Issue
2 Environment
3 Resolution
4 Root Cause
5 Reference

Issue

- In hive-cli, rename table with command:

hive> alter table large_table_bk rename to large_table;

- 10 minutes later, it prompts error.

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. New location for this table default.large_table already exists : s3://feichashao-hadoop/warehouse/large_table

- However, before executing the "rename" command, the directory was not exist in S3, so we don't expect such an error.

Environment

- AWS EMR
- AWS S3
- Large table (~ 600GiB) resides in S3.

Resolution

- Ignore the error, and wait for some minutes. The table will be renamed eventually after all files in S3 are renamed by hive metastore.

- Or, extend the metastore socket timeout to eliminate the error.

$ hive --hiveconf hive.metastore.client.socket.timeout 1h
hive> alter table large_table rename to large_table_bk;
OK
Time taken: 798.823 seconds

Root Cause

- In hive, to rename a managed table in default warehouse (hive.metastore.warehouse.dir), it will also rename the underlying directory. For example, in HDFS, from /user/hive/warehouse/table_before to /user/hive/warehouse/table_after.
- The rename operation is done by hive metastore.
- In S3, there's no build-in rename operation, so actually, rename = copy + delete.
- If the dataset is large (600+ GiB), it might take more than 10 minutes for metastore to finish the rename operation. Then, the socket between hive-cli and metastore will timeout with log:

metastore.RetryingMetaStoreClient (RetryingMetaStoreClient.java:invoke(218)) - MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. alter_table_with_environmentContext
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out

- Monitor S3, we can see the rename operation is still ongoing.

$ aws s3 ls s3://feichashao-hadoop/warehouse/large_table/ --recursive --human-readable --summarize | tail -n2
Total Objects: 2101
Total Size: 237.8 GiB

$ aws s3 ls s3://feichashao-hadoop/warehouse/large_table/ --recursive --human-readable --summarize | tail -n2
Total Objects: 2348
Total Size: 265.8 GiB

- s3n-working is still running inside metastore.

$ sudo -u hive jstack 10834 | grep s3n-worker
"s3n-worker-19" #86 daemon prio=5 os_prio=0 tid=0x00007f8928455000 nid=0x1623 runnable [0x00007f89171d5000]
"s3n-worker-18" #85 daemon prio=5 os_prio=0 tid=0x00007f8928454800 nid=0x1622 runnable [0x00007f89174d6000]
"s3n-worker-17" #84 daemon prio=5 os_prio=0 tid=0x00007f8928453800 nid=0x1621 runnable [0x00007f8917ada000]

- Wait for some time, and we can see the table is renamed successfully even if we see the error.

hive> show tables;
large_table

Reference

[1] Preserve the location of table created with the location clause in table rename
https://issues.apache.org/jira/browse/HIVE-14909