Openshift v4 中的 logging 出现了某些问题,导致 Elasticsearch 处于 red state, 日志无法写入。
1. 进入到 ES Pod 中,查看健康状态,发现是 red 的,有一些 unassigned shard.
$ oc exec -it elasticsearch-cdm-xxxx-1-yyyy-zzzz -n openshift-logging bash
bash-4.2$ health
Tue Nov 10 06:19:00 UTC 2020
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
xxxxx 06:19:00 elasticsearch red 3 3 578 289 0 0 202 0 - 74.1%
2. 查看有哪些 unassigned shards, 以及 unassigned 的原因。集群刚刚恢复的时候,shard 的状态会是 CLUSTER_RECOVERED,但不应该持续很久。这里的状态一直停留在 CLUSTER_RECOVERED,应该是什么地方出现了问题。
bash-4.2$ $curl_get "$ES_BASE/_cat/shards?h=index,shard,prirep,state,unassigned.reason" | grep UNASSIGNED
.xxxxxxxxxx.2020.10.14 1 p UNASSIGNED CLUSTER_RECOVERED
.xxxxxxxxxx.2020.10.14 1 r UNASSIGNED CLUSTER_RECOVERED
.xxxxxxxxxx.2020.10.14 2 p UNASSIGNED CLUSTER_RECOVERED
.xxxxxxxxxx.2020.10.14 1 p UNASSIGNED CLUSTER_RECOVERED
.xxxxxxxxxx.2020.10.14 1 r UNASSIGNED CLUSTER_RECOVERED
.xxxxxxxxxx.2020.10.14 2 p UNASSIGNED CLUSTER_RECOVERED
3. 查看 Unassigned 的原因。数据没了。
bash-4.2$ $curl_get "$ES_BASE/_cluster/allocation/explain?pretty"
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
4. 一个简单粗暴的移除 red state 的方法是把 red state 的 index 删除掉,这样 Elasticsearch 可以继续接受到日志,不至于影响新进来的日志。
$ curl -XDELETE 'localhost:9200/index_name/'
据了解,更靠谱的方法应该是,relocate 这个 shard, 这样不至于丢失整个 index 的数据。然而我没有尝试。