Openshift – 肥叉烧 feichashao.com

Elasticsearch 删除 RED state index

问题

Openshift v4 中的 logging 出现了某些问题，导致 Elasticsearch 处于 red state, 日志无法写入。

解决方法

1. 进入到 ES Pod 中，查看健康状态，发现是 red 的，有一些 unassigned shard.

$ oc exec -it elasticsearch-cdm-xxxx-1-yyyy-zzzz -n openshift-logging bash
bash-4.2$ health
Tue Nov 10 06:19:00 UTC 2020
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
xxxxx 06:19:00 elasticsearch red 3 3 578 289 0 0 202 0 - 74.1%

2. 查看有哪些 unassigned shards, 以及 unassigned 的原因。集群刚刚恢复的时候，shard 的状态会是 CLUSTER_RECOVERED，但不应该持续很久。这里的状态一直停留在 CLUSTER_RECOVERED，应该是什么地方出现了问题。

bash-4.2$ $curl_get "$ES_BASE/_cat/shards?h=index,shard,prirep,state,unassigned.reason" | grep UNASSIGNED
.xxxxxxxxxx.2020.10.14 1 p UNASSIGNED CLUSTER_RECOVERED
.xxxxxxxxxx.2020.10.14 1 r UNASSIGNED CLUSTER_RECOVERED
.xxxxxxxxxx.2020.10.14 2 p UNASSIGNED CLUSTER_RECOVERED

3. 查看 Unassigned 的原因。数据没了。

bash-4.2$ $curl_get "$ES_BASE/_cluster/allocation/explain?pretty"
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",

4. 一个简单粗暴的移除 red state 的方法是把 red state 的 index 删除掉，这样 Elasticsearch 可以继续接受到日志，不至于影响新进来的日志。

$ curl -XDELETE 'localhost:9200/index_name/'

据了解，更靠谱的方法应该是，relocate 这个 shard, 这样不至于丢失整个 index 的数据。然而我没有尝试。

继续阅读“Elasticsearch 删除 RED state index”

"kubectl apply -f" fail with "The xxxx is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update"

Issue

When running "kubectl apply" against a Resource in Kubernetes, it fails with below error.

The <resource> is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update

继续阅读“"kubectl apply -f" fail with "The xxxx is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update"”

Kubernetes Operators 入门笔记

为什么需要 Kubernetes Operator?

在查阅 Kubernetes Operators 由来的时候，看到阿里云的这篇文章[8]. 这篇文章的作者之一邓洪超是下文学习的 etcd operator 的作者之一。这篇文章讲述了 Operator 的历史，写出了历史大片的感觉，值得一看。

按照我的理解，Operator 的就是一个运维人员，只是这个运维人员是软件而不是人类。对于无状态的应用，像是 web server, 其实用 k8s 本身的 deployment 就足够了，一个 pod 坏了，自动部署一个新的 pod, 又有弹性又高可用。但是对于一些稍微复杂的应用，尤其是有状态，有特定拓扑的应用，就需要特定的逻辑才能管理起来。比如 etcd, 如果有一半以上的节点挂了，那么就会出现 quorum lost, 集群无法工作。还是 etcd, 里面的数据需要定期备份，万一整个集群挂了，也能及时地恢复。这些操作可以通过人类实现，也可以通过软件实现，Operator 就是这样的软件，它会自动管理 k8s 上一个特定的复杂应用。

于是，Operator 也成为了应用软件在 Kubernetes 上的一种标准化交付方式。应用的开发者可以定义好如何运维这个应用，用户想用某个应用，只需要安装上相应的 Operator，那么 Operator 就会自动完成整个应用生命周期的管理，包括安装、升级、备份、恢复等等。

如果把饼画得大一点，在 k8s 里，下有管理 k8s 节点的 Operator, 上有管理应用的 Operator, 那么这整个平台都是自动管理的，可以做到免运维，这很云计算。
继续阅读“Kubernetes Operators 入门笔记”

Openshift 4 学习笔记

DO280 在线课程的笔记。

Openshift 是基于 k8s 的 paas 产品，基本概念与 k8s 差不多。 Openshift 使用 Project 划分不同的 namespace, 隔离不同的资源。

Operator

Openshift 4 引入了 Operator 的概念。Operator 可以帮助管理资源。
Operators usually define custom resources (CR) that store their settings and configurations. An OpenShift administrator manages an operator my editing its custom resources. The syntax of a custom resource is defined by a custom resource definition (CRD).

Operator 通过调用 kubernetes 的 api 来实现它的功能。不过直接调用底层 api 会比较麻烦，因此有一个 Operator Software Development Kit (Operator SDK) 来帮助开发 Operator.

Operator Life Cycle Manager (OLM) 负责管理 Operator 的生命周期，它本身也是一个 Operator.
继续阅读“Openshift 4 学习笔记”