使用 Pacemaker 搭建集群服务

Pacemaker 简述

Pacemaker 是一个集群资源管理器,它负责管理集群环境中资源(服务)的整个生命周期。

Pacemaker is a cluster resource manager, that is, a logic responsible for a life-cycle of deployed software — indirectly perhaps even whole systems or their interconnections — under its control within a set of computers (a.k.a. nodes) and driven by prescribed rules.

Pacemaker 的内部构成

Pacemaker 主要由以下部件组成:
CIB - Cluster Information Base
CRMd - Cluster Resource Management daemon
LRMd - Local Resource Management daemon
PEngine - Policy Engine
STONITHd - Fencing daemon

pcmk-internals

1. CIB 以 XML 格式记录着集群环境的配置以及当前集群中所有资源的状态。CIB 的内容会自动在节点间同步。
2. CIB 的内容会被 DC (Designated Controller) 节点的 PEngine 用来计算集群的最佳状态,或者计算集群要达到的目标。
3. 如果集群状态发生变化,PEngine 会根据规则计算出下一步的行动。PEngine 计算完成后,会告知 DC 节点的 CRMd 相应的行动计划。
4. 如果所要执行是本地的行动,比如起停本节点上的资源,则 CRMd 会告知本节点的 LRMd,由 LRMd 对资源进行具体的操作。如果要执行的行动不在本地,比如起停其他节点的资源,则 DC 的 CRMd 会将行动计划告知目标节点的 CRMd, 目标节点的 CRMd 再告知它的 LRMd 进行行动。

构建一个集群

准备工作

在 RHEL7/Centos7 中,我们通常使用`pcs`工具来配置 Pacemaker 集群。使用`yum`安装`pcs`,则会自动安装所需的依赖关系,包括 Pacemaker。在使用 yum 进行安装前,请先配置好 High-availability Repo.

在每个节点上执行:

# yum install pcs fence-agents-all

另外,我们需要定义好每个节点的主机名。在 /etc/hosts 中,添加所有节点的解析信息,例如:

192.168.122.11 node1.example.com
192.168.122.12 node2.example.com

认证各节点的 pcs

要使用 pcs 管理和搭建 pacemaker 集群,首先要对各节点的 pcs 进行认证,即授权每一个节点的 pcs 可以访问所有节点的 pcsd.

# pcs cluster auth <hostnames>
for example,
# pcs cluster auth node1.example.com node2.example.com

成员管理 - Corosync 2.x

集群环境需要一套机制来了解当前环境中节点的情况,比如哪些节点在线,哪些节点没有响应。在 RHEL7/Centos7 的集群中,Pacemaker 通过 Corosync 2 来进行成员管理。

Corosync 通过心跳来确定一个节点是生是死。在 Corosync 中,每个节点会记录票数。比如,集群中有 node 1, 2, 3, 4 四个节点,在节点 1 看来,它能与自己通信,得1票,它还能成功与节点 2,3 通信,则再得2票,所以节点1一共有3票。

如果某个节点得票大于节点总数的1/2,则认为该节点是 quorate 状态,即属于“大多数”。上述例子中,节点1有3票,处于 quorate 状态。 Pacemaker 的资源通常只会运行在处于 quorate 状态的节点上。

quorate = Total Votes/2 + 1

我们可以用以下命令查看当前得票情况:

# corosync-quorumtool -siH

在建立集群的时候,需要建立起集群间的 Corosync 通信:

# pcs cluster setup
for example,
# pcs cluster setup --start --name mycluster node1.example.com node2.example.com

Default Token

一个节点多久没有响应就算死亡,是有 Token 时间决定的。默认情况下,两节点的集群 token 是 1000ms,多增加一个节点则多增加 650ms.

/etc/corosync/corosync.conf
totem {
version: 2
secauth: off
cluster_name: rhel7-cluster
transport: udpu
rrp_mode: passive
token: 1000  <---
}

Corosync 2 的特殊模式

在 Corosync 2 中,可以设置不同的模式来适应不同的实际情况。详情可以参考 "corosync_votequorum" 的man手册。

two_node
a. this is not a new feature, it's been in cman since the start. still pretty useful.

b. `two_node` is, shockingly, designed for clusters with only two nodes where you, not unreasonably, want one node to continue working if the other fails.

This mode requires that expected_votes is set to 2 and that you have hardware fencing configured and connected over the same network interface as the cluster heartbeat.

c. Use `wait_for_all` to avoid fence_loop.

wait_for_all
`two_node` by defaults enables wait_for_all.

When starting up a cluster from scratch (ie all nodes down or, at least, not part of the cluster) it will prevent the cluster from becoming quorate until all of the nodes have joined in.

It does this by comparing the number of active votes in the cluster with the value of expected_votes.

Without `wait_for_all`, the normal behavior of a cluster is for quorum to be enabled as soon as the required number of votes is achieved.

`wait_for_all` is a useful way of booting up a cluster and making sure that it is not partitioned at startup. In the two_node case this is very important.

auto_tie_breaker
Auto Tie Breaker allows the cluster to continue working in the event that a cluster containing an ** even number of nodes ** is split in half.

auto_tie_breaker is not compatible with two_node as both are systems for determining what happens should there be an even split of nodes.

This new option tells votequorum that in the event of a 50/50 split of the cluster then the half with the ** lowest node ID (by default) ** in it should be deemed the quorate half, and the other half not.

If fencing is in operation than the other half will be fenced.

last_man_standing & last_man_standing_window
allows votequorum to reduce the number of expected_votes automatically when nodes leave the cluster.

It does this last_man_standing_window milliseconds after the nodes leave the cluster.

It's important to note that the remaining cluster must be quorate for this calculation to happen.

This allows a cluster to be partitioned and the quorate side can be reduced and still stay active.

quorum { 
    provider: corosync_votequorum 
    expected_votes: 8 
    wait_for_all: 1 
    last_man_standing: 1
    last_man_standing_window: 10000 
} 

Example chain of events:
a. The cluster is fully operational with 8 nodes. (expected_votes: 8 quorum: 5)
b. 3 nodes die, cluster is quorate with 5 nodes.
c. After last_man_standing_window timer expires, expected_votes and quorum are recalculated. (expected_votes: 5 quorum: 3)
d. At this point, 2 more nodes can die and cluster will still be quorate with 3.
e. Once again, after last_man_standing_window timer expires expected_votes and quorum are recalculated. (expected_votes: 3 quorum: 2)
f. At this point, 1 more node can die and cluster will still be quorate with 2.
g. After one more last_man_standing_window timer (expected_votes: 2 quorum: 2)

It's important to note that the normal operation of last_man_standing only allows the cluster to go down to 2 nodes.

If you want to go down to running with only 1 node then you also need to set auto_tie_breaker.

Fence - Stonith

当节点出现故障时,Pacemaker 会进行资源切换,把运行在故障节点上的资源切换到正常运行的节点上,以保证服务的高可用。

如果节点没有响应,处于“大多数”(quorate)状态的节点会杀掉没有响应的节点,以:
1. 使集群成员状态明确(不希望有节点处于“生死未卜”状态);
2. 确保没有响应的节点不能够访问存储;

杀掉节点的行为我们可以成为 fence. Pacemaker 中, fence 由 Stonith 进行管理。

我们需要对每个节点配置对应的 Stonith 设备。例如,在 KVM 虚拟化环境中,我们可以这样设置:

# pcs stonith create <stonith id> <stonith device type> [stonith device options]
for example,
# pcs stonith create fence_node1 fence_xvm port=feichashao_RHEL72-pcmk-node1 pcmk_host_list=node1.example.com
# pcs stonith create fence_node2 fence_xvm port=feichashao_RHEL72-pcmk-node2 pcmk_host_list=node2.example.com

注:KVM Host 上需要先配置好`fence_virtd`.

添加资源 - Resources

资源,即运行在集群环境中的服务,比如 httpd。Pacemaker 不会直接处理资源的起停,比如说它不会直接通过 apachectl 来起停 httpd 服务。实际上 Pacemaker 会通过 Resource Agent 来间接管理资源,Pacemaker 只需要告诉 Resource Agent 要对相应资源进行何种操作(start, stop, monitor 等),针对不同服务的具体操作则由 Resource Agent 来完成。

This allows the cluster to be agnostic about the resources it manages. The cluster doesn’t need to understand how the resource works because it relies on the resource agent to do the right thing when given a start, stop or monitor command.

Typically, resource agents come in the form of shell scripts. However, they can be written using any technology (such as C, Python or Perl) that the author is comfortable with.

添加资源可以使用以下命令:

# pcs resource create <resource id> <standard:provider:type|type> [resource options]
for example,
# pcs resource create vip7 IPaddr2 ip=192.168.122.37 cidr_netmask=24

Resource agents

OCF
The OCF class is the most preferred as it is an industry standard, highly flexible (allowing parameters to be passed to agents in a non-positional manner) and self-describing.

The cluster follows these specifications exactly, and giving the wrong exit code will cause the cluster to behave in ways you will likely find puzzling and annoying.

Parameters are passed to the resource agent as environment variables, with the special prefix OCF_RESKEY_. So, a parameter which the user thinks of as ip will be passed to the resource agent as OCF_RESKEY_ip. The number and purpose of the parameters is left to the resource agent;

Actions
Normal OCF Resource Agents are required to have these actions:

*start* - start the resource. Exit 0 when the resource is correctly running (i-e providing the service) and anything else except 7 if it failed

*stop* - stop the resource. Exit 0 when the resource is correctly stopped and anything else except 7 if it failed.

*monitor* - monitor the health of a resource. Exit 0 if the resource is running, 7 if it is stopped and anything else if it is failed. Note that the monitor script should test the state of the resource on the localhost.

*meta-data* - provide information about this resource as an XML snippet. Exit with 0

约束条件 - Constrains

一个多节点的集群,会运行多个服务,这些服务会有相应的依赖关系。比如,一个网页服务器,会有一个 Floating IP 和一个 httpd 服务。我们希望 Floating IP 和 httpd 这两个服务运行在同一个节点上,且希望 Floating IP 成功启动后,再启动 httpd 服务。要达到这个目的,我们可以对资源设置约束条件(Constrains)。

资源的约束条件可以通过以下命令添加:

# pcs constraint ...
for example,
# pcs constraint location vip7 prefers node1
for detail,
# pcs constraint -h

Score

对于“某个资源应该运行在哪个节点上”, Pacemaker 有一套算分机制。每个 的组合,会对应一个分数,如果分数为负,则资源不会在该 node 上运行。资源最终会运行在分数最高的 node 上。

Scores of all kinds are integral to how the cluster works. Practically everything from moving a resource to deciding which resource to stop in a degraded cluster is achieved by manipulating scores in someway.

Scores are calculated per resource and node. Any node with a negative score for a resource can’t run that resource. The cluster places a resource on the node with the highest score for it.

(INFINITY = 1,000,000)
• Any value + INFINITY = INFINITY
• Any value - INFINITY = -INFINITY
• INFINITY - INFINITY = -INFINITY

Location Constrains

Location constraints tell the cluster which nodes a resource can run on.

Ordering Constrains

Ordering constraints tell the cluster the order in which resources should start.

Colocation Constrains

Colocation constraints tell the cluster that the location of one resource depends on the location of another one.

参考资料

[1] Pacemaker Explained (http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/pdf/Pacemaker_Explained/Pacemaker-1.1-Pacemaker_Explained-en-US.pdf)
[2] Clusters from Scratch (http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/pdf/Clusters_from_Scratch/Pacemaker-1.1-Clusters_from_Scratch-en-US.pdf)
[3] New quorum features in Corosync 2 (http://people.redhat.com/ccaulfie/docs/Votequorum_Intro.pdf)
[4] The OCF Resource Agent Developer's Guide (https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.txt)