背景
办公网DNS使用Nginx作为反向代理服务器,服务器配有corosync+pacemaker,用来实现高可用。
问题
今天在维护这几台服务器时,突然发现2台Nginx的corosync发生了脑裂的现象,各自都分配了vip:
- NodeA
# crm status
Stack: classic openais (with plugin)
Current DC: pr-v17-122.hjidc.com (version 1.1.15-5.el6-e174ec8) - partition WITHOUT quorum
Last updated: Wed Jul 3 16:06:32 2019 Last change: Sat Mar 30 02:36:27 2019 by root via cibadmin on pr-v17-128.hjidc.com
, 2 expected votes
2 nodes and 2 resources configured
Online: [ pr-v17-122.hjidc.com ]
OFFLINE: [ pr-v17-128.hjidc.com ]
Full list of resources:
vip_192.168.18.244 (ocf::heartbeat:IPaddr): Started pr-v17-122.hjidc.com
vip_192.168.18.245 (ocf::heartbeat:IPaddr): Started pr-v17-122.hjidc.com
# ip a
...
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 52:55:00:00:67:e9 brd ff:ff:ff:ff:ff:ff
inet 192.168.18.177/24 brd 192.168.18.255 scope global eth1
inet 192.168.18.244/24 brd 192.168.18.255 scope global secondary eth1:vip83
inet 192.168.18.245/24 brd 192.168.18.255 scope global secondary eth1:vip84
inet6 fe80::5055:ff:fe00:67e9/64 scope link
valid_lft forever preferred_lft forever
- NodeB
# crm status
Stack: classic openais (with plugin)
Current DC: pr-v17-128.hjidc.com (version 1.1.15-5.el6-e174ec8) - partition WITHOUT quorum
Last updated: Wed Jul 3 16:07:03 2019 Last change: Sat Mar 30 02:36:18 2019 by root via cibadmin on pr-v17-128.hjidc.com
, 2 expected votes
2 nodes and 2 resources configured
Online: [ pr-v17-128.hjidc.com ]
OFFLINE: [ pr-v17-122.hjidc.com ]
Full list of resources:
vip_192.168.18.244 (ocf::heartbeat:IPaddr): Started pr-v17-128.hjidc.com
vip_192.168.18.245 (ocf::heartbeat:IPaddr): Started pr-v17-128.hjidc.com
# ip a
...
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 52:55:00:00:69:c8 brd ff:ff:ff:ff:ff:ff
inet 192.168.18.179/24 brd 192.168.18.255 scope global eth1
inet 192.168.18.244/24 brd 192.168.18.255 scope global secondary eth1:vip83
inet 192.168.18.245/24 brd 192.168.18.255 scope global secondary eth1:vip84
inet6 fe80::5055:ff:fe00:69c8/64 scope link
valid_lft forever preferred_lft forever
分别重启两台服务器都corosync和pacemaker服务,并没有恢复,又查了日志,没查到明显报错。
缓解
查看crm配置,发现多处配置通过域名方式调用服务器,但ping域名后解析出来的ip地址与实际服务器ip地址不符,修改hosts
文件,指定域名到服务器ip地址,并重启corosync+pacemaker后问题得到缓解。
# ping pr-v17-122.hjidc.com
PING pk.22.cn (47.52.135.116) 56(84) bytes of data.
64 bytes from 47.52.135.116: icmp_seq=1 ttl=47 time=32.2 ms
# vi /etc/hosts
...
192.168.17.122 pr-v17-122 pr-v17-122.hjidc.com
192.168.17.128 pr-v17-128 pr-v17-128.hjidc.com
# /etc/init.d/pacemaker stop
# /etc/init.d/corosync stop
# /etc/init.d/corosync start
# /etc/init.d/pacemaker start
恢复后2各节点的crm及网卡ip状态信息:
- NodeA
# crm status
Stack: classic openais (with plugin)
Current DC: pr-v17-122.hjidc.com (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Wed Jul 3 17:16:51 2019 Last change: Wed Jul 3 16:18:37 2019 by root via cibadmin on pr-v17-128.hjidc.com
, 2 expected votes
2 nodes and 2 resources configured
Online: [ pr-v17-122.hjidc.com pr-v17-128.hjidc.com ]
Full list of resources:
vip_192.168.18.244 (ocf::heartbeat:IPaddr): Started pr-v17-122.hjidc.com
vip_192.168.18.245 (ocf::heartbeat:IPaddr): Started pr-v17-128.hjidc.com
# ip a
...
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 52:55:00:00:67:e9 brd ff:ff:ff:ff:ff:ff
inet 192.168.18.177/24 brd 192.168.18.255 scope global eth1
inet 192.168.18.244/24 brd 192.168.18.255 scope global secondary eth1:vip83
inet6 fe80::5055:ff:fe00:67e9/64 scope link
valid_lft forever preferred_lft forever
- NodeB
# crm status
Stack: classic openais (with plugin)
Current DC: pr-v17-122.hjidc.com (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Wed Jul 3 17:17:00 2019 Last change: Wed Jul 3 16:18:37 2019 by root via cibadmin on pr-v17-128.hjidc.com
, 2 expected votes
2 nodes and 2 resources configured
Online: [ pr-v17-122.hjidc.com pr-v17-128.hjidc.com ]
Full list of resources:
vip_192.168.18.244 (ocf::heartbeat:IPaddr): Started pr-v17-122.hjidc.com
vip_192.168.18.245 (ocf::heartbeat:IPaddr): Started pr-v17-128.hjidc.com
# ip a
...
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 52:55:00:00:69:c8 brd ff:ff:ff:ff:ff:ff
inet 192.168.18.179/24 brd 192.168.18.255 scope global eth1
inet 192.168.18.245/24 brd 192.168.18.255 scope global secondary eth1:vip84
inet6 fe80::5055:ff:fe00:69c8/64 scope link
valid_lft forever preferred_lft forever
原因
原先机房linux服务器都使用了.hjidc.com
作为服务器后缀,并且解析都没有问题,而现在,所有主机名解析都会向hjidc.com
这个权威域进行解析,解析出来的结果自然与实际服务器ip地址不一致。猜想应该是hjidc.com
这个权威域的管理员配置了默认解析,才导致了脑裂的问题。