南大通用GBase 8a 集群服务corosync、gcware由于其它IP干扰导致异常

GBase 8a 数据库集群,通过corosync/gcware服务维持集群一致性服务。当有其它非集群IP,向集群发送了数据包,通过tcpdump能截获,会干扰集群数据通讯的正常运行。一般出现在老环境销毁后,没有清掉服务导致。

场景

某现场环境已经正常使用超过3个月,某天突然发生集群gcadmin卡住和报错现象,无法正常使用。

集群IP是 107,108,151,152共4个IP。

[root@gbase2 ~]# gcadmin

  CLUSTER STATE:  LOCKED
  CLUSTER MODE:   NORMAL

 +==========================================================================================================================+
 |                                                   GCLUSTER INFORMATION                                                   |
 +==========================================================================================================================+
 +-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+
 | rowid | nodename |    IpAddress    | sgname | dpname | gcware  |  gnode  | gcluster | syncserver | datastate | nodestate |
 +-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+
 |   1   |   sg1_1  | 192.168.129.107 |  sg01  |   n1   | Offline |         |          |            |  [1]      |  [0]      |
 +-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+
 |   2   |   sg1_2  | 192.168.129.108 |  sg01  |   n2   | Online  |  OPEN   |   OPEN   |    OPEN    |  [0]      |  [0]      |
 +-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+
 |   3   |   sg2_1  | 192.168.129.151 |  sg02  |   n3   | Offline |         |          |            |  [0]      |  [0]      |
 +-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+
 |   4   |   sg2_2  | 192.168.129.152 |  sg02  |   n4   | Offline |         |          |            |  [0]      |  [0]      |
 +-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+

排查

由于集群本身正常运行,硬件也未发现问题,网卡都正常,于是检查了数据包情况。

tcpdump -n -i bond0 port 5493
其中bond0是集群IP所在的网卡名字。

[root@gbase2 ~]# tcpdump -n  -i bond0 port 5493
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:54:59.657760 IP 192.168.129.108.19822 > 192.168.129.151.5493: UDP, length 244
16:54:59.657777 IP 192.168.129.108.28083 > 192.168.129.152.5493: UDP, length 244
16:54:59.675538 IP 192.168.129.151.17949 > 192.168.129.108.5493: UDP, length 244
16:55:00.001626 IP 192.168.129.113.37835 > 192.168.129.108.5493: UDP, length 112
16:55:00.206573 IP 192.168.129.151.17949 > 192.168.129.108.5493: UDP, length 244
16:55:00.253182 IP 192.168.129.108.19822 > 192.168.129.151.5493: UDP, length 244
16:55:00.253201 IP 192.168.129.108.28083 > 192.168.129.152.5493: UDP, length 244
16:55:00.600413 IP 192.168.129.113.37835 > 192.168.129.108.5493: UDP, length 112
16:55:00.775431 IP 192.168.129.108.19822 > 192.168.129.151.5493: UDP, length 244
16:55:00.775446 IP 192.168.129.108.28083 > 192.168.129.152.5493: UDP, length 244
16:55:00.798784 IP 192.168.129.151.17949 > 192.168.129.108.5493: UDP, length 244

^C
28 packets captured
209 packets received by filter
123 packets dropped by kernel

从数据中,发现从113节点发来的UDP包,但这个IP不是集群的IP。

登陆113节点,发现确实运行了corosync服务,其配置文件里是107和108的IP。

最终确认,该服务器为利旧服务器,昨天才开机准备使用。以前安装过数据库集群,后来有新的硬件了,就全新安装了一套,老机器就停机了。但没有卸载上面的服务。 3个月后,机器启动后,其上面的corosync服务向107,108节点发送了数据包,而现有集群并没有发现113节点是集群的IP, 一直在处理这个问题的异常,导致集群LOCK。

解决方案

停掉113上的corosync服务,并将相关配置文件删除。重启现有集群的服务后,解决。

总结

集群节点下线时,要清理掉数据库服务,避免某一天上线时,对新集群产生干扰。