GBase 8a 数据库集群,通过corosync/gcware服务维持集群一致性服务。当有其它非集群IP,向集群发送了数据包,通过tcpdump能截获,会干扰集群数据通讯的正常运行。一般出现在老环境销毁后,没有清掉服务导致。
目录导航
场景
某现场环境已经正常使用超过3个月,某天突然发生集群gcadmin卡住和报错现象,无法正常使用。
集群IP是 107,108,151,152共4个IP。
[root@gbase2 ~]# gcadmin
CLUSTER STATE: LOCKED
CLUSTER MODE: NORMAL
+==========================================================================================================================+
| GCLUSTER INFORMATION |
+==========================================================================================================================+
+-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+
| rowid | nodename | IpAddress | sgname | dpname | gcware | gnode | gcluster | syncserver | datastate | nodestate |
+-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+
| 1 | sg1_1 | 192.168.129.107 | sg01 | n1 | Offline | | | | [1] | [0] |
+-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+
| 2 | sg1_2 | 192.168.129.108 | sg01 | n2 | Online | OPEN | OPEN | OPEN | [0] | [0] |
+-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+
| 3 | sg2_1 | 192.168.129.151 | sg02 | n3 | Offline | | | | [0] | [0] |
+-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+
| 4 | sg2_2 | 192.168.129.152 | sg02 | n4 | Offline | | | | [0] | [0] |
+-------+----------+-----------------+--------+--------+---------+---------+----------+------------+-----------+-----------+
排查
由于集群本身正常运行,硬件也未发现问题,网卡都正常,于是检查了数据包情况。
tcpdump -n -i bond0 port 5493
其中bond0是集群IP所在的网卡名字。
[root@gbase2 ~]# tcpdump -n -i bond0 port 5493
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:54:59.657760 IP 192.168.129.108.19822 > 192.168.129.151.5493: UDP, length 244
16:54:59.657777 IP 192.168.129.108.28083 > 192.168.129.152.5493: UDP, length 244
16:54:59.675538 IP 192.168.129.151.17949 > 192.168.129.108.5493: UDP, length 244
16:55:00.001626 IP 192.168.129.113.37835 > 192.168.129.108.5493: UDP, length 112
16:55:00.206573 IP 192.168.129.151.17949 > 192.168.129.108.5493: UDP, length 244
16:55:00.253182 IP 192.168.129.108.19822 > 192.168.129.151.5493: UDP, length 244
16:55:00.253201 IP 192.168.129.108.28083 > 192.168.129.152.5493: UDP, length 244
16:55:00.600413 IP 192.168.129.113.37835 > 192.168.129.108.5493: UDP, length 112
16:55:00.775431 IP 192.168.129.108.19822 > 192.168.129.151.5493: UDP, length 244
16:55:00.775446 IP 192.168.129.108.28083 > 192.168.129.152.5493: UDP, length 244
16:55:00.798784 IP 192.168.129.151.17949 > 192.168.129.108.5493: UDP, length 244
^C
28 packets captured
209 packets received by filter
123 packets dropped by kernel
从数据中,发现从113节点发来的UDP包,但这个IP不是集群的IP。
登陆113节点,发现确实运行了corosync服务,其配置文件里是107和108的IP。
最终确认,该服务器为利旧服务器,昨天才开机准备使用。以前安装过数据库集群,后来有新的硬件了,就全新安装了一套,老机器就停机了。但没有卸载上面的服务。 3个月后,机器启动后,其上面的corosync服务向107,108节点发送了数据包,而现有集群并没有发现113节点是集群的IP, 一直在处理这个问题的异常,导致集群LOCK。
解决方案
停掉113上的corosync服务,并将相关配置文件删除。重启现有集群的服务后,解决。
总结
集群节点下线时,要清理掉数据库服务,避免某一天上线时,对新集群产生干扰。