GBase 8a通过gcware集群维护集群状态,包括各个节点服务,数据一致性等。其中主动检查(所有版本支持)机制是gcware定时扫描各个节点的服务状态,被动检查(9.5.3版本支持)是通过各个节点服务向gcware注册并上报状态给gcware。本文介绍gcware主动检查机制。
目录导航
参考
GBase 8a 9.5.3多实例版本的服务状态注册和检测机制
主动检测方法
如下的4中检测方法,其中ssh和socket从最早的版本一直支持,GBase Ping和本文重点介绍的SQL检测方法则要求8.6.2Build43R33及之后发布的版本支持。
所有参数均在gcware的配置文件内, 分别对应V86版本的corosync.conf和V95版本的gcware.conf。
ssh服务
主机检测。检查节点的ssh服务能否连通,如不能或超时,则判断节点离线。
ssh的端口由 node_ssh_port: 参数指定,默认是22
服务端口socket探测
服务检测。通过TCP连接服务的端口,不能连接则判断服务CLOSE。 如下是3个被检测服务的端口配置。
gcluster_port: 5258
gnode_port: 5050
syncserver_port: 5288
执行内部的Ping命令
SQL执行前检测。每个SQL下发前,先发送Ping【内部命令】检测服务状态,如异常则不会下发任务。
该命令为内部命令,不对外使用。
从V862Build43R33及之后的新版本开始正式支持。
从审计日志里可以看到如下信息,先ping成功,再执行了业务语句 count(*)
# Threadid=66;
# Taskid=0;
# Time: 220602 14:49:13
# End_time: 220602 14:49:13
# User@Host: root[root] @ [10.0.2.101]
# UID: 1
# Query_time: 0.000032 Rows: 0
# SET timestamp=1654152553;
# administrator command: Ping;
# Sql_type: OTHERS;
# Sql_command: Ping;
# Status: SUCCESS;
# Connect Type: CAPI;
# Threadid=66;
# Taskid=0;
# Time: 220602 14:49:14
# End_time: 220602 14:49:14
# User@Host: root[root] @ [10.0.2.101]
# UID: 1
# Query_time: 0.004065 Rows: 1
# Tables: WRITE: ; READ: `testdb`.`tt_n2`; OTHER: ; ;
# SET timestamp=1654152554;
# Sql_text: SELECT COUNT(1) FROM `testdb`.`tt_n2` `vcname000001.testdb.tt`;
# Sql_type: DQL;
# Sql_command: SELECT;
# Status: SUCCESS;
# Connect Type: CAPI;
SQL检测
定时下发一个SQL检测服务状态。如超时没有返回则判断节点服务异常,设置为CLOSE,阻止后续SQL下发。
现有版本的检测SQL是 select 1, 不支持修改。
从V862Build43R33及之后的新版本开始正式支持。如不确认版本情况,请联系技术支持人员。
需要修改参数 :check_tcp_only, 默认值为1,只检查tcp, 也就是前面三种(SSH,SOCKET和ping)
check_tcp_only: 1
修改成如下内容,其中0是开关。 inner_connect_read_write_timeout是执行SQL如果超过这个时间,则判定节点服务异常,设置为CLOSE。默认值为15。
警告:请【一定要】根据实际情况调整该参数,避免在负载本来已经极高,SQL返回已经很慢的环境,由于该检测而频繁出现服务CLOSE,导致业务更加繁重。
该参数适合于整体负载不高(包括忙时),为了避免【个别】节点意外导致的性能问题,比如服务能连接但卡住或性能极差的情况。
check_tcp_only: 0
inner_connect_read_write_timeout:5
现有版本的所有定时检测周期(包括SQL)由如下2个参数控制。 其中 check_interval (默认30秒)是检测间隔,对异常节点间隔多长时间检测一次是否恢复了; whole_check_interval_num 指经过多少次检测后,要整体检测一次(所有节点,所有服务)。 集群全部正常时,30*20=600秒=10分钟整体检测一次。
check_interval: 30
whole_check_interval_num: 20
SQL检测例子
正常时
每10分钟收到一次检测,完整的connect,select, quit 过程。
# Threadid=55;
# Taskid=0;
# Time: 700101 8:00:00
# End_time: 700101 8:00:00
# User@Host: gbase[gbase] @ [10.0.2.101]
# UID: 2
# Query_time: 0.000000 Rows: 0
# SET timestamp=0;
# administrator command: Connect;
# Sql_type: OTHERS;
^@# Sql_command: Connect;
# Status: SUCCESS;
^@# Connect Type: CAPI;
# Threadid=55;
# Taskid=0;
# End_time: 220602 14:25:24
# User@Host: gbase[gbase] @ [10.0.2.101]
# UID: 2
# Query_time: 0.000107 Rows: 1
# use gbase;
# Tables: WRITE: ; READ: ; OTHER: ; ;
# SET timestamp=1654151124;
# Sql_text: select 1;
# Sql_type: DQL;
^@# Sql_command: SELECT;
# Status: SUCCESS;
^@# Connect Type: CAPI;
# Threadid=55;
# Taskid=0;
# End_time: 220602 14:25:24
# User@Host: gbase[gbase] @ [10.0.2.101]
# UID: 2
# Query_time: 0.000005 Rows: 0
# SET timestamp=1654151124;
# administrator command: Quit;
# Sql_type: OTHERS;
^@# Sql_command: Quit;
# Status: SUCCESS;
^@# Connect Type: CAPI;
.......................
# Threadid=57;
# Taskid=0;
# Time: 700101 8:00:00
# End_time: 700101 8:00:00
# User@Host: gbase[gbase] @ [10.0.2.101]
# UID: 2
# Query_time: 0.000000 Rows: 0
# SET timestamp=0;
# administrator command: Connect;
# Sql_type: OTHERS;
^@# Sql_command: Connect;
# Status: SUCCESS;
^@# Connect Type: CAPI;
# Threadid=57;
# Taskid=0;
# End_time: 220602 14:35:24
# User@Host: gbase[gbase] @ [10.0.2.101]
# UID: 2
# Query_time: 0.000129 Rows: 1
# use gbase;
# Tables: WRITE: ; READ: ; OTHER: ; ;
# SET timestamp=1654151724;
# Sql_text: select 1;
# Sql_type: DQL;
^@# Sql_command: SELECT;
# Status: SUCCESS;
^@# Connect Type: CAPI;
# Threadid=57;
# Taskid=0;
# End_time: 220602 14:35:24
# User@Host: gbase[gbase] @ [10.0.2.101]
# UID: 2
# Query_time: 0.000005 Rows: 0
# SET timestamp=1654151724;
# administrator command: Quit;
# Sql_type: OTHERS;
^@# Sql_command: Quit;
# Status: SUCCESS;
^@# Connect Type: CAPI;
模拟故障
我们将超时参数减少到1秒,方便测试。
check_tcp_only: 0
inner_connect_read_write_timeout:1
通过tc工具,模拟网络故障,将网卡的延迟改成1000ms
tc qdisc add dev enp0s3 root netem delay 1000ms
查看gcware故障检测日志
能看到如下输出,大约30秒检测一轮,每轮尝试3次(由参数cfg_check_times_judge_failure控制)。
Jun 02 14:53:05.713318 ERROR [CLM ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'waiting for initial communication packet', system error: 115
Jun 02 14:53:06.715107 ERROR [CLM ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Can't connect to GBase server on '10.0.2.115' (4)
Jun 02 14:53:08.717442 ERROR [CLM ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'waiting for initial communication packet', system error: 115
Jun 02 14:53:22.695797 ERROR [CLM ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Can't connect to GBase server on '10.0.2.115' (4)
Jun 02 14:53:25.700085 ERROR [CLM ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'reading authorization packet', system error: 11
Jun 02 14:53:27.703212 ERROR [CLM ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'waiting for initial communication packet', system error: 115
Jun 02 14:54:01.714689 ERROR [CLM ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Can't connect to GBase server on '10.0.2.115' (4)
Jun 02 14:54:04.718350 ERROR [CLM ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'reading authorization packet', system error: 11
Jun 02 14:54:06.721413 ERROR [CLM ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'waiting for initial communication packet', system error: 115
查看集群故障状态
故障节点gnode服务为CLOSE。
[gbase@gbase_rh7_001 gcluster]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
=============================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=============================================================
| NodeName | IpAddress | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 | OPEN | OPEN | 0 |
-------------------------------------------------------------
=========================================================================================================
| GBASE DATA CLUSTER INFORMATION |
=========================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.101 | 5 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.102 | 5 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.115 | 5 | CLOSE | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
[gbase@gbase_rh7_001 gcluster]$
模拟故障恢复
tc qdisc del dev enp0s3 root
查看gcware故障恢复后日志
Jun 02 14:54:37.751090 NOTIC [CLM ] EXEC request: invalid node del 1929510922
Jun 02 14:54:37.751146 NOTIC [CLM ] EXEC request: invalid node del delete node: 1929510922
Jun 02 14:54:37.751165 NOTIC [CLM ] EXEC request: notification_clusterstate_changed clusterstatechange = 1, trackflag = 8, num = 1
Jun 02 14:54:37.751181 NOTIC [CLM ] nodeid = 1929510922
查看集群故障恢复状态
[gbase@gbase_rh7_001 gcluster]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
=============================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=============================================================
| NodeName | IpAddress | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 | OPEN | OPEN | 0 |
-------------------------------------------------------------
=========================================================================================================
| GBASE DATA CLUSTER INFORMATION |
=========================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.101 | 5 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.102 | 5 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.115 | 5 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
[gbase@gbase_rh7_001 gcluster]$
总结
对于日常负载不重的系统,为了减少部分节点性能问题导致的整体性能下降或卡住,可以通过本文提到的SQL检测,设置超时的节点停止服务。待性能恢复后,再继续提供服务。
该参数需要根据现场情况正确设置,本人认为默认的5秒有点小了,在不确定影响的情况下,建议设置的高一些。
如果有条件,可以采集下每个gnode节点执行select 1的实际耗时,建议覆盖最繁忙的周末,周初,月末,月初等时间,以繁忙时的耗时为基准,再加上一个安全系数,来设置这个超时时间。