GBase 8a 953部分版本新增加了eventMode做节点替换的选项。该选项仅用于数据服务的替换,管理和调度服务替换维持原状,原有的重分布模式的数据服务替换方案依旧。
目录导航
环境
支持的版本
该功能目前(2025-03-21)在9.5.3.28.18R1_Patch.8实现,后续【可能】合入其它版本。
测试集群
3节点集群,1个管理+1个调度+3个数据。其中1个纯数据服务故障,设置成了unavailable。
如果故障节点是复合节点,先用原有方案替换其它服务,最后替换数据服务。
[gbase@rh151 gcinstall]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
====================================
| GBASE GCWARE CLUSTER INFORMATION |
====================================
| NodeName | IpAddress | gcware |
------------------------------------
| gcware1 | 10.0.2.151 | OPEN |
------------------------------------
====================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
====================================================
| NodeName | IpAddress | gcluster | DataState |
----------------------------------------------------
| coordinator1 | 10.0.2.151 | OPEN | 0 |
----------------------------------------------------
===============================================================================================================
| GBASE DATA CLUSTER INFORMATION |
===============================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.152 | 1,2 | UNAVAILABLE | | |
---------------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.151 | 1,2 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.153 | 1,2 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
eventMode替换命令
./replace.py --feventMode --host=10.0.2.152 --dbaUser=gbase --dbaUserPwd=XXXX --generalDBUse=gbase --generalDBPwd=XXXXXX --overwrite --type=data
--feventMode 为新增加参数,采用event模式做数据服务的替换,如果没有这个参数,采用原有的重分布模式。
eventMode替换要求
本版本的eventMode替换,如果正在做重分布,比如扩容,必须先停下来:
- gcluster_rebalancing_concurrent_count 必须设置为0 (set global XXXX=0)
- gclusterdb.rebalancing_status表必须为空 (truncae table XXXXX)
等节点替换的故障节点所有event都恢复了,再继续重分布:通过运行rebalance instance会再次填充gclusterdb.rebalancing_status表。
本要求会在替换时做检查,如不满足会报错退出。
eventMode替换执行过程
其中红色的字体,是检查前面要求的:
gcluster_rebalancing_concurrent_count value was 0 at coordinator 10.0.2.151
table gclusterdb.rebalancing_status was empty
替换成功后的如下输出,是提醒在event完成同步前,不要继续做重分布。
please wait all dmlstorageevent recover success, do not start rebalance before it
install prefix: /opt/gbase/10.0.2.151
execute replace node os user: gbase
replaced nodes: ['10.0.2.152']
replace node type: data
dbaHome: /home/gbase
check DBA_HOME environment variable
IsAutoGcware: True
feventMode: True
coordinator hosts: ['10.0.2.151']
data hosts: ['10.0.2.152', '10.0.2.151', '10.0.2.153']
freenode hosts: []
node address type: IPV4
localHost is: 10.0.2.151
coorIdDict: {'10.0.2.151': '2533490698'}
gcware mode: single vc mode
host 10.0.2.152 node state: UNAVAILABLE
host 10.0.2.152 node state: UNAVAILABLE
check login all host
10.0.2.152
Are you sure to replace install these nodes ([Y,y]/[N,n])? y
check database user and password ...
check database user and password successful
check rebalance status ...
distribution id list:['2', '1']
gcluster_rebalancing_concurrent_count value was 0 at coordinator 10.0.2.151
table gclusterdb.rebalancing_status was empty
check rebalance status successful
check crontab privilege ...
get server and data host dict
create flag file every data host
read host list every data host
delete flag file every data host
get server and data host dict end
check crontab privilege successfully
10.0.2.152 os type is CentOS
checking rpms ...
uninstall host ['10.0.2.152'] begin
check and stop gcmonit ...
check and stop gcmonit successful
uninstall host ['10.0.2.152'] end
install host ['10.0.2.152'] begin
mkdir /opt/gbase/10.0.2.152/cluster_prepare on host 10.0.2.152.
Copying /home/gbase/gcinstall/10.0.2.152.options to host 10.0.2.152:/opt/gbase/10.0.2.152/cluster_prepare
Copying data files to host 10.0.2.152 successfully
send install command: /usr/bin/python /opt/gbase/10.0.2.152/cluster_prepare/InstallTar.py --silent=/opt/gbase/10.0.2.152/cluster_prepare/10.0.2.152.options --IsData --type=data --IsAutoGcware --uuid=ddb84e64-0551-11f0-adb0-08002773b683
install host ['10.0.2.152'] end
sync cluster config file begin
sync cluster config file end
sync kerberos file begin
sync kerberos file end
Starting all gcluster nodes ...
get multi instance on replaced host
multi instance dictionary:{'10.0.2.152': ['10.0.2.152']}
Begin to exec gcadmin replacenodes ...
get table id and set dmlstorageevent on node [10.0.2.151], please wait a moment
check ip start ......
check ip end ......
switch cluster mode into READONLY start ......
wait all ddl statement stop ......
all ddl statement stoped
switch cluster mode into READONLY end ......
gcadmin check rebalance status start ......
distribution number:2
gcadmin check rebalance status end ......
delete all fevent log on replace nodes start ......
delete ddl event log on node 10.0.2.152 start
delete ddl event log on node 10.0.2.152 end
delete dml event log on node 10.0.2.152 start
delete dml event log on node 10.0.2.152 end
delete dml storage event log on node 10.0.2.152 start
delete dml storage event log on node 10.0.2.152 end
delete all fevent log on replace nodes end ......
sync dataserver metedata begin ......
copy script to data node begin
copy script to data node end
build data packet begin
build data packet end
copy data packet to target node begin
copy data packet to target node end
extract data packet begin
extract data packet end
sync dataserver metedata end, spend time 46898 ms ......
create database start ......
create database end ......
restore node state start ......
restore node state end ......
replace nodes spend time: 101831 ms
set dmlstorageevent success
please wait all dmlstorageevent recover success, do not start rebalance before it
Replace gcluster nodes successfully.
[gbase@rh151 gcinstall]$
总结
该方法仅是一个新的实现方案,并不影响原有的重分布模式的数据服务替换。
该方法的问题是,生成的event会很多,无法象重分布模式那样,可以相对精准通过并行度,控制对系统现有业务的影响。
该方法的优点是:可以在扩容重分布一半的时候,优先完成节点替换。