南大通用GBase 8a V95版本节点替换操作手顺

GBase 8a V95版本更改了节点步骤方式,采用重分布的方案以便用户能自主控制系统资源使用。本文模拟了一个3节点集群的故障,并恢复的全过程。

V86版本的节点替换,请参考 GBase 8a 强制节点离线和节点替换replace

本文中,对于主备分片方案(distribution方式),分成3种:
老的策略:指当前有故障的主备方案
中间策略:指删除了故障节点的主备方案,
最终策略:指我们最后恢复成功的主备方案,其主备方案和老的方案是完全一样的。

1、环境准备

如下开始为测试环境准备,如果你是实际现场环境,从从 1.3环境准备 开始

1.1 测试环境

如下是3个节点的集群,全部是数据和管理节点在一起的情况。初始的distribution id 为7。如果现场没做过扩容,那么distribution Id是从1开始的。

安装包解压后,已经放在了第一个节点( 10.0.2.102)的 /home/gbase/gcinstall目录下面。

测试版本为 9.5.2.26

当前distribution Id是7

1.1.1 集群节点信息
[gbase@localhost ~]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.102 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.202 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator3 | 10.0.2.203 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.102                |       7        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.202                |       7        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node3   |                10.0.2.203                |       7        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

1.1.2 集群主备信息

[gbase@localhost ~]$ gcadmin showdistribution

                                 Distribution ID: 7 | State: new | Total segment num: 3

             Primary Segment Node IP                   Segment ID                 Duplicate Segment node IP
========================================================================================================================
|                   10.0.2.102                   |         1          |                   10.0.2.202                   |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.202                   |         2          |                   10.0.2.203                   |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.203                   |         3          |                   10.0.2.102                   |
========================================================================================================================

1.1.3 集群版本信息

现有集群

[gbase@localhost gcinstall]$ gclusterd -V
gclusterd ver 9.5.2.26.121440 for unknown-linux-gnu on x86_64
[gbase@localhost gcinstall]$

安装包

[gbase@localhost gcinstall]$ cat BUILDINFO
release_version =  9.5.2.26
os_ventor =  redhat
build_version = release
license = none
install_svn:121428
autobuild_svn:121428
monit_svn: 76343
gcrcman_svn: 114799
corosync_svn: 121227
gcware_svn: 121226
gcluster_svn: 121440
logCollector_svn: 94134
GCMonit_svn: 113145
gnode_svn: 121440
[gbase@localhost gcinstall]$ ll
total 139992
-rw-r--r-- 1 gbase gbase       288 Aug 15 01:04 BUILDINFO
-rw-r--r-- 1 gbase gbase   2245528 Aug 15 01:04 bundle_data.tar.bz2
-rw-r--r-- 1 gbase gbase 135498012 Aug 15 01:04 bundle.tar.bz2
-rw-r--r-- 1 gbase gbase      1551 Aug  8 09:10 CGConfigChecker.py
-rwxr-xr-x 1 gbase gbase      3851 Aug  8 09:10 chkLicense
-rw-rw-r-- 1 gbase gbase       301 Oct 11  2019 cluster.conf
-rwxrwxr-x 1 gbase gbase      4854 Oct 11  2019 CorosyncConf.py
-rw-r--r-- 1 gbase gbase       305 Aug  8 09:10 demo.options
-rw-r--r-- 1 gbase gbase       170 Aug 15 01:03 dependRpms
-rw-rw-r-- 1 gbase gbase       684 Oct 11  2019 example.xml
-rwxrwxr-x 1 gbase gbase       358 Oct 11  2019 extendCfg.xml
drwxrwxr-x 3 gbase gbase        49 Oct 11  2019 extra_rpms
-rw-rw-r-- 1 gbase gbase       781 Oct 11  2019 FileCheck.py
-rw-rw-r-- 1 gbase gbase      2700 Oct 11  2019 fulltext.py
-rw-rw-r-- 1 gbase gbase   4818440 Oct 11  2019 gbase_data_timezone.sql
-rwxrwxr-x 1 gbase gbase      4264 Oct 11  2019 gccopy.py
-rwxrwxr-x 1 gbase gbase      4462 Oct 11  2019 gcexec.py
-rwxr-xr-x 1 gbase gbase     98991 Aug  8 09:10 gcinstall.py
-rw-rw-r-- 1 gbase gbase       294 Oct 11  2019 gcwareGroup.json
-rw-r--r-- 1 gbase gbase    180956 Aug  8 09:10 InstallFuns.py
-rw-r--r-- 1 gbase gbase    180691 Aug  8 09:10 InstallTar.py
-rw-rw-r-- 1 gbase gbase      5167 Oct 11  2019 license.txt
-rwxrwxr-x 1 gbase gbase     75990 Oct 11  2019 pexpect.py
-rwxr-xr-x 1 gbase gbase     32361 Aug  8 09:10 replace.py
-rwxr-xr-x 1 gbase gbase     23930 Aug  8 09:10 replaceStop.py
-rw-rw-r-- 1 gbase gbase      2981 Oct 11  2019 RestoreLocal.py
-rwxr-xr-x 1 gbase gbase      9965 May 19  2020 Restore.py
-rw-r--r-- 1 gbase gbase      8666 Aug  8 09:10 rmt.py
-rw-r--r-- 1 gbase gbase       299 Aug  8 09:10 rootPwd.json
-rwxr-xr-x 1 gbase gbase     27855 May 17  2020 SetSysEnv.py
-rw-r--r-- 1 gbase gbase      2512 Aug  8 09:10 SSHThread.py
-rwxr-xr-x 1 gbase gbase      6458 Aug  8 09:10 unInstall_fulltext.py
-rwxr-xr-x 1 gbase gbase     21662 May 17  2020 unInstall.py

1.2 故障模拟

我们把节点2(10.0.2.202)服务停掉,然后把目录全部删掉。模拟机器损坏了。当然IP,操作系统我们保留了。

1.2.1 停掉服务

[gbase@localhost ~]$ ifconfig
enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.2.202  netmask 255.255.255.0  broadcast 10.0.2.255
        inet6 fe80::3486:f571:1c39:3ce5  prefixlen 64  scopeid 0x20<link>
        ether 08:00:27:49:8b:0c  txqueuelen 1000  (Ethernet)
        RX packets 216177  bytes 39836607 (37.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 140374  bytes 32140351 (30.6 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1  (Local Loopback)
        RX packets 20119  bytes 4386155 (4.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 20119  bytes 4386155 (4.1 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 192.168.122.1  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 52:54:00:ce:92:64  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[gbase@localhost ~]$ gcluster_services all stop
Stopping GCMonit success!
Stopping gcrecover :                                       [  OK  ]
Stopping gcluster :                                        [  OK  ]
Stopping gcware :                                          [  OK  ]
Stopping gbase :                                           [  OK  ]
Stopping syncserver :                                      [  OK  ]

1.2.2 删掉服务目录的所有文件

[gbase@localhost ~]$ cd /opt/gbase
[gbase@localhost gbase]$ ll
total 0
drwxr-xr-x  8 gbase gbase  92 Sep  8 02:38 gcluster
drwxr-xr-x 13 gbase gbase 148 Sep  8 02:51 gcware
drwxrwxr-x  8 gbase gbase  92 Sep  8 02:38 gnode
[gbase@localhost gbase]$ rm -fr *
[gbase@localhost gbase]$ ll
total 0
[gbase@localhost gbase]$ gcluseter_service all start
bash: gcluseter_service: command not found...
[gbase@localhost gbase]$

1.2.3 删除模拟操作系统dbaUser用户gbase

注意用root操作系统用户

[root@localhost ~]# userdel gbase -r
[root@localhost ~]# ll /home
total 0
drwx------. 3 ubuntu ubuntu 78 Oct 15  2019 ubuntu
[root@localhost ~]#

1.2.4 模拟不一致event

创建一个表,模拟ddlevent, 然后insert 一些数据,模拟dmlevent

[gbase@localhost gcinstall]$ gccli testdb

GBase client 9.5.2.26.121440. Copyright (c) 2004-2020, GBase.  All Rights Reserved.

gbase> create table test_replace(id int);
Query OK, 0 rows affected (Elapsed: 00:00:00.46)

gbase> insert into test_replace values(1);
Query OK, 1 row affected (Elapsed: 00:00:00.21)

gbase> ^CAborted
[gbase@localhost gcinstall]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.102 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.202 | CLOSE  |  CLOSE   |     1     |
-------------------------------------------------------------
| coordinator3 | 10.0.2.203 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.102                |       7        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.202                |       7        | CLOSE |   CLOSE    |     1     |
---------------------------------------------------------------------------------------------------------
|  node3   |                10.0.2.203                |       7        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

[gbase@localhost gcinstall]$ gcadmin showddlevent
Vc event count:1
Event ID:    20
ObjectName: testdb.test_replace
Fail Node Copy:
------------------------------------------------------
NodeID: 3389128714      NodeIP:10.0.2.202       FAILURE

Fail Data Copy:
------------------------------------------------------
SegName: n2     NodeIP: 10.0.2.202      FAILURE
SegName: n1     NodeIP: 10.0.2.202      FAILURE

[gbase@localhost gcinstall]$ gcadmin showdmlevent
Vc event count:1
Event ID:    9
ObjectName: testdb.test_replace

Fail Data Copy:
------------------------------------------------------
SegName: n1     SCN: 15370      NodeIP: 10.0.2.202      FAILURE

1.3 新环境准备

如果你是真实环境,则需要从这里开始。 操作系统部分安装,防火墙关闭等步骤有我们就不写了,请自行准备。

1.3.1 创建用户

注意用root操作系统用户,gbase密码要和其它节点的一致。

[root@localhost ~]# useradd gbase -m
[root@localhost ~]# passwd gbase
Changing password for user gbase.
New password:
BAD PASSWORD: The password contains the user name in some form
Retype new password:
passwd: all authentication tokens updated successfully.
[root@localhost ~]# ll /home/gbase
total 0
[root@localhost ~]#

1.3.2 设置操作系统环境变量

从安装包的gcinstall目录,复制SetSysEnv.py,并执行。

[root@localhost ~]# scp 10.0.1.102:/home/gbase/gcinstall/SetSysEnv.py /root/
^C[root@localhost ~]# scp 10.0.2.102:/home/gbase/gcinstall/SetSysEnv.py /root/
root@10.0.2.102's password:
SetSysEnv.py                                                                                                                                                                100%   26KB  25.8KB/s   00:01
[root@localhost ~]# python /root/SetSysEnv.py
"--dbaUser" must be assigned.
[root@localhost ~]# python /root/SetSysEnv.py  --dbaUser=gbase
[root@localhost ~]#

2、设置故障节点状态为 unavailable

提示:本步骤和下一步骤,建议和新节点环境准备同时做,一旦确认节点磁盘损坏需要做替换,则可以做本步骤。清理event的耗时,根据数量多少,可能需要几分钟到几个小时,所以强烈建议提前做。

操作并检查是否设置正确

[gbase@localhost gcinstall]$ gcadmin setnodestate 10.0.2.202 unavailable
after set node state into unavailable,can not set the state into normal,
must run gcadmin replacenodes to replace this node ,after that command node state can return into normal.
you realy want to set node state into unavailable(yes or no)?
yes
get node data state by ddl fevent log start ......
get node data state by ddl fevent log end ......
get node data state by dml fevent log start ......
get node data state by dml fevent log end ......
get node data state by dml storage fevent log start ......
get node data state by dml storage fevent log end ......
check coordinator node data state by fevent log start ......
check coordinator node data state by fevent log end ......

check data server node data state by fevent log start ......
check data server node data state by fevent log end ......
set node [10.0.2.202] state to unavailable successful

[gbase@localhost gcinstall]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

==================================================================
|             GBASE COORDINATOR CLUSTER INFORMATION              |
==================================================================
|   NodeName   | IpAddress  |   gcware    | gcluster | DataState |
------------------------------------------------------------------
| coordinator1 | 10.0.2.102 |    OPEN     |   OPEN   |     0     |
------------------------------------------------------------------
| coordinator2 | 10.0.2.202 | UNAVAILABLE |          |           |
------------------------------------------------------------------
| coordinator3 | 10.0.2.203 |    OPEN     |   OPEN   |     0     |
------------------------------------------------------------------
===============================================================================================================
|                                       GBASE DATA CLUSTER INFORMATION                                        |
===============================================================================================================
| NodeName |                IpAddress                 | DistributionId |    gnode    | syncserver | DataState |
---------------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.102                |       7        |    OPEN     |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.202                |       7        | UNAVAILABLE |            |           |
---------------------------------------------------------------------------------------------------------------
|  node3   |                10.0.2.203                |       7        |    OPEN     |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------------

[gbase@localhost gcinstall]$

3、删除被替换节点的 feventlog

提示:本步骤和上一步骤,建议和新节点环境准备同时做,一旦确认节点磁盘损坏需要做替换,则可以做本步骤。清理event的耗时,根据数量多少,可能需要几分钟到几个小时,所以强烈建议提前做。

操作并检查event已经删除。

[gbase@localhost gcinstall]$ gcadmin rmfeventlog 10.0.2.202
after rmfeventlog 10.0.2.202, fevent log will be removed, must run gcadmin replacenodes to replace this node.
you realy want to remove node 10.0.2.202 fevent log(yes or no)?
yes
delete ddl event log on node 10.0.2.202 start
delete ddl event log on node 10.0.2.202 end
delete dml event log on node 10.0.2.202 start
delete dml event log on node 10.0.2.202 end
delete dml storage event log on node 10.0.2.202 start
delete dml storage event log on node 10.0.2.202 end

[gbase@localhost gcinstall]$ gcadmin showddlevent
Vc event count:0
[gbase@localhost gcinstall]$ gcadmin showdmlevent
Vc event count:0
[gbase@localhost gcinstall]$ gcadmin showdmlstorageevent
Vc event count:0
[gbase@localhost gcinstall]$

4、替换管理节点

如果是只替换数据节点,请看下一章 替换数据节点

注意替换管理节点时,集群会处于readonly状态,只能读取,不能写入和变更,请提前安排。

4.1 替换命令replace的参数

如果后面的参数初选不一致,以当前版本的参数写法为准。

[gbase@localhost gcinstall]$ ./replace.py --help
Usage: replace.py [options]

Options:
  -h, --help            show this help message and exit
  -a                    do not prompt the user for confirmation
  --host=HOSTLIST       replaced nodes' ip splitting by comma
  --type=NODETYPE       replaced nodes' type,value:coor,data
  --freenode=FREENODE   cluster freenodes' ip splitting by comma
  --dbaUser=DBAUSER     dba user
  --dbaUserPwd=DBAPWD   dba user password
  --generalDBUser=GENDBUSER
                        cluster database user
  --generalDBPwd=GENDBPWD
                        cluster database user password
  --overwrite           new and complete overwrite
  --sync_coordi_metadata_timeout=SYNC_COORDI_METADATA_TIMEOUT
                        sync coordinators' metadata timeout,default 15mins
  --parallel_pack=PARALLEL_PACK
                        whether to parallel packaging,value<0|1>,default 0
  --retry_times=RETRY_TIMES
                        replace node retry times,default 3
  --use_shm=USE_SHM     whether to set path of package,value<0|1>,default 0
  --license_file=LICENSE_FILE
                        import license file
  --vcname=VC_NAME      vc name,only support one vc
  -p, --addr_protocol   domain map address,default False(IPv4)
  --passwordInputMode=PASSWORDINPUTMODE
                        get password method[file,pwdsame],
                        file:    get from command line paramters,default
                        pwdsame: nodes have same user passwd

--host 可以指定多个IP,用逗号分割
--type=data 本次只替换数据节点。 管理节点要各自单独运行,不能一起做。
--dbaUser 操作系统的用户,一般是gbase
-- dbaUserPwd 操作系统用户的密码
--generalDBUser 数据库的dba用户名,默认为root
--generalDBPwd 数据库的dba密码,请根据实际填写。 默认为空
--overwrite 强制覆盖残留的文件
--sync_coordi_metadata_timeout 同步元数据的超时时间,单位是分钟。默认15分钟,建议改大,比如3000分钟。

4.2 替换操作

[gbase@localhost gcinstall]$ ./replace.py --host=10.0.2.202 --type=coor --dbaUser=gbase --dbaUserPwd=gbase1234 --generalDBUse=root --generalDBPwd=root1234 --overwrite --sync_coordi_metadata_timeout=3000
install prefix: /opt/gbase
execute replace node os user: gbase
replaced nodes: ['10.0.2.202']
node address type: IPV4
gcware mode: single vc mode
201129 09:04:05 [GCWARE] connect to 10.0.2.202 error:connect 10.0.2.202:7959 error, Connection refused

host 10.0.2.202 node state: UNAVAILABLE
10.0.2.202
Are you sure to replace install these nodes ([Y,y]/[N,n])? y
check database user and password ...
check database user and password successful
Starting all gcluster nodes...
Begin to exec gcadmin replacenodes ...
check ip start ......
check ip end ......

switch cluster mode into READONLY start ......
wait all ddl statement stop ......

all ddl statement stoped
switch cluster mode into READONLY end ......

delete all fevent log on replace nodes start ......
delete ddl event log on node 10.0.2.202 start
delete ddl event log on node 10.0.2.202 end
delete dml event log on node 10.0.2.202 start
delete dml event log on node 10.0.2.202 end
delete dml storage event log on node 10.0.2.202 start
delete dml storage event log on node 10.0.2.202 end
delete all fevent log on replace nodes end ......

sync coordinator metedata start ......
build data packet start ......
build data packet end ......

copy data packet start ......
copy data packet end ......

copy plugin start ......
copy plugin end ......
uncompress data packet start ......
uncompress data packet end ......

clear temporary file start ......
clear temporary file end ......
sync coordinator metedata end ......
sync coordinator metedata end,spend time 38450 ms......

restore node state start ......
restore node state end ......

replace nodes spend time: 68626 ms

all nodes replace success end
Replace gcluster nodes successfully.
[gbase@localhost gcinstall]$

4.3 替换日志

日志在/home/gbase/gcinstall/replace.log 里面。如果发生错误,可以查看。

如果是执行gadmin内部逻辑时报错,比如 gadm_cp_codi_tbl.py 日志可以在安装目录/gcware/log下看到。

4.4 检查

确认10.0.2.202节点的管理服务已经处于正常的OPEN 状态。
注意本例中数据服务状态依然是UNAVAILABLE

[gbase@localhost gcinstall]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.102 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.202 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator3 | 10.0.2.203 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
===============================================================================================================
|                                       GBASE DATA CLUSTER INFORMATION                                        |
===============================================================================================================
| NodeName |                IpAddress                 | DistributionId |    gnode    | syncserver | DataState |
---------------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.102                |       7        |    OPEN     |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.202                |       7        | UNAVAILABLE |            |           |
---------------------------------------------------------------------------------------------------------------
|  node3   |                10.0.2.203                |       7        |    OPEN     |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------------

[gbase@localhost gcinstall]$

4.5 备注

如果安装包和当前数据库版本不同,会报错

[gbase@localhost gcinstall]$ ./replace.py --host=10.0.2.202 --type=coor --dbaUser=gbase --dbaUserPwd=gbase1234 --generalDBUse=root --generalDBPwd=root1234 --overwrite --sync_coordi_metadata_timeout=3000
Error: replace.py(line 828) -- current gcware version (121227) and package gcware version (115518) are not same.

管理节点同步期间,集群是readonly状态,本例中我们没有抓到,但从日志里可以看到。

switch cluster mode into READONLY start ......
wait all ddl statement stop ......

all ddl statement stoped
switch cluster mode into READONLY end ......

5、替换数据节点

替换数据节点的方案比V8变动很大,V8是用event方式,通过内部自动同步做恢复,问题是无法很好的控制恢复的并发数量,导致恢复占用资源高,影响了现有业务运行。

V9版本则通过类似扩容的重分布方式,可以自定义优先级,并行度,可以暂停继续重分布过程。整个替换过程完全可控,用户可以根据系统负载,随时调整参数。

本方案,使用了一个中间distribution,其不再包含已经故障的节点了。而替换是类似于扩容,将故障节点当成扩容节点来使用。唯一注意的是现有的distribution是节点替换命令【自动】删除的,用户【不要】自行删除。

5.1 创建临时的中间策略distribution

从现有distribution获得信息,然后替换掉和故障节点有关的信息。后面简称【中间策略】。而我们最终的恢复后的主备策略建【最终策略】

5.1.1 拿到老的策略distribution

我们当前的distribution id是7, 所以参数也是7

[gbase@localhost gcinstall]$ gcadmin getdistribution 7 distribution_info_7.xml
gcadmin getdistribution 7 distribution_info_7.xml ...

get segments information
write segments information to file [distribution_info_7.xml]

gcadmin getdistribution information successful
[gbase@localhost gcinstall]$ cat distribution_info_7.xml
<?xml version='1.0' encoding="utf-8"?>
<distributions>
    <distribution>
        <segments>
            <segment>
                <primarynode ip="10.0.2.102"/>

                <duplicatenodes>
                    <duplicatenode ip="10.0.2.202"/>
                </duplicatenodes>
            </segment>

            <segment>
                <primarynode ip="10.0.2.202"/>

                <duplicatenodes>
                    <duplicatenode ip="10.0.2.203"/>
                </duplicatenodes>
            </segment>

            <segment>
                <primarynode ip="10.0.2.203"/>

                <duplicatenodes>
                    <duplicatenode ip="10.0.2.102"/>
                </duplicatenodes>
            </segment>
        </segments>
    </distribution>
</distributions>
[gbase@localhost gcinstall]$

5.1.2 创建中间策略distribution_info.xml文件

要做的事情是把故障的IP,从配置里去掉。包含了2种情况。
A、出现在 duplicatenode 部分,则删除这一行即可;
B、出现在 primarynode 部分,需要改造,将其duplicatenode 部分的某个IP(如果有多个的话),改造成 primarynode,记得把duplicatenode 删掉。 也就是残存的备份节点,成了主节点。

本例中,我们复制了一份配置文件,然后将配置文件中

1、将102的备份202删掉了
2、将202主分片【替换】成了其备份203,将203作为主分片。

[gbase@localhost gcinstall]$ cp distribution_info_7.xml distribution_info_8.xml
[gbase@localhost gcinstall]$ vi distribution_info_8.xml
[gbase@localhost gcinstall]$ cat distribution_info_8.xml
<?xml version='1.0' encoding="utf-8"?>
<distributions>
    <distribution>
        <segments>
            <segment>
                <primarynode ip="10.0.2.102"/>

                <duplicatenodes>
                </duplicatenodes>
            </segment>

            <segment>
                <primarynode ip="10.0.2.203"/>

                <duplicatenodes>
                </duplicatenodes>
            </segment>

            <segment>
                <primarynode ip="10.0.2.203"/>

                <duplicatenodes>
                    <duplicatenode ip="10.0.2.102"/>
                </duplicatenodes>
            </segment>
        </segments>
    </distribution>
</distributions>
[gbase@localhost gcinstall]$

5.1.3 准备中间策略配置文件gcChangeInfo.xml

[gbase@localhost gcinstall]$ vi gcChangeInfo_8.xml
[gbase@localhost gcinstall]$ cat gcChangeInfo_8.xml
<?xml version="1.0" encoding="utf-8"?>
<servers>
<cfgFile file="distribution_info_8.xml"/>
</servers>
[gbase@localhost gcinstall]$ 

5.1.4 创建中间策略distribution

注意,不要指定p d参数,主备策略我们都写到配置文件里了。 因为没有使用VC,所以无需指定vc参数。

gbase@localhost gcinstall]$ gcadmin distribution gcChangeInfo_8.xml
gcadmin generate distribution ...

gcadmin generate distribution successful

[gbase@localhost gcinstall]$ 

5.1.5 检查中间策略

其中不再包含故障的节点IP 10.0.2.202。

[gbase@localhost gcinstall]$ gcadmin showdistribution

                                 Distribution ID: 8 | State: new | Total segment num: 3

             Primary Segment Node IP                   Segment ID                 Duplicate Segment node IP
========================================================================================================================
|                   10.0.2.102                   |         1          |                                                |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.203                   |         2          |                                                |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.203                   |         3          |                   10.0.2.102                   |
========================================================================================================================

                                 Distribution ID: 7 | State: old | Total segment num: 3

             Primary Segment Node IP                   Segment ID                 Duplicate Segment node IP
========================================================================================================================
|                   10.0.2.102                   |         1          |                   10.0.2.202                   |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.202                   |         2          |                   10.0.2.203                   |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.203                   |         3          |                   10.0.2.102                   |
========================================================================================================================
[gbase@localhost gcinstall]$

5.2、初始化并重分布中间策略

initnodedatamap;
rebalance instance;
此处无需做优先级处理,因为所有的表都无需要搬数据,所以速度很快,采用默认值即可。如果以前做过扩容,请检查并调整参数。

[gbase@localhost gcinstall]$ gccli testdb -proot1234

GBase client 9.5.2.26.121440. Copyright (c) 2004-2020, GBase.  All Rights Reserved.

gbase> initnodedatamap;
Query OK, 0 rows affected, 6 warnings (Elapsed: 00:00:02.63)

gbase> rebalance instance;
Query OK, 25 rows affected (Elapsed: 00:00:00.77)

检查进度,直到所有的都是COMPLETED状态。

gbase> select status,count(*) from gclusterdb.rebalancing_status group by status;
+-----------+----------+
| status    | count(*) |
+-----------+----------+
| STARTING  |        4 |
| COMPLETED |       17 |
| RUNNING   |        4 |
+-----------+----------+
3 rows in set (Elapsed: 00:00:00.69)

。。。。。。。。

gbase> select status,count(*) from gclusterdb.rebalancing_status group by status;
+-----------+----------+
| status    | count(*) |
+-----------+----------+
| COMPLETED |       25 |
+-----------+----------+
1 row in set (Elapsed: 00:00:00.50)

gbase>

5.2.1 不要删除老策略警告

强烈警告,本步骤后面,没有删除nodedatamap 和 rmdistribution步骤,请一定【不要做】,一定【不要做】,一定【不要做】。

如果你不小心做了,那么下一步将无法成功。而且,我能想出的补救方法,只有【缩容】+【扩容】方案了。也就是我们手工把replace.py替我们做的事情,全部手工做一遍。其中包括:

  • 缩容,就当这个节点我们不要了。
    • 删除distribution
    • 将节点移出集群 gcadmin rmnodes
    • 卸载该节点
  • 按照扩容步骤操作,将节点加回来,并将新的distribution恢复到最早的那个就行了。

从时间看,没有增加多少,只是心情不爽。

5.3、执行节点替换命令 replace

注意,期间故障节点的gclusterd,gcware服务会被停掉,估计是因为部署在相同的节点上,如果有SQL在上面运行,会报错。

其内部会以distribution ID为基准,重新创建一个新的,并将最开始的7删掉。

5.3.1 执行replace

[gbase@localhost gcinstall]$ ./replace.py --host=10.0.2.202 --type=data --dbaUser=gbase --dbaUserPwd=gbase1234 --generalDBUse=root --generalDBPwd=root1234 --overwrite --sync_coordi_metadata_timeout=3000
install prefix: /opt/gbase
execute replace node os user: gbase
replaced nodes: ['10.0.2.202']
node address type: IPV4
gcware mode: single vc mode
host 10.0.2.202 node state: UNAVAILABLE
10.0.2.202
Are you sure to replace install these nodes ([Y,y]/[N,n])? y
check database user and password ...
check database user and password successful
uninstall host ['10.0.2.202'] begin
uninstall host ['10.0.2.202'] end
Starting all gcluster nodes...
Begin to exec gcadmin replacenodes ...
check ip start ......
check ip end ......

switch cluster mode into READONLY start ......
wait all ddl statement stop ......

all ddl statement stoped
switch cluster mode into READONLY end ......

delete all fevent log on replace nodes start ......
delete ddl event log on node 10.0.2.202 start
delete ddl event log on node 10.0.2.202 end
delete dml event log on node 10.0.2.202 start
delete dml event log on node 10.0.2.202 end
delete dml storage event log on node 10.0.2.202 start
delete dml storage event log on node 10.0.2.202 end
delete all fevent log on replace nodes end ......

sync dataserver metedata begin ......
copy script to data node begin
copy script to data node end
build data packet begin
build data packet end
copy data packet to target node begin
copy data packet to target node end
extract data packet begin
extract data packet end
sync dataserver metedata end, spend time 41804 ms ......

create distribution begin ......
remove old distribution begin
remove old distribution end
create new distribution begin
restore node state start ......
restore node state end ......
create new distribution end
replace node initnodedatamap
create distribution end

replace nodes spend time: 74638 ms

synchronize data node metadata success
please rebalance instance then remove old distribution after rebalance complete success
Replace gcluster nodes successfully.

5.3.2 检查最终策略

如上replace操作,会自动新建一个最终策略 9,其已经把故障恢复的节点添加进去,并且最老的策略7已经自动删除了。

[gbase@localhost gcinstall]$ gcadmin showdistribution

                                 Distribution ID: 9 | State: new | Total segment num: 3

             Primary Segment Node IP                   Segment ID                 Duplicate Segment node IP
========================================================================================================================
|                   10.0.2.102                   |         1          |                   10.0.2.202                   |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.202                   |         2          |                   10.0.2.203                   |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.203                   |         3          |                   10.0.2.102                   |
========================================================================================================================

                                 Distribution ID: 8 | State: old | Total segment num: 3

             Primary Segment Node IP                   Segment ID                 Duplicate Segment node IP
========================================================================================================================
|                   10.0.2.102                   |         1          |                                                |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.203                   |         2          |                                                |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.203                   |         3          |                   10.0.2.102                   |
========================================================================================================================
[gbase@localhost gcinstall]$

5.4、初始化和重分布最终策略

打补丁!!!如果集群打过补丁,此时是最佳时间。新的节点还没有加入集群服务,可以更安全的启动停止。

本部分,请完全参考扩容的操作,其并行度,优先级的设置不再本文重复描述了。参考 GBase 8a 扩容操作详细实例

5.4.1 初始化最终策略

初始化这步骤已经不需要了,但跑一个错误无所谓。

[gbase@localhost gcinstall]$ gccli -proot1234

GBase client 9.5.2.26.121440. Copyright (c) 2004-2020, GBase.  All Rights Reserved.

gbase> initnodedatamap;
ERROR 1707 (HY000): gcluster command error: (GBA-02CO-0004) nodedatamap is already initialized.

5.4.2 设置重分布参数

各种参数,优先级等。要考虑到对系统的影响,表老化情况,尽量让重要的,常用的,永久保留的先做。 很快有可能自动老化删除的,最后做。具体步骤请参考扩容的详细情况,我这里因为是测试,就不再设置了。

5.4.3 开始重分布并监控进度

gbase> rebalance instance;
Query OK, 25 rows affected (Elapsed: 00:00:00.94)
gbase> select status,count(*) from gclusterdb.rebalancing_status group by status;
+----------+----------+
| status   | count(*) |
+----------+----------+
| STARTING |       20 |
| RUNNING  |        5 |
+----------+----------+
2 rows in set (Elapsed: 00:00:01.74)

。n个小时或n天之后。我这里只用了2分钟,就几个小表。。。
gbase> select status,count(*) from gclusterdb.rebalancing_status group by status;
+-----------+----------+
| status    | count(*) |
+-----------+----------+
| COMPLETED |       25 |
+-----------+----------+
1 row in set (Elapsed: 00:00:00.18)

5.5、清理环境

删掉老的nodedatamap
refreshnodedatamap drop 8

删掉老的distribution
gcadmin rmdistribution 8

gbase> refreshnodedatamap drop 8;
Query OK, 0 rows affected, 6 warnings (Elapsed: 00:00:01.34)

gbase> ^CAborted
[gbase@localhost gcinstall]$ gcadmin rmdistribution 8
cluster distribution ID [8]
it will be removed now
please ensure this is ok, input [Y,y] or [N,n]: y
gcadmin remove distribution [8] success
[gbase@localhost gcinstall]$ gcadmin showdistribution

                                 Distribution ID: 9 | State: new | Total segment num: 3

             Primary Segment Node IP                   Segment ID                 Duplicate Segment node IP
========================================================================================================================
|                   10.0.2.102                   |         1          |                   10.0.2.202                   |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.202                   |         2          |                   10.0.2.203                   |
------------------------------------------------------------------------------------------------------------------------
|                   10.0.2.203                   |         3          |                   10.0.2.102                   |
========================================================================================================================