南大通用GBase 8a 通过缩容剔除无法修复的故障节点操作记录

如某个节点出现永久性故障,不可修复,且剩余节点也足以支撑现有业务,GBase 8a 可以通过缩容,重建集群主备关系,来剔除故障节点。本文通过一个实际例子介绍操作过程。

本文故障节点,指数据计算节点。 强烈建议管理,调度和计算节点分别部署,避免混用,除非节点少,成本优先。

环境

3节点集群,其中115节点故障。本此操作,不仅将故障的115缩容,顺便将102节点也缩容。

数据库为9.5.2

[gbase@gbase_rh7_001 ~]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       7        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.102                |       7        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node3   |                10.0.2.115                |       7        | CLOSE |   CLOSE    |     0     |
---------------------------------------------------------------------------------------------------------

缩容操作

与普通缩容过程完全一样,主要为了证明在节点故障时,也是可以缩容的。

创建不包含故障节点,以及计划缩容节点的分布策略

创建全新的策略

[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ cat gcChangeInfo_one.xml
<?xml version="1.0" encoding="utf-8"?>
<servers>
 <rack>
  <node ip="10.0.2.101"/>
 </rack>
</servers>

保留现有策略

如下IP为另一个例子,请注意区分参考V95的另一个例子

如果希望保留以前的分布策略,用于后续再通过扩容的方式来实现节点替换的目的,则需要先拿到老的策略distribution

[gbase@localhost gcinstall]$ gcadmin getdistribution 7 distribution_info_7.xml
gcadmin getdistribution 7 distribution_info_7.xml ...

get segments information
write segments information to file [distribution_info_7.xml]

gcadmin getdistribution information successful
[gbase@localhost gcinstall]$ cat distribution_info_7.xml
<?xml version='1.0' encoding="utf-8"?>
<distributions>
    <distribution>
        <segments>
            <segment>
                <primarynode ip="10.0.2.102"/>

                <duplicatenodes>
                    <duplicatenode ip="10.0.2.202"/>
                </duplicatenodes>
            </segment>

            <segment>
                <primarynode ip="10.0.2.202"/>

                <duplicatenodes>
                    <duplicatenode ip="10.0.2.203"/>
                </duplicatenodes>
            </segment>

            <segment>
                <primarynode ip="10.0.2.203"/>

                <duplicatenodes>
                    <duplicatenode ip="10.0.2.102"/>
                </duplicatenodes>
            </segment>
        </segments>
    </distribution>
</distributions>
[gbase@localhost gcinstall]$

创建不包含故障节点的策略

要做的事情是把故障的IP,从配置里去掉。包含了2种情况。
A、出现在 duplicatenode 部分,则删除这一行即可;
B、出现在 primarynode 部分,需要改造,将其duplicatenode 部分的某个IP(如果有多个的话),改造成 primarynode,记得把duplicatenode 删掉。 也就是残存的备份节点,成了主节点。

本例中,我们复制了一份配置文件,然后将配置文件中

1、将102的备份202删掉了
2、将202主分片【替换】成了其备份203,将203作为主分片。

[gbase@localhost gcinstall]$ cp distribution_info_7.xml distribution_info_8.xml
[gbase@localhost gcinstall]$ vi distribution_info_8.xml
[gbase@localhost gcinstall]$ cat distribution_info_8.xml
<?xml version='1.0' encoding="utf-8"?>
<distributions>
    <distribution>
        <segments>
            <segment>
                <primarynode ip="10.0.2.102"/>

                <duplicatenodes>
                </duplicatenodes>
            </segment>

            <segment>
                <primarynode ip="10.0.2.203"/>

                <duplicatenodes>
                </duplicatenodes>
            </segment>

            <segment>
                <primarynode ip="10.0.2.203"/>

                <duplicatenodes>
                    <duplicatenode ip="10.0.2.102"/>
                </duplicatenodes>
            </segment>
        </segments>
    </distribution>
</distributions>
[gbase@localhost gcinstall]$

创建新的策略

[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin distribution gcChangeInfo_one.xml p 1 d 0
gcadmin generate distribution ...

[warning]: parameter [d num] is 0, the new distribution will has no segment backup
please ensure this is ok, input [Y,y] or [N,n]: y
gcadmin generate distribution successful

[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ 

初始化和重分布

[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gccli

GBase client 9.5.2.44.1045e3118. Copyright (c) 2004-2022, GBase.  All Rights Reserved.

gbase> initnodedatamap;
Query OK, 0 rows affected, 3 warnings (Elapsed: 00:00:00.53)

gbase> rebalance instance;
Query OK, 11 rows affected (Elapsed: 00:00:00.74)

等待重分布结束

清理环境

删除nodedatamap

[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gccli

GBase client 9.5.2.44.1045e3118. Copyright (c) 2004-2022, GBase.  All Rights Reserved.

gbase> refreshnodedatamap drop 7;
Query OK, 0 rows affected, 3 warnings (Elapsed: 00:00:00.62)

gbase> ^CAborted
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$

清理event,因为重分布时,会导致故障节点出现ddl/dml的event.

[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin rmdmlevent 2 10.0.2.115
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin rmddlevent 2 10.0.2.115

删除分布策略

[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin rmdistribution 7
cluster distribution ID [7]
it will be removed now
please ensure this is ok, input [Y,y] or [N,n]: y
gcadmin remove distribution [7] success
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       8        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.102                |                | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node3   |                10.0.2.115                |                | CLOSE |   CLOSE    |     0     |
---------------------------------------------------------------------------------------------------------

[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$

移除缩容的节点

[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ cat rmnodes.xml
<?xml version="1.0" encoding="utf-8"?>
<servers>
 <rack>
  <node ip="10.0.2.102"/>
  <node ip="10.0.2.115"/>
 </rack>
</servers>
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin rmnodes rmnodes.xml
gcadmin remove nodes ...


gcadmin rmnodes from cluster success

[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       8        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$

删除缩容节点的数据库文件

rm -fr /opt/gbase/gcluster
rm -fr /opt/gbase/gnode
rm -fr /opt/gbase/gcware

总结

当节点彻底不可用时,GBase 8a集群是支持将该节点强制缩容剔除出集群的。与正常缩容的区别,就是缩容重分布过程会在故障节点产生event,再删除分布策略时要先清理掉。