GBase 8a支持节点替换,当某些服务器出现不可恢复的故障时,比如磁盘损坏,可以在修复后替换,或者用新节点做节点替换。在V8版本里,默认节点替换必须用老的IP,在V95版本的多VC模式,支持了集群空闲的备用节点freenode节点替换模式。本文介绍一种通过扩容的方式,采用新节点做数据计算节点替换的方案。
- 该场景一般发生在,故障节点不确认何时修好,而新服务又不允许更换IP的情况。
- 本方案只涉及计算数据节点,不涉及管理和调度节点。如需要替换的是这2种,请按照标准【节点替换】流程操作。
- 本文只针对不支持用新IP直接做节点替换的GBase 8a集群版本,主要是V8系列。
目录导航
整体思路
- 设置故障节点状态并清理event
- 按照扩容流程安装新的计算数据节点
- 获取老的分布策略distribution
- 生成新的分布策略distribution, 将故障节点IP,替换成新扩容的节点IP
- 初始化
- 重分布
- 清理环境
- 卸载或清理故障节点残余数据
环境
4节点(10.0.2.103-106),集群使用了3个节点(10.0.1.103-105),1个备用(10.0.2.106)。
[gbase@gbase_rh7_003 gcinstall]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
====================================
| GBASE GCWARE CLUSTER INFORMATION |
====================================
| NodeName | IpAddress | gcware |
------------------------------------
| gcware1 | 10.0.2.103 | OPEN |
------------------------------------
====================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
====================================================
| NodeName | IpAddress | gcluster | DataState |
----------------------------------------------------
| coordinator1 | 10.0.2.103 | OPEN | 0 |
----------------------------------------------------
=========================================================================================================
| GBASE DATA CLUSTER INFORMATION |
=========================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.103 | 1 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.104 | 1 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.105 | 1 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
设置故障节点状态并清理event
如下假设104节点故障了。这里指无法恢复的故障,而不是断电,断网,死机等可以恢复的。 最常见的是数据分区的磁盘,出现损坏且超过RAID允许块数。
gcadmin setnodestate 10.0.2.104 unavaliable
清理104节点event
gcadmin rmdmlstorageevent 2 10.0.2.104
gcadmin rmddlstorageevent 2 10.0.2.104
gcadmin rmdmlevent 2 10.0.2.104
确认104节点没有event了。
[gbase@gbase_rh7_003 gcinstall]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
====================================
| GBASE GCWARE CLUSTER INFORMATION |
====================================
| NodeName | IpAddress | gcware |
------------------------------------
| gcware1 | 10.0.2.103 | OPEN |
------------------------------------
====================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
====================================================
| NodeName | IpAddress | gcluster | DataState |
----------------------------------------------------
| coordinator1 | 10.0.2.103 | OPEN | 0 |
----------------------------------------------------
===============================================================================================================
| GBASE DATA CLUSTER INFORMATION |
===============================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.103 | 1 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.104 | 1 | UNAVAILABLE | | |
---------------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.105 | 1 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
扩容安装一个数据节点
扩容安装步骤就不写了,请自行参考 GBase 8a 扩容操作详细实例 。 提醒,只需要将安装(gcinstall.py)部分做完即可。
可以看到106节点已经在集群内,但并没有加入分布策略(distribution)
[gbase@gbase_rh7_003 gcinstall]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
====================================
| GBASE GCWARE CLUSTER INFORMATION |
====================================
| NodeName | IpAddress | gcware |
------------------------------------
| gcware1 | 10.0.2.103 | OPEN |
------------------------------------
====================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
====================================================
| NodeName | IpAddress | gcluster | DataState |
----------------------------------------------------
| coordinator1 | 10.0.2.103 | OPEN | 0 |
----------------------------------------------------
===============================================================================================================
| GBASE DATA CLUSTER INFORMATION |
===============================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.103 | 1 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.104 | 1 | UNAVAILABLE | | |
---------------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.105 | 1 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
| node4 | 10.0.2.106 | | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
获取老的分布策略distribution
其中的参数1,指上面gcadmin输出的,包含故障节点的分别策略ID,请根据实际情况填写。
gcadmin getdistribution 1 distribution_1.xml
注意其中的故障节点部分
[gbase@gbase_rh7_003 gcinstall]$ cat distribution_1.xml
<?xml version='1.0' encoding="utf-8"?>
<distributions>
<distribution>
<segments>
<segment>
<primarynode ip="10.0.2.103"/>
<duplicatenodes>
<duplicatenode ip="10.0.2.104"/>
</duplicatenodes>
</segment>
<segment>
<primarynode ip="10.0.2.104"/>
<duplicatenodes>
<duplicatenode ip="10.0.2.105"/>
</duplicatenodes>
</segment>
<segment>
<primarynode ip="10.0.2.105"/>
<duplicatenodes>
<duplicatenode ip="10.0.2.103"/>
</duplicatenodes>
</segment>
</segments>
</distribution>
</distributions>
[gbase@gbase_rh7_003 gcinstall]$
生成新的分布策略distribution
将故障节点IP,替换成新安装的节点IP
[gbase@gbase_rh7_003 gcinstall]$ cat distribution_2.xml
<?xml version='1.0' encoding="utf-8"?>
<distributions>
<distribution>
<segments>
<segment>
<primarynode ip="10.0.2.103"/>
<duplicatenodes>
<duplicatenode ip="10.0.2.106"/>
</duplicatenodes>
</segment>
<segment>
<primarynode ip="10.0.2.106"/>
<duplicatenodes>
<duplicatenode ip="10.0.2.105"/>
</duplicatenodes>
</segment>
<segment>
<primarynode ip="10.0.2.105"/>
<duplicatenodes>
<duplicatenode ip="10.0.2.103"/>
</duplicatenodes>
</segment>
</segments>
</distribution>
</distributions>
[gbase@gbase_rh7_003 gcinstall]$
[gbase@gbase_rh7_003 gcinstall]$ cat gcChangeInfo_2.xml
<?xml version="1.0" encoding="utf-8"?>
<servers>
<cfgFile file="distribution_2.xml"/>
</servers>
[gbase@gbase_rh7_003 gcinstall]$
生成策略并检查,注意新的分布策略,不再包含故障节点了。
[gbase@gbase_rh7_003 gcinstall]$ gcadmin distribution gcChangeInfo_2.xml
gcadmin generate distribution ...
copy system table to 10.0.2.106
gcadmin generate distribution successful
[gbase@gbase_rh7_003 gcinstall]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
====================================
| GBASE GCWARE CLUSTER INFORMATION |
====================================
| NodeName | IpAddress | gcware |
------------------------------------
| gcware1 | 10.0.2.103 | OPEN |
------------------------------------
====================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
====================================================
| NodeName | IpAddress | gcluster | DataState |
----------------------------------------------------
| coordinator1 | 10.0.2.103 | OPEN | 0 |
----------------------------------------------------
===============================================================================================================
| GBASE DATA CLUSTER INFORMATION |
===============================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.103 | 1,3 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.104 | 1 | UNAVAILABLE | | |
---------------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.105 | 1,3 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
| node4 | 10.0.2.106 | 3 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
[gbase@gbase_rh7_003 gcinstall]$
初始化和重分布
重分布需要等待时间,其参数和策略,请参考标准扩容手册的重分布部分。 GBase 8a 扩容操作详细实例
[gbase@gbase_rh7_003 gcinstall]$ gccli
GBase client 9.5.3.22.126635. Copyright (c) 2004-2021, GBase. All Rights Reserved.
gbase> initnodedatamap;
Query OK, 0 rows affected, 4 warnings (Elapsed: 00:00:00.50)
gbase> rebalance instance;
Query OK, 1 row affected (Elapsed: 00:00:00.59)
gbase> select status,count(*) from gclusterdb.rebalancing_status group by status;
+-----------+----------+
| status | count(*) |
+-----------+----------+
| COMPLETED | 1 |
+-----------+----------+
1 row in set (Elapsed: 00:00:00.06)
清理环境
确认重分布完成,清理不再需要的信息,包括nodedatamap, distribution和移除故障节点。
清理nodedatamap
gbase> refreshnodedatamap drop 1;
Query OK, 0 rows affected, 4 warnings (Elapsed: 00:00:00.66)
gbase> exit
Bye
清理分布策略distribution
[gbase@gbase_rh7_003 gcinstall]$ gcadmin rmdistribution 1
cluster distribution ID [1]
it will be removed now
please ensure this is ok, input [Y,y] or [N,n]: y
gcadmin remove distribution [1] success
[gbase@gbase_rh7_003 gcinstall]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
====================================
| GBASE GCWARE CLUSTER INFORMATION |
====================================
| NodeName | IpAddress | gcware |
------------------------------------
| gcware1 | 10.0.2.103 | OPEN |
------------------------------------
====================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
====================================================
| NodeName | IpAddress | gcluster | DataState |
----------------------------------------------------
| coordinator1 | 10.0.2.103 | OPEN | 0 |
----------------------------------------------------
===============================================================================================================
| GBASE DATA CLUSTER INFORMATION |
===============================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.103 | 3 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.104 | | UNAVAILABLE | | |
---------------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.105 | 3 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
| node4 | 10.0.2.106 | 3 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------------
[gbase@gbase_rh7_003 gcinstall]$
从集群移出故障节点
[gbase@gbase_rh7_003 gcinstall]$ cp gcChangeInfo.xml gcChangeInfo_delete.xml
[gbase@gbase_rh7_003 gcinstall]$ vi gcChangeInfo_delete.xml
[gbase@gbase_rh7_003 gcinstall]$ cat gcChangeInfo_delete.xml
<?xml version="1.0" encoding="utf-8"?>
<servers>
<rack>
<node ip="10.0.2.104"/>
</rack>
</servers>
[gbase@gbase_rh7_003 gcinstall]$ gcadmin rmnodes gcChangeInfo_delete.xml
gcadmin remove nodes ...
gcadmin rmnodes from cluster success
[gbase@gbase_rh7_003 gcinstall]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
====================================
| GBASE GCWARE CLUSTER INFORMATION |
====================================
| NodeName | IpAddress | gcware |
------------------------------------
| gcware1 | 10.0.2.103 | OPEN |
------------------------------------
====================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
====================================================
| NodeName | IpAddress | gcluster | DataState |
----------------------------------------------------
| coordinator1 | 10.0.2.103 | OPEN | 0 |
----------------------------------------------------
=========================================================================================================
| GBASE DATA CLUSTER INFORMATION |
=========================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.103 | 3 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.105 | 3 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node4 | 10.0.2.106 | 3 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
[gbase@gbase_rh7_003 gcinstall]$
卸载或清理故障节点残余数据
如果故障节点彻底损坏,包括操作系统都要重做,那当其恢复上线时,【不会】对现有集群造成干扰。
如果不确定,包括操作系统盘没有损坏的。请一定要卸载该节点的集群数据,方案包括
- 格式化数据盘
- 手工删除数据盘文件
总之,既然坏了,用其他节点代替了,那就坏的彻底点。建议当新服务器,重新做RAID,重新做操作系统。
性能影响
由于分片数量没有变动,主副本关系,除了故障节点IP外,没有变动,通过底层数据文件的时间戳,可以看到只有故障节点发生了数据迁移,其它节点没有改动。
103节点的2个分片无变动
[gbase@gbase_rh7_003 gcinstall]$ stat /opt/gbase/10.0.2.103/gnode/userdata/gbase/testdb/metadata/t1_n1.GED
File: ‘/opt/gbase/10.0.2.103/gnode/userdata/gbase/testdb/metadata/t1_n1.GED’
Size: 65 Blocks: 0 IO Block: 4096 directory
Device: 801h/2049d Inode: 52203643 Links: 2
Access: (0700/drwx------) Uid: ( 1000/ gbase) Gid: ( 1000/ gbase)
Access: 2021-06-23 16:41:47.164743753 +0800
Modify: 2021-06-23 16:41:47.154743698 +0800
Change: 2021-06-23 16:41:47.154743698 +0800
Birth: -
[gbase@gbase_rh7_003 gcinstall]$
[gbase@gbase_rh7_003 gcinstall]$ stat /opt/gbase/10.0.2.103/gnode/userdata/gbase/testdb/metadata/t1_n3.GED
File: ‘/opt/gbase/10.0.2.103/gnode/userdata/gbase/testdb/metadata/t1_n3.GED’
Size: 65 Blocks: 0 IO Block: 4096 directory
Device: 801h/2049d Inode: 1193943 Links: 2
Access: (0700/drwx------) Uid: ( 1000/ gbase) Gid: ( 1000/ gbase)
Access: 2021-06-23 16:41:47.170743786 +0800
Modify: 2021-06-23 16:41:47.162743742 +0800
Change: 2021-06-23 16:41:47.162743742 +0800
Birth: -
105节点的2个分片无变动
[gbase@gbase_rh7_003 gcinstall]$ stat /opt/gbase/10.0.2.105/gnode/userdata/gbase/testdb/metadata/t1_n2.GED
File: ‘/opt/gbase/10.0.2.105/gnode/userdata/gbase/testdb/metadata/t1_n2.GED’
Size: 65 Blocks: 0 IO Block: 4096 directory
Device: 801h/2049d Inode: 52203645 Links: 2
Access: (0700/drwx------) Uid: ( 1000/ gbase) Gid: ( 1000/ gbase)
Access: 2021-06-23 16:41:47.166743764 +0800
Modify: 2021-06-23 16:41:47.157743715 +0800
Change: 2021-06-23 16:41:47.157743715 +0800
Birth: -
[gbase@gbase_rh7_003 gcinstall]$ stat /opt/gbase/10.0.2.105/gnode/userdata/gbase/testdb/metadata/t1_n3.GED
File: ‘/opt/gbase/10.0.2.105/gnode/userdata/gbase/testdb/metadata/t1_n3.GED’
Size: 65 Blocks: 0 IO Block: 4096 directory
Device: 801h/2049d Inode: 1734395 Links: 2
Access: (0700/drwx------) Uid: ( 1000/ gbase) Gid: ( 1000/ gbase)
Access: 2021-06-23 16:41:47.175743813 +0800
Modify: 2021-06-23 16:41:47.166743764 +0800
Change: 2021-06-23 16:41:47.166743764 +0800
Birth: -
[gbase@gbase_rh7_003 gcinstall]$
106节点的2个分片是全新的
[gbase@gbase_rh7_003 gcinstall]$ stat /opt/gbase/10.0.2.106/gnode/userdata/gbase/testdb/metadata/t1_n1.GED
File: ‘/opt/gbase/10.0.2.106/gnode/userdata/gbase/testdb/metadata/t1_n1.GED’
Size: 65 Blocks: 0 IO Block: 4096 directory
Device: 801h/2049d Inode: 16875349 Links: 2
Access: (0700/drwx------) Uid: ( 1000/ gbase) Gid: ( 1000/ gbase)
Access: 2021-06-24 09:38:55.890441928 +0800
Modify: 2021-06-24 09:38:56.070442917 +0800
Change: 2021-06-24 09:38:56.070442917 +0800
Birth: -
[gbase@gbase_rh7_003 gcinstall]$ stat /opt/gbase/10.0.2.106/gnode/userdata/gbase/testdb/metadata/t1_n2.GED
File: ‘/opt/gbase/10.0.2.106/gnode/userdata/gbase/testdb/metadata/t1_n2.GED’
Size: 65 Blocks: 0 IO Block: 4096 directory
Device: 801h/2049d Inode: 702206 Links: 2
Access: (0700/drwx------) Uid: ( 1000/ gbase) Gid: ( 1000/ gbase)
Access: 2021-06-24 09:38:55.885441901 +0800
Modify: 2021-06-24 09:38:56.065442889 +0800
Change: 2021-06-24 09:38:56.065442889 +0800
Birth: -
[gbase@gbase_rh7_003 gcinstall]$