GBase 8a是通过副本机制来提供高可用保障,但如果出现主副本数据均损坏且无法修复呢?按标准逻辑,有关的表将无法保障完整性,所有相关表查询将报错。本文提供一种在【允许数据丢失,查询结果不完整的前提下】,对现有表的剩余数据可以查询的方法,以及通过缩容,保证新建的表可以正常使用的方案,以及测试过程。
目录导航
标准缩容步骤
- 创建新的分布策略distribution
- 重分布数据到新的策略
- 删除老的策略
- 将节点从集群移除
- 删除被移除节点的数据服务
详细信息可以参考 https://www.gbase8.cn/1113
强行缩容的步骤差异
重分布数据到新的策略
由于节点已经下线,所以普通的重分布方法略有差异。当前版本(9.5.2.44截至2022-06-20)尚不支持rebalance方式,只能通过手工重建表的方案。
删除被移除节点的数据服务
节点已经无法修复,没了,也就无需再删除服务了。
强制缩容测试
本测试只考虑纯数据节点,建议集群调度和数据节点均独立部署,不要混合使用。
环境
3节点集群,1个调度,3个数据。
注意我这个测试集群的DistributionId当前是5,这个需要根据集群当前情况而定。
[gbase@gbase_rh7_001 liblog]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
=============================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=============================================================
| NodeName | IpAddress | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 | OPEN | OPEN | 0 |
-------------------------------------------------------------
=========================================================================================================
| GBASE DATA CLUSTER INFORMATION |
=========================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.101 | 5 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.102 | 5 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.115 | 5 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
模拟故障
将102,115两个节点服务器关机
[gbase@gbase_rh7_001 liblog]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
=============================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=============================================================
| NodeName | IpAddress | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 | OPEN | OPEN | 0 |
-------------------------------------------------------------
===========================================================================================================
| GBASE DATA CLUSTER INFORMATION |
===========================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
-----------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.101 | 5 | OPEN | OPEN | 0 |
-----------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.102 | 5 | OFFLINE | | |
-----------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.115 | 5 | OFFLINE | | |
-----------------------------------------------------------------------------------------------------------
查询不完整的数据
由于主副分片所在的节点都故障了,所以查询会报错
gbase> select * from t1;
ERROR 1708 (HY000): (GBA-02EX-0004) Failed to get metadata:
DETAIL: check nodes, no valid node for suffix: n2,
please execute 'show datacopymap database.table_name;' to see the detail.
gbase>
修改参数
gbase> show variables like '%sg%';
+------------------------+-------+
| Variable_name | Value |
+------------------------+-------+
| gcluster_allow_sg_lost | 0 |
+------------------------+-------+
1 row in set (Elapsed: 00:00:00.00)
gbase> set global gcluster_allow_sg_lost=1;
Query OK, 0 rows affected (Elapsed: 00:00:00.01)
再次查询可以读取到不完整的数据。
gbase> select * from t1;
+------+------+----------+
| id | id2 | name |
+------+------+----------+
| 10 | NULL | Name_100 |
| 10 | NULL | Name_100 |
| 9 | NULL | Name_90 |
| 9 | NULL | Name_90 |
| 14 | NULL | Name_140 |
| 18 | NULL | Name_180 |
| 14 | NULL | Name_140 |
| 3 | NULL | Name_30 |
| 5 | NULL | Name_50 |
| 5 | NULL | Name_50 |
+------+------+----------+
10 rows in set, 1 warning (Elapsed: 00:00:00.00)
gbase> show warnings;
+-------+------+------------------------------------+
| Level | Code | Message |
+-------+------+------------------------------------+
| Note | 1702 | No valid nodes for table part 'n2' |
+-------+------+------------------------------------+
1 row in set (Elapsed: 00:00:00.00)
创建新的分布策略distribution
新策略不包含故障的节点。
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ cp gcChangeInfo.xml gcChangeInfo_one.xml
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ vi gcChangeInfo_one.xml
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ cat gcChangeInfo_one.xml
<?xml version="1.0" encoding="utf-8"?>
<servers>
<rack>
<node ip="10.0.2.101"/>
</rack>
</servers>
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin distribution gcChangeInfo_one.xml p 1 d 0
gcadmin generate distribution ...
[warning]: parameter [d num] is 0, the new distribution will has no segment backup
please ensure this is ok, input [Y,y] or [N,n]: y
gcadmin generate distribution successful
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
=============================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=============================================================
| NodeName | IpAddress | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 | OPEN | OPEN | 0 |
-------------------------------------------------------------
===========================================================================================================
| GBASE DATA CLUSTER INFORMATION |
===========================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
-----------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.101 | 5,6 | OPEN | OPEN | 0 |
-----------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.102 | 5 | OFFLINE | | |
-----------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.115 | 5 | OFFLINE | | |
-----------------------------------------------------------------------------------------------------------
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$
初始化
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gccli testdb
GBase client 9.5.2.44.1045e3118. Copyright (c) 2004-2022, GBase. All Rights Reserved.
gbase> initnodedatamap;
Query OK, 0 rows affected, 2 warnings (Elapsed: 00:00:00.45)
重分布数据到新的策略
当前版本(9.5.2.44)不支持rebalance 方式,命令会执行,但后台会一直报错,因为缺少分片数据。如后续版本是处理该问题,我再更新本文档。
通过重建表的方式,将重要的,会更新的表重建。 历史的只会查询的表,延后处理或者等自然老化后删除。
如下以t1表重建为例。
gbase> create table t1_new like t1;
Query OK, 0 rows affected (Elapsed: 00:00:00.11)
gbase> insert into t1_new select * from t1;
Query OK, 10 rows affected, 1 warning (Elapsed: 00:00:00.80)
Records: 10 Duplicates: 0 Warnings: 0
gbase> select count(*) from t1;
+----------+
| count(*) |
+----------+
| 10 |
+----------+
1 row in set, 1 warning (Elapsed: 00:00:00.00)
gbase> select count(*) from t1_new;
+----------+
| count(*) |
+----------+
| 10 |
+----------+
1 row in set (Elapsed: 00:00:00.01)
gbase> drop table t1;
Query OK, 0 rows affected (Elapsed: 00:00:00.28)
gbase> rename table t1_new to t1;
Query OK, 0 rows affected (Elapsed: 00:00:00.32)
gbase> select count(*) from t1;
+----------+
| count(*) |
+----------+
| 10 |
+----------+
1 row in set (Elapsed: 00:00:00.01)
查询当前分布策略使用中的表,对应本集群是DistributionId为5。
gbase> select tbname from table_distribution where data_distribution_id=5;
+--------+
| tbname |
+--------+
| tt |
| tt2 |
| t2 |
| ta |
+--------+
4 rows in set (Elapsed: 00:00:00.00)
其它重要的,有DML操作的表,都重建一下。 历史表就可以延后了。
删除老的策略
此步要求集群所有的表,都已经重建完成,不再有DistributionId为5的。 查询方式看前一节。
删掉nodedatamap
gbase> refreshnodedatamap drop 5;
Query OK, 0 rows affected, 2 warnings (Elapsed: 00:00:00.59)
删掉distribution,要求集群没有故障节点的event, 必要时清理一次。
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin rmddlevent 2 10.0.2.102
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin rmddlevent 2 10.0.2.115
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin rmdistribution 5
cluster distribution ID [5]
it will be removed now
please ensure this is ok, input [Y,y] or [N,n]: y
gcadmin remove distribution [5] success
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$
查看集群,DistributionId已经是6,没有5了。
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
=============================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=============================================================
| NodeName | IpAddress | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 | OPEN | OPEN | 0 |
-------------------------------------------------------------
===========================================================================================================
| GBASE DATA CLUSTER INFORMATION |
===========================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
-----------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.101 | 6 | OPEN | OPEN | 0 |
-----------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.102 | | OFFLINE | | |
-----------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.115 | | OFFLINE | | |
-----------------------------------------------------------------------------------------------------------
将节点从集群移除
编写移除的配置文件。每个IP一行。 与gcChangeInfo.xml格式一样的。
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ cat gcChangeInfo_uninstall.xml
<?xml version="1.0" encoding="utf-8"?>
<servers>
<rack>
<node ip="10.0.2.102"/>
<node ip="10.0.2.115"/>
</rack>
</servers>
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$
移除故障节点
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin rmnodes gcChangeInfo_uninstall.xml
gcadmin remove nodes ...
gcadmin rmnodes from cluster success
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
=============================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=============================================================
| NodeName | IpAddress | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 | OPEN | OPEN | 0 |
-------------------------------------------------------------
=========================================================================================================
| GBASE DATA CLUSTER INFORMATION |
=========================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.101 | 6 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$ gcadmin showdistribution
Distribution ID: 6 | State: new | Total segment num: 1
Primary Segment Node IP Segment ID Duplicate Segment node IP
========================================================================================================================
| 10.0.2.101 | 1 | |
========================================================================================================================
[gbase@gbase_rh7_001 gcinstall_9.5.2.44.10]$
删除被移除节点的数据服务
因为服务器已经无法恢复,所以不需要这一步了。 但如果有条件,还是建议将服务器磁盘上的数据库文件都删了,避免某一天修好了,又投入使用,造成对现有集群的【干扰】
参考 GBase 8a 集群服务corosync、gcware由于其它IP干扰导致异常
总结
对于主副本服务器均无法修复的情况,可以通过参数gcluster_allow_sg_lost对残留的不完整数据进行查询。 同时通过强制的手工缩容方式,重建表后,将故障节点彻底从集群移除,剩余节点继续对外提供服务。
由于主副本均故障,数据丢失是必然的,无论你有多少副本,所以故障发生后还是要尽快维修恢复,避免这种情况发生才是上策。