GBase 8a在扩容操作中,当所有表已经全部重分布到新的分布策略distribution以后,老的distribution就可以用refreshnodedatamap drop删除了。 但如果此时有些表存在event,且使用的老的策略,则会出现这个错误:Can not drop nodedatamap EventLog is using distribution。此时需要将原有的event处理完成才可以继续操作。
换个角度,如果是扩容,在操作前将集群状态全部恢复正常,没有event会更合适一些,可以减少运维的耗时。
目录导航
报错样例
gbase> refreshnodedatamap drop 1;
ERROR 1707 (HY000): gcluster command error: Can not drop nodedatamap 1. FEventLog is using distribution.
原因
查看gcadmin,确实有event
[gbase@rh6-1 gcinstall_43R33]$ gcadmin
CLUSTER STATE: ACTIVE
CLUSTER MODE: NORMAL
=================================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=================================================================
| NodeName | IpAddress |gcware |gcluster |DataState |
-----------------------------------------------------------------
| coordinator1 | 10.0.2.201 | OPEN | OPEN | 1 |
-----------------------------------------------------------------
=============================================================
| GBASE DATA CLUSTER INFORMATION |
=============================================================
|NodeName | IpAddress |gnode |syncserver |DataState |
-------------------------------------------------------------
| node1 | 10.0.2.201 | OPEN | OPEN | 0 |
-------------------------------------------------------------
| node2 | 10.0.2.202 | OPEN | OPEN | 0 |
-------------------------------------------------------------
查看具体event,发现本次比较特殊,是gssys表的审计日志audit_log出了问题。
[gbase@rh6-1 gcinstall_43R33]$ gcadmin showdmlevent
Event count:0
[gbase@rh6-1 gcinstall_43R33]$ gcadmin showddlevent
Event count:0
[gbase@rh6-1 gcinstall_43R33]$ gcadmin showdmlstorageevent
Event count:1
Event ID: 2
ObjectName: gbase.audit_log
TableID: 0
Fail Data Copy:
------------------------------------------------------
NodeIP: 10.0.2.201 FAILURE
处理方案
修复该event。如果系统无法自动同步完成,排查原因。
查看gcluster日志下的gc_recovery.log,发现该event无法自动恢复,因为gssys表是本地表,没有副本。
2022-04-14 08:57:40.898 [ERROR] <STORAGE-Recover-0>: GetSyncDmlStorgeInfo error, eventid=2, tablename=gbase.audit_log, content=gbase.audit_log,,true
2022-04-14 08:57:40.898 [INFO ] <RECOVER-INFO-0>: Finishing Recovering gbase.audit_log,tid 0
2022-04-14 08:57:41.119 [INFO ] <RECOVER-INFO>: MasterAssignTask dmlstoragetid num 1.
2022-04-14 08:57:41.119 [INFO ] <RECOVER-INFO-0>: Start Recovering gbase.audit_log tid 0
2022-04-14 08:57:41.119 [INFO ] <STORAGE-Recover-0>: Start DMLStorge recover gbase.audit_log,tid 0 eventnum 1
2022-04-14 08:57:41.119 [INFO ] <STORAGE-Recover-0>: Start to DMLStorge recover of eventid(2)
2022-04-14 08:57:41.119 [ERROR] <GCWare>: sys gbase.audit_log nodeid: 3372351498, have dmlstorageevent,eventid: 2
2022-04-14 08:57:41.119 [ERROR] <STORAGE-Recover>: GetDataCopyMap error, can't get a source node, because of no normal
2022-04-14 08:57:41.119 [ERROR] <STORAGE-Recover-0>: GetSyncDmlStorgeInfo error, eventid=2, tablename=gbase.audit_log, content=gbase.audit_log,,true
登录节点,修复该表,发现报错
gbase> repair table gbase.audit_log;
+-----------------+--------+----------+-----------------------------------+
| Table | Op | Msg_type | Msg_text |
+-----------------+--------+----------+-----------------------------------+
| gbase.audit_log | repair | Error | Incorrect file format 'audit_log' |
| gbase.audit_log | repair | error | Corrupt |
+-----------------+--------+----------+-----------------------------------+
2 rows in set (Elapsed: 00:00:00.01)
确认是表数据文件彻底损坏,只能清空数据
gbase> repair table gbase.audit_log use_frm;
+-----------------+--------+----------+-----------------------------------+
| Table | Op | Msg_type | Msg_text |
+-----------------+--------+----------+-----------------------------------+
| gbase.audit_log | repair | Error | Incorrect file format 'audit_log' |
| gbase.audit_log | repair | status | OK |
+-----------------+--------+----------+-----------------------------------+
2 rows in set (Elapsed: 00:00:00.00)
然后清理event
[gbase@rh6-1 gcinstall_43R33]$ gcadmin rmdmlstorageevent 0 2
[gbase@rh6-1 gcinstall_43R33]$ gcadmin
CLUSTER STATE: ACTIVE
CLUSTER MODE: NORMAL
=================================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=================================================================
| NodeName | IpAddress |gcware |gcluster |DataState |
-----------------------------------------------------------------
| coordinator1 | 10.0.2.201 | OPEN | OPEN | 0 |
-----------------------------------------------------------------
=============================================================
| GBASE DATA CLUSTER INFORMATION |
=============================================================
|NodeName | IpAddress |gnode |syncserver |DataState |
-------------------------------------------------------------
| node1 | 10.0.2.201 | OPEN | OPEN | 0 |
-------------------------------------------------------------
| node2 | 10.0.2.202 | OPEN | OPEN | 0 |
-------------------------------------------------------------
重新删除旧的分布策略成功
gbase> refreshnodedatamap drop 1;
Query OK, 0 rows affected (Elapsed: 00:00:04.64)
gbase> ^CAborted
[gbase@rh6-1 gcluster]$ gcadmin rmdistribution 1
cluster distribution ID [1]
it will be removed now
please ensure this is ok, input y or n: y
gcadmin remove distribution [1] success
[gbase@rh6-1 gcluster]$
总结
当数据库存在event时要及时关注,如果数据库自身无法自动恢复,要排查原因,在排除环境自身问题,比如磁盘损坏,空间满,网络不稳定等。 等event完成恢复后再进行扩容。
如果存在逻辑上的不能自动恢复,比如主副本都被设置了不一致标记,或者如本例的这种本地gssys类型的表,要根据实际清空手工处理。