南大通用GBase 8a数据库数据不一致ddlevent故障模拟和恢复过程分析

GBase 8a数据库集群,通过副本来保证数据高可用,当某些服务或节点故障时,就会产生不一致,比如ddlevent。本文在测试环境模拟故障,并分析其恢复过程。

环境

3个节点V95集群,关闭其中一个节点,只保留2个。

[gbase@localhost log]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

==============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION            |
==============================================================
|   NodeName   | IpAddress  | gcware  | gcluster | DataState |
--------------------------------------------------------------
| coordinator1 | 10.0.2.102 |  OPEN   |   OPEN   |     0     |
--------------------------------------------------------------
| coordinator2 | 10.0.2.202 |  OPEN   |   OPEN   |     0     |
--------------------------------------------------------------
| coordinator3 | 10.0.2.203 | OFFLINE |          |           |
--------------------------------------------------------------
===========================================================================================================
|                                     GBASE DATA CLUSTER INFORMATION                                      |
===========================================================================================================
| NodeName |                IpAddress                 | DistributionId |  gnode  | syncserver | DataState |
-----------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.102                |       3        |  OPEN   |    OPEN    |     0     |
-----------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.202                |       3        |  OPEN   |    OPEN    |     0     |
-----------------------------------------------------------------------------------------------------------
|  node3   |                10.0.2.203                |       3        | OFFLINE |            |           |
-----------------------------------------------------------------------------------------------------------

测试DDL操作

[gbase@localhost log]$ gccli testdb\
GBase client 9.5.2.17.115980. Copyright (c) 2004-2020, GBase.  All Rights Reserved.

gbase> create table t11 (id int);
Query OK, 0 rows affected (Elapsed: 00:00:00.62)

查看event

[gbase@localhost gcluster]$ gcadmin showddlevent
Vc event count:1
Event ID:    8
ObjectName: testdb.t11
Fail Node Copy:
------------------------------------------------------
NodeID: 3405905930      NodeIP:10.0.2.203       FAILURE

Fail Data Copy:
------------------------------------------------------
SegName: n3     NodeIP: 10.0.2.203      FAILURE
SegName: n2     NodeIP: 10.0.2.203      FAILURE

查看当前节点gcluster/log/gcluster/express.log,发现了设置不一致标识的内容,对应上面的event id=8。

2020-08-15 18:41:51.834 [DDL][WARN ][S:9][Q:4]:node(140100944134154) is not valid.
2020-08-15 18:41:51.834 [DDL][WARN ][S:9][Q:4]:node(140100944134154) is offline.
2020-08-15 18:41:52.224 [DDL][DEF ][S:9][Q:4]:[create table] Node(10.0.2.203:5050) testdb.t11 with Suffix: n3 fails
2020-08-15 18:41:52.224 [DDL][DEF ][S:9][Q:4]:[create table] Node(10.0.2.203:5258) testdb.t11 with Suffix: N/A fails
2020-08-15 18:41:52.225 [DDL][DEF ][S:9][Q:4]:[create table] Node(10.0.2.203:5050) testdb.t11 with Suffix: n2 fails
2020-08-15 18:41:52.225 [DDL][ERROR][S:9][Q:4]:[create table] Failing SQL: create table t11 (id int)
2020-08-15 18:41:52.286 [DEF ] [S:9][Q:4]Set metadata(testdb.t11:normal) to LOCKED, event id is 8

启动故障节点,有不一致标记

[gbase@localhost ~]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.102 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.202 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator3 | 10.0.2.203 |  OPEN  |   OPEN   |     1     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.102                |       3        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.202                |       3        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node3   |                10.0.2.203                |       3        | OPEN  |    OPEN    |     1     |
---------------------------------------------------------------------------------------------------------

查看故障恢复日志。 gc_recover.log.

该日志可能在任意一个管理节点上。找到eventid=8的对应的session值,然后grep日志。比如

tail gc_recover.log -n 2000|grep session:7

2020-08-15 19:09:59.386 [INFO ] <session:7>: Start DDL recover,. tid 353 num 1
2020-08-15 19:09:59.386 [INFO ] <session:7>: Start to DDL recover of eventid(8)
2020-08-15 19:09:59.388 [INFO ] <session:7>: ==================================================
2020-08-15 19:09:59.389 [INFO ] <session:7>: recoverinfo gngcddl: eventid = 8, tableId = 353, tablename = testdb.t11, is dummy log=0
2020-08-15 19:09:59.389 [INFO ] <session:7>:
2020-08-15 19:09:59.453 [INFO ] <session:7>: DealDDLRecoverLock vc00001.testdb.t11 lockinfo [{"key":"vc00000.hashmap_lock","vc":"vc00000","db":"","tb":"","lktype":0,"stat":1},{"key":"vc00001.hashmap_lock","vc":"vc00001","db":"","tb":"","lktype":0,"stat":1},{"key":"vc00000.gbase.db_lock","vc":"vc00000","db":"GBASE","tb":"","lktype":1,"stat":1},{"key":"vc00001.testdb.db_lock","vc":"vc00001","db":"testdb","tb":"","lktype":1,"stat":1},{"key":"vc00001.testdb.table_space_lock","vc":"vc00001","db":"testdb","tb":"","lktype":2,"stat":1},{"key":"vc00000.gbase.table_distribution.meta_lock","vc":"vc00000","db":"GBASE","tb":"TABLE_DISTRIBUTION","lktype":4,"stat":1},{"key":"vc00001.testdb.t11.meta_lock","vc":"vc00001","db":"testdb","tb":"t11","lktype":4,"stat":2}] success
2020-08-15 19:09:59.500 [INFO ] <session:7>: The source GCluster(10.0.2.102)
2020-08-15 19:09:59.507 [INFO ] <session:7>: recoverinfo gc(ddl): eventid(8),Db(testdb), Tb(t11), STb() dst:0x3405905930(10.0.2.203) .
2020-08-15 19:09:59.507 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5258, do "use vc "vcname000001""
2020-08-15 19:09:59.507 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5258, do "use self "vc00001"."testdb";"
2020-08-15 19:09:59.544 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5258, do "SET SELF SCN = 9217"
2020-08-15 19:09:59.544 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log start.............
2020-08-15 19:09:59.544 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5258, do "CREATE  GCLUSTER_LOCAL  TABLE /*+ sub_step,vcid,distribution(3) */ "vc00001"."testdb"."t11" (
2020-08-15 19:10:00.241 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: Succeeded recovering Log: CREATE  GCLUSTER_LOCAL  TABLE /*+ sub_step,vcid,distribution(3) */ "vc00001"."testdb"."t11" (
2020-08-15 19:10:00.241 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log: end.............
2020-08-15 19:10:00.241 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: do sql success: CREATE  GCLUSTER_LOCAL  TABLE /*+ sub_step,vcid,distribution(3) */ "vc00001"."testdb"."t11" (
2020-08-15 19:10:00.241 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: delete ddl event, cluster eventId=8.............
2020-08-15 19:10:00.391 [INFO ] <session:7>: The source node(10.0.2.102)
2020-08-15 19:10:00.394 [INFO ] <session:7>: The source GCluster(10.0.2.102)
2020-08-15 19:10:00.402 [INFO ] <session:7>: recoverinfo gn(ddl): eventid(8),Db(testdb), Tb(t11), STb() dst:0xcb02000a(10.0.2.203) slice(n3).
2020-08-15 19:10:00.402 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "use self "testdb";"
2020-08-15 19:10:00.402 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "SET SELF SCN = 9217"
2020-08-15 19:10:00.403 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log start.............
2020-08-15 19:10:00.403 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "CREATE  GCLUSTER_LOCAL  TABLE /*+ sub_step,vcid,distribution(3) */  "testdb"."t11_n3" (
2020-08-15 19:10:00.431 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: Succeeded recovering Log: CREATE  GCLUSTER_LOCAL  TABLE /*+ sub_step,vcid,distribution(3) */  "testdb"."t11_n3" (
2020-08-15 19:10:00.431 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log: end.............
2020-08-15 19:10:00.432 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: do sql success: CREATE  GCLUSTER_LOCAL  TABLE /*+ sub_step,vcid,distribution(3) */  "testdb"."t11_n3" (
2020-08-15 19:10:00.432 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: delete ddl event, node eventId=8, segId=3.............
2020-08-15 19:10:00.614 [INFO ] <session:7>: The source node(10.0.2.202)
2020-08-15 19:10:00.616 [INFO ] <session:7>: The source GCluster(10.0.2.102)
2020-08-15 19:10:00.624 [INFO ] <session:7>: recoverinfo gn(ddl): eventid(8),Db(testdb), Tb(t11), STb() dst:0xcb02000a(10.0.2.203) slice(n2).
2020-08-15 19:10:00.624 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "use self "testdb";"
2020-08-15 19:10:00.624 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "SET SELF SCN = 9217"
2020-08-15 19:10:00.624 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log start.............
2020-08-15 19:10:00.624 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "CREATE  GCLUSTER_LOCAL  TABLE /*+ sub_step,vcid,distribution(3) */  "testdb"."t11_n2" (
2020-08-15 19:10:00.952 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: Succeeded recovering Log: CREATE  GCLUSTER_LOCAL  TABLE /*+ sub_step,vcid,distribution(3) */  "testdb"."t11_n2" (
2020-08-15 19:10:00.953 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log: end.............
2020-08-15 19:10:00.953 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: do sql success: CREATE  GCLUSTER_LOCAL  TABLE /*+ sub_step,vcid,distribution(3) */  "testdb"."t11_n2" (
2020-08-15 19:10:00.953 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: delete ddl event, node eventId=8, segId=2.............
2020-08-15 19:10:00.997 [INFO ] <session:7>: DealDDLRecoverUnLock vc00001.testdb.t11
2020-08-15 19:10:01.197 [INFO ] <session:7>: End DDL recover .,tid 353

注意如上日志中,分别对管理节点和数据节点进行了各自恢复。