GBase 8a数据库集群,通过副本来保证数据高可用,当某些服务或节点故障时,就会产生不一致,比如ddlevent。本文在测试环境模拟故障,并分析其恢复过程。
目录导航
环境
3个节点V95集群,关闭其中一个节点,只保留2个。
[gbase@localhost log]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
==============================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
==============================================================
| NodeName | IpAddress | gcware | gcluster | DataState |
--------------------------------------------------------------
| coordinator1 | 10.0.2.102 | OPEN | OPEN | 0 |
--------------------------------------------------------------
| coordinator2 | 10.0.2.202 | OPEN | OPEN | 0 |
--------------------------------------------------------------
| coordinator3 | 10.0.2.203 | OFFLINE | | |
--------------------------------------------------------------
===========================================================================================================
| GBASE DATA CLUSTER INFORMATION |
===========================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
-----------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.102 | 3 | OPEN | OPEN | 0 |
-----------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.202 | 3 | OPEN | OPEN | 0 |
-----------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.203 | 3 | OFFLINE | | |
-----------------------------------------------------------------------------------------------------------
测试DDL操作
[gbase@localhost log]$ gccli testdb\
GBase client 9.5.2.17.115980. Copyright (c) 2004-2020, GBase. All Rights Reserved.
gbase> create table t11 (id int);
Query OK, 0 rows affected (Elapsed: 00:00:00.62)
查看event
[gbase@localhost gcluster]$ gcadmin showddlevent
Vc event count:1
Event ID: 8
ObjectName: testdb.t11
Fail Node Copy:
------------------------------------------------------
NodeID: 3405905930 NodeIP:10.0.2.203 FAILURE
Fail Data Copy:
------------------------------------------------------
SegName: n3 NodeIP: 10.0.2.203 FAILURE
SegName: n2 NodeIP: 10.0.2.203 FAILURE
查看当前节点gcluster/log/gcluster/express.log,发现了设置不一致标识的内容,对应上面的event id=8。
2020-08-15 18:41:51.834 [DDL][WARN ][S:9][Q:4]:node(140100944134154) is not valid.
2020-08-15 18:41:51.834 [DDL][WARN ][S:9][Q:4]:node(140100944134154) is offline.
2020-08-15 18:41:52.224 [DDL][DEF ][S:9][Q:4]:[create table] Node(10.0.2.203:5050) testdb.t11 with Suffix: n3 fails
2020-08-15 18:41:52.224 [DDL][DEF ][S:9][Q:4]:[create table] Node(10.0.2.203:5258) testdb.t11 with Suffix: N/A fails
2020-08-15 18:41:52.225 [DDL][DEF ][S:9][Q:4]:[create table] Node(10.0.2.203:5050) testdb.t11 with Suffix: n2 fails
2020-08-15 18:41:52.225 [DDL][ERROR][S:9][Q:4]:[create table] Failing SQL: create table t11 (id int)
2020-08-15 18:41:52.286 [DEF ] [S:9][Q:4]Set metadata(testdb.t11:normal) to LOCKED, event id is 8
启动故障节点,有不一致标记
[gbase@localhost ~]$ gcadmin
CLUSTER STATE: ACTIVE
VIRTUAL CLUSTER MODE: NORMAL
=============================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=============================================================
| NodeName | IpAddress | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.102 | OPEN | OPEN | 0 |
-------------------------------------------------------------
| coordinator2 | 10.0.2.202 | OPEN | OPEN | 0 |
-------------------------------------------------------------
| coordinator3 | 10.0.2.203 | OPEN | OPEN | 1 |
-------------------------------------------------------------
=========================================================================================================
| GBASE DATA CLUSTER INFORMATION |
=========================================================================================================
| NodeName | IpAddress | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
| node1 | 10.0.2.102 | 3 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node2 | 10.0.2.202 | 3 | OPEN | OPEN | 0 |
---------------------------------------------------------------------------------------------------------
| node3 | 10.0.2.203 | 3 | OPEN | OPEN | 1 |
---------------------------------------------------------------------------------------------------------
查看故障恢复日志。 gc_recover.log.
该日志可能在任意一个管理节点上。找到eventid=8的对应的session值,然后grep日志。比如
tail gc_recover.log -n 2000|grep session:7
2020-08-15 19:09:59.386 [INFO ] <session:7>: Start DDL recover,. tid 353 num 1
2020-08-15 19:09:59.386 [INFO ] <session:7>: Start to DDL recover of eventid(8)
2020-08-15 19:09:59.388 [INFO ] <session:7>: ==================================================
2020-08-15 19:09:59.389 [INFO ] <session:7>: recoverinfo gngcddl: eventid = 8, tableId = 353, tablename = testdb.t11, is dummy log=0
2020-08-15 19:09:59.389 [INFO ] <session:7>:
2020-08-15 19:09:59.453 [INFO ] <session:7>: DealDDLRecoverLock vc00001.testdb.t11 lockinfo [{"key":"vc00000.hashmap_lock","vc":"vc00000","db":"","tb":"","lktype":0,"stat":1},{"key":"vc00001.hashmap_lock","vc":"vc00001","db":"","tb":"","lktype":0,"stat":1},{"key":"vc00000.gbase.db_lock","vc":"vc00000","db":"GBASE","tb":"","lktype":1,"stat":1},{"key":"vc00001.testdb.db_lock","vc":"vc00001","db":"testdb","tb":"","lktype":1,"stat":1},{"key":"vc00001.testdb.table_space_lock","vc":"vc00001","db":"testdb","tb":"","lktype":2,"stat":1},{"key":"vc00000.gbase.table_distribution.meta_lock","vc":"vc00000","db":"GBASE","tb":"TABLE_DISTRIBUTION","lktype":4,"stat":1},{"key":"vc00001.testdb.t11.meta_lock","vc":"vc00001","db":"testdb","tb":"t11","lktype":4,"stat":2}] success
2020-08-15 19:09:59.500 [INFO ] <session:7>: The source GCluster(10.0.2.102)
2020-08-15 19:09:59.507 [INFO ] <session:7>: recoverinfo gc(ddl): eventid(8),Db(testdb), Tb(t11), STb() dst:0x3405905930(10.0.2.203) .
2020-08-15 19:09:59.507 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5258, do "use vc "vcname000001""
2020-08-15 19:09:59.507 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5258, do "use self "vc00001"."testdb";"
2020-08-15 19:09:59.544 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5258, do "SET SELF SCN = 9217"
2020-08-15 19:09:59.544 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log start.............
2020-08-15 19:09:59.544 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5258, do "CREATE GCLUSTER_LOCAL TABLE /*+ sub_step,vcid,distribution(3) */ "vc00001"."testdb"."t11" (
2020-08-15 19:10:00.241 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: Succeeded recovering Log: CREATE GCLUSTER_LOCAL TABLE /*+ sub_step,vcid,distribution(3) */ "vc00001"."testdb"."t11" (
2020-08-15 19:10:00.241 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log: end.............
2020-08-15 19:10:00.241 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: do sql success: CREATE GCLUSTER_LOCAL TABLE /*+ sub_step,vcid,distribution(3) */ "vc00001"."testdb"."t11" (
2020-08-15 19:10:00.241 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:,et:DDL,eid:8,en:coor,ip:10.0.2.203>: delete ddl event, cluster eventId=8.............
2020-08-15 19:10:00.391 [INFO ] <session:7>: The source node(10.0.2.102)
2020-08-15 19:10:00.394 [INFO ] <session:7>: The source GCluster(10.0.2.102)
2020-08-15 19:10:00.402 [INFO ] <session:7>: recoverinfo gn(ddl): eventid(8),Db(testdb), Tb(t11), STb() dst:0xcb02000a(10.0.2.203) slice(n3).
2020-08-15 19:10:00.402 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "use self "testdb";"
2020-08-15 19:10:00.402 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "SET SELF SCN = 9217"
2020-08-15 19:10:00.403 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log start.............
2020-08-15 19:10:00.403 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "CREATE GCLUSTER_LOCAL TABLE /*+ sub_step,vcid,distribution(3) */ "testdb"."t11_n3" (
2020-08-15 19:10:00.431 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: Succeeded recovering Log: CREATE GCLUSTER_LOCAL TABLE /*+ sub_step,vcid,distribution(3) */ "testdb"."t11_n3" (
2020-08-15 19:10:00.431 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log: end.............
2020-08-15 19:10:00.432 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: do sql success: CREATE GCLUSTER_LOCAL TABLE /*+ sub_step,vcid,distribution(3) */ "testdb"."t11_n3" (
2020-08-15 19:10:00.432 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n3,et:DDL,eid:8,en:node,ip:10.0.2.203>: delete ddl event, node eventId=8, segId=3.............
2020-08-15 19:10:00.614 [INFO ] <session:7>: The source node(10.0.2.202)
2020-08-15 19:10:00.616 [INFO ] <session:7>: The source GCluster(10.0.2.102)
2020-08-15 19:10:00.624 [INFO ] <session:7>: recoverinfo gn(ddl): eventid(8),Db(testdb), Tb(t11), STb() dst:0xcb02000a(10.0.2.203) slice(n2).
2020-08-15 19:10:00.624 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "use self "testdb";"
2020-08-15 19:10:00.624 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "SET SELF SCN = 9217"
2020-08-15 19:10:00.624 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log start.............
2020-08-15 19:10:00.624 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: QueryExecute ip 10.0.2.203,port 5050, do "CREATE GCLUSTER_LOCAL TABLE /*+ sub_step,vcid,distribution(3) */ "testdb"."t11_n2" (
2020-08-15 19:10:00.952 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: Succeeded recovering Log: CREATE GCLUSTER_LOCAL TABLE /*+ sub_step,vcid,distribution(3) */ "testdb"."t11_n2" (
2020-08-15 19:10:00.953 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: DB_DoDDLQuery DDL Redo Log: end.............
2020-08-15 19:10:00.953 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: do sql success: CREATE GCLUSTER_LOCAL TABLE /*+ sub_step,vcid,distribution(3) */ "testdb"."t11_n2" (
2020-08-15 19:10:00.953 [INFO ] <session:7, tid:353, db:testdb,tb:t11,nx:n2,et:DDL,eid:8,en:node,ip:10.0.2.203>: delete ddl event, node eventId=8, segId=2.............
2020-08-15 19:10:00.997 [INFO ] <session:7>: DealDDLRecoverUnLock vc00001.testdb.t11
2020-08-15 19:10:01.197 [INFO ] <session:7>: End DDL recover .,tid 353
注意如上日志中,分别对管理节点和数据节点进行了各自恢复。