南大通用GBase 8a刚启动服务后,执行gcadmin报错exec_cmd_show_cluster crm get cluster mode ERROR: [6]-->[GC_AIS_ERR_TRY_AGAIN] gcadmin show cluster failed

本文介绍GBase 8a数据库集群在执行集群管理命令gcadmin时报错 ,exec_cmd_show_cluster crm get cluster mode ERROR: [6]-->[GC_AIS_ERR_TRY_AGAIN],gcadmin show cluster failed。一般都是集群corosync/gcware在同步REDOLOG信息时出现,稍后会自动恢复。

报错样例

如下是一个集群服务刚启动时,出现的报错。其中第一个单纯的GC_AIS_ERR_TRY_AGAIN, 是因为服务没有启动。启动后还是会报错一段时间

[gcadmin] Could not initialize CRM instance error: [6]->[GC_AIS_ERR_TRY_AGAIN]
[root@localhost ~]# service gcware start
Starting GCWARE (gcwexec):                                 [  OK  ]
Starting GBASED    :                                       [  OK  ]
Starting GCSYNC :                                          [  OK  ]
Starting GCLUSTERD :                                       [  OK  ]
Starting GCRECOVER :                                       [  OK  ]
Starting GCMonit success!

[root@localhost ~]# gcadmin
exec_cmd_show_cluster crm get cluster mode ERROR:  [6]-->[GC_AIS_ERR_TRY_AGAIN]
gcadmin show cluster failed
[root@localhost ~]# gcadmin
exec_cmd_show_cluster crm get cluster mode ERROR:  [6]-->[GC_AIS_ERR_TRY_AGAIN]
gcadmin show cluster failed
[root@localhost ~]# gcadmin
exec_cmd_show_cluster crm get cluster mode ERROR:  [6]-->[GC_AIS_ERR_TRY_AGAIN]
gcadmin show cluster failed
[root@localhost ~]# gcadmin
exec_cmd_show_cluster crm get cluster mode ERROR:  [6]-->[GC_AIS_ERR_TRY_AGAIN]
gcadmin show cluster failed
[root@localhost ~]# gcadmin
exec_cmd_show_cluster crm get cluster mode ERROR:  [6]-->[GC_AIS_ERR_TRY_AGAIN]
gcadmin show cluster failed

排查

查看进程

发现corosync进程正常

[root@localhost ~]# ps -ef|grep corosync
root      6445 24649  0 06:42 pts/0    00:00:00 grep --color=auto corosync
root     24799     1  9 06:41 ?        00:00:03 corosync

查看日志

发现在处理REDOLOG

[root@localhost ~]# tail /var/log/corosync.log
Sep 05 06:42:26 corosync [corosy] persist_sync_finish leave 1
Sep 05 06:42:26 corosync [CLM   ] [DEBUG]: clm_sync_process [LEAVE]
Sep 05 06:42:27 corosync [CLM   ] [DEBUG]: clm_sync_process [ENTER]
Sep 05 06:42:27 corosync [corosy] persist_sync_finish enter
Sep 05 06:42:27 corosync [corosy] persist_sync_finish leave 1
Sep 05 06:42:27 corosync [CLM   ] [DEBUG]: clm_sync_process [LEAVE]
Sep 05 06:42:28 corosync [CLM   ] [DEBUG]: clm_sync_process [ENTER]
Sep 05 06:42:28 corosync [corosy] persist_sync_finish enter
Sep 05 06:42:28 corosync [corosy] persist_sync_finish leave 1
Sep 05 06:42:28 corosync [CLM   ] [DEBUG]: clm_sync_process [LEAVE]

解决

稍等一段时间,等服务处理完成即可。时间得看REDOLOG的大小,一般在秒到分钟级。出现了 all node finish new cluster persist, sync process end 就完成了,这个节点就恢复了。

Sep 05 06:42:49 corosync [CLM   ] [DEBUG]: clm_sync_process [LEAVE]
Sep 05 06:42:49 corosync [CLM   ] [DEBUG]: clm_sync_process [ENTER]
Sep 05 06:42:49 corosync [corosy] persist_sync_finish enter
Sep 05 06:42:49 corosync [corosy] persist_sync_finish leave 1
Sep 05 06:42:49 corosync [CLM   ] [DEBUG]: clm_sync_process [LEAVE]
Sep 05 06:42:49 corosync [CLM   ] [DEBUG]: clm_sync_process [ENTER]
Sep 05 06:42:49 corosync [corosy] persist_sync_finish enter
Sep 05 06:42:49 corosync [corosy] all node finish new cluster  persist, sync process end
Sep 05 06:42:49 corosync [corosy] persist_sync_finish leave 0
Sep 05 06:42:49 corosync [CRM   ] [INFO]: key_change_notify: cluster_state 2
Sep 05 06:42:49 corosync [LCK   ] [INFO]: key_change_notify: cluster_state 2
Sep 05 06:42:49 corosync [LCK   ] lck_remove_invalid_lock begin
Sep 05 06:42:49 corosync [LCK   ] lck_remove_invalid_lock end
Sep 05 06:42:49 corosync [LCK   ] lck_remove_invalid_commit_takeover begin
Sep 05 06:42:49 corosync [LCK   ] lck_remove_invalid_commit_takeover end
Sep 05 06:42:49 corosync [LCK   ] lck_remove_invalid_commit begin
Sep 05 06:42:49 corosync [LCK   ] lck_remove_invalid_commit for offline node find out takeovernodeid = 1778515978
Sep 05 06:42:49 corosync [LCK   ] lck_remove_invalid_commit for conn disconnect find out takeovernodeid = 1778515978
Sep 05 06:42:49 corosync [LCK   ] lck_remove_invalid_commit end
Sep 05 06:42:49 corosync [LCK   ] qlm sync active remove lock 1 0
Sep 05 06:42:49 corosync [LCK   ] qlm_sync_active_compatible_mode
Sep 05 06:42:49 corosync [LCK   ] qlm_sync_active_compatible_mode name:, type:1 nodeid:1795293194
Sep 05 06:42:49 corosync [LCK   ] lck_common_regist_sync_active regist_type:2,nodeid:1795293194,conn:(nil),trans:0
Sep 05 06:42:49 corosync [LCK   ] lck_unregist_common_regist regist type not find,2
Sep 05 06:42:49 corosync [LCK   ] lck_common_regist_sync_active regist_type:4,nodeid:1795293194,conn:(nil),trans:0
Sep 05 06:42:49 corosync [LCK   ] lck_unregist_common_regist node is not exist!
Sep 05 06:42:49 corosync [LCK   ] [INFO]: notify_cluster_state_change_to_consumer:state:2,local_conn:(nil),event_conn:(nil),node_id:1795293194
Sep 05 06:42:49 corosync [LCK   ] [INFO]: lck_print_commit_takeover_list [ENTER]
Sep 05 06:42:49 corosync [LCK   ] [INFO]: NodeId = 1778515978 connection:0xc41440
Sep 05 06:42:49 corosync [LCK   ] [INFO]: lck_print_commit_takeover_list [END]
Sep 05 06:42:49 corosync [LCK   ] [INFO]: lck_print_commit_list [ENTER]
Sep 05 06:42:49 corosync [LCK   ] [INFO]: lck_print_commit_list [END]
Sep 05 06:42:49 corosync [LCK   ] lck regist consumer:type:4,master_nodeid:1778515978,num:1
Sep 05 06:42:49 corosync [LCK   ] lck regist consumer:nodeid:1778515978,info:(null)
Sep 05 06:42:49 corosync [CLM   ] [DEBUG]: clm_sync_process [LEAVE]
Sep 05 06:42:49 corosync [CLM   ] [DEBUG]: clm_sync_activate [ENTER]
Sep 05 06:42:49 corosync [LCK   ] lck_sync_reset enter
Sep 05 06:42:49 corosync [LCK   ] lck_sync_reset leave
Sep 05 06:42:49 corosync [CRM   ] [DEBUG]: crm_sync_reset [LEAVE]
Sep 05 06:42:49 corosync [CLM   ] [DEBUG]: clm_sync_reset [ENTER]
Sep 05 06:42:49 corosync [CLM   ] [DEBUG]: clm_sync_activate [LEAVE]
Sep 05 06:42:49 corosync [CLM   ] [DEBUG]: clm_sync_activate [LEAVE]
Sep 05 06:42:49 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Sep 05 06:42:49 corosync [LCK   ] LIB request: gcLckRecoverMasterNodeGet [ENTER]
Sep 05 06:42:49 corosync [LCK   ] LIB request: gcLckRecoverMasterNodeGet masternodeid = 1778515978[LEAVE]
Sep 05 06:42:49 corosync [LCK   ] LIB request: gcLckRecoverMasterNodeGet [ENTER]
Sep 05 06:42:49 corosync [LCK   ] LIB request: gcLckRecoverMasterNodeGet masternodeid = 1778515978[LEAVE]
Sep 05 06:42:54 corosync [CLM   ] port scan read error ip:::ffff:10.0.2.107 port:5258 time:5000(ms) cfg_connect_timeout:5000(ms) error:11 Resource temporarily unavailable
Sep 05 06:42:54 corosync [CLM   ] port scan read error ip:::ffff:10.0.2.107 port:5258 time:5000(ms) cfg_connect_timeout:5000(ms) error:11 Resource temporarily unavailable
Sep 05 06:42:54 corosync [LCK   ] LIB request: gcLckRecoverRegist [ENTER]
Sep 05 06:42:54 corosync [LCK   ] LIB request: gcLckRecoverRegist [LEAVE]
Sep 05 06:42:54 corosync [LCK   ] EXEC request: saLckRecoverRegist 1795293194
Sep 05 06:42:54 corosync [LCK   ] EXEC notification: recover_regist_changed
Sep 05 06:42:54 corosync [LCK   ] LIB request: gcLckRecoverMasterNodeGet [ENTER]
Sep 05 06:42:54 corosync [LCK   ] LIB request: gcLckRecoverMasterNodeGet masternodeid = 1778515978[LEAVE]
Sep 05 06:42:54 corosync [LCK   ] LIB request: gcLckRecoverMasterNodeGet [ENTER]
Sep 05 06:42:54 corosync [LCK   ] LIB request: gcLckRecoverMasterNodeGet masternodeid = 1778515978[LEAVE]
Sep 05 06:42:54 corosync [LCK   ] LIB request: gcLckCommitRegistTakeover [ENTER]
Sep 05 06:42:54 corosync [LCK   ] LIB request: gcLckCommitRegistTakeover [LEAVE]
Sep 05 06:42:54 corosync [LCK   ] EXEC request: saLckCommitRegistTakeover 1795293194

恢复

[root@localhost ~]# gcadmin
CLUSTER STATE:  ACTIVE
CLUSTER MODE:   NORMAL

=================================================================
|             GBASE COORDINATOR CLUSTER INFORMATION             |
=================================================================
|   NodeName   |     IpAddress     |gcware |gcluster |DataState |
-----------------------------------------------------------------
| coordinator1 |    10.0.2.107     | OPEN  |  OPEN   |    0     |
-----------------------------------------------------------------
| coordinator2 |    10.0.2.106     | OPEN  |  OPEN   |    0     |
-----------------------------------------------------------------
=============================================================
|              GBASE DATA CLUSTER INFORMATION               |
=============================================================
|NodeName |     IpAddress     |gnode |syncserver |DataState |
-------------------------------------------------------------
|  node1  |    10.0.2.107     | OPEN |   OPEN    |    0     |
-------------------------------------------------------------
|  node2  |    10.0.2.106     | OPEN |   OPEN    |    0     |
-------------------------------------------------------------

备注

现场也发生过,一台corosync所在服务器磁盘空间满了,导致corosync异常,其它节点日志里可以看到频繁的节点加入和离开,导致整个集群的服务异常。