本文介绍GBase 8a数据库集群在执行集群管理命令gcadmin时报错 ,exec_cmd_show_cluster crm get cluster mode ERROR: [6]-->[GC_AIS_ERR_TRY_AGAIN],gcadmin show cluster failed。一般都是集群corosync/gcware在同步REDOLOG信息时出现,稍后会自动恢复。
目录导航
报错样例
如下是一个集群服务刚启动时,出现的报错。其中第一个单纯的GC_AIS_ERR_TRY_AGAIN, 是因为服务没有启动。启动后还是会报错一段时间
[gcadmin] Could not initialize CRM instance error: [6]->[GC_AIS_ERR_TRY_AGAIN]
[root@localhost ~]# service gcware start
Starting GCWARE (gcwexec): [ OK ]
Starting GBASED : [ OK ]
Starting GCSYNC : [ OK ]
Starting GCLUSTERD : [ OK ]
Starting GCRECOVER : [ OK ]
Starting GCMonit success!
[root@localhost ~]# gcadmin
exec_cmd_show_cluster crm get cluster mode ERROR: [6]-->[GC_AIS_ERR_TRY_AGAIN]
gcadmin show cluster failed
[root@localhost ~]# gcadmin
exec_cmd_show_cluster crm get cluster mode ERROR: [6]-->[GC_AIS_ERR_TRY_AGAIN]
gcadmin show cluster failed
[root@localhost ~]# gcadmin
exec_cmd_show_cluster crm get cluster mode ERROR: [6]-->[GC_AIS_ERR_TRY_AGAIN]
gcadmin show cluster failed
[root@localhost ~]# gcadmin
exec_cmd_show_cluster crm get cluster mode ERROR: [6]-->[GC_AIS_ERR_TRY_AGAIN]
gcadmin show cluster failed
[root@localhost ~]# gcadmin
exec_cmd_show_cluster crm get cluster mode ERROR: [6]-->[GC_AIS_ERR_TRY_AGAIN]
gcadmin show cluster failed
排查
查看进程
发现corosync进程正常
[root@localhost ~]# ps -ef|grep corosync
root 6445 24649 0 06:42 pts/0 00:00:00 grep --color=auto corosync
root 24799 1 9 06:41 ? 00:00:03 corosync
查看日志
发现在处理REDOLOG
[root@localhost ~]# tail /var/log/corosync.log
Sep 05 06:42:26 corosync [corosy] persist_sync_finish leave 1
Sep 05 06:42:26 corosync [CLM ] [DEBUG]: clm_sync_process [LEAVE]
Sep 05 06:42:27 corosync [CLM ] [DEBUG]: clm_sync_process [ENTER]
Sep 05 06:42:27 corosync [corosy] persist_sync_finish enter
Sep 05 06:42:27 corosync [corosy] persist_sync_finish leave 1
Sep 05 06:42:27 corosync [CLM ] [DEBUG]: clm_sync_process [LEAVE]
Sep 05 06:42:28 corosync [CLM ] [DEBUG]: clm_sync_process [ENTER]
Sep 05 06:42:28 corosync [corosy] persist_sync_finish enter
Sep 05 06:42:28 corosync [corosy] persist_sync_finish leave 1
Sep 05 06:42:28 corosync [CLM ] [DEBUG]: clm_sync_process [LEAVE]
解决
稍等一段时间,等服务处理完成即可。时间得看REDOLOG的大小,一般在秒到分钟级。出现了 all node finish new cluster persist, sync process end 就完成了,这个节点就恢复了。
Sep 05 06:42:49 corosync [CLM ] [DEBUG]: clm_sync_process [LEAVE]
Sep 05 06:42:49 corosync [CLM ] [DEBUG]: clm_sync_process [ENTER]
Sep 05 06:42:49 corosync [corosy] persist_sync_finish enter
Sep 05 06:42:49 corosync [corosy] persist_sync_finish leave 1
Sep 05 06:42:49 corosync [CLM ] [DEBUG]: clm_sync_process [LEAVE]
Sep 05 06:42:49 corosync [CLM ] [DEBUG]: clm_sync_process [ENTER]
Sep 05 06:42:49 corosync [corosy] persist_sync_finish enter
Sep 05 06:42:49 corosync [corosy] all node finish new cluster persist, sync process end
Sep 05 06:42:49 corosync [corosy] persist_sync_finish leave 0
Sep 05 06:42:49 corosync [CRM ] [INFO]: key_change_notify: cluster_state 2
Sep 05 06:42:49 corosync [LCK ] [INFO]: key_change_notify: cluster_state 2
Sep 05 06:42:49 corosync [LCK ] lck_remove_invalid_lock begin
Sep 05 06:42:49 corosync [LCK ] lck_remove_invalid_lock end
Sep 05 06:42:49 corosync [LCK ] lck_remove_invalid_commit_takeover begin
Sep 05 06:42:49 corosync [LCK ] lck_remove_invalid_commit_takeover end
Sep 05 06:42:49 corosync [LCK ] lck_remove_invalid_commit begin
Sep 05 06:42:49 corosync [LCK ] lck_remove_invalid_commit for offline node find out takeovernodeid = 1778515978
Sep 05 06:42:49 corosync [LCK ] lck_remove_invalid_commit for conn disconnect find out takeovernodeid = 1778515978
Sep 05 06:42:49 corosync [LCK ] lck_remove_invalid_commit end
Sep 05 06:42:49 corosync [LCK ] qlm sync active remove lock 1 0
Sep 05 06:42:49 corosync [LCK ] qlm_sync_active_compatible_mode
Sep 05 06:42:49 corosync [LCK ] qlm_sync_active_compatible_mode name:, type:1 nodeid:1795293194
Sep 05 06:42:49 corosync [LCK ] lck_common_regist_sync_active regist_type:2,nodeid:1795293194,conn:(nil),trans:0
Sep 05 06:42:49 corosync [LCK ] lck_unregist_common_regist regist type not find,2
Sep 05 06:42:49 corosync [LCK ] lck_common_regist_sync_active regist_type:4,nodeid:1795293194,conn:(nil),trans:0
Sep 05 06:42:49 corosync [LCK ] lck_unregist_common_regist node is not exist!
Sep 05 06:42:49 corosync [LCK ] [INFO]: notify_cluster_state_change_to_consumer:state:2,local_conn:(nil),event_conn:(nil),node_id:1795293194
Sep 05 06:42:49 corosync [LCK ] [INFO]: lck_print_commit_takeover_list [ENTER]
Sep 05 06:42:49 corosync [LCK ] [INFO]: NodeId = 1778515978 connection:0xc41440
Sep 05 06:42:49 corosync [LCK ] [INFO]: lck_print_commit_takeover_list [END]
Sep 05 06:42:49 corosync [LCK ] [INFO]: lck_print_commit_list [ENTER]
Sep 05 06:42:49 corosync [LCK ] [INFO]: lck_print_commit_list [END]
Sep 05 06:42:49 corosync [LCK ] lck regist consumer:type:4,master_nodeid:1778515978,num:1
Sep 05 06:42:49 corosync [LCK ] lck regist consumer:nodeid:1778515978,info:(null)
Sep 05 06:42:49 corosync [CLM ] [DEBUG]: clm_sync_process [LEAVE]
Sep 05 06:42:49 corosync [CLM ] [DEBUG]: clm_sync_activate [ENTER]
Sep 05 06:42:49 corosync [LCK ] lck_sync_reset enter
Sep 05 06:42:49 corosync [LCK ] lck_sync_reset leave
Sep 05 06:42:49 corosync [CRM ] [DEBUG]: crm_sync_reset [LEAVE]
Sep 05 06:42:49 corosync [CLM ] [DEBUG]: clm_sync_reset [ENTER]
Sep 05 06:42:49 corosync [CLM ] [DEBUG]: clm_sync_activate [LEAVE]
Sep 05 06:42:49 corosync [CLM ] [DEBUG]: clm_sync_activate [LEAVE]
Sep 05 06:42:49 corosync [MAIN ] Completed service synchronization, ready to provide service.
Sep 05 06:42:49 corosync [LCK ] LIB request: gcLckRecoverMasterNodeGet [ENTER]
Sep 05 06:42:49 corosync [LCK ] LIB request: gcLckRecoverMasterNodeGet masternodeid = 1778515978[LEAVE]
Sep 05 06:42:49 corosync [LCK ] LIB request: gcLckRecoverMasterNodeGet [ENTER]
Sep 05 06:42:49 corosync [LCK ] LIB request: gcLckRecoverMasterNodeGet masternodeid = 1778515978[LEAVE]
Sep 05 06:42:54 corosync [CLM ] port scan read error ip:::ffff:10.0.2.107 port:5258 time:5000(ms) cfg_connect_timeout:5000(ms) error:11 Resource temporarily unavailable
Sep 05 06:42:54 corosync [CLM ] port scan read error ip:::ffff:10.0.2.107 port:5258 time:5000(ms) cfg_connect_timeout:5000(ms) error:11 Resource temporarily unavailable
Sep 05 06:42:54 corosync [LCK ] LIB request: gcLckRecoverRegist [ENTER]
Sep 05 06:42:54 corosync [LCK ] LIB request: gcLckRecoverRegist [LEAVE]
Sep 05 06:42:54 corosync [LCK ] EXEC request: saLckRecoverRegist 1795293194
Sep 05 06:42:54 corosync [LCK ] EXEC notification: recover_regist_changed
Sep 05 06:42:54 corosync [LCK ] LIB request: gcLckRecoverMasterNodeGet [ENTER]
Sep 05 06:42:54 corosync [LCK ] LIB request: gcLckRecoverMasterNodeGet masternodeid = 1778515978[LEAVE]
Sep 05 06:42:54 corosync [LCK ] LIB request: gcLckRecoverMasterNodeGet [ENTER]
Sep 05 06:42:54 corosync [LCK ] LIB request: gcLckRecoverMasterNodeGet masternodeid = 1778515978[LEAVE]
Sep 05 06:42:54 corosync [LCK ] LIB request: gcLckCommitRegistTakeover [ENTER]
Sep 05 06:42:54 corosync [LCK ] LIB request: gcLckCommitRegistTakeover [LEAVE]
Sep 05 06:42:54 corosync [LCK ] EXEC request: saLckCommitRegistTakeover 1795293194
恢复
[root@localhost ~]# gcadmin
CLUSTER STATE: ACTIVE
CLUSTER MODE: NORMAL
=================================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=================================================================
| NodeName | IpAddress |gcware |gcluster |DataState |
-----------------------------------------------------------------
| coordinator1 | 10.0.2.107 | OPEN | OPEN | 0 |
-----------------------------------------------------------------
| coordinator2 | 10.0.2.106 | OPEN | OPEN | 0 |
-----------------------------------------------------------------
=============================================================
| GBASE DATA CLUSTER INFORMATION |
=============================================================
|NodeName | IpAddress |gnode |syncserver |DataState |
-------------------------------------------------------------
| node1 | 10.0.2.107 | OPEN | OPEN | 0 |
-------------------------------------------------------------
| node2 | 10.0.2.106 | OPEN | OPEN | 0 |
-------------------------------------------------------------
备注
现场也发生过,一台corosync所在服务器磁盘空间满了,导致corosync异常,其它节点日志里可以看到频繁的节点加入和离开,导致整个集群的服务异常。