南大通用GBase 8a不释放锁导致的SQL一直checking permissions卡住

GBase 8a的客户一个业务执行完成后没有释放锁,且没有关闭连接,导致后续该库的其它DDL业务卡住,本文介绍长时间处于checking permissions阶段的SQL的排查和处理方法。

现象

如下只是一个示例,比如一个增加列的DDL,一直无法完成,处于checking permissions阶段

gbase> show processlist;
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
| Id  | User            | Host             | db     | Command | Time  | State                  | Info                                        |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
|   1 | event_scheduler | localhost        | NULL   | Daemon  | 89609 | Waiting for event lock | NULL                                        |
|   7 | gbase           | 10.0.2.115:49156 | NULL   | Sleep   |  3595 |                        | NULL                                        |
|  11 | gbase           | 10.0.2.115:50038 | NULL   | Sleep   | 50308 |                        | NULL                                        |
| 269 | root            | localhost        | testdb | Query   |    16 | checking permissions   | alter table t1 add column name varchar(100) |
| 270 | root            | 10.0.2.101:44906 | testdb | Sleep   | 18556 |                        | NULL                                        |
| 342 | root            | localhost        | testdb | Sleep   |    41 |                        | NULL                                        |
| 343 | root            | localhost        | NULL   | Query   |     0 | NULL                   | show processlist                            |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+

排查方法

记录下该SQL的节点IP和对应的SQL ID

IP就是执行show processlist的当前节点对应服务的IP地址(例子中是10.0.2.101)。 SQL ID就是show processlist的第一列,如上例子中是269

gbase> show processlist;
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
| Id  | User            | Host             | db     | Command | Time  | State                  | Info                                        |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
|   1 | event_scheduler | localhost        | NULL   | Daemon  | 89609 | Waiting for event lock | NULL                                        |
|   7 | gbase           | 10.0.2.115:49156 | NULL   | Sleep   |  3595 |                        | NULL                                        |
|  11 | gbase           | 10.0.2.115:50038 | NULL   | Sleep   | 50308 |                        | NULL                                        |
| 269 | root            | localhost        | testdb | Query   |    16 | checking permissions   | alter table t1 add column name varchar(100) |
| 270 | root            | 10.0.2.101:44906 | testdb | Sleep   | 18556 |                        | NULL                                        |
| 342 | root            | localhost        | testdb | Sleep   |    41 |                        | NULL                                        |
| 343 | root            | localhost        | NULL   | Query   |     0 | NULL                   | show processlist                            |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+

查看该SQL锁集群锁情况

集群锁是用过gcadmin showlock 拿到,详情参考 GBase8a MPP Cluster查看集群锁 showlock 通过grep过滤SQL ID号。

[gbase@gbase_rh7_001 ~]$ gcadmin showlock | grep 269
 |     vc00001.hashmap_lock      |10.0.2.101|269(LWP:21694)|20211222145941| TRUE | S  |
 |    vc00001.testdb.db_lock     |10.0.2.101|269(LWP:21694)|20211222145941| TRUE | S  |
 |  vc00001.testdb.t1.meta_lock  |10.0.2.101|269(LWP:21694)|20211222145941|FALSE | S  |
 |vc00001.testdb.table_space_lock|10.0.2.101|269(LWP:21694)|20211222145941| TRUE | S  |
[gbase@gbase_rh7_001 ~]$

如上可以看到IP是10.0.2.101,ID是269的一共有4个锁,其中有1个所锁是FALSE没有拿到的状态。其余的是TRUE. 锁的名字是 vc00001.testdb.t1.meta_lock

查看锁的持有者

根据所得名字,从集群锁中过滤得到如下信息

[gbase@gbase_rh7_001 ~]$ gcadmin showlock | grep vc00001.testdb.t1.meta_lock
 |  vc00001.testdb.t1.meta_lock  |10.0.2.101|342(LWP:15618)|20211222145916| TRUE | E  |
 |  vc00001.testdb.t1.meta_lock  |10.0.2.101|269(LWP:21694)|20211222145941|FALSE | S  |
[gbase@gbase_rh7_001 ~]$

可以看到10.0.2.101的ID=342的连接,拿到了锁。

查看锁持有者的任务

gbase> show processlist;
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
| Id  | User            | Host             | db     | Command | Time  | State                  | Info                                        |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
|   1 | event_scheduler | localhost        | NULL   | Daemon  | 90181 | Waiting for event lock | NULL                                        |
|   7 | gbase           | 10.0.2.115:49156 | NULL   | Sleep   |   567 |                        | NULL                                        |
|  11 | gbase           | 10.0.2.115:50038 | NULL   | Sleep   | 50880 |                        | NULL                                        |
| 269 | root            | localhost        | testdb | Query   |   588 | checking permissions   | alter table t1 add column name varchar(100) |
| 270 | root            | 10.0.2.101:44906 | testdb | Sleep   | 19128 |                        | NULL                                        |
| 342 | root            | localhost        | testdb | Sleep   |   613 |                        | NULL                                        |
| 347 | root            | localhost        | NULL   | Query   |     0 | NULL                   | show processlist                            |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
7 rows in set (Elapsed: 00:00:00.00)

可以看到342的状态为Sleep, 也就是空闲,其发起放为localhost, 如果不是本机,则会有IP信息,比如270的就是10.0.2.101端口号44906的。发起数据库用户为root,默认的数据库是testdb.

此时,根据发起者信息,可以继续排查该业务连接是否正常。 如果该业务就是需要长时间持有锁,那就只能等待。 如果业务异常,比如机器意外宕机, jvm卡住等, 可以考虑kill掉这个不正常Sleep状态的SQL 进程来释放锁。

强行kill掉来释放锁

该操作,只有在确认持有锁的进程确实故障状态,或者时间明显超过允许范围了,才建议kill掉。

gbase> kill 342;
Query OK, 0 rows affected (Elapsed: 00:00:00.00)

gbase> show processlist;
+-----+-----------------+------------------+--------+---------+-------+------------------------+------------------+
| Id  | User            | Host             | db     | Command | Time  | State                  | Info             |
+-----+-----------------+------------------+--------+---------+-------+------------------------+------------------+
|   1 | event_scheduler | localhost        | NULL   | Daemon  | 90475 | Waiting for event lock | NULL             |
|   7 | gbase           | 10.0.2.115:49156 | NULL   | Sleep   |   861 |                        | NULL             |
|  11 | gbase           | 10.0.2.115:50038 | NULL   | Sleep   | 51174 |                        | NULL             |
| 269 | root            | localhost        | testdb | Sleep   |     4 |                        | NULL             |
| 270 | root            | 10.0.2.101:44906 | testdb | Sleep   |     4 |                        | NULL             |
| 347 | root            | localhost        | NULL   | Query   |     0 | NULL                   | show processlist |
+-----+-----------------+------------------+--------+---------+-------+------------------------+------------------+
6 rows in set (Elapsed: 00:00:00.00)

gbase> desc testdb.t1;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int(11)      | YES  |     | NULL    |       |
| name  | varchar(100) | YES  |     | NULL    |       |
+-------+--------------+------+-----+---------+-------+
2 rows in set (Elapsed: 00:00:00.00)

杀掉锁的进程后,可以看到加列的SQL正常运行完成了,表也确实多了一个name列。

总结

锁持久不释放,已知的一些原因包括:

  • jvm异常宕机,数据库没有收到信号而是处于等待timeout状态。
  • 客户业务运行缓慢,确实要占用很久时间
  • 客户开启了事务,但忘记了commit或rollback, 且没有主动关闭连接。