GBase 8a的客户一个业务执行完成后没有释放锁,且没有关闭连接,导致后续该库的其它DDL业务卡住,本文介绍长时间处于checking permissions阶段的SQL的排查和处理方法。
目录导航
现象
如下只是一个示例,比如一个增加列的DDL,一直无法完成,处于checking permissions阶段
gbase> show processlist;
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
| 1 | event_scheduler | localhost | NULL | Daemon | 89609 | Waiting for event lock | NULL |
| 7 | gbase | 10.0.2.115:49156 | NULL | Sleep | 3595 | | NULL |
| 11 | gbase | 10.0.2.115:50038 | NULL | Sleep | 50308 | | NULL |
| 269 | root | localhost | testdb | Query | 16 | checking permissions | alter table t1 add column name varchar(100) |
| 270 | root | 10.0.2.101:44906 | testdb | Sleep | 18556 | | NULL |
| 342 | root | localhost | testdb | Sleep | 41 | | NULL |
| 343 | root | localhost | NULL | Query | 0 | NULL | show processlist |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
排查方法
记录下该SQL的节点IP和对应的SQL ID
IP就是执行show processlist的当前节点对应服务的IP地址(例子中是10.0.2.101)。 SQL ID就是show processlist的第一列,如上例子中是269
gbase> show processlist;
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
| 1 | event_scheduler | localhost | NULL | Daemon | 89609 | Waiting for event lock | NULL |
| 7 | gbase | 10.0.2.115:49156 | NULL | Sleep | 3595 | | NULL |
| 11 | gbase | 10.0.2.115:50038 | NULL | Sleep | 50308 | | NULL |
| 269 | root | localhost | testdb | Query | 16 | checking permissions | alter table t1 add column name varchar(100) |
| 270 | root | 10.0.2.101:44906 | testdb | Sleep | 18556 | | NULL |
| 342 | root | localhost | testdb | Sleep | 41 | | NULL |
| 343 | root | localhost | NULL | Query | 0 | NULL | show processlist |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
查看该SQL锁集群锁情况
集群锁是用过gcadmin showlock 拿到,详情参考 GBase8a MPP Cluster查看集群锁 showlock 通过grep过滤SQL ID号。
[gbase@gbase_rh7_001 ~]$ gcadmin showlock | grep 269
| vc00001.hashmap_lock |10.0.2.101|269(LWP:21694)|20211222145941| TRUE | S |
| vc00001.testdb.db_lock |10.0.2.101|269(LWP:21694)|20211222145941| TRUE | S |
| vc00001.testdb.t1.meta_lock |10.0.2.101|269(LWP:21694)|20211222145941|FALSE | S |
|vc00001.testdb.table_space_lock|10.0.2.101|269(LWP:21694)|20211222145941| TRUE | S |
[gbase@gbase_rh7_001 ~]$
如上可以看到IP是10.0.2.101,ID是269的一共有4个锁,其中有1个所锁是FALSE没有拿到的状态。其余的是TRUE. 锁的名字是 vc00001.testdb.t1.meta_lock
查看锁的持有者
根据所得名字,从集群锁中过滤得到如下信息
[gbase@gbase_rh7_001 ~]$ gcadmin showlock | grep vc00001.testdb.t1.meta_lock
| vc00001.testdb.t1.meta_lock |10.0.2.101|342(LWP:15618)|20211222145916| TRUE | E |
| vc00001.testdb.t1.meta_lock |10.0.2.101|269(LWP:21694)|20211222145941|FALSE | S |
[gbase@gbase_rh7_001 ~]$
可以看到10.0.2.101的ID=342的连接,拿到了锁。
查看锁持有者的任务
gbase> show processlist;
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
| 1 | event_scheduler | localhost | NULL | Daemon | 90181 | Waiting for event lock | NULL |
| 7 | gbase | 10.0.2.115:49156 | NULL | Sleep | 567 | | NULL |
| 11 | gbase | 10.0.2.115:50038 | NULL | Sleep | 50880 | | NULL |
| 269 | root | localhost | testdb | Query | 588 | checking permissions | alter table t1 add column name varchar(100) |
| 270 | root | 10.0.2.101:44906 | testdb | Sleep | 19128 | | NULL |
| 342 | root | localhost | testdb | Sleep | 613 | | NULL |
| 347 | root | localhost | NULL | Query | 0 | NULL | show processlist |
+-----+-----------------+------------------+--------+---------+-------+------------------------+---------------------------------------------+
7 rows in set (Elapsed: 00:00:00.00)
可以看到342的状态为Sleep, 也就是空闲,其发起放为localhost, 如果不是本机,则会有IP信息,比如270的就是10.0.2.101端口号44906的。发起数据库用户为root,默认的数据库是testdb.
此时,根据发起者信息,可以继续排查该业务连接是否正常。 如果该业务就是需要长时间持有锁,那就只能等待。 如果业务异常,比如机器意外宕机, jvm卡住等, 可以考虑kill掉这个不正常Sleep状态的SQL 进程来释放锁。
强行kill掉来释放锁
该操作,只有在确认持有锁的进程确实故障状态,或者时间明显超过允许范围了,才建议kill掉。
gbase> kill 342;
Query OK, 0 rows affected (Elapsed: 00:00:00.00)
gbase> show processlist;
+-----+-----------------+------------------+--------+---------+-------+------------------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-----+-----------------+------------------+--------+---------+-------+------------------------+------------------+
| 1 | event_scheduler | localhost | NULL | Daemon | 90475 | Waiting for event lock | NULL |
| 7 | gbase | 10.0.2.115:49156 | NULL | Sleep | 861 | | NULL |
| 11 | gbase | 10.0.2.115:50038 | NULL | Sleep | 51174 | | NULL |
| 269 | root | localhost | testdb | Sleep | 4 | | NULL |
| 270 | root | 10.0.2.101:44906 | testdb | Sleep | 4 | | NULL |
| 347 | root | localhost | NULL | Query | 0 | NULL | show processlist |
+-----+-----------------+------------------+--------+---------+-------+------------------------+------------------+
6 rows in set (Elapsed: 00:00:00.00)
gbase> desc testdb.t1;
+-------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id | int(11) | YES | | NULL | |
| name | varchar(100) | YES | | NULL | |
+-------+--------------+------+-----+---------+-------+
2 rows in set (Elapsed: 00:00:00.00)
杀掉锁的进程后,可以看到加列的SQL正常运行完成了,表也确实多了一个name列。
总结
锁持久不释放,已知的一些原因包括:
- jvm异常宕机,数据库没有收到信号而是处于等待timeout状态。
- 客户业务运行缓慢,确实要占用很久时间
- 客户开启了事务,但忘记了commit或rollback, 且没有主动关闭连接。