南大通用GBase 8a剔除处理重复数据的几个去重方法

数据由于有些原因出现了重复,比如误操作insert了两行,或者数据分批录入过程中意外中断了,继续后当天一批重复录入等。本文介绍GBase 8a数据库在处理剔除重复数据的几个方法。

如下先给出通用的方法,并对复制表给出一种额外的方案。

有唯一ID的情况

如果数据有唯一列,不重复(比如自增),那么可以直接删除。比如有id列 a,b,c 重复

Delete  from table  where id not in (select max (id) from table group a,b,c)

后面都是无唯一ID的情况。

测试数据

如下构造几行数据包含重复的。其中我们假定id是可以唯一区分的字段,只是重复了。

gbase> create table tc1 (id int, name varchar(100));
Query OK, 0 rows affected (Elapsed: 00:00:00.60)

gbase> insert into tc1 values (1,'111');
Query OK, 1 row affected (Elapsed: 00:00:00.40)

gbase> insert into tc1 values (1,'111');
Query OK, 1 row affected (Elapsed: 00:00:00.25)

gbase> insert into tc1 values (2,'222');
Query OK, 1 row affected (Elapsed: 00:00:00.40)

gbase> insert into tc1 values (3,'333');
Query OK, 1 row affected (Elapsed: 00:00:00.22)

gbase>
gbase> select * from tc1;
+------+------+
| id   | name |
+------+------+
|    1 | 111  |
|    1 | 111  |
|    2 | 222  |
|    3 | 333  |
+------+------+
4 rows in set (Elapsed: 00:00:00.00)

数据转储法

此方法操作逻辑最简单。本方案处理逻辑如下:

1、新建一个表
2、使用 insert into 新表 as select distinct * from老表来去重,
3、drop老的表
4、将新表改名rename成老表。

此方法适合
1、极少量数据,比如几百行,几玩行级别的配置表,维度表等;
2、大量重复数据,比如超过30%的重复数据。

数据量少,可以在几秒内处理完成。同时表数据变动频率可控,不要出现数据丢失(insert 后可能原始表又insert的新数据,切换表会导致数据丢失)的前提下。

如数据变动不可控,请不要使用此方案。

gbase> create table tc2 as select distinct * from tc1;
Query OK, 3 rows affected (Elapsed: 00:00:00.80)

gbase> select * from tc2;
+------+------+
| id   | name |
+------+------+
|    1 | 111  |
|    2 | 222  |
|    3 | 333  |
+------+------+
3 rows in set (Elapsed: 00:00:00.00)

临时表法

此方法适合处理所有场景,但如果绝大部分都是重复数据库,性能上不如方法1的重建方案。

处理逻辑如下:

1、将重复数据insert到一个临时表
2、将重复数据从原始表删除
3、将临时表数据insert到正式表

gbase> create table tc3 as select * from tc1 group by id,name having count(*) >1;
Query OK, 1 row affected (Elapsed: 00:00:00.65)

gbase> select * from tc3;
+------+------+
| id   | name |
+------+------+
|    1 | 111  |
+------+------+
1 row in set (Elapsed: 00:00:00.00)

gbase> delete tc1 where id in (select id from tc3) ;
Query OK, 2 rows affected (Elapsed: 00:00:00.57)

gbase> select * from tc1;
+------+------+
| id   | name |
+------+------+
|    2 | 222  |
|    3 | 333  |
+------+------+
2 rows in set (Elapsed: 00:00:00.00)

gbase> insert into tc1 select * from tc3;
Query OK, 1 row affected (Elapsed: 00:00:00.49)
Records: 1  Duplicates: 0  Warnings: 0

gbase> select * from tc1;
+------+------+
| id   | name |
+------+------+
|    1 | 111  |
|    2 | 222  |
|    3 | 333  |
+------+------+
3 rows in set (Elapsed: 00:00:00.01)

复制表特殊方案

复制表,因为每个节点的数据完全一致,包括顺序,所以可以用rowid来删除。

复制表测试数据

gbase> create table tcc1 (id int, name varchar(100))replicated;
Query OK, 0 rows affected (Elapsed: 00:00:00.48)

gbase> insert into tcc1 values(1,'1111');
Query OK, 1 row affected (Elapsed: 00:00:00.68)

gbase> insert into tcc1 values(1,'1111');
Query OK, 1 row affected (Elapsed: 00:00:00.36)

gbase> insert into tcc1 values(2,'2222');
Query OK, 1 row affected (Elapsed: 00:00:00.33)

gbase> insert into tcc1 values(3,'3333');
sQuery OK, 1 row affected (Elapsed: 00:00:00.48)

gbase> select * from tcc1;
+------+------+
| id   | name |
+------+------+
|    1 | 1111 |
|    1 | 1111 |
|    2 | 2222 |
|    3 | 3333 |
+------+------+
4 rows in set (Elapsed: 00:00:00.00)

复制表去重

通过如下的方法,得到重复数据的rowid。

gbase> select min(rowid),id from tcc1 group by id having count(*)>1;
+------------+------+
| min(rowid) | id   |
+------------+------+
|          0 |    1 |
+------------+------+
1 row in set (Elapsed: 00:00:00.02)

通过关联删除去重,功能细节看后面参考部分。

gbase> delete tcc1 from tcc1,(select min(rowid)minrowid,id from tcc1 group by id having count(*)>1)tmp where tmp.id=tcc1.id and tcc1.rowid>minrowid;
Query OK, 1 row affected (Elapsed: 00:00:00.49)
gbase> select * from tcc1;
+------+------+
| id   | name |
+------+------+
|    1 | 1111 |
|    2 | 2222 |
|    3 | 3333 |
+------+------+
3 rows in set (Elapsed: 00:00:00.01)

分布表去重

如果Hash分布表,可以通过类似复制表的方案进行去重。其核心是相同hash列的数据,都在一个分片里。

随机分布

可以看到相同的id,有重复的rowid,所以不能删除。

gbase> create table tccc1(id int, name varchar(100));
Query OK, 0 rows affected (Elapsed: 00:00:00.48)

gbase>
gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.201/home/gbase/tc.txt' into table tccc1 fields terminated by ',';
Query OK, 4 rows affected (Elapsed: 00:00:01.03)
Task 3080235 finished, Loaded 4 records, Skipped 0 records

gbase> select rowid, t.* from tccc1 t;
+-------+------+------+
| rowid | id   | name |
+-------+------+------+
|     0 |    1 | 1111 |
|     1 |    2 | 2222 |
|     0 |    1 | 1111 |
|     1 |    3 | 3333 |
+-------+------+------+
4 rows in set (Elapsed: 00:00:00.02)

gbase> select rowid, t.* from tccc1 t order by id;
+-------+------+------+
| rowid | id   | name |
+-------+------+------+
|     0 |    1 | 1111 |
|     0 |    1 | 1111 |
|     1 |    2 | 2222 |
|     1 |    3 | 3333 |
+-------+------+------+
4 rows in set (Elapsed: 00:00:00.08)

Hash分布表

可以看到相同的id,其rowid是不同的,不会重复。

gbase> create table tccc2(id int, name varchar(100))distributed by ('id');
Query OK, 0 rows affected (Elapsed: 00:00:00.40)

gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.201/home/gbase/tc.txt' into table tccc2 fields terminated by ',';
Query OK, 4 rows affected (Elapsed: 00:00:00.90)
Task 3080237 finished, Loaded 4 records, Skipped 0 records

gbase> select rowid, t.* from tccc2 t order by id;
+-------+------+------+
| rowid | id   | name |
+-------+------+------+
|     1 |    1 | 1111 |
|     0 |    1 | 1111 |
|     2 |    2 | 2222 |
|     3 |    3 | 3333 |
+-------+------+------+
4 rows in set (Elapsed: 00:00:00.05)

去重

gbase> delete tccc2 from tccc2,(select min(rowid)minrowid,id from tccc2 group by id having count(*)>1)tmp where tmp.id=tccc2.id and tccc2.rowid>minrowid;
Query OK, 1 row affected (Elapsed: 00:00:00.30)

gbase> select rowid, t.* from tccc2 t order by id;                                                                                +-------+------+------+
| rowid | id   | name |
+-------+------+------+
|     0 |    1 | 1111 |
|     2 |    2 | 2222 |
|     3 |    3 | 3333 |
+-------+------+------+
3 rows in set (Elapsed: 00:00:00.07)

参考