数据由于有些原因出现了重复,比如误操作insert了两行,或者数据分批录入过程中意外中断了,继续后当天一批重复录入等。本文介绍GBase 8a数据库在处理剔除重复数据的几个方法。
如下先给出通用的方法,并对复制表给出一种额外的方案。
目录导航
有唯一ID的情况
如果数据有唯一列,不重复(比如自增),那么可以直接删除。比如有id列 a,b,c 重复
Delete from table where id not in (select max (id) from table group a,b,c)
后面都是无唯一ID的情况。
测试数据
如下构造几行数据包含重复的。其中我们假定id是可以唯一区分的字段,只是重复了。
gbase> create table tc1 (id int, name varchar(100));
Query OK, 0 rows affected (Elapsed: 00:00:00.60)
gbase> insert into tc1 values (1,'111');
Query OK, 1 row affected (Elapsed: 00:00:00.40)
gbase> insert into tc1 values (1,'111');
Query OK, 1 row affected (Elapsed: 00:00:00.25)
gbase> insert into tc1 values (2,'222');
Query OK, 1 row affected (Elapsed: 00:00:00.40)
gbase> insert into tc1 values (3,'333');
Query OK, 1 row affected (Elapsed: 00:00:00.22)
gbase>
gbase> select * from tc1;
+------+------+
| id | name |
+------+------+
| 1 | 111 |
| 1 | 111 |
| 2 | 222 |
| 3 | 333 |
+------+------+
4 rows in set (Elapsed: 00:00:00.00)
数据转储法
此方法操作逻辑最简单。本方案处理逻辑如下:
1、新建一个表
2、使用 insert into 新表 as select distinct * from老表来去重,
3、drop老的表
4、将新表改名rename成老表。
此方法适合
1、极少量数据,比如几百行,几玩行级别的配置表,维度表等;
2、大量重复数据,比如超过30%的重复数据。
数据量少,可以在几秒内处理完成。同时表数据变动频率可控,不要出现数据丢失(insert 后可能原始表又insert的新数据,切换表会导致数据丢失)的前提下。
如数据变动不可控,请不要使用此方案。
gbase> create table tc2 as select distinct * from tc1;
Query OK, 3 rows affected (Elapsed: 00:00:00.80)
gbase> select * from tc2;
+------+------+
| id | name |
+------+------+
| 1 | 111 |
| 2 | 222 |
| 3 | 333 |
+------+------+
3 rows in set (Elapsed: 00:00:00.00)
临时表法
此方法适合处理所有场景,但如果绝大部分都是重复数据库,性能上不如方法1的重建方案。
处理逻辑如下:
1、将重复数据insert到一个临时表
2、将重复数据从原始表删除
3、将临时表数据insert到正式表
gbase> create table tc3 as select * from tc1 group by id,name having count(*) >1;
Query OK, 1 row affected (Elapsed: 00:00:00.65)
gbase> select * from tc3;
+------+------+
| id | name |
+------+------+
| 1 | 111 |
+------+------+
1 row in set (Elapsed: 00:00:00.00)
gbase> delete tc1 where id in (select id from tc3) ;
Query OK, 2 rows affected (Elapsed: 00:00:00.57)
gbase> select * from tc1;
+------+------+
| id | name |
+------+------+
| 2 | 222 |
| 3 | 333 |
+------+------+
2 rows in set (Elapsed: 00:00:00.00)
gbase> insert into tc1 select * from tc3;
Query OK, 1 row affected (Elapsed: 00:00:00.49)
Records: 1 Duplicates: 0 Warnings: 0
gbase> select * from tc1;
+------+------+
| id | name |
+------+------+
| 1 | 111 |
| 2 | 222 |
| 3 | 333 |
+------+------+
3 rows in set (Elapsed: 00:00:00.01)
复制表特殊方案
复制表,因为每个节点的数据完全一致,包括顺序,所以可以用rowid来删除。
复制表测试数据
gbase> create table tcc1 (id int, name varchar(100))replicated;
Query OK, 0 rows affected (Elapsed: 00:00:00.48)
gbase> insert into tcc1 values(1,'1111');
Query OK, 1 row affected (Elapsed: 00:00:00.68)
gbase> insert into tcc1 values(1,'1111');
Query OK, 1 row affected (Elapsed: 00:00:00.36)
gbase> insert into tcc1 values(2,'2222');
Query OK, 1 row affected (Elapsed: 00:00:00.33)
gbase> insert into tcc1 values(3,'3333');
sQuery OK, 1 row affected (Elapsed: 00:00:00.48)
gbase> select * from tcc1;
+------+------+
| id | name |
+------+------+
| 1 | 1111 |
| 1 | 1111 |
| 2 | 2222 |
| 3 | 3333 |
+------+------+
4 rows in set (Elapsed: 00:00:00.00)
复制表去重
通过如下的方法,得到重复数据的rowid。
gbase> select min(rowid),id from tcc1 group by id having count(*)>1;
+------------+------+
| min(rowid) | id |
+------------+------+
| 0 | 1 |
+------------+------+
1 row in set (Elapsed: 00:00:00.02)
通过关联删除去重,功能细节看后面参考部分。
gbase> delete tcc1 from tcc1,(select min(rowid)minrowid,id from tcc1 group by id having count(*)>1)tmp where tmp.id=tcc1.id and tcc1.rowid>minrowid;
Query OK, 1 row affected (Elapsed: 00:00:00.49)
gbase> select * from tcc1;
+------+------+
| id | name |
+------+------+
| 1 | 1111 |
| 2 | 2222 |
| 3 | 3333 |
+------+------+
3 rows in set (Elapsed: 00:00:00.01)
分布表去重
如果Hash分布表,可以通过类似复制表的方案进行去重。其核心是相同hash列的数据,都在一个分片里。
随机分布
可以看到相同的id,有重复的rowid,所以不能删除。
gbase> create table tccc1(id int, name varchar(100));
Query OK, 0 rows affected (Elapsed: 00:00:00.48)
gbase>
gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.201/home/gbase/tc.txt' into table tccc1 fields terminated by ',';
Query OK, 4 rows affected (Elapsed: 00:00:01.03)
Task 3080235 finished, Loaded 4 records, Skipped 0 records
gbase> select rowid, t.* from tccc1 t;
+-------+------+------+
| rowid | id | name |
+-------+------+------+
| 0 | 1 | 1111 |
| 1 | 2 | 2222 |
| 0 | 1 | 1111 |
| 1 | 3 | 3333 |
+-------+------+------+
4 rows in set (Elapsed: 00:00:00.02)
gbase> select rowid, t.* from tccc1 t order by id;
+-------+------+------+
| rowid | id | name |
+-------+------+------+
| 0 | 1 | 1111 |
| 0 | 1 | 1111 |
| 1 | 2 | 2222 |
| 1 | 3 | 3333 |
+-------+------+------+
4 rows in set (Elapsed: 00:00:00.08)
Hash分布表
可以看到相同的id,其rowid是不同的,不会重复。
gbase> create table tccc2(id int, name varchar(100))distributed by ('id');
Query OK, 0 rows affected (Elapsed: 00:00:00.40)
gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.201/home/gbase/tc.txt' into table tccc2 fields terminated by ',';
Query OK, 4 rows affected (Elapsed: 00:00:00.90)
Task 3080237 finished, Loaded 4 records, Skipped 0 records
gbase> select rowid, t.* from tccc2 t order by id;
+-------+------+------+
| rowid | id | name |
+-------+------+------+
| 1 | 1 | 1111 |
| 0 | 1 | 1111 |
| 2 | 2 | 2222 |
| 3 | 3 | 3333 |
+-------+------+------+
4 rows in set (Elapsed: 00:00:00.05)
去重
gbase> delete tccc2 from tccc2,(select min(rowid)minrowid,id from tccc2 group by id having count(*)>1)tmp where tmp.id=tccc2.id and tccc2.rowid>minrowid;
Query OK, 1 row affected (Elapsed: 00:00:00.30)
gbase> select rowid, t.* from tccc2 t order by id; +-------+------+------+
| rowid | id | name |
+-------+------+------+
| 0 | 1 | 1111 |
| 2 | 2 | 2222 |
| 3 | 3 | 3333 |
+-------+------+------+
3 rows in set (Elapsed: 00:00:00.07)