南大通用GBase 8a加载时数据字符集和数据库不匹配时的处理方法

当原始数据的字符集和GBase 8a数据库表的字符集不同时,可以在加载LOAD时指定原始数据的字符集类型,由加载程序自动进行编码转换。

当然,转码必然有些许的CPU消耗,如果数据源能匹配,性能更好。

参考

GBase 8a 集群加载数据LOAD的方法

环境

表格

gbase> desc t2;
+----------+--------------+------+-----+-------------------+-------+
| Field    | Type         | Null | Key | Default           | Extra |
+----------+--------------+------+-----+-------------------+-------+
| id       | int(11)      | YES  |     | NULL              |       |
| name     | varchar(100) | YES  |     | noname            |       |
| birthday | date         | YES  |     | NULL              |       |
| commons  | varchar(100) | YES  |     | NULL              |       |
| loadtime | datetime     | YES  |     | CURRENT_TIMESTAMP |       |
+----------+--------------+------+-----+-------------------+-------+
5 rows in set (Elapsed: 00:00:00.01)

数据

[root@gbase_rh7_001 gbase]# cat t2_3.txt
502,张三李四王五等,2021-07-09,大中国,2021-07-08 13:14:15
502,张三,2021-07-09,大,2021-07-08 13:14:15
[root@gbase_rh7_001 gbase]# iconv -f utf8 -t gbk t2_3.txt -o t2_4.txt
[root@gbase_rh7_001 gbase]# cat t2_4.txt
502,▒▒▒▒▒▒▒▒▒▒▒▒▒,2021-07-09,▒▒▒й▒,2021-07-08 13:14:15
502,▒▒▒▒,2021-07-09,▒▒,2021-07-08 13:14:15

[root@gbase_rh7_001 gbase]# file t2_3.txt
t2_3.txt: UTF-8 Unicode text
[root@gbase_rh7_001 gbase]# file t2_4.txt
t2_4.txt: ISO-8859 text
[root@gbase_rh7_001 gbase]#

样例

普通加载出现乱码

由于字符集不匹配,入库的数据是乱码。

gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.101/home/gbase/t2_4.txt' into table t2 fields terminated by ',';
Query OK, 2 rows affected (Elapsed: 00:00:01.32)
Task 9280 finished, Loaded 2 records, Skipped 0 records

gbase> select * from t2;
+------+----------------+------------+---------+---------------------+
| id   | name           | birthday   | commons | loadtime            |
+------+----------------+------------+---------+---------------------+
|  502 | ▒▒▒▒               | 2021-07-09 | ▒▒        | 2021-07-08 13:14:15 |
|  502 | ▒▒▒▒▒▒▒▒▒▒▒▒▒               | 2021-07-09 | ▒▒▒й▒       | 2021-07-08 13:14:15 |
+------+----------------+------------+---------+---------------------+
2 rows in set (Elapsed: 00:00:00.01)

指定原始数据字符集加载

通过CHARACTER SET GBK参数指定原始数据的字符集是GBK, 加载结果正确。

当前GBase 8a支持的字符集为GB系列(GB2312,GBK,GB18030)和UTF8系列(utf8,utf8mb4)。

gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.101/home/gbase/t2_4.txt' into table t2 CHARACTER SET GBK fields terminated by ',';
Query OK, 2 rows affected (Elapsed: 00:00:01.23)
Task 9281 finished, Loaded 2 records, Skipped 0 records

gbase> select * from t2;
+------+-----------------------+------------+-----------+---------------------+
| id   | name                  | birthday   | commons   | loadtime            |
+------+-----------------------+------------+-----------+---------------------+
|  502 | ▒▒▒▒                      | 2021-07-09 | ▒▒          | 2021-07-08 13:14:15 |
|  502 | 张三                  | 2021-07-09 | 大        | 2021-07-08 13:14:15 |
|  502 | ▒▒▒▒▒▒▒▒▒▒▒▒▒                      | 2021-07-09 | ▒▒▒й▒         | 2021-07-08 13:14:15 |
|  502 | 张三李四王五等        | 2021-07-09 | 大中国    | 2021-07-08 13:14:15 |
+------+-----------------------+------------+-----------+---------------------+
4 rows in set (Elapsed: 00:00:00.01)