南大通用GBase 8a功能介绍,加载json格式的数据文件

本文介绍GBase 8a里加载json格式的数据文件的样例和注意事项。

该功能需要新的版本,本文试用的是2023年4月份的9.5.3.28。老版本可能不支持。

json加载参数

使用data_format 6来指示加载,数据格式为json。

单行,单个数据加载

数据就一行,只包含单个数据。

gbase> system cat /home/gbase/json1.txt
{"id":"1","name":"First"}
gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.103/home/gbase/json1.txt' into table json data_format 6;
Query OK, 1 row affected (Elapsed: 00:00:00.77)
Task 11 finished, Loaded 1 records, Skipped 0 records

gbase> select * from json;
+------+-------+
| id   | name  |
+------+-------+
|    1 | First |
+------+-------+
1 row in set (Elapsed: 00:00:00.01)

每行单个数据加载

每行一个数据,有多行。

gbase> system cat /home/gbase/json2.txt
{"id":"1","name":"First"}
{"id":"2","name":"second no comer between line"}
gbase> truncate table json;
Query OK, 5 rows affected (Elapsed: 00:00:00.13)

gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.103/home/gbase/json2.txt' into table json data_format 6;
Query OK, 2 rows affected (Elapsed: 00:00:00.73)
Task 28 finished, Loaded 2 records, Skipped 0 records

gbase> select * from json;
+------+------------------------------+
| id   | name                         |
+------+------------------------------+
|    2 | second no comer between line |
|    1 | First                        |
+------+------------------------------+
2 rows in set (Elapsed: 00:00:00.00)

多行之间,如果末尾有逗号,不影响结果,有和没有都可以。解析的时候,是以行为基本单位。

{"id":"1","name":"First"},
{"id":"2","name":"second no comer between line"}
gbase> truncate table json;
Query OK, 2 rows affected (Elapsed: 00:00:00.12)

gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.103/home/gbase/json2_2.txt' into table json data_format 6;
Query OK, 2 rows affected (Elapsed: 00:00:00.68)
Task 30 finished, Loaded 2 records, Skipped 1 records

gbase> select * from json;
+------+------------------------------+
| id   | name                         |
+------+------------------------------+
|    1 | First                        |
|    2 | second no comer between line |
+------+------------------------------+
2 rows in set (Elapsed: 00:00:00.01)

一行里有多个数据

多个数据紧凑在一起,不要有分隔符

gbase> system cat /home/gbase/json2_3.txt
{"id":"1","name":"First"}{"id":"2","name":"second no comer between line"}
gbase> truncate table json;
Query OK, 4 rows affected (Elapsed: 00:00:00.12)

gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.103/home/gbase/json2_3.txt' into table json data_format 6;
Query OK, 2 rows affected (Elapsed: 00:00:00.83)
Task 33 finished, Loaded 2 records, Skipped 0 records

gbase> select * from json;
+------+------------------------------+
| id   | name                         |
+------+------------------------------+
|    1 | First                        |
|    2 | second no comer between line |
+------+------------------------------+
2 rows in set (Elapsed: 00:00:00.00)

如果中间包含了分隔符,比如逗号,只有第一个数据入库了,其余的报错。

gbase> system cat /home/gbase/json2_4.txt
{"id":"1","name":"First"},{"id":"2","name":"second no comer between line"}
gbase> truncate table json;
Query OK, 2 rows affected (Elapsed: 00:00:00.13)

gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.103/home/gbase/json2_4.txt' into table json data_format 6;
Query OK, 1 row affected (Elapsed: 00:00:00.59)
Task 35 finished, Loaded 1 records, Skipped 1 records

gbase> select * from json;
+------+-------+
| id   | name  |
+------+-------+
|    1 | First |
+------+-------+
1 row in set (Elapsed: 00:00:00.00)

用json数组,用逗号分割

前后用中括号,数据之间用逗号分割。

[{"id":"1","name":"First"},{"id":"2","name":"second no comer between line"}]
gbase> truncate table json;
Query OK, 1 row affected (Elapsed: 00:00:00.12)

gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.103/home/gbase/json2_5.txt' into table json data_format 6;
Query OK, 2 rows affected (Elapsed: 00:00:02.53)
Task 37 finished, Loaded 2 records, Skipped 0 records

gbase> select * from json;
+------+------------------------------+
| id   | name                         |
+------+------------------------------+
|    1 | First                        |
|    2 | second no comer between line |
+------+------------------------------+
2 rows in set (Elapsed: 00:00:00.01)

如果用了数组,但中间不包含逗号,结果错误。

gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.103/home/gbase/json2_5.txt' into table json data_format 6;
Query OK, 0 rows affected (Elapsed: 00:00:00.67)
Task 38 finished, Loaded 0 records, Skipped 1 records

gbase> system cat /home/gbase/json2_5.txt
[{"id":"1","name":"First"}{"id":"2","name":"second no comer between line"}]
gbase>

多行,每行多个数据

每一行,依然按照前面的单行处理逻辑。单行里如果包含多个数据,可以紧凑的,也可用数组格式。每一行的格式可以不同。

gbase> system cat /home/gbase/json2_6.txt
[{"id":"1","name":"First"},{"id":"2","name":"second no comer between line"}]
{"id":"3","name":"Third"}
gbase> truncate table json;
Query OK, 2 rows affected (Elapsed: 00:00:00.11)

gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.103/home/gbase/json2_6.txt' into table json data_format 6;
Query OK, 3 rows affected (Elapsed: 00:00:00.71)
Task 40 finished, Loaded 3 records, Skipped 0 records

gbase> select * from json;
+------+------------------------------+
| id   | name                         |
+------+------------------------------+
|    1 | First                        |
|    3 | Third                        |
|    2 | second no comer between line |
+------+------------------------------+
3 rows in set (Elapsed: 00:00:00.01)

注意事项

嵌套

不支持json的值又嵌套了另一个json格式。比如 {"id":{"id1":"1","id2":"2"}}

数据切块

一个大文件,可以切块,由不同的节点处理,来提高处理速度。但如果有如下情况,不支持切块。此时整个数据文件视同一个json对象整体处理,不可拆分。同时,可以突破单行数据 gbase_loader_max_line_length的参数上限。 但要求所有的数据格式完全正确,不允许出现格式错误。

数据连续存放

也就是不包含分隔符的,且指定line terminated by '' 为空。

数据值中包含了行分隔符的

指定了 having lines separator参数。

常量固定值true,false,null的要求

按照json规定,需要写成false,true和null, 必须是小写,且不能用双引号包围。