南大通用GBase 8a全文索引提高模糊查询性能的使用样例

发表于2021年7月1日2021年7月1日作者 laozizhu

本文介绍GBase 8a MPP集群全文索引的详细样例，用来提高模糊查询的性能。

参考

GBase 8a全文索引功能安装部署方法
 GBase 8a全文索引创建、更新和删除方法
 GBase 8a全文索引常用配置文件和配置参数
 GBase 8a全文索引提高模糊查询性能使用样例
 GBase 8a全文索引多分词器的功能介绍和使用

语法

  CONTAINS (column_name, Query Content[, score_flag])

说明

column_name 列名字
Query Content 查询表达式，由1到多个条件组成。详情看后面样例介绍
score_flag SCORE 标号，可选项，如不填写，则表示不输出评估值, 应与 SCORE 函数一起使用

样例

本文样例，调整了参数，从默认的”自然分词“参数0，改成了字母和数组三元分词的参数3。详细的全文配置，请参考GBase 8a全文索引配置参数详解(编写中)。

区分大小写。

数据样本

gbase> show create table t1;
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                                                                                                         |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t1    | CREATE TABLE "t1" (
  "id" int(11) DEFAULT NULL,
  "name" varchar(100) DEFAULT NULL,
  "dept" int(11) DEFAULT NULL,
  "birth" date DEFAULT NULL,
  "memo" longtext,
  FULLTEXT "idx_memo" ("memo")
) ENGINE=EXPRESS DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (Elapsed: 00:00:00.00)

gbase> select * from t1;
+------+--------+------+------------+-----------------------------------+
| id   | name   | dept | birth      | memo                              |
+------+--------+------+------------+-----------------------------------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注              |
|    2 | 李四   |    1 | 1980-08-08 | 李四的个人建立和备注              |
|    3 | GBase  |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU    |    3 | 1895-10-02 | TianJin University                |
|    5 | Test   |    3 | 1895-10-02 | Test12345678abcdefg               |
|    6 | Test2  |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3  |    4 | 2021-07-01 | China TianJin                     |
+------+--------+------+------------+-----------------------------------+
7 rows in set (Elapsed: 00:00:00.01)

gbase> update index idx_memo on t1;
Query OK, 1 row affected (Elapsed: 00:00:01.29)

双引号精确匹配

双引号内的部分，必须精确完全匹配。否则将按照分词结果，做OR匹配了。比如 '天津’匹配的是'天'和'津'两个汉字，如果你要匹配‘天津’则必须用双引号包围起来‘“天津"'。

gbase> select * from t1 where contains(memo,'"TianJin ltd"');
gbase> select * from t1 where contains(memo,'"TianJin China"');
+------+-------+------+------------+----------------------+
| id   | name  | dept | birth      | memo                 |
+------+-------+------+------------+----------------------+
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China. |
+------+-------+------+------------+----------------------+
1 row in set (Elapsed: 00:00:00.01)

gbase> select * from t1 where contains(memo,'"TianJin ltd"');
Empty set (Elapsed: 00:00:00.00)

等同于左右百分号的like的查询

汉字，完整单词查询

gbase> select * from t1 where contains(memo,'人');
+------+--------+------+------------+--------------------------------+
| id   | name   | dept | birth      | memo                           |
+------+--------+------+------------+--------------------------------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注           |
|    2 | 李四   |    1 | 1980-08-08 | 李四的个人建立和备注           |
+------+--------+------+------------+--------------------------------+
2 rows in set (Elapsed: 00:00:00.02)

gbase> select * from t1 where contains(memo,'个人');
+------+--------+------+------------+--------------------------------+
| id   | name   | dept | birth      | memo                           |
+------+--------+------+------------+--------------------------------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注           |
|    2 | 李四   |    1 | 1980-08-08 | 李四的个人建立和备注           |
+------+--------+------+------------+--------------------------------+
2 rows in set (Elapsed: 00:00:00.00)

gbase> select * from t1 where contains(memo,'TianJin');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|gbase> select * from t1 where contains(memo,'TianJin');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU   |    3 | 1895-10-02 | TianJin University                |
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin                     |
+------+-------+------+------------+-----------------------------------+
4 rows in set (Elapsed: 00:00:00.11)

长度大于等于3的连续数字和字母

gbase> select * from t1 where contains(memo,'gen');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
+------+-------+------+------------+-----------------------------------+
1 row in set (Elapsed: 00:00:00.00)

gbase> select * from t1 where contains(memo,'123');
+------+------+------+------------+---------------------+
| id   | name | dept | birth      | memo                |
+------+------+------+------------+---------------------+
|    5 | Test |    3 | 1895-10-02 | Test12345678abcdefg |
+------+------+------+------------+---------------------+
1 row in set (Elapsed: 00:00:00.00)

长度小于3的连续数字和字母

没有查到，参数默认只匹配3个。从实际业务看，查询2个连续数字或字母，是一个罕见需求。查询1个数字或字母，是一个不正常需求。GBase 8a支持配置分词长度，但会带来更多的数据膨胀。

gbase> select * from t1 where contains(memo,'ge');
Empty set (Elapsed: 00:00:00.01)


gbase> select * from t1 where contains(memo,'12');
Empty set (Elapsed: 00:00:00.01)

区分大小写

gbase> select * from t1 where contains(memo,'TianJin');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU   |    3 | 1895-10-02 | TianJin University                |
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin                     |
+------+-------+------+------------+-----------------------------------+
4 rows in set (Elapsed: 00:00:00.11)

gbase> select * from t1 where contains(memo,'tianJin');
Empty set (Elapsed: 00:00:00.01)

开头标记^，等同右百分号的like查询

用^表示以此开头，等同于like 'TianJin%'的右百分号形式。如下只有TianJin开头的才返回。

gbase> select * from t1 where contains(memo,'^TianJin');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU   |    3 | 1895-10-02 | TianJin University                |
+------+-------+------+------------+-----------------------------------+
2 rows in set (Elapsed: 00:00:00.01)

结尾标记$，等同左百分号的like查询

只有以TianJin结尾的才返回。等同于memo like '%TianJin'

gbase> select * from t1 where contains(memo,'TianJin$');
+------+-------+------+------------+---------------+
| id   | name  | dept | birth      | memo          |
+------+-------+------+------------+---------------+
|    7 | Test3 |    4 | 2021-07-01 | China TianJin |
+------+-------+------+------------+---------------+
1 row in set (Elapsed: 00:00:00.01)

无双引号的空格属于隐式AND

如下查询TianJin ltd, 如果没有双引号包围符，则代表AND, 也就是包含TianJin 和 ltd. 如果由包围符，则数据必须是连续的"TianJin ltd"。

gbase> select * from t1 where contains(memo,'TianJin ltd');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
+------+-------+------+------------+-----------------------------------+
1 row in set (Elapsed: 00:00:00.01)

gbase> select * from t1 where contains(memo,'"TianJin ltd"');
Empty set (Elapsed: 00:00:00.01)

"与"AND标记&, 多个like and

AND标记用&表示，比如包含Tianjin 和 ltd的。多个条件，每个条件用双引号包围。

gbase> select * from t1 where contains(memo,'"TianJin"&"ltd"');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
+------+-------+------+------------+-----------------------------------+
1 row in set (Elapsed: 00:00:00.01)

“或”OR标记|，多个like OR

用竖线|表示或。比如包含张三或者TianJin的，等同于 memo like '%张三%' or memo like '%TianJin%'

gbase> select * from t1 where contains(memo,'"张三"|"TianJin"');
+------+--------+------+------------+-----------------------------------+
| id   | name   | dept | birth      | memo                              |
+------+--------+------+------------+-----------------------------------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注              |
|    3 | GBase  |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU    |    3 | 1895-10-02 | TianJin University                |
|    6 | Test2  |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3  |    4 | 2021-07-01 | China TianJin                     |
+------+--------+------+------------+-----------------------------------+
5 rows in set (Elapsed: 00:00:00.01)

”非“NOT标记-, not like

用减号-，表示排除后面的内容。比如包含”张三“或”TianJin“, 但不包含”人“的。注意默认优先级或 > 非 > 与

gbase> select * from t1 where contains(memo,'"张三"|"TianJin"-"人"');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU   |    3 | 1895-10-02 | TianJin University                |
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin                     |
+------+-------+------+------------+-----------------------------------+
4 rows in set (Elapsed: 00:00:00.01)

使用小括号分组来改变匹配顺序

对照前例，本例查询 like '%张三%' or （like '%Tianjin%' and not like '%人')。

gbase> select * from t1 where contains(memo,'"张三"|("TianJin"-"人")');
+------+--------+------+------------+-----------------------------------+
| id   | name   | dept | birth      | memo                              |
+------+--------+------+------------+-----------------------------------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注              |
|    3 | GBase  |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU    |    3 | 1895-10-02 | TianJin University                |
|    6 | Test2  |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3  |    4 | 2021-07-01 | China TianJin                     |
+------+--------+------+------------+-----------------------------------+
5 rows in set (Elapsed: 00:00:00.00)

阀值匹配符‘/’

值一组条件内，最少满足匹配几个。比如条件”张 Tia 人“ 最少包含其中的2个即可，则写成"张 Tia 人"/2。

请一定要注意分词问题，特别是英文和字母。每个汉字，每个默认的三元分词都会命中得分。所以本例采用的是单个汉字和3个字母的样例。



gbase> select t1.*,score(1) from t1 where contains(memo,'"张 Tia 人"/2',1);
+------+--------+------+------------+--------------------------------+----------+
| id   | name   | dept | birth      | memo                           | score(1) |
+------+--------+------+------------+--------------------------------+----------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注           |     1610 |
+------+--------+------+------------+--------------------------------+----------+
1 row in set (Elapsed: 00:00:00.01)

gbase> select t1.*,score(1) from t1 where contains(memo,'"张 Tia 人"/3',1);
Empty set (Elapsed: 00:00:00.01)

如下是多个字母的样例，粗看会觉得结果不正确的。

之所以返回，是因为TianJin按照默认的三元分词规则， Tia ian anJ 。。。。都能匹配的。这就和我们期望的TianJin整体匹配不符和，但既然设置了三元分词，而不是”自然分词“，这个结果也得接受。

gbase> select t1.*,score(1) from t1 where contains(memo,'"张 TianJin 人"/3',1);
+------+-------+------+------------+-----------------------------------+----------+
| id   | name  | dept | birth      | memo                              | score(1) |
+------+-------+------+------------+-----------------------------------+----------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |     5500 |
|    4 | TJU   |    3 | 1895-10-02 | TianJin University                |     5500 |
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China.              |     5500 |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin                     |     5500 |
+------+-------+------+------------+-----------------------------------+----------+
4 rows in set (Elapsed: 00:00:00.00)

near函数

本身属于contains的一部分，但内容较为独立，所以单开一个标题。

语法

NEAR ((term1, term2), num[，Order])

请注意near的term部分，外面有一层小括号。

说明

term ：搜索词。多个之间用逗号分割。词之间是AND关系。且每个搜索词必须精确满足。
num : 表示词距数，非 0 整数，词距（包括匹配词），实际词距小于 num 值即为符号要求。
order : 表示词序。Order 为可选项，默认值为 0，表示无词序。
- 为 0 代表无词序
- 为 1 代表有词序。

词距

指二个词之间间隔单词数，也包含被查询的首尾词。例如：“the black cat catch white mouse”这段文字中要查询“cat mouse”，则词距就是 4。对于中文来说，则词距为两个汉字之间间隔的字符数（字符包含汉字、标点、空格、符号等等，回车符略去不算）。

例如“上周，我去海洋局开会…”这段文字要查询“上海”，则词距为 6。分析如下：上(1)周(2)，(3)我(4)去(5)海(6)洋(7)。后面是位置顺序。其中“海”是6， “上”是1，距离是（6-1）+1 = 6；

例如查询“售卖制作假文*#凭、发票”的垃圾短信中“文凭”，则词距为 4。分析如下：售(1)卖(2)制(3)作(4)假(5)文(6)*(7)#(8)凭(9). 用9-6+1=4；

样例

如下以三个字母为例，因为多个字母又涉及到分词问题，看着混乱。对于“China TianJin”，Chi和Tia之间的距离是4(na和空格三个字符+1)。而对“TianJin China” 则是6（nJin和空格+1)。

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),3)',1);
Empty set (Elapsed: 00:00:00.01)

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),4)',1);
+------+-------+------+------------+---------------+----------+
| id   | name  | dept | birth      | memo          | score(1) |
+------+-------+------+------------+---------------+----------+
|    7 | Test3 |    4 | 2021-07-01 | China TianJin |     1560 |
+------+-------+------+------------+---------------+----------+
1 row in set (Elapsed: 00:00:00.01)

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),5)',1);
+------+-------+------+------------+---------------+----------+
| id   | name  | dept | birth      | memo          | score(1) |
+------+-------+------+------------+---------------+----------+
|    7 | Test3 |    4 | 2021-07-01 | China TianJin |     1560 |
+------+-------+------+------------+---------------+----------+
1 row in set (Elapsed: 00:00:00.01)

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),6)',1);
+------+-------+------+------------+----------------------+----------+
| id   | name  | dept | birth      | memo                 | score(1) |
+------+-------+------+------------+----------------------+----------+
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China. |     1560 |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin        |     1560 |
+------+-------+------+------------+----------------------+----------+
2 rows in set (Elapsed: 00:00:00.02)

顺序order参数效果

但有顺序时，China TianJin就满足，而TianJin China就不满足了。

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),6,0)',1);
+------+-------+------+------------+----------------------+----------+
| id   | name  | dept | birth      | memo                 | score(1) |
+------+-------+------+------------+----------------------+----------+
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China. |     1560 |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin        |     1560 |
+------+-------+------+------------+----------------------+----------+
2 rows in set (Elapsed: 00:00:00.01)

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),6,1)',1);
+------+-------+------+------------+---------------+----------+
| id   | name  | dept | birth      | memo          | score(1) |
+------+-------+------+------------+---------------+----------+
|    7 | Test3 |    4 | 2021-07-01 | China TianJin |     1560 |
+------+-------+------+------------+---------------+----------+
1 row in set (Elapsed: 00:00:00.01)

SCORE函数

语法

INT SCORE(N)

说明

N 指在Contains里的score_flag参数。

该函数功能为全文索引对关键字的匹配程度所计算的分数，返回的值是全文查询函数 CONTAINS 查询结果的权重值，值的大小跟全文索引评估算法有关，权重值越高则匹配度越高。GBase 8a MPP Cluster 全文索引的评估模式默认是 BM25算法，可以通过扩展选项"rank=tf"设置采用词频评估算法。

样例

如下同样的China,TianJin, 但那个score高的比低的更符合要求些。

gbase> select t1.*,score(1) from t1 where contains(memo,'near(("China,TianJin"),15)',1);
+------+-------+------+------------+----------------------+----------+
| id   | name  | dept | birth      | memo                 | score(1) |
+------+-------+------+------------+----------------------+----------+
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China. |     1545 |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin        |     2545 |
+------+-------+------+------------+----------------------+----------+
2 rows in set (Elapsed: 00:00:00.01)

2026 年 4 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

参考

语法

说明

样例