南大通用GBase 8a全文索引提高模糊查询性能的使用样例

本文介绍GBase 8a MPP集群全文索引的详细样例,用来提高模糊查询的性能。

参考

GBase 8a全文索引功能安装部署方法
GBase 8a全文索引创建、更新和删除方法
GBase 8a全文索引常用配置文件和配置参数
GBase 8a全文索引提高模糊查询性能使用样例
GBase 8a全文索引多分词器的功能介绍和使用

语法

  CONTAINS (column_name, Query Content[, score_flag])

说明

  • column_name 列名字
  • Query Content 查询表达式,由1到多个条件组成。详情看后面样例介绍
  • score_flag SCORE 标号,可选项, 如不填写,则表示不输出评估值, 应与 SCORE 函数一起使用

样例

本文样例,调整了参数,从默认的”自然分词“参数0,改成了字母和数组三元分词的参数3。详细的全文配置,请参考GBase 8a全文索引配置参数详解(编写中)。

区分大小写。

数据样本

gbase> show create table t1;
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                                                                                                         |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t1    | CREATE TABLE "t1" (
  "id" int(11) DEFAULT NULL,
  "name" varchar(100) DEFAULT NULL,
  "dept" int(11) DEFAULT NULL,
  "birth" date DEFAULT NULL,
  "memo" longtext,
  FULLTEXT "idx_memo" ("memo")
) ENGINE=EXPRESS DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (Elapsed: 00:00:00.00)

gbase> select * from t1;
+------+--------+------+------------+-----------------------------------+
| id   | name   | dept | birth      | memo                              |
+------+--------+------+------------+-----------------------------------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注              |
|    2 | 李四   |    1 | 1980-08-08 | 李四的个人建立和备注              |
|    3 | GBase  |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU    |    3 | 1895-10-02 | TianJin University                |
|    5 | Test   |    3 | 1895-10-02 | Test12345678abcdefg               |
|    6 | Test2  |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3  |    4 | 2021-07-01 | China TianJin                     |
+------+--------+------+------------+-----------------------------------+
7 rows in set (Elapsed: 00:00:00.01)

gbase> update index idx_memo on t1;
Query OK, 1 row affected (Elapsed: 00:00:01.29)

双引号精确匹配

双引号内的部分,必须精确完全匹配。 否则将按照分词结果,做OR匹配了。比如 '天津’匹配的是'天'和'津'两个汉字,如果你要匹配‘天津’则必须用双引号包围起来‘“天津"'。

gbase> select * from t1 where contains(memo,'"TianJin ltd"');
gbase> select * from t1 where contains(memo,'"TianJin China"');
+------+-------+------+------------+----------------------+
| id   | name  | dept | birth      | memo                 |
+------+-------+------+------------+----------------------+
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China. |
+------+-------+------+------------+----------------------+
1 row in set (Elapsed: 00:00:00.01)

gbase> select * from t1 where contains(memo,'"TianJin ltd"');
Empty set (Elapsed: 00:00:00.00)

等同于左右百分号的like的查询

汉字,完整单词查询

gbase> select * from t1 where contains(memo,'人');
+------+--------+------+------------+--------------------------------+
| id   | name   | dept | birth      | memo                           |
+------+--------+------+------------+--------------------------------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注           |
|    2 | 李四   |    1 | 1980-08-08 | 李四的个人建立和备注           |
+------+--------+------+------------+--------------------------------+
2 rows in set (Elapsed: 00:00:00.02)

gbase> select * from t1 where contains(memo,'个人');
+------+--------+------+------------+--------------------------------+
| id   | name   | dept | birth      | memo                           |
+------+--------+------+------------+--------------------------------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注           |
|    2 | 李四   |    1 | 1980-08-08 | 李四的个人建立和备注           |
+------+--------+------+------------+--------------------------------+
2 rows in set (Elapsed: 00:00:00.00)

gbase> select * from t1 where contains(memo,'TianJin');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|gbase> select * from t1 where contains(memo,'TianJin');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU   |    3 | 1895-10-02 | TianJin University                |
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin                     |
+------+-------+------+------------+-----------------------------------+
4 rows in set (Elapsed: 00:00:00.11)

长度大于等于3的连续数字和字母

gbase> select * from t1 where contains(memo,'gen');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
+------+-------+------+------------+-----------------------------------+
1 row in set (Elapsed: 00:00:00.00)

gbase> select * from t1 where contains(memo,'123');
+------+------+------+------------+---------------------+
| id   | name | dept | birth      | memo                |
+------+------+------+------------+---------------------+
|    5 | Test |    3 | 1895-10-02 | Test12345678abcdefg |
+------+------+------+------------+---------------------+
1 row in set (Elapsed: 00:00:00.00)

长度小于3的连续数字和字母

没有查到,参数默认只匹配3个。从实际业务看,查询2个连续数字或字母,是一个罕见需求。查询1个数字或字母,是一个不正常需求。GBase 8a支持配置分词长度,但会带来更多的数据膨胀。

gbase> select * from t1 where contains(memo,'ge');
Empty set (Elapsed: 00:00:00.01)


gbase> select * from t1 where contains(memo,'12');
Empty set (Elapsed: 00:00:00.01)

区分大小写

gbase> select * from t1 where contains(memo,'TianJin');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU   |    3 | 1895-10-02 | TianJin University                |
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin                     |
+------+-------+------+------------+-----------------------------------+
4 rows in set (Elapsed: 00:00:00.11)

gbase> select * from t1 where contains(memo,'tianJin');
Empty set (Elapsed: 00:00:00.01)

开头标记^,等同右百分号的like查询

用^表示以此开头,等同于like 'TianJin%'的右百分号形式。如下只有TianJin开头的才返回。

gbase> select * from t1 where contains(memo,'^TianJin');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU   |    3 | 1895-10-02 | TianJin University                |
+------+-------+------+------------+-----------------------------------+
2 rows in set (Elapsed: 00:00:00.01)

结尾标记$,等同左百分号的like查询

只有以TianJin结尾的才返回。等同于memo like '%TianJin'

gbase> select * from t1 where contains(memo,'TianJin$');
+------+-------+------+------------+---------------+
| id   | name  | dept | birth      | memo          |
+------+-------+------+------------+---------------+
|    7 | Test3 |    4 | 2021-07-01 | China TianJin |
+------+-------+------+------------+---------------+
1 row in set (Elapsed: 00:00:00.01)

无双引号的空格属于隐式AND

如下查询TianJin ltd, 如果没有双引号包围符,则代表AND, 也就是包含TianJin 和 ltd. 如果由包围符,则数据必须是连续的"TianJin ltd"。

gbase> select * from t1 where contains(memo,'TianJin ltd');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
+------+-------+------+------------+-----------------------------------+
1 row in set (Elapsed: 00:00:00.01)

gbase> select * from t1 where contains(memo,'"TianJin ltd"');
Empty set (Elapsed: 00:00:00.01)

"与"AND标记&, 多个like and

AND标记用&表示,比如包含Tianjin 和 ltd的。多个条件,每个条件用双引号包围。

gbase> select * from t1 where contains(memo,'"TianJin"&"ltd"');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
+------+-------+------+------------+-----------------------------------+
1 row in set (Elapsed: 00:00:00.01)

“或”OR标记|,多个like OR

用竖线|表示或。比如包含张三或者TianJin的,等同于 memo like '%张三%' or memo like '%TianJin%'

gbase> select * from t1 where contains(memo,'"张三"|"TianJin"');
+------+--------+------+------------+-----------------------------------+
| id   | name   | dept | birth      | memo                              |
+------+--------+------+------------+-----------------------------------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注              |
|    3 | GBase  |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU    |    3 | 1895-10-02 | TianJin University                |
|    6 | Test2  |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3  |    4 | 2021-07-01 | China TianJin                     |
+------+--------+------+------------+-----------------------------------+
5 rows in set (Elapsed: 00:00:00.01)

”非“NOT标记-, not like

用减号-,表示排除后面的内容。比如包含”张三“或”TianJin“, 但不包含”人“的。 注意默认优先级 或 > 非 > 与

gbase> select * from t1 where contains(memo,'"张三"|"TianJin"-"人"');
+------+-------+------+------------+-----------------------------------+
| id   | name  | dept | birth      | memo                              |
+------+-------+------+------------+-----------------------------------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU   |    3 | 1895-10-02 | TianJin University                |
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin                     |
+------+-------+------+------------+-----------------------------------+
4 rows in set (Elapsed: 00:00:00.01)

使用小括号分组来改变匹配顺序

对照前例,本例查询 like '%张三%' or (like '%Tianjin%' and not like '%人')。

gbase> select * from t1 where contains(memo,'"张三"|("TianJin"-"人")');
+------+--------+------+------------+-----------------------------------+
| id   | name   | dept | birth      | memo                              |
+------+--------+------+------------+-----------------------------------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注              |
|    3 | GBase  |    2 | 2004-05-11 | TianJin general date compamy ltd. |
|    4 | TJU    |    3 | 1895-10-02 | TianJin University                |
|    6 | Test2  |    4 | 2021-07-01 | GBase TianJin China.              |
|    7 | Test3  |    4 | 2021-07-01 | China TianJin                     |
+------+--------+------+------------+-----------------------------------+
5 rows in set (Elapsed: 00:00:00.00)

阀值匹配符‘/’

值一组条件内,最少满足匹配几个。比如条件”张 Tia 人“ 最少包含其中的2个即可,则写成"张 Tia 人"/2。

请一定要注意分词问题,特别是英文和字母。每个汉字,每个默认的三元分词都会命中得分。所以本例采用的是单个汉字和3个字母的样例。



gbase> select t1.*,score(1) from t1 where contains(memo,'"张 Tia 人"/2',1);
+------+--------+------+------------+--------------------------------+----------+
| id   | name   | dept | birth      | memo                           | score(1) |
+------+--------+------+------------+--------------------------------+----------+
|    1 | 张三   |    1 | 1990-09-09 | 张三的个人建立和备注           |     1610 |
+------+--------+------+------------+--------------------------------+----------+
1 row in set (Elapsed: 00:00:00.01)

gbase> select t1.*,score(1) from t1 where contains(memo,'"张 Tia 人"/3',1);
Empty set (Elapsed: 00:00:00.01)

如下是多个字母的样例,粗看会觉得结果不正确的。

之所以返回,是因为TianJin按照默认的三元分词规则, Tia ian anJ 。。。。 都能匹配的。这就和我们期望的TianJin整体匹配不符和,但既然设置了三元分词,而不是”自然分词“, 这个结果也得接受。

gbase> select t1.*,score(1) from t1 where contains(memo,'"张 TianJin 人"/3',1);
+------+-------+------+------------+-----------------------------------+----------+
| id   | name  | dept | birth      | memo                              | score(1) |
+------+-------+------+------------+-----------------------------------+----------+
|    3 | GBase |    2 | 2004-05-11 | TianJin general date compamy ltd. |     5500 |
|    4 | TJU   |    3 | 1895-10-02 | TianJin University                |     5500 |
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China.              |     5500 |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin                     |     5500 |
+------+-------+------+------------+-----------------------------------+----------+
4 rows in set (Elapsed: 00:00:00.00)

near函数

本身属于contains的一部分,但内容较为独立,所以单开一个标题。

语法

NEAR ((term1, term2), num[,Order])

请注意near的term部分,外面有一层小括号。

说明

  • term : 搜索词。多个之间用逗号分割。词之间是AND关系。且每个搜索词必须精确满足。
  • num : 表示词距数,非 0 整数,词距(包括匹配词),实际词距小于 num 值即为符号要求。
  • order : 表示词序。Order 为可选项,默认值为 0,表示无词序。
    • 为 0 代表无词序
    • 为 1 代表有词序。

词距

指二个词之间间隔单词数,也包含被查询的首尾词。例如:“the black cat catch white mouse”这段文字中要查询“cat mouse”,则词距就是 4。对于中文来说,则词距为两个汉字之间间隔的字符数(字符包含汉字、标点、空格、符号等等,回车符略去不算)。

例如“上周,我去海洋局开会…”这段文字要查询“上 海”,则词距为 6。分析如下:上(1)周(2),(3)我(4)去(5)海(6)洋(7)。后面是位置顺序。 其中“海”是6, “上”是1, 距离是(6-1)+1 = 6;

例如查询“售卖制作假文*#凭、发票”的垃圾短信中“文凭”,则词距为 4。分析如下:售(1)卖(2)制(3)作(4)假(5)文(6)*(7)#(8)凭(9). 用9-6+1=4;

样例

如下以三个字母为例,因为多个字母又涉及到分词问题,看着混乱。对于“China TianJin”,Chi和Tia之间的距离是4(na和空格三个字符+1)。 而对“TianJin China” 则是6(nJin和空格+1)。

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),3)',1);
Empty set (Elapsed: 00:00:00.01)

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),4)',1);
+------+-------+------+------------+---------------+----------+
| id   | name  | dept | birth      | memo          | score(1) |
+------+-------+------+------------+---------------+----------+
|    7 | Test3 |    4 | 2021-07-01 | China TianJin |     1560 |
+------+-------+------+------------+---------------+----------+
1 row in set (Elapsed: 00:00:00.01)

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),5)',1);
+------+-------+------+------------+---------------+----------+
| id   | name  | dept | birth      | memo          | score(1) |
+------+-------+------+------------+---------------+----------+
|    7 | Test3 |    4 | 2021-07-01 | China TianJin |     1560 |
+------+-------+------+------------+---------------+----------+
1 row in set (Elapsed: 00:00:00.01)

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),6)',1);
+------+-------+------+------------+----------------------+----------+
| id   | name  | dept | birth      | memo                 | score(1) |
+------+-------+------+------------+----------------------+----------+
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China. |     1560 |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin        |     1560 |
+------+-------+------+------------+----------------------+----------+
2 rows in set (Elapsed: 00:00:00.02)

顺序order参数效果

但有顺序时,China TianJin就满足,而TianJin China就不满足了。

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),6,0)',1);
+------+-------+------+------------+----------------------+----------+
| id   | name  | dept | birth      | memo                 | score(1) |
+------+-------+------+------------+----------------------+----------+
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China. |     1560 |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin        |     1560 |
+------+-------+------+------------+----------------------+----------+
2 rows in set (Elapsed: 00:00:00.01)

gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),6,1)',1);
+------+-------+------+------------+---------------+----------+
| id   | name  | dept | birth      | memo          | score(1) |
+------+-------+------+------------+---------------+----------+
|    7 | Test3 |    4 | 2021-07-01 | China TianJin |     1560 |
+------+-------+------+------------+---------------+----------+
1 row in set (Elapsed: 00:00:00.01)

SCORE函数

语法

INT SCORE(N)

说明

N 指在Contains里的score_flag参数。

该函数功能为全文索引对关键字的匹配程度所计算的分数,返回的值是全文查询函数 CONTAINS 查询结果的权重值,值的大小跟全文索引评估算法有关,权重值越高则匹配度越高。GBase 8a MPP Cluster 全文索引的评估模式默认是 BM25算法,可以通过扩展选项"rank=tf"设置采用词频评估算法。

样例

如下同样的China,TianJin, 但那个score高的比低的更符合要求些。

gbase> select t1.*,score(1) from t1 where contains(memo,'near(("China,TianJin"),15)',1);
+------+-------+------+------------+----------------------+----------+
| id   | name  | dept | birth      | memo                 | score(1) |
+------+-------+------+------------+----------------------+----------+
|    6 | Test2 |    4 | 2021-07-01 | GBase TianJin China. |     1545 |
|    7 | Test3 |    4 | 2021-07-01 | China TianJin        |     2545 |
+------+-------+------+------------+----------------------+----------+
2 rows in set (Elapsed: 00:00:00.01)