本文介绍GBase 8a MPP集群全文索引的详细样例,用来提高模糊查询的性能。
目录导航
参考
GBase 8a全文索引功能安装部署方法
GBase 8a全文索引创建、更新和删除方法
GBase 8a全文索引常用配置文件和配置参数
GBase 8a全文索引提高模糊查询性能使用样例
GBase 8a全文索引多分词器的功能介绍和使用
语法
CONTAINS (column_name, Query Content[, score_flag])
说明
- column_name 列名字
- Query Content 查询表达式,由1到多个条件组成。详情看后面样例介绍
- score_flag SCORE 标号,可选项, 如不填写,则表示不输出评估值, 应与 SCORE 函数一起使用
样例
本文样例,调整了参数,从默认的”自然分词“参数0,改成了字母和数组三元分词的参数3。详细的全文配置,请参考GBase 8a全文索引配置参数详解(编写中)。
区分大小写。
数据样本
gbase> show create table t1;
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t1 | CREATE TABLE "t1" (
"id" int(11) DEFAULT NULL,
"name" varchar(100) DEFAULT NULL,
"dept" int(11) DEFAULT NULL,
"birth" date DEFAULT NULL,
"memo" longtext,
FULLTEXT "idx_memo" ("memo")
) ENGINE=EXPRESS DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (Elapsed: 00:00:00.00)
gbase> select * from t1;
+------+--------+------+------------+-----------------------------------+
| id | name | dept | birth | memo |
+------+--------+------+------------+-----------------------------------+
| 1 | 张三 | 1 | 1990-09-09 | 张三的个人建立和备注 |
| 2 | 李四 | 1 | 1980-08-08 | 李四的个人建立和备注 |
| 3 | GBase | 2 | 2004-05-11 | TianJin general date compamy ltd. |
| 4 | TJU | 3 | 1895-10-02 | TianJin University |
| 5 | Test | 3 | 1895-10-02 | Test12345678abcdefg |
| 6 | Test2 | 4 | 2021-07-01 | GBase TianJin China. |
| 7 | Test3 | 4 | 2021-07-01 | China TianJin |
+------+--------+------+------------+-----------------------------------+
7 rows in set (Elapsed: 00:00:00.01)
gbase> update index idx_memo on t1;
Query OK, 1 row affected (Elapsed: 00:00:01.29)
双引号精确匹配
双引号内的部分,必须精确完全匹配。 否则将按照分词结果,做OR匹配了。比如 '天津’匹配的是'天'和'津'两个汉字,如果你要匹配‘天津’则必须用双引号包围起来‘“天津"'。
gbase> select * from t1 where contains(memo,'"TianJin ltd"');
gbase> select * from t1 where contains(memo,'"TianJin China"');
+------+-------+------+------------+----------------------+
| id | name | dept | birth | memo |
+------+-------+------+------------+----------------------+
| 6 | Test2 | 4 | 2021-07-01 | GBase TianJin China. |
+------+-------+------+------------+----------------------+
1 row in set (Elapsed: 00:00:00.01)
gbase> select * from t1 where contains(memo,'"TianJin ltd"');
Empty set (Elapsed: 00:00:00.00)
等同于左右百分号的like的查询
汉字,完整单词查询
gbase> select * from t1 where contains(memo,'人');
+------+--------+------+------------+--------------------------------+
| id | name | dept | birth | memo |
+------+--------+------+------------+--------------------------------+
| 1 | 张三 | 1 | 1990-09-09 | 张三的个人建立和备注 |
| 2 | 李四 | 1 | 1980-08-08 | 李四的个人建立和备注 |
+------+--------+------+------------+--------------------------------+
2 rows in set (Elapsed: 00:00:00.02)
gbase> select * from t1 where contains(memo,'个人');
+------+--------+------+------------+--------------------------------+
| id | name | dept | birth | memo |
+------+--------+------+------------+--------------------------------+
| 1 | 张三 | 1 | 1990-09-09 | 张三的个人建立和备注 |
| 2 | 李四 | 1 | 1980-08-08 | 李四的个人建立和备注 |
+------+--------+------+------------+--------------------------------+
2 rows in set (Elapsed: 00:00:00.00)
gbase> select * from t1 where contains(memo,'TianJin');
+------+-------+------+------------+-----------------------------------+
| id | name | dept | birth | memo |
+------+-------+------+------------+-----------------------------------+
|gbase> select * from t1 where contains(memo,'TianJin');
+------+-------+------+------------+-----------------------------------+
| id | name | dept | birth | memo |
+------+-------+------+------------+-----------------------------------+
| 3 | GBase | 2 | 2004-05-11 | TianJin general date compamy ltd. |
| 4 | TJU | 3 | 1895-10-02 | TianJin University |
| 6 | Test2 | 4 | 2021-07-01 | GBase TianJin China. |
| 7 | Test3 | 4 | 2021-07-01 | China TianJin |
+------+-------+------+------------+-----------------------------------+
4 rows in set (Elapsed: 00:00:00.11)
长度大于等于3的连续数字和字母
gbase> select * from t1 where contains(memo,'gen');
+------+-------+------+------------+-----------------------------------+
| id | name | dept | birth | memo |
+------+-------+------+------------+-----------------------------------+
| 3 | GBase | 2 | 2004-05-11 | TianJin general date compamy ltd. |
+------+-------+------+------------+-----------------------------------+
1 row in set (Elapsed: 00:00:00.00)
gbase> select * from t1 where contains(memo,'123');
+------+------+------+------------+---------------------+
| id | name | dept | birth | memo |
+------+------+------+------------+---------------------+
| 5 | Test | 3 | 1895-10-02 | Test12345678abcdefg |
+------+------+------+------------+---------------------+
1 row in set (Elapsed: 00:00:00.00)
长度小于3的连续数字和字母
没有查到,参数默认只匹配3个。从实际业务看,查询2个连续数字或字母,是一个罕见需求。查询1个数字或字母,是一个不正常需求。GBase 8a支持配置分词长度,但会带来更多的数据膨胀。
gbase> select * from t1 where contains(memo,'ge');
Empty set (Elapsed: 00:00:00.01)
gbase> select * from t1 where contains(memo,'12');
Empty set (Elapsed: 00:00:00.01)
区分大小写
gbase> select * from t1 where contains(memo,'TianJin');
+------+-------+------+------------+-----------------------------------+
| id | name | dept | birth | memo |
+------+-------+------+------------+-----------------------------------+
| 3 | GBase | 2 | 2004-05-11 | TianJin general date compamy ltd. |
| 4 | TJU | 3 | 1895-10-02 | TianJin University |
| 6 | Test2 | 4 | 2021-07-01 | GBase TianJin China. |
| 7 | Test3 | 4 | 2021-07-01 | China TianJin |
+------+-------+------+------------+-----------------------------------+
4 rows in set (Elapsed: 00:00:00.11)
gbase> select * from t1 where contains(memo,'tianJin');
Empty set (Elapsed: 00:00:00.01)
开头标记^,等同右百分号的like查询
用^表示以此开头,等同于like 'TianJin%'的右百分号形式。如下只有TianJin开头的才返回。
gbase> select * from t1 where contains(memo,'^TianJin');
+------+-------+------+------------+-----------------------------------+
| id | name | dept | birth | memo |
+------+-------+------+------------+-----------------------------------+
| 3 | GBase | 2 | 2004-05-11 | TianJin general date compamy ltd. |
| 4 | TJU | 3 | 1895-10-02 | TianJin University |
+------+-------+------+------------+-----------------------------------+
2 rows in set (Elapsed: 00:00:00.01)
结尾标记$,等同左百分号的like查询
只有以TianJin结尾的才返回。等同于memo like '%TianJin'
gbase> select * from t1 where contains(memo,'TianJin$');
+------+-------+------+------------+---------------+
| id | name | dept | birth | memo |
+------+-------+------+------------+---------------+
| 7 | Test3 | 4 | 2021-07-01 | China TianJin |
+------+-------+------+------------+---------------+
1 row in set (Elapsed: 00:00:00.01)
无双引号的空格属于隐式AND
如下查询TianJin ltd, 如果没有双引号包围符,则代表AND, 也就是包含TianJin 和 ltd. 如果由包围符,则数据必须是连续的"TianJin ltd"。
gbase> select * from t1 where contains(memo,'TianJin ltd');
+------+-------+------+------------+-----------------------------------+
| id | name | dept | birth | memo |
+------+-------+------+------------+-----------------------------------+
| 3 | GBase | 2 | 2004-05-11 | TianJin general date compamy ltd. |
+------+-------+------+------------+-----------------------------------+
1 row in set (Elapsed: 00:00:00.01)
gbase> select * from t1 where contains(memo,'"TianJin ltd"');
Empty set (Elapsed: 00:00:00.01)
"与"AND标记&, 多个like and
AND标记用&表示,比如包含Tianjin 和 ltd的。多个条件,每个条件用双引号包围。
gbase> select * from t1 where contains(memo,'"TianJin"&"ltd"');
+------+-------+------+------------+-----------------------------------+
| id | name | dept | birth | memo |
+------+-------+------+------------+-----------------------------------+
| 3 | GBase | 2 | 2004-05-11 | TianJin general date compamy ltd. |
+------+-------+------+------------+-----------------------------------+
1 row in set (Elapsed: 00:00:00.01)
“或”OR标记|,多个like OR
用竖线|表示或。比如包含张三或者TianJin的,等同于 memo like '%张三%' or memo like '%TianJin%'
gbase> select * from t1 where contains(memo,'"张三"|"TianJin"');
+------+--------+------+------------+-----------------------------------+
| id | name | dept | birth | memo |
+------+--------+------+------------+-----------------------------------+
| 1 | 张三 | 1 | 1990-09-09 | 张三的个人建立和备注 |
| 3 | GBase | 2 | 2004-05-11 | TianJin general date compamy ltd. |
| 4 | TJU | 3 | 1895-10-02 | TianJin University |
| 6 | Test2 | 4 | 2021-07-01 | GBase TianJin China. |
| 7 | Test3 | 4 | 2021-07-01 | China TianJin |
+------+--------+------+------------+-----------------------------------+
5 rows in set (Elapsed: 00:00:00.01)
”非“NOT标记-, not like
用减号-,表示排除后面的内容。比如包含”张三“或”TianJin“, 但不包含”人“的。 注意默认优先级 或 > 非 > 与
gbase> select * from t1 where contains(memo,'"张三"|"TianJin"-"人"');
+------+-------+------+------------+-----------------------------------+
| id | name | dept | birth | memo |
+------+-------+------+------------+-----------------------------------+
| 3 | GBase | 2 | 2004-05-11 | TianJin general date compamy ltd. |
| 4 | TJU | 3 | 1895-10-02 | TianJin University |
| 6 | Test2 | 4 | 2021-07-01 | GBase TianJin China. |
| 7 | Test3 | 4 | 2021-07-01 | China TianJin |
+------+-------+------+------------+-----------------------------------+
4 rows in set (Elapsed: 00:00:00.01)
使用小括号分组来改变匹配顺序
对照前例,本例查询 like '%张三%' or (like '%Tianjin%' and not like '%人')。
gbase> select * from t1 where contains(memo,'"张三"|("TianJin"-"人")');
+------+--------+------+------------+-----------------------------------+
| id | name | dept | birth | memo |
+------+--------+------+------------+-----------------------------------+
| 1 | 张三 | 1 | 1990-09-09 | 张三的个人建立和备注 |
| 3 | GBase | 2 | 2004-05-11 | TianJin general date compamy ltd. |
| 4 | TJU | 3 | 1895-10-02 | TianJin University |
| 6 | Test2 | 4 | 2021-07-01 | GBase TianJin China. |
| 7 | Test3 | 4 | 2021-07-01 | China TianJin |
+------+--------+------+------------+-----------------------------------+
5 rows in set (Elapsed: 00:00:00.00)
阀值匹配符‘/’
值一组条件内,最少满足匹配几个。比如条件”张 Tia 人“ 最少包含其中的2个即可,则写成"张 Tia 人"/2。
请一定要注意分词问题,特别是英文和字母。每个汉字,每个默认的三元分词都会命中得分。所以本例采用的是单个汉字和3个字母的样例。
gbase> select t1.*,score(1) from t1 where contains(memo,'"张 Tia 人"/2',1);
+------+--------+------+------------+--------------------------------+----------+
| id | name | dept | birth | memo | score(1) |
+------+--------+------+------------+--------------------------------+----------+
| 1 | 张三 | 1 | 1990-09-09 | 张三的个人建立和备注 | 1610 |
+------+--------+------+------------+--------------------------------+----------+
1 row in set (Elapsed: 00:00:00.01)
gbase> select t1.*,score(1) from t1 where contains(memo,'"张 Tia 人"/3',1);
Empty set (Elapsed: 00:00:00.01)
如下是多个字母的样例,粗看会觉得结果不正确的。
之所以返回,是因为TianJin按照默认的三元分词规则, Tia ian anJ 。。。。 都能匹配的。这就和我们期望的TianJin整体匹配不符和,但既然设置了三元分词,而不是”自然分词“, 这个结果也得接受。
gbase> select t1.*,score(1) from t1 where contains(memo,'"张 TianJin 人"/3',1);
+------+-------+------+------------+-----------------------------------+----------+
| id | name | dept | birth | memo | score(1) |
+------+-------+------+------------+-----------------------------------+----------+
| 3 | GBase | 2 | 2004-05-11 | TianJin general date compamy ltd. | 5500 |
| 4 | TJU | 3 | 1895-10-02 | TianJin University | 5500 |
| 6 | Test2 | 4 | 2021-07-01 | GBase TianJin China. | 5500 |
| 7 | Test3 | 4 | 2021-07-01 | China TianJin | 5500 |
+------+-------+------+------------+-----------------------------------+----------+
4 rows in set (Elapsed: 00:00:00.00)
near函数
本身属于contains的一部分,但内容较为独立,所以单开一个标题。
语法
NEAR ((term1, term2), num[,Order])
请注意near的term部分,外面有一层小括号。
说明
- term : 搜索词。多个之间用逗号分割。词之间是AND关系。且每个搜索词必须精确满足。
- num : 表示词距数,非 0 整数,词距(包括匹配词),实际词距小于 num 值即为符号要求。
- order : 表示词序。Order 为可选项,默认值为 0,表示无词序。
- 为 0 代表无词序
- 为 1 代表有词序。
词距
指二个词之间间隔单词数,也包含被查询的首尾词。例如:“the black cat catch white mouse”这段文字中要查询“cat mouse”,则词距就是 4。对于中文来说,则词距为两个汉字之间间隔的字符数(字符包含汉字、标点、空格、符号等等,回车符略去不算)。
例如“上周,我去海洋局开会…”这段文字要查询“上 海”,则词距为 6。分析如下:上(1)周(2),(3)我(4)去(5)海(6)洋(7)。后面是位置顺序。 其中“海”是6, “上”是1, 距离是(6-1)+1 = 6;
例如查询“售卖制作假文*#凭、发票”的垃圾短信中“文凭”,则词距为 4。分析如下:售(1)卖(2)制(3)作(4)假(5)文(6)*(7)#(8)凭(9). 用9-6+1=4;
样例
如下以三个字母为例,因为多个字母又涉及到分词问题,看着混乱。对于“China TianJin”,Chi和Tia之间的距离是4(na和空格三个字符+1)。 而对“TianJin China” 则是6(nJin和空格+1)。
gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),3)',1);
Empty set (Elapsed: 00:00:00.01)
gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),4)',1);
+------+-------+------+------------+---------------+----------+
| id | name | dept | birth | memo | score(1) |
+------+-------+------+------------+---------------+----------+
| 7 | Test3 | 4 | 2021-07-01 | China TianJin | 1560 |
+------+-------+------+------------+---------------+----------+
1 row in set (Elapsed: 00:00:00.01)
gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),5)',1);
+------+-------+------+------------+---------------+----------+
| id | name | dept | birth | memo | score(1) |
+------+-------+------+------------+---------------+----------+
| 7 | Test3 | 4 | 2021-07-01 | China TianJin | 1560 |
+------+-------+------+------------+---------------+----------+
1 row in set (Elapsed: 00:00:00.01)
gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),6)',1);
+------+-------+------+------------+----------------------+----------+
| id | name | dept | birth | memo | score(1) |
+------+-------+------+------------+----------------------+----------+
| 6 | Test2 | 4 | 2021-07-01 | GBase TianJin China. | 1560 |
| 7 | Test3 | 4 | 2021-07-01 | China TianJin | 1560 |
+------+-------+------+------------+----------------------+----------+
2 rows in set (Elapsed: 00:00:00.02)
顺序order参数效果
但有顺序时,China TianJin就满足,而TianJin China就不满足了。
gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),6,0)',1);
+------+-------+------+------------+----------------------+----------+
| id | name | dept | birth | memo | score(1) |
+------+-------+------+------------+----------------------+----------+
| 6 | Test2 | 4 | 2021-07-01 | GBase TianJin China. | 1560 |
| 7 | Test3 | 4 | 2021-07-01 | China TianJin | 1560 |
+------+-------+------+------------+----------------------+----------+
2 rows in set (Elapsed: 00:00:00.01)
gbase> select t1.*,score(1) from t1 where contains(memo,'near((“Chi,Tia”),6,1)',1);
+------+-------+------+------------+---------------+----------+
| id | name | dept | birth | memo | score(1) |
+------+-------+------+------------+---------------+----------+
| 7 | Test3 | 4 | 2021-07-01 | China TianJin | 1560 |
+------+-------+------+------------+---------------+----------+
1 row in set (Elapsed: 00:00:00.01)
SCORE函数
语法
INT SCORE(N)
说明
N 指在Contains里的score_flag参数。
该函数功能为全文索引对关键字的匹配程度所计算的分数,返回的值是全文查询函数 CONTAINS 查询结果的权重值,值的大小跟全文索引评估算法有关,权重值越高则匹配度越高。GBase 8a MPP Cluster 全文索引的评估模式默认是 BM25算法,可以通过扩展选项"rank=tf"设置采用词频评估算法。
样例
如下同样的China,TianJin, 但那个score高的比低的更符合要求些。
gbase> select t1.*,score(1) from t1 where contains(memo,'near(("China,TianJin"),15)',1);
+------+-------+------+------------+----------------------+----------+
| id | name | dept | birth | memo | score(1) |
+------+-------+------+------------+----------------------+----------+
| 6 | Test2 | 4 | 2021-07-01 | GBase TianJin China. | 1545 |
| 7 | Test3 | 4 | 2021-07-01 | China TianJin | 2545 |
+------+-------+------+------------+----------------------+----------+
2 rows in set (Elapsed: 00:00:00.01)