南大通用GBase 8a 做distinct 操作时的2个参数,_t_gcluster_agg_distinct_redist_optimize和_t_gcluster_agg_distinct_redist_optimize_with_groupby

本文介绍2个和distinct 有关的参数。_t_gcluster_agg_distinct_redist_optimize 和 _t_gcluster_agg_distinct_redist_optimize_with_groupby。

如下参数,仅在count distinct列或者count distinct group列不是hash分布列的情况下。如果distinct或group有hash列,全部本地计算即可。

_t_gcluster_agg_distinct_redist_optimize

aggr(distinct)优化。注意带group by时不参与
1:打开,默认。本地
0:关闭,数据会先拉到一个节点。

执行计划样例

默认情况下,数据先在本地根据distinct列做group, 然后将结果按照distinct的列做二次分发。每个节点本地做完agg后,将结果汇总到一个节点,最后得到结果。

如果参数为0,则数据本地做group后,直接全部汇总到一个节点,得到结果。

gbase> set _t_gcluster_agg_distinct_redist_optimize=1;
Query OK, 0 rows affected (Elapsed: 00:00:00.00)

gbase> explain select count(distinct name) from t1;
+----+----------------+-----------+--------+---------------+
| ID | MOTION         | OPERATION | TABLE  | CONDITION     |
+----+----------------+-----------+--------+---------------+
| 02 | [RESULT]       |  Step     | <01>   |               |
|    |                |  AGG      |        |               |
| 01 | [GATHER]       |  Step     | <00>   |               |
|    |                |  AGG      |        |               |
| 00 | [REDIST(name)] |  Table    | t1[id] |               |
|    |                |  GROUP    |        | GROUP BY name |
+----+----------------+-----------+--------+---------------+
6 rows in set (Elapsed: 00:00:00.00)

gbase> set _t_gcluster_agg_distinct_redist_optimize=0;
Query OK, 0 rows affected (Elapsed: 00:00:00.00)

gbase> explain select count(distinct name) from t1;
+----+----------+-----------+--------+---------------+
| ID | MOTION   | OPERATION | TABLE  | CONDITION     |
+----+----------+-----------+--------+---------------+
| 01 | [RESULT] |  Step     | <00>   |               |
|    |          |  AGG      |        |               |
| 00 | [GATHER] |  Table    | t1[id] |               |
|    |          |  GROUP    |        | GROUP BY name |
+----+----------+-----------+--------+---------------+
4 rows in set (Elapsed: 00:00:00.01)

_t_gcluster_agg_distinct_redist_optimize_with_groupby

agg(distinct A) group by B时,强行指定按A列做重分布,必须在_t_gcluster_agg_distinct_redist_optimize打开时才生效。
0:默认按照group by 的B列重分布
1:按照distinct的A列重分布

执行计划样例

如下例子,是按照id列,也就是group by 的列做动态重分布。

设置参数为1,则多了一步,先按照distinct做重分布,然后再做一次根据group 列的二次重分布。

gbase> explain select count(distinct name) from t2 group by id;
+----+--------------+-----------+---------+-------------------+
| ID | MOTION       | OPERATION | TABLE   | CONDITION         |
+----+--------------+-----------+---------+-------------------+
| 01 | [RESULT]     |  Step     | <00>    |                   |
|    |              |  GROUP    |         | GROUP BY id       |
| 00 | [REDIST(id)] |  Table    | t2[DIS] |                   |
|    |              |  GROUP    |         | GROUP BY id, name |
+----+--------------+-----------+---------+-------------------+
4 rows in set (Elapsed: 00:00:00.01)

gbase> set _t_gcluster_agg_distinct_redist_optimize_with_groupby=1;
Query OK, 0 rows affected (Elapsed: 00:00:00.01)

gbase> explain select count(distinct name) from t2 group by id;
+----+----------------+-----------+---------+-------------------+
| ID | MOTION         | OPERATION | TABLE   | CONDITION         |
+----+----------------+-----------+---------+-------------------+
| 02 | [RESULT]       |  Step     | <01>    |                   |
|    |                |  GROUP    |         | GROUP BY id       |
| 01 | [REDIST(id)]   |  Step     | <00>    |                   |
|    |                |  GROUP    |         | GROUP BY id       |
| 00 | [REDIST(name)] |  Table    | t2[DIS] |                   |
|    |                |  GROUP    |         | GROUP BY id, name |
+----+----------------+-----------+---------+-------------------+
6 rows in set (Elapsed: 00:00:00.02)