Clickhoue 的预聚合物化视图样例和性能提升测试SummingMergeTree

本文介绍在Clickhouse里,对MergeTree表,制作物化视图,采用SummingMergeTree引擎来预聚合数据的样例,6000万行性能提升约10倍。

原始表结构

简单的4列表,包含了顺序号,名字,采集日期和一个数值。

CREATE TABLE testdb.t3
(
    `seqid` Int64,
    `name` String,
    `collectdate` Date,
    `value` Decimal(16, 3)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(collectdate)
ORDER BY collectdate

数据

最终结果如下

localhost :) select count(*) from t3;

SELECT count(*)
FROM t3

Query id: 5ad7dc70-0ef0-4ff7-8cae-311481eb8169

┌──count()─┐
│ 67108864 │
└──────────┘

1 rows in set. Elapsed: 0.014 sec.

数据生成过程如下

 insert into t3 values(1,11111,'2020-01-01',111);
 insert into t3 values(2,22222,'2020-01-01',111);
 insert into t3 values(3,33333,'2020-01-01',222);
 insert into t3 values(4,44444,'2020-01-01',222);

 insert into t3 select seqid+power(2,2) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,3) seqid,name,collectdate+1 collectdate ,value+1 value from t3; 
 insert into t3 select seqid+power(2,4) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,5) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,6) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,7) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,8) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,9) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,10) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,11) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,12) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,13) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 
 insert into t3 select seqid+power(2,14) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,15) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,16) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,17) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,18) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,19) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,20) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 
 
 insert into t3 select seqid+power(2,21) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,22) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,23) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,24) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
 insert into t3 select seqid+power(2,25) seqid,name,collectdate+1 collectdate ,value+1 value from t3;

聚合运算

直接做collectdate列的count集合运算,耗时0.302秒,处理了6700万行,134M数据。

localhost :) select collectdate,count(*) from t3 group by collectdate order by collectdate;

SELECT
    collectdate,
    count(*)
FROM t3
GROUP BY collectdate
ORDER BY collectdate ASC

Query id: 5c2cabe8-8389-4a0d-a07a-c84ed664bdff

┌─collectdate─┬──count()─┐
│  2020-01-01 │        4 │
│  2020-01-02 │       96 │
│  2020-01-03 │     1104 │
│  2020-01-04 │     8096 │
│  2020-01-05 │    42504 │
│  2020-01-06 │   170016 │
│  2020-01-07 │   538384 │
│  2020-01-08 │  1384416 │
│  2020-01-09 │  2941884 │
│  2020-01-10 │  5230016 │
│  2020-01-11 │  7845024 │
│  2020-01-12 │  9984576 │
│  2020-01-13 │ 10816624 │
│  2020-01-14 │  9984576 │
│  2020-01-15 │  7845024 │
│  2020-01-16 │  5230016 │
│  2020-01-17 │  2941884 │
│  2020-01-18 │  1384416 │
│  2020-01-19 │   538384 │
│  2020-01-20 │   170016 │
│  2020-01-21 │    42504 │
│  2020-01-22 │     8096 │
│  2020-01-23 │     1104 │
│  2020-01-24 │       96 │
│  2020-01-25 │        4 │
└─────────────┴──────────┘

25 rows in set. Elapsed: 0.302 sec. Processed 67.11 million rows, 134.22 MB (222.45 million rows/s., 444.91 MB/s.)

localhost :) 

物化视图

在建表后,立即创建物化视图,然后再insert数据

CREATE MATERIALIZED VIEW IF NOT EXISTS testdb.mv_t2_collectdate
ENGINE = SummingMergeTree
PARTITION BY collectdate
ORDER BY collectdate AS
SELECT
    collectdate,
    count(*) AS visit
FROM testdb.t2
GROUP BY collectdate

再次聚合运算

耗时0.027秒

localhost :) select collectdate,sum(visit) from mv_t3_collectdate group by collectdate order by collectdate;

SELECT
    collectdate,
    sum(visit)
FROM mv_t3_collectdate
GROUP BY collectdate
ORDER BY collectdate ASC

Query id: 6b7aed99-75b2-4553-81fa-6d209bda5e8e

┌─collectdate─┬─sum(visit)─┐
│  2020-01-01 │          4 │
│  2020-01-02 │         96 │
│  2020-01-03 │       1104 │
│  2020-01-04 │       8096 │
│  2020-01-05 │      42504 │
│  2020-01-06 │     170016 │
│  2020-01-07 │     538384 │
│  2020-01-08 │    1384416 │
│  2020-01-09 │    2941884 │
│  2020-01-10 │    5230016 │
│  2020-01-11 │    7845024 │
│  2020-01-12 │    9984576 │
│  2020-01-13 │   10816624 │
│  2020-01-14 │    9984576 │
│  2020-01-15 │    7845024 │
│  2020-01-16 │    5230016 │
│  2020-01-17 │    2941884 │
│  2020-01-18 │    1384416 │
│  2020-01-19 │     538384 │
│  2020-01-20 │     170016 │
│  2020-01-21 │      42504 │
│  2020-01-22 │       8096 │
│  2020-01-23 │       1104 │
│  2020-01-24 │         96 │
│  2020-01-25 │          4 │
└─────────────┴────────────┘

25 rows in set. Elapsed: 0.027 sec.

查看物化视图原始数据

可以看到,每个part已经按照预先设置的日期列collectdate做好了count的结果,日期有多个part,所以最终的count要把每个part的count值,sum(visit)起来就可以了。

localhost :) select * from mv_t3_collectdate;

SELECT *
FROM mv_t3_collectdate

Query id: 450de5b9-7518-49ba-9a64-f2208cbf2c12

┌─collectdate─┬─visit─┐
│  2020-01-01 │     1 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-01 │     1 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-01 │     1 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-01 │     1 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-02 │    84 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-02 │     4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-02 │     4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-02 │     4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-03 │   924 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-03 │    88 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-03 │    92 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-04 │  7084 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-04 │  1012 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-05 │ 42504 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-06 │ 81396 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-06 │ 23940 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-06 │ 29260 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-06 │ 35420 │
└─────────────┴───────┘
┌─collectdate─┬──visit─┐
│  2020-01-07 │ 298452 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│  2020-01-07 │ 105336 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│  2020-01-07 │ 134596 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│  2020-01-08 │ 980628 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│  2020-01-08 │ 403788 │
└─────────────┴────────┘
┌─collectdate─┬───visit─┐
│  2020-01-09 │ 2941884 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-10 │ 1175720 │
└─────────────┴─────────┘
┌─collectdate─┬──visit─┐
│  2020-01-10 │ 813960 │
└─────────────┴────────┘
┌─collectdate─┬───visit─┐
│  2020-01-10 │ 1279080 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-10 │ 1961256 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-11 │ 2586584 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-11 │ 1989680 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-11 │ 3268760 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-12 │ 5408312 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-12 │ 4576264 │
└─────────────┴─────────┘
┌─collectdate─┬────visit─┐
│  2020-01-13 │ 10816624 │
└─────────────┴──────────┘
┌─collectdate─┬──visit─┐
│  2020-01-14 │ 813960 │
└─────────────┴────────┘
┌─collectdate─┬───visit─┐
│  2020-01-14 │ 1175720 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-14 │ 2586584 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-14 │ 5408312 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-15 │ 1279080 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-15 │ 1989680 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-15 │ 4576264 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-16 │ 1961256 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-16 │ 3268760 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│  2020-01-17 │ 2941884 │
└─────────────┴─────────┘
┌─collectdate─┬─visit─┐
│  2020-01-18 │ 23940 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-18 │ 81396 │
└─────────────┴───────┘
┌─collectdate─┬──visit─┐
│  2020-01-18 │ 298452 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│  2020-01-18 │ 980628 │
└─────────────┴────────┘
┌─collectdate─┬─visit─┐
│  2020-01-19 │ 29260 │
└─────────────┴───────┘
┌─collectdate─┬──visit─┐
│  2020-01-19 │ 105336 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│  2020-01-19 │ 403788 │
└─────────────┴────────┘
┌─collectdate─┬─visit─┐
│  2020-01-20 │ 35420 │
└─────────────┴───────┘
┌─collectdate─┬──visit─┐
│  2020-01-20 │ 134596 │
└─────────────┴────────┘
┌─collectdate─┬─visit─┐
│  2020-01-21 │ 42504 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-22 │     4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-22 │    84 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-22 │   924 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-22 │  7084 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-23 │     4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-23 │    88 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-23 │  1012 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-24 │     4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-24 │    92 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│  2020-01-25 │     4 │
└─────────────┴───────┘

64 rows in set. Elapsed: 0.044 sec.

带过滤条件的性能对比

只查询2020-01-15以后的数据,t3使用了0.272秒,物化视图用了0.008秒。差了34倍。

localhost :) select collectdate,count(*) from t3 where collectdate > '2020-01-15' group by collectdate order by collectdate;

SELECT
    collectdate,
    count(*)
FROM t3
WHERE collectdate > '2020-01-15'
GROUP BY collectdate
ORDER BY collectdate ASC

Query id: 4e3562a9-cbbd-4d48-ab01-c408e2c6e427

┌─collectdate─┬─count()─┐
│  2020-01-16 │ 5230016 │
│  2020-01-17 │ 2941884 │
│  2020-01-18 │ 1384416 │
│  2020-01-19 │  538384 │
│  2020-01-20 │  170016 │
│  2020-01-21 │   42504 │
│  2020-01-22 │    8096 │
│  2020-01-23 │    1104 │
│  2020-01-24 │      96 │
│  2020-01-25 │       4 │
└─────────────┴─────────┘

10 rows in set. Elapsed: 0.272 sec. Processed 10.32 million rows, 20.63 MB (37.98 million rows/s., 75.96 MB/s.)

localhost :) select collectdate,sum(visit) from mv_t3_collectdate where collectdate > '2020-01-15'  group by collectdate order by collectdate;

SELECT
    collectdate,
    sum(visit)
FROM mv_t3_collectdate
WHERE collectdate > '2020-01-15'
GROUP BY collectdate
ORDER BY collectdate ASC

Query id: 76042319-3d5e-42cb-99a0-2973119bd3a8

┌─collectdate─┬─sum(visit)─┐
│  2020-01-16 │    5230016 │
│  2020-01-17 │    2941884 │
│  2020-01-18 │    1384416 │
│  2020-01-19 │     538384 │
│  2020-01-20 │     170016 │
│  2020-01-21 │      42504 │
│  2020-01-22 │       8096 │
│  2020-01-23 │       1104 │
│  2020-01-24 │         96 │
│  2020-01-25 │          4 │
└─────────────┴────────────┘

10 rows in set. Elapsed: 0.008 sec.

总结

物化视图,在单列6000万行场景,可以提升10倍以上的性能(原始数据和聚合数据比例 200百万分之一)。

限制

物化视图必须先建立

物化视图只对建立后的数据变动做统计,也就是典型的【流式】计算,历史的不管。除非你历史数据整体折腾一遍。

当然如果你的查询,本身就有时间过滤,那么最新的数据肯定都是正确的。

物化视图只对inseft的数据有效

物化视图本质就像insert语句的触发器;如果有什么集合的运算,他会应用于最新插入的数据当中;对于其他原表的变化,比如说,更新,删除,删除分区,都不会影响到物化视图的变化