本文介绍在Clickhouse里,对MergeTree表,制作物化视图,采用SummingMergeTree引擎来预聚合数据的样例,6000万行性能提升约10倍。
目录导航
原始表结构
简单的4列表,包含了顺序号,名字,采集日期和一个数值。
CREATE TABLE testdb.t3
(
`seqid` Int64,
`name` String,
`collectdate` Date,
`value` Decimal(16, 3)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(collectdate)
ORDER BY collectdate
数据
最终结果如下
localhost :) select count(*) from t3;
SELECT count(*)
FROM t3
Query id: 5ad7dc70-0ef0-4ff7-8cae-311481eb8169
┌──count()─┐
│ 67108864 │
└──────────┘
1 rows in set. Elapsed: 0.014 sec.
数据生成过程如下
insert into t3 values(1,11111,'2020-01-01',111);
insert into t3 values(2,22222,'2020-01-01',111);
insert into t3 values(3,33333,'2020-01-01',222);
insert into t3 values(4,44444,'2020-01-01',222);
insert into t3 select seqid+power(2,2) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,3) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,4) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,5) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,6) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,7) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,8) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,9) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,10) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,11) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,12) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,13) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,14) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,15) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,16) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,17) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,18) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,19) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,20) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,21) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,22) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,23) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,24) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
insert into t3 select seqid+power(2,25) seqid,name,collectdate+1 collectdate ,value+1 value from t3;
聚合运算
直接做collectdate列的count集合运算,耗时0.302秒,处理了6700万行,134M数据。
localhost :) select collectdate,count(*) from t3 group by collectdate order by collectdate;
SELECT
collectdate,
count(*)
FROM t3
GROUP BY collectdate
ORDER BY collectdate ASC
Query id: 5c2cabe8-8389-4a0d-a07a-c84ed664bdff
┌─collectdate─┬──count()─┐
│ 2020-01-01 │ 4 │
│ 2020-01-02 │ 96 │
│ 2020-01-03 │ 1104 │
│ 2020-01-04 │ 8096 │
│ 2020-01-05 │ 42504 │
│ 2020-01-06 │ 170016 │
│ 2020-01-07 │ 538384 │
│ 2020-01-08 │ 1384416 │
│ 2020-01-09 │ 2941884 │
│ 2020-01-10 │ 5230016 │
│ 2020-01-11 │ 7845024 │
│ 2020-01-12 │ 9984576 │
│ 2020-01-13 │ 10816624 │
│ 2020-01-14 │ 9984576 │
│ 2020-01-15 │ 7845024 │
│ 2020-01-16 │ 5230016 │
│ 2020-01-17 │ 2941884 │
│ 2020-01-18 │ 1384416 │
│ 2020-01-19 │ 538384 │
│ 2020-01-20 │ 170016 │
│ 2020-01-21 │ 42504 │
│ 2020-01-22 │ 8096 │
│ 2020-01-23 │ 1104 │
│ 2020-01-24 │ 96 │
│ 2020-01-25 │ 4 │
└─────────────┴──────────┘
25 rows in set. Elapsed: 0.302 sec. Processed 67.11 million rows, 134.22 MB (222.45 million rows/s., 444.91 MB/s.)
localhost :)
物化视图
在建表后,立即创建物化视图,然后再insert数据
CREATE MATERIALIZED VIEW IF NOT EXISTS testdb.mv_t2_collectdate
ENGINE = SummingMergeTree
PARTITION BY collectdate
ORDER BY collectdate AS
SELECT
collectdate,
count(*) AS visit
FROM testdb.t2
GROUP BY collectdate
再次聚合运算
耗时0.027秒
localhost :) select collectdate,sum(visit) from mv_t3_collectdate group by collectdate order by collectdate;
SELECT
collectdate,
sum(visit)
FROM mv_t3_collectdate
GROUP BY collectdate
ORDER BY collectdate ASC
Query id: 6b7aed99-75b2-4553-81fa-6d209bda5e8e
┌─collectdate─┬─sum(visit)─┐
│ 2020-01-01 │ 4 │
│ 2020-01-02 │ 96 │
│ 2020-01-03 │ 1104 │
│ 2020-01-04 │ 8096 │
│ 2020-01-05 │ 42504 │
│ 2020-01-06 │ 170016 │
│ 2020-01-07 │ 538384 │
│ 2020-01-08 │ 1384416 │
│ 2020-01-09 │ 2941884 │
│ 2020-01-10 │ 5230016 │
│ 2020-01-11 │ 7845024 │
│ 2020-01-12 │ 9984576 │
│ 2020-01-13 │ 10816624 │
│ 2020-01-14 │ 9984576 │
│ 2020-01-15 │ 7845024 │
│ 2020-01-16 │ 5230016 │
│ 2020-01-17 │ 2941884 │
│ 2020-01-18 │ 1384416 │
│ 2020-01-19 │ 538384 │
│ 2020-01-20 │ 170016 │
│ 2020-01-21 │ 42504 │
│ 2020-01-22 │ 8096 │
│ 2020-01-23 │ 1104 │
│ 2020-01-24 │ 96 │
│ 2020-01-25 │ 4 │
└─────────────┴────────────┘
25 rows in set. Elapsed: 0.027 sec.
查看物化视图原始数据
可以看到,每个part已经按照预先设置的日期列collectdate做好了count的结果,日期有多个part,所以最终的count要把每个part的count值,sum(visit)起来就可以了。
localhost :) select * from mv_t3_collectdate;
SELECT *
FROM mv_t3_collectdate
Query id: 450de5b9-7518-49ba-9a64-f2208cbf2c12
┌─collectdate─┬─visit─┐
│ 2020-01-01 │ 1 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-01 │ 1 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-01 │ 1 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-01 │ 1 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-02 │ 84 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-02 │ 4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-02 │ 4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-02 │ 4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-03 │ 924 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-03 │ 88 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-03 │ 92 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-04 │ 7084 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-04 │ 1012 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-05 │ 42504 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-06 │ 81396 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-06 │ 23940 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-06 │ 29260 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-06 │ 35420 │
└─────────────┴───────┘
┌─collectdate─┬──visit─┐
│ 2020-01-07 │ 298452 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│ 2020-01-07 │ 105336 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│ 2020-01-07 │ 134596 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│ 2020-01-08 │ 980628 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│ 2020-01-08 │ 403788 │
└─────────────┴────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-09 │ 2941884 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-10 │ 1175720 │
└─────────────┴─────────┘
┌─collectdate─┬──visit─┐
│ 2020-01-10 │ 813960 │
└─────────────┴────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-10 │ 1279080 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-10 │ 1961256 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-11 │ 2586584 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-11 │ 1989680 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-11 │ 3268760 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-12 │ 5408312 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-12 │ 4576264 │
└─────────────┴─────────┘
┌─collectdate─┬────visit─┐
│ 2020-01-13 │ 10816624 │
└─────────────┴──────────┘
┌─collectdate─┬──visit─┐
│ 2020-01-14 │ 813960 │
└─────────────┴────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-14 │ 1175720 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-14 │ 2586584 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-14 │ 5408312 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-15 │ 1279080 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-15 │ 1989680 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-15 │ 4576264 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-16 │ 1961256 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-16 │ 3268760 │
└─────────────┴─────────┘
┌─collectdate─┬───visit─┐
│ 2020-01-17 │ 2941884 │
└─────────────┴─────────┘
┌─collectdate─┬─visit─┐
│ 2020-01-18 │ 23940 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-18 │ 81396 │
└─────────────┴───────┘
┌─collectdate─┬──visit─┐
│ 2020-01-18 │ 298452 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│ 2020-01-18 │ 980628 │
└─────────────┴────────┘
┌─collectdate─┬─visit─┐
│ 2020-01-19 │ 29260 │
└─────────────┴───────┘
┌─collectdate─┬──visit─┐
│ 2020-01-19 │ 105336 │
└─────────────┴────────┘
┌─collectdate─┬──visit─┐
│ 2020-01-19 │ 403788 │
└─────────────┴────────┘
┌─collectdate─┬─visit─┐
│ 2020-01-20 │ 35420 │
└─────────────┴───────┘
┌─collectdate─┬──visit─┐
│ 2020-01-20 │ 134596 │
└─────────────┴────────┘
┌─collectdate─┬─visit─┐
│ 2020-01-21 │ 42504 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-22 │ 4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-22 │ 84 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-22 │ 924 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-22 │ 7084 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-23 │ 4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-23 │ 88 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-23 │ 1012 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-24 │ 4 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-24 │ 92 │
└─────────────┴───────┘
┌─collectdate─┬─visit─┐
│ 2020-01-25 │ 4 │
└─────────────┴───────┘
64 rows in set. Elapsed: 0.044 sec.
带过滤条件的性能对比
只查询2020-01-15以后的数据,t3使用了0.272秒,物化视图用了0.008秒。差了34倍。
localhost :) select collectdate,count(*) from t3 where collectdate > '2020-01-15' group by collectdate order by collectdate;
SELECT
collectdate,
count(*)
FROM t3
WHERE collectdate > '2020-01-15'
GROUP BY collectdate
ORDER BY collectdate ASC
Query id: 4e3562a9-cbbd-4d48-ab01-c408e2c6e427
┌─collectdate─┬─count()─┐
│ 2020-01-16 │ 5230016 │
│ 2020-01-17 │ 2941884 │
│ 2020-01-18 │ 1384416 │
│ 2020-01-19 │ 538384 │
│ 2020-01-20 │ 170016 │
│ 2020-01-21 │ 42504 │
│ 2020-01-22 │ 8096 │
│ 2020-01-23 │ 1104 │
│ 2020-01-24 │ 96 │
│ 2020-01-25 │ 4 │
└─────────────┴─────────┘
10 rows in set. Elapsed: 0.272 sec. Processed 10.32 million rows, 20.63 MB (37.98 million rows/s., 75.96 MB/s.)
localhost :) select collectdate,sum(visit) from mv_t3_collectdate where collectdate > '2020-01-15' group by collectdate order by collectdate;
SELECT
collectdate,
sum(visit)
FROM mv_t3_collectdate
WHERE collectdate > '2020-01-15'
GROUP BY collectdate
ORDER BY collectdate ASC
Query id: 76042319-3d5e-42cb-99a0-2973119bd3a8
┌─collectdate─┬─sum(visit)─┐
│ 2020-01-16 │ 5230016 │
│ 2020-01-17 │ 2941884 │
│ 2020-01-18 │ 1384416 │
│ 2020-01-19 │ 538384 │
│ 2020-01-20 │ 170016 │
│ 2020-01-21 │ 42504 │
│ 2020-01-22 │ 8096 │
│ 2020-01-23 │ 1104 │
│ 2020-01-24 │ 96 │
│ 2020-01-25 │ 4 │
└─────────────┴────────────┘
10 rows in set. Elapsed: 0.008 sec.
总结
物化视图,在单列6000万行场景,可以提升10倍以上的性能(原始数据和聚合数据比例 200百万分之一)。
限制
物化视图必须先建立
物化视图只对建立后的数据变动做统计,也就是典型的【流式】计算,历史的不管。除非你历史数据整体折腾一遍。
当然如果你的查询,本身就有时间过滤,那么最新的数据肯定都是正确的。
物化视图只对inseft的数据有效
物化视图本质就像insert语句的触发器;如果有什么集合的运算,他会应用于最新插入的数据当中;对于其他原表的变化,比如说,更新,删除,删除分区,都不会影响到物化视图的变化