Clickhouse 合并存储提高小数据量的处理性能表参数min_bytes_for_wide_part和min_rows_for_wide_part

在Clickhouse里,每次数据装在都会生成一个part, 内部数据是按照列存储的,每个列一组文件。当数据量很少时,会浪费很多的磁盘IO次数。新版本Clickhouse里,在建表时可以通过min_bytes_for_wide_part和min_rows_for_wide_part两个参数,来设置一个阈值。当参数低于阈值时,会将part数据合并存储来减少IO次数。

参数

min_bytes_for_wide_part

数据从合并存储(Compact)转成按列存储(Wide)的最小文件大小。

min_rows_for_wide_part

数据从合并存储(Compact)转成按列存储(Wide)的最小行数。

样例

建表

设置参数大小为10K。

localhost :) create table t3_2(id int, name String, birth date)engine=MergeTree order by id settings min_bytes_for_wide_part=10240;

CREATE TABLE t3_2
(
    `id` int,
    `name` String,
    `birth` date
)
ENGINE = MergeTree
ORDER BY id
SETTINGS min_bytes_for_wide_part = 10240

Query id: 58aeb53d-c8e5-44f3-9ceb-9137216c366d

Ok.

0 rows in set. Elapsed: 0.048 sec.

插入一行数据

localhost :) insert into t3_2 values(1,'111','2011-01-02');

INSERT INTO t3_2 VALUES

Query id: 36ec247b-0033-4c07-881f-319b88132681

Ok.

1 rows in set. Elapsed: 0.005 sec.

查看part数据

可以看到part数据是合并在一起的。

[root@localhost t3_2]#  ll *
-rw-r-----. 1 clickhouse clickhouse    1 May 17 16:28 format_version.txt

all_1_1_0:
total 28
-rw-r-----. 1 clickhouse clickhouse 185 May 17 16:29 checksums.txt
-rw-r-----. 1 clickhouse clickhouse  75 May 17 16:29 columns.txt
-rw-r-----. 1 clickhouse clickhouse   1 May 17 16:29 count.txt
-rw-r-----. 1 clickhouse clickhouse  88 May 17 16:29 data.bin
-rw-r-----. 1 clickhouse clickhouse 112 May 17 16:29 data.mrk3
-rw-r-----. 1 clickhouse clickhouse  10 May 17 16:29 default_compression_codec.txt
-rw-r-----. 1 clickhouse clickhouse   8 May 17 16:29 primary.idx

detached:
total 0

再次插入100行数据

localhost :) insert into t3_2 select * from t1 limit 100;

INSERT INTO t3_2 SELECT *
FROM t1
LIMIT 100

Query id: a22e737b-0ff2-4463-9682-aec04d675ce8

Ok.

0 rows in set. Elapsed: 0.007 sec. Processed 8.19 thousand rows, 193.98 KB (1.21 million rows/s., 28.59 MB/s.)

查看100行数据的part

可以看到还是没有超过。

[root@localhost t3_2]# ll *
-rw-r-----. 1 clickhouse clickhouse    1 May 17 16:28 format_version.txt

all_1_1_0:
total 28
-rw-r-----. 1 clickhouse clickhouse 185 May 17 16:29 checksums.txt
-rw-r-----. 1 clickhouse clickhouse  75 May 17 16:29 columns.txt
-rw-r-----. 1 clickhouse clickhouse   1 May 17 16:29 count.txt
-rw-r-----. 1 clickhouse clickhouse  88 May 17 16:29 data.bin
-rw-r-----. 1 clickhouse clickhouse 112 May 17 16:29 data.mrk3
-rw-r-----. 1 clickhouse clickhouse  10 May 17 16:29 default_compression_codec.txt
-rw-r-----. 1 clickhouse clickhouse   8 May 17 16:29 primary.idx

all_2_2_0:
total 28
-rw-r-----. 1 clickhouse clickhouse 187 May 17 16:30 checksums.txt
-rw-r-----. 1 clickhouse clickhouse  75 May 17 16:30 columns.txt
-rw-r-----. 1 clickhouse clickhouse   3 May 17 16:30 count.txt
-rw-r-----. 1 clickhouse clickhouse 303 May 17 16:30 data.bin
-rw-r-----. 1 clickhouse clickhouse 112 May 17 16:30 data.mrk3
-rw-r-----. 1 clickhouse clickhouse  10 May 17 16:30 default_compression_codec.txt
-rw-r-----. 1 clickhouse clickhouse   8 May 17 16:30 primary.idx

detached:
total 0

插入1000行数据

localhost :) insert into t3_2 select * from t1 limit 1000;

INSERT INTO t3_2 SELECT *
FROM t1
LIMIT 1000

Query id: bf969c32-3b51-4686-a1e4-ba53c3412179

Ok.

0 rows in set. Elapsed: 0.006 sec. Processed 8.19 thousand rows, 193.98 KB (1.31 million rows/s., 31.02 MB/s.)

localhost :)

查看1000行数据part

可以看到第三个part:all_3_3_0已经是按列存储了。每个列一组文件。

[root@localhost t3_2]# ll *
-rw-r-----. 1 clickhouse clickhouse    1 May 17 16:28 format_version.txt

all_1_1_0:
total 28
-rw-r-----. 1 clickhouse clickhouse 185 May 17 16:29 checksums.txt
-rw-r-----. 1 clickhouse clickhouse  75 May 17 16:29 columns.txt
-rw-r-----. 1 clickhouse clickhouse   1 May 17 16:29 count.txt
-rw-r-----. 1 clickhouse clickhouse  88 May 17 16:29 data.bin
-rw-r-----. 1 clickhouse clickhouse 112 May 17 16:29 data.mrk3
-rw-r-----. 1 clickhouse clickhouse  10 May 17 16:29 default_compression_codec.txt
-rw-r-----. 1 clickhouse clickhouse   8 May 17 16:29 primary.idx

all_2_2_0:
total 28
-rw-r-----. 1 clickhouse clickhouse 187 May 17 16:30 checksums.txt
-rw-r-----. 1 clickhouse clickhouse  75 May 17 16:30 columns.txt
-rw-r-----. 1 clickhouse clickhouse   3 May 17 16:30 count.txt
-rw-r-----. 1 clickhouse clickhouse 303 May 17 16:30 data.bin
-rw-r-----. 1 clickhouse clickhouse 112 May 17 16:30 data.mrk3
-rw-r-----. 1 clickhouse clickhouse  10 May 17 16:30 default_compression_codec.txt
-rw-r-----. 1 clickhouse clickhouse   8 May 17 16:30 primary.idx

all_3_3_0:
total 44
-rw-r-----. 1 clickhouse clickhouse  76 May 17 16:30 birth.bin
-rw-r-----. 1 clickhouse clickhouse  48 May 17 16:30 birth.mrk2
-rw-r-----. 1 clickhouse clickhouse 326 May 17 16:30 checksums.txt
-rw-r-----. 1 clickhouse clickhouse  75 May 17 16:30 columns.txt
-rw-r-----. 1 clickhouse clickhouse   4 May 17 16:30 count.txt
-rw-r-----. 1 clickhouse clickhouse  10 May 17 16:30 default_compression_codec.txt
-rw-r-----. 1 clickhouse clickhouse 483 May 17 16:30 id.bin
-rw-r-----. 1 clickhouse clickhouse  48 May 17 16:30 id.mrk2
-rw-r-----. 1 clickhouse clickhouse 921 May 17 16:30 name.bin
-rw-r-----. 1 clickhouse clickhouse  48 May 17 16:30 name.mrk2
-rw-r-----. 1 clickhouse clickhouse   8 May 17 16:30 primary.idx

detached:
total 0