Data is a collection of information that can be used for many different purposes, including big-volume data analysis, external data integration, and many more in ClickHouse. Broadly, engines play a key role in these purposes and provide various solutions for different use cases and patterns. Knowing how these engines affect your data and which engine fits your architecture is important. All of this engine variety is unique to ClickHouse and requires experienced DBAs to make decisions.
What are the engine types?
Using a log family engine is recommended for less than 1 million rows, but what is the MergeTree family good for? Along with them, what about external data integration or special engines? All of them are mainly used for special purposes, and the following schema will be helpful to see the whole picture in ClickHouse engine architecture. This article gives a great opportunity to introduce the MergeTree technology of the MergeTree family under the Table Engines.
The MergeTree family is the most functional table engine for high-load environments and should be preferred for big-volume data analysis. Their main features are supporting data partitioning, data sampling, and TTL. On the other hand, since the above schema does not contain any information about replication, you may be wondering where the replication solution is located. I’d like to point out that all 7 tree engines in the MergeTree family can be prefixed with “Replicated”. Thus, we can clearly say that the MergeTree family supports replication solutions besides the above features.
What is MergeTree?
MergeTree engine is the main engine of the MergeTree family and supports almost all ClickHouse core functions. This engine is designed to insert a large amount of data into a table and write to the disk after sorting based on the primary key. Also, it drives the rules that are applied for the merging process in the background. The MergeTree table engine is more suitable for single-node server environments. Let’s see how it works:
As shown in the following DDL, let’s create a table using the MergeTree engine and examine them:
CREATE TABLE testTable ( id UInt16, created_time Date, comment String ) ENGINE = MergeTree() ORDER BY (id, created_time) PARTITION BY toYYYYMM(created_time) TTL created_time + INTERVAL 3 MONTH;
Specifying the mandatory ENGINE and ORDER BY parameters was sufficient to create the table, but additional PARTITION BY and TTL parameters were determined according to our business requirements.
It is important to know that primary keys are not used for deduplication in the MergeTree engine, inserting multiple rows with the same primary key is possible.
At the DDL above, although a primary key value was not specifically mentioned, the primary key value of the table was determined as (id, created_time). Because these values were set as a sort key in the ORDER BY value. The storage is sorted by the primary key, and the data is partitioned by month due to the PARTITION BY’s value. Also, partition names will start with as 202208_*
It’s not always a good idea to store data forever. According to the last additional parameter, the data of the last three months is retained in the storage.
Let’s insert some data records in the testTable with the same primary key:
INSERT INTO testTable VALUES (1,'2022-08-11', 'hello'); INSERT INTO testTable VALUES (1,'2022-08-11', 'world'); INSERT INTO testTable VALUES (2,'2022-08-15', 'this'); INSERT INTO testTable VALUES (2,'2022-08-15', 'is'); INSERT INTO testTable VALUES (2,'2022-08-17', 'ChistaDATA');
Since MergeTree is similar to the LSM-tree structure, ClickHouse supports high write throughput. Because all disk writes are sequential append. Also, when checking the default data path shows that each data partition is independent of the other and is not technically related before background compressions are applied. This structure minimizes the volume of data to be retrieved during reading operation.
root@MT01:/var/lib/clickhouse/data/default/testTable# ls -lrt total 32 -rw-r----- 1 clickhouse clickhouse 1 Aug 29 16:26 format_version.txt drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 detached drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_1_1_0 drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_2_2_0 drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_3_3_0 drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_4_4_0 drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_5_5_0
After a while, MergeTree background processes were applied by the merged pool, and partitions were merged month by month. This is also where the name of MergeTree comes from.
root@MT01:/var/lib/clickhouse/data/default/testTable# ls -lrt total 12 -rw-r----- 1 clickhouse clickhouse 1 Aug 29 16:26 format_version.txt drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 detached drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_1_5_1
What about SELECT?
We have always been interested in INSERT and its performance until there, but what about SELECT queries? Are you really taking advantage of primary keys?
SHOW CREATE TABLE testTable;
It is sufficient to run the above command to learn all of the DDL statements of our table, including primary keys in the ClickHouse session. This way is useful for determining indexes, before executing complex queries especially. Of course, according to the type of primary key, you need to make sure whether the queries can use the index, depending on “force_index_by_date” and “force_primary_key” parameters.
Also, attempting to read and write data simultaneously, the query operation will be performed in the index block that has been inserted, and the index block which is being written but has not been written at that time, is excluded. Therefore, this process does not use any lock mechanism, and the insert operation does not block the read operation.
ClickHouse architecture is mainly split up into two categories as, Table Engine and Database Engine, in order to provide different solutions. This article gives an overview of the engines and a sample demonstration of the MergeTree table engines; however, it must be born in mind that the type of solution is as effective as the database parameters in ClickHouse performance.