ClickHouse LowCardinality Data Type
ClickHouse LowCardinality feature is not documented in detail and people still mistake the functional relevance of ClickHouse LowCardinality. ClickHouse has several data types and SQL implementation which is entirely different from other data warehousing systems, Few of them even doesn’t exist. In this post we have explained how ClickHouse LowCardinality works and its common use case.
ClickHouse LowCardinality Introduction
In simple terms, LowCardinality is a ClickHouse data type that can be used to modify any other ClickHosue data type (though more often used for strings), The most interesting fact here is you can apply this to existing data in a ClickHouse Server. We are explaining more details here with an Ad. Tech Database (which has approx. 600M records) example:
:) select campaign_name, count() from ad_campaign_tab group by campaign_name order by count() desc limit 10; ┌─campaign_name-────────┬──count()─┐ │ Nike │ 34545845 │ │ Goal-dude │ 6546836 │ │ Fast_pace │ 4493533 │ │ Pento_clck │ 8565984 │ │ Ball_trick │ 8537456 │ │ py_to_win │ 8556589 │ │ bty_boy │ 8465575 │ │ Pet_Poppy │ 7564646 │ │ car_vieny │ 7575368 │ │ Poll_var_one │ 7354799 │ └───────────────────────┴──────────┘ 10 rows in set. Elapsed: 9.791 sec. Processed 7819.226 million rows, 31.28 GB (91.28 million rows/s., 8.51 GB/s.)`
10 rows in set. Elapsed: 2.089 sec. Processed 172.34 million rows, 3.82 GB (82.51 million rows/s., 1.83 GB/s.)`
Now we will drill down further for more detail on the campaign from Nike:
:) select campaign_name, count(), campaign_code, sum(clicks) from ad_campaign_tab where campaign_name = 'Nike' group by campaign_name order by count() desc; ┌─campaign_name-─┬──count()─┬─campaign_code--─┬─sum(clicks)--─┐ │ Nike │ 34545845 │ 4792 │ 5476466474 │ └────────────────┴──────────┴─────────────────┴───────────────┘ 1 rows in set. Elapsed: 5.464 sec. Processed 445.67 million rows, 7.33 GB (45.72 million rows/s., 4.55 GB/s.)
Queries are relatively fast on Google Compute Engine C2 and C2D instances, but this can be optimized further. Let’s look at how campaign_name columns are stored:
:) SELECT column, any(type), sum(column_data_compressed_bytes) compressed, sum(column_data_uncompressed_bytes) uncompressed, sum(rows) FROM system.parts_columns WHERE (table = 'ad_campaign_tab') AND active AND (column LIKE '%campaign_name') GROUP BY column ORDER BY column ASC ┌─column─────────┬─any(type)─┬─compressed─┬─uncompressed─┬─sum(rows)─┐ │ campaign_name │ String │ 865321765 │ 1113512121 │ 628992695 │ └────────────────┴───────────┴────────────┴──────────────┴───────────┘
Here is the recipe for magic, We are changing the type of campaign_name to LowCardinality. This can be done directly with ALTER TABLE statement:
:) ALTER TABLE ad_campaign_tab MODIFY COLUMN campaign_name LowCardinality(String); 0 rows in set. Elapsed: 11.612 sec.
We did now an ALTER on the ad_campaign_tab table and it took 12 seconds approx. So what changed? check the storage details again:
┌─column─────────┬─any(type)──────────────┬─compressed─┬─uncompressed─┬─sum(rows)─┐ │ campaign_name │ LowCardinality(String) │ 259596529 │ 111351212 │ 628992695 │ └────────────────┴────────────────────────┴────────────┴──────────────┴───────────┘
So now on the storage, The compressed size has been reduced to almost 70% and uncompressed data by 90%, This will improve ClickHouse query performance definitely
Before implementing ClickHouse LowCardinality – SQL 1
10 rows in set. Elapsed: 9.791 sec. Processed 7819.226 million rows, 31.28 GB (91.28 million rows/s., 8.51 GB/s.)`
After implementing ClickHouse LowCardinality – SQL 1
10 rows in set. Elapsed: 2.436 sec. Processed 7819.226 million rows, 11.73 GB (253.71 million rows/s., 59.86 GB/s.)`
The query performance improved by 4x, The average use of available system resources also was lower compared to before implementation of the ClickHouse LowCardinality feature.
Before implementing ClickHouse LowCardinality – SQL 2
1 rows in set. Elapsed: 5.464 sec. Processed 445.67 million rows, 7.33 GB (45.72 million rows/s., 4.55 GB/s.)
After implementing ClickHouse LowCardinality – SQL 2
1 rows in set. Elapsed: 2.719 sec. Processed 445.67 million rows, 11.97 GB (689.29 million rows/s., 8.19 GB/s.)
In SQL 2, The query performance improved by 2x, This also happened like magic. We haven’t changed the structure of the application for performance. The actual use case of ClickHouse LowCardinality is on long strings. We have witnessed much better performance (20-25x) on time-series databases where you can expect long path-like data structures.
ClickHouse is built for high-performance analytics using best of dictionary encoding features in which strings are encoded as ‘positions’, referencing dictionary with position-to-string mapping. ClickHouse LowCardinality is a performance accelerator where the source strings are longer with a minimal number of distinct values. Technically, ClickHouse creates separate file or several files to store LowCardinality dictionaries. When all the LowCardinality columns can fit on 8192 distinct values ClickHouse creates a single common file per table. But, When number of distinct values are higher ClickHouse will create one file per LowCardinality column. The common dictionary is more resource utilization efficient since ClickHouse server can cache and maintain pre-calculated hashes for group by and sort / search intensive operations. Also, If you are using distributed queries, ClickHouse also tries to operate on dictionary positions for most of query processing, replacing positions with actual strings as late as possible.