ClickHouse’s MergeTree table engine uses sparse indexing for its primary index and data-skipping indices as a secondary index. These indices are used to speed up the data retrieval from the disk. More recently, ClickHouse has introduced inverted indices as an experimental feature. This is to speed up the text searching on String columns and provide a full-text searching superpower to ClickHouse. Once, this feature becomes Generally Available as a stable feature, it could be a game-changer for the ClickHouse database. This post is intended to be a gentle introduction to inverted indices in ClickHouse.
Inverted indices store a mapping of words to their locations in the database table or document. Let us look at an example. Consider the following table.
[table id=26 /]
When we create an inverted index for the above table, the index contents would be something similar to this.
[table id=27 /]
Under the hood, the words are split into uni-grams (one-word sequences) based on white spaces, and store the occurrence of each of these words in another column based on the row position. The splitting of words can be bi-grams (two-word sequences) or n-grams.
Let us look at an example of creating and using inverted indices. I am going to use recipes dataset which is available for download here. After downloading the data, extract the archive and you will get a file called full_dataset.csv. Connect to a ClickHouse server and create the below table.
CREATE TABLE recipes ( title String, ingredients String, directions String, link String, source LowCardinality(String), NER String ) ENGINE = MergeTree ORDER BY title;
After creating the table, let us insert the data from the csv file via clickhouse-client tool.
clickhouse-client --password --query " INSERT INTO recipes SELECT title,ingredients, directions, link, source, NER FROM input('num UInt32, title String, ingredients String, directions String, link String, source LowCardinality(String), NER String') FORMAT CSVWithNames " --input_format_with_names_use_header 0 --format_csv_allow_single_quote 0 --input_format_allow_errors_num 10 < full_dataset.csv
Let us enable the experimental feature for now after logging in to the ClickHouse server via clickhouse-client. Once the inverted indices feature reaches GA, this step would not be needed anymore.
SET allow_experimental_inverted_index = true;
Let us verify the speed of a query on the newly populated table before creating an index.
chistadata :) SELECT COUNT( DISTINCT title) FROM recipes WHERE like(directions, 'olive') AND like(directions, 'cheese'); SELECT COUNTDistinct(title) FROM recipes WHERE (directions LIKE 'olive') AND (directions LIKE 'cheese') Query id: dddcde80-926e-4446-9e34-e39eac3d330c ┌─uniqExact(title)─┐ │ 0 │ └──────────────────┘ 1 row in set. Elapsed: 0.200 sec. Processed 2.23 million rows, 1.15 GB (11.17 million rows/s., 5.74 GB/s.)
Let us create the index using the ALTER TABLE statement.
ALTER TABLE recipes ADD INDEX inverted_index(directions) TYPE inverted;
This will add an index named inverted_index to the table based on
directions column. Let us materialize the index.
ALTER TABLE inverted_index_example MATERIALIZE INDEX inverted_index;
After this, let us run a query on the table.
chistadata :) SELECT COUNT( DISTINCT title) FROM recipes WHERE like(directions, 'olive') AND like(directions, 'cheese'); SELECT COUNTDistinct(title) FROM recipes WHERE (directions LIKE 'olive') AND (directions LIKE 'cheese') Query id: e8659fee-1fb2-4b06-a5e9-504bd7805feb ┌─uniqExact(title)─┐ │ 0 │ └──────────────────┘ 1 row in set. Elapsed: 0.191 sec. Processed 2.23 million rows, 1.15 GB (11.67 million rows/s., 6.00 GB/s.)
We can see that the speed has increased for the same query using inverted index. As this feature is still in the experimental stage, we could expect a few more changes, additional enhancements, and features on this and I will update this post regularly.
In this blog article, we have seen how to create inverted index in ClickHouse, and compare the difference. Inverted index can lend search capabilities within ClickHouse akin to ElasticSearch.