Transactional Databases systems run thousands of transactions per second.
ClickHouse, on the other hand, as an analytical database, is the right choice for performing these operations with its features such as columnar storage and data compression. These features of ClickHouse are very beneficial in terms of performance and cost.
Many organizations need to be able to analyze and visualize the metrics in near real-time on a dashboard. A traditional ETL process runs overnight after the end of the business day and transfers the data from Transactional databases to Analytical Databases. Then aggregation is performed so the metrics can be visualized in BI tools like Superset, Tableau, and others.
MySQL or PostgreSQL will not be enough products for businesses that need to process important, critical data quickly and companies that need to process their data in a real-time environment. Instead, an ETL solution that continuously pushes changes from transaction databases to OLAP(ClickHouse) would be ideal.
The solution to this issue comes with the ClickHouse Sink Connector tool provided by Altinity.
The work is completely open-source and is licensed under Apache 2.0. Altinity encourages you to try the project and participate in its development. Participation could include testing (issues), ideas (discussion), and code contribution (Pull Requests). It’s the same pattern Altinity follows for all open-source projects we run or maintain.
Sink Connector Objective
Even though the main goal was to replicate data from MySQL to ClickHouse, we also wanted to develop a solution that would make it easier for Engineers to configure, deploy, maintain and operationalize this solution efficiently. The following is the list of objectives for developing this new solution.
- Replication: Our core objective is to replicate/mirror data from MySQL to ClickHouse. Users can leverage the benefits of ClickHouse to perform analytics at scale.
- Performance: We have users who need to load millions of rows per second. High performance was, therefore, a key requirement. We have added metrics to monitor throughput, CPU, and memory usage.
- No Data Loss or Duplication: This is another key requirement for any customer running financially critical data workloads. It is not fun when the dashboard shows incorrect numbers because of data duplication or data loss. Whenever this happens, it forces Data Engineers to do a back-fill or re-run an entire pipeline which could take hours/days.
- High Availability/Fault Tolerance: Our clients maintain production environments that are expected to run 24/7. They cannot afford to have any downtime. Our core objective is to design this solution so that it is easily recoverable without any side effects in case of a failure.
- Schema Evolution: Data from source systems are constantly changing/evolving because of new data fields captured from the user (Example: Sentiment analysis of survey responses). We need to be able to support this new data field without affecting the SLA.
- Transactions Support: MySQL’s strength is support for transactions. Even though ClickHouse does not have full support for transactions, we wanted to design a solution that will allow that transactional information to be persisted in ClickHouse so that it’s easier for troubleshooting and Data Lineage.
- On-Prem/Cloud/Kubernetes Support: Depending on the industry, clients might be running their workloads on-premises or on the cloud. Our primary objective is to design a solution that works everywhere and is easily maintainable with monitoring capabilities.
- Extensibility: Even though our initial goal is to support MySQL, we would like to leverage the effort to support other transactional databases like PostgreSQL, MariaDB, and MongoDB.
Image Source Altinity
Image Source Altinity
Ingestor: Ingestor is responsible for ingesting messages from Kafka. In addition to the ingestion functionality, the user configuration is also parsed by this component. Here is the list of all the configuration parameters supported by the connector – Sink Configuration.
DeDuplicator: We have explained in detail the drawbacks of duplicated data in the pipeline. This component creates a session window of messages and removes duplicates in that window based on the primary key. The window duration is configurable by the user.
Converter: This component very similar to the converter functionality of Kafka connect is responsible for mapping the data types from Kafka Connect/Debezium to Clickhouse data types.
Block Flush Handler: This component creates blocks of data that is then passed on to a background thread which flushes the data to ClickHouse using the JDBC Driver. The block can be configured by either the size in bytes or time duration.
Setup – Docker compose
All the components of the Altinity Sink Connector for ClickHouse can be set up locally using Docker-Compose. Below are the instructions for setting up the Docker compose environment.
The following services are included in the Docker compose file
Pre-requisites: Docker and Docker compose need to be installed.
- MySQL Primary (Bitnami Image)
- MySQL Secondary (Bitnami Image)
- Apicurio Schema Registry
- Altinity Sink Connector for ClickHouse
Clone the repo with the following command.
git clone https://github.com/Altinity/clickhouse-sink-connector.git
Run docker-compose to start all the docker components.
cd deploy/docker docker-compose up
Add MySQL connector: If you wish to connect to a different MySQL instance, you can update the connection information in the following file before executing it.
MYSQL_HOST=”mysql-secondary” MYSQL_PORT=”3306″ MYSQL_USER=”root” MYSQL_PASSWORD=”root” # Comma-separated list of regular expressions that match the databases for which to capture changes MYSQL_DBS=”test” # Comma-separated list of regular expressions that match fully-qualified table identifiers of tables MYSQL_TABLES=”employees” #KAFKA_BOOTSTRAP_SERVERS=”one-node-cluster-0.one-node-cluster.redpanda.svc.cluster.local:9092″
After the docker components are up/running, Kafka connectors(Source and Sink) need to be created. Execute the script to add MySQL connector to Kafka connect
cd deploy ./debezium-connector-setup-schema-registry.sh
If you plan to connect to a different ClickHouse server, then update the connection information in the following file.
CLICKHOUSE_HOST=”clickhouse” CLICKHOUSE_PORT=8123 CLICKHOUSE_USER=”root” CLICKHOUSE_PASSWORD=”root” CLICKHOUSE_TABLE=”employees” CLICKHOUSE_DATABASE=”test”
Edit docker compose yaml file. Go to
Find the Sink container section and change the image tab with the information given in the below.
sink: container_name: sink image: altinity/clickhouse-sink-connector:latest restart: "no"
Secondly, find the ClickHouse container and change “soft” and “hard” parameters with the information given in the below.
clickhouse: # clickhouse-client --host=127.0.0.1 --port=9005 --user=root --password=root --database=test container_name: clickhouse image: clickhouse/clickhouse-server:latest restart: "no" ports: - "8123:8123" - "9005:9005" environment: - CLICKHOUSE_USER=root - CLICKHOUSE_PASSWORD=root - CLICKHOUSE_DB=test ulimits: nofile: soft: 262144 hard: 262144
Add Altinity Sink Connector for ClickHouse by executing the following command.
Now, check both MySQL and ClickHouse server’s tables.
Your MySQL connection id is 38 Server version: 8.0.31 Source distribution Copyright (c) 2000, 2022, Oracle and/or its affiliates. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mysql | | performance_schema | | sys | | test | +--------------------+ 5 rows in set (0.00 sec) mysql> use test; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> show tables; +----------------+ | Tables_in_test | +----------------+ | Persons | +----------------+ 1 row in set (0.00 sec) mysql> select * from Persons limit 1; +----------+----------+-----------+-------------+------------+ | PersonID | LastName | FirstName | Address | City | +----------+----------+-----------+-------------+------------+ | 47 | SAYIN | CAN | LOS ANGELES | CALIFORNIA | +----------+----------+-----------+-------------+------------+ 1 row in set (0.01 sec)
Now check ClickHouse;
Connected to ClickHouse server version 22.9.3 revision 54460. Warnings: * Linux is not using a fast clock source. Performance can be degraded. Check /sys/devices/system/clocksource/clocksource0/current_clocksource 66a9fa08f9b8 :) 66a9fa08f9b8 :) show databases; SHOW DATABASES Query id: ce264979-aeb8-46d8-a9eb-e9ffc8ecff5f ┌─name───────────────┐ │ INFORMATION_SCHEMA │ │ default │ │ information_schema │ │ system │ │ test │ └────────────────────┘ 5 rows in set. Elapsed: 0.001 sec. 66a9fa08f9b8 :) use test; USE test Query id: ac75c954-7a60-4367-aed9-b92787046848 Ok. 0 rows in set. Elapsed: 0.002 sec. 66a9fa08f9b8 :) show tables; SHOW TABLES Query id: f51a71b6-a458-409c-b8bf-32895dc8c51a ┌─name────┐ │ Persons │ └─────────┘ 1 row in set. Elapsed: 0.002 sec. 66a9fa08f9b8 :) select * from Persons limit 1; SELECT * FROM Persons LIMIT 1 Query id: 4a14c6fa-6f3d-4eb0-96e3-9ac91f9281af ┌─PersonID─┬─LastName─┬─FirstName─┬─Address─────┬─City───────┐ │ 47 │ SAYIN │ CAN │ LOS ANGELES │ CALIFORNIA │ └──────────┴──────────┴───────────┴─────────────┴────────────┘ 1 row in set. Elapsed: 0.003 sec. 66a9fa08f9b8 :)
Today, especially open source software is frequently encountered. We hold and process our data on these software or databases. At this point, we want to replicate, protect or process our data in these tools. One of the solutions for this has been solved with the ClickHouse Sink Connector that Altinity company has prepared. Sink Connector helps you to send your MySQL data to ClickHouse via Kafka. You can try this completely open source tool and contribute to its development.
In the part 2 of this blog we will demonstrate some of the features of Sink Connector with a larger data set.
With ClickHouse 22.10 release, Kafka Connector is coming with ClickHouse. You can read our ClickHouse 22.10 release notes.