In this blog post, I’m going to take on the subject of why we need ColumnStores for Analytics, the options we have, and how ClickHouse is compelling.
Background and History
Historically popular datastores have been row-based even in the earlier days of relational databases. This had an advantage over scanning rows required for the data for both predictability and reliability. The disadvantage obviously was the amount of time required to read all the data until the modern indexing algorithms were in place. Indexing algorithms solved the majority of the data access problems when the data sets were small. With the ACID-compliant databases, data access to large data sets had always been an overhead specifically requiring high throughput of I/O operations which caused decades of hardship until more memory (buffers and cache) and SSDs become cheaply available in redefined hardware. Now the challenge remains in the cloud for the cost of the IOPS where the bandwidth is still limited. A new era of computing focusing smaller units of shared hardware to break workloads into pieces with the help of the Kubernetes ecosystem.
Databases widely handled two types of workloads over the past few decades. These were in broad terms Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). OLTP applications would generally cover write-heavy workloads with smaller and short running requests hence they need to be fast. A multithreaded processing architecture helped OLTP applications to be scaled processing more requests per second. The focus was to help these applications handle large-scale workloads with the help of improvements from storage engines to faster cache implementations. OLAP applications remained to be seen as a batch, long-running and large data set processing. These were majorly focused on ETL jobs and literally overnight processing of data from CDC pipelines. OLAP requirements have also evolved in processing not only historical or archival data but also on-demand data analysis. Say you have a gaming application and with the help of AI/ML algorithms, you want to offer some coins to players based on their past various activities. This is where it only begins…
Targetting read-only use cases, column-store databases have a huge advantage over row-based implementation. This makes it perfect for the analytical processing of large data sets. The use case also complies with everything none transactional in the sense that the atomicity, consistency, isolation, and durability properties aren’t the number one priority for the application versus the robust system that delivers efficient queries from large databases. ColumnStores also have one other huge advantage over traditional OLTP databases is the aggregation of data. Since the tables are in a denormalized form including a very large number of columns aggregating while reading only columns is a huge advantage. This may sound awkward to me as much as you do what the technology has advanced to overcome many obstacles to make consistent data stores at scale. Again the audience is different at this time.
Unlike some OLTP solutions, you don’t have to be a hyper-scaler to utilize column-store DBMS. Data gets bigger and bigger every day even if your daily consumption is relatively smaller than some. The analytical data reaches to certain level crippling your OLTP systems sooner than you think. After all, the ideal solution becomes obvious to offload that data to a system still highly available and performant.
Some of the known column-oriented commercial DBMSs are Vertica, Redshift, Sybase IQ, Exasol, Infobright, InfiniDB, MonetDB, LucidDB, SAP HANA, Google Bigtable, and more. Many have been battle-tested on different occasions and have pros and cons to each other. We’re going to look into Clickhouse rather than these known ones.
How vectorized computing is influencing ColumnStores?
The vectorization technique helps analytical calculations work faster with the help of mathematical libraries in code. ColumnStores take advantage of this and make efficient use of computing power by vectorized data processing which helps build faster analytical query engines. The model for vectorized computing fits very well with the columnar representation of the data to write highly optimized query processing algorithms. It also makes efficient utilization of CPU cache by reducing the number of columns used. The key difference is between columns versus rows/tuples have the ability to process n # of records per aggregation operator hence it speeds up the processing block of data instead of each row at a time.
What are those success stories that we commonly hear about?
Like many other Open Source projects adoption numbers, data, and details are hard to determine due to the nature of the licensing. Anyone literally can adopt and doesn’t need to publish their use case. Remembering security and compliance requirements hardening every day some will opt out to publish that they even use these technologies to avoid being targeted when a security vulnerability surfaces or for other legal reasons.
The list is long from the use case point of view of various industries. What’s interesting to see is that there are more examples of live data analysis coming than in the original OLAP data warehouses. What I mean by that is AI/ML properties will be satisfied in the Clickhouse ecosystem. The predictive functions of a strong datastore are going to be in huge demand. This is a game-changer while a fast analytical system was a dream a decade ago. Why?, it was costly to operate in a memory (fast) OLAP system and it didn’t scale either. Costly and not scaling is something that the IT industry does not want to hear about in the same sentence.
Here’s a quick list of Clickhouse’s Advantages over traditional systems.
- Excellent compression
- Store data on disk for cost efficiency
- Vector computing
- Real-time data updates based on sort keys
- Rich choice of aggregates
- Data Skipping Indexes with Bloom Filters
- Approximated Calculation
- Built-in Replication and Sharding ecosystem
Clickhouse provides various methods for connectivity to the server. Some of them are here:
- CLI – CH client
- Native TCP
- HTTP Interface
- MySQL Wire Protocol
- Connectors JDBC/ODBC
ClickHouse also does one thing differently from others. Instead of drawing engines to themselves, it integrates popular data sources engines externally. This makes the adoption and orchestration of data much more robust than others.
List of supported integrations:
In conclusion, ClickHouse has eyebrow-raising features for those looking for fast, efficient, and open-source solutions for analytics workloads in data management platforms.