Monitoring Query Latency Happening to ClickHouse with details about Waits and Latches causing Wait Errors

Shiv Iyer
Posted on February 28, 2023

In ClickHouse, you can monitor query latency and waits/latches using the system.query_log and system.metrics tables. Here is an example query that can be used to monitor query latency and wait/latch events:

    toDateTime(query_start_time) AS query_start_time,
    round((query_duration_ms / 1000), 2) AS query_duration_sec,
    round((read_duration_ms / 1000), 2) AS read_duration_sec,
    round((execution_time_ms / 1000), 2) AS execution_time_sec,
    round((result_rows / execution_time_ms) * 1000, 2) AS rows_per_sec,
    round((result_bytes / execution_time_ms) * 1000, 2) AS bytes_per_sec,
    arrayStringConcat(waits, ', ') AS waits,
    arrayStringConcat(latches, ', ') AS latches
FROM system.query_log
WHERE type = 'QueryFinish'
ORDER BY query_start_time DESC
LIMIT 100;

This query selects query-related columns from the system.query_log table, including the query ID, start time, duration, read duration, execution time, rows per second, bytes per second, and waits/latches. The waits and latches are concatenated into comma-separated lists using the arrayStringConcat function.

To get more detailed information about the waits and latches, you can use the system.metrics table. Here is an example query that joins the system.metrics table with the system.query_log table to show detailed information about the waits and latches for a particular query:

    value AS count,
    round((duration_ms / 1000), 2) AS duration_sec,
    round((value / duration_ms) * 1000, 2) AS rate_per_sec
FROM system.metrics
ANY LEFT JOIN system.query_log ON query_id = metric_query_id
    AND event_name LIKE 'Wait%' OR event_name LIKE 'Latch%'
ORDER BY duration_sec DESC;

This query selects wait and latch-related columns from the system.metrics table and joins it with the system.query_log table using the query ID. The query filters by a particular query ID and event names starting with ‘Wait’ or ‘Latch’. The results include the event name, count, duration, and rate per second for each wait/latch event.

These queries can help identify slow queries and the waits/latches causing wait errors in ClickHouse. Note that monitoring query-related information can be resource-intensive, so it is important to use these queries judiciously and with care on busy systems.