🤩Canva designs user-facing analytics with Tinybird.Read their story.

Apr 16, 2025

Lessons learned from 5 years operating huge ClickHouse® clusters: Part II

This is the second part of the series. Here's more of what I've learned from operating petabyte-scale ClickHouse clusters for the last 5+ years.

The Data Base

Javier Santana

Co-founder

This is the second part of this series. You can read the first one here.

Handling load

This section is mostly about reads. I talked about ingestion in the previous post, and while reads and writes could use the same resources, I'm going to focus on reads in isolation, as if you only had reads.

Well, I lied. Because I'm going to start by telling you: you can't decouple reads and writes. If you see any benchmark that only gives read performance, it may look nice in the benchmark, but that's not true in real life.

Reads depend a lot on how many parts a table has, and the number of parts depends on the ingestion. If you are inserting data often, you'll get penalized while reading no matter what schema your table has (more about this in performance). You can reduce parts by running merges more often, but you'll need more CPU.

When it comes to reads, there are many topics I could cover. I'll start here:

Handling different kinds of traffic
Hardware resources
Query design

About handling different kinds of traffic and hardware

In a cluster, you usually have:

Real-time traffic, aka queries that need to answer in less than X seconds, usually serving a dashboard or some real-time workload.
Long-running queries, which could be from an ad hoc analysis (like someone who decides they need to know the average of something over 7 years of data)
Backfills (this requires its own section).
Anything else that's not real time, that is, queries where it doesn't matter how long it takes to run.

Solving this is a never-ending design topic: how do I not let my long-running queries affect my latency on real-time queries, especially the ones over p95? There are many ways to approach it, for example, by having different replicas for different workloads. But that's expensive because you need to allocate hardware for both of them, and while real-time traffic is usually predictable, one-off queries just "happen", and you can't plan ahead. You could be smarter and handle that in the application layer (before sending the query), but you probably have 3-4 different sources of queries: the app, people running clickhouse-client, people running a BI, and people running a BI. Yes, I counted it twice because BI queries are usually really bad.

You could create spot replicas to run those long-running queries, but that needs extra handling and adds complexity. At Tinybird, we handle it with a combination of app logic and a load balancer. We know the status of the replicas, and based on that, we pick the right one to send the query. Sometimes we reject the query to avoid crashing the server.

About query design

Queries have infinite configuration options, but one of the most important ones is max_threads. It controls how many threads you can use to read the data. In general, real-time queries should only use 1, and if you need more than 1-2, you'll need a lot of hardware if you have many QPS. You need to understand that, most of the time, your bottleneck here is scan speed. So you can keep adding CPUs, but you are still limited by scan size. A rule of thumb: you can scan 500Mb/s on a single machine on average. This is, of course, not true for every kind of machine or workload, but it's a good rule of thumb. The other important settings are those that control memory:

max_memory
max_bytes_before_external_group_by

Both control how the memory is used, and you'll need to control how much memory you give to each query. If you want to run massive group by, joins, window functions, or anything like that, you want max_memory to be high. If you want those queries to be fast, max_bytes_before_external_group_by should be high to avoid ClickHouse dumping partial results to disk and killing performance.

ClickHouse does not have an optimizer; you are the optimizer. And your users don't care about optimizing (that feature or report needs to be done ASAP). Query optimization is an art, but there are some simple rules you have to follow:

Filter on columns in the sorting key first. Sorting key design is super important as well, so pay attention to it. Believe me, you are losing 10-100x on your queries with a bad sorting key design. Spend 2-3 hours understanding the data layout
Run other filters, second, and try to use PREWHERE if you have large columns. Move high-selectivity filters to prewhere on columns that are small (not Strings, not Arrays).
Then run everything that filters data out, mostly IN operations
Save JOINs, GROUP BY, and other complex operations for last.

If you master these rules, I can guarantee you will be in the top p95 of ClickHouse users. Actually, use any LLM and just give it those rules and table schema, and it will do a great job. By the way, we wrote more about these things here.

It's not that simple, as you may have expected, otherwise query planners would be a few lines of code (and they are not). Depending on the cardinality etc., you may need to change the rules.

So that's Query Design 101, and yes, it is really more about the user than the operator, but you still need to know those things because you are going to be fighting with users running bad queries all the time. Because even ClickHouse has a max total memory config, and it'll OOM. OOM is bad; depending on your database, it could take minutes to load (so you also need to design your HA setup properly)

The other setting you want to set is max_concurrent_queries. This can save you when the server is overloaded, so do not remove this value. You may need to change it depending on your workload, but keep it, it's a lifesaver.

Backfills

I'm adding this section because it's so painful that if I manage to save even 1 person 1 hour of pain, this article would be worth it.

Let me explain the scenario: you are inserting into a table, that table has a materialized view, you add another materialized view, and you want to backfill it. This sounds simple, but it's not.

You might be tempted to use POPULATE... Don't. It's broken because data can be duplicated. From the official documentation:

We do not recommend using POPULATE, since data inserted in the table during the view creation will not be inserted in it.

If you run an INSERT INTO SELECT * FROM table, you risk losing or duplicating data, and you'll need a lot of memory to do it.

So, when backfilling materialized views, here's what to do: use a Null table before the actual table, and increase the rows you push per block to avoid generating thousands of parts.

I can’t miss the opportunity to explain how to do this in Tinybird. Just change the MV SQL and then tb deploy, we’ll take care of all of this.

Operating the cluster and monitoring

Aside from the regular stuff you monitor (CPU, mem, IO), you also want to track things like:

Number of queries running
ZooKeeper latencies
S3 errors (if you use it)
Replication lag
DDL queue length (this is the queue that stores what all the replicas need to execute)
Merge queue
Merges memory
Mutations

Things get stuck sometimes, and killing the queries does not always work, so keep an eye on everything.

You should track errors as well. These are the important ones:

Read only tables (when you hit this one, you are very likely fucked)
ZooKeeper errors
Max simultaneous queries.

You should learn to work with ClickHouse system tables. You can add a lot of metrics and everything, but when things go south, this is what will save you.

system.query_log is your bible. This tells you everything about queries. The ProfileEvents column is where you go to understand what a query did. Learn how to use it, master it, and learn the column names.
system.processes tells you what's currently running. Useful in fires.
system.part_log tells you how parts move, what was merged, and so on.
You need to learn about system.tables and system.columns to understand the table shapes.

You may want to enable some others, but those are the critical ones.

Track segfaults as well. They might happen, and you'll need to identify which query caused it. This does not happen often, and if your workload is stable, I don't think you'll ever see one. But, if they happen, and you don't identify the query causing it, your cluster will go down all the time. And you don't want to be on call all the time.

I'm not going to talk too much about the tooling, some people use K8s with the Altinity operator, others handle it with scripts, but everybody needs to:

Be aware of adding and removing replica nodes to the cluster
Be careful dropping table replicas
Keep an eye on replication lag when adding a new replica
Remain aware of long-running queries when turning off a replica
Watch inserts

And so on. We do a multi-step process at Tinybird to shut a replica down that takes care of all of this (first remove traffic, then wait for the inserts and queries, kill them if needed).

Be careful handling ON CLUSTER operations. As you may know, changing a table in the cluster requires some coordination. For that, ClickHouse uses Zookeeper or ClickHouse Keeper to coordinate replicas. If a replica is down, ON CLUSTER operations will take a lot of time (depending on the config) and may generate issues in your app if timeouts are not properly set (and trust me, they're not).

For super-low-latency use cases (under 20ms per query), you need to warm caches (take a look at the settings), because if you start sending traffic to a replica, the latencies will go up like crazy. Watch the memory available for the mark_cache. There is also an uncompressed cache, but, to be honest, we never used it successfully, so let me know if you did.

In general, alert on max_simultaneous_queries, connectivity issues, and queues growing without control. You will add more alerts on other things, but that depends heavily on what your cluster is designed for.

Last note: Altinity's knowledge base (and videos) is probably one of the best repositories to understand how to set up ClickHouse on your own (and in general, all the content that Altinity publishes is really good. A must-read if you handle ClickHouse).

Other stuff

Other random stuff you may find useful:

Table deletion

By default, ClickHouse does not remove any table with more than 50 GB. This makes total sense, and you may be tempted to lower the limit to a few Mb. There are two kinds of developers/data engineers: 1) The ones who removed a table by mistake, and 2) the ones who will. If you drop a table by mistake, you'll need to go to the backups and that's a pain in the ass (unless you use the engine we developed, of course)

Materialized views

MVs are a killer feature, but they are hard to manage, they lead to memory issues, and they generate a lot of parts. In general, you need to be careful with MVs because people tend to add a lot of things in them. MVs are executed on inserts, so every new MV will make your ingestion slower.

Don't kill your server

Some column types can make your server go south, for example, a uniqExactState over a column with high cardinality (+200m) can kill your cluster on merges. That usually happens with aggregating columns, so be careful with them.

index_granularity can also kill your server if it's too low. My recommendation: do not go under 128. Low values are useful for point-like queries.

Final thoughts

I focused a lot on things that can go wrong with ClickHouse, but that doesn't mean ClickHouse is a bad database—quite the opposite. Handling a database at scale is hard; if you have enough experience, you know it.

Anyways, I hope this saves you some headaches. (If you didn't read the first post, here it is again.)

Do you like this post?