5 Critical Lessons from a ClickHouse Bottleneck That Slowed Cloudflare's Billing

At Cloudflare, we handle petabytes of data and millions of billing-related queries daily using ClickHouse, the popular open-source OLAP database. One day, our daily aggregation jobs—responsible for generating accurate invoices—became agonizingly slow after a routine migration. All the usual metrics (I/O, memory, rows scanned) looked normal. Yet the pipeline stalled. This is the story of that hidden bottleneck and the five key takeaways that saved our billing pipeline.

1. The Scale of the Billing Pipeline

Cloudflare’s billing system depends on a massive ClickHouse platform that stores over a hundred petabytes of data across multiple clusters. We built a system called “Ready-Analytics” to simplify onboarding: teams stream data into a single, wide table where every record follows a standard schema (20 floats, 20 strings, a timestamp, and an indexID). The primary key—(namespace, indexID, timestamp)—ensures each namespace’s data is sorted optimally for its queries. By December 2024, this table held over 2 PiB of data and ingested millions of rows per second. This scale made even small inefficiencies catastrophic when the billing pipeline suddenly slowed down.

5 Critical Lessons from a ClickHouse Bottleneck That Slowed Cloudflare's Billing — Source: blog.cloudflare.com

2. The Hidden Slowdown After a Migration

Following a routine migration, our daily aggregation jobs—powering hundreds of millions of dollars in usage revenue—stalled. We checked every standard culprit: I/O, memory, CPU, rows scanned, parts read. All were within normal ranges. Yet queries took far longer than expected. This was the first clue that the bottleneck wasn’t in the usual places. The system appeared healthy, but something deep inside ClickHouse was adding latency. It took forensic analysis of internal metrics to reveal the real issue.

3. The One-Size-Fits-All Retention Policy

Before ClickHouse had native TTLs, we built retention by partitioning tables by day and dropping partitions older than 31 days. The Ready-Analytics table inherited this “one retention for all” approach. But different teams had vastly different needs: some required years of data for legal reasons, others only a few days. This forced many teams to avoid Ready-Analytics and endure a complex onboarding process. The restrictive 31-day policy was a known limitation, but it also masked a deeper inefficiency: the fixed-partition scheme was causing unnecessary I/O on older partitions during scans.

4. Diagnosing the Bottleneck Inside ClickHouse

After exhausting standard diagnostics, we dove into ClickHouse internals. We discovered that the bottleneck wasn’t in disk or CPU but in the way ClickHouse’s merge process handled the wide schema. The hidden culprit was the mark cache and primary key index being repeatedly invalidated due to the large number of columns. Each scan triggered excessive index reads, even for small result sets. Three key patches emerged: 1) Optimizing the mark cache eviction policy, 2) Reducing index reads for wide tables, and 3) Adding a per-query prefetch hint for frequently accessed columns. These fixes eliminated the slowdown.

5. The Three Patches That Fixed Everything

Our engineering team wrote three targeted patches to ClickHouse. The first improved the mark cache eviction algorithm so that frequently needed marks weren’t evicted prematurely. The second reduced the number of index reads required for wide tables by pre-filtering columns based on query patterns. The third added an internal hint for the query engine to prefetch column data for common fields like namespace and timestamp. After deploying these patches, the daily aggregation jobs returned to normal speed. The billing pipeline became reliable again, and the incident taught us the importance of looking beyond surface metrics when diagnosing performance issues in complex OLAP systems.

In conclusion, Cloudflare’s billing pipeline bottleneck was a stealthy one—hiding not in I/O or memory but in ClickHouse’s index management. By analyzing the problem step by step and writing targeted patches, we not only fixed the immediate slowdown but also gained insights that will help optimize our petabyte-scale analytics platform for years to come. The lesson: always question “normal” metrics when performance degrades.

Tags:

5 Critical Lessons from a ClickHouse Bottleneck That Slowed Cloudflare's Billing

1. The Scale of the Billing Pipeline

2. The Hidden Slowdown After a Migration

3. The One-Size-Fits-All Retention Policy

4. Diagnosing the Bottleneck Inside ClickHouse

5. The Three Patches That Fixed Everything

Related Articles

Recommended

Discover More