The Hidden ClickHouse Bottleneck: 7 Key Insights from Cloudflare's Billing Pipeline Crisis
In the world of cloud infrastructure, few things are as alarming as a sudden slowdown in a mission-critical pipeline. For Cloudflare, that pipeline powers billions of dollars in usage revenue, fraud detection, and billing across millions of users. When daily aggregation jobs in ClickHouse – the open source OLAP database at the heart of their operations – ground to a halt after a routine migration, the engineering team faced a high-stakes mystery. All standard diagnostics showed normal metrics: I/O, memory, rows scanned, parts read. Yet jobs were crawling. This article unpacks seven crucial lessons from that crisis, revealing a hidden bottleneck deep within ClickHouse's internals and the three patches that finally restored order. Whether you're a database administrator or a developer building large-scale analytics, these insights are invaluable.
1. The Billing Pipeline That Must Never Fail
Cloudflare relies on ClickHouse for more than just analytics – it's the engine behind daily billing calculations that determine how much customers owe for services. Every single day, millions of queries are executed to aggregate usage data from hundreds of products. If these jobs aren't completed on time, invoices become impossible to reconcile, leading to revenue loss and customer frustration. The pipeline also feeds fraud systems and other critical workflows. When it slowed down after a migration, the ripple effects were immediate and severe. Teams scrambled to understand why a system that had previously performed flawlessly was now struggling to keep pace.

2. The Mystery: Normal Metrics, Abnormal Performance
When the slowdown hit, the engineering team followed standard playbooks. They checked I/O utilization – normal. They examined memory pressure – fine. They monitored rows scanned and parts read – all within expected ranges. Even CPU usage looked ordinary. Everything we would normally check when a ClickHouse query is slow appeared to be normal, they noted. This was deeply puzzling. A bottleneck that leaves no trace in conventional metrics is the kind of problem that keeps database administrators up at night. It forces you to look beyond the obvious and dive into the internal mechanics of the database engine itself.
3. Discovering the Hidden Bottleneck: A Deep Dive Internals
The real culprit turned out to be a subtle inefficiency in how ClickHouse handled data sorting and partitioning in their massive Ready-Analytics setup. With terabytes of data and millions of rows ingested per second, even tiny inefficiencies in query execution can snowball. The team had to trace the problem down to the internals of ClickHouse – specifically how the primary key (namespace, indexID, timestamp) was being processed during aggregation. They found that a specific code path was causing excessive memory copying and unnecessary data movement, especially under high concurrency. This was not a simple configuration tweak; it required patching the underlying ClickHouse source code.
4. The Three Patches That Fixed It
Once the root cause was identified, the team developed and applied three targeted patches. The first patch optimized the way primary key columns were accessed during aggregation, reducing redundant scans. The second patch improved memory allocation for temporary data structures used in sorting. The third and most critical patch addressed a race condition in the parts merging logic that caused contention under heavy load. Each patch was thoroughly tested in a staging environment before deployment. The result was a dramatic recovery: query times dropped back to normal, and the billing pipeline resumed its daily rhythm without further issues.
5. Understanding the Ready-Analytics Platform
To appreciate the bottleneck, you need to understand the scale of Cloudflare's ClickHouse deployment. They store over 100 petabytes of data across dozens of clusters. In early 2022, they built a system called Ready-Analytics to simplify onboarding for internal teams. Instead of designing individual tables, teams could stream data into a massive single table. Each record follows a standard schema with 20 float fields, 20 string fields, a timestamp, and an indexID. The primary key is (namespace, indexID, timestamp), which allows each namespace's data to be sorted optimally for its queries. By December 2024, the system had grown to over 2 PiB of data and processed millions of rows per second. Yet it had one critical flaw: its retention policy.

6. The One-Size-Fits-All Retention Policy
Cloudflare has used ClickHouse for years, long before native TTL features were available. As a result, they built a custom retention system based on partitioning: the Ready-Analytics table was partitioned by day, and a retention job simply dropped partitions older than 31 days. This one-size-fits-all approach became a major obstacle. Some teams needed to store data for years due to legal or contractual obligations, others for only a few days. Teams with different retention requirements couldn't use Ready-Analytics and were forced into a more complex conventional setup. The need for a per-namespace retention system was clear, but implementing it required changes to both the ingestion pipeline and the data model. This limitation was a driving factor behind the eventual redesign.
7. Lessons Learned and Best Practices for Large-Scale ClickHouse
The crisis taught Cloudflare invaluable lessons about monitoring and debugging at scale. First, never assume your metrics cover all bottlenecks – sometimes the most dangerous slowdowns are invisible until you dig into source code. Second, partitioning and primary key design are not just performance tuning; they can become architectural constraints that ripple across the entire system. Third, investing in deep expertise in your core infrastructure – even if it's open source – pays off when you need to write custom patches. Finally, consider building flexibility into your data retention from the start, even if it adds complexity. The three patches may have fixed the immediate problem, but the long-term solution was a fundamental rethinking of how they manage data lifecycle across hundreds of namespaces.
Conclusion: The hidden bottleneck in ClickHouse that slowed Cloudflare's billing pipeline was a stark reminder that even the most battle-tested systems can harbor subtle performance bugs. By combining deep database knowledge with systematic debugging, the team not only resolved the crisis but also contributed three important patches back to the open source community. For anyone running large-scale analytics workloads, the key takeaway is clear: always be prepared to look beyond the dashboard and into the engine room.
Related Articles
- Breaking: Markdown Proficiency Now Critical for GitHub Success – Experts Urge Beginners to Learn Now
- Coursera Unveils Major Expansion of Job-Focused Learning Programs Amid AI Revolution
- Creating a Functional GUI Calculator with Python's Tkinter
- Mastering Java Object Storage in HttpSession: A Step-by-Step Guide
- Dell and Lenovo Set New Standard for Linux Firmware Support with Major LVFS Sponsorship
- What the Coursera-Udemy Merger Means for You: Answers to Your Questions
- Digital Nomads Face Infrastructure Crisis: 2026 Tools Revealed as Backbone of Global Remote Work
- The 21st Century Calculator: Relic or Essential Tool?