Cloudflare's 'Code Orange: Fail Small' Project: Building a More Resilient Network

Cloudflare recently completed its ambitious 'Code Orange: Fail Small' initiative, a comprehensive engineering effort designed to bolster network resilience and prevent the types of outages experienced in late 2025. This Q&A breaks down the project's key achievements, including the development of the Snapstone system for safer configuration changes and improved incident response procedures.

1. What was the 'Code Orange: Fail Small' project at Cloudflare?

'Code Orange: Fail Small' was an internal code name for a large-scale engineering project that occupied Cloudflare's teams over two and a half quarters. Its primary goal was to make the company's infrastructure more resilient, secure, and reliable for every customer. The project concluded earlier this month, marking a significant milestone in Cloudflare's ongoing commitment to network stability. While resiliency improvement is a perpetual priority, the specific work completed would have prevented the global outages that occurred on November 18 and December 5, 2025. The project addressed several core areas: safer configuration deployments, reducing failure impact, updating 'break glass' emergency procedures, enhancing incident management, preventing drift over time, and improving customer communication during incidents.

Cloudflare's 'Code Orange: Fail Small' Project: Building a More Resilient Network — Source: blog.cloudflare.com

2. Why did Cloudflare undertake this intensive engineering effort?

Cloudflare undertook 'Code Orange: Fail Small' directly in response to two major global outages that disrupted services in November and December 2025. These incidents highlighted vulnerabilities in how internal configuration changes were deployed and managed across the network. The engineering team recognized that the existing processes for rolling out configuration updates—whether data files or control flags—were insufficiently guarded against human error or unintended consequences. By focusing on safer, health-mediated deployments and better automated rollback capabilities, Cloudflare aimed to drastically reduce the risk of similar widespread failures. The project was a proactive, systematic overhaul to ensure that any single point of failure during a configuration change would 'fail small'—affecting only a limited portion of the network—rather than cascading into a global outage.

3. What key areas did the project focus on to improve network reliability?

The project concentrated on five crucial areas: safer configuration changes, reducing the impact of failure, revising 'break glass' procedures (emergency access protocols), improving incident management, and preventing drift and regressions over time. Additionally, Cloudflare strengthened how it communicates with customers during outages, ensuring more transparent and timely updates. For safer configuration changes, the team built a new internal system called Snapstone to enable progressive rollouts with continuous health monitoring. The 'reduce impact of failure' work involved segmenting the network so that any localized issue wouldn't spread. Updated 'break glass' procedures allow engineers to respond to critical incidents without bypassing safety checks, and incident management now includes clearer roles and automated escalation paths. Finally, measures like automated tests and regression detection prevent previous fixes from eroding as new features are added.

4. How did Cloudflare make configuration changes safer after the outages?

Previously, many internal configuration changes reached Cloudflare's network instantly, leaving no chance to detect problems before they affected customer traffic. After the outages, Cloudflare adopted a health-mediated deployment methodology for all configuration changes—the same approach used for software releases. This means changes are no longer deployed across the entire network at once. Instead, they are rolled out progressively to small subsets of servers. As each increment deploys, real-time observability tools monitor health metrics. If any anomaly is detected, the deployment automatically halts and rolls back the change. This eliminates the risk of a single bad configuration propagating globally. The methodology now applies to all product teams that were directly affected by the incidents, as well as many others. A central enabler of this is a new component called Snapstone, which provides a unified platform for gradual rollout and health mediation.

5. What is Snapstone and how does it enhance configuration management?

Snapstone is an internal system built by Cloudflare to bring health-mediated deployment to configuration changes. It bundles a configuration change (such as a data file or a control flag) into a deployable package. Snapstone then orchestrates a gradual release of that package across the network, constantly monitoring health signals in real time. If the health metrics degrade, Snapstone automatically triggers a rollback, preventing the change from reaching the entire network. What sets Snapstone apart is its flexibility: teams can define any unit of configuration that needs health mediation, whether it's a static data file (like the one that caused the November outage) or a dynamic control flag (like the one involved in the December outage). Before Snapstone, applying health-mediated deployment to configurations was possible but required significant manual effort per team, leading to inconsistent application. Now, Snapstone offers a standardized, automated process, making safer configuration changes the default across all affected product areas.

6. How does Snapstone help prevent the types of issues that caused past outages?

The two global outages in late 2025 were triggered by configuration changes: one by a bad data file, the other by a problematic control flag. Snapstone directly addresses these categories by providing a generic framework that can health-mediate any configuration unit. For the data file scenario, Snapstone would package the file, release it to a small set of servers, monitor for errors, and quickly roll back if issues appeared—preventing the file from corrupting the entire network. For the control flag example, Snapstone would similarly deploy the flag gradually, watching for performance degradation or unexpected behavior. Additionally, Snapstone's progressive rollout means that the blast radius of any failure is small, limiting impact to only a fraction of traffic. By making health-mediated deployment mandatory for high-risk configuration changes, Snapstone ensures that the same root cause—an unchecked configuration push—cannot cause a global outage again.