Cloudflare's Code Orange: Fail Small Project Complete – A More Resilient Network

By ✦ min read

Cloudflare recently wrapped up a major engineering initiative called Code Orange: Fail Small, which aimed to make its global network more resilient, secure, and reliable. Over the past two quarters, teams focused on preventing major outages like those on November 18 and December 5, 2025. The project introduced several key improvements: safer configuration changes, reduced failure impact, updated break-glass procedures, better incident management, and stronger customer communication. Here we answer the most common questions about what changed and what it means for you.

What was Code Orange: Fail Small and why did Cloudflare undertake it?

Code Orange: Fail Small was an intensive engineering effort spanning over two quarters, designed to dramatically improve Cloudflare's network resilience after two significant global outages in late 2025. The core idea was to ensure that any future failure would be contained to a small, isolated part of the network rather than cascading to affect all customers. By focusing on safer configuration changes, reducing blast radius, and overhauling incident response procedures, Cloudflare aimed to prevent incidents like the November 18 and December 5 outages from recurring. The project wasn't just about fixing those specific root causes; it created a systemic framework to catch problems early, roll back changes automatically, and communicate transparently during incidents. While resilience is an ongoing priority, the completion of this work means the network is now fundamentally stronger and better prepared to handle unexpected issues.

Cloudflare's Code Orange: Fail Small Project Complete – A More Resilient Network — Source: blog.cloudflare.com

How did Cloudflare make configuration changes safer?

Configurations—such as data files and control flags—can cause outages if deployed incorrectly. Before Code Orange, many configuration changes went live instantly across the entire network with no health checks. Now, Cloudflare uses a health-mediated deployment methodology for all high-risk configuration pipelines. This means changes are rolled out gradually, with real-time health monitoring watching for anomalies. If a problem is detected, the system automatically reverts the change before it affects customer traffic. These changes apply to all product teams that were involved in past incidents, and the approach is now standard for both software releases and configuration deployments. A critical enabler of this process is a new internal component called Snapstone, which provides a unified way to bundle, release, and monitor configuration changes with automatic rollback—making health-mediated deployment easy and consistent across the network.

What is Snapstone and how does it work?

Snapstone is a custom-built internal system that brings health-mediated deployment to configuration changes. Before Snapstone, applying progressive rollout and health monitoring to configs was possible but required significant manual effort by each team, leading to inconsistent adoption. Snapstone provides a single, flexible platform where teams can define any unit of configuration—like a data file or a control flag—and then package it for gradual release. Once shipped, Snapstone monitors real-time health signals across the network; if something goes wrong, it automatically rolls back the change. This system is designed to prevent not just past failures but future ones too. For example, the data file that caused the November 18 outage and the control flag involved in the December 5 outage could both be managed via Snapstone. By centralizing this capability, Cloudflare ensures that every configuration change follows the same rigorous health-mediated process by default.

How does Cloudflare reduce the impact of failure now?

Beyond safer deployments, Code Orange introduced measures to reduce failure blast radius. The network architecture was reviewed to ensure that if one component fails, its impact is contained to a small subset of customers or regions, rather than cascading globally. This involved revising design patterns, improving isolation between services, and adding circuit breakers at critical points. For instance, changes that previously affected the entire network are now tested on a small initial group—similar to a canary release—and only rolled out broadly if health checks pass. Additionally, the team improved observability tools to detect anomalies faster and pinpoint the source of failure more precisely. The goal is to always keep a failure 'small' in scope, so the vast majority of traffic remains unaffected. This philosophy now guides all new feature development and infrastructure changes at Cloudflare.

What changes were made to break-glass procedures and incident management?

Cloudflare completely revised its break-glass procedures—the emergency protocols used to bypass normal safeguards during a crisis. The old procedures sometimes allowed for overly broad access or actions that could worsen an outage. The new procedures enforce stricter controls, requiring multiple approvals and logging of each emergency action. Lessons from the November and December incidents were incorporated to ensure that break-glass steps are safer and better documented. Incident management was also overhauled: roles and responsibilities are now clearer, with dedicated incident commanders and communication leads. Post-incident reviews automatically trigger follow-up tasks to prevent regressions. These improvements mean that even during the most stressful situations, the team can act quickly without accidentally making things worse. The changes also include better training and regular drills to keep everyone prepared.

How does Cloudflare prevent drift and regressions over time?

To ensure that improvements remain effective, Cloudflare introduced new prevent drift and regression measures. This includes automated checks that continuously verify that configuration changes follow health-mediated deployment rules. Any deviation is flagged and automatically corrected or requires human approval. The company also implemented a system of mandatory periodic reviews for all critical infrastructure components, ensuring that best practices are updated and reapplied as the network evolves. New tooling monitors the health of health-mediation itself—if a team stops using Snapstone for a particular config, an alert is triggered. This layered approach means that the resilience gains from Code Orange are not eroded over time by quick fixes or shortcuts. Additionally, every incident post-mortem now includes action items to prevent similar regressions, and those action items are tracked to completion with automated reminders.

How has customer communication during outages improved?

Code Orange also focused on strengthening customer communication during incidents. Cloudflare now publishes more frequent, more detailed status updates on its dashboard, with clear timestamps and explanations of the root cause and impact. The team developed templates for different outage scenarios, so information can be shared within minutes rather than hours. Internally, a dedicated communications role during incidents ensures that updates are accurate and accessible. Customers also receive proactive notifications via email and webhook for any service degradation. The goal is to reduce uncertainty and provide actionable information, such as estimated resolution times and workarounds. These changes are based on feedback from the November and December outages, where customers expressed frustration about the lack of timely information. Now, even if the network faces a challenge, you'll know exactly what's happening and what's being done about it.

Tags: