Off-Prem

Cloudflare explains how it managed to break the internet

'Network engineers walked over each other's changes'

Richard Speed Tue 21 Jun 2022 // 15:04 UTC

A large chunk of the web (including your own Vulture Central) fell off the internet this morning as content delivery network Cloudflare suffered a self-inflicted outage.

The incident began at 0627 UTC (2327 Pacific Time) and it took until 0742 UTC (0042 Pacific) before the company managed to bring all its datacenters back online and verify they were working correctly. During this time a variety of sites and services relying on Cloudflare went dark while engineers frantically worked to undo the damage they had wrought short hours previously.

"The outage," explained Cloudflare, "was caused by a change that was part of a long-running project to increase resilience in our busiest locations."

Oh, the irony.

What had happened was a change to the company's prefix advertisement policies, resulting in the withdrawal of a critical subset of prefixes. Cloudflare makes use of BGP (Border Gateway Protocol). As part of this protocol, operators define which policies (adjacent IP addresses) are advertised to or accepted from networks (or peers).

Changing a policy can result in IP addresses no longer being reachable on the Internet. One would therefore hope that extreme caution would be taken before doing a such a thing...

Cloudflare's mistakes actually began at 0356 UTC (2056 Pacific), when the change was made at the first location. There was no problem - the location used an older architecture rather than Cloudflare's new "more flexible and resilient" version, known internally as MCP (Multi-Colo Pop.) MCP differed from what had gone before by adding a layer of routing to create a mesh of connections. The theory went that bits and pieces of the internal network could be disabled for maintenance. Cloudflare has already rolled out MCP to 19 of its datacenters.

Moving forward to 0617 UTC (2317 Pacific) and the change was deployed to one of the company's busiest locations, but not an MCP-enabled one. Things still seemed OK... However, by 0627 UTC (2327 Pacific), the change hit the MCP-enabled locations, rattled through the mesh layer and... took out all 19 locations.

Five minutes later the company declared a major incident. Within half an hour the root cause had been found and engineers began to revert the change. Slightly worryingly, it took until 0742 UTC (0042 Pacific) before everything was complete. "This was delayed as network engineers walked over each other's changes, reverting the previous reverts, causing the problem to re-appear sporadically."

One can imagine the panic at Cloudflare towers, although we cannot imagine a controlled process that resulted in a scenario where "network engineers walked over each other's changes."

We've asked the company to clarify how this happened, and what testing was done before the configuration change was made, and will update should we receive a response.

Mark Boost CEO of Cloud native outfit Civo (formerly of LCN.com) was scathing regarding the outage: "This morning was a wake-up call for the price we pay for over-reliance on big cloud providers. It is completely unsustainable for an outage with one provider being able to bring vast swathes of the internet offline.

"Users today rely on constant connectivity to access the online services that are part of the fabric of all our lives, making outages hugely damaging...

"We should remember that scale is no guarantee of uptime. Large cloud providers have to manage a vast degree of complexity and moving parts, significantly increasing the risk of an outage." ®

Off-Prem

Cloudflare explains how it managed to break the internet

'Network engineers walked over each other's changes'

Alibaba Cloud reveals network telemetry tool that helped cut number of engineers needed by 86%

Cisco creates architecture to improve security and sell you new switches

Zero-day exploited right now in Palo Alto Networks' GlobalProtect gateways

HPE bakes LLMs into Aruba as AI inches closer to network takeover

Starlink clashes with Telecom Italia over frequency data sharing

Cloudflare says it has automated empathy to avoid fixing flaky hardware too often

Virgin Media sets up 'smart poles' next to cabinets to boost mobile network capacity

Vernor Vinge, first author to describe cyberspace and 'The Singularity,' dies at 79

We talk to W3C board vice-chair Robin Berjon about the InterPlanetary File System

Japan's NTT and NEC reckon they can boost optical network capacities 12x

Attacks on UK fiber networks mount: Operators beg govt to step in

FCC ups broadband benchmark speeds, says rural areas still underserved

About Us

Our Websites

You Privacy