Cloudflare explains how it managed to break the internet

'Network engineers walked over each other's changes'


A large chunk of the web (including your own Vulture Central) fell off the internet this morning as content delivery network Cloudflare suffered a self-inflicted outage.

The incident began at 0627 UTC (2327 Pacific Time) and it took until 0742 UTC (0042 Pacific) before the company managed to bring all its datacenters back online and verify they were working correctly. During this time a variety of sites and services relying on Cloudflare went dark while engineers frantically worked to undo the damage they had wrought short hours previously.

"The outage," explained Cloudflare, "was caused by a change that was part of a long-running project to increase resilience in our busiest locations."

Oh, the irony.

What had happened was a change to the company's prefix advertisement policies, resulting in the withdrawal of a critical subset of prefixes. Cloudflare makes use of BGP (Border Gateway Protocol). As part of this protocol, operators define which policies (adjacent IP addresses) are advertised to or accepted from networks (or peers).

Changing a policy can result in IP addresses no longer being reachable on the Internet. One would therefore hope that extreme caution would be taken before doing a such a thing...

Cloudflare's mistakes actually began at 0356 UTC (2056 Pacific), when the change was made at the first location. There was no problem - the location used an older architecture rather than Cloudflare's new "more flexible and resilient" version, known internally as MCP (Multi-Colo Pop.) MCP differed from what had gone before by adding a layer of routing to create a mesh of connections. The theory went that bits and pieces of the internal network could be disabled for maintenance. Cloudflare has already rolled out MCP to 19 of its datacenters.

Moving forward to 0617 UTC (2317 Pacific) and the change was deployed to one of the company's busiest locations, but not an MCP-enabled one. Things still seemed OK... However, by 0627 UTC (2327 Pacific), the change hit the MCP-enabled locations, rattled through the mesh layer and... took out all 19 locations.

Five minutes later the company declared a major incident. Within half an hour the root cause had been found and engineers began to revert the change. Slightly worryingly, it took until 0742 UTC (0042 Pacific) before everything was complete. "This was delayed as network engineers walked over each other's changes, reverting the previous reverts, causing the problem to re-appear sporadically."

One can imagine the panic at Cloudflare towers, although we cannot imagine a controlled process that resulted in a scenario where "network engineers walked over each other's changes."

We've asked the company to clarify how this happened, and what testing was done before the configuration change was made, and will update should we receive a response.

Mark Boost CEO of Cloud native outfit Civo (formerly of LCN.com) was scathing regarding the outage: "This morning was a wake-up call for the price we pay for over-reliance on big cloud providers. It is completely unsustainable for an outage with one provider being able to bring vast swathes of the internet offline.

"Users today rely on constant connectivity to access the online services that are part of the fabric of all our lives, making outages hugely damaging...

"We should remember that scale is no guarantee of uptime. Large cloud providers have to manage a vast degree of complexity and moving parts, significantly increasing the risk of an outage." ®

Similar topics


Other stories you might like

  • Cloudflare menaces virtual desktops with isolated browser access to internal networks
    Gives cloudy email a kicking, too – but VDI should be safe in its bastions

    Cloudflare has added the ability to access private networks to its browser isolation service, and suggests the combo represents an alternative to virtual desktop infrastructure.

    Browser isolation requires organizations to have a Cloudflare Zero Trust account, and to install a client on users' devices. Cloudflare runs a browser in its cloud and users browse as usual – but Cloudflare intervenes so that users don't make it to whichever web server they intend to visit.

    Cloudflare browses to the server and then redraws the web page on the client browser. The user's device therefore never touches the web server, so anything nasty on a page is snuffed out by Cloudflare in its cloud instead of poisoning a local PC.

    Continue reading
  • Google, EFF back Cloudflare in row over pirate streams
    Ban akin to 'ordering a telephone company to prevent a person from having conversations' over its lines

    Google, EFF, and the Computer and Communications Industry Association (CCIA) have filed court documents supporting Cloudflare after it was sued for refusing to block a streaming site.

    Earlier this year, a handful of Israel-based media companies took Israel.tv to court, accusing it of streaming TV and movie content it had no right to distribute. The corporations — United King Film Distribution, D.B.S. Satellite Services, HOT Communication Systems, Charlton, Reshet Media and Keshet Broadcasting — won the lawsuit after Israel.tv's creators failed to show up to their hearings, and the judge ordered Israel-tv.com, Israel.tv and Sdarot.tv each pay $7,650,000 in damages. 

    In a more surprising move, however, the media outfits also won an injunction [PDF] in the United States in April against a slew of internet companies, among others, banning them from aiding Israel.tv in its piracy.

    Continue reading
  • This startup says it can glue all your networks together in the cloud
    Or some approximation of that

    Multi-cloud networking startup Alkira has decided it wants to be a network-as-a-service (NaaS) provider with the launch of its cloud area networking platform this week.

    The upstart, founded in 2018, claims this platform lets customers automatically stitch together multiple on-prem datacenters, branches, and cloud workloads at the press of a button.

    The subscription is the latest evolution of Alkira’s multi-cloud platform introduced back in 2020. The service integrates with all major public cloud providers – Amazon Web Services, Google Cloud, Microsoft Azure, and Oracle Cloud – and automates the provisioning and management of their network services.

    Continue reading
  • Cloudflare's outage was human error. There's a way to make tech divinely forgive
    Don't push me 'cos I'm close to the edge. And the edge is safer if you can take a step back

    Opinion Edge is terribly trendy. Move cloudy workloads as close to the user as possible, the thinking goes, and latency goes down, as do core network and data center pressures. It's true  – until the routing sleight-of-hand breaks that diverts user requests from the site they think they're getting to the copies in the edge server. 

    If that happens, everything goes dark – as it did last week at Cloudflare, edge lords of large chunks of web content. It deployed a Border Gateway Protocol policy update, which promptly took against a new fancy-pants matrix routing system designed to improve reliability. Yeah. They know. 

    It took some time to fix, too, because in the words of those in the know, engineers "walked over each other's changes" as fresh frantic patches overwrote slightly staler frantic patches, taking out the good they'd done. You'd have thought Cloudflare of all people would be able to handle concepts of dirty data and cache consistency, but hey. They know that too. 

    Continue reading
  • Alcatel-Lucent Enterprise adds Wi-Fi 6E to 'premium' access points
    Company claims standard will improve performance in dense environments

    Alcatel-Lucent Enterprise is the latest networking outfit to add Wi-Fi 6E capability to its hardware, opening up access to the less congested 6GHz spectrum for business users.

    The France-based company just revealed the OmniAccess Stellar 14xx series of wireless access points, which are set for availability from this September. Alcatel-Lucent Enterprise said its first Wi-Fi 6E device will be a high-end "premium" Access Point and will be followed by a mid-range product by the end of the year.

    Wi-Fi 6E is compatible with the Wi-Fi 6 standard, but adds the ability to use channels in the 6GHz portion of the spectrum, a feature that will be built into the upcoming Wi-Fi 7 standard from the start. This enables users to reduce network contention, or so the argument goes, as the 6GHz portion of the spectrum is less congested with other traffic than the existing 2.4GHz and 5GHz frequencies used for Wi-Fi access.

    Continue reading
  • Cloudflare says it thwarted record-breaking HTTPS DDoS flood
    26m requests a second? Not legit traffic, not even Bill Gates doing $1m giveaways could manage that

    Cloudflare said it this month staved off another record-breaking HTTPS-based distributed denial-of-service attack, this one significantly larger than the previous largest DDoS attack that occurred only two months ago.

    In April, the biz said it mitigated an HTTPS DDoS attack that reached a peak of 15.3 million requests-per-second (rps). The flood last week hit a peak of 26 million rps, with the target being the website of a company using Cloudflare's free plan, according to Omer Yoachimik, product manager at Cloudflare.

    Like the attack in April, the most recent one not only was unusual because of its size, but also because it involved using junk HTTPS requests to overwhelm a website, preventing it from servicing legit visitors and thus effectively falling off the 'net.

    Continue reading
  • Resurrected Dundee Satellite Station to host quantum Optical Ground Station
    Pandemics and university disinterest apparently no match for ingenuity and determination

    Dundee Satellite Station's home turf at Scotland's Errol Aerodrome is to host an Optical Ground Station to test and demonstrate satellite quantum secure communications.

    The name may sound familiar. Dundee Satellite Station Ltd. is a phoenix rising from the ashes of the University of Dundee Satellite Receiving Station (DSRS), which was axed in 2019 after more than 40 years of operations.

    The Natural Environment Research Council (NERC) cut funding for the facility in 2019 and, despite protestations from the likes of NASA, the lights went out when Dundee University refused to underwrite the annual costs of £338,000. As a reminder, the Principal of the University (paid nearly £300,000 including pension contributions) departed later that year under somewhat of a cloud.

    Continue reading
  • PCIe 7.0 pegged to arrive in 2025 with speeds of 512 GBps
    Although PCIe 5.0 is just coming to market, here's what we can expect in the years ahead

    Early details of the specifications for PCIe 7.0 are out, and it's expected to deliver data rates of up to 512 GBps bi-directionally for data-intensive applications such as 800G Ethernet.

    The announcement from the The Peripheral Component Interconnect Special Interest Group (PCI SIG) was made to coincide with its Developers Conference 2022, held at the Santa Clara Convention Center in California this week. It also marks the 30th anniversary of the PCI-SIG itself.

    While the completed specifications for PCIe 6.0 were only released this January, PCIe 7.0 looks to double the bandwidth of the high-speed interconnect yet again from a raw bit rate of 64 GTps to 128 GTps, and bi-directional speeds of up to 512 GBps in a x16 configuration.

    Continue reading
  • AWS buys before it tries with quantum networking center
    Fundamental problems of qubit physics aside, the cloud giant thinks it can help

    Nothing in the quantum hardware world is fully cooked yet, but quantum computing is quite a bit further along than quantum networking – an esoteric but potentially significant technology area, particularly for ultra-secure transactions. Amazon Web Services is among those working to bring quantum connectivity from the lab to the real world. 

    Short of developing its own quantum processors, AWS has created an ecosystem around existing quantum devices and tools via its Braket (no, that's not a typo) service. While these bits and pieces focus on compute, the tech giant has turned its gaze to quantum networking.

    Alongside its Center for Quantum Computing, which it launched in late 2021, AWS has announced the launch of its Center for Quantum Networking. The latter is grandly working to solve "fundamental scientific and engineering challenges and to develop new hardware, software, and applications for quantum networks," the internet souk declared.

    Continue reading
  • Cisco execs pledge simpler, more integrated networks
    Is this the end of Switchzilla's dashboard creep?

    Cisco Live In his first in-person Cisco Live keynote in two years, CEO Chuck Robbins didn't make any lofty claims about how AI is taking over the network or how the company's latest products would turn networking on its head. Instead, the presentation was all about working with customers to make their lives easier.

    "We need to simplify the things that we do with you. If I think back to eight or ten years ago, I think we've made progress, but we still have more to do," he said, promising to address customers' biggest complaints with the networking giant's various platforms.

    "Everything we find that is inhibiting your experience from being the best that it can be, we're going to tackle," he declared, appealing to customers to share their pain points at the show.

    Continue reading

Biting the hand that feeds IT © 1998–2022