Cloudflare gave everyone a 30-minute break from a chunk of the internet yesterday: Here's how they did it
DevOps-tating automation cockup... or machines trying to take over the web? El Reg talks to the CTO
Interview Internet services outfit Cloudflare took careful aim and unloaded both barrels at its feet yesterday, taking out a large chunk of the internet as it did so.
In an impressive act of openness, the company posted a distressingly detailed post-mortem on the cockwomblery that led to the outage. The Register also spoke to a weary John Graham-Cumming, CTO of the embattled company, to understand how it all went down.
This time it wasn't Verizon wot dunnit; Cloudflare engineered this outage all by itself.
In a nutshell, what happened was that Cloudflare deployed some rules to its Web Application Firewall (WAF). The gang deploys these rules to servers in a test mode – the rule gets fired but doesn't take any action – in order to measure what happens when real customer traffic runs through it.
We'd contend that an isolated test environment into which one could direct traffic would make sense, but Graham-Cumming told us: "We do this stuff all the time. We have a sequence of ways in which we deploy stuff. In this case, it didn't happen."
It all sounds a bit like the start of a Who, Me?
In a frank admission that should send all DevOps enthusiasts scurrying to look at their pipelines, Graham-Cumming told us: "We're really working on understanding how the automated test suite which runs internally didn't pick up the fact that this was going to blow up our service."
The CTO elaborated: "We push something out, it gets approved by a human, and then it goes through a testing procedure, and then it gets pushed out to the world. And somehow in that testing procedure, we didn't spot that this was going to blow things up."
He went on to explain how things should happen. After some internal dog-fooding, the updates are pushed out to a small group of customers "who tend to be a little bit cheeky with us" and "do naughty things" before it is progressively rolled out to the wider world.
Cloudflare hits the deck, websites sink from sight after the internet springs yet another BGP leakREAD MORE
"And that didn't happen in this instance. This should have been caught easily."
The result? "One of these rules caused the CPU spike to 100 per cent, on all of our machines." And because Cloudflare's products are distributed over all its servers, every service was starved of CPU while the offending regular expression did its thing.
The 100 per cent CPU spike kicked off at 1342 UTC, taking down pretty much everything Cloudflare does, including DNS over HTTPS (DoH), the Content Delivery Network (CDN) and so on. At its worst, traffic flowing through the company dropped by 82 per cent. Customers saw "502 Bad Gateway" errors as they attempted to browse sites such as our very own Vulture Central.
The internet shrieked in pain.
It took the company 20 long minutes to work out just what the heck had happened. At 1409 UTC – nearly half an hour after the borkage had occurred – the company fired off a "global kill" on the WAF Managed Rulesets, which sent things back to normal.
"We had to determine which system was causing the problem. We knew there was a *massive* problem, but which massive problem needed determining. During those 20 minutes we identified that it was the WAF and identified the specific change causing the problem."
It wasn't until 1452 UTC that those rulesets got re-enabled, after the company had tracked down the offending pull request and rolled back the specific rules.
Cloudflare itself uses some pretty standard tools to actually build and test, including TeamCity and the Atlassian suite. Graham-Cumming described a process familiar to many DevOps practitioners:
"If you were to look at our internal systems to see this problem going to production, you can see an individual engineer making a pull request, that pull request getting approved by other humans, getting pushed into the automated build system in TeamCity, the build running, the test suite running, and that system making the decision that its okay to deploy.
"And that's been sort of the critical thing: that automated system was allowed to make that decision… and probably shouldn't."
Cloudflare goes big on serverless with new command line, lures devs with free account tierREAD MORE
Customers hit by the outage do, of course, have a Service Level Agreement that reimburses them for outages, although that is highly unlikely to cover lost revenues for half an hour of downtime. Graham-Cumming would only say he would be "discussing with the sales leadership and with the rest of the team, how we best address this" before repeating that Cloudflare "felt this pain very, very severely."
We'll leave it to Scott Hanselman to sum up how the machines will eventually rise up and wipe out humanity. ®
The universe won’t end in fire or fade away - it’ll be a poorly written regular expression that destroys us all. https://t.co/zhORvHcx0t— Scott Hanselman (@shanselman) July 3, 2019
Disclosure: The Register is a Cloudflare customer.