Cloudflare has published a detailed and refreshingly honest report into precisely what went wrong earlier this month when its systems fell over and took a big wedge of the internet with it.
We already knew from a quick summary published the next day, and our interview with its CTO John Graham-Cumming, that the 30-minute global outage had been caused by an error in a single line of code in a system the company uses to push rapid software changes.
Even though that change had been run through a test beforehand, the blunder maxed out Cloudflare's servers CPUs and caused customers worldwide to get 502 errors from Cloudflare-backed websites. The full postmortem digs into precisely what went wrong and what the biz has done and is doing, to fix it and stop any repetition.
The headline is that it was a cascade of small mistakes that caused one almighty cock-up. We're tempted to use the phrase-du-jour "perfect storm," but it wasn't. It was a small mistake and lots of gaps in Cloudflare's otherwise robust processes that let the mistake escalate.
First up the error itself – it was in this bit of code:
.*(?:.*=.*). We won't go into the full workings as to why because the post does so extensively (a Friday treat for coding nerds) but very broadly the code caused a lot of what's called "backtracking," basically repetitive looping. This backtracking got worse – exponentially worse – the more complex the request and very, very quickly maxed out the company's CPUs.
So the three big questions: why wasn't this noticed before it went live? How did it have such a huge impact so quickly? And why did it take Cloudflare so long to fix it?
The post answers each question clearly in a detailed rundown and even includes a lot of information that most organizations would be hesitant to share about internal processes and software, so kudos to Cloudflare for that. But to those questions…
I see you CPU
The impact wasn't noticed for the simple reason that the test suite didn’t measure CPU usage. It soon will – Cloudflare has an internal deadline of a week from now.
The second problem was that a software protection system that would have prevented excessive CPU consumption had been removed "by mistake" just a weeks earlier. That protection is now back in although it clearly needs to be locked down.
Cloudflare gave everyone a 30-minute break from a chunk of the internet yesterday: Here's how they did itREAD MORE
The software used to run the code – the expression engine – also doesn't have the ability to check for the sort of backtracking that occurred. Cloudflare says it will shift to one that does.
So that's how it got through the checking process: what about the speed with which it impacted everyone?
Here was another significant mistake: Cloudflare seems to have got too comfortable with making changes to its Web Application Firewall (WAF). The WAF is designed to be able to quickly provide protection to Cloudflare customers – it can literally make changes globally in seconds.
And Cloudflare has in the past put this to good use. In the post, it points to the fast rollout of protections against a SharePoint security hole in May. Very soon after the holes were made public, the biz saw a lot of hacking efforts on its customers' system and was able to cut them off almost instantly with an update pushed through WAF. This kind of service is precisely what has given Cloudflare its reputation – and paying clients. It deals with the constant stream of security issues so you don't have to.
But it uses the system a lot: 476 change requests in the past 60 days, or the equivalent of one every three hours.
The code that caused the problem was designed to deal with new cross-site scripting (XSS) attacks the company had identified but – and here’s the crucial thing – it wasn't urgent that that change be made. So Cloudflare could have introduced it in a slower way and noticed the problem before it became a global issue. But it didn't; it has various testing processes that have always worked and so it put the expression into the global system – as it has with many other expressions.
Cloudflare justifies this by pointing to the growing number of CVEs – Common Vulnerabilities and Exposures – that are published annually.
War Games redux
The impact however was that it created an instant global headache. What's more the code itself was being run in a simulation mode – not in the full live mode – but because of the massive CPU consumption that it provoked, even within that mode it was able to knock everything offline as servers were unable to deal with the processing load.
That's where it all went wrong. Now, why did it take Cloudflare so long to fix it? Why didn't it just do a rollback within minutes and solve the issue while it figured out what was going on?
The post gives some interesting details that will be familiar to anyone that has ever had to deal with a crisis: the problem was noticed through alerts and then everyone scrambled. The issue had to be escalated to pull in more engineers and especially more senior engineers who are allowed to make big decisions about what to do.
The mistakes here are all human: first, you have to physically get other human beings in front of screens, on phones, and in chatrooms. Then you have to coordinate quickly but effectively. What is the problem? What is causing it? How can we be sure that's right?
People get panicky under pressure and can easily misread or misunderstand the situation or decide the wrong thing. It takes a cool head to figure out what the truth is and figure out the best way to resolve it as quickly as possible.
It appears from Cloudflare's post that the web biz actually did really well in this respect – and we can have some degree of confidence in its version of events thanks to the timeline. Despite the obvious initial thought that the company was under some kind of external attack, it pinpointed the issue as being the WAF within 15 minutes of receiving the first alert. Which is actually a pretty good response time considering that no one was watching this rule change. It was a routine update that went wrong.
But there were several crucial delays. First the automated emergency alerts took three minutes to arrive. Cloudflare admits this should have been faster. Second, even though a senior engineer made the decision to do a global kill on the WAF two minutes after it was pinpointed as the cause of the problem, it took another five minutes to actually process it.