Cloudflare comes clean on crashing a chunk of the web: How small errors and one tiny bit of code led to a huge mess

The culprit? .*(?:.*=.*)

Got Tips? 25 Reg comments

Slow death

Why? Because the people authorized to issue the kill hadn't logged into the system for a while and the system's protection system had logged them out as a result. They had to re-verify themselves to get into the system. When they did and authorized the kill, two minutes later it had kicked in globally and traffic levels went down to normal – making it clear that it was in fact the WAF that was the problem.

This is the timeline:

  • 13.42: Bad code posted
  • 13.45: First alert arrives (followed by lots of others)
  • 14.00: WAF identified as the problem
  • 14.02: Global kill on WAF approved
  • 14.07: Kill finally implemented (logging in)
  • 14.09: Traffic back to normal

Cloudflare has changed its systems and approach in response so in future this response time should go from 27 minutes to around 20 minutes (assuming it will always take some amount of time to figure out where the problem lies in a previously unidentified issue.)

At this point, the problem was identified but WAF had been taken down so people were still experiencing problems. The Cloudflare team then had to figure out what in WAF had gone wrong, fix it, check it, and then restart it. That took 53 minutes.

This is where the impressive openness and honesty from Cloudflare up until this point gets a little more opaque. One paragraph covers this entire process:

"Because of the sensitivity of the situation we performed both negative tests (asking ourselves “was it really that particular change that caused the problem?”) and positive tests (verifying the rollback worked) in a single city using a subset of traffic after removing our paying customers’ traffic from that location. At 14:52 we were 100 per cent satisfied that we understood the cause and had a fix in place and the WAF was re-enabled globally."

There's no more information than that, although it does mention later on that "the rollback plan required running the complete WAF build twice, taking too long."

Timing off

It also mentions that the Cloudflare team "had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on" – although it's not clear if that leads to delays in fixing the WAF.

It's hard to know without more detail whether Cloudflare did a great job here or whether its systems were found lacking - given its global reach and that it's entire function as a company is around this kind of work.

For example: how long after the WAF was taken down did the engineer manage to pinpoint the specific code that caused the problem? Did it figure it out in five minutes and then run 47 minutes of tests? Or did it take them 47 minutes to find it and run five minutes of tests?

The fact that Cloudflare doesn't say in an otherwise very detailed and expansive post suggests that this was not its finest hour. You would imagine that it would simply bring up a log of all the changes made just prior to the problems, cut those changes out, rebuild, and test. Maybe it did.

Is 53 minutes a good timeframe to rebuild something that had just caused worldwide outages and put it live again? What do Reg readers think?

Anyway, that's how it went down. To its credit, Cloudflare also acknowledges that its communication during the crisis could have been better. For obvious reasons, all of its customers were clamoring for information but all the people with the answers were busy fixing it.

Worse, customers lost access to their Cloudflare Dashboard and API - because they pass through the Cloudflare edge which was impacted – and so they were really in the dark. The business plans to fix both these issues by adding automatic updates to its status page and by having a way to bypass the normal Dashboard and API approach in an emergency, so people can get access to information.

So there you have it. It's not clear how much an impact this cock-up has had on people's confidence with Cloudflare. The post is keen to point out the company hasn't had a global outage in six years – not including Verizon-induced problems of course.

Its honesty, clear breakdown and list of logical improvements – including not posting non-urgent updates to its super-fast global update system - will go some way to reassure customers that Cloudflare is not going all-Evernote and building more and more services on top of sub-optimal code.

With luck it will be another six years until the Cloudflare-reliant internet goes down. ®

Disclosure: The Register is a Cloudflare customer.

Sponsored: Webcast: Discover and secure all of your attack surface

SUBSCRIBE TO OUR WEEKLY TECH NEWSLETTER


Biting the hand that feeds IT © 1998–2020