This article is more than 1 year old
Fastly 'fesses up to breaking the internet with an 'an undiscovered software bug' triggered by a customer
Promises it won't happen again, expresses remorse … all the usual stuff that clouds (and Zuck) say after they stumble around making messes
Fastly has explained how it managed to black-hole big chunks of the internet yesterday: a customer triggered a bug.
The customer, Fastly points out in a post titled Summary of June 8 outage, was blameless. "We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change," wrote Nick Rockwell, the company's senior veep of engineering and infrastructure.
The bug was introduced in a 12 May software deployment and lay dormant until, on 8 June, "a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85 per cent of our network to return errors."
Cue global chaos.
Rockwell's post states that Fastly "detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95 per cent of our network was operating as normal."
- IBM Cloud resets 'Days Since Last Major Incident' clock to zero – after just five days
- Indian Finance Minister throws Infosys under the bus as new e-tax portal fails on first day
- Azure services fall over in Europe, Microsoft works on fix
- Colonial Pipeline suffers server gremlins, says it's not due to another ransomware infection
The veep also admitted that Fastly should have done better.
"Even though there were specific conditions that triggered this outage, we should have anticipated it," he wrote.
The company has therefore resolved to do four things:
- We're deploying the bug fix across our network as quickly and safely as possible.
- We are conducting a complete post mortem of the processes and practices we followed during this incident.
- We'll figure out why we didn't detect the bug during our software quality assurance and testing processes.
- We'll evaluate ways to improve our remediation time.
And, of course, it has apologised and promised it will do its very best not to make mistakes like this again. Which is just what all clouds, and social networks, say when they make avoidable but very damaging errors. ®