Finally, made it to the weekend, time to breathe, relax, and... Cloudflare's taken down a chunk of the web
DNS provider goes dark amid bad routing, world+dog goes through nine minutes of terror
Updated Global internet glue Cloudflare experienced a brief network outage on Friday that broke multiple apps and websites, including your humble Register.
On its status page, as of Jul 17, 21:37 UTC, the DNS-and-everything-else provider said it was "investigating issues with Cloudflare Resolver and our edge network in certain locations," and warned that customers in certain regions may experience failures or errors.
Affected services included the Cloudflare API and Cloudflare Recursive DNS, both of which were listed as having degraded performance. And in various regions around the world where Cloudflare handles network traffic, the status page said data is being rerouted.
Nine minutes later, at Jul 17, 21:46 UTC, the biz announced a fix had been implemented without immediately saying what happened. Within the past few minutes, a Cloudflare spokesperson told The Register the blip was due to a blunder involving one of its routers:
This afternoon we saw an outage across some parts of our network. It was not as a result of an attack. It appears a router on our global backbone announced bad routes and caused some portions of the network to not be available. We believe we have addressed the root cause and monitoring systems for stability now. We will share more shortly – we have a team writing an update as we speak.
Cloudflare CEO Matt Prince then pointed to a single piece of equipment in Atlanta, USA, as the culprit:
We isolated the Atlanta router and shut down our backbone, routing traffic across transit providers instead. There was some congestion that caused slow performance on some links as the logging caught up. Everything is restored now and we're looking into the root cause. 2/2— Matthew Prince 🌥 (@eastdakota) July 17, 2020
He added the glitch "appears to have impacted about 50 per cent of our traffic for a bit over 20 minutes."
"The issue was caused by a mistaken configuration we were applying to a router during a routine update," said Prince. "There was no attack. It was not a failure of the router software. Blog post with details coming soon as are protections to our backbone to prevent going forward."
Because Cloudflare handles DNS services and edge computing services for many, many non-commercial and commercial websites, including The Register and our backend systems, the brief service interruption drew immediate notice.
#FLASH Cloudflare - which handles half the internets DNS is suffering an 'outage'. Many sites and services affected.— JΞSŦΞR ✪ ΔCŦUΔL³³°¹ (@th3j35t3r) July 17, 2020
Breaking: A large outage took down Cloudflare, a website hosting, network and internet security provider. The outage is mainly resolved. More than 80+ websites and apps were down. pic.twitter.com/cg2vDSVNdu— Porter Medium (@PorterMedium) July 17, 2020
Good on @Cloudflare to encourage social distancing by distancing everyone from the internet— The Register (@TheRegister) July 17, 2020
Aside from El Reg, services said to have been affected are Authy, Digital Ocean, Discord, Downdetector, GitLab, Medium, Patron, and Riot among others.
Cloudflare seems to have resolved their DNS issues, and all services are operational now. We are monitoring for now. An incident issue was created and reviewed in https://t.co/nYPsQt7kD8 https://t.co/FRkUs3EQOU— GitLab.com Status (@gitlabstatus) July 17, 2020
The incident, while short-lived, is yet another reminder of the fragility of critical online services. We'll let you know when we know more. ®
Updated to add
Cloudflare has published a fairly detailed postmortem of the downtime.
"We are sorry for this outage and have already made a global change to the backbone configuration that will prevent it from being able to occur again," said CTO John Graham-Cumming.