We have some sad news about Facebook. It has returned to the internet after six-hour mega outage

It’s not DNS. There is no way it’s just DNS. It was BGP


Updated Facebook has struggled back online today, though at the time of writing glitches are still very much a part of The Social Network™ experience.

WhatsApp and Facebook became available to users at around 2210 UTC on October 4 after falling off the internet some six or so hours prior. Instagram and Facebook Messenger should be not far behind.

In the past hour, Facebook tweeted: "To the huge community of people and businesses around the world who depend on us: we're sorry. We’ve been working hard to restore access to our apps and services and are happy to report they are coming back online now. Thank you for bearing with us."

CTO Mike Schroepfer earlier said: "Sincere apologies to everyone impacted by outages of Facebook powered services right now. We are experiencing networking issues and teams are working as fast as possible to debug and restore as fast as possible."

Founder Mark Zuckerberg chimed in: "Facebook, Instagram, WhatsApp and Messenger are coming back online now. Sorry for the disruption today – I know how much you rely on our services to stay connected with the people you care about." His otherwise most recent missive was a video of him on a yacht.

The Register staff in the United States and Australia have experienced different levels of service since the resumption.

One vulture in the USA was able to post to Facebook without issue. Antipodean staff were unable to post and saw errors such as the following...

Facebook error

Click to enlarge

Attempts to view notifications produced a “query error” dialog. WhatsApp was flaky – linking devices took over a minute. Instagram was down, at least in Australia, where its favicon loaded in a browser tab, but the site produced only the message, “Oops, an error occurred.”

Theories about the cause of the outage have focused on Facebook’s sudden withdrawal of BGP routes to its own DNS servers, causing look-ups for Facebook domain names, such as facebook.com and instagram.com, to fail when those servers vanished from the internet.

Generic illustration of networking gear

It's time to decentralize the internet, again: What was distributed is now centralized by Google, Facebook, etc

READ MORE

This not only brought down Facebook's empire of apps but also apparently even caused door keycards to stop working on Facebook's campus. Staff fell back to Outlook, Zoom, and Discord to organize themselves and work on correcting the problem as they were unable to use internal and external Facebook-based systems. This hampered the recovery effort, leading to the hours of downtime.

In May this year, Facebook announced it had built an automated peering configuration system. This software may or may not have been at the heart of today's outage. Someone claiming to work at Facebook posted on Reddit, and then deleted their missive, that Facebook's BGP peering went down likely due to a configuration blunder somewhere.

The IT breakdown was such that engineers needed to get physical access to the routers to fix and restart them, and a crack team was sent into Facebook's Santa Clara, California, data center to do that, according to the New York Times. It's believed that, due to pandemic restrictions, the staff with the skills to repair the network were not on-site and obviously couldn't log in remotely.

The outage comes at a terrible time for Facebook, which in recent days has been the subject of damning leaks that suggest the company is indifferent to harms its platforms can create, including the likelihood of self-harm by users, facilitating human trafficking, and ineffectual efforts to suppress hate speech and misinformation.

Documents shared by whistleblower Frances Haugen, a former Facebook employee, have also suggested The Social Network™ ignored rules about content for high-profile users, and employed woefully insufficient numbers of staff who speak users’ native languages, thereby allowing vile content to circulate without checks.

Haugen has filed a complaint with the United States’ Securities and Exchange Commission, suggesting Facebook withheld information investors need to make informed decisions. That’s the kind of indirect but effective tactic that sees authorities chase mobsters for unpaid taxes rather than trying to secure evidence of murders.

The Register does not suggest Facebook has murdered anyone.

But today’s outages may have been extremely serious for those who rely on its services for day-to-day communications, both in their personal lives and for businesses that have gone all-in on Facebook as a customer communication and sales channel. ®

Updated to add

Facebook has shared an official statement on what happened. It confirmed the outage took down its internal tools and systems, "complicating our attempts to quickly diagnose and resolve the problem." That's why it took so long to fix: it's hard to do so when your infrastructure has self-imploded.

It also confirmed an accidental configuration change ultimately caused the loss in connectivity, though you'll note the details have been massaged for the general public:

Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.

We want to make clear at this time we believe the root cause of this outage was a faulty configuration change.

It also stressed there is "no evidence that user data was compromised," well, anymore than it usually is on Facebook.

You can find more technical info on the outage here.

Similar topics


Other stories you might like

Biting the hand that feeds IT © 1998–2021