Facebook rendered spineless by buggy audit code that missed catastrophic network config error

Explains mega-outage with boilerplate response: We try hard, we're sorry we failed, we'll try to do better

Facebook has admitted buggy auditing code was at the core of yesterday's six-hour outage – and revealed a little more about its infrastructure to explain how it vanished from the internet.

In a write-up by infrastructure veep Santosh Janardhan, titled "More details about the October 4 outage," the outrage-monetization giant confirmed early analyses that Facebook yesterday withdrew the border gateway protocol (BGP) routing to its own DNS servers, causing its domain names to fail to resolve.

That led to its websites disappearing, apps stopping, and internal tools and services needed by staff to remedy the situation breaking down as well.

But this DNS and BGP borkage turns out to have been the consequence of other errors. Janardhan explained that it operates two classes of data center.

One type was described as "massive buildings that house millions of machines," performing core computation and storage tasks. The other bit barns are "smaller facilities that connect our backbone network to the broader internet and the people using our platforms."

Users of Facebook's services first touch one of those smaller facilities, which then send traffic over Facebook's backbone to a larger data center. Like any complex system, that backbone is not set-and-forget – it requires maintenance. Facebook stuffed that up.

"During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network," Janardhan revealed.

That should not have happened. As the post explains:

Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command

Once the command was executed, it "caused a complete disconnection of our server connections between our data centers and the internet," Janardhan added.

Which was problematic. Facebook's smaller bit barns handle DNS queries for facebook.com, fb.com, instagram.com, etc. "Those translation queries are answered by our authoritative name servers that occupy well-known IP addresses themselves, which in turn are advertised to the rest of the internet via … BGP," as Janardhan put it.

Crucially, Facebook's DNS servers disable their BGP advertisements when those machines can't reach their own back-end data centers. That's fair enough as this unavailability could be a sign of duff connectivity, and you'd want to advertise routes to DNS servers that have robust links to their major centers.

So when the bad change hit Facebook's backbone, and all the data centers disconnected, all of Facebook's small bit barns declared themselves crocked and withdrew their BGP advertisements. So even though Facebook's DNS servers were up, they couldn't be reached by the outside world. Plus, the back-end systems were inaccessible due to the dead backbone, anyway. Failure upon failure.

Couldn't happen to a nicer bunch of blokes

While Facebook's post says it runs "storm" drills to ready itself to cope with outages, it had never simulated its backbone going down. Fixing the outage therefore proved … challenging.

"It was not possible to access our data centers through our normal means because their networks were down, and … the total loss of DNS broke many of the internal tools we'd normally use to investigate and resolve outages like this," Janardhan stated.

Engineers were dispatched to Facebook facilities, and those buildings are "designed with high levels of physical and system security" that makes them "hard to get into, and once you're inside, the hardware and routers are designed to be difficult to modify even when you have physical access.

It took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers

"So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online."

This follows reports of employees' door keycards not even working on Facebook's campuses during the downtime let alone internal diagnosis and collaboration tools, hampering recovery. Once admins figured out the networking problem, they had to confront the impact of resuming service:

"We knew that flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk."

The post doesn't explain how Facebook addressed those issues.

Janardhan said he found it "interesting" to see how Facebook's security measures "slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making."

He owns those delays. "I believe a tradeoff like this is worth it – greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this," he wrote.

The post concludes with Facebook's usual admission of error despite earnest effort, apology, and pledge to improve. We're assuming the social network is telling the truth in its write-up.

Facebook is not alone in breaking itself or having unhealthy reliance on its own resources: a massive AWS outage in 2017 was caused by a single error, and IBM Cloud's June 2020 planet-wide outage was exacerbated by its status page being hosted on its own infrastructure, which left customers completely in the dark about the situation.

Site reliability engineers should know better. Especially in Facebook's case, as it was unable to serve ads for hours, its federated identity services are used by countless third-party web sites, and the company has positioned itself as the ideal source of everyday personal and/or commercial communications for literally billions of people.

But as whistleblower Frances Haugen told US Congress, Facebook puts profit before people, many of its efforts to do otherwise are shallow and performative, and its sins of omission are many and constant. ®

Similar topics

Other stories you might like

  • Think your phone is snooping on you? Hold my beer, says basic physics

    Information wants to be free, and it's making its escape

    Opinion Forget the Singularity. That modern myth where AI learns to improve itself in an exponential feedback loop towards evil godhood ain't gonna happen. Spacetime itself sets hard limits on how fast information can be gathered and processed, no matter how clever you are.

    What we should expect in its place is the robot panopticon, a relatively dumb system with near-divine powers of perception. That's something the same laws of physics that prevent the Godbot practically guarantee. The latest foreshadowing of mankind's fate? The Ethernet cable.

    By itself, last week's story of a researcher picking up and decoding the unintended wireless emissions of an Ethernet cable is mildly interesting. It was the most labby of lab-based demos, with every possible tweak applied to maximise the chances of it working. It's not even as if it's a new discovery. The effect and its security implications have been known since the Second World War, when Bell Labs demonstrated to the US Army that a wired teleprinter encoder called SIGTOT was vulnerable. It could be monitored at a distance and the unencrypted messages extracted by the radio pulses it gave off in operation.

    Continue reading
  • What do you mean you gave the boss THAT version of the report? Oh, ****ing ****balls

    Say what you mean

    NSFW Who, Me? Ever written that angry email and accidentally hit send instead of delete? Take a trip back to the 1990s equivalent with a slightly NSFW Who, Me?

    Our story, from "Matt", flings us back the best part of 30 years to an era when mobile telephones were the preserve of the young, upwardly mobile professionals and fixed lines ruled the roost for more than just your senior relatives.

    Back then, Matt was working for a UK-based fixed-line telephone operator. He was dealing with a telephone exchange which served a relatively large town. "I ran a reasonably ordinary, read-only command to interrogate a specific setting," he told us.

    Continue reading
  • Chinese tech minister says he's 'dealt with' 73,000 websites that breached the law

    Ongoing crackdown saw apps 1.83 million apps tested, 4,200 told to clean up their act, pop-up ads popped

    China's Minister of Industry and Information Technology, Xiao Yaqing, has given a rare interview in which he signalled the nation's crackdown on the internet and predatory companies will continue.

    The interview, reported in state-controlled organ Xinhua, reveals that China's recent crackdowns on inappropriate content and companies with monopolistic tendencies have both bitten – hard.

    The nation investigated 1.83 million apps to ensure they don't infringe users' rights. Some 4,200 illegal apps found to require "rectification".

    Continue reading

Biting the hand that feeds IT © 1998–2021