Facebook rendered spineless by buggy audit code that missed catastrophic network config error

Explains mega-outage with boilerplate response: We try hard, we're sorry we failed, we'll try to do better

Facebook has admitted buggy auditing code was at the core of yesterday's six-hour outage – and revealed a little more about its infrastructure to explain how it vanished from the internet.

In a write-up by infrastructure veep Santosh Janardhan, titled "More details about the October 4 outage," the outrage-monetization giant confirmed early analyses that Facebook yesterday withdrew the border gateway protocol (BGP) routing to its own DNS servers, causing its domain names to fail to resolve.

That led to its websites disappearing, apps stopping, and internal tools and services needed by staff to remedy the situation breaking down as well.

But this DNS and BGP borkage turns out to have been the consequence of other errors. Janardhan explained that it operates two classes of data center.

One type was described as "massive buildings that house millions of machines," performing core computation and storage tasks. The other bit barns are "smaller facilities that connect our backbone network to the broader internet and the people using our platforms."

Users of Facebook's services first touch one of those smaller facilities, which then send traffic over Facebook's backbone to a larger data center. Like any complex system, that backbone is not set-and-forget – it requires maintenance. Facebook stuffed that up.

"During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network," Janardhan revealed.

That should not have happened. As the post explains:

Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command

Once the command was executed, it "caused a complete disconnection of our server connections between our data centers and the internet," Janardhan added.

Which was problematic. Facebook's smaller bit barns handle DNS queries for facebook.com, fb.com, instagram.com, etc. "Those translation queries are answered by our authoritative name servers that occupy well-known IP addresses themselves, which in turn are advertised to the rest of the internet via … BGP," as Janardhan put it.

Crucially, Facebook's DNS servers disable their BGP advertisements when those machines can't reach their own back-end data centers. That's fair enough as this unavailability could be a sign of duff connectivity, and you'd want to advertise routes to DNS servers that have robust links to their major centers.

So when the bad change hit Facebook's backbone, and all the data centers disconnected, all of Facebook's small bit barns declared themselves crocked and withdrew their BGP advertisements. So even though Facebook's DNS servers were up, they couldn't be reached by the outside world. Plus, the back-end systems were inaccessible due to the dead backbone, anyway. Failure upon failure.

Couldn't happen to a nicer bunch of blokes

While Facebook's post says it runs "storm" drills to ready itself to cope with outages, it had never simulated its backbone going down. Fixing the outage therefore proved … challenging.

"It was not possible to access our data centers through our normal means because their networks were down, and … the total loss of DNS broke many of the internal tools we'd normally use to investigate and resolve outages like this," Janardhan stated.

Engineers were dispatched to Facebook facilities, and those buildings are "designed with high levels of physical and system security" that makes them "hard to get into, and once you're inside, the hardware and routers are designed to be difficult to modify even when you have physical access.

It took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers

"So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online."

This follows reports of employees' door keycards not even working on Facebook's campuses during the downtime let alone internal diagnosis and collaboration tools, hampering recovery. Once admins figured out the networking problem, they had to confront the impact of resuming service:

"We knew that flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk."

The post doesn't explain how Facebook addressed those issues.

Janardhan said he found it "interesting" to see how Facebook's security measures "slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making."

He owns those delays. "I believe a tradeoff like this is worth it – greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this," he wrote.

The post concludes with Facebook's usual admission of error despite earnest effort, apology, and pledge to improve. We're assuming the social network is telling the truth in its write-up.

Facebook is not alone in breaking itself or having unhealthy reliance on its own resources: a massive AWS outage in 2017 was caused by a single error, and IBM Cloud's June 2020 planet-wide outage was exacerbated by its status page being hosted on its own infrastructure, which left customers completely in the dark about the situation.

Site reliability engineers should know better. Especially in Facebook's case, as it was unable to serve ads for hours, its federated identity services are used by countless third-party web sites, and the company has positioned itself as the ideal source of everyday personal and/or commercial communications for literally billions of people.

But as whistleblower Frances Haugen told US Congress, Facebook puts profit before people, many of its efforts to do otherwise are shallow and performative, and its sins of omission are many and constant. ®

Similar topics

Narrower topics

Other stories you might like

  • North Korea pulled in $400m in cryptocurrency heists last year – report

    Plus: FIFA 22 players lose their identity and Texas gets phony QR codes

    In brief Thieves operating for the North Korean government made off with almost $400m in digicash last year in a concerted attack to steal and launder as much currency as they could.

    A report from blockchain biz Chainalysis found that attackers were going after investment houses and currency exchanges in a bid to purloin funds and send them back to the Glorious Leader's coffers. They then use mixing software to make masses of micropayments to new wallets, before consolidating them all again into a new account and moving the funds.

    Bitcoin used to be a top target but Ether is now the most stolen currency, say the researchers, accounting for 58 per cent of the funds filched. Bitcoin accounted for just 20 per cent, a fall of more than 50 per cent since 2019 - although part of the reason might be that they are now so valuable people are taking more care with them.

    Continue reading
  • Tesla Full Self-Driving videos prompt California's DMV to rethink policy on accidents

    Plus: AI systems can identify different chess players by their moves and more

    In brief California’s Department of Motor Vehicles said it’s “revisiting” its opinion of whether Tesla’s so-called Full Self-Driving feature needs more oversight after a series of videos demonstrate how the technology can be dangerous.

    “Recent software updates, videos showing dangerous use of that technology, open investigations by the National Highway Traffic Safety Administration, and the opinions of other experts in this space,” have made the DMV think twice about Tesla, according to a letter sent to California’s Senator Lena Gonzalez (D-Long Beach), chair of the Senate’s transportation committee, and first reported by the LA Times.

    Tesla isn’t required to report the number of crashes to California’s DMV unlike other self-driving car companies like Waymo or Cruise because it operates at lower levels of autonomy and requires human supervision. But that may change after videos like drivers having to take over to avoid accidentally swerving into pedestrians crossing the road or failing to detect a truck in the middle of the road continue circulating.

    Continue reading
  • Alien life on Super-Earth can survive longer than us due to long-lasting protection from cosmic rays

    Laser experiments show their magnetic fields shielding their surfaces from radiation last longer

    Life on Super-Earths may have more time to develop and evolve, thanks to their long-lasting magnetic fields protecting them against harmful cosmic rays, according to new research published in Science.

    Space is a hazardous environment. Streams of charged particles traveling at very close to the speed of light, ejected from stars and distant galaxies, bombard planets. The intense radiation can strip atmospheres and cause oceans on planetary surfaces to dry up over time, leaving them arid and incapable of supporting habitable life. Cosmic rays, however, are deflected away from Earth, however, since it’s shielded by its magnetic field.

    Now, a team of researchers led by the Lawrence Livermore National Laboratory (LLNL) believe that Super-Earths - planets that are more massive than Earth but less than Neptune - may have magnetic fields too. Their defensive bubbles, in fact, are estimated to stay intact for longer than the one around Earth, meaning life on their surfaces will have more time to develop and survive.

    Continue reading

Biting the hand that feeds IT © 1998–2022