That Salesforce outage: Global DNS downfall started by one engineer trying a quick fix

'We feel bad about what happened'


The sound of rumbling rubber could be heard today as Salesforce threw an engineer responsible for a change that knocked it offline under a passing bus.

"We're not blaming one employee," said Chief Availability Officer Darryn Dieken after spending the first half hour of a Wednesday briefing on the outage doing pretty much that very thing.

To recap, on May 11 around 2100 UTC, a configuration change was applied to Salesforce's Domain Name System (DNS) servers that resulted in folks unable to access the software-as-a-service titan's products. For about five hours, clients could not reliably log in, and things got so bad that even the status page was unavailable.

Salesforce has been updating its public root cause analysis ever since, and Dieken said during his briefing to customers that a few more tweaks would be needed before the fix was completed.

It was during that call that the full extent of the screw-up was revealed and the engineer concerned launched buswards.

While Dieken boasted of the automation in place within Salesforce towers, some processes remain manual. One of these is related to DNS (yes, it is always DNS.) A lone engineer was tasked with making a configuration change to connect up a new Salesforce Hyperforce environment in Australia.

A DNS change is not an uncommon occurrence, and the engineer also had a four-year-old script to do the job. However, while Salesforce usually "staggers" changes to reduce the blast radius of blunders, the manual nature of this change meant it was up to the engineer to roll it out slowly.

This, alas, did not happen. The engineer instead decided erroneously, according to Dieken, to shortcut the normal procedures by using a so-called Emergency Break-Fix (EBF) process. The EBF is normally used when something really bad is happening, or an emergency patch is quickly and widely needed.

Going down the EBF route meant fewer approvals and a shortened process that wasn't gradual. Hey, this was a well-used script, the engineer had worked for Salesforce for years and these changes were pretty common. What could possibly go wrong?

In classic Who, Me? fashion, rather a lot.

We don't understand

"For whatever reason that we don't understand, the employee decided to do a global deployment," Dieken went on. The usual staggered approach was therefore bypassed. And a DNS change meant those servers would need restarting.

That in itself would not be a total catastrophe. Maybe a short outage, perhaps. But not the disaster that unfolded.

However, it transpired that lurking within that tried-and-trusted script was a bug. Under load, a timeout could happen that would stop other things from running. And sure enough, as the update was being rolled out across all of Salesforce's data centers, a timeout occurred. This in turn meant that certain tasks were not carried out when the servers were restarted. And that, in turn, meant that those servers did not return to operation correctly. That left customers unable to access Salesforce's products.

And then things got even worse. The Salesforce team has tools to deal with sad servers, and use what Dieken called "our emergency break glass process" to perform rollbacks and restarts.

"In this case," he went on, "we found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active."

It is always DNS.

We found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active

Of course, staff did eventually get into the servers to fix them up but, as customers felt all too keenly, it took quite some time to undo the mess.

As for forthcoming actions, safeguards are to be put in place to stop manual global deployments like this in future, and the process will be automated. Dieken also acknowledged that the experience had shown up a gap in Salesforce's test coverage – the script needed to be better tested, essentially. Finally, that dependency of the recovery tools on DNS needed to be dealt with.

Customers bemused that they had to get official word of the outage from social media or this very organ, rather than the status page, were doubtless more bemused at the revelation that the reason for the Salesforce status site falling over was due to auto-scale not being turned on for that web property. (During the downtime, Salesforce had to use its documentation site to explain to clients what was going wrong.)

"We over-provisioned enough capacity to make sure that we could handle large spikes," explained Dieken, "but we never foresaw that we'd have this type of load."

Not to worry, though, auto-scale is now on, so should things go south again at least the status site is unlikely to be embarrassingly absent.

And the engineer who sidestepped Salesforce's carefully crafted policies and took down the platform? "We have taken action with that particular employee," said Dieken. ®

Similar topics


Other stories you might like

  • Prisons transcribe private phone calls with inmates using speech-to-text AI

    Plus: A drug designed by machine learning algorithms to treat liver disease reaches human clinical trials and more

    In brief Prisons around the US are installing AI speech-to-text models to automatically transcribe conversations with inmates during their phone calls.

    A series of contracts and emails from eight different states revealed how Verus, an AI application developed by LEO Technologies and based on a speech-to-text system offered by Amazon, was used to eavesdrop on prisoners’ phone calls.

    In a sales pitch, LEO’s CEO James Sexton told officials working for a jail in Cook County, Illinois, that one of its customers in Calhoun County, Alabama, uses the software to protect prisons from getting sued, according to an investigation by the Thomson Reuters Foundation.

    Continue reading
  • Battlefield 2042: Please don't be the death knell of the franchise, please don't be the death knell of the franchise

    Another terrible launch, but DICE is already working on improvements

    The RPG Greetings, traveller, and welcome back to The Register Plays Games, our monthly gaming column. Since the last edition on New World, we hit level cap and the "endgame". Around this time, item duping exploits became rife and every attempt Amazon Games made to fix it just broke something else. The post-level 60 "watermark" system for gear drops is also infuriating and tedious, but not something we were able to address in the column. So bear these things in mind if you were ever tempted. On that note, it's time to look at another newly released shit show – Battlefield 2042.

    I wanted to love Battlefield 2042, I really did. After the bum note of the first-person shooter (FPS) franchise's return to Second World War theatres with Battlefield V (2018), I stupidly assumed the next entry from EA-owned Swedish developer DICE would be a return to form. I was wrong.

    The multiplayer military FPS market is dominated by two forces: Activision's Call of Duty (COD) series and EA's Battlefield. Fans of each franchise are loyal to the point of zealotry with little crossover between player bases. Here's where I stand: COD jumped the shark with Modern Warfare 2 in 2009. It's flip-flopped from WW2 to present-day combat and back again, tried sci-fi, and even the Battle Royale trend with the free-to-play Call of Duty: Warzone (2020), which has been thoroughly ruined by hackers and developer inaction.

    Continue reading
  • American diplomats' iPhones reportedly compromised by NSO Group intrusion software

    Reuters claims nine State Department employees outside the US had their devices hacked

    The Apple iPhones of at least nine US State Department officials were compromised by an unidentified entity using NSO Group's Pegasus spyware, according to a report published Friday by Reuters.

    NSO Group in an email to The Register said it has blocked an unnamed customers' access to its system upon receiving an inquiry about the incident but has yet to confirm whether its software was involved.

    "Once the inquiry was received, and before any investigation under our compliance policy, we have decided to immediately terminate relevant customers’ access to the system, due to the severity of the allegations," an NSO spokesperson told The Register in an email. "To this point, we haven’t received any information nor the phone numbers, nor any indication that NSO’s tools were used in this case."

    Continue reading

Biting the hand that feeds IT © 1998–2021