That Salesforce outage: Global DNS downfall started by one engineer trying a quick fix

'We feel bad about what happened'


The sound of rumbling rubber could be heard today as Salesforce threw an engineer responsible for a change that knocked it offline under a passing bus.

"We're not blaming one employee," said Chief Availability Officer Darryn Dieken after spending the first half hour of a Wednesday briefing on the outage doing pretty much that very thing.

To recap, on May 11 around 2100 UTC, a configuration change was applied to Salesforce's Domain Name System (DNS) servers that resulted in folks unable to access the software-as-a-service titan's products. For about five hours, clients could not reliably log in, and things got so bad that even the status page was unavailable.

Salesforce has been updating its public root cause analysis ever since, and Dieken said during his briefing to customers that a few more tweaks would be needed before the fix was completed.

It was during that call that the full extent of the screw-up was revealed and the engineer concerned launched buswards.

While Dieken boasted of the automation in place within Salesforce towers, some processes remain manual. One of these is related to DNS (yes, it is always DNS.) A lone engineer was tasked with making a configuration change to connect up a new Salesforce Hyperforce environment in Australia.

A DNS change is not an uncommon occurrence, and the engineer also had a four-year-old script to do the job. However, while Salesforce usually "staggers" changes to reduce the blast radius of blunders, the manual nature of this change meant it was up to the engineer to roll it out slowly.

This, alas, did not happen. The engineer instead decided erroneously, according to Dieken, to shortcut the normal procedures by using a so-called Emergency Break-Fix (EBF) process. The EBF is normally used when something really bad is happening, or an emergency patch is quickly and widely needed.

Going down the EBF route meant fewer approvals and a shortened process that wasn't gradual. Hey, this was a well-used script, the engineer had worked for Salesforce for years and these changes were pretty common. What could possibly go wrong?

In classic Who, Me? fashion, rather a lot.

We don't understand

"For whatever reason that we don't understand, the employee decided to do a global deployment," Dieken went on. The usual staggered approach was therefore bypassed. And a DNS change meant those servers would need restarting.

That in itself would not be a total catastrophe. Maybe a short outage, perhaps. But not the disaster that unfolded.

However, it transpired that lurking within that tried-and-trusted script was a bug. Under load, a timeout could happen that would stop other things from running. And sure enough, as the update was being rolled out across all of Salesforce's data centers, a timeout occurred. This in turn meant that certain tasks were not carried out when the servers were restarted. And that, in turn, meant that those servers did not return to operation correctly. That left customers unable to access Salesforce's products.

And then things got even worse. The Salesforce team has tools to deal with sad servers, and use what Dieken called "our emergency break glass process" to perform rollbacks and restarts.

"In this case," he went on, "we found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active."

It is always DNS.

We found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active

Of course, staff did eventually get into the servers to fix them up but, as customers felt all too keenly, it took quite some time to undo the mess.

As for forthcoming actions, safeguards are to be put in place to stop manual global deployments like this in future, and the process will be automated. Dieken also acknowledged that the experience had shown up a gap in Salesforce's test coverage – the script needed to be better tested, essentially. Finally, that dependency of the recovery tools on DNS needed to be dealt with.

Customers bemused that they had to get official word of the outage from social media or this very organ, rather than the status page, were doubtless more bemused at the revelation that the reason for the Salesforce status site falling over was due to auto-scale not being turned on for that web property. (During the downtime, Salesforce had to use its documentation site to explain to clients what was going wrong.)

"We over-provisioned enough capacity to make sure that we could handle large spikes," explained Dieken, "but we never foresaw that we'd have this type of load."

Not to worry, though, auto-scale is now on, so should things go south again at least the status site is unlikely to be embarrassingly absent.

And the engineer who sidestepped Salesforce's carefully crafted policies and took down the platform? "We have taken action with that particular employee," said Dieken. ®

Similar topics


Other stories you might like

  • IT staffing, recruitment biz settles claims it discriminated against Americans
    Foreign workers favored over US residents because that's what clients wanted, allegedly

    Amtex Systems Incorporated, an IT staffing and recruiting firm based in New York City, has agreed to settle claims it discriminated against American workers because company clients wanted workers with temporary visas.

    The US Department of Justice on Wednesday announced the agreement, which followed from a US citizen filing a discrimination complaint with the DoJ's Civil Rights Division’s Immigrant and Employee Rights Section (IER).

    "IT staffing agencies cannot unlawfully exclude applicants or impose additional burdens because of someone’s citizenship or immigration status," said Assistant Attorney General Kristen Clarke of the Justice Department’s Civil Rights Division, in a statement. "The Civil Rights Division is committed to enforcing the law to ensure that job applicants, including US workers, are protected from unlawful discrimination."

    Continue reading
  • Will this be one of the world's first RISC-V laptops?
    A sneak peek at a notebook that could be revealed this year

    Pic As Apple and Qualcomm push for more Arm adoption in the notebook space, we have come across a photo of what could become one of the world's first laptops to use the open-source RISC-V instruction set architecture.

    In an interview with The Register, Calista Redmond, CEO of RISC-V International, signaled we will see a RISC-V laptop revealed sometime this year as the ISA's governing body works to garner more financial and development support from large companies.

    It turns out Philipp Tomsich, chair of RISC-V International's software committee, dangled a photo of what could likely be the laptop in question earlier this month in front of RISC-V Week attendees in Paris.

    Continue reading
  • Did ID.me hoodwink Americans with IRS facial-recognition tech, senators ask
    Biz tells us: Won't someone please think of the ... fraud we've stopped

    Democrat senators want the FTC to investigate "evidence of deceptive statements" made by ID.me regarding the facial-recognition technology it controversially built for Uncle Sam.

    ID.me made headlines this year when the IRS said US taxpayers would have to enroll in the startup's facial-recognition system to access their tax records in the future. After a public backlash, the IRS reconsidered its plans, and said taxpayers could choose non-biometric methods to verify their identity with the agency online.

    Just before the IRS controversy, ID.me said it uses one-to-one face comparisons. "Our one-to-one face match is comparable to taking a selfie to unlock a smartphone. ID.me does not use one-to-many facial recognition, which is more complex and problematic. Further, privacy is core to our mission and we do not sell the personal information of our users," it said in January.

    Continue reading
  • Meet Wizard Spider, the multimillion-dollar gang behind Conti, Ryuk malware
    Russia-linked crime-as-a-service crew is rich, professional – and investing in R&D

    Analysis Wizard Spider, the Russia-linked crew behind high-profile malware Conti, Ryuk and Trickbot, has grown over the past five years into a multimillion-dollar organization that has built a corporate-like operating model, a year-long study has found.

    In a technical report this week, the folks at Prodaft, which has been tracking the cybercrime gang since 2021, outlined its own findings on Wizard Spider, supplemented by info that leaked about the Conti operation in February after the crooks publicly sided with Russia during the illegal invasion of Ukraine.

    What Prodaft found was a gang sitting on assets worth hundreds of millions of dollars funneled from multiple sophisticated malware variants. Wizard Spider, we're told, runs as a business with a complex network of subgroups and teams that target specific types of software, and has associations with other well-known miscreants, including those behind REvil and Qbot (also known as Qakbot or Pinkslipbot).

    Continue reading
  • Supreme Court urged to halt 'unconstitutional' Texas content-no-moderation law
    Everyone's entitled to a viewpoint but what's your viewpoint on what exactly is and isn't a viewpoint?

    A coalition of advocacy groups on Tuesday asked the US Supreme Court to block Texas' social media law HB 20 after the US Fifth Circuit Court of Appeals last week lifted a preliminary injunction that had kept it from taking effect.

    The Lone Star State law, which forbids large social media platforms from moderating content that's "lawful-but-awful," as advocacy group the Center for Democracy and Technology puts it, was approved last September by Governor Greg Abbott (R). It was immediately challenged in court and the judge hearing the case imposed a preliminary injunction, preventing the legislation from being enforced, on the basis that the trade groups opposing it – NetChoice and CCIA – were likely to prevail.

    But that injunction was lifted on appeal. That case continues to be litigated, but thanks to the Fifth Circuit, HB 20 can be enforced even as its constitutionality remains in dispute, hence the coalition's application [PDF] this month to the Supreme Court.

    Continue reading

Biting the hand that feeds IT © 1998–2022