On-Prem

Networks

That Salesforce outage: Global DNS downfall started by one engineer trying a quick fix

'We feel bad about what happened'


The sound of rumbling rubber could be heard today as Salesforce threw an engineer responsible for a change that knocked it offline under a passing bus.

"We're not blaming one employee," said Chief Availability Officer Darryn Dieken after spending the first half hour of a Wednesday briefing on the outage doing pretty much that very thing.

To recap, on May 11 around 2100 UTC, a configuration change was applied to Salesforce's Domain Name System (DNS) servers that resulted in folks unable to access the software-as-a-service titan's products. For about five hours, clients could not reliably log in, and things got so bad that even the status page was unavailable.

Salesforce has been updating its public root cause analysis ever since, and Dieken said during his briefing to customers that a few more tweaks would be needed before the fix was completed.

It was during that call that the full extent of the screw-up was revealed and the engineer concerned launched buswards.

While Dieken boasted of the automation in place within Salesforce towers, some processes remain manual. One of these is related to DNS (yes, it is always DNS.) A lone engineer was tasked with making a configuration change to connect up a new Salesforce Hyperforce environment in Australia.

A DNS change is not an uncommon occurrence, and the engineer also had a four-year-old script to do the job. However, while Salesforce usually "staggers" changes to reduce the blast radius of blunders, the manual nature of this change meant it was up to the engineer to roll it out slowly.

This, alas, did not happen. The engineer instead decided erroneously, according to Dieken, to shortcut the normal procedures by using a so-called Emergency Break-Fix (EBF) process. The EBF is normally used when something really bad is happening, or an emergency patch is quickly and widely needed.

Going down the EBF route meant fewer approvals and a shortened process that wasn't gradual. Hey, this was a well-used script, the engineer had worked for Salesforce for years and these changes were pretty common. What could possibly go wrong?

In classic Who, Me? fashion, rather a lot.

We don't understand

"For whatever reason that we don't understand, the employee decided to do a global deployment," Dieken went on. The usual staggered approach was therefore bypassed. And a DNS change meant those servers would need restarting.

That in itself would not be a total catastrophe. Maybe a short outage, perhaps. But not the disaster that unfolded.

However, it transpired that lurking within that tried-and-trusted script was a bug. Under load, a timeout could happen that would stop other things from running. And sure enough, as the update was being rolled out across all of Salesforce's data centers, a timeout occurred. This in turn meant that certain tasks were not carried out when the servers were restarted. And that, in turn, meant that those servers did not return to operation correctly. That left customers unable to access Salesforce's products.

And then things got even worse. The Salesforce team has tools to deal with sad servers, and use what Dieken called "our emergency break glass process" to perform rollbacks and restarts.

"In this case," he went on, "we found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active."

It is always DNS.

We found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active

Of course, staff did eventually get into the servers to fix them up but, as customers felt all too keenly, it took quite some time to undo the mess.

As for forthcoming actions, safeguards are to be put in place to stop manual global deployments like this in future, and the process will be automated. Dieken also acknowledged that the experience had shown up a gap in Salesforce's test coverage – the script needed to be better tested, essentially. Finally, that dependency of the recovery tools on DNS needed to be dealt with.

Customers bemused that they had to get official word of the outage from social media or this very organ, rather than the status page, were doubtless more bemused at the revelation that the reason for the Salesforce status site falling over was due to auto-scale not being turned on for that web property. (During the downtime, Salesforce had to use its documentation site to explain to clients what was going wrong.)

"We over-provisioned enough capacity to make sure that we could handle large spikes," explained Dieken, "but we never foresaw that we'd have this type of load."

Not to worry, though, auto-scale is now on, so should things go south again at least the status site is unlikely to be embarrassingly absent.

And the engineer who sidestepped Salesforce's carefully crafted policies and took down the platform? "We have taken action with that particular employee," said Dieken. ®

Send us news
68 Comments

Microsoft emits more Win 11 fixes for AMD speed issues and death by PowerShell bug

Names November as the month for Win 10 H2 update – then reveals major new feature won’t arrive on time

Microsoft has released a build of Windows 11 that it claims addresses performance problems the new OS imposed on some systems.

Redmond's announcement of OS Build 22000.282 lists over 60 "improvements and fixes" on top of a lucky 13 "highlights".

One of those highlights is described as fixing "an issue that causes some applications to run slower than usual after you upgrade to Windows 11 (original release)".

Continue reading

US consumer watchdog starts sniffing around tech giants' use of your spending data

Amazon, Apple, Facebook, Google, PayPal, Square under investigation

America's Consumer Financial Protection Bureau (CFPB) said on Thursday it is probing some of the biggest names in the electronic payments industry, requesting detailed information from them on how they collect and use people's spending data.

A strings of demands was issued by the government watchdog to Amazon, Apple, Facebook, Google, PayPal, and Square, said CFPB Director Rohit Chopra, and more could be sent to others. In addition, the agency is also looking into Chinese payment providers WeChat Pay and Alipay, saying the duo are "combining messaging, e-commerce and payment functionality into super-apps," which America's internet goliaths may try to imitate.

“Big Tech companies are eagerly expanding their empires to gain greater control and insight into our spending habits,” said Chopra in a statement [PDF]. “We have ordered them to produce information about their business plans and practices.”

Continue reading

We're closing the gap with Arm and x86, claims SiFive: New RISC-V CPU core for PCs, servers, mobile incoming

As it appears Intel's attempt to gobble the upstart collapses

SiFive reckons its fastest RISC-V processor core yet is closing the gap on being a mainstream computing alternative to x86 and Arm.

The yet-unnamed high-performance design is within reach of Intel's Rocket Lake family, introduced in March, and Arm's Cortex-A78 design, announced last year, in terms of single-core performance, James Prior, senior director of product marketing and communications at SiFive, told The Register.

San Francisco-based SiFive didn't provide specific comparative benchmarks, so you'll have to take their word for it, if you so choose.

Continue reading

Unvaccinated and working at Apple? Prepare for COVID-19 testing 'every time' you step in the office

Tell us you've been jabbed or...

Apple will require unvaccinated workers to get tested for COVID-19 every time they come into the office for work, starting from November 1.

Employees have been told to declare whether they’ve been vaccinated or not by October 24, Bloomberg reported this week. Staff who choose not to disclose their vaccination status will be subjected to COVID-19 testing whenever they enter the office, it's said.

The iGiant has again and again pushed back the date it wants its staff to return to their desks as the coronavirus continues romping around the planet. Although it hoped workers could go back to their campuses this autumn, now the plan is to get them working at least three days a week at their office desks from some time in January 2022.

Continue reading

Google trims the cut its Play Store takes from digital subscriptions, ebooks, music streaming

But with 97 per cent of Android devs offering free software, web giant's share of mobile ad spend matters more

Google is cutting the fee it charges Play Store app developers for digital subscriptions from 30 per cent during the first 12 months to 15 per cent at all times.

Previously, Android developers selling digital subscriptions in their apps endured the 30 per cent rate during the first year, after which the fee percentage would be halved.

The revised price structure, which takes effect January, 2022, puts more pressure on Apple to further trim its iOS fee schedule, already dented by legal and regulatory pressure. Apple currently follows Google's old model of 30 per cent for auto-renewable subscriptions, dropping to 15 per cent after a year.

Continue reading

Executive exodus from Intel depth and tracking tech arm RealSense continues

Former CTO leaves for car tech biz

Another key executive who was part of Intel's RealSense group – which is winding down operations – left the company this month.

Anders Grunnet-Jepsen, formerly chief technology officer of the RealSense group, has started a job as head of advanced development at Luminar.

"I will be moving across the country from Silicon Valley to Orlando to work for Luminar where I will head up development of their amazing next generation Computer Vision and Lidar products focused on making cars and trucks safer," Grunnet-Jepsen said in a note sent via a Luminar representative.

Continue reading

'Windows 11 has been successfully downloaded,' says update for Xbox version of Microsoft Flight Simulator

What? No. Noooooooooooooooooo

At first glance, Microsoft appears to have torn up the infamous Windows 11 hardware compatibility list by inflicting the code on its latest games console.

Though the original Xbox (now approaching its 20th anniversary) was little more than a jumped-up PC in a hefty black box, sticking vanilla Windows on the thing was probably a step too far for Redmond.

But a Register reader updating the rather excellent (if a tad hardware-hungry and occasionally buggy) Microsoft Flight Simulator on his Xbox Series X was greeted with the strange message: "Windows 11 has been successfully downloaded."

Continue reading

We regret to inform you there's an RCE vuln in old version of WinRAR. Yes, the file decompression utility

Update to v6.02 – or don't, but on your head be it

A remote code execution vulnerability existed in an old and free trial version of WinRAR, according to infosec firm Positive Technologies.

While a vuln in version 5.7 of WinRAR may not seem like an immediate threat given that version was first released two years ago and has been superseded since, simple shareware/free-to-use software has a habit of being used long after its due date.

The vuln, tracked as CVE-2021-35052, has since been patched. Users should check their installed versions of WinRAR and update if it isn't v 6.02 or later, though the practicality of the attack seems limited unless your device or network is first compromised by other means.

Continue reading

GIMP 2.99.8 is here but what's happened to 3.0? If only stuff would not break all the time

Keeping up with technology changes 'taking a toll on development'

GIMP 2.99.8, a development version with many new features, has been released, but 3.0 is taking its time due to system changes that break things.

Continue reading

After more than a decade of development, South Korea has a near miss with Nuri rocket test

Nation playing catch-up following release from 1979 ban

South Korea today came close to joining the small club of nations that can build and launch their own orbital-class rockets, with its maiden attempt blasting off successfully then failing to deploy its payload.

At 5pm local time (UTC+9), the rocket, named Nuri, or KSLV-II, left its launchpad at Naro Space Center, destined for low-Earth orbit with a 1.5-ton dummy payload. But while all the three stages of the Korea Space Launch Vehicle II worked and the initial payload separation was fine, the dummy satellite was not placed into orbit as planned.

It wasn't immediately clear what went wrong, although South Korean President Moon Jae-in, speaking from the Naro spaceport, said the payload did not stabilize in orbit after separation. It appears the rocket's third-stage engine stopping running after 475 seconds, about 50 seconds earlier than planned, leading to the failed deployment.

Continue reading

Developers offered browser-based fun in VSCode.dev and Java action in Visual Studio Code

Looking at code here, there and (almost) everywhere

Microsoft has whipped the covers off yet another take on code-in-the-browser with a lightweight version of Visual Studio Code, while unveiling the version 1.0 release of support for Red Hat Java in the freebie source wrangler.

It comes after last month's preview of the code editor that runs entirely in the browser, and will doubtless have some users pondering the difference between this and Microsoft-owned GitHub's github.dev, which also pops a development environment into the browser. One of the biggest of those differences is a lack of compulsory integration with the VS source-shack; this is unavoidable with github.dev (the clue is, after all, in the URL.)

VSCode.dev, on the other hand, will permit the opening up of a file from a local device (if the browser allows it and supports the File System Access API) in what looks for all the world like an instance of Visual Studio Code, except surrounded by the gubbins of a browser.

Continue reading