On-Prem

Networks

That Salesforce outage: Global DNS downfall started by one engineer trying a quick fix

'We feel bad about what happened'


The sound of rumbling rubber could be heard today as Salesforce threw an engineer responsible for a change that knocked it offline under a passing bus.

"We're not blaming one employee," said Chief Availability Officer Darryn Dieken after spending the first half hour of a Wednesday briefing on the outage doing pretty much that very thing.

To recap, on May 11 around 2100 UTC, a configuration change was applied to Salesforce's Domain Name System (DNS) servers that resulted in folks unable to access the software-as-a-service titan's products. For about five hours, clients could not reliably log in, and things got so bad that even the status page was unavailable.

Salesforce has been updating its public root cause analysis ever since, and Dieken said during his briefing to customers that a few more tweaks would be needed before the fix was completed.

It was during that call that the full extent of the screw-up was revealed and the engineer concerned launched buswards.

While Dieken boasted of the automation in place within Salesforce towers, some processes remain manual. One of these is related to DNS (yes, it is always DNS.) A lone engineer was tasked with making a configuration change to connect up a new Salesforce Hyperforce environment in Australia.

A DNS change is not an uncommon occurrence, and the engineer also had a four-year-old script to do the job. However, while Salesforce usually "staggers" changes to reduce the blast radius of blunders, the manual nature of this change meant it was up to the engineer to roll it out slowly.

This, alas, did not happen. The engineer instead decided erroneously, according to Dieken, to shortcut the normal procedures by using a so-called Emergency Break-Fix (EBF) process. The EBF is normally used when something really bad is happening, or an emergency patch is quickly and widely needed.

Going down the EBF route meant fewer approvals and a shortened process that wasn't gradual. Hey, this was a well-used script, the engineer had worked for Salesforce for years and these changes were pretty common. What could possibly go wrong?

In classic Who, Me? fashion, rather a lot.

We don't understand

"For whatever reason that we don't understand, the employee decided to do a global deployment," Dieken went on. The usual staggered approach was therefore bypassed. And a DNS change meant those servers would need restarting.

That in itself would not be a total catastrophe. Maybe a short outage, perhaps. But not the disaster that unfolded.

However, it transpired that lurking within that tried-and-trusted script was a bug. Under load, a timeout could happen that would stop other things from running. And sure enough, as the update was being rolled out across all of Salesforce's data centers, a timeout occurred. This in turn meant that certain tasks were not carried out when the servers were restarted. And that, in turn, meant that those servers did not return to operation correctly. That left customers unable to access Salesforce's products.

And then things got even worse. The Salesforce team has tools to deal with sad servers, and use what Dieken called "our emergency break glass process" to perform rollbacks and restarts.

"In this case," he went on, "we found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active."

It is always DNS.

We found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active

Of course, staff did eventually get into the servers to fix them up but, as customers felt all too keenly, it took quite some time to undo the mess.

As for forthcoming actions, safeguards are to be put in place to stop manual global deployments like this in future, and the process will be automated. Dieken also acknowledged that the experience had shown up a gap in Salesforce's test coverage – the script needed to be better tested, essentially. Finally, that dependency of the recovery tools on DNS needed to be dealt with.

Customers bemused that they had to get official word of the outage from social media or this very organ, rather than the status page, were doubtless more bemused at the revelation that the reason for the Salesforce status site falling over was due to auto-scale not being turned on for that web property. (During the downtime, Salesforce had to use its documentation site to explain to clients what was going wrong.)

"We over-provisioned enough capacity to make sure that we could handle large spikes," explained Dieken, "but we never foresaw that we'd have this type of load."

Not to worry, though, auto-scale is now on, so should things go south again at least the status site is unlikely to be embarrassingly absent.

And the engineer who sidestepped Salesforce's carefully crafted policies and took down the platform? "We have taken action with that particular employee," said Dieken. ®

Send us news
68 Comments

Spyware, trade-secret theft, and $30m in damages: How two online support partners spectacularly fell out

Chat-bot maker LivePerson wins lawsuit against call-center outfit [24]7.ai

On Thursday, a jury in a federal court in Oakland, California, found call center biz [24]7.ai – as in, 24/7 – guilty of unfair competition and stealing trade secrets from chatbot maker LivePerson, awarding the company more than $30m in damages.

The case was filed in 2014. In its complaint [PDF], LivePerson described how its partnership with 24/7 went bad.

LivePerson provides online engagement technology, which takes the form of chatbots that corporate clients add to their websites to field questions, gather interaction data, and reduce customer support costs.

Continue reading

Amazon notices Apple, Google cutting app store commission rates, follows suit

Keeps small-time devs on the reservation with AWS credits, too

Amazon this week said it would reduce its Appstore commission rate for less successful developers, following recent similar moves by Apple and Google, and is sweetening its deal by offering AWS credits to support apps' backend services.

"Starting in Q4, for developers that earned less than $1m in revenue in the previous calendar year, we are increasing developer revenue share and adding AWS credit options," said Palanidaran Chidambaram, director of the Amazon Appstore, in a blog post. "This brings total program benefits up to an equivalent of 90 percent of revenue."

Amazon will allow developers to retain 80 per cent of app revenue, keeping 20 per cent for itself. The company suggests those using AWS credits will add another 10 per cent to the developer take. It's calling its largesse the Amazon Appstore Small Business Accelerator Program.

Continue reading

FCC pushes forward on rules to block the certification of new telecoms gear from ZTE and Huawei

Crackdown on loopholes that allow 'high-risk' vendors to have equipment approved for use in the US

The US Federal Communications Commission is pressing forward with a proposal that would ban telecommunications providers [PDF] from using equipment made by manufacturers deemed to present a risk to national security.

The agency has opened a request for comments on rules that would revoke the certification of any equipment listed by the Secure and Trusted Communications Networks Act of 2019. This probe has also sought to gauge the temperature for withdrawing certification for "high-risk" equipment already deployed by carriers.

Both Huawei and ZTE were listed in the notification, as well as smaller entities that have earned the ire of US government. These include the Hytera Communications Corporation, which produces radio systems for cellular and industrial users, as well as video surveillance vendors Dahua and Hikvision.

Continue reading

New York congressman puts forward federal right-to-repair bill

Fair Repair Act targets all varieties of electronic devices

A New York congressman has introduced a federal right-to-repair bill, just a week after the state's Senate passed a bill addressing the same issue. That state bill has failed to progress, we note.

The proposed federal-level legislation, though, would compel original equipment manufacturers to provide consumers and independent businesses access to the tools, schematics, and parts required to fix broken devices.

Dubbed the Fair Repair Act, and proposed by House Rep Joe Morelle (D-NY), the bill would provide an equal basis for all consumers and independent repair shops. Although great strides have been made pushing similar legislation on the state level, with bills introduced or passed in 27 states this year alone, progress has not been evenly divided.

Continue reading

Petition instructs Jeff Bezos to buy, eat world's most famous painting

Booze-fuelled Change.org campaign implores Amazon founder to 'GOBBLE DA LISA!'

Ultra-billionaire Amazon founder Jeff Bezos has already been the subject of a petition asking him not to return to Earth after he blasts off in his New Shepard rocket on July 20, but even if he is allowed back, Bezos is now facing an even more difficult prospect.

The aerodynamically-pated arch-villain archetype and his vast fortune are increasingly becoming subjects of fascination for the denizens of campaign website Change.org, with multiple petitions currently running, mostly trying to persuade him to divert some of his almost-limitless resources toward good causes.

However, some users are suggesting more novel and entertaining uses for his immense wealth. Change.org user Kane Powell has chosen to use the platform to attempt to persuade Bezos to buy and eat the Mona Lisa, the supposedly priceless Leonardo da Vinci masterpiece housed in the Louvre in Paris.

Continue reading

Microsoft: Try to break our first preview of 64-bit Visual Studio – go on, we dare you

Plus: Updates to .NET 6, ASP.NET Core, and .NET MAUI

Microsoft has unveiled a slew of developer tools, including a preview of the 64-bit Visual Studio 2022, ahead of that developer event set for 24 June.

Preview 1 of Visual Studio 2022 comes direct from the department of never-say-never following version after version of the toolset remaining staunchly 32-bit, even as the hardware world changed around it.

The move to 64-bit was announced earlier this year and is an ambitious one considering the ecosystem and sheer size of the Visual Studio codebase.

Continue reading

Racist malware blocks The Pirate Bay by tampering with victims' Windows hosts file

Hello, 2002 called with one of the oldest low-tech tricks in the book

Malware laced with racial epithets tries to block Windows-based victims from visiting file-sharing sites associated with copyright infringement, according to new Sophos research.

The malicious software amounts to a "goofy process to block people from going to the Pirate Bay," according to Sophos researcher Andrew Brandt, who stumbled across the malware after a colleague mentioned it in passing.

Rather than opening a backdoor for a ransomware gang to exploit or dropping a malicious payload, however, this malware merely sinkholes a bunch of Pirate Bay domain names by adding them to the Windows hosts file and pointing them at 127.0.0.1 – meaning they'll be inaccessible from the victim's machine.

Continue reading

UK gets glowing salute from Bezos-backed General Fusion: Nuclear energy company to build plant in Oxfordshire

Biz will develop Magnetized Target Fusion technology at the site

General Fusion – the Canadian-based atomic outfit backed by Jeff Bezos and a battalion of other major investors – is to build a test facility in Oxfordshire to showcase its power-generating technology.

Following a COVID-friendly handshake, the UK Atomic Energy Authority (UKAEA) has given General Fusion the green light to proceed with its Fusion Demonstration Plant (FDP) at UKAEA's Centre for Fusion Energy Campus in Culham.

The campus – a Royal Navy airbase until it was handed to the UKAEA in 1960 – is home to a cluster of fusion development technologies.

Continue reading

UK financial watchdog dithers over £680k refund from Google (in ad credits, mind you) for running anti-fraud ads

MPs give FCA a telling-off for wasting taxpayer money

The UK's financial regulator is refusing to say whether it will accept an offer by Google to pay back more than £600,000 spent on online ads warning people about the dangers of money scams.

News that Google made the offer came to light earlier this week during oral evidence [PDF] to the Treasury Committee hearing on economic crime. Among those giving evidence was Mark Steward, director of enforcement and market insight at the Financial Conduct Authority (FCA).

He was quizzed by Rushinara Ali, Labour MP for Bethnal Green and Bow, who wanted to know about the £600,000 the FCA is paying Google to run ads warning about online financial scams.

Continue reading

CREST president Ian Glover to retire after 13 years – but where's the transparency, bossman?

UK infosec accreditation body still won't publish exam cheatsheet scandal report nor be interviewed by El Reg

Ian Glover, president of infosec accreditation body CREST, is stepping down from his post, he told the organisation's annual general meeting yesterday.

Sources whispered of Glover's departure to The Register ahead of a mass mailout today to members of the organisation, which oversees some industry-recognised penetration testing exams and certifications in the UK.

"My retirement is something I have been planning for some time and, while I leave with a heavy heart, I am confident CREST will continue to move forward in the hands of an excellent team," said the man himself in a canned statement emailed round CREST member organisations, following his 13 years at the helm.

Continue reading

Playmobil crosses the final frontier with enormous, metre-long Enterprise playset

$500, 136-piece, tribble-laden Star Trek tribute is immense, but clearly illogical

Playmobil is set to boldly go where no three-inch man has gone before with the release of a metre-long replica of the NCC-1701 USS Enterprise from the original Star Trek series.

The enormous model of the Federation Constitution-class vessel will come with standard-scale figures representing the main original series characters – Captain Kirk, Mr Spock, Dr McCoy, Chief Engineer Scott, Lieutenant Uhura, Lieutenant Sulu and Ensign Chekov – and features a removable panel on the disc section revealing "a full 1966-style bridge play environment" to allow children of all ages to recreate their favourite first-contact scenes.

Continue reading