On-Prem

Networks

That Salesforce outage: Global DNS downfall started by one engineer trying a quick fix

'We feel bad about what happened'


The sound of rumbling rubber could be heard today as Salesforce threw an engineer responsible for a change that knocked it offline under a passing bus.

"We're not blaming one employee," said Chief Availability Officer Darryn Dieken after spending the first half hour of a Wednesday briefing on the outage doing pretty much that very thing.

To recap, on May 11 around 2100 UTC, a configuration change was applied to Salesforce's Domain Name System (DNS) servers that resulted in folks unable to access the software-as-a-service titan's products. For about five hours, clients could not reliably log in, and things got so bad that even the status page was unavailable.

Salesforce has been updating its public root cause analysis ever since, and Dieken said during his briefing to customers that a few more tweaks would be needed before the fix was completed.

It was during that call that the full extent of the screw-up was revealed and the engineer concerned launched buswards.

While Dieken boasted of the automation in place within Salesforce towers, some processes remain manual. One of these is related to DNS (yes, it is always DNS.) A lone engineer was tasked with making a configuration change to connect up a new Salesforce Hyperforce environment in Australia.

A DNS change is not an uncommon occurrence, and the engineer also had a four-year-old script to do the job. However, while Salesforce usually "staggers" changes to reduce the blast radius of blunders, the manual nature of this change meant it was up to the engineer to roll it out slowly.

This, alas, did not happen. The engineer instead decided erroneously, according to Dieken, to shortcut the normal procedures by using a so-called Emergency Break-Fix (EBF) process. The EBF is normally used when something really bad is happening, or an emergency patch is quickly and widely needed.

Going down the EBF route meant fewer approvals and a shortened process that wasn't gradual. Hey, this was a well-used script, the engineer had worked for Salesforce for years and these changes were pretty common. What could possibly go wrong?

In classic Who, Me? fashion, rather a lot.

We don't understand

"For whatever reason that we don't understand, the employee decided to do a global deployment," Dieken went on. The usual staggered approach was therefore bypassed. And a DNS change meant those servers would need restarting.

That in itself would not be a total catastrophe. Maybe a short outage, perhaps. But not the disaster that unfolded.

However, it transpired that lurking within that tried-and-trusted script was a bug. Under load, a timeout could happen that would stop other things from running. And sure enough, as the update was being rolled out across all of Salesforce's data centers, a timeout occurred. This in turn meant that certain tasks were not carried out when the servers were restarted. And that, in turn, meant that those servers did not return to operation correctly. That left customers unable to access Salesforce's products.

And then things got even worse. The Salesforce team has tools to deal with sad servers, and use what Dieken called "our emergency break glass process" to perform rollbacks and restarts.

"In this case," he went on, "we found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active."

It is always DNS.

We found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active

Of course, staff did eventually get into the servers to fix them up but, as customers felt all too keenly, it took quite some time to undo the mess.

As for forthcoming actions, safeguards are to be put in place to stop manual global deployments like this in future, and the process will be automated. Dieken also acknowledged that the experience had shown up a gap in Salesforce's test coverage – the script needed to be better tested, essentially. Finally, that dependency of the recovery tools on DNS needed to be dealt with.

Customers bemused that they had to get official word of the outage from social media or this very organ, rather than the status page, were doubtless more bemused at the revelation that the reason for the Salesforce status site falling over was due to auto-scale not being turned on for that web property. (During the downtime, Salesforce had to use its documentation site to explain to clients what was going wrong.)

"We over-provisioned enough capacity to make sure that we could handle large spikes," explained Dieken, "but we never foresaw that we'd have this type of load."

Not to worry, though, auto-scale is now on, so should things go south again at least the status site is unlikely to be embarrassingly absent.

And the engineer who sidestepped Salesforce's carefully crafted policies and took down the platform? "We have taken action with that particular employee," said Dieken. ®

Send us news
68 Comments

China-linked Twisted Panda caught spying on Russian defense R&D

Because Beijing isn't above covert ops to accomplish its five-year goals

Chinese cyberspies targeted two Russian defense institutes and possibly another research facility in Belarus, according to Check Point Research.

The new campaign, dubbed Twisted Panda, is part of a larger, state-sponsored espionage operation that has been ongoing for several months, if not nearly a year, according to the security shop.

In a technical analysis, the researchers detail the various malicious stages and payloads of the campaign that used sanctions-related phishing emails to attack Russian entities, which are part of the state-owned defense conglomerate Rostec Corporation.

Continue reading

FTC signals crackdown on ed-tech harvesting kid's data

Trade watchdog, and President, reminds that COPPA can ban ya

The US Federal Trade Commission on Thursday said it intends to take action against educational technology companies that unlawfully collect data from children using online educational services.

In a policy statement, the agency said, "Children should not have to needlessly hand over their data and forfeit their privacy in order to do their schoolwork or participate in remote learning, especially given the wide and increasing adoption of ed tech tools."

The agency says it will scrutinize educational service providers to ensure that they are meeting their legal obligations under COPPA, the Children's Online Privacy Protection Act.

Continue reading

Mysterious firm seeks to buy majority stake in Arm China

Chinese joint venture's ousted CEO tries to hang on - who will get control?

The saga surrounding Arm's joint venture in China just took another intriguing turn: a mysterious firm named Lotcap Group claims it has signed a letter of intent to buy a 51 percent stake in Arm China from existing investors in the country.

In a Chinese-language press release posted Wednesday, Lotcap said it has formed a subsidiary, Lotcap Fund, to buy a majority stake in the joint venture. However, reporting by one newspaper suggested that the investment firm still needs the approval of one significant investor to gain 51 percent control of Arm China.

The development comes a couple of weeks after Arm China said that its former CEO, Allen Wu, was refusing once again to step down from his position, despite the company's board voting in late April to replace Wu with two co-chief executives. SoftBank Group, which owns 49 percent of the Chinese venture, has been trying to unentangle Arm China from Wu as the Japanese tech investment giant plans for an initial public offering of the British parent company.

Continue reading

SmartNICs power the cloud, are enterprise datacenters next?

High pricing, lack of software make smartNICs a tough sell, despite offload potential

SmartNICs have the potential to accelerate enterprise workloads, but don't expect to see them bring hyperscale-class efficiency to most datacenters anytime soon, ZK Research's Zeus Kerravala told The Register.

SmartNICs are widely deployed in cloud and hyperscale datacenters as a means to offload input/output (I/O) intensive network, security, and storage operations from the CPU, freeing it up to run revenue generating tenant workloads. Some more advanced chips even offload the hypervisor to further separate the infrastructure management layer from the rest of the server.

Despite relative success in the cloud and a flurry of innovation from the still-limited vendor SmartNIC ecosystem, including Mellanox (Nvidia), Intel, Marvell, and Xilinx (AMD), Kerravala argues that the use cases for enterprise datacenters are unlikely to resemble those of the major hyperscalers, at least in the near term.

Continue reading

US fears China may have ten exascale systems by 2025

China refuses to share benchmarks, US sharpens focus on developing optimized software

The US is racing to catch up with China in supercomputing performance amid fears that the country may widen its lead in exascale computers over the next decade, according to reports.

The Frontier supercomputer at Oak Ridge National Laboratory is expected to be the first exascale system in the US once it is fully operational, but China already has two exascale systems up and running since last year, as reported on our sister site The Next Platform.

This lead may widen as the US has three exascale systems in the pipeline, while China aims to have up to 10 operational systems by 2025, says a report in the Financial times.

Continue reading

Repairability champ Framework's modular laptop gets a speed boost

With any other portable, this would be bad news for existing owners

Laptop vendor Framework Computer has launched new faster models. Unlike in the case of any other laptop maker, if you already have one, this is good news.

Modern laptops tend to be promoted on the basis of thinness and lightness, and the Framework range is no different. The machines have 13.5-inch (8.89cm) screens, are just under 16mm thick (0.6 inch), and weigh 1.3kg (2lb 14oz).

The new models have faster 12th-generation Intel Core CPUs.

Continue reading

Boeing's Starliner CST-100 on its way to the ISS 2 years late

A couple of thruster failures shouldn't affect the Calamity Capsule's second attempt at reaching space station

Two and a half years after its first disastrous launch, Boeing has once again fired its CST-100 Starliner capsule at the International Space Station.

This time it appeared to go well, launching at 18:54 ET from Space Launch Complex 41 at Cape Canaveral. The RD-180 main engine and twin solid rocket boosters of the Atlas V performed as planned before Starliner was pushed to near orbital velocity by the Centaur upper stage.

After separation from the Centaur, Starliner fired its own thrusters for orbital insertion and is on course for the ISS. Docking is scheduled for approximately 19:10 ET today (23:10 UTC).

Continue reading

Biden tours Samsung fab, talks chip cooperation with South Korea

Factory is a model for one the company has planned in Texas

US president Joe Biden kicked off his first Asian tour since taking office in South Korea, where he visited a Samsung semiconductor fab said to be the model for the company's planned plant in Taylor, Texas.

While speaking at the Samsung Electronics Pyeongtaek Campus, Biden said the region will be a key part of the next several decades – a reason "to invest in one another to deepen our business ties.". 

Much of the talk on Biden's five-day trip to South Korea and Japan will center around broader deepening of economic and business ties. In Pyeongtaek, however, the emphasis was on semiconductor cooperation. While touring the plant with recently elected South Korean president Yoon Suk Yeol, Biden noted "these little chips are the key to propelling us into the next era of humanity's technological development."

Continue reading

Meta to squeeze money from WhatsApp with Cloud API for businesses

How to make a free messaging platform bought for $22 billion profitable

At Meta's first Conversations keynote yesterday, the company announced the WhatsApp Cloud API, aimed at improving the customer service experience for businesses of all sizes.

Meta already has the WhatsApp Business API, the first revenue-generating enterprise product for the otherwise free messaging app, where companies pay WhatsApp on a per-message basis and can use the platform to direct customer communications to other lines like SMS, email, other apps, and more.

It's basically another online presence where enterprises can set up shop to make it easier for customers to get in touch. But the WhatsApp Business API is on-premises and would normally need a solutions provider like Twilio to facilitate back-end integration.

Continue reading

Microsoft patches the patch that broke Windows authentication

May 10 update addressed serious vulns but also had problems of its own

Microsoft has released an out-of-band patch to deal with an authentication issue that was introduced in the May 10 Windows update.

Elizabeth Tyler, cyber security consultant on Microsoft's Detection and Response Team, confirmed the fix to worried administrators early this morning.

Continue reading

Daisy Group to take on some of data management company Sungard's UK customers

Customers at other Sungard datacenters are not affected

UK customers of datacenter and colo service provider Sungard Availability Services are to be transferred to Daisy Corporate Services, part of the Daisy Group, months after Sungard went into administration.

According to some reports, Daisy Group has signed a deal to acquire the UK arm of Sungard, in a move that would see the company pick up Sungard's former customers, including major banks and other financial institutions.

However, a statement given to The Register by the administrators, Teneo Financial Advisory, merely states that some Sungard customers will be transferred to Daisy Corporate Services, and it is not clear how many are included this arrangement.

Continue reading