Salesfarce to Failsforce: Salesforce database blunder outage enters day three as fix falters
El Reg tunes into customer conference calls to hear SVP of engineering apologize
Three days on, Salesforce.com has yet to fully recover from an outage that began on Friday.
Fifteen hours and eight minutes after an errant database deployment script granted past and current users of the company's Pardot B2B marketing automation system full read and write access to all data, prompting the cloud CRM giant to disable all affected server instances, Salesforce declared victory on Saturday morning. And then immediately declared another emergency.
"Service disruption ended, 0104 PDT, May 18," the San Francisco tech titan said on its status website, only to immediately restart the clock with another notification, "Service disruption began, 0104 PDT, May 18."
Here's a summary of what happened: on Friday, the biz accidentally gave all users within current and former Pardot customers sysadmin-level access to all data, then pulled offline all instances running Pardot to prevent any information theft or tampering. Pulling the plug on these shared instances booted Pardot and non-Pardot customers off the Salesforce cloud: any customers sharing a Pardot-hosting instance lost access.
Then Salesforce wiped all access permissions for all affected users, and restored sysadmin-level access to customers' administrator accounts. Instances were gradually brought back online so admins could log in to manually repair user permissions by hand, allowing folks to get back to work as normal.
Over the weekend, Salesforce staff developed, tested, and ran a script that attempted to restore user permissions from backups, though this was not always successful. In some cases, it even went backwards, and regranted full read-write permissions to users.
Over the weekend, Salesforce held a series of conference calls to update customers on the status of repairs on the 105 affected instances. Come Monday morning, Salesforce functionality appears to have been restored for most organizations, though the tech goliath acknowledged its automated fixes and repair work haven't reached everyone.
In a conference call on Monday, May 20, at 0030 PDT, Anmol Bhasin, SVP of engineering, said Salesforce was still dealing with a few thousand trouble reports after said automated script failed to fully undo the permission snafu.
"On the last customer update, I had communicated that the initial fix that we had put in place for restoring functionality for the pre-incident state – restoring permission sets in particular – which we believe should have restored functionality for all the affected organizations was not successful in doing so," he said.
Bhasin apologized for the disruption, and offered his assurance that the Salesforce is focused on fixing things at the highest level of the company and has devoted all available engineering resources toward resolving the problems.
Since then, there have been reports of instances going offline and then coming back online. The latest update from the cloud giant insists the automated permissions repair operation has been run on all production instances, but after that "a subset of users in affected orgs on the NA53, NA57, and NA59 instances had their permission levels reset again, which gave them broader data access than intended."
Customers on those instances are still experiencing problems on Monday morning.
In an email earlier today, Alex Brausewetter, CTO of Blue Canvas, told El Reg, "It's completely bonkers! From what we gather there are still hundreds if not thousands of customers affected. In one earlier call, they said they received thousands of complaints/support tickets after they ran they scripts that they thought would fix this issue. Salesforce has gone radio silent since yesterday night Pacific time and they just cancelled a bridge call that was scheduled for 0900 and moved it to 1030. Aside from that there's been no public communication."
The Register asked Salesforce to provide an update on the outage. A spokesperson merely pointed back to the published incident response webpage, which says that the issue is "ongoing."
Brausewetter has collected details from the calls into a Google Docs file, and shared the results. "It's totally unacceptable from this kind of service to leave customers in the dark like this," he told us. "For the customers affected by this permission problem, none of their users can log into the org or use Salesforce right now. And now the weekend is over…"
At time of writing, the 1030 PDT briefing had been moved back to 1100. ®