Storage array firmware bug caused Salesforce data loss

Circuit breakers broke bad, workload moved but array flipped out under heavy load


Salesforce.com has revealed that a bug in the firmware of its storage arrays was behind last week's data loss incident.

The mess started in the company's Washington data centre on May 9th, when admins noticed “a circuit breaker responsible for controlling power into the data center had failed.”

“The team engaged the circuit vendor who began the process of replacing the failed breaker. Multiple redundant power systems had not engaged, which led to power failures at the compute system level.”

That mess took the company's NA14 instance offline, so it took steps to move it into a Chicago data centre. The move worked, but not long afterwards databased performance dived.

Salesforce's explanation for the mess explains that when it moves an instance it also moves the data to the bit barn the instance now occupies. That move means local infrastructure gets rather busy.

So busy that the “... increase in volume on the instance exposed the firmware bug on the storage array, which significantly increased the time for the database to write to the array. Because the time to write to the storage array increased, the database began to experience timeout conditions when writing to the storage tier. Once these timeout conditions began, a single database write was unable to successfully complete, which caused the file discrepancy condition to become present in the database. Once this discrepancy occurred, the database cluster failed and could not be restarted.”

The data loss came about because while “Our internal backup processes are designed to be near real-time, however the local copy of the database had not yet completed.”

“Our remote replication process copied the blocks that contained the file discrepancies to the standby copy of the database in the WAS data center before the database crashed, resulting in these copies of the NA14 database being unusable for purposes of restoring service once the database cluster failed.”

Salesforce says the circuit breakers that started the mess passed March 2016 tests, but have been replaced anyway. The offending firmware has been banished and the company is “in the process of deploying a new technology for data replication to standby copies of instances. This updated approach will utilize the wide-area network (WAN) to perform logical, database level replication, eliminating block-level replication.”

Which leaves us with the task of figuring out who Salesforce uses for database and storage.

Oracle's known known to be under the hood, but The Register can't say with any certainty if its database was the villain in this case. Nor can we say just what storage array had the hiccup. We do know, thanks to a 2013 post by site reliability engineer Claude Johnson that Salesforce has in the past used ZFS and Solaris-powered servers for storage. If that's still the case, blaming Oracle for a firmware mess may not be sound thinking as it's entirely conceivable that Salesforce's server-provider supplies firmware.

Whatever the source of the mess, we now know that electrical equipment can bedevil even one of the world's largest and most careful cloud operators, and that this whole business of teleporting big workloads between bit barns is not always simple. ®

Similar topics


Other stories you might like

  • Prisons transcribe private phone calls with inmates using speech-to-text AI

    Plus: A drug designed by machine learning algorithms to treat liver disease reaches human clinical trials and more

    In brief Prisons around the US are installing AI speech-to-text models to automatically transcribe conversations with inmates during their phone calls.

    A series of contracts and emails from eight different states revealed how Verus, an AI application developed by LEO Technologies and based on a speech-to-text system offered by Amazon, was used to eavesdrop on prisoners’ phone calls.

    In a sales pitch, LEO’s CEO James Sexton told officials working for a jail in Cook County, Illinois, that one of its customers in Calhoun County, Alabama, uses the software to protect prisons from getting sued, according to an investigation by the Thomson Reuters Foundation.

    Continue reading
  • Battlefield 2042: Please don't be the death knell of the franchise, please don't be the death knell of the franchise

    Another terrible launch, but DICE is already working on improvements

    The RPG Greetings, traveller, and welcome back to The Register Plays Games, our monthly gaming column. Since the last edition on New World, we hit level cap and the "endgame". Around this time, item duping exploits became rife and every attempt Amazon Games made to fix it just broke something else. The post-level 60 "watermark" system for gear drops is also infuriating and tedious, but not something we were able to address in the column. So bear these things in mind if you were ever tempted. On that note, it's time to look at another newly released shit show – Battlefield 2042.

    I wanted to love Battlefield 2042, I really did. After the bum note of the first-person shooter (FPS) franchise's return to Second World War theatres with Battlefield V (2018), I stupidly assumed the next entry from EA-owned Swedish developer DICE would be a return to form. I was wrong.

    The multiplayer military FPS market is dominated by two forces: Activision's Call of Duty (COD) series and EA's Battlefield. Fans of each franchise are loyal to the point of zealotry with little crossover between player bases. Here's where I stand: COD jumped the shark with Modern Warfare 2 in 2009. It's flip-flopped from WW2 to present-day combat and back again, tried sci-fi, and even the Battle Royale trend with the free-to-play Call of Duty: Warzone (2020), which has been thoroughly ruined by hackers and developer inaction.

    Continue reading
  • American diplomats' iPhones reportedly compromised by NSO Group intrusion software

    Reuters claims nine State Department employees outside the US had their devices hacked

    The Apple iPhones of at least nine US State Department officials were compromised by an unidentified entity using NSO Group's Pegasus spyware, according to a report published Friday by Reuters.

    NSO Group in an email to The Register said it has blocked an unnamed customers' access to its system upon receiving an inquiry about the incident but has yet to confirm whether its software was involved.

    "Once the inquiry was received, and before any investigation under our compliance policy, we have decided to immediately terminate relevant customers’ access to the system, due to the severity of the allegations," an NSO spokesperson told The Register in an email. "To this point, we haven’t received any information nor the phone numbers, nor any indication that NSO’s tools were used in this case."

    Continue reading

Biting the hand that feeds IT © 1998–2021