Storage array firmware bug caused Salesforce data loss

Circuit breakers broke bad, workload moved but array flipped out under heavy load


Salesforce.com has revealed that a bug in the firmware of its storage arrays was behind last week's data loss incident.

The mess started in the company's Washington data centre on May 9th, when admins noticed “a circuit breaker responsible for controlling power into the data center had failed.”

“The team engaged the circuit vendor who began the process of replacing the failed breaker. Multiple redundant power systems had not engaged, which led to power failures at the compute system level.”

That mess took the company's NA14 instance offline, so it took steps to move it into a Chicago data centre. The move worked, but not long afterwards databased performance dived.

Salesforce's explanation for the mess explains that when it moves an instance it also moves the data to the bit barn the instance now occupies. That move means local infrastructure gets rather busy.

So busy that the “... increase in volume on the instance exposed the firmware bug on the storage array, which significantly increased the time for the database to write to the array. Because the time to write to the storage array increased, the database began to experience timeout conditions when writing to the storage tier. Once these timeout conditions began, a single database write was unable to successfully complete, which caused the file discrepancy condition to become present in the database. Once this discrepancy occurred, the database cluster failed and could not be restarted.”

The data loss came about because while “Our internal backup processes are designed to be near real-time, however the local copy of the database had not yet completed.”

“Our remote replication process copied the blocks that contained the file discrepancies to the standby copy of the database in the WAS data center before the database crashed, resulting in these copies of the NA14 database being unusable for purposes of restoring service once the database cluster failed.”

Salesforce says the circuit breakers that started the mess passed March 2016 tests, but have been replaced anyway. The offending firmware has been banished and the company is “in the process of deploying a new technology for data replication to standby copies of instances. This updated approach will utilize the wide-area network (WAN) to perform logical, database level replication, eliminating block-level replication.”

Which leaves us with the task of figuring out who Salesforce uses for database and storage.

Oracle's known known to be under the hood, but The Register can't say with any certainty if its database was the villain in this case. Nor can we say just what storage array had the hiccup. We do know, thanks to a 2013 post by site reliability engineer Claude Johnson that Salesforce has in the past used ZFS and Solaris-powered servers for storage. If that's still the case, blaming Oracle for a firmware mess may not be sound thinking as it's entirely conceivable that Salesforce's server-provider supplies firmware.

Whatever the source of the mess, we now know that electrical equipment can bedevil even one of the world's largest and most careful cloud operators, and that this whole business of teleporting big workloads between bit barns is not always simple. ®

Similar topics


Other stories you might like

  • AsmREPL: Wing your way through x86-64 assembly language

    Assemblers unite

    Ruby developer and internet japester Aaron Patterson has published a REPL for 64-bit x86 assembly language, enabling interactive coding in the lowest-level language of all.

    REPL stands for "read-evaluate-print loop", and REPLs were first seen in Lisp development environments such as Lisp Machines. They allow incremental development: programmers can write code on the fly, entering expressions or blocks of code, having them evaluated – executed – immediately, and the results printed out. This was viable because of the way Lisp blurred the lines between interpreted and compiled languages; these days, they're a standard feature of most scripting languages.

    Patterson has previously offered ground-breaking developer productivity enhancements such as an analogue terminal bell and performance-enhancing firmware for the Stack Overflow keyboard. This only has Ctrl, C, and V keys for extra-easy copy-pasting, but Patterson's firmware removes the tedious need to hold control.

    Continue reading
  • Microsoft adds Buy Now, Pay Later financing option to Edge – and everyone hates it

    There's always Use Another Browser

    As the festive season approaches, Microsoft has decided to add "Buy Now, Pay Later" financing options to its Edge browser in the US.

    The feature turned up in recent weeks, first in beta and canary before it was made available "by default" to all users of Microsoft Edge version 96.

    The Buy Now Pay Later (BNPL) option pops up at the browser level (rather than on checkout at an ecommerce site) and permits users to split any purchase between $35 and $1,000 made via Edge into four instalments spread over six weeks.

    Continue reading
  • Visiting a booby-trapped webpage could give attackers code execution privileges on HP network printers

    Patches available for 150 affected products

    Tricking users into visiting a malicious webpage could allow malicious people to compromise 150 models of HP multi-function printers, according to F-Secure researchers.

    The Finland-headquartered infosec firm said it had found "exploitable" flaws in the HP printers that allowed attackers to "seize control of vulnerable devices, steal information, and further infiltrate networks in pursuit of other objectives such as stealing or changing other data" – and, inevitably, "spreading ransomware."

    "In all likelihood, a lot of companies are using these vulnerable devices," said F-Secure researchers Alexander Bolshev and Timo Hirvonen.

    Continue reading

Biting the hand that feeds IT © 1998–2021