Storage array firmware bug caused Salesforce data loss

Circuit breakers broke bad, workload moved but array flipped out under heavy load has revealed that a bug in the firmware of its storage arrays was behind last week's data loss incident.

The mess started in the company's Washington data centre on May 9th, when admins noticed “a circuit breaker responsible for controlling power into the data center had failed.”

“The team engaged the circuit vendor who began the process of replacing the failed breaker. Multiple redundant power systems had not engaged, which led to power failures at the compute system level.”

That mess took the company's NA14 instance offline, so it took steps to move it into a Chicago data centre. The move worked, but not long afterwards databased performance dived.

Salesforce's explanation for the mess explains that when it moves an instance it also moves the data to the bit barn the instance now occupies. That move means local infrastructure gets rather busy.

So busy that the “... increase in volume on the instance exposed the firmware bug on the storage array, which significantly increased the time for the database to write to the array. Because the time to write to the storage array increased, the database began to experience timeout conditions when writing to the storage tier. Once these timeout conditions began, a single database write was unable to successfully complete, which caused the file discrepancy condition to become present in the database. Once this discrepancy occurred, the database cluster failed and could not be restarted.”

The data loss came about because while “Our internal backup processes are designed to be near real-time, however the local copy of the database had not yet completed.”

“Our remote replication process copied the blocks that contained the file discrepancies to the standby copy of the database in the WAS data center before the database crashed, resulting in these copies of the NA14 database being unusable for purposes of restoring service once the database cluster failed.”

Salesforce says the circuit breakers that started the mess passed March 2016 tests, but have been replaced anyway. The offending firmware has been banished and the company is “in the process of deploying a new technology for data replication to standby copies of instances. This updated approach will utilize the wide-area network (WAN) to perform logical, database level replication, eliminating block-level replication.”

Which leaves us with the task of figuring out who Salesforce uses for database and storage.

Oracle's known known to be under the hood, but The Register can't say with any certainty if its database was the villain in this case. Nor can we say just what storage array had the hiccup. We do know, thanks to a 2013 post by site reliability engineer Claude Johnson that Salesforce has in the past used ZFS and Solaris-powered servers for storage. If that's still the case, blaming Oracle for a firmware mess may not be sound thinking as it's entirely conceivable that Salesforce's server-provider supplies firmware.

Whatever the source of the mess, we now know that electrical equipment can bedevil even one of the world's largest and most careful cloud operators, and that this whole business of teleporting big workloads between bit barns is not always simple. ®

Broader topics

Other stories you might like

  • Robotics and 5G to spur growth of SoC industry – report
    Big OEMs hogging production and COVID causing supply issues

    The system-on-chip (SoC) side of the semiconductor industry is poised for growth between now and 2026, when it's predicted to be worth $6.85 billion, according to an analyst's report. 

    Chances are good that there's an SoC-powered device within arm's reach of you: the tiny integrated circuits contain everything needed for a basic computer, leading to their proliferation in mobile, IoT and smart devices. 

    The report predicting the growth comes from advisory biz Technavio, which looked at a long list of companies in the SoC market. Vendors it analyzed include Apple, Broadcom, Intel, Nvidia, TSMC, Toshiba, and more. The company predicts that much of the growth between now and 2026 will stem primarily from robotics and 5G. 

    Continue reading
  • Deepfake attacks can easily trick live facial recognition systems online
    Plus: Next PyTorch release will support Apple GPUs so devs can train neural networks on their own laptops

    In brief Miscreants can easily steal someone else's identity by tricking live facial recognition software using deepfakes, according to a new report.

    Sensity AI, a startup focused on tackling identity fraud, carried out a series of pretend attacks. Engineers scanned the image of someone from an ID card, and mapped their likeness onto another person's face. Sensity then tested whether they could breach live facial recognition systems by tricking them into believing the pretend attacker is a real user.

    So-called "liveness tests" try to authenticate identities in real-time, relying on images or video streams from cameras like face recognition used to unlock mobile phones, for example. Nine out of ten vendors failed Sensity's live deepfake attacks.

    Continue reading
  • Lonestar plans to put datacenters in the Moon's lava tubes
    How? Founder tells The Register 'Robots… lots of robots'

    Imagine a future where racks of computer servers hum quietly in darkness below the surface of the Moon.

    Here is where some of the most important data is stored, to be left untouched for as long as can be. The idea sounds like something from science-fiction, but one startup that recently emerged from stealth is trying to turn it into a reality. Lonestar Data Holdings has a unique mission unlike any other cloud provider: to build datacenters on the Moon backing up the world's data.

    "It's inconceivable to me that we are keeping our most precious assets, our knowledge and our data, on Earth, where we're setting off bombs and burning things," Christopher Stott, founder and CEO of Lonestar, told The Register. "We need to put our assets in place off our planet, where we can keep it safe."

    Continue reading

Biting the hand that feeds IT © 1998–2022