Salesforce's data-center design: 'Go for web scale, and build it out of s**t!'

Oracle sits at the core, but after that things get ugly - and cheap!


RICON West 2013 If you're building systems to run at a large scale, then rather than waste time and money trying to avoid any failure, you need to suck it up and accept that faults will happen – and make sure you've got enough cheap gear to recover.

So says Salesforce, which argued that this strategy can save you a bunch of cash you'd otherwise spend on expensive hardware, and makes it easier for your applications to survive catastrophes.

In a candid keynote speech at the Ricon West distributed systems conference on Tuesday, Salesforce architect and former Amazon infrastructure brain Pat Helland talked up Salesforce's internal "Keystone" system: this technology lets the company provide greater backup and replication capabilities for data stored in Oracle without having to spend Oracle prices on supporting infrastructure.

"You need to have enterprise-trust with web-class resilience," Helland said. "Failures are normal, mean time between failure makes failures common. You have to expect things to break so you have to layer it with immutable data*.

"The ideal design approach is 'web scale and I want to build it out of shit'."

Salesforce's Keystone system takes data from Oracle and then layers it on top of a set of cheap infrastructure running on commodity servers, Helland explained. The Oracle technology gives Salesforce confidence and consistency, he said, and the secondary layer of commodity systems and open-source software can give the company greater flexibility and a cheaper way of providing storage infrastructure, he explained.

What's inside Keystone

Keystone consists of several clusters of storage servers running 10 four-terabyte drives and two 750GB SSDs apiece, he explained. Salesforce has a preference for buying "the shittiest SSDs money can buy," he said, then designing systems to route around failures. Keystone's storage underbelly consists of clusters of 10s to 100s of nodes, possibly scaling up to thousands he explained.

The design methodology behind Keystone comes from two techniques named ROC - recovery-oriented computation - and SOFT - storage over flaky technologies. ROC comes from some research done at the University of California at Berkeley in the mid-2000s and is about designing systems to rapidly recover from failures, while SOFT is Helland's own term for building storage systems with guarantees despite its cheap-as-chips hardware.

'You get better system behavior if you assume everything is a hunk of crap'

Salesforce is designing systems using these techniques so that it can better deal with the flakiness of commodity infrastructure, without having to upgrade to more expensive systems, he said. "Storage servers may crash and they can lie with their frickin' teeth. You get better system behavior if you assume everything is a hunk of crap."

Keystone has four main elements: a Catalog for keeping track of data, a Store for storing it, a Vault for long-term storage, and a Pump for shuffling data between systems over WAN.

The Catalog provides an intermediary between storage systems, such as Oracle, and secondary storage systems built on commodity hardware. Primary storage can point to Keystone, which then points to secondary systems, making it easy to shift the location of secondary data without having to fiddle with the primary system.

The Store keeps hold of the data which is fed into the catalog, and sees Salesforce adopt the design approach pioneered by major companies – such as Google, Facebook, and Amazon – of using large quantities of low-cost hardware to provide backend storage while achieving good guarantees and reliability through a software layer.

This design approach has one glaring problem: failures. These happen a lot, Helland explained. About 4 per cent or more SATA hard drives fail a year, he said, and so a data center with 1,200 of Salesforce's storage-stuffed servers will lose 480 drives every 12 months. Therefore storage needs to be triple replicated to deal with these failures. However, servers will fail at a shade under one percent per year, he explained, so data must also be replicated at distance from rack infrastructures.

This brings about problems regarding maintaining consistency when pulling from a replicate, and so requires a caching layer to store locations of the data and keep everything up to date. This is built on cheap consumer-grade SSDs, he said, which have awful reliability.

"Consumer-grade SSDs will just die - the device stops less than 1 percent per year," he said. They also wear out after about 3,000 write cycles, so with a 750GB SSD you can probably write 2.25PBs to it before it fails, he said.

They also sometimes suffer from bit rot, which in his experience using the cheapest possible systems has one uncorrected bit error for every 1014 bits (11.3TB) written. By comparison, top-of-the-line enterprise-class SSDs have a rot rate of one bit in every 1019 (1.08EB), but they are much more expensive.

Modern software needs to be built to fail, he says, because if you design it in a monolithic, interlinked manner, then a simple hardware brownout can ripple through the entire system and take you offline.

"If everything in the system can break it's more robust if it does break. If you run around and nobody knows what happens when it breaks then you don't have a robust system," he says. ®

Bootnote

* Salesforce is able to use its cheap'n'cheerful infrastructure and still offer strong guarantees to users because Keystone exclusively relies on immutable data structures, Helland said in a chat with El Reg today.

This means once a block of data is stored in the system, it is given a unique ID number and then split into fragments smaller than a megabyte, which each have their own IDs. These fragments are then run through a cyclic-redundancy check (CRC) algorithm that calculates a simple fingerprint from the contents of each fragment.

This metadata is combined by Keystone into a naming scheme that can detect changes to the data, whether by bit rot or other failures: if the contents of a fragment is altered, a subsequently calculated fingerprint will differ from the original value, which will alert the system.

By using such techniques, Salesforce is able to tolerate inexpensive but troublesome hardware in its backend, increase overall reliability for its users, and give its programmers more freedom when writing distributed apps.

"What we do that is special is we make it look just like the classic enterprise IT while we're doing the cloud," Helland said.


Other stories you might like

  • 381,000-plus Kubernetes API servers 'exposed to internet'
    Firewall isn't a made-up word from the Hackers movie, people

    A large number of servers running the Kubernetes API have been left exposed to the internet, which is not great: they're potentially vulnerable to abuse.

    Nonprofit security organization The Shadowserver Foundation recently scanned 454,729 systems hosting the popular open-source platform for managing and orchestrating containers, finding that more than 381,645 – or about 84 percent – are accessible via the internet to varying degrees thus providing a cracked door into a corporate network.

    "While this does not mean that these instances are fully open or vulnerable to an attack, it is likely that this level of access was not intended and these instances are an unnecessarily exposed attack surface," Shadowserver's team stressed in a write-up. "They also allow for information leakage on version and build."

    Continue reading
  • A peek into Gigabyte's GPU Arm for AI, HPC shops
    High-performance platform choices are going beyond the ubiquitous x86 standard

    Arm-based servers continue to gain momentum with Gigabyte Technology introducing a system based on Ampere's Altra processors paired with Nvidia A100 GPUs, aimed at demanding workloads such as AI training and high-performance compute (HPC) applications.

    The G492-PD0 runs either an Ampere Altra or Altra Max processor, the latter delivering 128 64-bit cores that are compatible with the Armv8.2 architecture.

    It supports 16 DDR4 DIMM slots, which would be enough space for up to 4TB of memory if all slots were filled with 256GB memory modules. The chassis also has space for no fewer than eight Nvidia A100 GPUs, which would make for a costly but very powerful system for those workloads that benefit from GPU acceleration.

    Continue reading
  • GitLab version 15 goes big on visibility and observability
    GitOps fans can take a spin on the free tier for pull-based deployment

    One-stop DevOps shop GitLab has announced version 15 of its platform, hot on the heels of pull-based GitOps turning up on the platform's free tier.

    Version 15.0 marks the arrival of GitLab's next major iteration and attention this time around has turned to visibility and observability – hardly surprising considering the acquisition of OpsTrace as 2021 drew to a close, as well as workflow automation, security and compliance.

    GitLab puts out monthly releases –  hitting 15.1 on June 22 –  and we spoke to the company's senior director of Product, Kenny Johnston, at the recent Kubecon EU event, about what will be added to version 15 as time goes by. During a chat with the company's senior director of Product, Kenny Johnston, at the recent Kubecon EU event, The Register was told that this was more where dollars were being invested into the product.

    Continue reading

Biting the hand that feeds IT © 1998–2022