Dropbox unplugged its own datacenter – and things went better than expected

Two years of disaster planning massively reduced recovery time objective, company says

If you're unsure how resilient your organization is to a disaster, there's a simple way to find out: unplug one of your datacenters from the internet and see what happens.

That's what Dropbox did in November, though with a bit more forethought. It had been planning to take the San Jose datacenter (its largest) offline for some time, and performed extensive tests prior to the actual event. It actually took all three datacenters in the city offline by physically pulling each site's main fiber connection from its port.

Dubbed the "SJC blackhole," the experiment was determined to be a success after 30 minutes had elapsed with what Dropbox described as no impact to its global availability. "In the unlikely event of a disaster, our revamped failover procedures showed that we now had the people and processes in place to offer a significantly reduced RTO [recovery time objective]," Dropbox said in a postmortem of the event.

According to the company, RTOs were reduced from eight to nine minutes down to four or five.

What was Dropbox thinking?

After parting ways with previous hosting service AWS and building its own datacenters, Dropbox said it realized there was a problem: its metadata was highly replicated, but block data wasn't. "Given San Jose's proximity to the San Andreas Fault, it was critical we ensured an earthquake wouldn't take Dropbox offline," the company said.

The first attempt Dropbox made to eliminate its centrality was called Magic Pocket, a system that distributes block data to multiple datacenters, which can serve portions of files at the same time, without worries about a single datacenter outage eliminating service. This is known as an active-active system because multiple nodes are serving files to users simultaneously.

Dropbox ultimately settled on an active-passive failure model, which still replicates blocks across multiple datacenters, but only serves files from a single location. It said this was necessary to implement its plan because of limitations imposed by how Dropbox itself chose to manage metadata.

"These choices severely limited our architectural choices when designing an active-active system, and made the resulting system much more complex," Dropbox said.

Failing over and over

A May 2020 failover tooling failure caused a 47-minute long service outage, which pushed Dropbox into high gear on improving its disaster recovery systems. It started by implementing a dedicated disaster recovery team, which rebuilt Dropbox's failover-handling software before running tests, of which the November 2021 shutdown was part.

Testing began at Dropbox's two Dallas Fort Worth datacenters, and initially things were less than smooth – due to the team not realizing all of its S3 proxies were running from the datacenter it took offline. A second test proved more successful, which led to the San Jose experiment. 

"Much like our second DFW test, we saw no impact to global availability—and ultimately reached our goal of a 30-minute SJC blackhole," Dropbox said. 

Dropbox's postmortem is worth paying attention to: not only did it find a way to successfully distribute its services and make its entire system more resilient, it also shows the type of work it takes for a large enterprise to commit to that type of project.

The entire effort to improve resiliency was described by Dropbox as a multi-year, multi-team project. Its nature as a cloud service may mean Dropbox is more complex than other enterprises, but that should serve as a motivator: disaster recovery planning in other companies may be a lot easier.

Dropbox also recommends that other companies perform regular disaster recovery practise exercises. "Like a muscle, it takes training and practise to get stronger." ®

Similar topics

Other stories you might like

  • Demand for PC and smartphone chips drops 'like a rock' says CEO of China’s top chipmaker
    Markets outside China are doing better, but at home vendors have huge component stockpiles

    Demand for chips needed to make smartphones and PCs has dropped "like a rock" – but mostly in China, according to Zhao Haijun, the CEO of China's largest chipmaker Semiconductor Manufacturing International Corporation (SMIC).

    Speaking on the company's Q1 2022 earnings call last Friday, Zhao said smartphone makers currently have five months inventory to hand, so are working through that stockpile before ordering new product. Sales of PCs, consumer electronics and appliances are also in trouble, the CEO said, leaving some markets oversupplied with product for now. But unmet demand remains for silicon used for Wi-Fi 6, power conversion, green energy products, and analog-to-digital conversion.

    The CEO's "like a rock" comment came in the Q&A section of the call, after previous scripted remarks mentioned a "destocking phase" among SMIC clients.

    Continue reading
  • Colocation consolidation: Analysts look at what's driving the feeding frenzy
    Sometimes a half-sized shipping container at the base of a cell tower is all you need

    Analysis Colocation facilities aren't just a place to drop a couple of servers anymore. Many are quickly becoming full-fledged infrastructure-as-a-service providers as they embrace new consumption-based models and place a stronger emphasis on networking and edge connectivity.

    But supporting the growing menagerie of value-added services takes a substantial footprint and an even larger customer base, a dynamic that's driven a wave of consolidation throughout the industry, analysts from Forrester Research and Gartner told The Register.

    "You can only provide those value-added services if you're big enough," Forrester research director Glenn O'Donnell said.

    Continue reading
  • D-Wave deploys first US-based Advantage quantum system
    For those that want to keep their data in the homeland

    Quantum computing outfit D-Wave Systems has announced availability of an Advantage quantum computer accessible via the cloud but physically located in the US, a key move for selling quantum services to American customers.

    D-Wave reported that the newly deployed system is the first of its Advantage line of quantum computers available via its Leap quantum cloud service that is physically located in the US, rather than operating out of D-Wave’s facilities in British Columbia.

    The new system is based at the University of Southern California, as part of the USC-Lockheed Martin Quantum Computing Center hosted at USC’s Information Sciences Institute, a factor that may encourage US organizations interested in evaluating quantum computing that are likely to want the assurance of accessing facilities based in the same country.

    Continue reading
  • Bosses using AI to hire candidates risk discriminating against disabled applicants
    US publishes technical guide to help organizations avoid violating Americans with Disabilities Act

    The Biden administration and Department of Justice have warned employers using AI software for recruitment purposes to take extra steps to support disabled job applicants or they risk violating the Americans with Disabilities Act (ADA).

    Under the ADA, employers must provide adequate accommodations to all qualified disabled job seekers so they can fairly take part in the application process. But the increasing rollout of machine learning algorithms by companies in their hiring processes opens new possibilities that can disadvantage candidates with disabilities. 

    The Equal Employment Opportunity Commission (EEOC) and the DoJ published a new document this week, providing technical guidance to ensure companies don't violate ADA when using AI technology for recruitment purposes.

    Continue reading
  • How ICE became a $2.8b domestic surveillance agency
    Your US tax dollars at work

    The US Immigration and Customs Enforcement (ICE) agency has spent about $2.8 billion over the past 14 years on a massive surveillance "dragnet" that uses big data and facial-recognition technology to secretly spy on most Americans, according to a report from Georgetown Law's Center on Privacy and Technology.

    The research took two years and included "hundreds" of Freedom of Information Act requests, along with reviews of ICE's contracting and procurement records. It details how ICE surveillance spending jumped from about $71 million annually in 2008 to about $388 million per year as of 2021. The network it has purchased with this $2.8 billion means that "ICE now operates as a domestic surveillance agency" and its methods cross "legal and ethical lines," the report concludes.

    ICE did not respond to The Register's request for comment.

    Continue reading

Biting the hand that feeds IT © 1998–2022