This article is more than 1 year old

Keep on trucking: Dropbox's Magic Pocket and the curse of the loading bay

Want to build a storage cloud? How many trucks can you unload in a day?

Design and build

At the top level, Magic Pocket was broken into four phases: designing the system so it's simple and scalable; system proving (making sure it's going to work the way you expect); the scale-up; and optimisation.

Cowling said proving the system was the longest part of Magic Pocket – “very extensive” work went into this.

By comparison, the scale-up was allocated six months and a countdown clock: any catastrophic failure during that period would mean someone had to go and tell the boss they needed to reset the countdown clock.

That's not a trivial decision to make, because it would commit Dropbox to another six months of AWS bills.

And it happened, Cowling said, once.

But let's go with “proving the system” first. Cowling told Vulture South the requirement was to demonstrate that the company could “trust that we can tie the business to the storage system”.

To do that, his group started with small-scale builds. That was followed by production validation: “six months at production scale, ready to go, storing a mirror of all the data on S3”, he explained.

Covering one-third of all customer data, mirrored to Amazon, the production validation was a torture-test for Magic Pocket in which all manner of medieval nastiness was practised on the system: injecting software faults, deliberately corrupting disks to make sure the system catches the problem, pulling circuit breakers, overheating individual boxes until they fail.

The criterion for success was that “we had to run the system without any data loss or failures, for six months”.

“And we almost made it into production,” Cowling added, except that a staging cluster failed.

“That's the small-scale cluster that stores a copy of the production data, so we can test a new node. There was a software bug that made it into the staging cluster, and that meant we had to reset the countdown clock.”

At the cost of another six months – but, he said, nobody expected the first pass to make it to “live”.

Murphy lives and his law gets addenda

What the system proofs also taught Cowling – and everyone else in Dropbox, is that “at exabyte scale, any possible hardware failure will happen … what people underestimate is how many ways things can fail.”

To cope with undetected corruption, data has to be “very heavily replicated and has to exist in two geographic locations at all times.”

And even then, “replication only makes sense if you know that the copies are correct.”

There were unexpected failure modes like a machine that would freeze for an hour in the middle of a request, and then resume.

Naturally enough, disks failed, and even “bad registers in CPUs” will pop up once there are enough machines in service, meaning “you think you have the data on the disk, but you don't … you need to create checksums a long way down the stack to catch this”.

Getting all of that into a “lego block” takes us not to the first stage of the Magic Pocket project, but the last – optimisation.

The “discotech” team had the brief of building “the most efficient chassis possible”.

That ended up being a four-rack-unit server with 100-plus disks, putting more than a petabyte into the chassis for stacking into the rack.

For storage density, Cowling likes the SMR (shingled magnetic recording) architecture, but it has a penalty.

“The disk tracks overlap like roof tiles. You get a smaller track and 30 per cent better storage density, but it's got very bad random-write performance.

“So we had to rewrite the architecture to get extremely dense hardware in terms of petabytes per box.”

The aim is to shift costs: “We want the dominant cost to be the disks themselves, so that the software is optimised, the CPU is optimised, the RAM and storage – so that the biggest investment is just in big farms of disks.” ®

More about

TIP US OFF

Send us news


Other stories you might like