AWS has started upgrading the software behind S3 storage cloud
Sharding system coded in 40,000+ lines of Rust is changing the way cloud colossus ensures data durability
Amazon Web Services has released a paper detailing the operations of its Simple Storage Service (S3), and in doing so revealed that the software powering the service is "being gradually deployed within our current service".
Titled Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 [PDF], the paper states that AWS is implementing "ShardStore" – tech described as "a new key-value storage node implementation for the Amazon S3 cloud object storage service".
The document also reveals that S3 currently holds "hundreds of petabytes of customer data".
"At the core of S3 are storage node servers that persist object data on hard disks," the paper explains. "These storage nodes are key-value stores that hold shards of object data, replicated by the control plane across multiple nodes for durability.
"Each storage node stores shards of customer objects, which are replicated across multiple nodes for durability, and so storage nodes need not replicate their stored data internally," it adds.
AWS also describes a concept called "crash consistency" that it employs to prevent data loss and achieve eleven nines of data durability – meaning the service is designed to preserve 99.999999999 per cent of data.
Replicating data across nodes helps AWS to achieve that reliability and means that losing one node won't destroy data.
"Recovering from a crash that loses an entire storage node's data creates large amounts of repair network traffic and IO load across the storage node fleet," the paper explains. "Crash consistency also ensures that the storage node recovers to a safe state after a crash, and so does not exhibit unexpected behavior that may require manual operator intervention."
ShardStore keeps track of all those objects. Its keys are shard identifiers and values are shards of customer object data. The importance of ShardStore data means it, too, is distributed across different nodes and disks.
There is no indication given in the paper regarding whether or not users will perceive any change as ShardStore is implemented, but does mention it is "API-compatible with our existing storage node software, and so requests can be served by either ShardStore or our existing key-value stores". The Reg can't imagine the change to ShardStore would be disruptive to users – a downtime requirement would see AWS laughed out of the cloud.
- Cloudflare R2 Storage service takes direct aim at Amazon S3, hits on price and portability
- NASA to launch 247 petabytes of data into AWS – but forgot about eye-watering cloudy egress costs before lift-off
- AWS straps Python support to its automated CodeGuru tool, slashes prices – just don't go over 100,000 lines
The paper describes how AWS used lightweight formal methods – a technique for using automation to verify that software meets its spec – to ensure ShardStore is doing its job. Most of the word count is therefore dedicated to explaining how AWS tested the 40,000-plus lines of Rust that make up ShardStore, and the many acts of deep storage wonkery the software performs to keep S3 alive.
In conclusion, the authors report that AWS’s experience with light formal methods have been "positive, with a number of issues prevented from reaching production and substantial adoption by the ShardStore engineering team".
The authors included several AWS staffers as well as folks from the University of Texas at Austin, the University of Washington, and Swiss public research university ETH Zurich. ®