This article is more than 1 year old
Boffins beat Amazon Web Services at its own storage game
Bell Labs and uni propose SEARS protocol to improve on AWS's own storage schemes
Boffins from Bell Labs and Stony Brook University have put together a cloud storage system they hope can serve as a reference design for future cloud implementations.
Called SEARS – Space Efficient And Reliable Storage – the research has been published at Arxiv here, and appears to rival Amazon's S3 cloud storage.
The researchers argue that what's expected from cloud storage is easy to articulate – reliability, interactive user access, global coverage and good response times – but not so easy to achieve.
Added to that is the inevitable tradeoff between space and efficiency. RAID schemes, for example, are space efficient, but computationally demanding; GFS has lower computational requirements, but its file duplication needs more space.
Hence the SEARS proposal, a combination of deduplication techniques and erasure coding that can be configured for either fast file access, or high storage efficiency.
Here's the high-level SEARS architecture:
On the upload side, the client chunks the file and generates metadata which it sends to the “switching node” (a server node designated to that user). The switching node checks the file metadata to identify unique chunks, and only those not already in storage are uploaded.
On the retrieval side, the user device receives unique chunks from multiple storage nodes for high performance.
The researchers acknowledge that in a content de-duplication scheme, there's a trade-off to be made in choosing the size of the chunk. If chunks are too big, there's less chance of a “hit” during the de-dupe process; “while smaller chunks lead to less efficient random access pattern”. In SEARS, chunks are between 1 KB and 8 KB (with an average 4 KB), and 160-bit SHA-1 hashing gives a fixed-size value as the chunk ID.
During file storage, file metadata is created containing the chunk IDs in the file, and an ID for the storage cluster holding the chunk.
So as to let sysadmins make their performance/efficiency choices, there are two binding schemes offered:
- Chunk-level binding – for archival storage running in the background. Chunk-level binding is designed to maximise system-wide de-duplication, “such that storage space of all clusters are evenly consumed as time passes”.
- User-level binding – for applications focussing on performance, this scheme binds users to particular clusters. This sacrifices system-wide de-duplication efficiency for fast file retrieval.
The system was tested across 10 Amazon EC2 instances, with the existing R-ADMAD storage scheme (there's a description of R-ADMAD here).
On ten machines in eastern USA, the researchers claimed to achieve 2.5 second retrieval of a 3MB file, compared to 7 seconds from an ordinary Amazon EC2 instance. ®