Facebook has ditched RAID and replication for its nearline storage, using distributed erasure coding to isolate what it calls "warm BLOBs" instead.
- BLOB — Binary Large OBject — Facebook user’s photos, videos, etc.
- Warm — data that has to be kept and is accessed at a lower rate than hot data, but more than archived or cold data. Typically, it’s more than a week old. Hot BLOBs, of course, are accessed more frequently.
- Erasure coding — the adding of calculated parity values (Reed-Solomon codes) to a string of bytes, such that the string can be recovered if an error deletes or distorts some of the complete string. Typically more efficient than RAID at protecting data as it uses less space.
Facebook’s special problem is that it has three main types of user data, with associated metadata, and these three types need huge amounts of storage. Its main and most-accessed datasets are the recent, less than one-week-old postings on a user’s timeline. These get accessed a lot by the user’s "Friends".
It uses its Haystack storage system for this data, which uses triple replication to protect the data and make sure it can always be accessed and accessed quickly, with as near to a single disk access as possible (once the metadata calculations have been run).
As this data ages, it is accessed less often, cooling from hot to warm, and yet still requires fast access when it is actually called upon. Trouble is, the damn stuff just keeps on growing. For example, at the end of January this year, Facebook was storing more than 400 billion photos.
Relative request rates by age. Each line is relative to only itself, absolute values have been denormalized to increase readability, and points mark an order-of-magnitude decrease in request rate.
Computing the count of IOs per terabyte shows that its IO density is much less than hot data and that means it can be stored without using a triple rep scheme, and yet still have acceptably fast access, while being protected against disk, host and rack failures.
Facebook engineers have set up a new storage system, f4, to store this set of warm BLOBs. A paper by the engineers explains: “f4 is a new system that lowers the effective-replication-factor of warm BLOBs while remaining fault-tolerant and able to support the lower throughput demands.”
Facebook’s engineers say:
[f4] uses Reed-Solomon coding and lays blocks out on different racks to ensure resilience to disk, machine, and rack failures within a single data center. Is uses XOR coding in the wide-area to ensure resilience to data center failures. f4 has been running in production at Facebook for over 19 months. f4 currently stores over 65PB of logical data and saves over 53PB of storage.
BLOBs are aggregated into logical volumes of c100GB with aggregated filesystem metadata. They consist of a data file, index file and journal file. The index file is a snapshot of the in-memory lookup structure of the storage machines. When full volumes are locked and creates are not allowed.
Volumes are stored in one data centre and in cells, where a cell is 14 racks of 15 hosts with 30 x 4TB drives per host. Each volume/stripe/block is paired with a buddy volume/stripe/block in a different geographic region. Facebook stores an XOR of the buddies in a third region. This scheme protects against failure of one of the three regions.
Will enterprises in general need to move to such a storage scheme for their nearline data? It's unlikely as they won’t necessarily have the same amount of data as Facebook, nor its growth rate speed and or its immutability.
Read more about Facebook’s Mr BLOBby f4 scheme here (17-page PDF). ®