Facebook promised to open up its log storage system
LogDevice: how to make sense of 10 hyperscale data centres
Sysadmins struggling to manage lots of logs may want to Like a new "friend", after Facebook last week decided to share its distributed log management system.
If you're just running one site, Zuck's "LogDevice" code might not be for you: it's how Facebook makes sense of its 10 data centres, including how The Social Network™ brings those logs back into sync when something goes wrong.
Perhaps the most impressive number is in that operation: Facebook claims that after a failure, LogDevice can rebuild logs to “fully restore the replication factor of all records affected” at between 5 Gbps and 10 Gbps per second.
As the post explains, logging at scale presents two particularly wicked problems: making the record storage highly available and durable, while maintaining a “repeatable total order on those records”.
The specs needed to achieve this are:
- LogDevice is record-oriented, meaning rather than bytes, the smallest indivisible unit written to the log is a full record, which the company says provides “better write availability in the presence of failures”;
- Logs are append-only – log records can't be modified;
- To manage log size, files are trimmable according to either time-based or space-based retention policies.
One key to getting the scale Facebook needs is by decoupling log sequencing from the records themselves: the sequencer runs as a separate process, either on a storage node or on its own node.
The sequences themselves aren't a single datum, but a tuple containing an epoch, and an offset within the epoch. “The epoch store acts as a repository of durable counters, one per log, that are seldom incremented and are guaranteed to never regress. Today we use Apache Zookeeper as the epoch store for LogDevice.”
LogDevice separates sequencing from object storage
As for log object storage, LogDevice randomly assigns a record to a storage node – hence, for example, you don't have all of the logs from a particular server landing on the same disk, and you don't lose the whole thing if the disk fails.
That's where the fast rebuilding is important: what if a record is waiting to be restored, when a second failure takes place? This is what the 5 Gbps to 10 Gbps rebuild is designed to avoid.
All of the centralised logging naturally enough comes from local logs in the first instance, and for this, LogDevice introduces a write-optimised store called LogDB. It's “designed to keep the number of disk seeks small and controlled, and the write and read IO patterns on the storage device mostly sequential”, the post says.
Facebook says its ultimate goal is to open source LogDevice, hopefully this year. ®