Facebook: Why our 'next-gen' comms ditched MySQL
Cassandra founder plays Ace of HBase
About a year ago, when Facebook set out to build its email-meets-chat-meets-everything-else messaging system, the company knew its infrastructure couldn't run the thing. "[The Facebook infrastructure] wasn't really ready to handle a bunch of different forms of messaging and have it happen in real time," says Joel Seligstein, a Facebook engineer who worked on the project. So Seligstein and crew mocked up a multifaceted messaging prototype, tossed it onto various distributed storage platforms, and ran a Big Data bake off.
The winner was HBase, the open source distributed database modeled after Google's proprietary BigTable platform. Facebook was already using MySQL for message storage, the open source Cassandra platform for inbox search, and the proprietary Haystack platform for storing photos. But in the company's mind, HBase was better equipped to handle a new-age messaging system that would seek to seamlessly juggle email, chat, and SMS as well as traditional on-site Facebook messages.
Originally built by Powerset – a semantic search outfit now owned by Microsoft – HBase is part of the Apache Hadoop project, a sweeping effort to mimic Google's back-end infrastructure. Like so many other big-name web outfits – including Microsoft, Yahoo!, and Twitter – Facebook uses the core Hadoop MapReduce platform to handle its back-end number-crunching tasks. It now spans a petabyte-scale cluster. "MapReduce is so useful," says Facebook infrastructure guru Karthik Ranganathan. "It's like the air you breathe."
Like Hadoop MapReduce, HBase uses the distributed Hadoop File System (HDFS), and Facebook chose it in part because it has experience scaling and debugging HDFS. But Hadoop isn't the only distributed platform the company uses on the back end. HBase was chosen, the company says, because it suited the task at hand.
The new Facebook messaging system uses Hbase to store the text and metadata for the messages as well as the indices needed to search these messages – an epic amount of data. Even before the new messaging system was rolled out, Facebook was juggling about 15 billion on-site messages a month (about 14 TB) and 120 billion chat messages (11 TB). It needed about three times more space to store all that data, with the average message going to multiple people, and as the new system adds good, old-fashioned email (from @facebook.com addresses), the storage requirements will only expand. HBase doesn't store everything – Haystack handles attachments and overly large messages – but it handles most of the data that's frequently updated.
"The email workload is a write dominated workload. We need to take a lot of writes very quickly," Ranganathan says. "We used HBase for the data the grows very fast, which is essentially the metadata."
Speaking alongside Seligstein at a recent "tech talk" that examined the infrastructure behind the company's new messaging system, unveiled last month, Ranganathan explained that the company chose HBase in part because its "strong consistency model." As it replicates data across myriad machines, he says, HBase is particularly adept at keeping all those replicas synchronized. "If something fails, we want to be able to read the data from somewhere else," Ranganathan says. "Once it's written, it doesn't matter what replica you go to. It's always in sync."
Facebook also picked HBase because it's designed for automatic failover. When a server goes down, it automatically reverts to another – at least in theory. "When you're talking about a lot of data, you're obviously talking about a lot of machines, and when you're talking about a lot of machines, failure is the norm," Ranganathan says. "When failure is the norm, we want to be able to automatically serve that data without too much intervention or it taking too much time."
At the same time, Ranganathan and crew like the way HBase works to prevent failure. As he puts it, when HBase breaks data into pieces and distributes them across a cluster, it puts multiple and completely separate data "shards" on each machine. This means that if one server dies, its work can be picked up several different machines and not just one. In other words: better load balancing.
If you put your data shards on just a few machines and a machine goes down, Ranganathan says, the burden is picked up by a single server, and this may end-up toppling other servers like dominos. "[The second server] dies, and then two machines' slack has to be taken up by a third, and this thing cascades all the way down," he says. "HBase shards the data into a lot of virtual shards and puts multiple shards on a single machine. So a machine dying spreads [its data] across multiple machines and the utilization of your entire cluster goes up."
Ranganathan also likes HBase's LZO (Lempel-Ziv-Oberhumer) data compression – "it's easier on the CPU," he says – and he likes that when you read, modify, and write in HBase, you can expect all readers to get the same values. "This really helps with counting – number of unread messages and things like that," he says.
But HBase did require modification. Facebook engineer Nicolas Spiegelberg and another company developer spent a year adding commits to the platform, mainly in an effort to minimize data loss. "The goal was zero data loss," says Spiegelberg. Committers updated HDFS Sync support, added some ACID properties, and even redesigned the HBase master.
Ranganathan and company used Haystack for attachment and very large messages because they didn't need the same write speed with these beefy files and Haystack could use far fewer servers in replicating the files across data centers. But they ditched MySQL entirely. They felt it couldn't provide quick access to such a large amount of data. "We have a lot of data that's quick and temporal," Seligstein says. "You need access to your first page of messages quite often."
For many, it's surprising that Facebook didn't put the messaging system on Cassandra, which was originally developed at the company and has since become hugely popular across the interwebs. But Ranganathan and others felt that it was too difficult to reconcile Cassandra's "eventual consistency model" with the messaging system setup.
"If you're going to front the data with some sort of a cache...then it becomes very difficult for use to program a cache where the database heals from underneath," Ranganathan says. "For a product like messaging, we needed a strong consistency model [as HBase offers]." If a user sends an email, he says, you have to be able to tell the user – immediately the email was sent. "If it doesn't show that it's been sent, I think 'Oh, I didn't send it,' and then I send it again. Then I come in the next day and I see it's been sent twice, and I get pissed at Facebook."
He also felt that the system needed HBase physical replication as opposed to Cassandra's logical replication. With Cassandra, he says, if something gets corrupted and you lose a disk and you have to restore the data, you have to restore the entire replica. Because Facebook uses very dense machines, this would mean a very long recovery time.
Some Cassandra backers don't see it that way. But Ranganathan is at least worth listening to. He's one of Cassandra's original authors. ®