Facebook rolls out new web and database server designs
Old photos, like revenge, are best served cold
Open Compute 2013 Facebook, the company behind the founding of the Open Compute Project that is opening server, storage, rack, and data center designs, gave a sneak peek this week at some new models that it is working on and could donate to the OCP cause.
At the Open Compute Summit in Santa Clara this week, Frank Frankovsky,vice president of hardware design and supply chain at Facebook and also chairman of the Open Compute effort, showed off a few server designs as well as talking up some microserver standards that it has established to make it possible to mix and match different processor architectures on the same backplane and in the same chassis.
The first new server coming out of Facebook is code-named "Dragonstone" and the specs for it are available out there on the Open Compute site unlike some of the other designs that were shown off this week.
According to Frankovsky, for certain database functions at Facebook, it was more important to have redundant power supplies for a database node than it was to have multiple compute nodes in an Open Compute V2 chassis sharing a single power supply. (This chassis and its related Xeon and Opteron server nodes were divulged in May 2012 and contributed to the Open Compute Project, and they are the servers that are in use in Facebook's data center in Forest City, North Carolina.)
The Open Compute V2 chassis uses server nodes called "Windmill" based on two-socket Xeon E5-2600 processors from Intel and another two-socketeer called "Watermark" that are based on the Opteron 6200 and now the Opteron 6300 processor from Advanced Micro Devices.
The mechanical drawing of the Dragonstone server
As you can see from the mechanical drawing above, the Dragonstone server has a two-socket server node on the left, a redundant power supply in the middle, and then space for 3.5-inch disk drives or flash storage from Fusion-io in a storage sled on the right.
This particular machine is based on the Intel Windmill board, and redundant power supplies from two different, er, suppliers – Power One and Delta – have been certified to fit on the middle tray and feed the server node and the storage. Fusion-io has come up with a 3.2TB flash storage card that hooks into a PCI-Express 2.0 x8 slot (it has ten flash modules) that is being used on Dragonstone. This card will be commercialized as the ioScale enterprise flash by Fusion-io, which has contributed the mechanical design of this card to Open Compute so others can implement it in OCP systems.
Frankovsky said that by doubling up the power supplies and making an Open Compute-style database server, it was able to cut the costs over its current database servers by 40 per cent. (He did not say what that prior database server was and if it used flash or disk storage.) This Dragonstone server is being installed in Facebook's third data center, which is located in Lulea, Sweden.
The Winterfell server designed by Facebook will eventually be contributed to the Open Compute cause, but its specs are not yet available.
The three-node Winterfell server chassis from Facebook
The Winterfell machine is Facebook's latest Web server design and slides three x86 servers into the three bays of the Open Compute chassis. Not much more is known about it at this point, but clearly three servers in a 1.5U chassis is better than two.
When you have more than 1 billion users and make billions of dollars peddling ads to them, you not only have some unique needs but you can indulge in engineering your systems and data centers to specifically meet those needs. By doing so – as Facebook fully understands in ways that most companies do not and as Google and Amazon and a few others do – you control the experience that users have and the costs that you incur providing that experience.
We used to live in a world where there were those who could afford the high availability and high throughput of mainframes and the rest of us had to cobble together networks of systems based on RISC/Unix or then x86 servers that mimicked as best they could some of the aspects of a mainframe.
We are now entering a world where some companies not only can indulge in custom engineering for their systems, data centers and software, but their very business demands it, while other companies will do the best they can with a mix of third party systems and application software and "engineered systems" with converged servers, storage, and networking.
Facebook's Dragonstone database server design
The rest of us get what we can afford, or we get whatever Facebook and its friends provide through the Open Compute Project if we can afford to indulge in custom servers.
Google has been generous with the software ideas, proving to the rest of the world that certain things could be done to manage big data and providing insights that have driven others to mimic its advances without actually releasing its code out into the world as open source. Google opens up the idea, but not the technology, and ditto for its own custom servers and data center designs.
Facebook has been generous with its Cassandra NoSQL data store as well as with the system and data center designs from Open Compute, and now third parties are starting to work with ODMs to make their own custom iron.
Jay Parikh, vice president of infrastructure at Facebook, talked quite a bit yesterday about the challenges that the social network is having storing the more than 240 billion photos on the site, which is growing by 350 million pictures per day.
That works out to an additional 7PB in the Facebook Photo data store every month, and obviously, you can't do that on an expensive storage area network and it would even be an economic challenge on the bare-bones Open Compute servers and Open Vault storage arrays that Facebook has already designed and put into production.
The rack and server design for Facebook's cold storage
If Facebook wants you to store all your photos on its site – and therefore have other people coming and looking at them, thus generating traffic and therefore ad money – then it can't ever lose a photo and it has to preserve the quick response time of its web farms. It cannot, as Parikh explained, just throw old photos out there on tape and tell you to come back in a day to see them.
What Facebook can do is use hierarchical storage management, albeit a homegrown variant that is, as you would expect from uber-nerds, kinda clever. The important thing to do first was to analyze its own data, and as you might expect, as photos age, they are accessed less.
In the Facebook pool, 82 per cent of the traffic of retrieving photos is actually only across 8 per cent of the stored capacity. And that means you don’t have to keep the other 92 per cent of the photos in the cache and storage vaults that are close to the web servers. You can put them in an different part of the data center on a different kind of storage server.
The data center that Facebook has built to test out its ideas has 1EB (that's Exabyte) of capacity and 1.5 megawatts per room, and there is no redundant electrical system at all because Facebook is trying to cut back on power consumption for photo storage.
The cold storage service uses Reed-Solomon encoding and checksum and spreads the bits that comprise a photo over multiple server nodes and – here's the tricky bit – only one drive per server. While the server node has many disk drives, only one drive in the machine is powered up at any time in an array of servers, and that is only when they are accessing the node to get a specific photo.
The demand for old photos is so small and powering drives up and down is so fast that this causes only a slight delay over the network. And if a drive in the array of servers fails, the data can be reconstructed by running the Reed-Solomon encoding in reverse.
The resulting server, which has not yet been contributed to Open Compute but could be, has 2PB of storage per rack, which is eight times the storage density of Facebook's current storage servers, and burns only 2 kilowatts per rack because at any given time, most of the disk drives are turned off. The server nodes have 10 Gigabit Ethernet links coming into them and the rack has a 40GE pipe going back out to the main storage of the Facebook site so once a photo is found it is piped out right quick to a web page.
The resulting setup provides storage at one third the cost of the prior generation of Open Vault storage arrays and the data center housing these cold storage racks is one-fifth the cost of the conventional data centers build by Facebook.
It looks like Facebook is willing to take the chance that one of these storage rooms could fail in its data center and that you won't gripe too much if it does as long as it comes back up and your photos are still there. That's probably a safe bet, considering what you pay to use Facebook. ®