Setting up a 10,000-core physical server cluster to run supercomputing workloads is a tough task that can take weeks or months, and cost millions of dollars including the servers, storage, switching, and personnel overhead. And the thing about a cluster is that no matter how hard you try to share it, it is very tough to get anywhere near peak utilization over the course of the year.
Enter the cloud, which HPC customers have been skeptical about (remember the Sun Cloud?) but may start taking a shining to given the compelling economics and the management and security services that companies are layering on top of public clouds.
Cycle Computing has made a tidy little business for itself firing up HPC grids on various public clouds using its homegrown cloud management and security tools. Last month, Genentech, arguably the pioneer of the modern biotech industry and now a part of the Roche big pharma conglomerate, came to Cycle Computing to have it fire up 80,000 hours of computing time on a 10,000 virtual core HPC cluster on Amazon's EC2 compute cloud to run one of its protein analysis jobs.
Genentech didn't want to use the Amazon EC2 Virtual Private Cloud launched in July 2009, which takes virtualized servers and corrals them into their own virtual private network, or the specialized dedicated HPC VPC instances with 10 Gigabit Ethernet networking, which debuted last August. The VPC options are pricier than raw EC2 capacity, and Genentech wanted to have Cycle Computing manage deal with the setup, monitoring, and breakdown of the virtual cluster.
This is something that Cycle Computing now does for a living. The HPC startup that was founded in 2005 by Jason Stowe, the company's chief executive officer, to provide services to HPC shops deploying the open source "Condor" grid management system developed at the University of Wisconsin. (Condor is now the key grid software used by Red Hat in its Enterprise Linux 6 distribution.) Stowe previous worked for Walt Disney Studios before setting up his company and helped manage movie production and was well aware of the need for computing capacity for film production.
Cycle Computing has created two tools to manage virtual HPC clusters. The first is called CycleServer, a management and encryption layer that rides atop Condor that Cycle Computing delivered in 2007 to help simplify its own life, and CycleCloud, a domain layer that rides atop of the EC2 compute cloud at Amazon as well as on the Rackspace Cloud from Rackspace Hosting and any public cloud that runs VMware's vSphere and vCloud combo. CycleServer monitors the jobs and tells Condor how to move workloads around the grid of virtual servers, while CycleCloud provisions the images onto EC2 and other public clouds.
Stowe tells El Reg that during December last year, Cycle Computing set up increasingly large clusters on behalf of customers to start testing the limits. First, it did a 2,000-core cluster in early December, and then a 4,096-core cluster in late December. The 10,000-core cluster that Cycle Computing set up and ran for eight hours on behalf of Genentech would have ranked at 114 on the Top 500 computing list from last November (the most current ranking), so it was not exactly a toy even if the cluster was ephemeral.