ScaleMP (finally) glues together 128 Opteron servers
The 8,192 core, 64TB AMD behemoth
Servers are going virtual these days, so maybe it is time for server chipsets and interconnects to do the same.
With Advanced Micro Devices not building any chipsets that go beyond four Opteron processor sockets in a single system image – and no one else interested in doing chipsets, either – there is an opportunity, it would seem, for someone to make big wonking Opteron boxes to compete against RISC and Itanium machines.
Many have tried. Newisys, Liquid Computing, Fabric7, 3Leaf Systems, and NUMAscale all took very serious runs at it, and thus far, four out of five of them have gone the way of all flesh. It is not a coincidence that these companies fail because they require customers to invest in expensive software that turned many Opteron nodes into a big, often virtualized, single system image.
Since its founding in 2003, ScaleMP has tried a different approach. Instead of using special ASICs and interconnection protocols to lash together multiple server modes together into a shared memory system, ScaleMP cooked up a special hypervisor layer, called vSMP, that rides atop the x64 processors, memory controllers, and I/O controllers in multiple server nodes. Rather than carve up a single system image into multiple virtual machines, vSMP takes multiple physical servers and – using InfiniBand as a backplane interconnect – makes them look like a giant virtual SMP server with a shared memory space. vSMP has its limits. It only runs on Linux and doesn't do Windows. And up until today, it was only supported on Intel's Xeon processors, not Opterons.
Better late than never
The DDR3 memory controller etched into the Opteron 6100 processor has a 768GB upper address limit, and for many four-way machines, the way the memory slots work out, 512GB is the practical upper limit. With the impending "Interlagos" Opteron 6200s, due for launch before year's end, AMD will hopefully goose the addressable main memory to at least 2TB, if not more.
But even if it does, those four-socket Opteron 6200 boxes might be a bit pricey and that's as far as they are going to scale. If you need more memory and more I/O and CPU oomph behind it, you are outta luck. You have to either parallelize your workloads or move to an eight-socket or if you are lucky and can find one, a sixteen-socket Xeon box. Or you can get vSMP from ScaleMP and use a bunch of smaller and cheaper two-socket boxes (or maybe even single-socket boxes) to create a virtual fat memory system.
The earlier releases of vSMP could scale across 16 nodes and up to 4TB of aggregate main memory (InfiniBand is still preferred to Ethernet as the backplane interconnect), but with vSMP Foundation 3.0, launched in May 2010, the company expanded the underlying hypervisor to support up to 128 nodes and 64TB of memory in a single image.
This version of vSMP is now supported on Opteron-based servers, not just those based on Intel Xeons. ScaleMP is supporting nodes based on either the current 8-core and 12-core "Magny-Cours" Opteron 6100s and the 12-core and 16-core Opteron 6200s. The virtual machine manager at the heart of vSMP can currently scales to 128 nodes. Depending on the cores per chip and the generation you use, you can have from 2,048 to 8,192 cores in a single image.
For machines that large, you would no doubt need a very fast InfiniBand fabric to make it work well. The limiting factor in the Opteron support are AMD's homegrown chipsets, which launched two years ago ahead of the Opteron 6100s. The SR5690, SR5670, and SR5650 I/O hubs and their companion SP5100 southbridge are all supported by the vSMP hypervisor.
The vSMP hypervisor that glues systems together is not for every workload, but on workloads where there is a lot of message passing between server nodes – financial modeling, supercomputing, data analytics, and similar parallel workloads. Shai Fultheim, the company's founder and chief executive officer, says ScaleMP has over 300 customers now. "We focused on HPC as the low-hanging fruit," Fultheim tells El Reg, "but these days we are doing business analytics and virtualization consolidation."
That latter one might crack you up a bit. You put vSMP on a bunch of servers to glue them together, and then you use a hypervisor like VMware's ESXi or Red Hat's KVM to cut it up into virtual slices. The benefit of this way of doing it is that you can build fat VM instances using skinny servers, and some people think the economics makes sense and are giving this idea a whirl.
ScaleMP needs to get Windows supported on vSMP in addition to adding Opteron support, which quite frankly would have been more useful two years ago when the Opteron 6100s came out. You could also make the case that vSMP would be useful on skinnier (and cheaper) Opteron 4100 nodes for certain kinds of workloads, like those that are sensitive to clock speed as well as memory capacity.
Triad memory test scales linearly on vSMP on Opteron servers
ScaleMP will ship vSMP for Opteron servers on October 1. It will be available in the same three flavors that the Xeon version of the hypervisor comes in. vSMP Foundation for Cluster is used to take multiple server images and plunk them on a single server image running one copy of a Linux operating system; you use vSMP and that operating system instead of a cluster manager to run workloads.
You don't aggregate memory in this case. vSMP Foundation for SMP is a slightly different tweak on the vSMP hypervisor that is designed to create a big shared memory space (often asymmetrically, mixing server nodes with skinny and fat physical main memories to get the desired balance of CPU core count, memory capacity, and cost) for applications to run in. And vSMP Foundation for Cloud has a user-based priced and is aimed at public and private clouds that want to aggregate VMs atop a virtual shared memory system for more configuration options than you can do with two-socket or four-socket server nodes by themselves. Most cloud providers using vSMP, says Fultheim, deploy it on only about 20 per cent of their nodes.
Pricing for vSMP depends on the scenario and is based on a percentage of the infrastructure costs customers have as they build clusters and virtual SMPs. For the cluster configuration, ScaleMP is charging about 20 per cent of cost of the underlying iron, and on the shared memory SMP setups, it workers out to about 30 per cent of the cost of the iron. For clouds, where nodes are not always ganged up, ScaleMP is charging 5 to 10 per cent of the infrastructure costs.
"We believe that this reflects the value customers get from the software," says Fultheim. While this may be true, it is probably better to figure out what those percentages work out to on average and just put a price tag on it. ®