Supercomputer clusters are getting larger and larger, and that is Platform Computing has to revamp its Load Sharing Facility to version 8 and double up the capacity of the workload scheduling software for grids and clusters. The updated LSF also supports GPU co-processors as full citizens of the cluster.
With LSF 7, Platform Computing could manage a cluster that had 24,000 cores and on the order of 100,000 pending jobs, according to Ken Hertzler, vice president of product management at the grid computing pioneer. With LSF 8, which will start shipping in January 2011, a single instance of the cluster management tool will be able to span a cluster comprised of 48,000 cores and 200,000 pending jobs. And if you need to span larger cluster sizes, you can gang up multiple LSF 8 instances to control grids that have 100,000 cores and up to 1.5 million pending jobs.
This may seem like plenty of scalability, but Hertzler says that Platform Computing already has a couple of accounts that have clusters that range from 50,000 to 70,000 cores, so the doubling up of cluster scalability for LSF is not just a matter of providing lots of headroom to most customers. With core counts on the rise in x64 processors from Intel and Advanced Micro Devices to the tune of 30 per cent or so in the coming year and companies simultaneously adding more nodes to clusters, Platform Computing has to broaden its core and pending job counts. In fact, it won't be long before Platform Computing has to jack up the core counts some more.
LSF 8 is more than a tweaked version of the code with twice the cluster scalability, and Hertzler says it is the first major release of the product since LSF 7 shipped four years ago. And now it speaks GPU as well as CPU.
Platform Computing's entry and midrange cluster management tool, Platform HPC 2.1, was announced last week ahead of the SC10 supercomputing conference, and it was the first program put out by the company to be able to directly schedule jobs on GPU co-processors. Now the full-on LSF scheduler, which is the flagship product from Platform Computing, has this capability. With the GPU support in LSF 8, jobs can be dispatched to them directly and the scheduler has smarts to see utilization and thermals for the GPUs so it can distribute workloads to avoid creating hot spots in the cluster.
Whether or not you use CPUs or a mix of CPUs and GPUs in your workloads (you can't actually run an operating system and applications directly on a GPU - yet), LSF 8 has a number of performance and scalability enhancements that can help boost the utilization on your clusters. And important new feature is called guaranteed resources, which is designed to make sure jobs get the resources they need to run to meet the service levels agreements that people require when they submit jobs. Because resources could not be guaranteed in prior releases, cluster administrators often had to carve their clusters up into silos, with higher priority jobs locking up resources that are often just sitting there, waiting for their job to start and lower priority jobs not finishing as quickly as they might had they had short-term access to those siloed resources.
With guaranteed resources, which are driven by SLAs set by cluster administrators, the scheduler finds the best way to meet the SLAs without partitioning up the cluster. The scheduler also now has pre-emptive and fair-share scheduling policies, which allows LSF to pre-empt jobs and steal resources temporarily from one job to help meet one SLA while at the same time allowing the second job to meet its SLAs. Basically, the software lets a bunch of small jobs say: "Hold on a minute until I finish and then you can have a lot more CPUs, big job."
The performance improvements moving from LSF 7 to LSF 8 on a given cluster will vary by jobs and system configuration, and there won't be much of an improvement if customers are already up near 100 per cent utilization. But Hertzler says for those customers who are maybe able to get 60 to 70 per cent utilization on their clusters running a large number of mixed workloads, they might be able to squeeze another 10 to 20 per cent utilization out of their clusters (and therefore get the same work done in a shorter period of time), and that is a significant improvement.
LSF 8 also has a new administrative rights delegation feature, which gets the cluster administrator out of the politics of who gets to use what cluster when. Now, supercomputer center or business line managers who have access to the cluster can add and remove users from the list of people who have access to the cluster to submit jobs and determine the service level they want for specific jobs. The LSF administrator then gets back to the job of managing the cluster, not answering cranky phone calls from people who all think they deserve special treatment.
LSF 8 can dispatch work to clusters running various Linuxes, Unixes, and Windows operating systems as well as Mac OS X; you can see a full list of the supported platforms here. LSF 8 has the same price as LSF 7, and customers on a support contract with Platform Computing can upgrade at no charge. While Platform Computing provides pricing for its HPC 2.1 stack, it does not reveal its prices for the LSF tool, except to say it charges on a per-core basis with site-wide (and presumably volume discounted) licenses available. ®