NOAA picks IBM for supercomputer storm chasing
Power out, Xeon in
Updated The server executives at Big Blue are probably breathing a bit easier now that the company managed to survive a competitive bidding process on a monster supercomputer contract at the National Oceanic and Atmospheric Administration, the climate modeling arm of the US Department of Commerce. As a result, IBM has the potential to reap $502m in rewards building and supporting supercomputers for NOAA through 2021, if all the contract provisions are extended and activated.
Among its many tasks, NOAA does the weather modeling for the National Weather Service and also, through its National Centers for Environmental Prediction, does longer-term climate modeling that could affect the US economy. Hence, the affiliation with the Department of Commerce.
NOAA has been shopping around an upgrade for its computer systems for NWS and NCEP since it kicked out a draft RFP back in November 2010. The bidding process went through many amendments last year. IBM said in a statement that the bidding process was competitive, but neither NOAA nor IBM divulged who was in on the deal.
IBM won a contract to install mirrored computers at NOAA's NCEP's primary facility in Gaithersburg, Maryland and a backup data center in Fairmont, West Virginia, in 2002. Those machines are AIX clusters based on IBM's water-cooled, Power6-based Power 575 servers and were last upgraded in late 2008 and went live in August 2009.
Each cluster is rated at a mere 73.1 teraflops peak performance on the Linpack Fortran benchmark test, which got the machine a number 37 ranking on the November 2008 Top 500 supercomputer list, but which don't mean squat on the current list, where top-end machines have tens of petaflops of oomph. This pair now ranks 224 and 223 on the November 2011 list.
NOAA's current Stratus supercomputer
Since these Power 575 machines were installed, NOAA has been arguing correctly that given the increasingly extreme weather swings in the States, it needed petaflops-class iron to make better predictions.It also needs to do so more quickly, too, because sometimes, as the tornadoes last week in the Ohio Valley demonstrated, the speed and accuracy of a weather forecast means the difference between life and death.
The NCEP was a Cray shop through the 1990s, with various Cray Y-MP systems. After IBM was awarded the contract for NCEP supers in 2002, the weather forecasting operation ran on a succession of Power-based servers, starting with a pair of mirrored clusters of pSeries 690 machines using 704 of IBM's dual-core, 1.3GHz Power4 processors. These machines peaked at 3.6 teraflops and ranked 25 and 26 on the list at the time.
The current clusters installed in Maryland and West Virginia each have 156 Power 575 nodes linked by double data rate (DDR) InfiniBand networks. The Power6 processors run at 4.7GHz and deliver a total of 4,992 cores, 18.7TB of main memory, 170TB of disk capacity, and 13PM of tape archiving capacity.
The primary machine in Maryland is called "Stratus," and the backup in West Virginia is called "Cirrus," if you like to know the names of supercomputers, as is the convention. They are four times faster than the machines they replaced, but that was many years ago and it is clear that these boxes are indeed long in the tooth.
In these two graphs, in a presentation by William Lapenta, acting director of environmental modeling for NOAA's NWS and NCEP operations, showing the daily workload the machines churn through, you can see why. Here's the mixed workload of simulations, including production and development jobs, from an average day in September 2011 on the primary cluster:
Now here is Lapenta's projection of the workload mix for May 2012:
As you can see, NOAA is running out of headroom on the systems to run its models and work on new ones at the same time. And hence its desire to get the bid done back in October 2011.
The new Weather and Climate Operational Supercomputing System, or WCOSS in gov-speak, will be a pair of mirrored machines, just as before. The contract called for the new facilities to be built to house the machines and for the vendor (or collection of vendors bidding on the deal) to provide facilities services, refreshes to the systems, and project management.
The contract called for a five year base period with a three year option period and a two year transition period, with a total contract term of no more than ten years and with a total contract value not to exceed $502m over those ten years.
The feeds and speeds of the WCOSS system were not specified last year when the bids started, but NOAA said it wanted a machine that ran either Unix or Linux and that had at least 2GB of main memory per core. The deal also required 99 per cent uptime, given the continuous and timely nature of the data coming into NOAA and the simulations and reports it must do for weather forecasters around the country.
IBM is not tipping its hand too much on the systems side as far as what the WCOSS system will look like, and in fact its US Federal division handled the announcement. Perhaps because IBM does not want to talk about yet another Power Systems cluster user that switched to x86 iron.
Back in January 2011, the Partnership for Advanced Computing in Europe (PRACE), which pays for supercomputing capacity around the EU, said that it was building a companion for the 1-petaflops "Jugene" BlueGene/P massively parallel machine built by IBM and using its PowerPC chips and installed at Forschungszentrum Juelich (FZJ). But the "SuperMUC" system, which will weigh in at 3 petaflops and which will be installed at the Leibniz-Rechenzentrum (LRZ), will be built from iDataPlex nodes and the new Intel Xeon E5-2600 processors, just announced this week.
The iDataPlex machines are a hybrid somewhere between a blade server and a rack server and have been chosen by a number of supercomputer centers around the world. SuperMUC is supposed to be up and running by the middle of this year.
Similarly, the US National Center for Atmospheric Research (NCAR), which does longer-range weather modeling, is a big former user of Cray iron that switched to IBM Power iron a decade ago. It has a system called "Bluefire" that is very similar to the Stratus system at NOAA. NCAR is building a 1.6 petaflops behemoth called "Yellowstone" that is also built on iDataPlex nodes and that also uses Xeon E5 processors from Intel. That Yellowstone system will have 74,592 cores in 4,662 nodes and cost between $25m and $35m, depending on how it is configured.
Perhaps even more significantly, IBM's Power7-based "Blue Waters" system, a multi-petaflopper that was due to be installed at the National Center for Supercomputing Applications at the University of Illinois, had its plug pulled by IBM last August. The Blue Waters machine will now be built by Cray for $188m using a mix of Opteron 6200 CPUs and Nvidia Tesla GPUs.
While IBM didn't say much about the NOAA WCOSS machine, it did say that it would be based on its latest iDataPlex x86-based servers, its disk storage, and its General Parallel File System. The system will be configured with "hot spares" to meet uptime requirements, but the IBM statement did not say hot spares of what.
Update: IBM was working to get us more details on the WCOSS system as we went to press earlier, and came back with some specs. The machines is based on the new iDataPlex dx360 M4 servers using the Xeon E5-2600 processors. It has 448 nodes and uses a 56Gb/sec FDR InfiniBand network to lash together the nodes to create a machine with 149 teraflops of oomph. The nodes are running Red Hat Enterprise Linux. ®