IBM uncloaks 20 petaflops BlueGene/Q super

Lilliputian cores give Brobdingnagian oomph

SC10 Although everybody seems to be excited about GPU-goosed supercomputing these days, Big Blue is sticking to its Power-based, many-cored BlueGene and Blue Waters massively parallel supers, and revving them up to bust into the 20-petaflops zone.

The Blue Waters massively parallel Power7-based supercomputer and its funky switching and interconnect, and very dense packaging were the big iron of last year's SC09 event in Portland, Oregon, which El Reg told you all about here. And we've covered the GPU additions to the iDataPlex bladish-rackish custom servers IBM builds, as well as the forthcoming GPU expansion blade for Big Blue's BladeCenter blade servers, which are due in December and which are also special-bid products.

But the BlueGene/Q super — made of fleets of embedded PowerPC processor cores — is still, in terms of aggregate number-crunching power, the biggest and baddest HPC box on the horizon from IBM for the next two years.

IBM lip-smackingly announced the sale of the "Sequoia" BlueGene/Q supercomputer to the US Department of Energy back in February 2009, just as the current BlueGene/P machines were ramping up production. But the company did not provide many details about the architecture, except that it would pack 1.6 million cores into a single cabinet, would have 1.6PB of storage, a peak performance of 20 petaflops, and burn 6.6 megawatts of juice. The machine will be installed at Lawrence Livermore National Laboratory, which bought the first experimental BlueGene/L super.

This week IBM yanked a compute node and an I/O out of the prototype portion of the future BlueGene/Q super that's installed at its Watson Research Center in New York and showcased them at the SC10 supercomputing show, the first outing of the BlueGene/Q system components.

To understand BlueGene/Q, you have to compare it to the prior BlueGene machines and their predecessors to see how far the design has come and why IBM still believes that the BlueGene approach — small cores, and lots of them — provides the best bang for the watt.

The original BlueGene/L machine was based on some early parallel-computing design work done in the early 1990s by IBM in conjunction with Columbia University, Brookhaven National Laboratory, and RIKEN (the big Japanese government-sponsored super lab) to make a massively parallel machine called QCDSP to do quantum chromodynamics calculations using digital signal processors.

A follow-on machine called QCDOC replaced the DSPs with embedded PowerPC processors, putting 64 compute nodes on a single board that interconnected with a proprietary backplane.

In December 1999, IBM ponied up $100m of its own dough to create the original BlueGene/L machine, aiming the box at massive protein-folding problems. Two years later, LLNL saw that such a machine could be used for nuclear weapons simulations and placed the first order for the prototype.

By the fall of 2004, a prototype of the BlueGene/L machine became the fastest supercomputer in the world, using eight BlueGene/L cabinets and 1,024 compute nodes for a sustained performance of 36 teraflops. That machine has been upgraded many times, and now has reached its full system configuration, which includes 65,536 compute nodes and 1,024 I/O nodes (both based on 32-bit PowerPC processors).

BlueGene/L held the top spot on the Top 500 ranking of supercomputers, which is based on the Linpack Fortran benchmark test, for four years. The machine is based on single-core 32-bit PowerPC 440 processors that spin at 700MHz and which are packed two cores to a die with a shared L2 and L3 cache. Each core has two floating-point units as well as memory controllers, on-chip Gigabit Ethernet interfaces, and the proprietary interconnect that implements a 3D torus interconnect (derived from the Columbia University machines) that runs the Message Passing Interface (MPI) clustering protocol to lash the nodes together like oxen pulling a cart.

The BlueGene/L machine at LLNL, which was first installed in 2005 and which has been upgraded a number of times, has 131,072 cores, 32TB of aggregate main memory, a peak performance of 367 teraflops, a sustained performance of 280.6 teraflops on the Linpack test, and burns around 1.2 megawatts. The machine is air-cooled.

IBM's currently selling massively parallel box is the BlueGene/P, which puts four 850MHz PowerPC 450 cores on a chip with the memory controllers, floating point unit, and BlueGene interconnect on the chips as well as a beefed-up 10 Gigabit Ethernet controller and the old Gigabit Ethernet port on the chip. Those PowerPC 450 cores are still 32-bit units, by the way.

Each BlueGene/P node can support 2GB of main memory (512MB for each core), and the 3D torus has 5.1GB/sec of bandwidth and somewhere between 160 nanoseconds and 1.3 microseconds of MPI point-to-point latency between its nearest peers in a single node — that's a factor of 2.4 more bandwidth and about 20 per cent lower latency.

The BlueGene/P collective network that brings the nodes together has 1.7GB/sec of bandwidth per port (2.4 times that of the BlueGene/L machine) and there are three ports per node that have a 2.5 microsecond latency talking to other nodes. In a worst-case scenario, where a node has to make 68 hops across 72 racks in the 3D torus to reach another node to get data, the latency is 5 microseconds, a big improvement over BlueGene/L, which took 7 microseconds to make the same hops.

An optical 10 Gigabit Ethernet network links the BlueGene/P nodes to the outside world and there is a Gigabit Ethernet network for controlling the system. The BlueGene/P system puts 1,024 compute nodes in a rack and from 8 to 64 I/O nodes (which plug into the same physical boards as the compute nodes) per rack. The machine delivers 13.9 teraflops per rack and can scale up to 256 racks, for a 3.56 petaflops of peak (not sustained) number-crunching performance across more than 1 million cores.

The BlueGene/P nodes, like their BlueGene/L predecessors, were air-cooled and put compute and I/O nodes on the same node boards. The BlueGene/P machines crammed twice as many cores onto a chip module (four cores instead of two) and twice as many compute nodes (32 instead of 16) onto a single compute drawer, basically quadrupling the cores and nearly quintupling floating-point performance.

The power drain on BlueGene/P also went up by a factor of 1.5, with a petaflops of peak oomph burning about 2.9 megawatts. But the performance per watt increased by 9 per cent, so it was a net gain on all fronts: performance and energy efficiency.

With the BlueGene/Q designs, IBM is doing a number of different things to boost the performance and energy efficiency of the massively parallel supers. First, the BlueGene Q processors — called BGQ for short at IBM — bear some resemblance to IBM's Power7 chip used in its commercial servers, and an even stronger resemblance to the Power A2 "wire-speed" processors, which El Reg discussed in detail this year as they were announced.

Like these two commercial chips, the BlueGene/Q processor is a 64-bit chip with four threads per core. The BlueGene/Q processor module is a bit funky in that it has 17 cores on it, according to Brian Smith, a software engineer for the product who was demonstrating the compute and I/O modules at the SC10 expo. On that BGQ processor, one of the cores will run a Linux kernel and the other 16 are used for calculations, according to Smith.

The cores used in the BlueGene/Q prototype run at 1.6GHz, compared to the 2.3GHz speed on the sixteen-core Power A2 wire-speed processor. (The cores could be the same or very similar on both chips.) With the BlueGene/Q super, not only is the BGQ chip moving to 64-bits, but it also has four threads per core to increase its efficiency.

Next page: Oomph and gunk

Other stories you might like

  • Planning for power cuts? That's strictly for the birds

    Please Mr Hitchcock, no more. The UPS can't take it

    Who, Me? "Expect the unexpected" is a cliché regularly trotted out during disaster planning. But how far should those plans go? Welcome to an episode of Who, Me? where a reader finds an entirely new failure mode.

    Today's tale comes from "Brian" (not his name) and is set during a period when the US state of California was facing rolling blackouts.

    Our reader was working for a struggling hardware vendor in the state, a once mighty power now reduced to a mere 1,400 employees thanks to that old favourite of the HR axe-wielder: "restructuring."

    Continue reading
  • North Korea pulled in $400m in cryptocurrency heists last year – report

    Plus: FIFA 22 players lose their identity and Texas gets phony QR codes

    In brief Thieves operating for the North Korean government made off with almost $400m in digicash last year in a concerted attack to steal and launder as much currency as they could.

    A report from blockchain biz Chainalysis found that attackers were going after investment houses and currency exchanges in a bid to purloin funds and send them back to the Glorious Leader's coffers. They then use mixing software to make masses of micropayments to new wallets, before consolidating them all again into a new account and moving the funds.

    Bitcoin used to be a top target but Ether is now the most stolen currency, say the researchers, accounting for 58 per cent of the funds filched. Bitcoin accounted for just 20 per cent, a fall of more than 50 per cent since 2019 - although part of the reason might be that they are now so valuable people are taking more care with them.

    Continue reading
  • Tesla Full Self-Driving videos prompt California's DMV to rethink policy on accidents

    Plus: AI systems can identify different chess players by their moves and more

    In brief California’s Department of Motor Vehicles said it’s “revisiting” its opinion of whether Tesla’s so-called Full Self-Driving feature needs more oversight after a series of videos demonstrate how the technology can be dangerous.

    “Recent software updates, videos showing dangerous use of that technology, open investigations by the National Highway Traffic Safety Administration, and the opinions of other experts in this space,” have made the DMV think twice about Tesla, according to a letter sent to California’s Senator Lena Gonzalez (D-Long Beach), chair of the Senate’s transportation committee, and first reported by the LA Times.

    Tesla isn’t required to report the number of crashes to California’s DMV unlike other self-driving car companies like Waymo or Cruise because it operates at lower levels of autonomy and requires human supervision. But that may change after videos like drivers having to take over to avoid accidentally swerving into pedestrians crossing the road or failing to detect a truck in the middle of the road continue circulating.

    Continue reading

Biting the hand that feeds IT © 1998–2022