ISC 2013 The Linpack Fortran benchmark, which has been used to gauge the relative performance of workstations and supercomputers for many decades, is looking a little long in the tooth. So, some of the people who love Linpack and know it best are proposing a new benchmark - with the mind-numbing name of High Performance Conjugate Gradient, or HPCG.
All system benchmark tests run their courses, and the most successful ones usually overstay their welcome and stop doing what they are designed to do. That is, allow for the comparative ranking of systems in terms of performance across different architectures and software stacks in a meaningful way, and - if we are lucky - provide some insight into bang vs buck for different systems so that companies can weigh the relative economic merits of systems that can run a particular application.
Tests run on supercomputers mostly do the former, and rarely do the latter. That is a big problem that Michael Heroux, of Sandia National Laboratories, and Jack Dongarra, of both the University of Tennessee and Oak Ridge National Laboratory, tackled in their HPCG benchmark proposal paper, which you can read here (PDF).
Earlier this week, your friendly neighborhood El Reg server hack defended the Linpack test as the Top500 supercomputer rankings for June 2013 were announced at the International Super Computing conference in Leipzig, Germany. Linpack is important because it is fun to see the rankings twice a year, and it is an easy enough test to run that people actually do it with some enthusiasm on their machines.
Erich Strohmaier, one of the administrators of the Top500 who hails from Lawrence Berkeley National Laboratory, tells El Reg that there were around 800 submissions of Linpack reports for the June Top500 list, and a bunch of them are tossed out because there is something funky in them. Making it to the Top500 list is a big deal for a lot of supercomputer labs, and every national and district politician around the world loves to see a new machine come online in their sphere of influence for the photo-op to prove they are doing their jobs to protect our future. And moving down in the rankings is also a big deal, because if you are not moving ahead you are falling behind. Jysoo Lee, director of the KISTI supercomputing center in Korea found this out when Korea's relative rankings slipped in the June list. You get phone calls from people in power who are not pleased.
So Linpack matters and it doesn't matter at the same time. It is a bit like miles per gallon fuel efficiency ratings for cars, or if you live in New York, letter grade ratings for restaurants that the city mandates.
But when it comes to using Linpack as a relative performance metric, there are some issues. If you really want to hurt your head, delve into the HPCG paper put out by Heroux and Dongarra. The basic gist, as this server hack understands it, is that the kind of codes that were initially deployed on parallel clusters fifteen years ago bore more of a relationship to High Performance Linpack, or HPL, than they do in all cases today. (HPL is the parallel implementation of Linpack; earlier versions could only run on a single machine like a PC at the low end or a federated SMP server with shared memory or a vector processor at the high-end.) This is one of the reasons why the University of Illinois has refused to run Linpack on the "Blue Waters" ceepie-geepie built by Cray. (Remember, Linpack is voluntary. It is not a ranking of the Top500 supercomputers as much as it is a ranking of the Top500 supercomputers from organizations that want to brag.)
Here's the central problem as Heroux and Dongarra see it:
"At the same time HPL rankings of computer systems are no longer so strongly correlated to real application performance, especially for the broad set of HPC applications governed by differential equations, which tend to have much stronger needs for high bandwidth and low latency, and tend to access data using irregular patterns. In fact, we have reached a point where designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system."
There is a lot of deep technical description in the paper, but here is the really simplified version. Linpack is a set of calculations using matrix multiplication that scales up the size of the arrays to try to choke the machine and force it to reach its peak capacity across all of the computing elements in the cluster. The original Linpack from the dawn of time solved a dense matrix of 100 x 100 linear equations, then it moved to 1000 x 1000 as machines got more powerful, and then the Linpackers took the hood off and let the linear equation count scale and tweaked the HPL code to run across clusters and take advantage of modern networks and interconnects as well as coprocessors like Intel Xeon Phi and Nvidia Tesla cards.
The kind of calculations and data access patterns in the Linpack test are what Heroux and Dongarra refer to as Type 1. The machines are loaded up with lots of floating point processing and the data can be organized in such a way as to be particularly efficient at getting the right data to the right FP unit at the right time most of the time. With Type 2 patterns - which are more reflective of the kinds of differential equations increasingly used in simulations - the data access patterns are less regular and the calculations are finer-grained and recursive. Those shiny new Xeon Phi and Tesla accelerators are really good at Type 1 calculations and not so hot (yet) on Type 2 calculations.
The upshot, say Heroux and Dongarra, is that designing a system to reach one exaflops on Linpack might result in a machine that is absolutely terrible at running real world applications, thus defeating one of the primary purposes of a benchmark test. Benchmarks are a feedback loop into system designers, and are absolutely necessary. Here is an example of the divergence cited by Heroux and Dongarra on the "Titan" Opteron-Tesla ceepie-geepie down in Oak Ridge National Laboratories, with which they are intimately familiar.
Read this carefully:
"The Titan system at Oak Ridge National Laboratory has 18,688 nodes, each with a 16-core, 32 GB AMD Opteron processor and a 6GB Nvidia K20 GPU. Titan was the top-ranked system in November 2012 using HPL. However, in obtaining the HPL result on Titan, the Opteron processors played only a supporting role in the result. All floating-point computation and all data were resident on the GPUs. In contrast, real applications, when initially ported to Titan, will typically run solely on the CPUs and selectively off-load computations to the GPU for acceleration."