What should replace Linpack for ranking supercomputers?
Our Big Iron man has a few ideas following Top500
There is always an immediate predecessor, so don't try to tell me there isn't
This strikes me as a software problem, not a hardware problem: and not only that, one that doesn't have any particular bearing on the relevance of Linpack. And even Heroux and Dongarra basically admit as much in the next paragraph of the HPCG proposal:
"Of course, one of the important research efforts in HPC today is to design applications such that more computations are Type 1 patterns, and we will see progress in the coming years. At the same time, most applications will always have some Type 2 patterns and our benchmarks must reflect this reality. In fact, a system's ability to effectively address Type 2 patterns is an important indicator of system balance."
What this server hack really wants to know is how the initial Oak Ridge application set is performing on Titan, how much work needs to be done to optimize the code and actually use those GPU accelerators that the US taxpayers (like me) have shelled out for.
I want something even more devious than that, too. I want to see the Top500 list completely reorganized, using its historical data in a more useful fashion. I want the list to show a machine and its immediate predecessor. There is always an immediate predecessor, so don't try to tell me there isn't, and it is running a key real-world application, too. So, for these two machines, I want three numbers: Peak theoretical flops, sustained Linpack flops, and relative performance of the key workload (you won't be able to get actual performance data for this, of course.)
I want the Top500 table to show the relative performance gains for all three across the two machines. And predecessor machines don't necessarily have to be in the same physical location, either. The "Kraken" super at the University of Tennessee runs code for NOAA, which also has its own machines.
So, for instance, it might show that Titan has a peak theoretical performance of 27.12 petaflops across its CPUs and GPUs, and running the Linpack test and loading up both computing elements it was able to deliver 17.59 petaflops of oomph in an 8.21 megawatt power envelope. Importantly, the Titan machine has 560,640 Opteron cores running at 2.2GHz. The "Jaguar" super that predates Titan at Oak Ridge (technically, Jaguar was upgraded to Titan in two steps) was an all-CPU machine with 298,592 cores running at 2.6GHz that had a peak theoretical performance of 2.63 petaflops and a sustained Linpack performance of 1.94 petaflops, all in a 5.14 megawatt power envelope.
So the delta on peak performance moving from the last iteration of Jaguar (it had a processor and interconnect upgrade a little more than a year ago before adding the GPUs to make it a Titan late last year) was a factor of 10.31, but on sustained performance for the Linpack test, the delta between the two machines was only a factor of 9.06. And, by the way, that was only because a lot of the calculations were done on the GPUs, and even then, Titan still only had a computational efficiency across those CPUs and GPUs of 64.9 per cent. It may be that it just cannot be pushed much higher than that. The best machines on the list might have an 85 to 90 per cent computational efficiency.
But here's the fun bit. If you just look at the CPU portions of the Jaguar and Titan machines, then the aggregate floating point delta moving from Jaguar to Titan was a mere 58.9 per cent. In other words, if a key Oak Ridge application could, in theory, be deployed across nearly twice as many cores and a faster interconnect and make use of them all perfectly efficiently and did not offload any calculations to the GPU, then you get a speedup of about a factor of 1.6. If you offload some calculations to the Tesla GPUs - but only some, as Heroux and Dongarra said was the case - then maybe, and El Reg is guessing wildly here - maybe you could get that delta up to a factor of 2X, 3X, or even 4X.
The point is to not guess, but for Oak Ridge to say as part of a Top500 submission. Why are we all guessing? Moreover, why not have Top500 ranked by the familiar Linpack, add a HPCG column, then one key real-world relative performance metric, and do the before and after machines? More data is better.
There are some other things that a revised supercomputer benchmark test needs as well, since we are mulling it over. Every machine has a list price, and every benchmark run should have one. Every machine has a measured wall power when it is running, and every benchmark run should have one. The Top500 list is fun, but it needs to be more useful, and the reality is that supercomputing is more constrained by the power envelope and the budget than any other factors, and this data has to be brought into the equation. Even estimated street prices are better than nothing, and I am just annoyed enough to gin them up myself and get into a whole lot of trouble.
Another thing to consider is that the needs of the very high end of the supercomputer space are far different from the needs of the rest of the HPC community.
At the high end, research labs and HPC centers are writing their own code, but smaller HPC installations are running other people's code. The addition of a real-world workload is useful for such customers in particular if it is an off-the-shelf application. What would be truly useful is a set of clusters of varying sizes and technologies running the ten top HPC applications - perhaps one for each industry - with all the same before and after configurations and deltas as machines are upgraded in the field. I think what you will find is that a lot of commercial applications can't scale as far as you might hope across ever-increasing cluster sizes, and software developers in the HPC racket would be extremely adverse to such odious comparisons. It would be nice to be surprised and see third-party apps scale well, so go ahead, I dare you - surprise me.
The HPC industry needs a complete set of benchmark test results that show a wide variety of components being mixed and matched in different patterns and showing the effects on application performance. We don't need 50 different Linpack test results on clusters with slightly different configurations, but results on as many diverse systems as we can find and test. This is how Dongarra originally compiled his Linpack benchmarks, and it was the right way to do it. ®