Aurora exascale system gets 'mini-me' testbed for researchers
We node you want to test it out. (We're here all week.)
Researchers waiting to get their hands on the much delayed Aurora supercomputer at the US Argonne National Laboratory now have a new toy at their disposal, a mini-Aurora codenamed Sunspot.
Sunspot is a two-rack test and development system equipped with 128 nodes of the same technologies that will power Argonne's Aurora exascale supercomputer. Image by Argonne National Laboratory
Sunspot is a new test and development system that has been built to the exact same architecture as Aurora, the exascale supercomputer currently under construction at the Argonne Leadership Computing Facility (ALCF) in Illinois.
But while Aurora is planned to assimilate more than 10,000 nodes once fully completed, Sunspot can slip into just two datacenter racks with its 128 nodes.
Like Aurora, each node is configured with two Intel Xeon CPU Max (Sapphire Rapids) processors and six Intel Data Center Max (Ponte Vecchio) GPU accelerators, with HPE's Slingshot interconnect (Cray technology) linking everything together.
"Sunspot is basically a miniature version of Aurora," said ALCF's project director for Aurora, Susan Coghlan.
The idea is that it gives the research teams a facility that they can use to optimize code performance using the actual Aurora hardware while they are still waiting for the real thing.
Aurora was originally scheduled for delivery in 2018 as a system based on the (now discontinued) Intel Xeon Phi chips, then came a new architecture intended to make it the first exascale supercomputer (one capable of performing a billion billion (1018) floating point calculations per second).
However, this incarnation slipped behind schedule due to delays in Intel getting its Sapphire Rapids Xeon Scalable processors out the door, and the AMD-based Frontier supercomputer at Oak Ridge National Laboratory in Tennessee eventually took the exascale prize.
Sunspot has apparently been on-site at Argonne since December, but prior to it being ready the development teams made use of a series of other testbed systems. These included Iris, Arcticus, and Florentia at Argonne itself and Borealis at Intel's high performance compute (HPC) lab in Oregon.
These systems continue to be useful for Aurora preparations, but it is apparently Sunspot's identical architecture that gives researchers the ideal environment for optimizing application performance for the exascale supercomputer.
"Sunspot is the first time we're seeing how everything is working together," Coghlan said. "We learn a lot from these runs. It gives us a chance to iron out some of the kinks before Aurora is ready for users."
ALCF's Aurora Early Science Program co-manager Tim Williams said that this was important for getting ready to start doing science with a new system from day one of deployment.
"Testbeds like Sunspot allow researchers to carry out performance studies and scale up their workloads to run on much larger supercomputers while those systems are still being built," he explained.
According to Argonne, over 180 researchers from over 20 application development teams from the Early Science Program (ESP) and the US Department of Energy Exascale Computing Project (ECP) have now begun accessing the testbed for scaling and performance optimization research.
The ALCF team said it expects performance improvements in the software code as the teams continue to do multi-node scaling and optimization work on Sunspot and other available computing resources.
- Polaris supercomputer boots up, paves way for Aurora exascale system
- Why we think Intel may be gearing up to push its GPU Max chips into China
- Intel ships multi-die chips ahead of schedule – to the US military
- SambaNova's AI paired with Fugaku supercomputer to develop 'digital twins'
As an example, the team is said to be using Sunspot's Intel DAOS (Distributed Asynchronous Object Storage) system to test and enhance I/O performance.
Sunspot is expected to continue to serve a role even after Aurora is declared fully operational, Argonne said, which is now scheduled for sometime later this year. Like the ALCF's other test and development systems, Sunspot should remain a useful platform for users to optimize code performance before moving across to Aurora. ®