DoE watchdog warns of poor maintenance at home of Frontier exascale system
Report says new QA plan currently being worked up
The US Department of Energy's watchdog claims that operations and maintenance are being poorly managed at Oak Ridge National Laboratory’s datacenter, home to advanced computers such as the world’s first exascale system, Frontier.
The DoE’s Office of Inspector General (OIG) received an allegation in September 2022 regarding maintenance and calibration in datacenters at the Oak Ridge site in Tennessee, which undertakes science projects relating to nuclear power and national security.
According to the report [PDF], filed yesterday, the allegation claimed that the calibration program at the site was inadequate, and there was poor or no maintenance at all on pressure relief valves (PRVs) within the datacenters. The OIG said it conducted an inspection from January 2023 through September 2023, and was able to "substantiate" the allegations.
Specifically, the watchdog said it found the calibration program was inadequate to meet quality assurance requirements, and that standards-based management system procedures were not always followed when maintaining PRVs.
Failure to test or inspect PRVs properly could cause the system to exceed allowable pressure limits, potentially resulting in "events that may harm personnel and equipment," the OIG stated, while if the infrastructure is not properly maintained, it could affect the availability of the computational resources and thus the site’s mission goals.
Oak Ridge National Laboratory is managed and operated by UT-Battelle, LLC. This is a not-for-profit organization established in 2000 for the sole purpose of managing the Oak Ridge site for the DoE, and is a limited liability partnership between the University of Tennessee and Battelle Memorial Institute, itself a non-profit science and technology outfit.
We asked UT-Battelle for a response to this report, but the organization was not immediately available to give an answer.
The report refers to datacenters relating to buildings 5300, 5600, and 5800 at the Oak Ridge site. These are home to the Multiprogram Research Facility, the Computational Sciences Building and the Engineering Technology Facility.
The Computational Sciences Building houses the Oak Ridge Leadership Computing Facility (OLCF), which operates the Frontier supercomputer.
The OIG report said it found UT-Battelle's calibration program to be inadequate because the organization was "unable to provide sufficient documentation that demonstrated calibration had been performed in accordance with applicable criteria."
A UT-Battelle manager informed the watchdog that routine calibration is not necessary, the report added. This is because each piece of equipment is calibrated at installation, and the datacenter systems are then continuously monitored by a subcontractor using a software system that notifies them of any subsequent issues.
However, the OIG said that while this is allowed, all software, regardless of safety significance, must be controlled by a quality assurance program, and the quality assurance program must describe how the requirements are met.
As ORNL was unable to provide documentation describing how these requirements are met, the OIG report said that UT-Battelle does not therefore know if the software is providing accurate information.
In the case of the PRVs, the report stated that UT-Battelle did not always maintain and/or test the three types of datacenter PRVs in accordance with applicable guidance.
The OIG found that all three listed air-type PRVs had not always been tested within the required timeframes, while 22 of the 54 refrigerant-type PRVs had not been tested and 12 of 27 water-type PRVs on the list had not been tested and/or inspected, as required.
UT-Battelle said in the report that PRV testing did not occur because it was overlooked in some instances, while in the case of refrigerant PRVs, testing was performed based on the manufacturer's recommendations rather than ORNL's procedure.
- The alternative to stopping climate change is untested carbon capture tech
- US supers maintain grip on Top500 list as China seemingly hides its powers
- Aurora exascale system gets 'mini-me' testbed for researchers
- Just because you failed doesn't mean you weren't right
UT-Battelle said that it is currently revising its procedure to reflect the manufacturer's recommendation for refrigerant PRVs, and that it has begun taking action to ensure full compliance with its procedure.
However, the OIG report noted that UT-Battelle carried out an assessment in 2020 that identified similar issues. The recommendations resulting from that assessment show signs of progress, it said, but illustrated the need for further improvement in this area.
The report said that UT-Battelle management fully concurred with the recommendations, and it has agreed to develop a quality assurance plan for the monitoring software and ensure that datacenter PRVs are properly identified and comply with current procedures and requirements. ®