NASA missions are being delayed by oversubscribed, overburdened, and out-of-date supercomputers

Flagship facility has just 48 GPUs

NASA's supercomputing capabilities are not keeping pace with the latest technology developments, and are "oversubscribed and overburdened," causing delays to missions that are sometimes addressed by teams acquiring their own infrastructure.

The above are some of the findings of an assessment [PDF] of the aerospace agency's high-end compute capabilities, conducted by NASA's internal auditor, the Office of Inspector General.

Published on Thursday, the audit opens by declaring "NASA needs a renewed commitment and sustained leadership attention to reinvigorate its [high-end computing] HEC efforts. Without key changes, the Agency's HEC is likely to constrain future mission priorities and goals."

Those changes are needed because NASA's HEC ops – a term the audit uses interchangeably with supercomputing – are managed by its Earth Science Research Program within the Science Mission Directorate, rather than as a central function.

NASA's CIO has some oversight of HEC, but it is not directly engaged in HEC activities or governance.

Because the agency's supers are oversubscribed, missions buy their own kit. The audit suggests almost every NASA location – other than Goddard Space Flight Center and Stennis Space Center – have their own independent infrastructure. The Space Launch System team alone spends $250,000 a year rather than waiting for access to existing HEC resource availability.

Confusion around NASA's cloud capacity and policy is another reason for the purchase of on-prem kit.

"NASA also lacks a comprehensive strategy for when to use HEC assets on the premises versus when to utilize cloud computing options – or a widespread understanding of the cost implications for each choice," the audit states. "Stakeholders told us that while they know NASA has HEC cloud computing options, they were hesitant to use them due to unknown scheduling practices or assumed higher costs."

The disparate fleet of HEC deployed across NASA also lacks strong security, the audit found. Some aren't regularly monitored – a big problem, because some are accessible by foreign nationals with whom NASA collaborates.

"Security controls are often bypassed or not implemented, increasing the risk of cyber attacks," the report warns.

Another issue the audit points out is that NASA is not keeping pace with modern supercomputing tech.

NASA's Advanced Supercomputing facility, for example, has just 48 GPUs alongside its 18,000 CPUs, "with an even larger disparity observed" at the NASA Center for Climate Simulation (see page 17 of NASA's report – PDF).

"HEC officials raised multiple concerns regarding this observation, stating that the inability to modernize NASA's systems can be attributed to various factors such as supply chain concerns, modern computing language (coding) requirements, and the scarcity of qualified personnel needed to implement the new technologies," according to the report.

The audit therefore makes ten recommendations, the first of which is for senior leadership to reform how supercomputing is administered and implanted at NASA.

The other nine recommendations are actions the auditor thinks should be performed by a "tiger team" dedicated to fixing known problems across NASA's HEC estate. Among the jobs that team needs to tackle are:

  • Identify technology gaps, such as GPU transition and code modernization, essential for meeting current and future needs and strategic technological and scientific requirements;
  • Develop a strategy to improve HEC asset allocations and prioritization for usage, including the appropriate use of on-premises versus cloud resources;
  • Evaluate cyber risks associated with HEC assets to determine oversight and monitoring requirements, establish risk appetite, and address control deficiencies. Consider using NASA's Splunk enterprise platform as a shared resource;
  • Develop an inventory of enterprise-wide HEC assets and formalize procedures for hardware and software life-cycle management.

Sorting out security is another item on the tiger team's to-do list.

NASA management agreed to implement the tiger team and concurred with the recommendation to reform its entire supercomputing management apparatus.

Which is welcome, because the audit document repeatedly observes that the current state of NASA's supercomputing estate hampers its efforts to do science and plan new missions, increases its costs, and increasingly threatens its ability to do all the stuff that Register readers find inspiring ®

More about

TIP US OFF

Send us news


Other stories you might like