HPE's Eng Lim Goh on spaceborne computers, NASA medals – and AI at the final frontier
Never mind the edge, try running a super 'puter up there
Interview Though HPE's Spaceborne Computer is still fresh from its jaunt to the International Space Station, veep and CTO for HPC and AI Dr Eng Lim Goh is pondering a return visit and outfitting missions to Mars with the company's kit.
The Register met Dr Goh after he'd been reassuring attendees at the Sibos 2019 event that AIs were worse than children at spotting giraffes without considerable training. UK prime minister Boris Johnson's vision of "pink-eyed terminators" is perhaps a while away.
More imminent, however, is NASA's use of off-the-shelf supercomputers in space as Goh showed off medals received from the US space agency after the successful conclusion of the first Spaceborne mission.
The Exceptional Technology Achievement Medal was awarded "for successfully demonstrating the first commercial supercomputing platforms on the ISS, capable of executing over one trillion calculations per second for a year without requiring a reset".
The goal of the mission was, according to Goh, to show NASA that off-the-shelf hardware could be reliable in space, rather than custom chippery replete with a lengthy gestation period.
To be fair, the ISS is festooned with laptops (and, of course, some ageing Raspberry Pi hardware), but getting a supercomputer into orbit required a change of thinking at the agency, and some challenges for HPE.
"Just before launch we just picked the latest 1U server and plugged it into a locker. The only issue was that a 1U server is quite deep and the Express racking [in the Destiny Lab] is quite shallow... so we turned it around and used two slots."
The desire to keep the hardware as stock as possible meant that the kit needed AC power. "However," said Goh, "the space station uses solar panel DC power. So NASA supplied us with inverters to convert DC to to AC so that we can plug ourselves right in.
"So of the four power supplies in the two servers, one of them did fail during the 1.6 years. However, they are all redundant anyway – so it didn't interrupt operations or applications."
Certainly, anyone who has spent time in the company of servers will recognise the foibles of power supplies. Goh told us: "First and foremost, lesson learned, maybe we need a triple-redundant power supply..."
Of course, NASA occasionally had to shut down the power to the racks, which allowed a swift replacement. As it transpired, the system ended up being rebooted four times during its 1.6 years of running on the ISS due to "various reasons on the station".
Running shrinkwrapped Red Hat Linux and software to harden the system against environmental factors such as cosmic radiation (rather than the hefty and expensive physical hardening usually used), the Apollo Spaceborne Computer still suffered its fair share of problems. "Nine of the 20 SSDs failed," remarked Goh, but redundancy ensured the thing kept ticking over. And the lengthy period running on orbit means that lessons can be learned.
Now, back at the factory following a SpaceX splashdown, "it booted up fine, even after the harsh landing". And those SSDs? "We are suspecting that it could be more the controllers because during the four reboots in space, some of the SSDs came back."
Good to know that the old BOFH standby of turning it off and on again can work just as well in orbit.
Of course, the goal was to stop already busy astronauts going anywhere near the device to fix problems. "What we did was develop three circles of software, the outermost circle supervising the second level, and the second level supervising the core and for it to also sense correctable errors, and in the future be able to sense inputs from the station saying there's a storm coming then respond appropriately."
The version that ran for 1.6 years on the ISS "had the ability to sense correctable errors". Goh explained the system dealt with these problems, but there was a danger that "correctable errors might accelerate to a point that it hits a threshold and it becomes uncorrectable".
"Which," he understated, "would be a problem for applications." A bad day in space indeed.
"We decided that after a certain threshold of correctable errors, let's be on the conservative side, after the next correction, retire that page. We can't retire a bit, but we can retire the page around that bit. So these are some of the mitigating things trying to keep the system going."
As for what the computer actually did, Goh told us the gang thrashed it with benchmark software from HPCG as well as Linpack and some of NASA's own. The poor thing was tortured in the CPU, memory and storage departments (aside from those reboots). And the performance decline? "Minimal," according to Goh.
Back to orbit, the Moon and beyond
Goh plans to send another computer to the ISS in the coming years, again pulled from what HPE is selling at the time, but this time the machine won't just be running benchmarking software. "Now we know its limits, we can run typical applications in space."
Unsurprisingly, because it is a heck of a lot more efficient to process data at the source rather than transmit it back to the ground for crunching by earthbound hardware, "NASA has strong interest in running the applications."
And, of course, HPE would also like its computers on NASA's upcoming Moon missions "because that's the penultimate step before Mars," explained Goh. "The station is still in a somewhat protected orbit [from radiation]." As such, seeing how the stock supercomputing hardware performs in deep space is a precursor to more ambitious applications.
And Goh has high hopes that supercomputing hardware could be used on space telescopes and probes, as well as reducing crew workload. While spacecraft are becoming ever more sensitive, getting the data back to Earth for processing is forever constrained by bandwidth and latency. "It's getting difficult," said Goh, "to keep up with the amount of data as it is coming through the sensors."
While NASA has always shovelled reprogrammable computing power into its probes (the software running on its Voyager probes is quite different from what left the launch pad more than 40 years ago), Goh reckons that the boost from supercomputing hardware could see tools like machine learning pushed out to actual spacecraft. The probes, he said, could then "learn locally... without having to send all the data back to Earth, which would be impractical anyway."
The ultimate edge application indeed. ®