Can you compose memory across a HPC cluster? Yes. Yes you can

GigaIO CTO talks up 'solution that has a lot of what CXL offers'

GigaIO and MemVerge are developing a joint solution to enable memory to be composable across a cluster of servers, addressing one of the thorny issues in high performance computing (HPC) where some nodes may not have enough memory for the tasks in hand, while others may have spare capacity.

s the combined GigaIO-MemVerge stack that supports composable memory.

The combined GigaIO-MemVerge stack that supports composable memory

The joint solution, which combines GigaIO's FabreX interconnect fabric with MemVerge's memory virtualization platform, was discussed at a Dell Technologies HPC Community online event this week, and anticipates the availability of hardware supporting version 2.0 of the CXL interconnect.

GigaIO's FabreX is based on PCIe and described by the firm as a rack-scale composable infrastructure solution. It already allows resources such as CPUs, GPUs, FPGAs, ASICs and storage to be orchestrated across a cluster of nodes. The fabric supports direct memory access, enabling direct CPU to GPU data transfers between nodes, for example.

"We can create a system that goes well beyond what one server can handle," explained Matt Demas, CTO of Global Sales at GigaIO. However, he added that: "Today, everything in the rack – except memory – is composable."

This is where MemVerge comes in with its Memory Machine software, which acts as an abstraction layer between memory and applications. It virtualizes both DRAM and persistent memory (PMEM) such as Intel's Optane to create a pool of software-defined memory with transparent tiering that just looks like plain old memory to applications.

The key part to bear in mind is that Intel's Optane DIMMs are available in higher capacities than standard memory DIMMs, but at half the cost per GB of DDR4. Optane has higher latency than DDR4, but can be useful in providing a larger memory space for applications that require it.

"What is the biggest issue in hardware utilisation?" asked Demas. "It's often down to not having the right amount of memory, either too much or running out."

The combination of FabreX and MemVerge "will allow us to compose memory to some applications, and the application won't know anything about that," he added, meaning that applications will simply see a continuous memory space, even if some of that memory is physically located in another node.

"You can use DRAM or PMEM, but also potentially re-use older servers that might otherwise have been disposed of, but you can utilise their memory," Demas said.

Bernie Wu, veep of business development at MemVerge, said that the company's Memory Machine software "runs above the kernel, slides right in underneath the HPC apps," to provide a transparent memory service, intercepting memory calls and providing virtualized memory.

Memory Machine works its magic by profiling memory calls from applications, so that it can move data around between DRAM and PMEM to deliver the optimal performance, while using PMEM to extend the available memory to an application.

"A lot of HPC apps already use a lot of memory, and we found can greatly improve performance by keeping stuff in persistent memory," Wu said. "Sometimes with PMEM, customers want to cut the costs of DRAM, and in some instances we have found we can get comparable performance to DRAM," he added.

Another feature that Memory Machine already supports is memory checkpoints for distributed HPC applications.

"A lot of apps were not intended to be long running or fault tolerant, but we can provide transparent checkpointing of the entire state of the app for restarting. We can also move [an application] to another node if rebalancing is needed," Wu said.

Demas said that the combination of GigaIO and MemVerge is "creating a solution that has a lot of what CXL offers," and that with CXL 2.0 it will "just get better and give a solution that everybody is asking for."

CXL 2.0 is based on PCIe 5.0, like the existing CXL 1.0, but adds support for a switched fabric that allows a host server to connect to resources on multiple other devices. ®

Narrower topics

Other stories you might like

  • DuckDuckGo tries to explain why its browsers won't block some Microsoft web trackers
    Meanwhile, Tails 5.0 users told to stop what they're doing over Firefox flaw

    DuckDuckGo promises privacy to users of its Android, iOS browsers, and macOS browsers – yet it allows certain data to flow from third-party websites to Microsoft-owned services.

    Security researcher Zach Edwards recently conducted an audit of DuckDuckGo's mobile browsers and found that, contrary to expectations, they do not block Meta's Workplace domain, for example, from sending information to Microsoft's Bing and LinkedIn domains.

    Specifically, DuckDuckGo's software didn't stop Microsoft's trackers on the Workplace page from blabbing information about the user to Bing and LinkedIn for tailored advertising purposes. Other trackers, such as Google's, are blocked.

    Continue reading
  • Despite 'key' partnership with AWS, Meta taps up Microsoft Azure for AI work
    Someone got Zuck'd

    Meta’s AI business unit set up shop in Microsoft Azure this week and announced a strategic partnership it says will advance PyTorch development on the public cloud.

    The deal [PDF] will see Mark Zuckerberg’s umbrella company deploy machine-learning workloads on thousands of Nvidia GPUs running in Azure. While a win for Microsoft, the partnership calls in to question just how strong Meta’s commitment to Amazon Web Services (AWS) really is.

    Back in those long-gone days of December, Meta named AWS as its “key long-term strategic cloud provider." As part of that, Meta promised that if it bought any companies that used AWS, it would continue to support their use of Amazon's cloud, rather than force them off into its own private datacenters. The pact also included a vow to expand Meta’s consumption of Amazon’s cloud-based compute, storage, database, and security services.

    Continue reading
  • Atos pushes out HPC cloud services based on Nimbix tech
    Moore's Law got you down? Throw everything at the problem! Quantum, AI, cloud...

    IT services biz Atos has introduced a suite of cloud-based high-performance computing (HPC) services, based around technology gained from its purchase of cloud provider Nimbix last year.

    The Nimbix Supercomputing Suite is described by Atos as a set of flexible and secure HPC solutions available as a service. It includes access to HPC, AI, and quantum computing resources, according to the services company.

    In addition to the existing Nimbix HPC products, the updated portfolio includes a new federated supercomputing-as-a-service platform and a dedicated bare-metal service based on Atos BullSequana supercomputer hardware.

    Continue reading

Biting the hand that feeds IT © 1998–2022