Handling inference at the edge

One way to minimise AI latency is to process ML models closer to end users and the data being ingested

Sponsored Post Any organisation accessing AI models hosted in the cloud knows what a challenge it can be to ensure that the large volumes of data needed to build and train those type of workloads can be accessed and ingested quickly to help avoid any potential performance lags.

Chatbots and virtual assistants, map generation, AI tools for software engineers, analytics, defect detection and generative AI applications – these are just some of the use cases which can benefit from real-time performance which can help eliminate those delays. And Gcore Inference at the Edge service is designed to give businesses across diverse industries, including IT, retail, gaming and manufacturing just that.

Latency is an issue that tends to be exacerbated when the collection and processing of datasets distributed across multiple geographical sources via the network is involved. It can be particularly problematic when deploying and scaling real-time AI applications in smart cities, TV translation, and autonomous vehicles. Taking those workloads out of a centralised data centre and hosting them at the network edge, closer to where the data actually resides, is one way around the problem.

That's what the Gcore Inference at the Edge solution is specifically designed to do. It distributes customers' pre-trained or custom machine learning models (including Mistral 7B, Stable-Diffusion XL, and LLaMA Pro 8B open source models for example) to 'edge inference nodes' located in over 180 locations on the company's content delivery network (CDN).

These nodes are built on servers running NVIDIA L40S GPUs designed to run AI inference workloads, interconnected by Gcore's low-latency smart-routing mechanism to minimise packet delay and better support real-time applications. Options for edge node servers built on Ampere® Altra® Max CPUs are planned for a later date.

The ML endpoints also feature built in distributed denial of service (DDoS) protection to help thwart cyber security attacks and keep applications up and running in the event of an incident. That's a crucial layer of cyber defence which aids compliance with various data protection rules and regulations, including the GDPR, PCI DSS and ISO/IEC 27001, says the company.

The service works by providing customers with an endpoint they can integrate into their applications which diverts subsequent access requests and queries to the nearest edge node using anycast balancing (clients trying to reach a specific IP address are routed to the nearest host).

That helps keep latencies down to as little as 30 milliseconds on average says Gcore, with application performance boosted further by the NVIDIA GPU server infrastructure. Customers pay only for the resources their AI models need, saving money on building their own AI ready infrastructure, while additional compute resources can be quickly scaled up to handle any spikes in demand.

You can find out more about the Gcore Inference at the Edge solution by clicking here.

Sponsored by GCore.

More about


Send us news