This article is more than 1 year old

Cloudflare loosens AI from the network edge using GPU-accelerated Workers

Isn't that how Skynet took over?

Generative AI models might be trained in massive clusters of GPUs, but Cloudflare argues the obvious place to run them isn't just at the edge but in the network itself.

On Wednesday the delivery giant announced a suite of AI services aimed at moving away the complexity of deploying and running large-language models (LLMs) and other machine learning (ML) algorithms, while also achieving the lowest possible latency.

Well, actually, the lowest possible latency would be achieved by running the inference workload on the user's device. Intel made a big deal about this, touting the rise of the AI PC generation, last week at Intel Innovation. But while this might make sense in some cases, Cloudflare argues that local devices aren't powerful enough yet.

"This makes the network the goldilocks of inference. Not too far, with sufficient compute power — just right," the biz writes.

Serverless for GPUs

The AI suite comprises three core services. The first of these is an extension of its serverless Workers platform to support GPU accelerated workloads. Dubbed Workers AI, the service is designed to streamline the process of deploying pre-trained models.

"No machine learning expertise, no rummaging for GPUs. Just pick one of the provided models and go," Cloudflare claims.

We're told the platform runs atop Nvidia GPUs, though Cloudflare wouldn't tell us which ones. "The technology Cloudflare has built can split an inference task across multiple different GPUs, because we're taking care of the scheduling and the system, and we'll decide what chip or chips make the most sense to deliver that," it told The Register in a statement.

In the interest of simplicity, the platform doesn't — at least not initially — support customer-supplied models. We're told it plans to roll out this functionally in the future, but, for now, it's limited to six pre-trained models, which include:

  • Meta's Llama 2 7B Int8 for text-generation
  • Meta's M2m100-1.2 for translation
  • OpenAI's Whisper for speech recognition
  • Hugging Face's Distilbert-sst-2-int8 for text classification
  • Microsoft's Resnet-50 for image classification
  • Baai's bge-base-en-v1.5 for embeddings

However, Cloudflare says it's working to expand this list in the near future. Like many AI hopefuls, it has solicited the help of Hugging Face to optimize additional models for the service.

It's not clear if there's a limit to the size of models the platform can support, but the initial list does offer some clues. Cloudflare is making Meta's seven-billion parameter Llama 2 LLM available running at Int8, which would require about 7GB of GPU memory. The company also notes that "if you're looking to run hundred-billion parameter versions of models, the centralized cloud is going to be better suited for your workload."

Once up and running, Cloudflare says customers can integrate the service into their applications using REST APIs or by tying it into their Pages website frontend.

Putting it all together

Because Workers AI only supports inferencing on pre-trained models, Cloudflare says it's developed a vector database service called Vectorize to make it easier for the ML models to pass customer data to users

For example, for a chatbot, a customer might upload their product catalog to the vector database, from which the model would convert it into an embedded asset.

The idea appears to be that, while the Llama 2 model offered by Cloudflare may not have specific knowledge of a customer's data, the chatbot can still surface relevant information by tying into the database service. According to Cloudflare, this approach makes inferencing more accessible, faster, and less resource intensive because it decouples customer data from the model itself.

Alongside Workers AI and Vectorize, Cloudflare's AI suite also includes a platform for monitoring, optimizing, and managing inference workloads at scale.

Dubbed AI Gateway, the service applies several features typically associated with content delivery networks and web proxies, like caching and rate limiting, to AI inferencing in order to help customers control costs.

"By caching frequently used AI responses, it reduces latency and bolsters system reliability, while rate limiting ensures efficient resource allocation, mitigating the challenges of spiraling AI costs," the company explains in the blog post.

Pricing and availability

Cloudflare notes that the service is still in the early stages of deployment, with seven sites online today. However, the company is deploying GPUs to bring the service to 100 points of presence by the end of the year and "nearly everywhere" by the end of 2024.

As a result of this, it doesn't recommend deploying production apps on Workers AI just yet, describing it as an "early beta."

"What we released today is just a small preview to give you a taste of what's coming," the blog post reads.

As usual, Cloudflare says it won't be billing for the service on day one. With that said, it expects to charge about a cent for every thousand "regular twitch neurons" and $0.125 for every thousand "fast twitch neurons." The difference between the two is that the latter prioritizes proximity to the end user, while the less expensive of the two runs anywhere Cloudflare has excess capacity.

Neurons are a way to measure AI output, the company explained, adding that a thousand neurons is good for about 130 LLM responses, 830 image classifications, or 1,250 embeddings ®.

More about

TIP US OFF

Send us news


Other stories you might like