This article is more than 1 year old

Feed the AI fire with centralization

An efficient GPU compute management platform is critical to furthering AI experimentation, says Run:ai

Sponsored Feature A steady stream of revolutionary technologies and discoveries – fire, agriculture, the wheel, the printing press and the internet, to name but a few - have profoundly shaped human development and civilization. And that cycle of innovation is continuing with Artificial Intelligence (AI). 

Research firm IDC has gone so far as to conclude that AI is really the answer to just about "everything". Rasmus Andsbjerg, associate vice president, data and analytics at IDC says: "The reality is, AI offers solutions to everything we are facing at the moment. AI can be a source for fast-tracking digital transformation journeys, enable cost savings in times of staggering inflation rates and support automation efforts in times of labor shortages."

Certainly, and across all industries and functions, end user organizations are starting to discover the benefits of AI, as increasingly powerful algorithms and underlying infrastructure emerge to enable better decision-making and higher productivity. 

Worldwide revenues for the artificial intelligence (AI) market, including associated software, hardware, and services for both AI centric and AI non-centric applications, totaled $383.3 billion in 2021. That was up 20.7% over the prior year, according to the most recent International Data Corporation (IDC) Worldwide Semiannual Artificial Intelligence Tracker.

Similarly, the deployment of AI software to the cloud continues to show steady growth. IDC expects cloud versions of newly purchased AI software to surpass on-premises deployments in 2022.

The sky's the limit for AI

Dr Ronen Dar, chief technology officer of AI specialist Run:ai, which has created a compute-management platform for AI, believes that the sky's the limit for the nascent enterprise AI sector. 

"AI is a market that we see is growing very rapidly. And in terms of enterprises, we see demand and adoption for machine learning and AI. And I think right now there is a new technology here that is bringing new capabilities that are going to change the world; that are going to revolutionize businesses," Dar notes. 

There is also an increasingly clear understanding on the need to start exploring and experimenting with AI and understand how to integrate AI into business models.

Dar believes that AI can bring "amazing benefits" to improve existing enterprise business processes: "In terms of optimizing and proving the current business, we see a lot of use cases around AI and machine learning which is improving operations and how decisions are being made around supply and demand."

He points out that new deep learning models based on neural networks can improve processes, decision making and the accuracy of critical business processes such as fraud detection in the financial services industry. Healthcare is another sector where the potential of AI is "huge", particularly in terms of helping doctors make better clinical decisions and helping to discover and develop new drugs. 

And, looking further ahead, Dar predicts that AI technology will help deliver brand new commercial opportunities that do not currently exist in sectors such as self-driving vehicles and immersive gaming. 

Infrastructure hurdles to overcome

Despite the obvious potential for AI and machine learning in the enterprise, Dar acknowledges that commercial deployment of AI is being held back by issues around infrastructure provision. He advises that firms need to look at the way in which AI gets into an organization in the first place.

Typically, this involves an uncoordinated, department-by-department process that sees different teams provisioning technology and resources independently, leading to siloed deployments. IT cannot effectively control these ad hoc projects and does not have visibility into what's going on. And this makes it difficult if not impossible to calculate ROI on the AI infrastructure spend. 

"It's the classical problem: back in the day it was shadow IT and now it's shadow AI," Dar says. 

In addition, the state-of-the art infrastructure needed for AI/ML is an investment as enterprises need powerful GPU-accelerated computing hardware to process very complex data and to train models. 

"AI teams need a lot of computing power to train models, typically using GPUs, which are premium data center resources  that can be siloed and not used efficiently," says Dar. "It can result in  a lot of money being wasted for sure." 

That siloed infrastructure can result in less than 10% utilization levels, for example.

According to the Run:ai poll, The 2021 State of AI Infrastructure Survey, published in October 2021, 87 percent of respondents said they experience some level of GPU/compute resource allocation issues, with 12 percent saying this happens often. As a result, 83 percent of the surveyed companies reported they were not fully utilizing their GPU and AI hardware. In fact, almost two-thirds (61 percent) indicated their GPU and AI hardware are mostly at "moderate" levels of utilization.

The centralization of AI

To solve these problems Dar advocates centralizing the provision of AI resources. Run:AI has developed a compute-management platform for AI that does just this, centralizing and virtualizing the GPU compute resource. By pooling GPUs into a single virtual layer and automating workload scheduling for 100 percent utilization, this approach offers advantages compared with siloed systems at departmental level. 

Centralizing the infrastructure gives back control and visibility, while freeing data scientists from the overhead of managing infrastructure. AI teams share a universal AI compute resource which can be dynamically dialed up and down as demand increases or decreases, eliminating demand bottlenecks and periods of under utilization. 

This approach, Dar argues, can help organizations get the most out of their hardware and free up data scientists from the constraints of underlying resource limitation. All of which means they can run more jobs and bring more AI models into production. 

An example is provided from the London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare, led by King's College London and based at St. Thomas' Hospital. It uses medical images and electronic healthcare data to train sophisticated deep learning algorithms for computer vision and natural-language processing. These algorithms are used to create new tools for effective screening, faster diagnosis and personalized therapies.

The Centre realized that its legacy AI infrastructure was suffering from efficiency issues: total GPU utilization was below 30 percent with "significant" idle periods for some components. After moving to address these issues by adopting a centralized AI compute provisioning model based on Run:ai's platform, its GPU utilization rose by 110 percent, with parallel improvements in experiment speed and overall research efficiency.

"Our experiments can take days or minutes, using a trickle of computing power or a whole cluster," says Dr M. Jorge Cardoso, associate professor and senior lecturer in AI at King's College London and CTO of the AI Centre. "Reducing time to results ensures we can ask and answer more critical questions about people's health and lives," 

Centralizing AI GPU resources also delivered valuable commercial benefits to Wayve, a London-based firm which develops AI software for self-driving cars. Its technology is designed to not depend on sensing, but instead focuses on greater intelligence, for better autonomous driving in dense urban areas.

Wayve's Fleet Learning Loop involves a continuous cycle of data collection, curation, training of models, re-simulation, and licensing models before deployment into the fleet. The company's primary GPU compute consumption comes from the Fleet Learning Loop production training. It trains the product baseline with the full dataset and continually re-trains to collect new data through iterations of the fleet learning loop.

The company began to realize it was suffering from GPU scheduling "horror": although nearly 100 percent of its available GPU resources were allocated to researchers, less than 45 percent were utilized when the testing was initially done. 

"Because GPUs were statically assigned to researchers, when researchers were not using their assigned GPUs others could not access them, creating the illusion that GPUs for model training were at capacity even as many GPUs sat idle" Wayve notes. 

Working with Run:ai tackled this problem by removing silos and eliminating static allocation of resources. Pools of shared GPUs were created allowing teams to access more GPUs and to run more workloads, which led to a 35% improvement of their utilization. 

Mirror CPU efficiency improvements

Mirroring the way in which VMware has brought substantial efficiency improvements to how server CPUs are being used to maximum capacity over recent years, new innovations are now coming on stream to optimize the efficiency of GPU utilization for AI compute workloads. 

"If you think about the software stack that runs on top of CPUs, it was built with a lot of VMware and virtualization," explains Dar. "GPUs are relatively new in the data center, and software for AI and virtualization – such as NVIDIA AI Enterprise – is also a recent development." 

"We bring advanced technology in that area with capabilities like fractional GPU, job swapping and. allowing workloads to efficiently share GPUs," says Dar, adding that further enhancements are being planned.

Run:ai works closely with NVIDIA to improve and simplify the usage of GPUs in the enterprise. The most recent collaboration includes enabling multi-cloud GPU flexibility for companies using GPUs in the cloud, and integration with NVIDIA Triton Inference Server software to simplify the process of deploying models in production.

In the way that major innovations over the course of history have had profound impacts on the human race and the world, Dar notes that the power of AI will need to be harnessed with care to maximize its potential benefits, while managing potential disadvantages. He compares AI with the most primeval innovation of all: fire. 

"It's like fire which brought a lot of great things and changed human lives. Fire also brought danger. So human beings understood how to live with fire," says Dar. "I think this is also here in AI these days." 

Sponsored by Run:ai.

More about

TIP US OFF

Send us news