AI caramba, those neural networks are power-hungry: Counting the environmental cost of artificial intelligence
(And, also worryingly, its increasing financial cost)
Feature The next time you ask Alexa to turn off your bedroom lights or make a computer write dodgy code, spare a thought for the planet. The back-end mechanics that make it all possible take up a lot of power, and these systems are getting hungrier.
Artificial intelligence began to gain traction in mainstream computing just over a decade ago when we worked out how to make GPUs handle the underlying calculations at scale. Now there's a machine learning algorithm for everything, but while the world marvels at the applications, some researchers are worried about the environmental expense.
One of the most frequently quoted papers on this topic, from the University of Massachusetts, analysed training costs on AI including Google's BERT natural language processing model. It found that the cost of training BERT on a GPU in carbon emissions was roughly the same as a trans-American jet flight.
Kate Saenko, associate professor of computer science at Boston University, worries that we're not doing enough to make AI more energy efficient. "The general trend in AI is going in the wrong direction for power consumption," she warns. "It's getting more expensive in terms of power to train the newer models."
- Imaginary numbers help AIs solve the very real problem of adversarial imagery
- DRAM-as-cache is too expensive for even Facebook – Zuck now blending it with NVM drives
- HPE bags $2bn HPC-as-a-service gig with the NSA
- US officials, experts fear China ransacked Exchange servers for data to train AI systems
The trend is exponential. Researchers associated with OpenAI wrote that the computing used to train the average model increases by a factor of 10 each year.
Why is AI so power hungry?
Most AI these days is based on machine learning (ML). This uses a neural network, which is a collection of nodes designed in layers. Each node has connections to nodes in the next. Each of these connections has a score known as a parameter or weight.
The neural network takes an input (such as a picture of a hotdog) and runs it through the layers of the neural network, each of which uses its parameters to produce an output. The final output is a judgement about the data (for example, was the original input a picture of a hotdog or not?)
Those weights don't come preconfigured. You have to calculate them. You do that by showing the network lots of labelled pictures of hot dogs and not hot dogs. You keep training it until the parameters are optimised, which means that they spit out the correct judgement for each piece of data as often as possible. The more accurate the model, the better it will be when making judgements about new data.
You don't just train an AI model once. You keep doing it, adjusting various aspects of the neural network each time to maximise the right answers. These aspects are called hyperparameters, and they include variables such as the number of neurons in each layer and the number of layers in each network. A lot of that tuning is trial and error, which can mean many training passes. Chewing through all that data is already expensive enough, but doing it repeatedly uses even more electrons.
The reason that the models are taking more power to train is that researchers are throwing more data at them to produce more accurate results, explains Lukas Biewald. He's the CEO of Weights and Biases, a company that helps AI researchers organise the training data for all these models while monitoring their compute usage.
"What's alarming about about it is that it seems like for every factor of 10 that you increase the scale of your model training, you get a better model," he says.
Yes, but the model's accuracy doesn't increase by a factor of 10. Jesse Dodge, postdoctoral researcher at the Allen Institute for AI and co-author of a paper called Green AI, notes studies pointing to the diminishing returns of throwing more data at a neural network.
So why do it?
"There's a long tail of things to learn," he explains. ML algorithms can train on the most commonly-seen data, but the edge cases – the confusing examples that rarely come up – are harder to optimise for.
Our hotdog recognition system might be fine until some clown comes along in a hotdog costume, or it sees a picture of a hotdog-shaped van. A language processing model might be able to understand 95 per cent of what people say, but wouldn't it be great if it could handle exotic words that hardly anyone uses? More importantly, your autonomous vehicle must be able to stop in dangerous conditions that rarely ever arise.
"A common thing that we see in machine learning is that it takes exponentially more and more data to get out into that long tail," Dodge says.
Piling on all this data data doesn't just slurp power on the compute side, points out Saenko; it also burdens other parts of the computing infrastructure. "The larger the data, the more overhead," she says. "Even transferring the data from the hard drive to the GPU memory is power intensive."
Sharing is caring
There are various attempts to mitigate this problem. It starts at the data centre level, where hyperscalers are doing their best to switch to renewables so that they can at least hammer their servers responsibly.
Another approach involves taking a more calculated approach when tweaking your hyperparameters. Weights and Biases offers a "hyperparameter sweep" service that uses Bayesian algorithms to narrow the field of potential changes with each training pass. It also offers an "early stopping" algorithm which halts a training pass early on if the optimisation isn't panning out.
Not all approaches involve fancy hardware and software footwork. Some are just about sharing. Dodge points out that researchers could amortise the carbon cost of their model training by sharing the end result. Trained models released in the public domain can be used without retraining, but people don't take enough advantage of that.
"In the AI community, we often train models and then don't release them," he says. "Or the next people that want to build on our work just rerun the experiments that we did."
Those trained models can also be fine tuned with additional data, enabling people to tweak existing optimisations for new applications without retraining the entire model from scratch.
Training isn't the whole story
Making training more efficient only tackles one part of the problem, and it isn't the most important part. The other side of the AI story is inference. This is when a computer runs new data through a trained model to evaluate it, recognising hotdogs it has never seen before. It still takes power, and the rapid adoption of AI is making it more of a problem. Every time you ask Siri how to cook rice properly, it uses inference power in the cloud.
One way to reduce model size is to cut down the number of parameters. AI models often use vast numbers of weights in a neural network because data scientists aren't sure which ones will be most useful. Saenko and her colleagues have researched reducing the number of parameters using a concept that they call shape shifter networks that share some of the parameters in the final model.
"You might train a much bigger network and then distil it into a smaller one so that you can deploy a smaller network and save computation and deployment at inference time," she says.
Companies are also working on hardware innovations to cope with this increased inference load. Google's Tensor Processing Units (TPUs) are tailored to handle both training and inference more efficiently, for example.
Solving the inference problem is especially tricky because we don't know where a lot of it will happen in the long term. The move to edge computing could see more inference jobs happening in lower-footprint devices rather than in the cloud. The trick there is to make the models small enough and to introduce hardware advances that will help to make local AI computation more cost-effective.
"How much do companies care about running their inference on smaller devices rather than in the cloud on GPUs?" Saenko muses. "There is not yet that much AI running standalone on edge devices to really give us some clear impetus to figure out a good strategy for that."
Still, there is movement. Apple and Qualcomm have already produced tailored silicon for inference on smart phones, and startups are becoming increasingly innovative in anticipation of edge-based inference. For example, semiconductor startup Mythic launched an AI processor focused on edge-based AI that uses analogue circuitry and in-memory computing to save power. It's targeting applications including object detection and depth estimation, which could see the chips turn up in everything from factories to surveillance cameras.
As power consumption rises, so do the stakes
As companies grapple with whether to infer at the edge, the problem of making AI more energy efficient in the cloud remains. The key lies in resolving two opposing forces: on the one hand, everyone wants more energy efficient computing. On the other, researchers constantly strive for more accuracy.
Dodge notes that most academic AI papers today focus on the latter. Accuracy is winning out as companies strive to beat each other with better models, agrees Saenko. "It might take a lot of compute but it's worthwhile for people to claim that one or two percent improvement," she says.
She would like to see more researchers publish data on the power consumption of their models. This might inspire competition to drive efficiencies up and costs down.
The stakes may be more than just environmental, warns Biewald; they could be political too. What happens if computing consumption continues to go up by a factor of 10 each year?
"You have to buy the energy to train these models, and the only people that can realistically afford that will be Google and Microsoft and the 100 biggest corporations," he posits.
If we start seeing a growing inequality gap in AI research, with corporate interests out in front, carbon emissions could be the least of our worries. ®