Microsoft, OpenAI method could make training large neural networks cheaper
Cost of tuning hyperparameters using μTransfer was 7% of what it would be to pre-train GPT-3
Companies scaling up their neural network models could cut expensive training costs by employing a technique developed by researchers at Microsoft and OpenAI.
Machine-learning systems are often compared to black boxes. Data is fed into an algorithm and out pops some more data. This output can be a label classifying an object in an image, a string of text based on a prompt, or even a snippet of code. The computation that happens in the middle involves manipulating countless matrices, and is a mystifying, hand-wavy process experts don't quite fully understand.
There are several properties developers tinker with to boost a model's performance during the training stage. These so-called hyperparameters are separate from the data, and are often manually tweaked based on intuition alone. Finding the optimum hyperparameters requires training and numerous adjustments; all that computation is costly and time-consuming. As systems grow larger and larger in size – with billions and trillions of parameters – it becomes too expensive to extensively search for hyperparameters to change.
"In practice, people rely on many rules of thumb to come up with 'educated guesses' of hyperparameters to use for a large model run without much confidence of their optimality," Greg Yang, a senior researcher at Microsoft, and Edward Hu, a PhD Student at Mila, a research institute based in Montreal, told The Register.
First, they find the optimal hyperparameters by tinkering with a smaller model, and then transfer them to a larger, scaled-up system. The team experimented with μTransfer on the text-generating GPT-3 architecture, transferring hyperparameters from a 40-million-parameter model to a 6.7-billion-parameter sized one.
- SPEC mulls benchmarks for ML processing performance
- Driverless car first: Chinese biz recalls faulty AI
- DeepMind AI tool helps historians restore ancient texts
- Rate of autonomous vehicle safety improvement slowing – research
By getting rid of the need to repeatedly adjust the larger GPT-3's hyperparameters, the team estimated their hyperparameter-tuning costs using μTransfer was only seven percent of what it would be to pre-train the model. Without μTransfer, the cost of hyperparameter-tuning the bigger system would have costed 167.5 times more, we're told. Models containing billions of parameters can rack up millions of dollars in compute costs.
"We are able to keep the optimal hyperparameters stable across model size thanks to a new parametrization suggested by the theory of neural network infinite-width limits," Yang and Hu told us. Neural networks, loosely modeled on the structure of brains, are made up of layers of neurons. The width of a network is described by the number of neurons contained in each layer; a wider network has more neurons. The depth of a network is described by the number of layers, a deeper network has more layers.
The pair explained the theory is an abstract concept that allows researchers to study the limits of a model as it increases in size. They found that some hyperparameters, like the learning rate, should be adjusted depending on the widths of each layer. Other hyperparameters, however, aren't so easily transferable. Developers will still need to directly tune their models if they can.
μTransfer is most effective for scaling existing architectures to larger sizes, where some hyperparameters can be reused. "Rather than being applied to fine-tuning, we are more likely to see our technique being used to find better hyperparameters and 'supercharge' model pretraining in the near future. We believe that the biggest payoff will come from pretraining enormous models with billions or even trillions of parameters," they said.
If you want to use μTransfer in scaling up your own models, the open-source code can be found here. ®