Faster is always better in AI, although it comes at a price. As researchers strive to train their neural networks at breakneck speeds, the accuracy of their software falls.
Thus, teaching machine-learning code at high speed, with the resulting network able to make precise and correct decisions, is a vital goal.
To that end, a group of researchers at TenCent, one of the biggest tech conglomerates in China, and Hong Kong Baptist University have trained AlexNet – a popular neural network built on ImageNet, a dataset often used for image-classification systems – in just four minutes. The record previously was 11 minutes, held by UC Berkeley et al boffins, as far as we're aware.
The resulting system is able to recognize and label an object in a given photo, getting the description right roughly two out of three times, which more or less matches the UC Berkeley effort in terms of accuracy.
“When training AlexNet with 95 epochs, our system can achieve 58.7 per cent top-1 test accuracy within four minutes, which also outperforms all other existing systems,” according a paper dropped onto arXiv by the TenCent and Hong Kong Baptist Uni team this week.
Neural networks are fed samples from datasets in batches during the training process. To reduce training times, increase the batch size. For this study, the researchers rinsed 65,536 ImageNet images per batch through the neural network during training – twice as many snaps as UC Berkeley team used – while employing 1,024 Nvidia Tesla P40 GPUs to crunch the numbers. The UC Berkeley effort used 1,024 Intel Knights Landing Xeon Phi 7250 chips.
Tricks of the trade
The trouble is, if you increase the batch size, the neural network's ability to generalize decreases. In other words, it becomes worse at identifying things it hasn’t seen before. Therefore, you have to balance your software so that it has, in terms of processing speed and classification ability, the optimal batch size, and can still scale over many nodes. You need multiple nodes running in parallel to get the training time down to the order of minutes rather than hours or days.
When you have a hundred machines running the numbers in parallel during training, they need to be coordinated and the results collected and distilled: this requires communications between the computers to be as efficient as possible to eliminate bottlenecks and other stalling factors.
Thus, the TenCent and Hong Kong uni crew came up with a communications technique called “tensor fusion.” When nodes are sharing information over their cluster's network, multiple small size tensors are packaged together to reduce the amount of information that has to be transferred, thus reducing latency and increasing throughput. And increasing throughput reduces the amount of time taken to train the model.
The team also used a mix of 32-bit full and 16-bit half-precision floating-point math (FP32 and FP16) during training, rather than purely FP32, which further reduced the amount of data shunted through a node's memory, also improving the throughput and cutting into the training time.
“Training at half precision increases the throughput of a single GPU, while dealing with a large batch size increases the scalability of the distributed system,” Xianyan Jia, an AI engineer at TenCent explained to The Register. "The combination of the two techniques improves the overall throughput of the system. So that we can train the neural networks in a much shorter time than the previous work with similar hardware
“Directly applying half precision with large batch size in training CNNs could result in an accuracy loss. So we propose some optimization techniques such that we can enjoy the high throughput of half-precision and batch size without losing accuracy.”
Speeding up research
Combining larger batches sizes, mixed precision, and optimized parallel-processing communications, the researchers managed to train not only AlexNet super-quick but also the ImageNet-based neural network ResNet-50 – the latter in 8.7 minutes with a top-1 accuracy of 76.2 per cent using 1,024 Tesla P40 GPUs. Doubling the number of GPUs to 2,048 reduced the training time to 6.6 minutes with a slight decrease in accuracy to 75.8 per cent. Interestingly, we note that the Nvidia-based TenCent AlexNet was faster than the Xeon-Phi-based Berkeley AlexNet.
“Deep learning has been widely applied in many areas with large-scale data,” Jia said.
Your phone may be able to clean up snaps – but our AI is much better at touching up, say boffinsREAD MORE
"However, deep learning researchers and engineers could spend several days to wait for the results of a large model to converge. Long training time will impede the research and development cycle. So, speeding up the training process of a large-scale deep neural model is critical in both machine learning and HPC communities."
Now, these sorts of projects aren’t cheap considering the math accelerators – the Nvidia Teslas and the Xeon Phis – cost thousands of dollars. There are other hardware options, such as Google’s rentable TPUs, however, Jia said Nvidia’s software stack of libraries and matrix-crunching compilers makes it relatively easy to optimize neural networks.
“For as far as we know, it is still difficult for a user or a group to use a TPU cluster to do such experiments," she told us. "That said, all of our optimizations are well-known practice in HPC community and can be generalized to a system which uses TPU or other ASIC too, given that the researchers know the specific machine or chip topology of the system and the chip's peak memory bandwidth and GFLOPS.”
The goal is to eventually apply these techniques to scale up TenCent’s AI infrastructure service. “Our vision is to apply optimizations we designed for ImageNet training to more fields such as sequence models and reinforcement learning models, and continue to provide infrastructure support for our internal researcher and data scientists who want to, and should, focus only on the model and algorithm design,” she concluded. ®