The shortest time for training a neural network using the popular ImageNet dataset has been slashed again, it is claimed, from the previously held record of four minutes to just one and a half.
Training is arguably the most important and tedious part of deep learning. A small mountain of data to teach a neural network to identify things, or otherwise make decisions from future inputs, is fed into the software, and filtered through multiple layers of intense matrix math calculations to train it. Developers long for snappy turnarounds in the order of minutes, rather than hours or days of waiting, so they can tweak their models for optimum performance, and test them, before the systems can be deployed.
Shorter reeducation sessions also means facial-recognition, voice-recognition, and similar systems can be rapidly updated, tweaked, or improved on-the-fly.
There are all sorts of tricks to shave off training times. A common tactic is to run through the dataset quickly by increasing the batch size so that the model processes more samples per iteration. It decreases the overall accuracy, however, so it’s a bit of a balancing act.
Another tactic is to use a mix of half-precision floating point, aka FP16, as well as single-precision, FP32. This, for one thing, alleviates the memory bandwidth pressure on the GPUs or whatever chips you're using to accelerate the machine-learning math in hardware, though you may face some loss of accuracy.
Researchers at SenseTime, a Hong Kong-based computer-vision startup valued over $1bn, and Nanyang Technological University in Singapore, say they used these techniques to train AlexNet, an image-recognition convolutional neural network, on ImageNet in just 1.5 minutes albeit it with a 58.2 per cent accuracy.
It required 512 of Nvidia’s 2017-era Tesla Volta V100 accelerators, in two physical clusters connected using a 56Gbps network, to crank through more than 1.2 million images in the ImageNet-1K dataset in that time. Each chip can set you back about $10,000, so you may prefer to rent them instead from a cloud provider, if possible.
They also used 16-bit FP16 parameters and gradients in forward-backward computation phases, and 32-bit FP32 values during the model update phase, balancing bandwidth versus accuracy. The training run also completed 95 epochs in the 90 seconds, using a per-GPU batch size of 128 for AlexNet and 65,536 for the full 512 GPU setup.
This is your four-minute warning: Boffins train ImageNet-based AI classifier in just 240sREAD MORE
The team devised a software toolkit, dubbed GradientFlow, to slash its training times on GPUs, as described in their arXiv-hosted paper, which was emitted earlier this month. Each GPU stores batches of data from ImageNet and crunches through their pixels using gradient descent. The value of the gradients are then passed onto server nodes to update the parameters in the overall model using a type of parallel-processing algorithm known as allreduce.
Trying to ingest these values, or tensors, from hundreds of GPUs at a time will run into bottlenecks. GradientFlow, it is claimed, increases the efficiency of the code by allowing the GPUs to communicate and exchange gradients locally before final values are sent to the model.
“Instead of immediately transmitting generated gradients with allreduce, GradientFlow tries to fuse multiple sequential communication operations into a single one, avoiding sending a huge number of small tensors via network,” the researchers wrote.
"To reduce network traffic, we design coarse-grained sparse communication. Instead of transmitting all gradients in every iteration, GradientFlow only sends important gradients for allreduce at the level of chunk (for example, a chunk may consist of 32K gradients)."
It’s about 2.6 times faster than the previous fastest-known model, developed by researchers at TenCent, a Chinese tech giant, and Hong Kong Baptist University, that took four minutes. ®