This paper is from the same team who published Deep Compression. This time they are targeting to reduce the traffic generated during training of distributed deep learning models by compressing the gradient matrix.
The key insight of this paper is that 99.9% of the gradients exchanged during parameter update is redundant. They take advantage of this observation and compresses the traffic generated by different models by 270x to 600x without losing of accuracy. Deep Gradient Compression(DGC) uses 4 different methods: momentum correction, local gradient clipping, momentum factor masking and warm-up training for keeping the same accuracy.
Gradient exchange is one of the major bottlenecks in the training process of distributed deep learning models, especially for RNN models where computation to communication ratio is lower. This issue also prevents us from training models on mobile devices which otherwise provide better privacy and more personalization.
Over the years, many methods have proposed for reducing the traffic. For example, gradient quantization methods like 1-bit SGD, TernGrad, QSGD etc. Another line of work is on “Gradient Dropping” in which the workers only send gradients larger than a predefined constant threshold. However, choosing the threshold is hard. An absolute threshold value could potentially introduce poor convergence accuracy.
DGC also follows the same method of Gradient dropping by sending only gradients which are above a particular threshold(choosing the threshold is explained later). To avoid losing information, the rest of the gradients are locally accumulated until it becomes large enough to be transmitted. The intuition behind this idea is that the local gradient accumulation is equivalent to increasing the batch size over time (skipping the math).
The sparse update mentioned above will greatly harm convergence when sparsity is extremely high. To solve this they introduce momentum correction and local gradient clipping. Momentum and gradient clipping are widely used in machine learning for preserving good convergence. Momentum is added to the gradient descent step in the direction of the previous update in order to reach the minima faster and also skip over local minimas. Gradient clipping is used to limit the variance in the parameter update by keeping an upper bound.
The only tweak they did in this method is to calculating the momentum locally instead of following the standard practice of doing it at the parameter server after combining all the gradients. They follow the same in gradient clipping as well. Nothing really exciting happening here.
The major problem with delaying small gradient update is that the corresponding parameters would become stale and affects accuracy. This is corrected using momentum factor masking and warm-up training. Momentum correction in DGC started introducing additional staleness into the parameters by adding momentum for the parameter which are not getting updated for a long time (600 to 1000 iterations). This is alleviated by sending the momentums only for the parameters which are going to be updated and rest is accumulated locally along with the corresponding gradients.
During the early stages of training the parameters change rapidly. Delaying this by sparsifying the gradients could lower the convergence rate. The solution here they employed is to use a less aggressive threshold (eg: 75%) at the beginning and then exponentially ramping it up to reach 99.9% on each epoch.
They have evaluated this method on ResNet, an LSTM language model and Deep Speech. DGC provides superior compression rates compared to all other contemporary methods while maintaining the baseline accuracy. The convergence curve also matches with the baseline. They are also evaluated this method in a large cluster with 64 machines and found that DGC on 1Gbps ethernet network provides better speedup than conventional training on 10Gbps ethernet interconnect.