https://arxiv.org/pdf/1510.00149.pdf

This is a really interesting paper and a very well written one. I recommend to read the paper if you have time.

Embedded systems are also constrained on limited resources and power availability. One of the major factor that affects both performance and energy consumption in embedded systems is the memory usage. Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. The goal of this paper is to reduce the storage and energy required to run **inference** on large DNNs so they can be deployed on mobile devices. Towards this goal the authors are proposing Deep compression: an amalgamation of 3 different compression techniques beautifully work together to give superior compression rates without any degradation in prediction accuracy.

Neural networks like AlexNet and VGG-16 uses large number of parameters which comes around 200MB-800MB of size. Deep compression is using 3 different methods – Pruning, Weight sharing and Huffman coding – for compressing this huge weight matrix. On a high level, first the model is fully trained. Then the network is pruned by removing weights which are less than a threshold. Next the remaining weights are quantized by calculating centroid using some clustering algorithm. All the weights are approximated to the nearest centroid of the cluster, thus employing weight sharing. After quantization, we have only the centroids and the cluster index matrix which maps weights to centroids. Finally, all the indices are further compressed using Huffman encoding.

#### Network pruning

Network pruning is based on the intuition that edges with smaller weights in the NN does not contribute much to the final prediction. After pruning, the network is trained again to learn the final weights for the remaining sparse connections. The sparse matrices can be efficiently compressed using methods like CSR or CSC (link for details) by storing only non-zero values and their indices. An additional compression technique is applied on top of this. Instead of storing the full index(32bit), they store only the difference of indices which are usually small can be expressed in 5-8 bits.

#### Trained quantization

After pruning, a clustering algorithm like k-means is performed on the weight matrix of all the layers separately. For example, the following 4×4 matrix is quantized to 4 weights (top image). The weights in the original matrix then approximated to the closest centroid. After this step, we get a cluster index matrix and a list of centroids. Indices for k number of clusters can be represented using log(k) bits.

After weight quantization, the centroids are fine tuned by retraining. Now the gradient of the k^{th }centroid is calculated by adding up the gradients of the weights in the corresponding cluster (bottom image). Once the gradient is calculated, a gradient descent step is performed on the centroids vector as shown in the above figure.

#### Huffman encoding

The above two compressions creates a few weights and a bunch of indices. The indices can be further compressed by Huffman encoding technique.

#### Experiments

They have performed a bunch of experiments and ablation studies on these methods. I’m only covering a few of those. First of all, they calculated the prediction accuracy of AlexNet and VGG-16 before and after compression. They were able to compress the network 35x and 49x times with absolutely no reduction in accuracy, which is pretty impressive(accuracy of these models are not really great however. 58% and 69% resp). Pruning and quantization seems to work pretty well together and give better compression. This is because given a fixed number of centroids unpruned network is more susceptible to errors compared to the pruned one.

After analysis on the compression rates, they performed experiments on speedup and energy consumption. Deep compression is actually reducing the memory footprint at the cost of increased computation. Neural network computations shows different speedup in batching and non-batching inference. Computation of batched inference is dominated by matrix-matrix multiplication (MM) whereas non-batched is mostly matrix-vector multiplication (MV). Ratio of memory access to computation is higher in MV which means reducing the memory access lead to better speedup (3-4x). However, MM has lower cost of memory access per computation and also can take advantage of the cache locality more efficiently. That means for batched inference, deep compression performs worse than uncompressed versions. They see similar trend in energy consumption as well.

This is really helpful, thanks.