Hubara, Courbariaux, Soudry, El-Yaniv and Bengio
Keywords: Deep Learning, Neural Networks Compression, Energy Efficient Neural Net-
works, Computer vision, Language Models.
1. Introduction
Deep Neural Networks (DNNs) have substantially pushed Artificial Intelligence (AI) lim-
its in a wide range of tasks, including but not limited to object recognition from im-
ages (Krizhevsky et al., 2012; Szegedy et al., 2014), speech recognition (Hinton et al., 2012;
Sainath et al., 2013), statistical machine translation (Devlin et al., 2014; Sutskever et al.,
2014; Bahdanau et al., 2015), Atari and Go games (Mnih et al., 2015; Silver et al., 2016),
and even computer generation of abstract art (Mordvintsev et al., 2015).
Training or even just using neural network (NN) algorithms on conventional general-
purpose digital hardware (Von Neumann architecture) has been found highly inefficient
due to the massive amount of multiply-accumulate operations (MACs) required to compute
the weighted sums of the neurons’ inputs. Today, DNNs are almost exclusively trained on
one or many very fast and power-hungry Graphic Processing Units (GPUs) (Coates et al.,
2013). As a result, it is often a challenge to run DNNs on target low-power devices, and
substantial research efforts are invested in speeding up DNNs at run-time on both general-
purpose (Vanhoucke et al., 2011; Gong et al., 2014; Romero et al., 2014; Han et al., 2015b)
and specialized computer hardware (Farabet et al., 2011a,b; Pham et al., 2012; Chen et al.,
2014a,b; Esser et al., 2015).
The most common approach is to compress a trained (full precision) network. Hashed-
Nets (Chen et al., 2015) reduce model sizes by using a hash function to randomly group
connection weights and force them to share a single parameter value. Gong et al. (2014)
compressed deep convnets using vector quantization, which resulteds in only a 1% accuracy
loss. However, both methods focused only on the fully connected layers. A recent work by
Han and Dally (2015) successfully pruned several state-of-the-art large scale networks and
showed that the number of parameters could be reduced by an order of magnitude.
Recent works have shown that more computationally efficient DNNs can be constructed
by quantizing some of the parameters during the training phase. In most cases, DNNs are
trained by minimizing some error function using Back-Propagation (BP) or related gradient
descent methods. However, such an approach cannot be directly applied if the weights are
restricted to binary values. Soudry et al. (2014) used a variational Bayesian approach
with Mean-Field and Central Limit approximation to calculate the posterior distribution
of the weights (the probability of each weight to be +1 or -1). During the inference stage
(test phase), their method samples from this distribution one binary network and used it
to predict the targets of the test set (More than one binary network can also be used).
Courbariaux et al. (2015b) similarly used two sets of weights, real-valued and binary. They,
however, updated the real valued version of the weights by using gradients computed by
applying forward and backward propagation with the set of binary weights (which was
obtained by quantizing the real-value weights to +1 and -1).
This study proposes a more advanced technique, referred to as Quantized Neural Net-
work (QNN), for quantizing the neurons and weights during inference and training. In such
networks, all MAC operations can be replaced with XNOR and population count (i.e.,
counting the number of ones in the binary number) operations. This is especially useful in
2
评论