暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
11. Quantized Neural Network_ Training Neural Networks with Low Precision Weights and Activations.pdf
302
29页
0次
2021-02-22
50墨值下载
Quantized Neural Networks
Quantized Neural Networks: Training Neural Networks with
Low Precision Weights and Activations
Itay Hubara* itayh@campuse.technion.ac.il
Department of Electrical Engineering
Technion - Israel Institute of Technology
Haifa, Israel
Matthieu Courbariaux* matthieu.courbariaux@gmail.com
Department of Computer Science and Department of Statistics
Universit´e de Montr´eal
Montr´eal, Canada
Daniel Soudry daniel.soudry@gmail.com
Department of Statistics
Columbia University
New York, USA
Ran El-Yaniv rani@cs.technion.ac.il
Department of Computer Science
Technion - Israel Institute of Technology
Haifa, Israel
Yoshua Bengio yoshua.umontreal@gmail.com
Department of Computer Science and Department of Statistics
Universit´e de Montr´eal
Montr´eal, Canada
*Indicates first authors.
Editor:
Abstract
We introduce a method to train Quantized Neural Networks (QNNs) neural networks
with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At train-
time the quantized weights and activations are used for computing the parameter gradients.
During the forward pass, QNNs drastically reduce memory size and accesses, and replace
most arithmetic operations with bit-wise operations. As a result, power consumption is
expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN
and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to
their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights
and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter
gradients to 6-bits as well which enables gradients computation using only bit-wise opera-
tion. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and
achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not
least, we programmed a binary matrix multiplication GPU kernel with which it is possible
to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without
suffering any loss in classification accuracy. The QNN code is available online.
1
arXiv:1609.07061v1 [cs.NE] 22 Sep 2016
Hubara, Courbariaux, Soudry, El-Yaniv and Bengio
Keywords: Deep Learning, Neural Networks Compression, Energy Efficient Neural Net-
works, Computer vision, Language Models.
1. Introduction
Deep Neural Networks (DNNs) have substantially pushed Artificial Intelligence (AI) lim-
its in a wide range of tasks, including but not limited to object recognition from im-
ages (Krizhevsky et al., 2012; Szegedy et al., 2014), speech recognition (Hinton et al., 2012;
Sainath et al., 2013), statistical machine translation (Devlin et al., 2014; Sutskever et al.,
2014; Bahdanau et al., 2015), Atari and Go games (Mnih et al., 2015; Silver et al., 2016),
and even computer generation of abstract art (Mordvintsev et al., 2015).
Training or even just using neural network (NN) algorithms on conventional general-
purpose digital hardware (Von Neumann architecture) has been found highly inefficient
due to the massive amount of multiply-accumulate operations (MACs) required to compute
the weighted sums of the neurons’ inputs. Today, DNNs are almost exclusively trained on
one or many very fast and power-hungry Graphic Processing Units (GPUs) (Coates et al.,
2013). As a result, it is often a challenge to run DNNs on target low-power devices, and
substantial research efforts are invested in speeding up DNNs at run-time on both general-
purpose (Vanhoucke et al., 2011; Gong et al., 2014; Romero et al., 2014; Han et al., 2015b)
and specialized computer hardware (Farabet et al., 2011a,b; Pham et al., 2012; Chen et al.,
2014a,b; Esser et al., 2015).
The most common approach is to compress a trained (full precision) network. Hashed-
Nets (Chen et al., 2015) reduce model sizes by using a hash function to randomly group
connection weights and force them to share a single parameter value. Gong et al. (2014)
compressed deep convnets using vector quantization, which resulteds in only a 1% accuracy
loss. However, both methods focused only on the fully connected layers. A recent work by
Han and Dally (2015) successfully pruned several state-of-the-art large scale networks and
showed that the number of parameters could be reduced by an order of magnitude.
Recent works have shown that more computationally efficient DNNs can be constructed
by quantizing some of the parameters during the training phase. In most cases, DNNs are
trained by minimizing some error function using Back-Propagation (BP) or related gradient
descent methods. However, such an approach cannot be directly applied if the weights are
restricted to binary values. Soudry et al. (2014) used a variational Bayesian approach
with Mean-Field and Central Limit approximation to calculate the posterior distribution
of the weights (the probability of each weight to be +1 or -1). During the inference stage
(test phase), their method samples from this distribution one binary network and used it
to predict the targets of the test set (More than one binary network can also be used).
Courbariaux et al. (2015b) similarly used two sets of weights, real-valued and binary. They,
however, updated the real valued version of the weights by using gradients computed by
applying forward and backward propagation with the set of binary weights (which was
obtained by quantizing the real-value weights to +1 and -1).
This study proposes a more advanced technique, referred to as Quantized Neural Net-
work (QNN), for quantizing the neurons and weights during inference and training. In such
networks, all MAC operations can be replaced with XNOR and population count (i.e.,
counting the number of ones in the binary number) operations. This is especially useful in
2
of 29
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜