11. Quantized Neural Network_ Training Neural Networks with Low Precision Weights and Activations.pdf

Libria

314

29页

0次

2021-02-22

50墨值下载

Quantized Neural Networks

Quantized Neural Networks: Training Neural Networks with

Low Precision Weights and Activations

Itay Hubara* itayh@campuse.technion.ac.il

Department of Electrical Engineering

Technion - Israel Institute of Technology

Haifa, Israel

Matthieu Courbariaux* matthieu.courbariaux@gmail.com

Department of Computer Science and Department of Statistics

Universit´e de Montr´eal

Montr´eal, Canada

Daniel Soudry daniel.soudry@gmail.com

Department of Statistics

Columbia University

New York, USA

Ran El-Yaniv rani@cs.technion.ac.il

Department of Computer Science

Technion - Israel Institute of Technology

Haifa, Israel

Yoshua Bengio yoshua.umontreal@gmail.com

Department of Computer Science and Department of Statistics

Universit´e de Montr´eal

Montr´eal, Canada

*Indicates ﬁrst authors.

Editor:

Abstract

We introduce a method to train Quantized Neural Networks (QNNs) — neural networks

with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At train-

time the quantized weights and activations are used for computing the parameter gradients.

During the forward pass, QNNs drastically reduce memory size and accesses, and replace

most arithmetic operations with bit-wise operations. As a result, power consumption is

expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN

and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to

their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights

and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter

gradients to 6-bits as well which enables gradients computation using only bit-wise opera-

tion. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and

achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not

least, we programmed a binary matrix multiplication GPU kernel with which it is possible

to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without

suﬀering any loss in classiﬁcation accuracy. The QNN code is available online.

arXiv:1609.07061v1 [cs.NE] 22 Sep 2016

Hubara, Courbariaux, Soudry, El-Yaniv and Bengio

Keywords: Deep Learning, Neural Networks Compression, Energy Eﬃcient Neural Net-

works, Computer vision, Language Models.

1. Introduction

Deep Neural Networks (DNNs) have substantially pushed Artiﬁcial Intelligence (AI) lim-

its in a wide range of tasks, including but not limited to object recognition from im-

ages (Krizhevsky et al., 2012; Szegedy et al., 2014), speech recognition (Hinton et al., 2012;

Sainath et al., 2013), statistical machine translation (Devlin et al., 2014; Sutskever et al.,

2014; Bahdanau et al., 2015), Atari and Go games (Mnih et al., 2015; Silver et al., 2016),

and even computer generation of abstract art (Mordvintsev et al., 2015).

Training or even just using neural network (NN) algorithms on conventional general-

purpose digital hardware (Von Neumann architecture) has been found highly ineﬃcient

due to the massive amount of multiply-accumulate operations (MACs) required to compute

the weighted sums of the neurons’ inputs. Today, DNNs are almost exclusively trained on

one or many very fast and power-hungry Graphic Processing Units (GPUs) (Coates et al.,

2013). As a result, it is often a challenge to run DNNs on target low-power devices, and

substantial research eﬀorts are invested in speeding up DNNs at run-time on both general-

purpose (Vanhoucke et al., 2011; Gong et al., 2014; Romero et al., 2014; Han et al., 2015b)

and specialized computer hardware (Farabet et al., 2011a,b; Pham et al., 2012; Chen et al.,

2014a,b; Esser et al., 2015).

The most common approach is to compress a trained (full precision) network. Hashed-

Nets (Chen et al., 2015) reduce model sizes by using a hash function to randomly group

connection weights and force them to share a single parameter value. Gong et al. (2014)

compressed deep convnets using vector quantization, which resulteds in only a 1% accuracy

loss. However, both methods focused only on the fully connected layers. A recent work by

Han and Dally (2015) successfully pruned several state-of-the-art large scale networks and

showed that the number of parameters could be reduced by an order of magnitude.

Recent works have shown that more computationally eﬃcient DNNs can be constructed

by quantizing some of the parameters during the training phase. In most cases, DNNs are

trained by minimizing some error function using Back-Propagation (BP) or related gradient

descent methods. However, such an approach cannot be directly applied if the weights are

restricted to binary values. Soudry et al. (2014) used a variational Bayesian approach

with Mean-Field and Central Limit approximation to calculate the posterior distribution

of the weights (the probability of each weight to be +1 or -1). During the inference stage

(test phase), their method samples from this distribution one binary network and used it

to predict the targets of the test set (More than one binary network can also be used).

Courbariaux et al. (2015b) similarly used two sets of weights, real-valued and binary. They,

however, updated the real valued version of the weights by using gradients computed by

applying forward and backward propagation with the set of binary weights (which was

obtained by quantizing the real-value weights to +1 and -1).

This study proposes a more advanced technique, referred to as Quantized Neural Net-

work (QNN), for quantizing the neurons and weights during inference and training. In such

networks, all MAC operations can be replaced with XNOR and population count (i.e.,

counting the number of ones in the binary number) operations. This is especially useful in

of 29

50墨值下载

database

关注

评论