XNOR-Net: ImageNet Classification Using Binary
Convolutional Neural Networks
Mohammad Rastegari
†
, Vicente Ordonez
†
, Joseph Redmon
∗
, Ali Farhadi
†∗
Allen Institute for AI
†
, University of Washington
∗
{mohammadr,vicenteor}@allenai.org
{pjreddie,ali}@cs.washington.edu
Abstract. We propose two efficient approximations to standard convolutional
neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-
Networks, the filters are approximated with binary values resulting in 32× mem-
ory saving. In XNOR-Networks, both the filters and the input to convolutional
layers are binary. XNOR-Networks approximate convolutions using primarily bi-
nary operations. This results in 58× faster convolutional operations (in terms of
number of the high precision operations) and 32× memory savings. XNOR-Nets
offer the possibility of running state-of-the-art networks on CPUs (rather than
GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work
on challenging visual tasks. We evaluate our approach on the ImageNet classifi-
cation task. The classification accuracy with a Binary-Weight-Network version of
AlexNet is the same as the full-precision AlexNet. We compare our method with
recent network binarization methods, BinaryConnect and BinaryNets, and out-
perform these methods by large margins on ImageNet, more than 16% in top-1
accuracy. Our code is available at: http://allenai.org/plato/xnornet.
1 Introduction
Deep neural networks (DNN) have shown significant improvements in several applica-
tion domains including computer vision and speech recognition. In computer vision, a
particular type of DNN, known as Convolutional Neural Networks (CNN), have demon-
strated state-of-the-art results in object recognition [1,2,3,4] and detection [5,6,7].
Convolutional neural networks show reliable results on object recognition and de-
tection that are useful in real world applications. Concurrent to the recent progress in
recognition, interesting advancements have been happening in virtual reality (VR by
Oculus) [8], augmented reality (AR by HoloLens) [9], and smart wearable devices.
Putting these two pieces together, we argue that it is the right time to equip smart
portable devices with the power of state-of-the-art recognition systems. However, CNN-
based recognition systems need large amounts of memory and computational power.
While they perform well on expensive, GPU-based machines, they are often unsuitable
for smaller devices like cell phones and embedded electronics.
For example, AlexNet[1] has 61M parameters (249MB of memory) and performs
1.5B high precision operations to classify one image. These numbers are even higher for
deeper CNNs e.g.,VGG [2] (see section 4.1). These models quickly overtax the limited
storage, battery power, and compute capabilities of smaller devices like cell phones.
评论