02.ImageNet Classification with Deep Convolutional Neural Networks.pdf

Libria

193

9页

0次

2021-02-22

50墨值下载

ImageNet Classiﬁcation with Deep Convolutional

Neural Networks

Alex Krizhevsky

University of Toronto

kriz@cs.utoronto.ca

Ilya Sutskever

University of Toronto

ilya@cs.utoronto.ca

Geoffrey E. Hinton

University of Toronto

hinton@cs.utoronto.ca

Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million

high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-

ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%

and 17.0% which is considerably better than the previous state-of-the-art. The

neural network, which has 60 million parameters and 650,000 neurons, consists

of ﬁve convolutional layers, some of which are followed by max-pooling layers,

and three fully-connected layers with a ﬁnal 1000-way softmax. To make train-

ing faster, we used non-saturating neurons and a very efﬁcient GPU implemen-

tation of the convolution operation. To reduce overﬁtting in the fully-connected

layers we employed a recently-developed regularization method called “dropout”

that proved to be very effective. We also entered a variant of this model in the

ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,

compared to 26.2% achieved by the second-best entry.

1 Introduction

Current approaches to object recognition make essential use of machine learning methods. To im-

prove their performance, we can collect larger datasets, learn more powerful models, and use bet-

ter techniques for preventing overﬁtting. Until recently, datasets of labeled images were relatively

small — on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and

CIFAR-10/100 [12]). Simple recognition tasks can be solved quite well with datasets of this size,

especially if they are augmented with label-preserving transformations. For example, the current-

best error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4].

But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is

necessary to use much larger training sets. And indeed, the shortcomings of small image datasets

have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to col-

lect labeled datasets with millions of images. The new larger datasets include LabelMe [23], which

consists of hundreds of thousands of fully-segmented images, and ImageNet [6], which consists of

over 15 million labeled high-resolution images in over 22,000 categories.

To learn about thousands of objects from millions of images, we need a model with a large learning

capacity. However, the immense complexity of the object recognition task means that this prob-

lem cannot be speciﬁed even by a dataset as large as ImageNet, so our model should also have lots

of prior knowledge to compensate for all the data we don’t have. Convolutional neural networks

(CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be con-

trolled by varying their depth and breadth, and they also make strong and mostly correct assumptions

about the nature of images (namely, stationarity of statistics and locality of pixel dependencies).

Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have

much fewer connections and parameters and so they are easier to train, while their theoretically-best

performance is likely to be only slightly worse.

Despite the attractive qualities of CNNs, and despite the relative efﬁciency of their local architecture,

they have still been prohibitively expensive to apply in large scale to high-resolution images. Luck-

ily, current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful

enough to facilitate the training of interestingly-large CNNs, and recent datasets such as ImageNet

contain enough labeled examples to train such models without severe overﬁtting.

The speciﬁc contributions of this paper are as follows: we trained one of the largest convolutional

neural networks to date on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012

competitions [2] and achieved by far the best results ever reported on these datasets. We wrote a

highly-optimized GPU implementation of 2D convolution and all the other operations inherent in

training convolutional neural networks, which we make available publicly

. Our network contains

a number of new and unusual features which improve its performance and reduce its training time,

which are detailed in Section 3. The size of our network made overﬁtting a signiﬁcant problem, even

with 1.2 million labeled training examples, so we used several effective techniques for preventing

overﬁtting, which are described in Section 4. Our ﬁnal network contains ﬁve convolutional and

three fully-connected layers, and this depth seems to be important: we found that removing any

convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in

inferior performance.

In the end, the network’s size is limited mainly by the amount of memory available on current GPUs

and by the amount of training time that we are willing to tolerate. Our network takes between ﬁve

and six days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results

can be improved simply by waiting for faster GPUs and bigger datasets to become available.

2 The Dataset

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000

categories. The images were collected from the web and labeled by human labelers using Ama-

zon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object

Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge

(ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of

1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and

150,000 testing images.

ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is

the version on which we performed most of our experiments. Since we also entered our model in

the ILSVRC-2012 competition, in Section 6 we report our results on this version of the dataset as

well, for which test set labels are unavailable. On ImageNet, it is customary to report two error rates:

top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label

is not among the ﬁve labels considered most probable by the model.

ImageNet consists of variable-resolution images, while our system requires a constant input dimen-

sionality. Therefore, we down-sampled the images to a ﬁxed resolution of 256 × 256. Given a

rectangular image, we ﬁrst rescaled the image such that the shorter side was of length 256, and then

cropped out the central 256×256 patch from the resulting image. We did not pre-process the images

in any other way, except for subtracting the mean activity over the training set from each pixel. So

we trained our network on the (centered) raw RGB values of the pixels.

3 The Architecture

The architecture of our network is summarized in Figure 2. It contains eight learned layers —

ﬁve convolutional and three fully-connected. Below, we describe some of the novel or unusual

features of our network’s architecture. Sections 3.1-3.4 are sorted according to our estimation of

their importance, with the most important ﬁrst.

http://code.google.com/p/cuda-convnet/