
Despite the attractive qualities of CNNs, and despite the relative efficiency of their local architecture,
they have still been prohibitively expensive to apply in large scale to high-resolution images. Luck-
ily, current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful
enough to facilitate the training of interestingly-large CNNs, and recent datasets such as ImageNet
contain enough labeled examples to train such models without severe overfitting.
The specific contributions of this paper are as follows: we trained one of the largest convolutional
neural networks to date on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012
competitions [2] and achieved by far the best results ever reported on these datasets. We wrote a
highly-optimized GPU implementation of 2D convolution and all the other operations inherent in
training convolutional neural networks, which we make available publicly
1
. Our network contains
a number of new and unusual features which improve its performance and reduce its training time,
which are detailed in Section 3. The size of our network made overfitting a significant problem, even
with 1.2 million labeled training examples, so we used several effective techniques for preventing
overfitting, which are described in Section 4. Our final network contains five convolutional and
three fully-connected layers, and this depth seems to be important: we found that removing any
convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in
inferior performance.
In the end, the network’s size is limited mainly by the amount of memory available on current GPUs
and by the amount of training time that we are willing to tolerate. Our network takes between five
and six days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results
can be improved simply by waiting for faster GPUs and bigger datasets to become available.
2 The Dataset
ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000
categories. The images were collected from the web and labeled by human labelers using Ama-
zon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object
Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge
(ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of
1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and
150,000 testing images.
ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is
the version on which we performed most of our experiments. Since we also entered our model in
the ILSVRC-2012 competition, in Section 6 we report our results on this version of the dataset as
well, for which test set labels are unavailable. On ImageNet, it is customary to report two error rates:
top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label
is not among the five labels considered most probable by the model.
ImageNet consists of variable-resolution images, while our system requires a constant input dimen-
sionality. Therefore, we down-sampled the images to a fixed resolution of 256 × 256. Given a
rectangular image, we first rescaled the image such that the shorter side was of length 256, and then
cropped out the central 256×256 patch from the resulting image. We did not pre-process the images
in any other way, except for subtracting the mean activity over the training set from each pixel. So
we trained our network on the (centered) raw RGB values of the pixels.
3 The Architecture
The architecture of our network is summarized in Figure 2. It contains eight learned layers —
five convolutional and three fully-connected. Below, we describe some of the novel or unusual
features of our network’s architecture. Sections 3.1-3.4 are sorted according to our estimation of
their importance, with the most important first.
1
http://code.google.com/p/cuda-convnet/
2
评论