暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
02.ImageNet Classification with Deep Convolutional Neural Networks.pdf
183
9页
0次
2021-02-22
50墨值下载
ImageNet Classification with Deep Convolutional
Neural Networks
Alex Krizhevsky
University of Toronto
kriz@cs.utoronto.ca
Ilya Sutskever
University of Toronto
ilya@cs.utoronto.ca
Geoffrey E. Hinton
University of Toronto
hinton@cs.utoronto.ca
Abstract
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
1 Introduction
Current approaches to object recognition make essential use of machine learning methods. To im-
prove their performance, we can collect larger datasets, learn more powerful models, and use bet-
ter techniques for preventing overfitting. Until recently, datasets of labeled images were relatively
small — on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and
CIFAR-10/100 [12]). Simple recognition tasks can be solved quite well with datasets of this size,
especially if they are augmented with label-preserving transformations. For example, the current-
best error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4].
But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is
necessary to use much larger training sets. And indeed, the shortcomings of small image datasets
have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to col-
lect labeled datasets with millions of images. The new larger datasets include LabelMe [23], which
consists of hundreds of thousands of fully-segmented images, and ImageNet [6], which consists of
over 15 million labeled high-resolution images in over 22,000 categories.
To learn about thousands of objects from millions of images, we need a model with a large learning
capacity. However, the immense complexity of the object recognition task means that this prob-
lem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots
of prior knowledge to compensate for all the data we don’t have. Convolutional neural networks
(CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be con-
trolled by varying their depth and breadth, and they also make strong and mostly correct assumptions
about the nature of images (namely, stationarity of statistics and locality of pixel dependencies).
Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have
much fewer connections and parameters and so they are easier to train, while their theoretically-best
performance is likely to be only slightly worse.
1
Despite the attractive qualities of CNNs, and despite the relative efficiency of their local architecture,
they have still been prohibitively expensive to apply in large scale to high-resolution images. Luck-
ily, current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful
enough to facilitate the training of interestingly-large CNNs, and recent datasets such as ImageNet
contain enough labeled examples to train such models without severe overfitting.
The specific contributions of this paper are as follows: we trained one of the largest convolutional
neural networks to date on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012
competitions [2] and achieved by far the best results ever reported on these datasets. We wrote a
highly-optimized GPU implementation of 2D convolution and all the other operations inherent in
training convolutional neural networks, which we make available publicly
1
. Our network contains
a number of new and unusual features which improve its performance and reduce its training time,
which are detailed in Section 3. The size of our network made overfitting a significant problem, even
with 1.2 million labeled training examples, so we used several effective techniques for preventing
overfitting, which are described in Section 4. Our final network contains five convolutional and
three fully-connected layers, and this depth seems to be important: we found that removing any
convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in
inferior performance.
In the end, the network’s size is limited mainly by the amount of memory available on current GPUs
and by the amount of training time that we are willing to tolerate. Our network takes between five
and six days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results
can be improved simply by waiting for faster GPUs and bigger datasets to become available.
2 The Dataset
ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000
categories. The images were collected from the web and labeled by human labelers using Ama-
zon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object
Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge
(ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of
1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and
150,000 testing images.
ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is
the version on which we performed most of our experiments. Since we also entered our model in
the ILSVRC-2012 competition, in Section 6 we report our results on this version of the dataset as
well, for which test set labels are unavailable. On ImageNet, it is customary to report two error rates:
top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label
is not among the five labels considered most probable by the model.
ImageNet consists of variable-resolution images, while our system requires a constant input dimen-
sionality. Therefore, we down-sampled the images to a fixed resolution of 256 × 256. Given a
rectangular image, we first rescaled the image such that the shorter side was of length 256, and then
cropped out the central 256×256 patch from the resulting image. We did not pre-process the images
in any other way, except for subtracting the mean activity over the training set from each pixel. So
we trained our network on the (centered) raw RGB values of the pixels.
3 The Architecture
The architecture of our network is summarized in Figure 2. It contains eight learned layers
five convolutional and three fully-connected. Below, we describe some of the novel or unusual
features of our network’s architecture. Sections 3.1-3.4 are sorted according to our estimation of
their importance, with the most important first.
1
http://code.google.com/p/cuda-convnet/
2
of 9
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜