debated during the ILSVRC 2012 workshop. The central
issue can be distilled to the following: To what extent do
the CNN classification results on ImageNet generalize to
object detection results on the PASCAL VOC Challenge?
We answer this question by bridging the gap between
image classification and object detection. This paper is the
first to show that a CNN can lead to dramatically higher ob-
ject detection performance on PASCAL VOC as compared
to systems based on simpler HOG-like features. To achieve
this result, we focused on two problems: localizing objects
with a deep network and training a high-capacity model
with only a small quantity of annotated detection data.
Unlike image classification, detection requires localiz-
ing (likely many) objects within an image. One approach
frames localization as a regression problem. However, work
from Szegedy et al. [38], concurrent with our own, indi-
cates that this strategy may not fare well in practice (they
report a mAP of 30.5% on VOC 2007 compared to the
58.5% achieved by our method). An alternative is to build a
sliding-window detector. CNNs have been used in this way
for at least two decades, typically on constrained object cat-
egories, such as faces [32, 40] and pedestrians [35]. In order
to maintain high spatial resolution, these CNNs typically
only have two convolutional and pooling layers. We also
considered adopting a sliding-window approach. However,
units high up in our network, which has five convolutional
layers, have very large receptive fields (195 × 195 pixels)
and strides (32×32 pixels) in the input image, which makes
precise localization within the sliding-window paradigm an
open technical challenge.
Instead, we solve the CNN localization problem by oper-
ating within the “recognition using regions” paradigm [21],
which has been successful for both object detection [39] and
semantic segmentation [5]. At test time, our method gener-
ates around 2000 category-independent region proposals for
the input image, extracts a fixed-length feature vector from
each proposal using a CNN, and then classifies each region
with category-specific linear SVMs. We use a simple tech-
nique (affine image warping) to compute a fixed-size CNN
input from each region proposal, regardless of the region’s
shape. Figure 1 presents an overview of our method and
highlights some of our results. Since our system combines
region proposals with CNNs, we dub the method R-CNN:
Regions with CNN features.
In this updated version of this paper, we provide a head-
to-head comparison of R-CNN and the recently proposed
OverFeat [34] detection system by running R-CNN on the
200-class ILSVRC2013 detection dataset. OverFeat uses a
sliding-window CNN for detection and until now was the
best performing method on ILSVRC2013 detection. We
show that R-CNN significantly outperforms OverFeat, with
a mAP of 31.4% versus 24.3%.
A second challenge faced in detection is that labeled data
is scarce and the amount currently available is insufficient
for training a large CNN. The conventional solution to this
problem is to use unsupervised pre-training, followed by su-
pervised fine-tuning (e.g., [35]). The second principle con-
tribution of this paper is to show that supervised pre-training
on a large auxiliary dataset (ILSVRC), followed by domain-
specific fine-tuning on a small dataset (PASCAL), is an
effective paradigm for learning high-capacity CNNs when
data is scarce. In our experiments, fine-tuning for detection
improves mAP performance by 8 percentage points. After
fine-tuning, our system achieves a mAP of 54% on VOC
2010 compared to 33% for the highly-tuned, HOG-based
deformable part model (DPM) [17, 20]. We also point read-
ers to contemporaneous work by Donahue et al. [12], who
show that Krizhevsky’s CNN can be used (without fine-
tuning) as a blackbox feature extractor, yielding excellent
performance on several recognition tasks including scene
classification, fine-grained sub-categorization, and domain
adaptation.
Our system is also quite efficient. The only class-specific
computations are a reasonably small matrix-vector product
and greedy non-maximum suppression. This computational
property follows from features that are shared across all cat-
egories and that are also two orders of magnitude lower-
dimensional than previously used region features (cf. [39]).
Understanding the failure modes of our approach is also
critical for improving it, and so we report results from the
detection analysis tool of Hoiem et al. [23]. As an im-
mediate consequence of this analysis, we demonstrate that
a simple bounding-box regression method significantly re-
duces mislocalizations, which are the dominant error mode.
Before developing technical details, we note that because
R-CNN operates on regions it is natural to extend it to the
task of semantic segmentation. With minor modifications,
we also achieve competitive results on the PASCAL VOC
segmentation task, with an average segmentation accuracy
of 47.9% on the VOC 2011 test set.
2. Object detection with R-CNN
Our object detection system consists of three modules.
The first generates category-independent region proposals.
These proposals define the set of candidate detections avail-
able to our detector. The second module is a large convo-
lutional neural network that extracts a fixed-length feature
vector from each region. The third module is a set of class-
specific linear SVMs. In this section, we present our design
decisions for each module, describe their test-time usage,
detail how their parameters are learned, and show detection
results on PASCAL VOC 2010-12 and on ILSVRC2013.
2.1. Module design
Region proposals. A variety of recent papers offer meth-
ods for generating category-independent region proposals.
2
评论