12.Rich feature hierarchies for accurate object detection and semantic segmentation.pdf

aKun

372

21页

0次

2021-02-24

50墨值下载

Rich feature hierarchies for accurate object detection and semantic segmentation

Tech report (v5)

Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik

UC Berkeley

{rbg,jdonahue,trevor,malik}@eecs.berkeley.edu

Abstract

Object detection performance, as measured on the

canonical PASCAL VOC dataset, has plateaued in the last

few years. The best-performing methods are complex en-

semble systems that typically combine multiple low-level

image features with high-level context. In this paper, we

propose a simple and scalable detection algorithm that im-

proves mean average precision (mAP) by more than 30%

relative to the previous best result on VOC 2012—achieving

a mAP of 53.3%. Our approach combines two key insights:

(1) one can apply high-capacity convolutional neural net-

works (CNNs) to bottom-up region proposals in order to

localize and segment objects and (2) when labeled training

data is scarce, supervised pre-training for an auxiliary task,

followed by domain-speciﬁc ﬁne-tuning, yields a signiﬁcant

performance boost. Since we combine region proposals

with CNNs, we call our method R-CNN: Regions with CNN

features. We also compare R-CNN to OverFeat, a recently

proposed sliding-window detector based on a similar CNN

architecture. We ﬁnd that R-CNN outperforms OverFeat

by a large margin on the 200-class ILSVRC2013 detection

dataset. Source code for the complete system is available at

http://www.cs.berkeley.edu/

rbg/rcnn.

1. Introduction

Features matter. The last decade of progress on various

visual recognition tasks has been based considerably on the

use of SIFT [29] and HOG [7]. But if we look at perfor-

mance on the canonical visual recognition task, PASCAL

VOC object detection [15], it is generally acknowledged

that progress has been slow during 2010-2012, with small

gains obtained by building ensemble systems and employ-

ing minor variants of successful methods.

SIFT and HOG are blockwise orientation histograms,

a representation we could associate roughly with complex

cells in V1, the ﬁrst cortical area in the primate visual path-

way. But we also know that recognition occurs several

stages downstream, which suggests that there might be hier-

1. Input

image

2. Extract region

proposals (~2k)

3. Compute

CNN features

aeroplane? no.

person? yes.

tvmonitor? no.

4. Classify

regions

warped region

CNN

R-CNN: Regions with CNN features

Figure 1: Object detection system overview. Our system (1)

takes an input image, (2) extracts around 2000 bottom-up region

proposals, (3) computes features for each proposal using a large

convolutional neural network (CNN), and then (4) classiﬁes each

region using class-speciﬁc linear SVMs. R-CNN achieves a mean

average precision (mAP) of 53.7% on PASCAL VOC 2010. For

comparison, [39] reports 35.1% mAP using the same region pro-

posals, but with a spatial pyramid and bag-of-visual-words ap-

proach. The popular deformable part models perform at 33.4%.

On the 200-class ILSVRC2013 detection dataset, R-CNN’s

mAP is 31.4%, a large improvement over OverFeat [34], which

had the previous best result at 24.3%.

archical, multi-stage processes for computing features that

are even more informative for visual recognition.

Fukushima’s “neocognitron” [19], a biologically-

inspired hierarchical and shift-invariant model for pattern

recognition, was an early attempt at just such a process.

The neocognitron, however, lacked a supervised training

algorithm. Building on Rumelhart et al. [33], LeCun et

al. [26] showed that stochastic gradient descent via back-

propagation was effective for training convolutional neural

networks (CNNs), a class of models that extend the neocog-

nitron.

CNNs saw heavy use in the 1990s (e.g., [27]), but then

fell out of fashion with the rise of support vector machines.

In 2012, Krizhevsky et al. [25] rekindled interest in CNNs

by showing substantially higher image classiﬁcation accu-

racy on the ImageNet Large Scale Visual Recognition Chal-

lenge (ILSVRC) [9, 10]. Their success resulted from train-

ing a large CNN on 1.2 million labeled images, together

with a few twists on LeCun’s CNN (e.g., max(x, 0) rectify-

ing non-linearities and “dropout” regularization).

The signiﬁcance of the ImageNet result was vigorously

arXiv:1311.2524v5 [cs.CV] 22 Oct 2014

debated during the ILSVRC 2012 workshop. The central

issue can be distilled to the following: To what extent do

the CNN classiﬁcation results on ImageNet generalize to

object detection results on the PASCAL VOC Challenge?

We answer this question by bridging the gap between

image classiﬁcation and object detection. This paper is the

ﬁrst to show that a CNN can lead to dramatically higher ob-

ject detection performance on PASCAL VOC as compared

to systems based on simpler HOG-like features. To achieve

this result, we focused on two problems: localizing objects

with a deep network and training a high-capacity model

with only a small quantity of annotated detection data.

Unlike image classiﬁcation, detection requires localiz-

ing (likely many) objects within an image. One approach

frames localization as a regression problem. However, work

from Szegedy et al. [38], concurrent with our own, indi-

cates that this strategy may not fare well in practice (they

report a mAP of 30.5% on VOC 2007 compared to the

58.5% achieved by our method). An alternative is to build a

sliding-window detector. CNNs have been used in this way

for at least two decades, typically on constrained object cat-

egories, such as faces [32, 40] and pedestrians [35]. In order

to maintain high spatial resolution, these CNNs typically

only have two convolutional and pooling layers. We also

considered adopting a sliding-window approach. However,

units high up in our network, which has ﬁve convolutional

layers, have very large receptive ﬁelds (195 × 195 pixels)

and strides (32×32 pixels) in the input image, which makes

precise localization within the sliding-window paradigm an

open technical challenge.

Instead, we solve the CNN localization problem by oper-

ating within the “recognition using regions” paradigm [21],

which has been successful for both object detection [39] and

semantic segmentation [5]. At test time, our method gener-

ates around 2000 category-independent region proposals for

the input image, extracts a ﬁxed-length feature vector from

each proposal using a CNN, and then classiﬁes each region

with category-speciﬁc linear SVMs. We use a simple tech-

nique (afﬁne image warping) to compute a ﬁxed-size CNN

input from each region proposal, regardless of the region’s

shape. Figure 1 presents an overview of our method and

highlights some of our results. Since our system combines

region proposals with CNNs, we dub the method R-CNN:

Regions with CNN features.

In this updated version of this paper, we provide a head-

to-head comparison of R-CNN and the recently proposed

OverFeat [34] detection system by running R-CNN on the

200-class ILSVRC2013 detection dataset. OverFeat uses a

sliding-window CNN for detection and until now was the

best performing method on ILSVRC2013 detection. We

show that R-CNN signiﬁcantly outperforms OverFeat, with

a mAP of 31.4% versus 24.3%.

A second challenge faced in detection is that labeled data

is scarce and the amount currently available is insufﬁcient

for training a large CNN. The conventional solution to this

problem is to use unsupervised pre-training, followed by su-

pervised ﬁne-tuning (e.g., [35]). The second principle con-

tribution of this paper is to show that supervised pre-training

on a large auxiliary dataset (ILSVRC), followed by domain-

speciﬁc ﬁne-tuning on a small dataset (PASCAL), is an

effective paradigm for learning high-capacity CNNs when

data is scarce. In our experiments, ﬁne-tuning for detection

improves mAP performance by 8 percentage points. After

ﬁne-tuning, our system achieves a mAP of 54% on VOC

2010 compared to 33% for the highly-tuned, HOG-based

deformable part model (DPM) [17, 20]. We also point read-

ers to contemporaneous work by Donahue et al. [12], who

show that Krizhevsky’s CNN can be used (without ﬁne-

tuning) as a blackbox feature extractor, yielding excellent

performance on several recognition tasks including scene

classiﬁcation, ﬁne-grained sub-categorization, and domain

adaptation.

Our system is also quite efﬁcient. The only class-speciﬁc

computations are a reasonably small matrix-vector product

and greedy non-maximum suppression. This computational

property follows from features that are shared across all cat-

egories and that are also two orders of magnitude lower-

dimensional than previously used region features (cf. [39]).

Understanding the failure modes of our approach is also

critical for improving it, and so we report results from the

detection analysis tool of Hoiem et al. [23]. As an im-

mediate consequence of this analysis, we demonstrate that

a simple bounding-box regression method signiﬁcantly re-

duces mislocalizations, which are the dominant error mode.

Before developing technical details, we note that because

R-CNN operates on regions it is natural to extend it to the

task of semantic segmentation. With minor modiﬁcations,

we also achieve competitive results on the PASCAL VOC

segmentation task, with an average segmentation accuracy

of 47.9% on the VOC 2011 test set.

2. Object detection with R-CNN

Our object detection system consists of three modules.

The ﬁrst generates category-independent region proposals.

These proposals deﬁne the set of candidate detections avail-

able to our detector. The second module is a large convo-

lutional neural network that extracts a ﬁxed-length feature

vector from each region. The third module is a set of class-

speciﬁc linear SVMs. In this section, we present our design

decisions for each module, describe their test-time usage,

detail how their parameters are learned, and show detection

results on PASCAL VOC 2010-12 and on ILSVRC2013.

2.1. Module design

Region proposals. A variety of recent papers offer meth-

ods for generating category-independent region proposals.

of 21

50墨值下载

database

关注

评论