17.Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks.pdf

aKun

262

14页

0次

2021-02-24

50墨值下载

Faster R-CNN: Towards Real-Time Object

Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun

Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.

Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region

proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image

convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional

network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to

generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN

into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with

“attention” mechanisms, the RPN component tells the uniﬁed network where to look. For the very deep VGG-16 model [3],

our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection

accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO

2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been

made publicly available.

Index Terms—Object Detection, Region Proposal, Convolutional Neural Network.

1 INTRODUCTION

Recent advances in object detection are driven by

the success of region proposal methods (e.g., [4])

and region-based convolutional neural networks (R-

CNNs) [5]. Although region-based CNNs were com-

putationally expensive as originally developed in [5],

their cost has been drastically reduced thanks to shar-

ing convolutions across proposals [1], [2]. The latest

incarnation, Fast R-CNN [2], achieves near real-time

rates using very deep networks [3], when ignoring the

time spent on region proposals. Now, proposals are the

test-time computational bottleneck in state-of-the-art

detection systems.

Region proposal methods typically rely on inex-

pensive features and economical inference schemes.

Selective Search [4], one of the most popular meth-

ods, greedily merges superpixels based on engineered

low-level features. Yet when compared to efﬁcient

detection networks [2], Selective Search is an order of

magnitude slower, at 2 seconds per image in a CPU

implementation. EdgeBoxes [6] currently provides the

best tradeoff between proposal quality and speed,

at 0.2 seconds per image. Nevertheless, the region

proposal step still consumes as much running time

as the detection network.

• S. Ren is with University of Science and Technology of China, Hefei,

China. This work was done when S. Ren was an intern at Microsoft

Research. Email: sqren@mail.ustc.edu.cn

• K. He and J. Sun are with Visual Computing Group, Microsoft

Research. E-mail: {kahe,jiansun}@microsoft.com

• R. Girshick is with Facebook AI Research. The majority of this work

was done when R. Girshick was with Microsoft Research. E-mail:

rbg@fb.com

One may note that fast region-based CNNs take

advantage of GPUs, while the region proposal meth-

ods used in research are implemented on the CPU,

making such runtime comparisons inequitable. An ob-

vious way to accelerate proposal computation is to re-

implement it for the GPU. This may be an effective en-

gineering solution, but re-implementation ignores the

down-stream detection network and therefore misses

important opportunities for sharing computation.

In this paper, we show that an algorithmic change—

computing proposals with a deep convolutional neu-

ral network—leads to an elegant and effective solution

where proposal computation is nearly cost-free given

the detection network’s computation. To this end, we

introduce novel Region Proposal Networks (RPNs) that

share convolutional layers with state-of-the-art object

detection networks [1], [2]. By sharing convolutions at

test-time, the marginal cost for computing proposals

is small (e.g., 10ms per image).

Our observation is that the convolutional feature

maps used by region-based detectors, like Fast R-

CNN, can also be used for generating region pro-

posals. On top of these convolutional features, we

construct an RPN by adding a few additional con-

volutional layers that simultaneously regress region

bounds and objectness scores at each location on a

regular grid. The RPN is thus a kind of fully convo-

lutional network (FCN) [7] and can be trained end-to-

end speciﬁcally for the task for generating detection

proposals.

RPNs are designed to efﬁciently predict region pro-

posals with a wide range of scales and aspect ratios. In

contrast to prevalent methods [8], [9], [1], [2] that use

arXiv:1506.01497v3 [cs.CV] 6 Jan 2016

multiple scaled images

multiple filter sizes

multiple references

(a) (b) (c)

image

feature map

image

feature map

image

feature map

Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps

are built, and the classiﬁer is run at all scales. (b) Pyramids of ﬁlters with multiple scales/sizes are run on

the feature map. (c) We use pyramids of reference boxes in the regression functions.

pyramids of images (Figure 1, a) or pyramids of ﬁlters

(Figure 1, b), we introduce novel “anchor” boxes

that serve as references at multiple scales and aspect

ratios. Our scheme can be thought of as a pyramid

of regression references (Figure 1, c), which avoids

enumerating images or ﬁlters of multiple scales or

aspect ratios. This model performs well when trained

and tested using single-scale images and thus beneﬁts

running speed.

To unify RPNs with Fast R-CNN [2] object detec-

tion networks, we propose a training scheme that

alternates between ﬁne-tuning for the region proposal

task and then ﬁne-tuning for object detection, while

keeping the proposals ﬁxed. This scheme converges

quickly and produces a uniﬁed network with convo-

lutional features that are shared between both tasks.

We comprehensively evaluate our method on the

PASCAL VOC detection benchmarks [11] where RPNs

with Fast R-CNNs produce detection accuracy bet-

ter than the strong baseline of Selective Search with

Fast R-CNNs. Meanwhile, our method waives nearly

all computational burdens of Selective Search at

test-time—the effective running time for proposals

is just 10 milliseconds. Using the expensive very

deep models of [3], our detection method still has

a frame rate of 5fps (including all steps) on a GPU,

and thus is a practical object detection system in

terms of both speed and accuracy. We also report

results on the MS COCO dataset [12] and investi-

gate the improvements on PASCAL VOC using the

COCO data. Code has been made publicly available

at https://github.com/shaoqingren/faster_

rcnn (in MATLAB) and https://github.com/

rbgirshick/py-faster-rcnn (in Python).

A preliminary version of this manuscript was pub-

lished previously [10]. Since then, the frameworks of

RPN and Faster R-CNN have been adopted and gen-

eralized to other methods, such as 3D object detection

[13], part-based detection [14], instance segmentation

[15], and image captioning [16]. Our fast and effective

object detection system has also been built in com-

1. Since the publication of the conference version of this paper

[10], we have also found that RPNs can be trained jointly with Fast

R-CNN networks leading to less training time.

mercial systems such as at Pinterests [17], with user

engagement improvements reported.

In ILSVRC and COCO 2015 competitions, Faster

R-CNN and RPN are the basis of several 1st-place

entries [18] in the tracks of ImageNet detection, Ima-

geNet localization, COCO detection, and COCO seg-

mentation. RPNs completely learn to propose regions

from data, and thus can easily beneﬁt from deeper

and more expressive features (such as the 101-layer

residual nets adopted in [18]). Faster R-CNN and RPN

are also used by several other leading entries in these

competitions

. These results suggest that our method

is not only a cost-efﬁcient solution for practical usage,

but also an effective way of improving object detec-

tion accuracy.

2 RELATED WORK

Object Proposals. There is a large literature on object

proposal methods. Comprehensive surveys and com-

parisons of object proposal methods can be found in

[19], [20], [21]. Widely used object proposal methods

include those based on grouping super-pixels (e.g.,

Selective Search [4], CPMC [22], MCG [23]) and those

based on sliding windows (e.g., objectness in windows

[24], EdgeBoxes [6]). Object proposal methods were

adopted as external modules independent of the de-

tectors (e.g., Selective Search [4] object detectors, R-

CNN [5], and Fast R-CNN [2]).

Deep Networks for Object Detection. The R-CNN

method [5] trains CNNs end-to-end to classify the

proposal regions into object categories or background.

R-CNN mainly plays as a classiﬁer, and it does not

predict object bounds (except for reﬁning by bounding

box regression). Its accuracy depends on the perfor-

mance of the region proposal module (see compar-

isons in [20]). Several papers have proposed ways of

using deep networks for predicting object bounding

boxes [25], [9], [26], [27]. In the OverFeat method [9],

a fully-connected layer is trained to predict the box

coordinates for the localization task that assumes a

single object. The fully-connected layer is then turned

2. http://image-net.org/challenges/LSVRC/2015/results

of 14

50墨值下载

database

关注

评论