20.Faster R-CNN_ Towards Real-Time Object Detection with Region Proposal Networks.pdf

Libria

263

13页

0次

2021-02-22

50墨值下载

Faster R-CNN: Towards Real-Time Object

Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun

Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances

like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation

as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the

detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts

object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which

are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional

features—using the recently popular terminology of neural networks with ’attention’ mechanisms, the RPN component tells the uniﬁed

network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5 fps (including all steps)ona

GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300

proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning

entries in several tracks. Code has been made publicly available.

Index Terms—Object detection, region proposal, convolutional neural network

1INTRODUCTION

ECENT advances in object detection are driven by the

success of region proposal methods (e.g., [4]) and

region-based convolutional neural networks (R-CNNs) [5].

Although region-based CNNs were computationally expen-

sive as originally developed in [5], their cost has been drasti-

cally reduced thanks to sharing convolutions across

proposals [1], [2]. The latest incarnation, Fast R-CNN [2],

achieves near real-time rates using very deep networks [3],

when ignoring the time spent on region proposals. Now, pro-

posals are the test-time computational bottleneck in state-

of-the-art detection systems.

Region proposal methods typically rely on inexpensive

features and economical inference schemes. Selective Search

[4], one of the most popular methods, greedily merges super-

pixels based on engineered low-level features. Yet when com-

pared to efﬁcient detection networks [2], Selective Search is an

order of magnitude slower, at 2 seconds per image in a CPU

implementation. EdgeBoxes [6] currently provides the best

tradeoff between proposal quality and speed, at 0.2 seconds

per image. Nevertheless, the region proposal step still con-

sumes as much running time as the detection network.

One may note that fast region-based CNNs take advantage

of GPUs, while the region proposal methods used in research

are implemented on the CPU, making such runtime

comparisons inequitable. An obvious way to accelerate pro-

posal computation is to re-implement it for the GPU. This

may be an effective engineering solution, but re-implementa-

tion ignores the down-stream detection network and there-

fore misses important opportunities for sharing computation.

In this paper, we show that an algorithmic change—com-

puting proposals with a deep convolutional neural net-

work—leads to an elegant and effective solution where

proposal computation is nearly cost-free given the detection

network’s computation. To this end, we introduce novel

Region Proposal Networks (RPNs) that share convolutional

layers with state-of-the-art object detection networks [1], [2].

By sharing convolutions at test-time, the marginal cost for

computing proposals is small (e.g., 10 ms per image).

Our observation is that the convolutional feature maps

used by region-based detectors, like Fast R-CNN, can also

be used for generating region proposals. On top of these

convolutional features, we construct an RPN by adding a

few additional convolutional layers that simultaneously

regress region bounds and objectness scores at each location

on a regular grid. The RPN is thus a kind of fully convolu-

tional network (FCN) [7] and can be trained end-to-end spe-

ciﬁcally for the task for generating detection proposals.

RPNs are designed to efﬁciently predict region proposals

with a wide range of scales and aspect ratios. In contrast to

prevalent methods [1], [2], [8], [9] that use pyramids of

images (Fig. 1a) or pyramids of ﬁlters (Fig. 1b), we introduce

novel “anchor” boxes that serve as references at multiple

scales and aspect ratios. Our scheme can be thought of as a

pyramid of regression references (Fig. 1c), which avoids

enumerating images or ﬁlters of multiple scales or aspect

ratios. This model performs well when trained and tested

using single-scale images and thus beneﬁts running speed.

To unify RPNs with Fast R-CNN [2] object detection

networks, we propose a training scheme that alternates

 S. Ren is with University of Science and Technology of China, Hefei,

Anhui 230026, China. E-mail: sqren@mail.ustc.edu.cn.

 K. He and J. Sun are with Visual Computing Group, Microsoft Research,

Beijing 100080, China. E-mail: {kahe, jiansun}@microsoft.com.

 R. Girshick is with Facebook AI Research, Seattle, WA 98109.

E-mail: rbg@fb.com.

Manuscript received 29 Dec. 2015; revised 21 Apr. 2016; accepted 28 May

2016. Date of publication 5 June 2016; date of current version 12 May 2017.

Recommended for acceptance by D. Hoiem.

For information on obtaining reprints of this article, please send e-mail to:

reprints@ieee.org, and reference the Digital Object Identiﬁer below.

Digital Object Identiﬁer no. 10.1109/TPAMI.2016.2577031

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 6, JUNE 2017 1137

0162-8828 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

between ﬁne-tuning for the region proposal task and then

ﬁne-tuning for object detection, while keeping the proposals

ﬁxed. This scheme converges quickly and produces a uni-

ﬁed network with convolutional features that are shared

between both tasks.

We comprehensively evaluate our method on the PAS-

CAL VOC detection benchmarks [11] where RPNs with Fast

R-CNNs produce detection accuracy better than the strong

baseline of Selective Search with Fast R-CNNs. Meanwhile,

our method waives nearly all computational burdens of

Selective Search at test-time—the effective running time for

proposals is just 10 milliseconds. Using the expensive very

deep models of [3], our detection method still has a frame

rate of 5 fps (including all steps) on a GPU, and thus is a practi-

cal object detection system in terms of both speed and accu-

racy. We also report results on the MS COCO dataset [12]

and investigate the improvements on PASCAL VOC using

the COCO data. Code has been made publicly available at

https://github.com/shaoqingren/faster_rcnn

(in MATLAB) and https://github.com/rbgirshick/

py-faster-rcnn (in Python).

A preliminary version of this manuscript was published

previously [10]. Since then, the frameworks of RPN and

Faster R-CNN have been adopted and generalized to other

methods, such as 3D object detection [13], part-based detec-

tion [14], instance segmentation [15], and image captioning

[16]. Our fast and effective object detection system has also

been built in commercial systems such as at Pinterests [17],

with user engagement improvements reported.

In ILSVRC and COCO 2015 competitions, Faster R-CNN

and RPN are the basis of several 1st-place entries [18] in the

tracks of ImageNet detection, ImageNet localization, COCO

detection, and COCO segmentation. RPNs completely learn

to propose regions from data, and thus can easily beneﬁt

from deeper and more expressive features (such as the 101-

layer residual nets adopted in [18]). Faster R-CNN and RPN

are also used by several other leading entries in these com-

petitions.

These results suggest that our method is not only

a cost-efﬁcient solution for practical usage, but also an effec-

tive way of improving object detection accuracy.

2RELATED WORK

Object Proposals. There is a large literature on object proposal

methods. Comprehensive surveys and comparisons of

object proposal methods can be found in [19], [20], [21].

Widely used object proposal methods include those based

on grouping super-pixels (e.g., Selective Search [4], CPMC

[22], MCG [23]) and those based on sliding windows (e.g.,

objectness in windows [24], EdgeBoxes [6]). Object proposal

methods were adopted as external modules independent of

the detectors (e.g., Selective Search [4] object detectors, R-

CNN [5], and Fast R-CNN [2]).

Deep Networks for Object Detection. The R-CNN method [5]

trains CNNs end-to-end to classify the proposal regions into

object categories or background. R-CNN mainly plays as a

classiﬁer, and it does not predict object bounds (except for

reﬁning by bounding box regression). Its accuracy depends

on the performance of the region proposal module (see com-

parisons in [20]). Several papers have proposed ways of using

deep networks for predicting object bounding boxes [9], [25],

[26], [27]. In the OverFeat method [9], a fully-connected layer

is trained to predict the box coordinates for the localization

task that assumes a single object. The fully-connected layer is

then turned into a convolutional layer for detecting multiple

class-speciﬁc objects. The MultiBox methods [26], [27] gener-

ate region proposals from a network whose last fully-con-

nected layer simultaneously predicts multiple class-agnostic

boxes, generalizing the “single-box” fashion of OverFeat.

These class-agnostic boxes are used as proposals for R-CNN

[5]. The MultiBox proposal network is applied on a single

image crop or multiple large image crops (e.g., 224  224), in

contrast to our fully convolutional scheme. MultiBox does

not share features between the proposal and detection net-

works. We discuss OverFeat and MultiBox in more depth

later in context with our method. Concurrent with our work,

the DeepMask method [28] is developed for learning segmen-

tation proposals.

Shared computation of convolutions [1], [2], [7], [9], [29]

has been attracting increasing attention for efﬁcient, yet

accurate, visual recognition. The OverFeat paper [9] com-

putes convolutional features from an image pyramid for

classiﬁcation, localization, and detection. Adaptively-sized

pooling (SPP) [1] on shared convolutional feature maps is

developed for efﬁcient region-based object detection [1],

[30] and semantic segmentation [29]. Fast R-CNN [2] ena-

bles end-to-end detector training on shared convolutional

features and shows compelling accuracy and speed.

3FASTER R-CNN

Our object detection system, called Faster R-CNN, is

composed of two modules. The ﬁrst module is a deep fully

convolutional network that proposes regions, and the

Fig. 1. Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps are built, and the classiﬁer is run at all scales.

(b) Pyramids of ﬁlters with multiple scales/sizes are run on the feature map. (c) We use pyramids of reference boxes in the regression functions.

1. Since the publication of the conference version of this paper [10],

we have also found that RPNs can be trained jointly with Fast R-CNN

networks leading to less training time.

2. http://image-net.org/challenges/LSVRC/2015/results.

1138 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 6, JUNE 2017

of 13

50墨值下载

database

关注

评论