暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
20.Faster R-CNN_ Towards Real-Time Object Detection with Region Proposal Networks.pdf
251
13页
0次
2021-02-22
50墨值下载
Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances
like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation
as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the
detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts
object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which
are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional
features—using the recently popular terminology of neural networks with ’attention’ mechanisms, the RPN component tells the unified
network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5 fps (including all steps)ona
GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300
proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning
entries in several tracks. Code has been made publicly available.
Index Terms—Object detection, region proposal, convolutional neural network
Ç
1INTRODUCTION
R
ECENT advances in object detection are driven by the
success of region proposal methods (e.g., [4]) and
region-based convolutional neural networks (R-CNNs) [5].
Although region-based CNNs were computationally expen-
sive as originally developed in [5], their cost has been drasti-
cally reduced thanks to sharing convolutions across
proposals [1], [2]. The latest incarnation, Fast R-CNN [2],
achieves near real-time rates using very deep networks [3],
when ignoring the time spent on region proposals. Now, pro-
posals are the test-time computational bottleneck in state-
of-the-art detection systems.
Region proposal methods typically rely on inexpensive
features and economical inference schemes. Selective Search
[4], one of the most popular methods, greedily merges super-
pixels based on engineered low-level features. Yet when com-
pared to efficient detection networks [2], Selective Search is an
order of magnitude slower, at 2 seconds per image in a CPU
implementation. EdgeBoxes [6] currently provides the best
tradeoff between proposal quality and speed, at 0.2 seconds
per image. Nevertheless, the region proposal step still con-
sumes as much running time as the detection network.
One may note that fast region-based CNNs take advantage
of GPUs, while the region proposal methods used in research
are implemented on the CPU, making such runtime
comparisons inequitable. An obvious way to accelerate pro-
posal computation is to re-implement it for the GPU. This
may be an effective engineering solution, but re-implementa-
tion ignores the down-stream detection network and there-
fore misses important opportunities for sharing computation.
In this paper, we show that an algorithmic change—com-
puting proposals with a deep convolutional neural net-
work—leads to an elegant and effective solution where
proposal computation is nearly cost-free given the detection
network’s computation. To this end, we introduce novel
Region Proposal Networks (RPNs) that share convolutional
layers with state-of-the-art object detection networks [1], [2].
By sharing convolutions at test-time, the marginal cost for
computing proposals is small (e.g., 10 ms per image).
Our observation is that the convolutional feature maps
used by region-based detectors, like Fast R-CNN, can also
be used for generating region proposals. On top of these
convolutional features, we construct an RPN by adding a
few additional convolutional layers that simultaneously
regress region bounds and objectness scores at each location
on a regular grid. The RPN is thus a kind of fully convolu-
tional network (FCN) [7] and can be trained end-to-end spe-
cifically for the task for generating detection proposals.
RPNs are designed to efficiently predict region proposals
with a wide range of scales and aspect ratios. In contrast to
prevalent methods [1], [2], [8], [9] that use pyramids of
images (Fig. 1a) or pyramids of filters (Fig. 1b), we introduce
novel “anchor” boxes that serve as references at multiple
scales and aspect ratios. Our scheme can be thought of as a
pyramid of regression references (Fig. 1c), which avoids
enumerating images or filters of multiple scales or aspect
ratios. This model performs well when trained and tested
using single-scale images and thus benefits running speed.
To unify RPNs with Fast R-CNN [2] object detection
networks, we propose a training scheme that alternates
S. Ren is with University of Science and Technology of China, Hefei,
Anhui 230026, China. E-mail: sqren@mail.ustc.edu.cn.
K. He and J. Sun are with Visual Computing Group, Microsoft Research,
Beijing 100080, China. E-mail: {kahe, jiansun}@microsoft.com.
R. Girshick is with Facebook AI Research, Seattle, WA 98109.
E-mail: rbg@fb.com.
Manuscript received 29 Dec. 2015; revised 21 Apr. 2016; accepted 28 May
2016. Date of publication 5 June 2016; date of current version 12 May 2017.
Recommended for acceptance by D. Hoiem.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPAMI.2016.2577031
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 6, JUNE 2017 1137
0162-8828 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
between fine-tuning for the region proposal task and then
fine-tuning for object detection, while keeping the proposals
fixed. This scheme converges quickly and produces a uni-
fied network with convolutional features that are shared
between both tasks.
1
We comprehensively evaluate our method on the PAS-
CAL VOC detection benchmarks [11] where RPNs with Fast
R-CNNs produce detection accuracy better than the strong
baseline of Selective Search with Fast R-CNNs. Meanwhile,
our method waives nearly all computational burdens of
Selective Search at test-time—the effective running time for
proposals is just 10 milliseconds. Using the expensive very
deep models of [3], our detection method still has a frame
rate of 5 fps (including all steps) on a GPU, and thus is a practi-
cal object detection system in terms of both speed and accu-
racy. We also report results on the MS COCO dataset [12]
and investigate the improvements on PASCAL VOC using
the COCO data. Code has been made publicly available at
https://github.com/shaoqingren/faster_rcnn
(in MATLAB) and https://github.com/rbgirshick/
py-faster-rcnn (in Python).
A preliminary version of this manuscript was published
previously [10]. Since then, the frameworks of RPN and
Faster R-CNN have been adopted and generalized to other
methods, such as 3D object detection [13], part-based detec-
tion [14], instance segmentation [15], and image captioning
[16]. Our fast and effective object detection system has also
been built in commercial systems such as at Pinterests [17],
with user engagement improvements reported.
In ILSVRC and COCO 2015 competitions, Faster R-CNN
and RPN are the basis of several 1st-place entries [18] in the
tracks of ImageNet detection, ImageNet localization, COCO
detection, and COCO segmentation. RPNs completely learn
to propose regions from data, and thus can easily benefit
from deeper and more expressive features (such as the 101-
layer residual nets adopted in [18]). Faster R-CNN and RPN
are also used by several other leading entries in these com-
petitions.
2
These results suggest that our method is not only
a cost-efficient solution for practical usage, but also an effec-
tive way of improving object detection accuracy.
2RELATED WORK
Object Proposals. There is a large literature on object proposal
methods. Comprehensive surveys and comparisons of
object proposal methods can be found in [19], [20], [21].
Widely used object proposal methods include those based
on grouping super-pixels (e.g., Selective Search [4], CPMC
[22], MCG [23]) and those based on sliding windows (e.g.,
objectness in windows [24], EdgeBoxes [6]). Object proposal
methods were adopted as external modules independent of
the detectors (e.g., Selective Search [4] object detectors, R-
CNN [5], and Fast R-CNN [2]).
Deep Networks for Object Detection. The R-CNN method [5]
trains CNNs end-to-end to classify the proposal regions into
object categories or background. R-CNN mainly plays as a
classifier, and it does not predict object bounds (except for
refining by bounding box regression). Its accuracy depends
on the performance of the region proposal module (see com-
parisons in [20]). Several papers have proposed ways of using
deep networks for predicting object bounding boxes [9], [25],
[26], [27]. In the OverFeat method [9], a fully-connected layer
is trained to predict the box coordinates for the localization
task that assumes a single object. The fully-connected layer is
then turned into a convolutional layer for detecting multiple
class-specific objects. The MultiBox methods [26], [27] gener-
ate region proposals from a network whose last fully-con-
nected layer simultaneously predicts multiple class-agnostic
boxes, generalizing the “single-box fashion of OverFeat.
These class-agnostic boxes are used as proposals for R-CNN
[5]. The MultiBox proposal network is applied on a single
image crop or multiple large image crops (e.g., 224 224), in
contrast to our fully convolutional scheme. MultiBox does
not share features between the proposal and detection net-
works. We discuss OverFeat and MultiBox in more depth
later in context with our method. Concurrent with our work,
the DeepMask method [28] is developed for learning segmen-
tation proposals.
Shared computation of convolutions [1], [2], [7], [9], [29]
has been attracting increasing attention for efficient, yet
accurate, visual recognition. The OverFeat paper [9] com-
putes convolutional features from an image pyramid for
classification, localization, and detection. Adaptively-sized
pooling (SPP) [1] on shared convolutional feature maps is
developed for efficient region-based object detection [1],
[30] and semantic segmentation [29]. Fast R-CNN [2] ena-
bles end-to-end detector training on shared convolutional
features and shows compelling accuracy and speed.
3FASTER R-CNN
Our object detection system, called Faster R-CNN, is
composed of two modules. The first module is a deep fully
convolutional network that proposes regions, and the
Fig. 1. Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps are built, and the classifier is run at all scales.
(b) Pyramids of filters with multiple scales/sizes are run on the feature map. (c) We use pyramids of reference boxes in the regression functions.
1. Since the publication of the conference version of this paper [10],
we have also found that RPNs can be trained jointly with Fast R-CNN
networks leading to less training time.
2. http://image-net.org/challenges/LSVRC/2015/results.
1138 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 6, JUNE 2017
of 13
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜