暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
17.Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks.pdf
249
14页
0次
2021-02-24
50墨值下载
1
Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.
Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region
proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image
convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional
network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to
generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN
into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with
“attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3],
our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection
accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO
2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been
made publicly available.
Index Terms—Object Detection, Region Proposal, Convolutional Neural Network.
F
1 INTRODUCTION
Recent advances in object detection are driven by
the success of region proposal methods (e.g., [4])
and region-based convolutional neural networks (R-
CNNs) [5]. Although region-based CNNs were com-
putationally expensive as originally developed in [5],
their cost has been drastically reduced thanks to shar-
ing convolutions across proposals [1], [2]. The latest
incarnation, Fast R-CNN [2], achieves near real-time
rates using very deep networks [3], when ignoring the
time spent on region proposals. Now, proposals are the
test-time computational bottleneck in state-of-the-art
detection systems.
Region proposal methods typically rely on inex-
pensive features and economical inference schemes.
Selective Search [4], one of the most popular meth-
ods, greedily merges superpixels based on engineered
low-level features. Yet when compared to efficient
detection networks [2], Selective Search is an order of
magnitude slower, at 2 seconds per image in a CPU
implementation. EdgeBoxes [6] currently provides the
best tradeoff between proposal quality and speed,
at 0.2 seconds per image. Nevertheless, the region
proposal step still consumes as much running time
as the detection network.
S. Ren is with University of Science and Technology of China, Hefei,
China. This work was done when S. Ren was an intern at Microsoft
Research. Email: sqren@mail.ustc.edu.cn
K. He and J. Sun are with Visual Computing Group, Microsoft
Research. E-mail: {kahe,jiansun}@microsoft.com
R. Girshick is with Facebook AI Research. The majority of this work
was done when R. Girshick was with Microsoft Research. E-mail:
rbg@fb.com
One may note that fast region-based CNNs take
advantage of GPUs, while the region proposal meth-
ods used in research are implemented on the CPU,
making such runtime comparisons inequitable. An ob-
vious way to accelerate proposal computation is to re-
implement it for the GPU. This may be an effective en-
gineering solution, but re-implementation ignores the
down-stream detection network and therefore misses
important opportunities for sharing computation.
In this paper, we show that an algorithmic change—
computing proposals with a deep convolutional neu-
ral network—leads to an elegant and effective solution
where proposal computation is nearly cost-free given
the detection network’s computation. To this end, we
introduce novel Region Proposal Networks (RPNs) that
share convolutional layers with state-of-the-art object
detection networks [1], [2]. By sharing convolutions at
test-time, the marginal cost for computing proposals
is small (e.g., 10ms per image).
Our observation is that the convolutional feature
maps used by region-based detectors, like Fast R-
CNN, can also be used for generating region pro-
posals. On top of these convolutional features, we
construct an RPN by adding a few additional con-
volutional layers that simultaneously regress region
bounds and objectness scores at each location on a
regular grid. The RPN is thus a kind of fully convo-
lutional network (FCN) [7] and can be trained end-to-
end specifically for the task for generating detection
proposals.
RPNs are designed to efficiently predict region pro-
posals with a wide range of scales and aspect ratios. In
contrast to prevalent methods [8], [9], [1], [2] that use
arXiv:1506.01497v3 [cs.CV] 6 Jan 2016
2
multiple scaled images
multiple filter sizes
multiple references
(a) (b) (c)
image
feature map
image
feature map
image
feature map
Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps
are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on
the feature map. (c) We use pyramids of reference boxes in the regression functions.
pyramids of images (Figure 1, a) or pyramids of filters
(Figure 1, b), we introduce novel “anchor boxes
that serve as references at multiple scales and aspect
ratios. Our scheme can be thought of as a pyramid
of regression references (Figure 1, c), which avoids
enumerating images or filters of multiple scales or
aspect ratios. This model performs well when trained
and tested using single-scale images and thus benefits
running speed.
To unify RPNs with Fast R-CNN [2] object detec-
tion networks, we propose a training scheme that
alternates between fine-tuning for the region proposal
task and then fine-tuning for object detection, while
keeping the proposals fixed. This scheme converges
quickly and produces a unified network with convo-
lutional features that are shared between both tasks.
1
We comprehensively evaluate our method on the
PASCAL VOC detection benchmarks [11] where RPNs
with Fast R-CNNs produce detection accuracy bet-
ter than the strong baseline of Selective Search with
Fast R-CNNs. Meanwhile, our method waives nearly
all computational burdens of Selective Search at
test-time—the effective running time for proposals
is just 10 milliseconds. Using the expensive very
deep models of [3], our detection method still has
a frame rate of 5fps (including all steps) on a GPU,
and thus is a practical object detection system in
terms of both speed and accuracy. We also report
results on the MS COCO dataset [12] and investi-
gate the improvements on PASCAL VOC using the
COCO data. Code has been made publicly available
at https://github.com/shaoqingren/faster_
rcnn (in MATLAB) and https://github.com/
rbgirshick/py-faster-rcnn (in Python).
A preliminary version of this manuscript was pub-
lished previously [10]. Since then, the frameworks of
RPN and Faster R-CNN have been adopted and gen-
eralized to other methods, such as 3D object detection
[13], part-based detection [14], instance segmentation
[15], and image captioning [16]. Our fast and effective
object detection system has also been built in com-
1. Since the publication of the conference version of this paper
[10], we have also found that RPNs can be trained jointly with Fast
R-CNN networks leading to less training time.
mercial systems such as at Pinterests [17], with user
engagement improvements reported.
In ILSVRC and COCO 2015 competitions, Faster
R-CNN and RPN are the basis of several 1st-place
entries [18] in the tracks of ImageNet detection, Ima-
geNet localization, COCO detection, and COCO seg-
mentation. RPNs completely learn to propose regions
from data, and thus can easily benefit from deeper
and more expressive features (such as the 101-layer
residual nets adopted in [18]). Faster R-CNN and RPN
are also used by several other leading entries in these
competitions
2
. These results suggest that our method
is not only a cost-efficient solution for practical usage,
but also an effective way of improving object detec-
tion accuracy.
2 RELATED WORK
Object Proposals. There is a large literature on object
proposal methods. Comprehensive surveys and com-
parisons of object proposal methods can be found in
[19], [20], [21]. Widely used object proposal methods
include those based on grouping super-pixels (e.g.,
Selective Search [4], CPMC [22], MCG [23]) and those
based on sliding windows (e.g., objectness in windows
[24], EdgeBoxes [6]). Object proposal methods were
adopted as external modules independent of the de-
tectors (e.g., Selective Search [4] object detectors, R-
CNN [5], and Fast R-CNN [2]).
Deep Networks for Object Detection. The R-CNN
method [5] trains CNNs end-to-end to classify the
proposal regions into object categories or background.
R-CNN mainly plays as a classifier, and it does not
predict object bounds (except for refining by bounding
box regression). Its accuracy depends on the perfor-
mance of the region proposal module (see compar-
isons in [20]). Several papers have proposed ways of
using deep networks for predicting object bounding
boxes [25], [9], [26], [27]. In the OverFeat method [9],
a fully-connected layer is trained to predict the box
coordinates for the localization task that assumes a
single object. The fully-connected layer is then turned
2. http://image-net.org/challenges/LSVRC/2015/results
of 14
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜