
2
multiple scaled images
multiple filter sizes
multiple references
(a) (b) (c)
image
feature map
image
feature map
image
feature map
Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps
are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on
the feature map. (c) We use pyramids of reference boxes in the regression functions.
pyramids of images (Figure 1, a) or pyramids of filters
(Figure 1, b), we introduce novel “anchor” boxes
that serve as references at multiple scales and aspect
ratios. Our scheme can be thought of as a pyramid
of regression references (Figure 1, c), which avoids
enumerating images or filters of multiple scales or
aspect ratios. This model performs well when trained
and tested using single-scale images and thus benefits
running speed.
To unify RPNs with Fast R-CNN [2] object detec-
tion networks, we propose a training scheme that
alternates between fine-tuning for the region proposal
task and then fine-tuning for object detection, while
keeping the proposals fixed. This scheme converges
quickly and produces a unified network with convo-
lutional features that are shared between both tasks.
1
We comprehensively evaluate our method on the
PASCAL VOC detection benchmarks [11] where RPNs
with Fast R-CNNs produce detection accuracy bet-
ter than the strong baseline of Selective Search with
Fast R-CNNs. Meanwhile, our method waives nearly
all computational burdens of Selective Search at
test-time—the effective running time for proposals
is just 10 milliseconds. Using the expensive very
deep models of [3], our detection method still has
a frame rate of 5fps (including all steps) on a GPU,
and thus is a practical object detection system in
terms of both speed and accuracy. We also report
results on the MS COCO dataset [12] and investi-
gate the improvements on PASCAL VOC using the
COCO data. Code has been made publicly available
at https://github.com/shaoqingren/faster_
rcnn (in MATLAB) and https://github.com/
rbgirshick/py-faster-rcnn (in Python).
A preliminary version of this manuscript was pub-
lished previously [10]. Since then, the frameworks of
RPN and Faster R-CNN have been adopted and gen-
eralized to other methods, such as 3D object detection
[13], part-based detection [14], instance segmentation
[15], and image captioning [16]. Our fast and effective
object detection system has also been built in com-
1. Since the publication of the conference version of this paper
[10], we have also found that RPNs can be trained jointly with Fast
R-CNN networks leading to less training time.
mercial systems such as at Pinterests [17], with user
engagement improvements reported.
In ILSVRC and COCO 2015 competitions, Faster
R-CNN and RPN are the basis of several 1st-place
entries [18] in the tracks of ImageNet detection, Ima-
geNet localization, COCO detection, and COCO seg-
mentation. RPNs completely learn to propose regions
from data, and thus can easily benefit from deeper
and more expressive features (such as the 101-layer
residual nets adopted in [18]). Faster R-CNN and RPN
are also used by several other leading entries in these
competitions
2
. These results suggest that our method
is not only a cost-efficient solution for practical usage,
but also an effective way of improving object detec-
tion accuracy.
2 RELATED WORK
Object Proposals. There is a large literature on object
proposal methods. Comprehensive surveys and com-
parisons of object proposal methods can be found in
[19], [20], [21]. Widely used object proposal methods
include those based on grouping super-pixels (e.g.,
Selective Search [4], CPMC [22], MCG [23]) and those
based on sliding windows (e.g., objectness in windows
[24], EdgeBoxes [6]). Object proposal methods were
adopted as external modules independent of the de-
tectors (e.g., Selective Search [4] object detectors, R-
CNN [5], and Fast R-CNN [2]).
Deep Networks for Object Detection. The R-CNN
method [5] trains CNNs end-to-end to classify the
proposal regions into object categories or background.
R-CNN mainly plays as a classifier, and it does not
predict object bounds (except for refining by bounding
box regression). Its accuracy depends on the perfor-
mance of the region proposal module (see compar-
isons in [20]). Several papers have proposed ways of
using deep networks for predicting object bounding
boxes [25], [9], [26], [27]. In the OverFeat method [9],
a fully-connected layer is trained to predict the box
coordinates for the localization task that assumes a
single object. The fully-connected layer is then turned
2. http://image-net.org/challenges/LSVRC/2015/results
评论