
R-FCN: Object Detection via
Region-based Fully Convolutional Networks
Jifeng Dai
Microsoft Research
Yi Li
∗
Tsinghua University
Kaiming He
Microsoft Research
Jian Sun
Microsoft Research
Abstract
We present region-based, fully convolutional networks for accurate and efficient
object detection. In contrast to previous region-based detectors such as Fast/Faster
R-CNN [
6
,
18
] that apply a costly per-region subnetwork hundreds of times, our
region-based detector is fully convolutional with almost all computation shared on
the entire image. To achieve this goal, we propose position-sensitive score maps
to address a dilemma between translation-invariance in image classification and
translation-variance in object detection. Our method can thus naturally adopt fully
convolutional image classifier backbones, such as the latest Residual Networks
(ResNets) [
9
], for object detection. We show competitive results on the PASCAL
VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet.
Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20
×
faster than the Faster R-CNN counterpart. Code is made publicly available at:
https://github.com/daijifeng001/r-fcn.
1 Introduction
A prevalent family [
8
,
6
,
18
] of deep networks for object detection can be divided into two subnetworks
by the Region-of-Interest (RoI) pooling layer [
6
]: (i) a shared, “fully convolutional” subnetwork
independent of RoIs, and (ii) an RoI-wise subnetwork that does not share computation. This
decomposition [
8
] was historically resulted from the pioneering classification architectures, such
as AlexNet [
10
] and VGG Nets [
23
], that consist of two subnetworks by design — a convolutional
subnetwork ending with a spatial pooling layer, followed by several fully-connected (fc) layers. Thus
the (last) spatial pooling layer in image classification networks is naturally turned into the RoI pooling
layer in object detection networks [8, 6, 18].
But recent state-of-the-art image classification networks such as Residual Nets (ResNets) [
9
] and
GoogLeNets [
24
,
26
] are by design fully convolutional
2
. By analogy, it appears natural to use
all convolutional layers to construct the shared, convolutional subnetwork in the object detection
architecture, leaving the RoI-wise subnetwork no hidden layer. However, as empirically investigated
in this work, this naïve solution turns out to have considerably inferior detection accuracy that does
not match the network’s superior classification accuracy. To remedy this issue, in the ResNet paper
[
9
] the RoI pooling layer of the Faster R-CNN detector [
18
] is unnaturally inserted between two sets
of convolutional layers — this creates a deeper RoI-wise subnetwork that improves accuracy, at the
cost of lower speed due to the unshared per-RoI computation.
We argue that the aforementioned unnatural design is caused by a dilemma of increasing translation
invariance for image classification vs. respecting translation variance for object detection. On one
hand, the image-level classification task favors translation invariance — shift of an object inside an
image should be indiscriminative. Thus, deep (fully) convolutional architectures that are as translation-
invariant as possible are preferable as evidenced by the leading results on ImageNet classification
∗
This work was done when Yi Li was an intern at Microsoft Research.
2
Only the last layer is fully-connected, which is removed and replaced when fine-tuning for object detection.
arXiv:1605.06409v2 [cs.CV] 21 Jun 2016
评论