detection systems, and describe how the leading ones
follow very similar designs.
• We describe our flexible and unified implementation
of three meta-architectures (Faster R-CNN, R-FCN
and SSD) in Tensorflow which we use to do exten-
sive experiments that trace the accuracy/speed trade-
off curve for different detection systems, varying meta-
architecture, feature extractor, image resolution, etc.
• Our findings show that using fewer proposals for
Faster R-CNN can speed it up significantly without
a big loss in accuracy, making it competitive with its
faster cousins, SSD and RFCN. We show that SSDs
performance is less sensitive to the quality of the fea-
ture extractor than Faster R-CNN and R-FCN. And we
identify sweet spots on the accuracy/speed trade-off
curve where gains in accuracy are only possible by sac-
rificing speed (within the family of detectors presented
here).
• Several of the meta-architecture and feature-extractor
combinations that we report have never appeared be-
fore in literature. We discuss how we used some of
these novel combinations to train the winning entry of
the 2016 COCO object detection challenge.
2. Meta-architectures
Neural nets have become the leading method for high
quality object detection in recent years. In this section we
survey some of the highlights of this literature. The R-CNN
paper by Girshick et al. [11] was among the first modern
incarnations of convolutional network based detection. In-
spired by recent successes on image classification [20], the
R-CNN method took the straightforward approach of crop-
ping externally computed box proposals out of an input im-
age and running a neural net classifier on these crops. This
approach can be expensive however because many crops
are necessary, leading to significant duplicated computation
from overlapping crops. Fast R-CNN [10] alleviated this
problem by pushing the entire image once through a feature
extractor then cropping from an intermediate layer so that
crops share the computation load of feature extraction.
While both R-CNN and Fast R-CNN relied on an exter-
nal proposal generator, recent works have shown that it is
possible to generate box proposals using neural networks
as well [41, 40, 8, 31]. In these works, it is typical to have a
collection of boxes overlaid on the image at different spatial
locations, scales and aspect ratios that act as “anchors”
(sometimes called “priors” or “default boxes”). A model
is then trained to make two predictions for each anchor:
(1) a discrete class prediction for each anchor, and (2) a
continuous prediction of an offset by which the anchor
needs to be shifted to fit the groundtruth bounding box.
Papers that follow this anchors methodology then
minimize a combined classification and regression loss that
we now describe. For each anchor a, we first find the best
matching groundtruth box b (if one exists). If such a match
can be found, we call a a “positive anchor”, and assign it
(1) a class label y
a
∈ {1 . . . K} and (2) a vector encoding
of box b with respect to anchor a (called the box encoding
φ(b
a
; a)). If no match is found, we call a a “negative
anchor” and we set the class label to be y
a
= 0. If for
the anchor a we predict box encoding f
loc
(I; a, θ) and
corresponding class f
cls
(I; a, θ), where I is the image and
θ the model parameters, then the loss for a is measured as
a weighted sum of a location-based loss and a classification
loss:
L(a, I; θ) = α · 1[a is positive] · `
loc
(φ(b
a
; a) − f
loc
(I; a, θ))
+ β · `
cls
(y
a
, f
cls
(I; a, θ)), (1)
where α, β are weights balancing localization and classi-
fication losses. To train the model, Equation 1 is averaged
over anchors and minimized with respect to parameters θ.
The choice of anchors has significant implications both
for accuracy and computation. In the (first) Multibox
paper [8], these anchors (called “box priors” by the au-
thors) were generated by clustering groundtruth boxes in
the dataset. In more recent works, anchors are generated
by tiling a collection of boxes at different scales and aspect
ratios regularly across the image. The advantage of hav-
ing a regular grid of anchors is that predictions for these
boxes can be written as tiled predictors on the image with
shared parameters (i.e., convolutions) and are reminiscent
of traditional sliding window methods, e.g. [44]. The Faster
R-CNN [31] paper and the (second) Multibox paper [40]
(which called these tiled anchors “convolutional priors”)
were the first papers to take this new approach.
2.1. Meta-architectures
In our paper we focus primarily on three recent (meta)-
architectures: SSD (Single Shot Multibox Detector [26]),
Faster R-CNN [31] and R-FCN (Region-based Fully Con-
volutional Networks [6]). While these papers were orig-
inally presented with a particular feature extractor (e.g.,
VGG, Resnet, etc), we now review these three methods, de-
coupling the choice of meta-architecture from feature ex-
tractor so that conceptually, any feature extractor can be
used with SSD, Faster R-CNN or R-FCN.
2.1.1 Single Shot Detector (SSD).
Though the SSD paper was published only recently (Liu et
al., [26]), we use the term SSD to refer broadly to archi-
tectures that use a single feed-forward convolutional net-
work to directly predict classes and anchor offsets without
requiring a second stage per-proposal classification oper-
ation (Figure 1a). Under this definition, the SSD meta-
architecture has been explored in a number of precursors
to [26]. Both Multibox and the Region Proposal Network
2
评论