暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
18.Speed_accuracy trade-offs for modern convolutional object detectors.pdf
222
21页
0次
2021-02-24
50墨值下载
Speed/accuracy trade-offs for modern convolutional object detectors
Jonathan Huang Vivek Rathod Chen Sun Menglong Zhu Anoop Korattikara
Alireza Fathi Ian Fischer Zbigniew Wojna Yang Song Sergio Guadarrama
Kevin Murphy
Google Research
Abstract
The goal of this paper is to serve as a guide for se-
lecting a detection architecture that achieves the right
speed/memory/accuracy balance for a given application
and platform. To this end, we investigate various ways to
trade accuracy for speed and memory usage in modern con-
volutional object detection systems. A number of successful
systems have been proposed in recent years, but apples-to-
apples comparisons are difficult due to different base fea-
ture extractors (e.g., VGG, Residual Networks), different
default image resolutions, as well as different hardware and
software platforms. We present a unified implementation of
the Faster R-CNN [31], R-FCN [6] and SSD [26] systems,
which we view as “meta-architectures” and trace out the
speed/accuracy trade-off curve created by using alterna-
tive feature extractors and varying other critical parameters
such as image size within each of these meta-architectures.
On one extreme end of this spectrum where speed and mem-
ory are critical, we present a detector that achieves real
time speeds and can be deployed on a mobile device. On
the opposite end in which accuracy is critical, we present
a detector that achieves state-of-the-art performance mea-
sured on the COCO detection task.
1. Introduction
A lot of progress has been made in recent years on object
detection due to the use of convolutional neural networks
(CNNs). Modern object detectors based on these networks
such as Faster R-CNN [31], R-FCN [6], Multibox [40],
SSD [26] and YOLO [29] are now good enough to be
deployed in consumer products (e.g., Google Photos, Pin-
terest Visual Search) and some have been shown to be fast
enough to be run on mobile devices.
However, it can be difficult for practitioners to decide
what architecture is best suited to their application. Stan-
dard accuracy metrics, such as mean average precision
(mAP), do not tell the entire story, since for real deploy-
ments of computer vision systems, running time and mem-
ory usage are also critical. For example, mobile devices
often require a small memory footprint, and self driving
cars require real time performance. Server-side production
systems, like those used in Google, Facebook or Snapchat,
have more leeway to optimize for accuracy, but are still sub-
ject to throughput constraints. While the methods that win
competitions, such as the COCO challenge [25], are opti-
mized for accuracy, they often rely on model ensembling
and multicrop methods which are too slow for practical us-
age.
Unfortunately, only a small subset of papers (e.g., R-
FCN [6], SSD [26] YOLO [29]) discuss running time in
any detail. Furthermore, these papers typically only state
that they achieve some frame-rate, but do not give a full
picture of the speed/accuracy trade-off, which depends on
many other factors, such as which feature extractor is used,
input image sizes, etc.
In this paper, we seek to explore the speed/accuracy
trade-off of modern detection systems in an exhaustive and
fair way. While this has been studied for full image clas-
sification( (e.g., [3]), detection models tend to be signif-
icantly more complex. We primarily investigate single-
model/single-pass detectors, by which we mean models
that do not use ensembling, multi-crop methods, or other
“tricks” such as horizontal flipping. In other words, we only
pass a single image through a single network. For simplicity
(and because it is more important for users of this technol-
ogy), we focus only on test-time performance and not on
how long these models take to train.
Though it is impractical to compare every recently pro-
posed detection system, we are fortunate that many of the
leading state of the art approaches have converged on a
common methodology (at least at a high level). This has
allowed us to implement and compare a large number of de-
tection systems in a unified manner. In particular, we have
created implementations of the Faster R-CNN, R-FCN and
SSD meta-architectures, which at a high level consist of a
single convolutional network, trained with a mixed regres-
sion and classification objective, and use sliding window
style predictions.
To summarize, our main contributions are as follows:
We provide a concise survey of modern convolutional
1
arXiv:1611.10012v3 [cs.CV] 25 Apr 2017
detection systems, and describe how the leading ones
follow very similar designs.
We describe our flexible and unified implementation
of three meta-architectures (Faster R-CNN, R-FCN
and SSD) in Tensorflow which we use to do exten-
sive experiments that trace the accuracy/speed trade-
off curve for different detection systems, varying meta-
architecture, feature extractor, image resolution, etc.
Our findings show that using fewer proposals for
Faster R-CNN can speed it up significantly without
a big loss in accuracy, making it competitive with its
faster cousins, SSD and RFCN. We show that SSDs
performance is less sensitive to the quality of the fea-
ture extractor than Faster R-CNN and R-FCN. And we
identify sweet spots on the accuracy/speed trade-off
curve where gains in accuracy are only possible by sac-
rificing speed (within the family of detectors presented
here).
Several of the meta-architecture and feature-extractor
combinations that we report have never appeared be-
fore in literature. We discuss how we used some of
these novel combinations to train the winning entry of
the 2016 COCO object detection challenge.
2. Meta-architectures
Neural nets have become the leading method for high
quality object detection in recent years. In this section we
survey some of the highlights of this literature. The R-CNN
paper by Girshick et al. [11] was among the first modern
incarnations of convolutional network based detection. In-
spired by recent successes on image classification [20], the
R-CNN method took the straightforward approach of crop-
ping externally computed box proposals out of an input im-
age and running a neural net classifier on these crops. This
approach can be expensive however because many crops
are necessary, leading to significant duplicated computation
from overlapping crops. Fast R-CNN [10] alleviated this
problem by pushing the entire image once through a feature
extractor then cropping from an intermediate layer so that
crops share the computation load of feature extraction.
While both R-CNN and Fast R-CNN relied on an exter-
nal proposal generator, recent works have shown that it is
possible to generate box proposals using neural networks
as well [41, 40, 8, 31]. In these works, it is typical to have a
collection of boxes overlaid on the image at different spatial
locations, scales and aspect ratios that act as “anchors”
(sometimes called “priors” or “default boxes”). A model
is then trained to make two predictions for each anchor:
(1) a discrete class prediction for each anchor, and (2) a
continuous prediction of an offset by which the anchor
needs to be shifted to fit the groundtruth bounding box.
Papers that follow this anchors methodology then
minimize a combined classification and regression loss that
we now describe. For each anchor a, we first find the best
matching groundtruth box b (if one exists). If such a match
can be found, we call a a “positive anchor”, and assign it
(1) a class label y
a
{1 . . . K} and (2) a vector encoding
of box b with respect to anchor a (called the box encoding
φ(b
a
; a)). If no match is found, we call a a “negative
anchor” and we set the class label to be y
a
= 0. If for
the anchor a we predict box encoding f
loc
(I; a, θ) and
corresponding class f
cls
(I; a, θ), where I is the image and
θ the model parameters, then the loss for a is measured as
a weighted sum of a location-based loss and a classification
loss:
L(a, I; θ) = α · 1[a is positive] · `
loc
(φ(b
a
; a) f
loc
(I; a, θ))
+ β · `
cls
(y
a
, f
cls
(I; a, θ)), (1)
where α, β are weights balancing localization and classi-
fication losses. To train the model, Equation 1 is averaged
over anchors and minimized with respect to parameters θ.
The choice of anchors has significant implications both
for accuracy and computation. In the (first) Multibox
paper [8], these anchors (called “box priors” by the au-
thors) were generated by clustering groundtruth boxes in
the dataset. In more recent works, anchors are generated
by tiling a collection of boxes at different scales and aspect
ratios regularly across the image. The advantage of hav-
ing a regular grid of anchors is that predictions for these
boxes can be written as tiled predictors on the image with
shared parameters (i.e., convolutions) and are reminiscent
of traditional sliding window methods, e.g. [44]. The Faster
R-CNN [31] paper and the (second) Multibox paper [40]
(which called these tiled anchors “convolutional priors”)
were the first papers to take this new approach.
2.1. Meta-architectures
In our paper we focus primarily on three recent (meta)-
architectures: SSD (Single Shot Multibox Detector [26]),
Faster R-CNN [31] and R-FCN (Region-based Fully Con-
volutional Networks [6]). While these papers were orig-
inally presented with a particular feature extractor (e.g.,
VGG, Resnet, etc), we now review these three methods, de-
coupling the choice of meta-architecture from feature ex-
tractor so that conceptually, any feature extractor can be
used with SSD, Faster R-CNN or R-FCN.
2.1.1 Single Shot Detector (SSD).
Though the SSD paper was published only recently (Liu et
al., [26]), we use the term SSD to refer broadly to archi-
tectures that use a single feed-forward convolutional net-
work to directly predict classes and anchor offsets without
requiring a second stage per-proposal classification oper-
ation (Figure 1a). Under this definition, the SSD meta-
architecture has been explored in a number of precursors
to [26]. Both Multibox and the Region Proposal Network
2
of 21
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜