18.Speed_accuracy trade-offs for modern convolutional object detectors.pdf

大狐狸

234

21页

0次

2021-02-24

50墨值下载

Speed/accuracy trade-offs for modern convolutional object detectors

Jonathan Huang Vivek Rathod Chen Sun Menglong Zhu Anoop Korattikara

Alireza Fathi Ian Fischer Zbigniew Wojna Yang Song Sergio Guadarrama

Kevin Murphy

Google Research

Abstract

The goal of this paper is to serve as a guide for se-

lecting a detection architecture that achieves the right

speed/memory/accuracy balance for a given application

and platform. To this end, we investigate various ways to

trade accuracy for speed and memory usage in modern con-

volutional object detection systems. A number of successful

systems have been proposed in recent years, but apples-to-

apples comparisons are difﬁcult due to different base fea-

ture extractors (e.g., VGG, Residual Networks), different

default image resolutions, as well as different hardware and

software platforms. We present a uniﬁed implementation of

the Faster R-CNN [31], R-FCN [6] and SSD [26] systems,

which we view as “meta-architectures” and trace out the

speed/accuracy trade-off curve created by using alterna-

tive feature extractors and varying other critical parameters

such as image size within each of these meta-architectures.

On one extreme end of this spectrum where speed and mem-

ory are critical, we present a detector that achieves real

time speeds and can be deployed on a mobile device. On

the opposite end in which accuracy is critical, we present

a detector that achieves state-of-the-art performance mea-

sured on the COCO detection task.

1. Introduction

A lot of progress has been made in recent years on object

detection due to the use of convolutional neural networks

(CNNs). Modern object detectors based on these networks

— such as Faster R-CNN [31], R-FCN [6], Multibox [40],

SSD [26] and YOLO [29] — are now good enough to be

deployed in consumer products (e.g., Google Photos, Pin-

terest Visual Search) and some have been shown to be fast

enough to be run on mobile devices.

However, it can be difﬁcult for practitioners to decide

what architecture is best suited to their application. Stan-

dard accuracy metrics, such as mean average precision

(mAP), do not tell the entire story, since for real deploy-

ments of computer vision systems, running time and mem-

ory usage are also critical. For example, mobile devices

often require a small memory footprint, and self driving

cars require real time performance. Server-side production

systems, like those used in Google, Facebook or Snapchat,

have more leeway to optimize for accuracy, but are still sub-

ject to throughput constraints. While the methods that win

competitions, such as the COCO challenge [25], are opti-

mized for accuracy, they often rely on model ensembling

and multicrop methods which are too slow for practical us-

age.

Unfortunately, only a small subset of papers (e.g., R-

FCN [6], SSD [26] YOLO [29]) discuss running time in

any detail. Furthermore, these papers typically only state

that they achieve some frame-rate, but do not give a full

picture of the speed/accuracy trade-off, which depends on

many other factors, such as which feature extractor is used,

input image sizes, etc.

In this paper, we seek to explore the speed/accuracy

trade-off of modern detection systems in an exhaustive and

fair way. While this has been studied for full image clas-

siﬁcation( (e.g., [3]), detection models tend to be signif-

icantly more complex. We primarily investigate single-

model/single-pass detectors, by which we mean models

that do not use ensembling, multi-crop methods, or other

“tricks” such as horizontal ﬂipping. In other words, we only

pass a single image through a single network. For simplicity

(and because it is more important for users of this technol-

ogy), we focus only on test-time performance and not on

how long these models take to train.

Though it is impractical to compare every recently pro-

posed detection system, we are fortunate that many of the

leading state of the art approaches have converged on a

common methodology (at least at a high level). This has

allowed us to implement and compare a large number of de-

tection systems in a uniﬁed manner. In particular, we have

created implementations of the Faster R-CNN, R-FCN and

SSD meta-architectures, which at a high level consist of a

single convolutional network, trained with a mixed regres-

sion and classiﬁcation objective, and use sliding window

style predictions.

To summarize, our main contributions are as follows:

• We provide a concise survey of modern convolutional

arXiv:1611.10012v3 [cs.CV] 25 Apr 2017

detection systems, and describe how the leading ones

follow very similar designs.

• We describe our ﬂexible and uniﬁed implementation

of three meta-architectures (Faster R-CNN, R-FCN

and SSD) in Tensorﬂow which we use to do exten-

sive experiments that trace the accuracy/speed trade-

off curve for different detection systems, varying meta-

architecture, feature extractor, image resolution, etc.

• Our ﬁndings show that using fewer proposals for

Faster R-CNN can speed it up signiﬁcantly without

a big loss in accuracy, making it competitive with its

faster cousins, SSD and RFCN. We show that SSDs

performance is less sensitive to the quality of the fea-

ture extractor than Faster R-CNN and R-FCN. And we

identify sweet spots on the accuracy/speed trade-off

curve where gains in accuracy are only possible by sac-

riﬁcing speed (within the family of detectors presented

here).

• Several of the meta-architecture and feature-extractor

combinations that we report have never appeared be-

fore in literature. We discuss how we used some of

these novel combinations to train the winning entry of

the 2016 COCO object detection challenge.

2. Meta-architectures

Neural nets have become the leading method for high

quality object detection in recent years. In this section we

survey some of the highlights of this literature. The R-CNN

paper by Girshick et al. [11] was among the ﬁrst modern

incarnations of convolutional network based detection. In-

spired by recent successes on image classiﬁcation [20], the

R-CNN method took the straightforward approach of crop-

ping externally computed box proposals out of an input im-

age and running a neural net classiﬁer on these crops. This

approach can be expensive however because many crops

are necessary, leading to signiﬁcant duplicated computation

from overlapping crops. Fast R-CNN [10] alleviated this

problem by pushing the entire image once through a feature

extractor then cropping from an intermediate layer so that

crops share the computation load of feature extraction.

While both R-CNN and Fast R-CNN relied on an exter-

nal proposal generator, recent works have shown that it is

possible to generate box proposals using neural networks

as well [41, 40, 8, 31]. In these works, it is typical to have a

collection of boxes overlaid on the image at different spatial

locations, scales and aspect ratios that act as “anchors”

(sometimes called “priors” or “default boxes”). A model

is then trained to make two predictions for each anchor:

(1) a discrete class prediction for each anchor, and (2) a

continuous prediction of an offset by which the anchor

needs to be shifted to ﬁt the groundtruth bounding box.

Papers that follow this anchors methodology then

minimize a combined classiﬁcation and regression loss that

we now describe. For each anchor a, we ﬁrst ﬁnd the best

matching groundtruth box b (if one exists). If such a match

can be found, we call a a “positive anchor”, and assign it

(1) a class label y

∈ {1 . . . K} and (2) a vector encoding

of box b with respect to anchor a (called the box encoding

φ(b

; a)). If no match is found, we call a a “negative

anchor” and we set the class label to be y

= 0. If for

the anchor a we predict box encoding f

loc

(I; a, θ) and

corresponding class f

cls

(I; a, θ), where I is the image and

θ the model parameters, then the loss for a is measured as

a weighted sum of a location-based loss and a classiﬁcation

loss:

L(a, I; θ) = α · 1[a is positive] · `

loc

(φ(b

; a) − f

loc

(I; a, θ))

+ β · `

cls

, f

cls

(I; a, θ)), (1)

where α, β are weights balancing localization and classi-

ﬁcation losses. To train the model, Equation 1 is averaged

over anchors and minimized with respect to parameters θ.

The choice of anchors has signiﬁcant implications both

for accuracy and computation. In the (ﬁrst) Multibox

paper [8], these anchors (called “box priors” by the au-

thors) were generated by clustering groundtruth boxes in

the dataset. In more recent works, anchors are generated

by tiling a collection of boxes at different scales and aspect

ratios regularly across the image. The advantage of hav-

ing a regular grid of anchors is that predictions for these

boxes can be written as tiled predictors on the image with

shared parameters (i.e., convolutions) and are reminiscent

of traditional sliding window methods, e.g. [44]. The Faster

R-CNN [31] paper and the (second) Multibox paper [40]

(which called these tiled anchors “convolutional priors”)

were the ﬁrst papers to take this new approach.

2.1. Meta-architectures

In our paper we focus primarily on three recent (meta)-

architectures: SSD (Single Shot Multibox Detector [26]),

Faster R-CNN [31] and R-FCN (Region-based Fully Con-

volutional Networks [6]). While these papers were orig-

inally presented with a particular feature extractor (e.g.,

VGG, Resnet, etc), we now review these three methods, de-

coupling the choice of meta-architecture from feature ex-

tractor so that conceptually, any feature extractor can be

used with SSD, Faster R-CNN or R-FCN.

2.1.1 Single Shot Detector (SSD).

Though the SSD paper was published only recently (Liu et

al., [26]), we use the term SSD to refer broadly to archi-

tectures that use a single feed-forward convolutional net-

work to directly predict classes and anchor offsets without

requiring a second stage per-proposal classiﬁcation oper-

ation (Figure 1a). Under this deﬁnition, the SSD meta-

architecture has been explored in a number of precursors

to [26]. Both Multibox and the Region Proposal Network

of 21

50墨值下载

database

关注

评论