19.SSD_ Single Shot MultiBox Detector.pdf

海尔

208

17页

0次

2021-02-22

50墨值下载

SSD: Single Shot MultiBox Detector

Wei Liu

, Dragomir Anguelov

, Dumitru Erhan

, Christian Szegedy

Scott Reed

, Cheng-Yang Fu

, Alexander C. Berg

UNC Chapel Hill

Zoox Inc.

Google Inc.

University of Michigan, Ann-Arbor

wliu@cs.unc.edu,

drago@zoox.com,

{dumitru,szegedy}@google.com,

reedscot@umich.edu,

{cyfu,aberg}@cs.unc.edu

Abstract. We present a method for detecting objects in images using a single

deep neural network. Our approach, named SSD, discretizes the output space of

bounding boxes into a set of default boxes over different aspect ratios and scales

per feature map location. At prediction time, the network generates scores for the

presence of each object category in each default box and produces adjustments to

the box to better match the object shape. Additionally, the network combines pre-

dictions from multiple feature maps with different resolutions to naturally handle

objects of various sizes. SSD is simple relative to methods that require object

proposals because it completely eliminates proposal generation and subsequent

pixel or feature resampling stages and encapsulates all computation in a single

network. This makes SSD easy to train and straightforward to integrate into sys-

tems that require a detection component. Experimental results on the PASCAL

VOC, COCO, and ILSVRC datasets conﬁrm that SSD has competitive accuracy

to methods that utilize an additional object proposal step and is much faster, while

providing a uniﬁed framework for both training and inference. For 300 × 300 in-

put, SSD achieves 74.3% mAP

on VOC2007 test at 59 FPS on a Nvidia Titan

X and for 512 × 512 input, SSD achieves 76.9% mAP, outperforming a compa-

rable state-of-the-art Faster R-CNN model. Compared to other single stage meth-

ods, SSD has much better accuracy even with a smaller input image size. Code is

available at: https://github.com/weiliu89/caffe/tree/ssd .

Keywords: Real-time Object Detection; Convolutional Neural Network

1 Introduction

Current state-of-the-art object detection systems are variants of the following approach:

hypothesize bounding boxes, resample pixels or features for each box, and apply a high-

quality classiﬁer. This pipeline has prevailed on detection benchmarks since the Selec-

tive Search work [1] through the current leading results on PASCAL VOC, COCO, and

ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as

[3]. While accurate, these approaches have been too computationally intensive for em-

bedded systems and, even with high-end hardware, too slow for real-time applications.

We achieved even better results using an improved data augmentation scheme in follow-on

experiments: 77.2% mAP for 300×300 input and 79.8% mAP for 512×512 input on VOC2007.

Please see Sec. 3.6 for details.

arXiv:1512.02325v5 [cs.CV] 29 Dec 2016

2 Liu et al.

Often detection speed for these approaches is measured in seconds per frame (SPF),

and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames

per second (FPS). There have been many attempts to build faster detectors by attacking

each stage of the detection pipeline (see related work in Sec. 4), but so far, signiﬁcantly

increased speed comes only at the cost of signiﬁcantly decreased detection accuracy.

This paper presents the ﬁrst deep network based object detector that does not re-

sample pixels or features for bounding box hypotheses and and is as accurate as ap-

proaches that do. This results in a signiﬁcant improvement in speed for high-accuracy

detection (59 FPS with mAP 74.3% on VOC2007 test, vs. Faster R-CNN 7 FPS with

mAP 73.2% or YOLO 45 FPS with mAP 63.4%). The fundamental improvement in

speed comes from eliminating bounding box proposals and the subsequent pixel or fea-

ture resampling stage. We are not the ﬁrst to do this (cf [4,5]), but by adding a series

of improvements, we manage to increase the accuracy signiﬁcantly over previous at-

tempts. Our improvements include using a small convolutional ﬁlter to predict object

categories and offsets in bounding box locations, using separate predictors (ﬁlters) for

different aspect ratio detections, and applying these ﬁlters to multiple feature maps from

the later stages of a network in order to perform detection at multiple scales. With these

modiﬁcations—especially using multiple layers for prediction at different scales—we

can achieve high-accuracy using relatively low resolution input, further increasing de-

tection speed. While these contributions may seem small independently, we note that

the resulting system improves accuracy on real-time detection for PASCAL VOC from

63.4% mAP for YOLO to 74.3% mAP for our SSD. This is a larger relative improve-

ment in detection accuracy than that from the recent, very high-proﬁle work on residual

networks [3]. Furthermore, signiﬁcantly improving the speed of high-quality detection

can broaden the range of settings where computer vision is useful.

We summarize our contributions as follows:

– We introduce SSD, a single-shot detector for multiple categories that is faster than

the previous state-of-the-art for single shot detectors (YOLO), and signiﬁcantly

more accurate, in fact as accurate as slower techniques that perform explicit region

proposals and pooling (including Faster R-CNN).

– The core of SSD is predicting category scores and box offsets for a ﬁxed set of

default bounding boxes using small convolutional ﬁlters applied to feature maps.

– To achieve high detection accuracy we produce predictions of different scales from

feature maps of different scales, and explicitly separate predictions by aspect ratio.

– These design features lead to simple end-to-end training and high accuracy, even

on low resolution input images, further improving the speed vs accuracy trade-off.

– Experiments include timing and accuracy analysis on models with varying input

size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a

range of recent state-of-the-art approaches.

2 The Single Shot Detector (SSD)

This section describes our proposed SSD framework for detection (Sec. 2.1) and the

associated training methodology (Sec. 2.2). Afterwards, Sec. 3 presents dataset-speciﬁc

model details and experimental results.

of 17

50墨值下载

database

关注

评论