暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
19.SSD_ Single Shot MultiBox Detector.pdf
200
17页
0次
2021-02-22
50墨值下载
SSD: Single Shot MultiBox Detector
Wei Liu
1
, Dragomir Anguelov
2
, Dumitru Erhan
3
, Christian Szegedy
3
,
Scott Reed
4
, Cheng-Yang Fu
1
, Alexander C. Berg
1
1
UNC Chapel Hill
2
Zoox Inc.
3
Google Inc.
4
University of Michigan, Ann-Arbor
1
wliu@cs.unc.edu,
2
drago@zoox.com,
3
{dumitru,szegedy}@google.com,
4
reedscot@umich.edu,
1
{cyfu,aberg}@cs.unc.edu
Abstract. We present a method for detecting objects in images using a single
deep neural network. Our approach, named SSD, discretizes the output space of
bounding boxes into a set of default boxes over different aspect ratios and scales
per feature map location. At prediction time, the network generates scores for the
presence of each object category in each default box and produces adjustments to
the box to better match the object shape. Additionally, the network combines pre-
dictions from multiple feature maps with different resolutions to naturally handle
objects of various sizes. SSD is simple relative to methods that require object
proposals because it completely eliminates proposal generation and subsequent
pixel or feature resampling stages and encapsulates all computation in a single
network. This makes SSD easy to train and straightforward to integrate into sys-
tems that require a detection component. Experimental results on the PASCAL
VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy
to methods that utilize an additional object proposal step and is much faster, while
providing a unified framework for both training and inference. For 300 × 300 in-
put, SSD achieves 74.3% mAP
1
on VOC2007 test at 59 FPS on a Nvidia Titan
X and for 512 × 512 input, SSD achieves 76.9% mAP, outperforming a compa-
rable state-of-the-art Faster R-CNN model. Compared to other single stage meth-
ods, SSD has much better accuracy even with a smaller input image size. Code is
available at: https://github.com/weiliu89/caffe/tree/ssd .
Keywords: Real-time Object Detection; Convolutional Neural Network
1 Introduction
Current state-of-the-art object detection systems are variants of the following approach:
hypothesize bounding boxes, resample pixels or features for each box, and apply a high-
quality classifier. This pipeline has prevailed on detection benchmarks since the Selec-
tive Search work [1] through the current leading results on PASCAL VOC, COCO, and
ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as
[3]. While accurate, these approaches have been too computationally intensive for em-
bedded systems and, even with high-end hardware, too slow for real-time applications.
1
We achieved even better results using an improved data augmentation scheme in follow-on
experiments: 77.2% mAP for 300×300 input and 79.8% mAP for 512×512 input on VOC2007.
Please see Sec. 3.6 for details.
arXiv:1512.02325v5 [cs.CV] 29 Dec 2016
2 Liu et al.
Often detection speed for these approaches is measured in seconds per frame (SPF),
and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames
per second (FPS). There have been many attempts to build faster detectors by attacking
each stage of the detection pipeline (see related work in Sec. 4), but so far, significantly
increased speed comes only at the cost of significantly decreased detection accuracy.
This paper presents the first deep network based object detector that does not re-
sample pixels or features for bounding box hypotheses and and is as accurate as ap-
proaches that do. This results in a significant improvement in speed for high-accuracy
detection (59 FPS with mAP 74.3% on VOC2007 test, vs. Faster R-CNN 7 FPS with
mAP 73.2% or YOLO 45 FPS with mAP 63.4%). The fundamental improvement in
speed comes from eliminating bounding box proposals and the subsequent pixel or fea-
ture resampling stage. We are not the first to do this (cf [4,5]), but by adding a series
of improvements, we manage to increase the accuracy significantly over previous at-
tempts. Our improvements include using a small convolutional filter to predict object
categories and offsets in bounding box locations, using separate predictors (filters) for
different aspect ratio detections, and applying these filters to multiple feature maps from
the later stages of a network in order to perform detection at multiple scales. With these
modifications—especially using multiple layers for prediction at different scales—we
can achieve high-accuracy using relatively low resolution input, further increasing de-
tection speed. While these contributions may seem small independently, we note that
the resulting system improves accuracy on real-time detection for PASCAL VOC from
63.4% mAP for YOLO to 74.3% mAP for our SSD. This is a larger relative improve-
ment in detection accuracy than that from the recent, very high-profile work on residual
networks [3]. Furthermore, significantly improving the speed of high-quality detection
can broaden the range of settings where computer vision is useful.
We summarize our contributions as follows:
We introduce SSD, a single-shot detector for multiple categories that is faster than
the previous state-of-the-art for single shot detectors (YOLO), and significantly
more accurate, in fact as accurate as slower techniques that perform explicit region
proposals and pooling (including Faster R-CNN).
The core of SSD is predicting category scores and box offsets for a fixed set of
default bounding boxes using small convolutional filters applied to feature maps.
To achieve high detection accuracy we produce predictions of different scales from
feature maps of different scales, and explicitly separate predictions by aspect ratio.
These design features lead to simple end-to-end training and high accuracy, even
on low resolution input images, further improving the speed vs accuracy trade-off.
Experiments include timing and accuracy analysis on models with varying input
size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a
range of recent state-of-the-art approaches.
2 The Single Shot Detector (SSD)
This section describes our proposed SSD framework for detection (Sec. 2.1) and the
associated training methodology (Sec. 2.2). Afterwards, Sec. 3 presents dataset-specific
model details and experimental results.
of 17
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜