22.Fully Convolutional Networks for Semantic Segmentation.pdf

海尔

216

10页

0次

2021-02-23

50墨值下载

Fully Convolutional Networks for Semantic Segmentation

Jonathan Long

∗

Evan Shelhamer

∗

Trevor Darrell

UC Berkeley

{jonlong,shelhamer,trevor}@cs.berkeley.edu

Abstract

Convolutional networks are powerful visual models that

yield hierarchies of features. We show that convolu-

tional networks by themselves, trained end-to-end, pixels-

to-pixels, exceed the state-of-the-art in semantic segmen-

tation. Our key insight is to build “fully convolutional”

networks that take input of arbitrary size and produce

correspondingly-sized output with efﬁcient inference and

learning. We deﬁne and detail the space of fully convolu-

tional networks, explain their application to spatially dense

prediction tasks, and draw connections to prior models. We

adapt contemporary classiﬁcation networks (AlexNet [19],

the VGG net [31], and GoogLeNet [32]) into fully convolu-

tional networks and transfer their learned representations

by ﬁne-tuning [4] to the segmentation task. We then de-

ﬁne a novel architecture that combines semantic informa-

tion from a deep, coarse layer with appearance information

from a shallow, ﬁne layer to produce accurate and detailed

segmentations. Our fully convolutional network achieves

state-of-the-art segmentation of PASCAL VOC (20% rela-

tive improvement to 62.2% mean IU on 2012), NYUDv2,

and SIFT Flow, while inference takes less than one ﬁfth of a

second for a typical image.

1. Introduction

Convolutional networks are driving advances in recog-

nition. Convnets are not only improving for whole-image

classiﬁcation [19, 31, 32], but also making progress on lo-

cal tasks with structured output. These include advances in

bounding box object detection [29, 12, 17], part and key-

point prediction [39, 24], and local correspondence [24, 9].

The natural next step in the progression from coarse to

ﬁne inference is to make a prediction at every pixel. Prior

approaches have used convnets for semantic segmentation

[27, 2, 8, 28, 16, 14, 11], in which each pixel is labeled with

the class of its enclosing object or region, but with short-

comings that this work addresses.

∗

Authors contributed equally

384

256

4096

backward/learning

forward/inference

pixelwise prediction

segmentation g.t.

256

384

Figure 1. Fully convolutional networks can efﬁciently learn to

make dense predictions for per-pixel tasks like semantic segmen-

tation.

We show that a fully convolutional network (FCN),

trained end-to-end, pixels-to-pixels on semantic segmen-

tation exceeds the state-of-the-art without further machin-

ery. To our knowledge, this is the ﬁrst work to train FCNs

end-to-end (1) for pixelwise prediction and (2) from super-

vised pre-training. Fully convolutional versions of existing

networks predict dense outputs from arbitrary-sized inputs.

Both learning and inference are performed whole-image-at-

a-time by dense feedforward computation and backpropa-

gation. In-network upsampling layers enable pixelwise pre-

diction and learning in nets with subsampled pooling.

This method is efﬁcient, both asymptotically and abso-

lutely, and precludes the need for the complications in other

works. Patchwise training is common [27, 2, 8, 28, 11], but

lacks the efﬁciency of fully convolutional training. Our ap-

proach does not make use of pre- and post-processing com-

plications, including superpixels [8, 16], proposals [16, 14],

or post-hoc reﬁnement by random ﬁelds or local classiﬁers

[8, 16]. Our model transfers recent success in classiﬁca-

tion [19, 31, 32] to dense prediction by reinterpreting clas-

siﬁcation nets as fully convolutional and ﬁne-tuning from

their learned representations. In contrast, previous works

have applied small convnets without supervised pre-training

[8, 28, 27].

Semantic segmentation faces an inherent tension be-

tween semantics and location: global information resolves

what while local information resolves where. Deep feature

arXiv:1411.4038v2 [cs.CV] 8 Mar 2015

hierarchies jointly encode location and semantics in a local-

to-global pyramid. We deﬁne a novel “skip” architecture

to combine deep, coarse, semantic information and shallow,

ﬁne, appearance information in Section 4.2 (see Figure 3).

In the next section, we review related work on deep clas-

siﬁcation nets, FCNs, and recent approaches to semantic

segmentation using convnets. The following sections ex-

plain FCN design and dense prediction tradeoffs, introduce

our architecture with in-network upsampling and multi-

layer combinations, and describe our experimental frame-

work. Finally, we demonstrate state-of-the-art results on

PASCAL VOC 2011-2, NYUDv2, and SIFT Flow.

2. Related work

Our approach draws on recent successes of deep nets

for image classiﬁcation [19, 31, 32] and transfer learning

[4, 38]. Transfer was ﬁrst demonstrated on various visual

recognition tasks [4, 38], then on detection, and on both

instance and semantic segmentation in hybrid proposal-

classiﬁer models [12, 16, 14]. We now re-architect and ﬁne-

tune classiﬁcation nets to direct, dense prediction of seman-

tic segmentation. We chart the space of FCNs and situate

prior models, both historical and recent, in this framework.

Fully convolutional networks To our knowledge, the

idea of extending a convnet to arbitrary-sized inputs ﬁrst

appeared in Matan et al. [25], which extended the classic

LeNet [21] to recognize strings of digits. Because their net

was limited to one-dimensional input strings, Matan et al.

used Viterbi decoding to obtain their outputs. Wolf and Platt

[37] expand convnet outputs to 2-dimensional maps of de-

tection scores for the four corners of postal address blocks.

Both of these historical works do inference and learning

fully convolutionally for detection. Ning et al. [27] deﬁne

a convnet for coarse multiclass segmentation of C. elegans

tissues with fully convolutional inference.

Fully convolutional computation has also been exploited

in the present era of many-layered nets. Sliding window

detection by Sermanet et al. [29], semantic segmentation

by Pinheiro and Collobert [28], and image restoration by

Eigen et al. [5] do fully convolutional inference. Fully con-

volutional training is rare, but used effectively by Tompson

et al. [35] to learn an end-to-end part detector and spatial

model for pose estimation, although they do not exposit on

or analyze this method.

Alternatively, He et al. [17] discard the non-

convolutional portion of classiﬁcation nets to make a

feature extractor. They combine proposals and spatial

pyramid pooling to yield a localized, ﬁxed-length feature

for classiﬁcation. While fast and effective, this hybrid

model cannot be learned end-to-end.

Dense prediction with convnets Several recent works

have applied convnets to dense prediction problems, includ-

ing semantic segmentation by Ning et al. [27], Farabet et al.

[8], and Pinheiro and Collobert [28]; boundary prediction

for electron microscopy by Ciresan et al. [2] and for natural

images by a hybrid neural net/nearest neighbor model by

Ganin and Lempitsky [11]; and image restoration and depth

estimation by Eigen et al. [5, 6]. Common elements of these

approaches include

• small models restricting capacity and receptive ﬁelds;

• patchwise training [27, 2, 8, 28, 11];

• post-processing by superpixel projection, random ﬁeld

regularization, ﬁltering, or local classiﬁcation [8, 2,

11];

• input shifting and output interlacing for dense output

[28, 11] as introduced by OverFeat [29];

• multi-scale pyramid processing [8, 28, 11];

• saturating tanh nonlinearities [8, 5, 28]; and

• ensembles [2, 11],

whereas our method does without this machinery. However,

we do study patchwise training 3.4 and “shift-and-stitch”

dense output 3.2 from the perspective of FCNs. We also

discuss in-network upsampling 3.3, of which the fully con-

nected prediction by Eigen et al. [6] is a special case.

Unlike these existing methods, we adapt and extend deep

classiﬁcation architectures, using image classiﬁcation as su-

pervised pre-training, and ﬁne-tune fully convolutionally to

learn simply and efﬁciently from whole image inputs and

whole image ground thruths.

Hariharan et al. [16] and Gupta et al. [14] likewise adapt

deep classiﬁcation nets to semantic segmentation, but do

so in hybrid proposal-classiﬁer models. These approaches

ﬁne-tune an R-CNN system [12] by sampling bounding

boxes and/or region proposals for detection, semantic seg-

mentation, and instance segmentation. Neither method is

learned end-to-end.

They achieve state-of-the-art results on PASCAL VOC

segmentation and NYUDv2 segmentation respectively, so

we directly compare our standalone, end-to-end FCN to

their semantic segmentation results in Section 5.

3. Fully convolutional networks

Each layer of data in a convnet is a three-dimensional

array of size h × w × d, where h and w are spatial dimen-

sions, and d is the feature or channel dimension. The ﬁrst

layer is the image, with pixel size h × w, and d color chan-

nels. Locations in higher layers correspond to the locations

in the image they are path-connected to, which are called

their receptive ﬁelds.

Convnets are built on translation invariance. Their ba-

sic components (convolution, pooling, and activation func-

tions) operate on local input regions, and depend only on

relative spatial coordinates. Writing x

for the data vector

at location (i, j) in a particular layer, and y

for the follow-

of 10

50墨值下载

database

关注

评论