
Minimizing Supervision for Free-space Segmentation
Satoshi Tsutsui
∗, †
Indiana University
stsutsui@indiana.edu
Tommi Kerola
†
Shunta Saito
†
Preferred Networks, Inc.
{tommi,shunta}@preferred.jp
David J. Crandall
Indiana University
djcran@indiana.edu
Abstract
Identifying “free-space,” or safely driveable regions in
the scene ahead, is a fundamental task for autonomous nav-
igation. While this task can be addressed using semantic
segmentation, the manual labor involved in creating pixel-
wise annotations to train the segmentation model is very
costly. Although weakly supervised segmentation addresses
this issue, most methods are not designed for free-space. In
this paper, we observe that homogeneous texture and loca-
tion are two key characteristics of free-space, and develop
a novel, practical framework for free-space segmentation
with minimal human supervision. Our experiments show
that our framework performs better than other weakly su-
pervised methods while using less supervision. Our work
demonstrates the potential for performing free-space seg-
mentation without tedious and costly manual annotation,
which will be important for adapting autonomous driving
systems to different types of vehicles and environments.
1. Introduction
A critical perceptual problem in autonomous vehicle
navigation is deciding whether the path ahead is safe and
free of potential collisions. While some problems (like traf-
fic sign detection) may just require detecting and recogniz-
ing objects, avoiding collisions requires fine-grained, pixel-
level understanding of the scene in front of the vehicle, to
separate “free-space” [24] – road surfaces that are free of
obstacles, in the case of autonomous cars, for example –
from other scene content in view.
Free-space segmentation can be addressed by existing
fully-supervised semantic segmentation algorithms [33].
But a major challenge is the cost of obtaining pixel-wise
ground truth annotations to train these algorithms: human-
labeling of a single object in a single image can take approx-
imately 80 seconds [7], while annotating all road-related
∗
Part of this work was done as an intern at Preferred Networks, Inc.
†
The first three authors contributed equally. The order was de-
cided by using the paper ID as a random seed and then calling
np.random.permutation.
objects in a street scene may take over an hour [12]. The
high cost of collecting training data may be a substantial
barrier for developing autonomous driving systems for new
environments that have not yet received commercial at-
tention (e.g. in resource-poor countries, for off-road con-
texts, for autonomous water vehicles, etc.), and especially
for small companies and research groups with limited re-
sources.
In this paper, we develop a framework for free-space
segmentation that minimizes human supervision. Our ap-
proach is based on two straightforward observations. First,
free-space has a strong location prior: pixels correspond-
ing to free space are likely to be located at the bottom and
center of the image taken by a front-facing camera, since
in training data there is always free-space under the vehicle
(by definition). Second, a free-space region generally has
homogeneous texture since road surfaces are typically level
and smooth (e.g. concrete or asphalt in an urban street).
To take advantage of these observations, we first group
together pixels with low-level homogeneous texture into
superpixels. We then select candidate free-space super-
pixels through a simple clustering algorithm that incorpo-
rates both the spatial prior and appearance features (§3.3).
The remaining challenge is to create higher-level features
for each superpixel that semantically distinguish free-space.
We show that features from a CNN pre-trained on Ima-
geNet (§3.1) perform well for free-space when combined
with superpixel alignment, a novel method that aligns su-
perpixels with CNN feature maps (§3.2). Finally, these re-
sults are used as labels to train a supervised segmentation
method (§3.4) for performing segmentation on new images.
We note that our framework does not need any image an-
notations, so collecting annotated data is a simple matter of
recording vehicle-centric images while navigating the envi-
ronment where free-space segmentation is needed, and then
running our algorithm. The human effort required is re-
duced to specifying the location prior and adjusting hyper-
parameters such as superpixel granularity and the number
of clusters. This form of supervision requires little ef-
fort because the technique is not very sensitive to the ex-
act values of these parameters, as we empirically demon-
1
arXiv:1711.05998v3 [cs.CV] 8 Dec 2018
评论