motion (SfM) and multi-view stereo (MVS) methods can
automatically produce dense depth. Such images have been
widely used in research on large-scale 3D reconstruction [
35
,
14
,
2
,
8
]. We propose to use the outputs of these systems
as the inputs to machine learning methods for single-view
depth prediction. By using large amounts of diverse training
data from photos taken around the world, we seek to learn
to predict depth with high accuracy and generalizability.
Based on this idea, we introduce MegaDepth (MD), a large-
scale depth dataset generated from Internet photo collections,
which we make fully available to the community.
To our knowledge, ours is the first use of Internet
SfM+MVS data for single-view depth prediction. Our main
contribution is the MD dataset itself. In addition, in creating
MD, we found that care must be taken in preparing a dataset
from noisy MVS data, and so we also propose new methods
for processing raw MVS output, and a corresponding new
loss function for training models with this data. Notably,
because MVS tends to not reconstruct dynamic objects (peo-
ple, cars, etc), we augment our dataset with ordinal depth
relationships automatically derived from semantic segmen-
tation, and train with a joint loss that includes an ordinal
term. In our experiments, we show that by training on MD,
we can learn a model that works well not only on images
of new scenes, but that also generalizes remarkably well to
completely different datasets, including Make3D, KITTI,
and DIW—achieving much better generalization than prior
datasets. Figure 1 shows example results spanning different
test sets from a network trained solely on our MD dataset.
2. Related work
Single-view depth prediction.
A variety of methods have
been proposed for single-view depth prediction, most re-
cently by utilizing machine learning [
15
,
28
]. A standard
approach is to collect RGB images with ground truth depth,
and then train a model (e.g., a CNN) to predict depth from
RGB [
7
,
22
,
23
,
27
,
3
,
19
]. Most such methods are trained on
a few standard datasets, such as NYU [
33
,
34
], Make3D [
29
],
and KITTI [
11
], which are captured using RGB-D sensors
(such as Kinect) or laser scanning. Such scanning methods
have important limitations, as discussed in the introduction.
Recently, Novotny et al. [
26
] trained a network on 3D mod-
els derived from SfM+MVS on videos to learn 3D shapes of
single objects. However, their method is limited to images
of objects, rather than scenes.
Multiple views of a scene can also be used as an im-
plicit source of training data for single-view depth pre-
diction, by utilizing view synthesis as a supervisory sig-
nal [
38
,
10
,
13
,
43
]. However, view synthesis is only a proxy
for depth, and may not always yield high-quality learned
depth. Ummenhofer et al. [
36
] trained from overlapping
image pairs taken with a single camera, and learned to pre-
dict image matches, camera poses, and depth. However, it
requires two input images at test time.
Ordinal depth prediction.
Another way to collect depth
data for training is to ask people to manually annotate depth
in images. While labeling absolute depth is challenging,
people are good at specifying relative (ordinal) depth rela-
tionships (e.g., closer-than, further-than) [
12
]. Zoran et al.
[
44
] used such relative depth judgments to predict ordinal
relationships between points using CNNs. Chen et al. lever-
aged crowdsourcing of ordinal depth labels to create a large
dataset called “Depth in the Wild” [
4
]. While useful for pre-
dicting depth ordering (and so we incorporate ordinal data
automatically generated from our imagery), the Euclidean
accuracy of depth learned solely from ordinal data is limited.
Depth estimation from Internet photos.
Estimating ge-
ometry from Internet photo collections has been an active
research area for a decade, with advances in both structure
from motion [
35
,
2
,
37
,
30
] and multi-view stereo [
14
,
9
,
32
].
These techniques generally operate on 10s to 1000s of im-
ages. Using such methods, past work has used retrieval and
SfM to build a 3D model seeded from a single image [
31
],
or registered a photo to an existing 3D model to transfer
depth [
40
]. However, this work requires either having a de-
tailed 3D model of each location in advance, or building one
at run-time. Instead, we use SfM+MVS to train a network
that generalizes to novel locations and scenarios.
3. The MegaDepth Dataset
In this section, we describe how we construct our dataset.
We first download Internet photos from Flickr for a set
of well-photographed landmarks from the Landmarks10K
dataset [
21
]. We then reconstruct each landmark in 3D using
state-of-the-art SfM and MVS methods. This yields an SfM
model as well as a dense depth map for each reconstructed
image. However, these depth maps have significant noise
and outliers, and training a deep network on this raw depth
data will not yield a useful predictor. Therefore, we propose
a series of processing steps that prepare these depth maps for
use in learning, and additionally use semantic segmentation
to automatically generate ordinal depth data.
3.1. Photo calibration and reconstruction
We build a 3D model from each photo collection using
COLMAP, a state-of-art SfM system [
30
] (for reconstructing
camera poses and sparse point clouds) and MVS system [
32
]
(for generating dense depth maps). We use COLMAP because
we found that it produces high-quality 3D models via its
careful incremental SfM procedure, but other such systems
could be used. COLMAP produces a depth map
D
for every
reconstructed photo
I
(where some pixels of
D
can be empty
if COLMAP was unable to recover a depth), as well as other
outputs, such as camera parameters and sparse SfM points
plus camera visibility.
评论