NVIDIA-立体声对精确深度估计的重要性：一种有效的半监督深度神经网络方法.pdf

poPoq

14页

0次

2021-05-01

50墨值下载

On the Importance of Stereo for Accurate Depth Estimation:

An Efﬁcient Semi-Supervised Deep Neural Network Approach

Nikolai Smolyanskiy Alexey Kamenev Stan Birchﬁeld

NVIDIA

{nsmolyanskiy, akamenev, sbirchfield}@nvidia.com

Abstract

We revisit the problem of visual depth estimation in the

context of autonomous vehicles. Despite the progress on

monocular depth estimation in recent years, we show that

the gap between monocular and stereo depth accuracy re-

mains large—a particularly relevant result due to the preva-

lent reliance upon monocular cameras by vehicles that are

expected to be self-driving. We argue that the challenges

of removing this gap are signiﬁcant, owing to fundamen-

tal limitations of monocular vision. As a result, we focus

our efforts on depth estimation by stereo. We propose a

novel semi-supervised learning approach to training a deep

stereo neural network, along with a novel architecture con-

taining a machine-learned argmax layer and a custom run-

time that enables a smaller version of our stereo DNN to

run on an embedded GPU. Competitive results are shown

on the KITTI 2015 stereo dataset. We also evaluate the re-

cent progress of stereo algorithms by measuring the impact

upon accuracy of various design criteria.

1. Introduction

Estimating depth from images is a long-standing prob-

lem in computer vision. Depth perception is useful

for scene understanding, scene reconstruction, virtual and

augmented reality, obstacle avoidance, self-driving cars,

robotics, and other applications.

Traditionally, multiple images have been used to esti-

mate depth. Techniques that fall within this category in-

clude stereo, photometric stereo, depth from focus, depth

from defocus, time-of-ﬂight,

and structure from motion.

The reasons for using multiple images are twofold: 1) abso-

lute depth estimates require at least one known distance in

the world, which can often be provided by some knowledge

regarding the multi-camera rig (e.g., the baseline between

Video of the system is at https://youtu.be/0FPQdVOYoAU.

Although time-of-ﬂight does not, in theory require multiple images, in

practice multiple images are collected with different bandwidths in order

to achieve high accuracy over long ranges.

stereo cameras); and 2) multiple images provide geomet-

ric constraints that can be leveraged to overcome the many

ambiguities of photometric data.

The alternative is to use a single image to estimate

depth. We argue that this alternative—due to its funda-

mental limitations—is not likely to be able to achieve high-

accuracy depth estimation at large distances in unfamiliar

environments. As a result, in the context of self-driving

cars we believe monocular depth estimation is not likely

to yield results with sufﬁcient accuracy. In contrast, we

offer a novel, efﬁcient deep-learning stereo approach that

achieves compelling results on the KITTI 2015 dataset by

leveraging a semi-supervised loss function (using LIDAR

and photometric consistency), concatenating cost volume,

3D convolutions, and a machine-learned argmax function.

The contributions of the paper are as follows:

• Quantitative and qualitative demonstration of the gap

in depth accuracy between monocular and stereoscopic

depth.

• A novel semi-supervised approach (combining lidar

and photometric losses) to training a deep stereo neu-

ral network. To our knowledge, ours is the ﬁrst deep

stereo network to do so.

• A smaller version of our network, and a custom run-

time, that runs at near real-time (∼20 fps) on a stan-

dard GPU, and runs efﬁciently on an embedded GPU.

To our knowledge, ours is the ﬁrst stereo DNN to run

on an embedded GPU.

• Quantitative analysis of various network design

choices, along with a novel machine-learned argmax

layer that yields smoother disparity maps.

2. Motivation

The undeniable success of deep neural networks in com-

puter vision has encouraged researchers to pursue the prob-

lem of estimating depth from a single image [5, 20, 6, 9, 17].

Similarly, Kuznietsov et al. [17] use a semi-supervised approach for

training a monocular network.

arXiv:1803.09719v3 [cs.CV] 20 Apr 2018

This is, no doubt, a noble endeavor: if it were possible to ac-

curately estimate depth from a single image, then the com-

plexity (and hence cost) of the hardware needed would be

dramatically reduced, which would broaden the applicabil-

ity substantially. An excellent overview of existing work on

monocular depth estimation can be found in [9].

Nevertheless, there are reasons to be cautious about the

reported success of monocular depth. To date, monocular

depth solutions, while yielding encouraging preliminary re-

sults, are not at the point where reliable information (from a

robotics point of view) can be expected from them. And al-

though such solutions will continue to improve, monocular

depth will never overcome well-known fundamental limi-

tations, such as the need for a world measurement to infer

absolute depth, and the ambiguity that arises when a pho-

tograph is taken of a photograph (an important observation

for biometric and security systems).

One of the motivations for monocular depth is a long-

standing belief that stereo is only useful at close range.

It has been widely reported, for example in [10], that be-

yond about 6 meters, the human visual system is essentially

monocular. But there is mounting evidence that the hu-

man stereo system is actually much more capable than that.

Multiple studies have shown metric depth estimation up to

20 meters [18, 1]; and, although error increases as disparity

increases [13], controlled experiments have conﬁrmed that

scaled disparity can be estimated up to 300 m, even with-

out any depth cues from monocular vision [22]. Moreover,

since the human visual system is capable of estimating dis-

parity as small as a few seconds of arc [22], there is rea-

son to believe that the distance could be 1 km or greater,

with some evidence supporting such a claim provided by

the experiments of [4]. Note that an artiﬁcial stereo system

whose baseline is wider than the average 65 mm interpupil-

lary distance of the human visual system has the potential

to provide even greater accuracy.

This question takes on renewed signiﬁcance in the con-

text of self-driving cars, since most automobile manufac-

turers and experimental autonomous vehicles do not install

stereo cameras in their vehicles.

Rather, these systems

rely on various combinations of monocular cameras, lidar,

radar, and sonar sensors.

For detecting static obstacles

such as trees, poles, railings, and concrete barriers, most

systems rely on cameras and/or lidar. Although it is be-

yond the scope of this paper whether monocular cameras are

sufﬁcient for self-driving behavior (certainly people with

monocular vision can drive safely in most situations), or

whether stereo is better than lidar, we argue that the proper

To the authors’ knowledge, at the time of this writing stereo cameras

can be found only on certain models of Mercedes and Subaru vehicles; no

major autonomous platform uses them.

Tesla vehicles, for example, are equipped with monocular cameras,

sonar, and radar, but no lidar. Despite having multiple foveated cameras for

wider ﬁeld of view, such vehicles do not rely upon depth from stereopsis.

engineering approach to such a safety-critical system is to

leverage all available sensors rather than assume they are

not needed; thus, we believe that it is important to accu-

rately assess the increased error in depth estimation when

relying upon monocular cameras.

At typical highway speeds, the braking distance re-

quired to completely stop before impact necessitates ob-

serving an unforeseen stopped object approximately 100 m

away. Intrigued by the reported success of monocular

depth, we tried some recent algorithms, only to discover

that monocular depth is not able to achieve accuracies any-

where close to that requirement. We then turned our at-

tention to stereo, where signiﬁcant progress has been made

in recent years in applying deep learning to the problem

[25, 24, 11, 27, 26, 29, 8, 15, 23]. An excellent overview

of recent stereo algorithms can be found in [15]. In this

ﬂurry of activity, a variety of architectures have been pro-

posed, but there has been no systematic study as to how

these design choices impact quality. One purpose of this

paper is thus to investigate several of these options in order

to quantify their impact, which we do in Sec. 5. In the con-

text of this study, we develop a novel semi-supervised stereo

approach, which we present in Sec. 4. First, however, we il-

lustrate the limitations of monocular depth estimation in the

next section.

3. Difﬁculties of Monocular Depth Estimation

To appreciate the gap between mono and stereo vision,

consider the image of Fig. 1, with several points of interest

highlighted. Without knowing the scene, if you were to ask

yourself whether the width of the near road (on which the

car (A) sits) is greater than the width of the far tracks (dis-

tance between the near and far poles (E and F)), you might

be tempted to answer in the afﬁrmative. After all, the road

not only occupies more pixels in the image (which is to be

expected, since it is closer to the camera), but it occupies

orders of magnitude more pixels. We showed this image to

several people in our lab, and they all reached the same con-

clusion: the road indeed appears to be signiﬁcantly wider.

As it turns out, if this image is any indication, people are not

very good at estimating metric depth from a single image.

The output of a leading monocular depth algorithm,

called MonoDepth [9], is shown in Fig. 2,

along with the

output of our stereo depth algorithm. At ﬁrst glance, both

results appear plausible. Although the stereo algorithm pre-

serves crisper object boundaries and appears at least slightly

more accurate, it is difﬁcult to tell from the grayscale im-

Speciﬁcally, we asked 8 people to estimate the distance to the fence

(ground truth 14 m) and the distance to the building (ground truth 30 m).

Their estimates on average were 9.3 m and 12.4 m, respectively. The dis-

tances were therefore underestimated by 34% and 59%, respectively, and

the distance from the fence to the building was underestimated by 81%.

Other monocular algorithms produce similar results.

of 14

50墨值下载

自动驾驶

关注

评论