暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
NVIDIA-立体声对精确深度估计的重要性:一种有效的半监督深度神经网络方法.pdf
74
14页
0次
2021-05-01
50墨值下载
On the Importance of Stereo for Accurate Depth Estimation:
An Efficient Semi-Supervised Deep Neural Network Approach
Nikolai Smolyanskiy Alexey Kamenev Stan Birchfield
NVIDIA
{nsmolyanskiy, akamenev, sbirchfield}@nvidia.com
Abstract
We revisit the problem of visual depth estimation in the
context of autonomous vehicles. Despite the progress on
monocular depth estimation in recent years, we show that
the gap between monocular and stereo depth accuracy re-
mains large—a particularly relevant result due to the preva-
lent reliance upon monocular cameras by vehicles that are
expected to be self-driving. We argue that the challenges
of removing this gap are significant, owing to fundamen-
tal limitations of monocular vision. As a result, we focus
our efforts on depth estimation by stereo. We propose a
novel semi-supervised learning approach to training a deep
stereo neural network, along with a novel architecture con-
taining a machine-learned argmax layer and a custom run-
time that enables a smaller version of our stereo DNN to
run on an embedded GPU. Competitive results are shown
on the KITTI 2015 stereo dataset. We also evaluate the re-
cent progress of stereo algorithms by measuring the impact
upon accuracy of various design criteria.
1
1. Introduction
Estimating depth from images is a long-standing prob-
lem in computer vision. Depth perception is useful
for scene understanding, scene reconstruction, virtual and
augmented reality, obstacle avoidance, self-driving cars,
robotics, and other applications.
Traditionally, multiple images have been used to esti-
mate depth. Techniques that fall within this category in-
clude stereo, photometric stereo, depth from focus, depth
from defocus, time-of-flight,
2
and structure from motion.
The reasons for using multiple images are twofold: 1) abso-
lute depth estimates require at least one known distance in
the world, which can often be provided by some knowledge
regarding the multi-camera rig (e.g., the baseline between
1
Video of the system is at https://youtu.be/0FPQdVOYoAU.
2
Although time-of-flight does not, in theory require multiple images, in
practice multiple images are collected with different bandwidths in order
to achieve high accuracy over long ranges.
stereo cameras); and 2) multiple images provide geomet-
ric constraints that can be leveraged to overcome the many
ambiguities of photometric data.
The alternative is to use a single image to estimate
depth. We argue that this alternative—due to its funda-
mental limitations—is not likely to be able to achieve high-
accuracy depth estimation at large distances in unfamiliar
environments. As a result, in the context of self-driving
cars we believe monocular depth estimation is not likely
to yield results with sufficient accuracy. In contrast, we
offer a novel, efficient deep-learning stereo approach that
achieves compelling results on the KITTI 2015 dataset by
leveraging a semi-supervised loss function (using LIDAR
and photometric consistency), concatenating cost volume,
3D convolutions, and a machine-learned argmax function.
The contributions of the paper are as follows:
Quantitative and qualitative demonstration of the gap
in depth accuracy between monocular and stereoscopic
depth.
A novel semi-supervised approach (combining lidar
and photometric losses) to training a deep stereo neu-
ral network. To our knowledge, ours is the first deep
stereo network to do so.
3
A smaller version of our network, and a custom run-
time, that runs at near real-time (20 fps) on a stan-
dard GPU, and runs efficiently on an embedded GPU.
To our knowledge, ours is the first stereo DNN to run
on an embedded GPU.
Quantitative analysis of various network design
choices, along with a novel machine-learned argmax
layer that yields smoother disparity maps.
2. Motivation
The undeniable success of deep neural networks in com-
puter vision has encouraged researchers to pursue the prob-
lem of estimating depth from a single image [5, 20, 6, 9, 17].
3
Similarly, Kuznietsov et al. [17] use a semi-supervised approach for
training a monocular network.
arXiv:1803.09719v3 [cs.CV] 20 Apr 2018
This is, no doubt, a noble endeavor: if it were possible to ac-
curately estimate depth from a single image, then the com-
plexity (and hence cost) of the hardware needed would be
dramatically reduced, which would broaden the applicabil-
ity substantially. An excellent overview of existing work on
monocular depth estimation can be found in [9].
Nevertheless, there are reasons to be cautious about the
reported success of monocular depth. To date, monocular
depth solutions, while yielding encouraging preliminary re-
sults, are not at the point where reliable information (from a
robotics point of view) can be expected from them. And al-
though such solutions will continue to improve, monocular
depth will never overcome well-known fundamental limi-
tations, such as the need for a world measurement to infer
absolute depth, and the ambiguity that arises when a pho-
tograph is taken of a photograph (an important observation
for biometric and security systems).
One of the motivations for monocular depth is a long-
standing belief that stereo is only useful at close range.
It has been widely reported, for example in [10], that be-
yond about 6 meters, the human visual system is essentially
monocular. But there is mounting evidence that the hu-
man stereo system is actually much more capable than that.
Multiple studies have shown metric depth estimation up to
20 meters [18, 1]; and, although error increases as disparity
increases [13], controlled experiments have confirmed that
scaled disparity can be estimated up to 300 m, even with-
out any depth cues from monocular vision [22]. Moreover,
since the human visual system is capable of estimating dis-
parity as small as a few seconds of arc [22], there is rea-
son to believe that the distance could be 1 km or greater,
with some evidence supporting such a claim provided by
the experiments of [4]. Note that an artificial stereo system
whose baseline is wider than the average 65 mm interpupil-
lary distance of the human visual system has the potential
to provide even greater accuracy.
This question takes on renewed significance in the con-
text of self-driving cars, since most automobile manufac-
turers and experimental autonomous vehicles do not install
stereo cameras in their vehicles.
4
Rather, these systems
rely on various combinations of monocular cameras, lidar,
radar, and sonar sensors.
5
For detecting static obstacles
such as trees, poles, railings, and concrete barriers, most
systems rely on cameras and/or lidar. Although it is be-
yond the scope of this paper whether monocular cameras are
sufficient for self-driving behavior (certainly people with
monocular vision can drive safely in most situations), or
whether stereo is better than lidar, we argue that the proper
4
To the authors’ knowledge, at the time of this writing stereo cameras
can be found only on certain models of Mercedes and Subaru vehicles; no
major autonomous platform uses them.
5
Tesla vehicles, for example, are equipped with monocular cameras,
sonar, and radar, but no lidar. Despite having multiple foveated cameras for
wider field of view, such vehicles do not rely upon depth from stereopsis.
engineering approach to such a safety-critical system is to
leverage all available sensors rather than assume they are
not needed; thus, we believe that it is important to accu-
rately assess the increased error in depth estimation when
relying upon monocular cameras.
At typical highway speeds, the braking distance re-
quired to completely stop before impact necessitates ob-
serving an unforeseen stopped object approximately 100 m
away. Intrigued by the reported success of monocular
depth, we tried some recent algorithms, only to discover
that monocular depth is not able to achieve accuracies any-
where close to that requirement. We then turned our at-
tention to stereo, where significant progress has been made
in recent years in applying deep learning to the problem
[25, 24, 11, 27, 26, 29, 8, 15, 23]. An excellent overview
of recent stereo algorithms can be found in [15]. In this
flurry of activity, a variety of architectures have been pro-
posed, but there has been no systematic study as to how
these design choices impact quality. One purpose of this
paper is thus to investigate several of these options in order
to quantify their impact, which we do in Sec. 5. In the con-
text of this study, we develop a novel semi-supervised stereo
approach, which we present in Sec. 4. First, however, we il-
lustrate the limitations of monocular depth estimation in the
next section.
3. Difficulties of Monocular Depth Estimation
To appreciate the gap between mono and stereo vision,
consider the image of Fig. 1, with several points of interest
highlighted. Without knowing the scene, if you were to ask
yourself whether the width of the near road (on which the
car (A) sits) is greater than the width of the far tracks (dis-
tance between the near and far poles (E and F)), you might
be tempted to answer in the affirmative. After all, the road
not only occupies more pixels in the image (which is to be
expected, since it is closer to the camera), but it occupies
orders of magnitude more pixels. We showed this image to
several people in our lab, and they all reached the same con-
clusion: the road indeed appears to be significantly wider.
As it turns out, if this image is any indication, people are not
very good at estimating metric depth from a single image.
6
The output of a leading monocular depth algorithm,
called MonoDepth [9], is shown in Fig. 2,
7
along with the
output of our stereo depth algorithm. At first glance, both
results appear plausible. Although the stereo algorithm pre-
serves crisper object boundaries and appears at least slightly
more accurate, it is difficult to tell from the grayscale im-
6
Specifically, we asked 8 people to estimate the distance to the fence
(ground truth 14 m) and the distance to the building (ground truth 30 m).
Their estimates on average were 9.3 m and 12.4 m, respectively. The dis-
tances were therefore underestimated by 34% and 59%, respectively, and
the distance from the fence to the building was underestimated by 81%.
7
Other monocular algorithms produce similar results.
of 14
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜