
This is, no doubt, a noble endeavor: if it were possible to ac-
curately estimate depth from a single image, then the com-
plexity (and hence cost) of the hardware needed would be
dramatically reduced, which would broaden the applicabil-
ity substantially. An excellent overview of existing work on
monocular depth estimation can be found in [9].
Nevertheless, there are reasons to be cautious about the
reported success of monocular depth. To date, monocular
depth solutions, while yielding encouraging preliminary re-
sults, are not at the point where reliable information (from a
robotics point of view) can be expected from them. And al-
though such solutions will continue to improve, monocular
depth will never overcome well-known fundamental limi-
tations, such as the need for a world measurement to infer
absolute depth, and the ambiguity that arises when a pho-
tograph is taken of a photograph (an important observation
for biometric and security systems).
One of the motivations for monocular depth is a long-
standing belief that stereo is only useful at close range.
It has been widely reported, for example in [10], that be-
yond about 6 meters, the human visual system is essentially
monocular. But there is mounting evidence that the hu-
man stereo system is actually much more capable than that.
Multiple studies have shown metric depth estimation up to
20 meters [18, 1]; and, although error increases as disparity
increases [13], controlled experiments have confirmed that
scaled disparity can be estimated up to 300 m, even with-
out any depth cues from monocular vision [22]. Moreover,
since the human visual system is capable of estimating dis-
parity as small as a few seconds of arc [22], there is rea-
son to believe that the distance could be 1 km or greater,
with some evidence supporting such a claim provided by
the experiments of [4]. Note that an artificial stereo system
whose baseline is wider than the average 65 mm interpupil-
lary distance of the human visual system has the potential
to provide even greater accuracy.
This question takes on renewed significance in the con-
text of self-driving cars, since most automobile manufac-
turers and experimental autonomous vehicles do not install
stereo cameras in their vehicles.
4
Rather, these systems
rely on various combinations of monocular cameras, lidar,
radar, and sonar sensors.
5
For detecting static obstacles
such as trees, poles, railings, and concrete barriers, most
systems rely on cameras and/or lidar. Although it is be-
yond the scope of this paper whether monocular cameras are
sufficient for self-driving behavior (certainly people with
monocular vision can drive safely in most situations), or
whether stereo is better than lidar, we argue that the proper
4
To the authors’ knowledge, at the time of this writing stereo cameras
can be found only on certain models of Mercedes and Subaru vehicles; no
major autonomous platform uses them.
5
Tesla vehicles, for example, are equipped with monocular cameras,
sonar, and radar, but no lidar. Despite having multiple foveated cameras for
wider field of view, such vehicles do not rely upon depth from stereopsis.
engineering approach to such a safety-critical system is to
leverage all available sensors rather than assume they are
not needed; thus, we believe that it is important to accu-
rately assess the increased error in depth estimation when
relying upon monocular cameras.
At typical highway speeds, the braking distance re-
quired to completely stop before impact necessitates ob-
serving an unforeseen stopped object approximately 100 m
away. Intrigued by the reported success of monocular
depth, we tried some recent algorithms, only to discover
that monocular depth is not able to achieve accuracies any-
where close to that requirement. We then turned our at-
tention to stereo, where significant progress has been made
in recent years in applying deep learning to the problem
[25, 24, 11, 27, 26, 29, 8, 15, 23]. An excellent overview
of recent stereo algorithms can be found in [15]. In this
flurry of activity, a variety of architectures have been pro-
posed, but there has been no systematic study as to how
these design choices impact quality. One purpose of this
paper is thus to investigate several of these options in order
to quantify their impact, which we do in Sec. 5. In the con-
text of this study, we develop a novel semi-supervised stereo
approach, which we present in Sec. 4. First, however, we il-
lustrate the limitations of monocular depth estimation in the
next section.
3. Difficulties of Monocular Depth Estimation
To appreciate the gap between mono and stereo vision,
consider the image of Fig. 1, with several points of interest
highlighted. Without knowing the scene, if you were to ask
yourself whether the width of the near road (on which the
car (A) sits) is greater than the width of the far tracks (dis-
tance between the near and far poles (E and F)), you might
be tempted to answer in the affirmative. After all, the road
not only occupies more pixels in the image (which is to be
expected, since it is closer to the camera), but it occupies
orders of magnitude more pixels. We showed this image to
several people in our lab, and they all reached the same con-
clusion: the road indeed appears to be significantly wider.
As it turns out, if this image is any indication, people are not
very good at estimating metric depth from a single image.
6
The output of a leading monocular depth algorithm,
called MonoDepth [9], is shown in Fig. 2,
7
along with the
output of our stereo depth algorithm. At first glance, both
results appear plausible. Although the stereo algorithm pre-
serves crisper object boundaries and appears at least slightly
more accurate, it is difficult to tell from the grayscale im-
6
Specifically, we asked 8 people to estimate the distance to the fence
(ground truth 14 m) and the distance to the building (ground truth 30 m).
Their estimates on average were 9.3 m and 12.4 m, respectively. The dis-
tances were therefore underestimated by 34% and 59%, respectively, and
the distance from the fence to the building was underestimated by 81%.
7
Other monocular algorithms produce similar results.
评论