
(a) input image (c) surface orientation estimate (e) P(viewpoint | objects)
(b) P(person) = uniform (d) P(person | geometry) (f) P(person | viewpoint) (g) P(person |viewpoint,geometry)
Figure 2. Watch for pedestrians! In (b,d,f,g), we show 100 boxes sampled according to the available information. Given an input image (a),
a local object detector will expect to find a pedestrian at any location/scale (b). However, given an estimate of rough surface orientations
(c), we can better predict where a pedestrian is likely to be (d). We can estimate the camera viewpoint (e) from a few known objects in
the image. Conversely, knowing the camera viewpoint can help in predict the likely scale of a pedestrian (f). The combined evidence from
surface geometry and camera viewpoint provides a powerful predictor of where a pedestrian might be (g), before we even run a pedestrian
detector! Red, green, and blue channels of (c) indicate confidence in vertical, ground, and sky, respectively. Best viewed in color.
pioneering work as Brooks’ ACRONYM [4], Hanson and
Riseman’s VISIONS [9], Ohta and Kanade’s outdoor scene
understanding system [19], Barrow and Tenenbaum’s in-
trinsic images [2], etc. For example, VISIONS was an ex-
tremely ambitious system that analyzed a scene on many
interrelated levels including segments, 3D surfaces and vol-
umes, objects, and scene categories. However, because of
the heavy use of heuristics, none of these early systems were
particularly successful, which led people to doubt the very
goal of complete image understanding.
We believe that the vision pioneers were simply ahead
of their time. They had no choice but to rely on heuris-
tics because they lacked the computational resources to
learn the relationships governing the structure of our visual
world. The advancement of learning methods in the last
decade brings renewed hope for a complete image under-
standing solution. However, the currently popular learning
approaches are based on looking at small image windows at
all locations and scales to find specific objects. This works
wonderfully for face detection [23, 29] (since the inside of
a face is much more important than the boundary) but is
quite unreliable for other types of objects, such as cars and
pedestrians, especially at the smaller scales.
As a result, several researchers have recently begun to
consider the use of contextual information for object de-
tection. The main focus has been on modeling direct re-
lationships between objects and other objects [15, 18], re-
gions [10, 16, 28] or scene categories [18, 24], all within
the 2D image. Going beyond the 2D image plane, Hoiem et
al. [11] propose a mechanism for estimating rough 3D scene
geometry from a single image and use this information as
additional features to improve object detection. From low-
level image cues, Torralba and Oliva [26] get a sense of the
viewpoint and mean scene depth, which provides a useful
prior for object detection [27]. Forsyth et al. [7] describe
a method for geometric consistency of object hypotheses in
simple scenes using hard algebraic constraints. Others have
also modeled the relationship between the camera parame-
ters and objects, requiring either a well-calibrated camera
(e.g. [12]), a stationary surveillance camera (e.g. [14]), or
both [8].
In this work, we draw on several of the previous tech-
niques: local object detection (based on Murphy et al. [18]),
3D scene geometry estimation [11], and camera viewpoint
estimation. Our contribution is a statistical framework that
allows simultaneous inference of object identities, surface
orientations, and camera viewpoint using a single image
taken from an uncalibrated camera.
1.2. Overview
To evaluate our approach, we have chosen a very chal-
lenging dataset of outdoor images [22] that contain cars and
people, often partly occluded, over an extremely wide range
of scales and in accidental poses (unlike, for example, the
framed photographs in Corel or CalTech datasets). Our goal
is to demonstrate that substantial improvement over stan-
dard low-level detectors can be obtained by reasoning about
the underlying 3D scene structure.
One way to think about what we are trying to achieve
is to consider the likely places in an image where an ob-
评论