
many images of real-world objects, but we do not have access
to their shapes; 3D CAD repositories offer object geometry,
but do not come with real images. Further, for each image-
shape pair, we need a precise pose annotation that aligns the
shape with its projection in the image.
We overcome these challenges by constructing Pix3D in
three steps. First, we collect a large number of image-shape
pairs by crawling the web and performing 3D scans ourselves.
Second, we collect 2D keypoint annotations of objects in the
images on Amazon Mechanical Turk, with which we optimize
for 3D poses that align shapes with image silhouettes. Third,
we filter out image-shape pairs with a poor alignment and, at
the same time, collect attributes (i.e., truncation, occlusion)
for each instance, again by crowdsourcing.
In addition to high-quality data, we need a proper metric
to objectively evaluate the reconstruction results. A well-
designed metric should reflect the visual appealingness of the
reconstructions. In this paper, we calibrate commonly used
metrics, including intersection over union, Chamfer distance,
and earth mover’s distance, on how well they capture human
perception of shape similarity. Based on this, we benchmark
state-of-the-art algorithms for 3D object modeling on Pix3D
to demonstrate their strengths and weaknesses.
With its high-quality alignment, Pix3D is also suitable for
object pose estimation and shape retrieval. To demonstrate
that, we propose a novel model that performs shape and pose
estimation simultaneously. Given a single RGB image, our
model first predicts its 2.5D sketches, and then regresses the
3D shape and the camera parameters from the estimated 2.5D
sketches. Experiments show that multi-task learning helps to
boost the model’s performance.
Our contributions are three-fold. First, we build a new
dataset for single-image 3D object modeling; Pix3D has a
diverse collection of image-shape pairs with precise 2D-3D
alignment. Second, we calibrate metrics for 3D shape recon-
struction based on their correlations with human perception,
and benchmark state-of-the-art algorithms on 3D reconstruc-
tion, pose estimation, and shape retrieval. Third, we present a
novel model that simultaneously estimates object shape and
pose, achieving state-of-the-art performance on both tasks.
2. Related Work
Datasets of 3D shapes and scenes.
For decades, re-
searchers have been building datasets of 3D objects, either
as a repository of 3D CAD models [
4
,
5
,
50
] or as images
of 3D shapes with pose annotations [
35
,
48
]. Both direc-
tions have witnessed the rapid development of web-scale
databases: ShapeNet [
7
] was proposed as a large repository of
more than 50K models covering 55 categories, and Xiang et
al. built Pascal 3D+ [
65
] and ObjectNet3D [
64
], two large-
scale datasets with alignment between 2D images and the 3D
shape inside. While these datasets have helped to advance the
field of 3D shape modeling, they have their respective limita-
tions: datasets like ShapeNet or Elastic2D3D [
33
] do not have
real images, and recent 3D reconstruction challenges using
ShapeNet have to be exclusively on synthetic images [
68
];
Pascal 3D+ and ObjectNet3D have only rough alignment be-
tween images and shapes, because objects in the images are
matched to a pre-defined set of CAD models, not their actual
shapes. This has limited their usage as a benchmark for 3D
shape reconstruction [60].
With depth sensors like Kinect [
24
,
27
], the community
has built various RGB-D or depth-only datasets of objects
and scenes. We refer readers to the review article from Fir-
man [
14
] for a comprehensive list. Among those, many ob-
ject datasets are designed for benchmarking robot manipula-
tion [
6
,
23
,
34
,
52
]. These datasets often contain a relatively
small set of hand-held objects in front of clean backgrounds.
Tanks and Temples [
31
] is an exciting new benchmark with 14
scenes, designed for high-quality, large-scale, multi-view 3D
reconstruction. In comparison, our dataset, Pix3D, focuses on
reconstructing a 3D object from a single image, and contains
much more real-world objects and images.
Probably the dataset closest to Pix3D is the large collec-
tion of object scans from Choi et al. [
8
], which contains a
rich and diverse set of shapes, each with an RGB-D video.
Their dataset, however, is not ideal for single-image 3D shape
modeling for two reasons. First, the object of interest may
be truncated throughout the video; this is especially the case
for large objects like sofas. Second, their dataset does not
explore the various contexts that an object may appear in, as
each shape is only associated with a single scan. In Pix3D,
we address both problems by leveraging powerful web search
engines and crowdsourcing.
Another closely related benchmark is IKEA [
39
], which
provides accurate alignment between images of IKEA objects
and 3D CAD models. This dataset is therefore particularly
suitable for fine pose estimation. However, it contains only
759 images and 90 shapes, relatively small for shape model-
ing
∗
. In contrast, Pix3D contains 10,069 images (13.3x) and
395 shapes (4.4x) of greater variations.
Researchers have also explored constructing scene datasets
with 3D annotations. Notable attempts include LabelMe-
3D [
47
], NYU-D [
51
], SUN RGB-D [
54
], KITTI [
16
], and
modern large-scale RGB-D scene datasets [
10
,
41
,
55
]. These
datasets are either synthetic or contain only 3D surfaces of
real scenes. Pix3D, in contrast, offers accurate alignment
between 3D object shape and 2D images in the wild.
Single-image 3D reconstruction.
The problem of recov-
ering object shape from a single image is challenging, as it
requires both powerful recognition systems and prior shape
knowledge. Using deep convolutional networks, researchers
have made significant progress in recent years [
9
,
17
,
21
,
29
,
42
,
44
,
57
,
60
,
61
,
63
,
67
,
53
,
62
]. While most of these ap-
proaches represent objects in voxels, there have also been
∗
Only 90 of the 219 shapes in the IKEA dataset have associated images.
评论