Pix3D：单图像3D形状建模的数据集和方法.pdf

poPoq

295

17页

0次

2021-05-02

50墨值下载

Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling

Xingyuan Sun

∗1,2

Jiajun Wu

∗1

Xiuming Zhang

Zhoutong Zhang

Chengkai Zhang

Tianfan Xue

Joshua B. Tenenbaum

William T. Freeman

1,3

Massachusetts Institute of Technology

Shanghai Jiao Tong University

Google Research

Mismatched 3D shapes

Imprecise pose annotationsWell-aligned images and shapes

Our Dataset: Pix3D Existing Datasets

Figure 1: We present Pix3D, a new large-scale dataset of diverse image-shape pairs. Each 3D shape in Pix3D is associated with a rich and

diverse set of images, each with an accurate 3D pose annotation to ensure precise 2D-3D alignment. In comparison, existing datasets have

limitations: 3D models may not match the objects in images; pose annotations may be imprecise; or the dataset may be relatively small.

Abstract

We study 3D shape modeling from a single image and make

contributions to it in three aspects. First, we present Pix3D,

a large-scale benchmark of diverse image-shape pairs with

pixel-level 2D-3D alignment. Pix3D has wide applications in

shape-related tasks including reconstruction, retrieval, view-

point estimation, etc. Building such a large-scale dataset,

however, is highly challenging; existing datasets either con-

tain only synthetic data, or lack precise alignment between

2D images and 3D shapes, or only have a small number of

images. Second, we calibrate the evaluation criteria for 3D

shape reconstruction through behavioral studies, and use

them to objectively and systematically benchmark cutting-

edge reconstruction algorithms on Pix3D. Third, we design

a novel model that simultaneously performs 3D reconstruc-

tion and pose estimation; our multi-task learning approach

achieves state-of-the-art performance on both tasks.

1. Introduction

The computer vision community has put major efforts in

building datasets. In 3D vision, there are rich 3D CAD model

∗ indicates equal contributions.

repositories like ShapeNet [

] and the Princeton Shape Bench-

mark [

], large-scale datasets associating images and shapes

like Pascal 3D+ [

] and ObjectNet3D [

], and benchmarks

with ﬁne-grained pose annotations for shapes in images like

IKEA [39]. Why do we need one more?

Looking into Figure 1, we realize existing datasets have

limitations for the task of modeling a 3D object from a single

image. ShapeNet is a large dataset for 3D models, but does

not come with real images; Pascal 3D+ and ObjectNet3D have

real images, but the image-shape alignment is rough because

the 3D models do not match the objects in images; IKEA has

high-quality image-3D alignment, but it only contains 90 3D

models and 759 images.

We desire a dataset that has all three merits—a large-scale

dataset of real images and ground-truth shapes with precise

2D-3D alignment. Our dataset, named Pix3D, has 395 3D

shapes of nine object categories. Each shape associates with

a set of real images, capturing the exact object in diverse

environments. Further, the 10,069 image-shape pairs have

precise 3D annotations, giving pixel-level alignment between

shapes and their silhouettes in the images.

Building such a dataset, however, is highly challenging.

For each object, it is difﬁcult to simultaneously collect its

high-quality geometry and in-the-wild images. We can crawl

arXiv:1804.04610v1 [cs.CV] 12 Apr 2018

many images of real-world objects, but we do not have access

to their shapes; 3D CAD repositories offer object geometry,

but do not come with real images. Further, for each image-

shape pair, we need a precise pose annotation that aligns the

shape with its projection in the image.

We overcome these challenges by constructing Pix3D in

three steps. First, we collect a large number of image-shape

pairs by crawling the web and performing 3D scans ourselves.

Second, we collect 2D keypoint annotations of objects in the

images on Amazon Mechanical Turk, with which we optimize

for 3D poses that align shapes with image silhouettes. Third,

we ﬁlter out image-shape pairs with a poor alignment and, at

the same time, collect attributes (i.e., truncation, occlusion)

for each instance, again by crowdsourcing.

In addition to high-quality data, we need a proper metric

to objectively evaluate the reconstruction results. A well-

designed metric should reﬂect the visual appealingness of the

reconstructions. In this paper, we calibrate commonly used

metrics, including intersection over union, Chamfer distance,

and earth mover’s distance, on how well they capture human

perception of shape similarity. Based on this, we benchmark

state-of-the-art algorithms for 3D object modeling on Pix3D

to demonstrate their strengths and weaknesses.

With its high-quality alignment, Pix3D is also suitable for

object pose estimation and shape retrieval. To demonstrate

that, we propose a novel model that performs shape and pose

estimation simultaneously. Given a single RGB image, our

model ﬁrst predicts its 2.5D sketches, and then regresses the

3D shape and the camera parameters from the estimated 2.5D

sketches. Experiments show that multi-task learning helps to

boost the model’s performance.

Our contributions are three-fold. First, we build a new

dataset for single-image 3D object modeling; Pix3D has a

diverse collection of image-shape pairs with precise 2D-3D

alignment. Second, we calibrate metrics for 3D shape recon-

struction based on their correlations with human perception,

and benchmark state-of-the-art algorithms on 3D reconstruc-

tion, pose estimation, and shape retrieval. Third, we present a

novel model that simultaneously estimates object shape and

pose, achieving state-of-the-art performance on both tasks.

2. Related Work

Datasets of 3D shapes and scenes.

For decades, re-

searchers have been building datasets of 3D objects, either

as a repository of 3D CAD models [

] or as images

of 3D shapes with pose annotations [

]. Both direc-

tions have witnessed the rapid development of web-scale

databases: ShapeNet [

] was proposed as a large repository of

more than 50K models covering 55 categories, and Xiang et

al. built Pascal 3D+ [

] and ObjectNet3D [

], two large-

scale datasets with alignment between 2D images and the 3D

shape inside. While these datasets have helped to advance the

ﬁeld of 3D shape modeling, they have their respective limita-

tions: datasets like ShapeNet or Elastic2D3D [

] do not have

real images, and recent 3D reconstruction challenges using

ShapeNet have to be exclusively on synthetic images [

];

Pascal 3D+ and ObjectNet3D have only rough alignment be-

tween images and shapes, because objects in the images are

matched to a pre-deﬁned set of CAD models, not their actual

shapes. This has limited their usage as a benchmark for 3D

shape reconstruction [60].

With depth sensors like Kinect [

], the community

has built various RGB-D or depth-only datasets of objects

and scenes. We refer readers to the review article from Fir-

man [

] for a comprehensive list. Among those, many ob-

ject datasets are designed for benchmarking robot manipula-

tion [

]. These datasets often contain a relatively

small set of hand-held objects in front of clean backgrounds.

Tanks and Temples [

] is an exciting new benchmark with 14

scenes, designed for high-quality, large-scale, multi-view 3D

reconstruction. In comparison, our dataset, Pix3D, focuses on

reconstructing a 3D object from a single image, and contains

much more real-world objects and images.

Probably the dataset closest to Pix3D is the large collec-

tion of object scans from Choi et al. [

], which contains a

rich and diverse set of shapes, each with an RGB-D video.

Their dataset, however, is not ideal for single-image 3D shape

modeling for two reasons. First, the object of interest may

be truncated throughout the video; this is especially the case

for large objects like sofas. Second, their dataset does not

explore the various contexts that an object may appear in, as

each shape is only associated with a single scan. In Pix3D,

we address both problems by leveraging powerful web search

engines and crowdsourcing.

Another closely related benchmark is IKEA [

], which

provides accurate alignment between images of IKEA objects

and 3D CAD models. This dataset is therefore particularly

suitable for ﬁne pose estimation. However, it contains only

759 images and 90 shapes, relatively small for shape model-

ing

∗

. In contrast, Pix3D contains 10,069 images (13.3x) and

395 shapes (4.4x) of greater variations.

Researchers have also explored constructing scene datasets

with 3D annotations. Notable attempts include LabelMe-

3D [

], NYU-D [

], SUN RGB-D [

], KITTI [

], and

modern large-scale RGB-D scene datasets [

]. These

datasets are either synthetic or contain only 3D surfaces of

real scenes. Pix3D, in contrast, offers accurate alignment

between 3D object shape and 2D images in the wild.

Single-image 3D reconstruction.

The problem of recov-

ering object shape from a single image is challenging, as it

requires both powerful recognition systems and prior shape

knowledge. Using deep convolutional networks, researchers

have made signiﬁcant progress in recent years [

]. While most of these ap-

proaches represent objects in voxels, there have also been

∗

Only 90 of the 219 shapes in the IKEA dataset have associated images.

of 17

50墨值下载

自动驾驶

关注

评论