GINA-3D.pdf - 墨天轮文档

GINA-3D.pdf

smith0907

128

23页

0次

2023-09-11

100墨值下载

GINA-3D: Learning to Generate Implicit Neural Assets in the Wild

Bokui Shen

1∗

Xinchen Yan

Charles R. Qi

Mahyar Najibi

Boyang Deng

1,2†

Leonidas Guibas

Yin Zhou

Dragomir Anguelov

Stanford University,

Waymo LLC,

Google

“Random…”

In-the-wild Driving Data

20m 40m 60m

20171005_230354_C00819_2078_320_2098_320

sunny_vehID_6anIJFOfWaUPtzUwSmDewA

Run: 12566399510596872945_2078_320_2098_320

(00117-of-00200)

sunny_vehID_29ZJOW70kl8WERZMKw8hqg (replaced)

Run: 12566399510596872945_2078_320_2098_320

(00132-of-00200)

sunny_vehID_FGsR7RUp3HqU6uMSeVAD_Q

Run: 12566399510596872945_2078_320_2098_320

(00042-of-00200)

sunny_vehID_dEVmX2VGinv3Ip7YcfRSrw (truck)

Vehicle ids:

“Night time…”Reconstruction

GINA-3D

Composition with Background NeRF

GINA-3D Synthesis

“Similar size…”

“Same kind…”

In-the-Wild Driving Data

Figure 1. Leveraging in-the-wild data for generative assets modeling embodies a scalable approach for simulation.

GINA-3D

uses real-world

driving data to perform various synthesis tasks for realistic 3D implicit neural assets. Left: Multi-sensor observations in the wild. Middle:

Asset reconstruction and conditional synthesis. Right: Scene composition with background neural ﬁelds [1].

Abstract

Modeling the 3D world from sensor data for simula-

tion is a scalable way of developing testing and valida-

tion environments for robotic learning problems such as

autonomous driving. However, manually creating or re-

creating real-world-like environments is difﬁcult, expensive,

and not scalable. Recent generative model techniques have

shown promising progress to address such challenges by

learning 3D assets using only plentiful 2D images – but still

suffer limitations as they leverage either human-curated im-

age datasets or renderings from manually-created synthetic

3D environments. In this paper, we introduce GINA-3D, a

generative model that uses real-world driving data from cam-

era and LiDAR sensors to create realistic 3D implicit neural

assets of diverse vehicles and pedestrians. Compared to the

existing image datasets, the real-world driving setting poses

new challenges due to occlusions, lighting-variations and

long-tail distributions. GINA-3D tackles these challenges by

decoupling representation learning and generative model-

ing into two stages with a learned tri-plane latent structure,

inspired by recent advances in generative modeling of im-

ages. To evaluate our approach, we construct a large-scale

object-centric dataset containing over 520K images of ve-

hicles and pedestrians from the Waymo Open Dataset, and

a new set of 80K images of long-tail instances such as con-

struction equipment, garbage trucks, and cable cars. We

compare our model with existing approaches and demon-

strate that it achieves state-of-the-art performance in quality

and diversity for both generated images and geometries.

∗

Work done during an internship at Waymo.

†

Work done at Waymo.

1. Introduction

Learning to perceive, reason, and interact with the 3D

world has been a longstanding challenge in the computer

vision and robotics community for decades [2

–

9]. Mod-

equipped with multiple sensors (e.g

ern robotic systems [10

–

16] deployed in the wild are often

. cameras, LiDARs, and

Radars) that perceive the 3D environments, followed by an

intelligent unit for reasoning and interacting with the com-

plex scene dynamics. End-to-end testing and validating these

intelligent agents in the real-world environments are difﬁ-

cult and expensive, especially in safety critical and resource

constrained domains like autonomous driving.

On the other hand, the use of simulated data has pro-

liferated over the last few years to train and evaluate the

intelligent agents under controlled settings [17

–

27] in a safe,

scalable and veriﬁable manner. Such developments were

fueled by rapid advances in computer graphics, including

rendering frameworks [28

–

30], physical simulation [31, 32]

and large-scale open-sourced asset repositories [33

–

39]. A

key concern is to create realistic virtual worlds that align in

asset content, composition, and behavior with real distribu-

tions, so as to give the practitioner conﬁdence that using such

simulations for development and veriﬁcation can transfer to

performance in the real world [40

–

48]. However, manual

asset creation faces two major obstacles. First, manual cre-

ation of 3D assets requires dedicated efforts from engineers

and artists with 3D domain expertise, which is expensive

and difﬁcult to scale [26]. Second, real-world distribution

contains diverse examples (including interesting rare cases)

and is also constantly evolving [49, 50].

Recent developments in the generative 3D modeling offer

arXiv:2304.02163v1 [cs.CV] 4 Apr 2023

new perspectives to tackle these aforementioned obstacles,

as it allows producing additional realistic but previously

unseen examples. A sub-class of these approaches, gener-

ative 3D-aware image synthesis [51, 52], holds signiﬁcant

promise since it enables 3D modeling from partial observa-

tions (e.g. image projections of the 3D object). Moreover,

many real-world robotic applications already capture, an-

notate and update multi-sensor observations at scale. Such

data thus offer an accurate, diverse, task-relevant, and up-

to-date representation of the real-world distribution, which

the generative model can potentially capture. However, ex-

isting works use either human-curated image datasets with

clean observations [53

–

58] or renderings from synthetic 3D

environments [33, 36]. Scaling generative 3D-aware image

synthesis models to the real world faces several challenges,

as many factors are entangled in the partial observations.

First, bridging the in-the-wild images from a simple prior

without 3D structures make the learning difﬁcult. Second,

unconstrained occlusions entangle object-of-interest and its

surroundings in pixel space, which is hard to disentangle

in a purely unsupervised manner. Lastly, the above chal-

lenges are compounded by a lack of effort in constructing an

asset-centric benchmark for sensor data captured in the wild.

In this work, we introduce a 3D-aware generative trans-

former for implicit neural asset generation, named GINA-3D

(

enerative

mplicit

eural

ssets). To tackle the real world

challenges, we propose a novel 3D-aware Encoder-Decoder

framework with a learned structured prior. Speciﬁcally, we

embed a tri-plane structure into the latent prior (or tri-plane

latents) of our generative model, where each entry is param-

eterized by a discrete representation from a learned code-

book [59,60]. The Encoder-Decoder framework is composed

of a transformation encoder and a decoder with neural ren-

dering components. To handle unconstrained occlusions, we

explicitly disentangle object pixels from its surrounding with

an occlusion-aware composition, using pseudo labels from

an off-the-shelf segmenation model [61]. Finally, the learned

prior of tri-plane latents from a discrete codebook can be

used to train conditional latents sampling models [62]. The

same codebook can be readily applied to various conditional

synthesis tasks, including object scale, class, semantics, and

time-of-day.

To evaluate our model, we construct a large-scale object-

centric benchmark from multi-sensor driving data captured

in the wild. We ﬁrst extract over 520K images of diverse

variations for vehicles and pedestrians from Waymo Open

Dataset [14]. We then augment the benchmark with long-tail

instances from real-world driving scenes, including rare ob-

jects like construction equipment, cable cars, school buses

and garbage trucks. We demonstrate through extensive ex-

periments that GINA-3D outperforms the state-of-the-art

3D-aware generative models, measured by image quality,

geometry consistency, and geometry diversity. Moreover,

we showcase example applications of various conditional

synthesis tasks and shape editing results by leveraging the

learned 3D-aware codebook. To support future research

along this direction, we are looking to make the benchmark

available publicly, such as through waymo.com/open, sub-

ject to updates.

2. Related Work

We discuss the relevant work on generative 3D-aware

image synthesis, 3D shape modeling, and applications in

autonomous driving.

Generative 3D-aware image synthesis.

Learning gener-

ative 3D-aware representations from image collections has

been increasingly popular for the past decade [63

–

69]. Early

work explored image synthesis from disentangled factors

such as learned pose embedding [64,66,69] or compact scene

representations [65,67]. Representing the 3D-structure as a

compressed embedding, this line of work approached image

synthesis by upsampling from the embedding space with a

stack of 2D deconvolutional layers. Driven by the progresses

in differentiable rendering, there have been efforts [70

–

73]

in baking explicit 3D structures into the generative architec-

tures. These efforts, however, are often conﬁned to a coarse

3D discretization due to memory consumption. Moving be-

yond explicits, more recent work leverages neural radiance

ﬁelds to learn implicit 3D-aware structures [51,52, 74

–

82]

for image synthesis. Schwarz et al. [74] introduced the Gen-

erative Radiance Fields (GRAF) that disentangles the 3D

shape, appearance and camera pose of a single object with-

out occlusions. Built on top of GRAF, Niemeyer et al. [51]

proposed the GIRAFFE model, which handles scene involv-

ing multiple objects by using the compositional 3D scene

structure. Notably, the query operation in the volumetric ren-

dering becomes computationally heavy at higher resolutions.

To tackle this, Chan et al. [52] introduced hybrid explicit-

implicit 3D representations with tri-plane features (EG3D),

which showcases image synthesis at higher resolutions. Con-

currently, [83] and [84] pioneer high-resolution unbounded

3D scene generation on ImageNet using tri-plane represen-

tations, where [84] uses a vector-quantized framework and

[83] uses a GAN framework. Our work is designed for ap-

plications in autonomous driving sensor simulation with an

emphasis on object-centric modeling.

Generative 3D shape modeling.

Generative modeling of

complete 3D shapes has also been extensively studied, in-

cluding efforts on synthesizing 3D voxel grids [85

–

93],

point clouds [94

–

96], surface meshes [97

–

103], shape primi-

tives [104, 105], and implicit functions or hybrid representa-

tions [103, 106

–

112] using various deep generative models.

Shen et al. [111] introduced a differentiable explicit sur-

face extraction method called Deep Marching Tetrahedra

(DMTet) that learns to reconstruct 3D surface meshes with

arbitrary topology directly. Built on top of the EG3D [52]

of 23

100墨值下载

关注

评论