new perspectives to tackle these aforementioned obstacles,
as it allows producing additional realistic but previously
unseen examples. A sub-class of these approaches, gener-
ative 3D-aware image synthesis [51, 52], holds significant
promise since it enables 3D modeling from partial observa-
tions (e.g. image projections of the 3D object). Moreover,
many real-world robotic applications already capture, an-
notate and update multi-sensor observations at scale. Such
data thus offer an accurate, diverse, task-relevant, and up-
to-date representation of the real-world distribution, which
the generative model can potentially capture. However, ex-
isting works use either human-curated image datasets with
clean observations [53
–
58] or renderings from synthetic 3D
environments [33, 36]. Scaling generative 3D-aware image
synthesis models to the real world faces several challenges,
as many factors are entangled in the partial observations.
First, bridging the in-the-wild images from a simple prior
without 3D structures make the learning difficult. Second,
unconstrained occlusions entangle object-of-interest and its
surroundings in pixel space, which is hard to disentangle
in a purely unsupervised manner. Lastly, the above chal-
lenges are compounded by a lack of effort in constructing an
asset-centric benchmark for sensor data captured in the wild.
In this work, we introduce a 3D-aware generative trans-
former for implicit neural asset generation, named GINA-3D
(
G
enerative
I
mplicit
N
eural
A
ssets). To tackle the real world
challenges, we propose a novel 3D-aware Encoder-Decoder
framework with a learned structured prior. Specifically, we
embed a tri-plane structure into the latent prior (or tri-plane
latents) of our generative model, where each entry is param-
eterized by a discrete representation from a learned code-
book [59,60]. The Encoder-Decoder framework is composed
of a transformation encoder and a decoder with neural ren-
dering components. To handle unconstrained occlusions, we
explicitly disentangle object pixels from its surrounding with
an occlusion-aware composition, using pseudo labels from
an off-the-shelf segmenation model [61]. Finally, the learned
prior of tri-plane latents from a discrete codebook can be
used to train conditional latents sampling models [62]. The
same codebook can be readily applied to various conditional
synthesis tasks, including object scale, class, semantics, and
time-of-day.
To evaluate our model, we construct a large-scale object-
centric benchmark from multi-sensor driving data captured
in the wild. We first extract over 520K images of diverse
variations for vehicles and pedestrians from Waymo Open
Dataset [14]. We then augment the benchmark with long-tail
instances from real-world driving scenes, including rare ob-
jects like construction equipment, cable cars, school buses
and garbage trucks. We demonstrate through extensive ex-
periments that GINA-3D outperforms the state-of-the-art
3D-aware generative models, measured by image quality,
geometry consistency, and geometry diversity. Moreover,
we showcase example applications of various conditional
synthesis tasks and shape editing results by leveraging the
learned 3D-aware codebook. To support future research
along this direction, we are looking to make the benchmark
available publicly, such as through waymo.com/open, sub-
ject to updates.
2. Related Work
We discuss the relevant work on generative 3D-aware
image synthesis, 3D shape modeling, and applications in
autonomous driving.
Generative 3D-aware image synthesis.
Learning gener-
ative 3D-aware representations from image collections has
been increasingly popular for the past decade [63
–
69]. Early
work explored image synthesis from disentangled factors
such as learned pose embedding [64,66,69] or compact scene
representations [65,67]. Representing the 3D-structure as a
compressed embedding, this line of work approached image
synthesis by upsampling from the embedding space with a
stack of 2D deconvolutional layers. Driven by the progresses
in differentiable rendering, there have been efforts [70
–
73]
in baking explicit 3D structures into the generative architec-
tures. These efforts, however, are often confined to a coarse
3D discretization due to memory consumption. Moving be-
yond explicits, more recent work leverages neural radiance
fields to learn implicit 3D-aware structures [51,52, 74
–
82]
for image synthesis. Schwarz et al. [74] introduced the Gen-
erative Radiance Fields (GRAF) that disentangles the 3D
shape, appearance and camera pose of a single object with-
out occlusions. Built on top of GRAF, Niemeyer et al. [51]
proposed the GIRAFFE model, which handles scene involv-
ing multiple objects by using the compositional 3D scene
structure. Notably, the query operation in the volumetric ren-
dering becomes computationally heavy at higher resolutions.
To tackle this, Chan et al. [52] introduced hybrid explicit-
implicit 3D representations with tri-plane features (EG3D),
which showcases image synthesis at higher resolutions. Con-
currently, [83] and [84] pioneer high-resolution unbounded
3D scene generation on ImageNet using tri-plane represen-
tations, where [84] uses a vector-quantized framework and
[83] uses a GAN framework. Our work is designed for ap-
plications in autonomous driving sensor simulation with an
emphasis on object-centric modeling.
Generative 3D shape modeling.
Generative modeling of
complete 3D shapes has also been extensively studied, in-
cluding efforts on synthesizing 3D voxel grids [85
–
93],
point clouds [94
–
96], surface meshes [97
–
103], shape primi-
tives [104, 105], and implicit functions or hybrid representa-
tions [103, 106
–
112] using various deep generative models.
Shen et al. [111] introduced a differentiable explicit sur-
face extraction method called Deep Marching Tetrahedra
(DMTet) that learns to reconstruct 3D surface meshes with
arbitrary topology directly. Built on top of the EG3D [52]
2
评论