暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
GINA-3D.pdf
128
23页
0次
2023-09-11
100墨值下载
GINA-3D: Learning to Generate Implicit Neural Assets in the Wild
Bokui Shen
1
Xinchen Yan
2
Charles R. Qi
2
Mahyar Najibi
2
Boyang Deng
1,2
Leonidas Guibas
3
Yin Zhou
2
Dragomir Anguelov
2
1
Stanford University,
2
Waymo LLC,
3
Google
“Random…”
In-the-wild Driving Data
20m 40m 60m
20171005_230354_C00819_2078_320_2098_320
sunny_vehID_6anIJFOfWaUPtzUwSmDewA
Run: 12566399510596872945_2078_320_2098_320
(00117-of-00200)
sunny_vehID_29ZJOW70kl8WERZMKw8hqg (replaced)
Run: 12566399510596872945_2078_320_2098_320
(00132-of-00200)
sunny_vehID_FGsR7RUp3HqU6uMSeVAD_Q
Run: 12566399510596872945_2078_320_2098_320
(00042-of-00200)
sunny_vehID_dEVmX2VGinv3Ip7YcfRSrw (truck)
Vehicle ids:
“Night time…”Reconstruction
GINA-3D
Composition with Background NeRF
GINA-3D Synthesis
“Similar size…”
“Same kind…”
In-the-Wild Driving Data
Figure 1. Leveraging in-the-wild data for generative assets modeling embodies a scalable approach for simulation.
GINA-3D
uses real-world
driving data to perform various synthesis tasks for realistic 3D implicit neural assets. Left: Multi-sensor observations in the wild. Middle:
Asset reconstruction and conditional synthesis. Right: Scene composition with background neural fields [1].
Abstract
Modeling the 3D world from sensor data for simula-
tion is a scalable way of developing testing and valida-
tion environments for robotic learning problems such as
autonomous driving. However, manually creating or re-
creating real-world-like environments is difficult, expensive,
and not scalable. Recent generative model techniques have
shown promising progress to address such challenges by
learning 3D assets using only plentiful 2D images – but still
suffer limitations as they leverage either human-curated im-
age datasets or renderings from manually-created synthetic
3D environments. In this paper, we introduce GINA-3D, a
generative model that uses real-world driving data from cam-
era and LiDAR sensors to create realistic 3D implicit neural
assets of diverse vehicles and pedestrians. Compared to the
existing image datasets, the real-world driving setting poses
new challenges due to occlusions, lighting-variations and
long-tail distributions. GINA-3D tackles these challenges by
decoupling representation learning and generative model-
ing into two stages with a learned tri-plane latent structure,
inspired by recent advances in generative modeling of im-
ages. To evaluate our approach, we construct a large-scale
object-centric dataset containing over 520K images of ve-
hicles and pedestrians from the Waymo Open Dataset, and
a new set of 80K images of long-tail instances such as con-
struction equipment, garbage trucks, and cable cars. We
compare our model with existing approaches and demon-
strate that it achieves state-of-the-art performance in quality
and diversity for both generated images and geometries.
Work done during an internship at Waymo.
Work done at Waymo.
1. Introduction
Learning to perceive, reason, and interact with the 3D
world has been a longstanding challenge in the computer
vision and robotics community for decades [2
9]. Mod-
equipped with multiple sensors (e.g
ern robotic systems [10
16] deployed in the wild are often
. cameras, LiDARs, and
Radars) that perceive the 3D environments, followed by an
intelligent unit for reasoning and interacting with the com-
plex scene dynamics. End-to-end testing and validating these
intelligent agents in the real-world environments are diffi-
cult and expensive, especially in safety critical and resource
constrained domains like autonomous driving.
On the other hand, the use of simulated data has pro-
liferated over the last few years to train and evaluate the
intelligent agents under controlled settings [17
27] in a safe,
scalable and verifiable manner. Such developments were
fueled by rapid advances in computer graphics, including
rendering frameworks [28
30], physical simulation [31, 32]
and large-scale open-sourced asset repositories [33
39]. A
key concern is to create realistic virtual worlds that align in
asset content, composition, and behavior with real distribu-
tions, so as to give the practitioner confidence that using such
simulations for development and verification can transfer to
performance in the real world [40
48]. However, manual
asset creation faces two major obstacles. First, manual cre-
ation of 3D assets requires dedicated efforts from engineers
and artists with 3D domain expertise, which is expensive
and difficult to scale [26]. Second, real-world distribution
contains diverse examples (including interesting rare cases)
and is also constantly evolving [49, 50].
Recent developments in the generative 3D modeling offer
1
arXiv:2304.02163v1 [cs.CV] 4 Apr 2023
new perspectives to tackle these aforementioned obstacles,
as it allows producing additional realistic but previously
unseen examples. A sub-class of these approaches, gener-
ative 3D-aware image synthesis [51, 52], holds significant
promise since it enables 3D modeling from partial observa-
tions (e.g. image projections of the 3D object). Moreover,
many real-world robotic applications already capture, an-
notate and update multi-sensor observations at scale. Such
data thus offer an accurate, diverse, task-relevant, and up-
to-date representation of the real-world distribution, which
the generative model can potentially capture. However, ex-
isting works use either human-curated image datasets with
clean observations [53
58] or renderings from synthetic 3D
environments [33, 36]. Scaling generative 3D-aware image
synthesis models to the real world faces several challenges,
as many factors are entangled in the partial observations.
First, bridging the in-the-wild images from a simple prior
without 3D structures make the learning difficult. Second,
unconstrained occlusions entangle object-of-interest and its
surroundings in pixel space, which is hard to disentangle
in a purely unsupervised manner. Lastly, the above chal-
lenges are compounded by a lack of effort in constructing an
asset-centric benchmark for sensor data captured in the wild.
In this work, we introduce a 3D-aware generative trans-
former for implicit neural asset generation, named GINA-3D
(
G
enerative
I
mplicit
N
eural
A
ssets). To tackle the real world
challenges, we propose a novel 3D-aware Encoder-Decoder
framework with a learned structured prior. Specifically, we
embed a tri-plane structure into the latent prior (or tri-plane
latents) of our generative model, where each entry is param-
eterized by a discrete representation from a learned code-
book [59,60]. The Encoder-Decoder framework is composed
of a transformation encoder and a decoder with neural ren-
dering components. To handle unconstrained occlusions, we
explicitly disentangle object pixels from its surrounding with
an occlusion-aware composition, using pseudo labels from
an off-the-shelf segmenation model [61]. Finally, the learned
prior of tri-plane latents from a discrete codebook can be
used to train conditional latents sampling models [62]. The
same codebook can be readily applied to various conditional
synthesis tasks, including object scale, class, semantics, and
time-of-day.
To evaluate our model, we construct a large-scale object-
centric benchmark from multi-sensor driving data captured
in the wild. We first extract over 520K images of diverse
variations for vehicles and pedestrians from Waymo Open
Dataset [14]. We then augment the benchmark with long-tail
instances from real-world driving scenes, including rare ob-
jects like construction equipment, cable cars, school buses
and garbage trucks. We demonstrate through extensive ex-
periments that GINA-3D outperforms the state-of-the-art
3D-aware generative models, measured by image quality,
geometry consistency, and geometry diversity. Moreover,
we showcase example applications of various conditional
synthesis tasks and shape editing results by leveraging the
learned 3D-aware codebook. To support future research
along this direction, we are looking to make the benchmark
available publicly, such as through waymo.com/open, sub-
ject to updates.
2. Related Work
We discuss the relevant work on generative 3D-aware
image synthesis, 3D shape modeling, and applications in
autonomous driving.
Generative 3D-aware image synthesis.
Learning gener-
ative 3D-aware representations from image collections has
been increasingly popular for the past decade [63
69]. Early
work explored image synthesis from disentangled factors
such as learned pose embedding [64,66,69] or compact scene
representations [65,67]. Representing the 3D-structure as a
compressed embedding, this line of work approached image
synthesis by upsampling from the embedding space with a
stack of 2D deconvolutional layers. Driven by the progresses
in differentiable rendering, there have been efforts [70
73]
in baking explicit 3D structures into the generative architec-
tures. These efforts, however, are often confined to a coarse
3D discretization due to memory consumption. Moving be-
yond explicits, more recent work leverages neural radiance
fields to learn implicit 3D-aware structures [51,52, 74
82]
for image synthesis. Schwarz et al. [74] introduced the Gen-
erative Radiance Fields (GRAF) that disentangles the 3D
shape, appearance and camera pose of a single object with-
out occlusions. Built on top of GRAF, Niemeyer et al. [51]
proposed the GIRAFFE model, which handles scene involv-
ing multiple objects by using the compositional 3D scene
structure. Notably, the query operation in the volumetric ren-
dering becomes computationally heavy at higher resolutions.
To tackle this, Chan et al. [52] introduced hybrid explicit-
implicit 3D representations with tri-plane features (EG3D),
which showcases image synthesis at higher resolutions. Con-
currently, [83] and [84] pioneer high-resolution unbounded
3D scene generation on ImageNet using tri-plane represen-
tations, where [84] uses a vector-quantized framework and
[83] uses a GAN framework. Our work is designed for ap-
plications in autonomous driving sensor simulation with an
emphasis on object-centric modeling.
Generative 3D shape modeling.
Generative modeling of
complete 3D shapes has also been extensively studied, in-
cluding efforts on synthesizing 3D voxel grids [85
93],
point clouds [94
96], surface meshes [97
103], shape primi-
tives [104, 105], and implicit functions or hybrid representa-
tions [103, 106
112] using various deep generative models.
Shen et al. [111] introduced a differentiable explicit sur-
face extraction method called Deep Marching Tetrahedra
(DMTet) that learns to reconstruct 3D surface meshes with
arbitrary topology directly. Built on top of the EG3D [52]
2
of 23
100墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜