NeuralField-LDM Scene Generation with Hierarchical Latent Diffusion Models.pdf

smith0907

144

37页

0次

2023-09-17

100墨值下载

NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Seung Wook Kim

1,2,3*

Bradley Brown

1,5*†

Kangxue Yin

Karsten Kreis

Katja Schwarz

6†

Daiqing Li

Robin Rombach

7†

Antonio Torralba

Sanja Fidler

1,2,3

NVIDIA

University of Toronto

Vector Institute

CSAIL, MIT

University of Waterloo

University of T

ubingen, T

ubingen AI Center

LMU Munich

https://research.nvidia.com/labs/toronto-ai/NFLDM

Figure 1. We introduce NeuralField-LDM, a generative model for complex open-world 3D scenes. This ﬁgure contains a panorama

constructed from NeuralField-LDM’s generated scene. We visualize different parts of the scene by placing cameras on them.

Abstract

Automatically generating high-quality real world 3D

scenes is of enormous interest for applications such as vir-

tual reality and robotics simulation. Towards this goal, we

introduce NeuralField-LDM, a generative model capable of

synthesizing complex 3D environments. We leverage Latent

Diffusion Models that have been successfully utilized for ef-

ﬁcient high-quality 2D content creation. We ﬁrst train a

scene auto-encoder to express a set of image and pose pairs

as a neural ﬁeld, represented as density and feature voxel

grids that can be projected to produce novel views of the

scene. To further compress this representation, we train a

latent-autoencoder that maps the voxel grids to a set of la-

tent representations. A hierarchical diffusion model is then

ﬁt to the latents to complete the scene generation pipeline.

We achieve a substantial improvement over existing state-

of-the-art scene generation models. Additionally, we show

how NeuralField-LDM can be used for a variety of 3D con-

tent creation applications, including conditional scene gen-

eration, scene inpainting and scene style manipulation.

1. Introduction

There has been increasing interest in modelling 3D real-

world scenes for use in virtual reality, game design, digi-

Equal contribution.

†

Work done during an internship at NVIDIA.

tal twin creation and more. However, designing 3D worlds

by hand is a challenging and time-consuming process, re-

quiring 3D modeling expertise and artistic talent. Recently,

we have seen success in automating 3D content creation

via 3D generative models that output individual object as-

sets [17, 53, 85]. Although a great step forward, automating

the generation of real-world scenes remains an important

open problem and would unlock many applications ranging

from scalably generating a diverse array of environments

for training AI agents (e.g. autonomous vehicles) to the de-

sign of realistic open-world video games. In this work, we

take a step towards this goal with NeuralField-LDM (NF-

LDM), a generative model capable of synthesizing complex

real-world 3D scenes. NF-LDM is trained on a collection

of posed camera images and depth measurements which are

easier to obtain than explicit ground-truth 3D data, offering

a scalable way to synthesize 3D scenes.

Recent approaches [3, 7, 9] tackle the same problem of

generating 3D scenes, albeit on less complex data. In [7,9],

a latent distribution is mapped to a set of scenes using ad-

versarial training, and in GAUDI [3], a denoising diffusion

model is ﬁt to a set of scene latents learned using an auto-

decoder. These models all have an inherent weakness of

attempting to capture the entire scene into a single vector

that conditions a neural radiance ﬁeld. In practice, we ﬁnd

that this limits the ability to ﬁt complex scene distributions.

Recently, diffusion models have emerged as a very pow-

erful class of generative models, capable of generating high-

arXiv:2304.09787v1 [cs.CV] 19 Apr 2023

quality images, point clouds and videos [20, 27, 45, 55, 60,

85,91]. Yet, due to the nature of our task, where image data

must be mapped to a shared 3D scene without an explicit

ground truth 3D representation, straightforward approaches

ﬁtting a diffusion model directly to data are infeasible.

In NeuralField-LDM, we learn to model scenes using a

three-stage pipeline. First, we learn an auto-encoder that en-

codes scenes into a neural ﬁeld, represented as density and

feature voxel grids. Inspired by the success of latent diffu-

sion models for images [60], we learn to model the distribu-

tion of our scene voxels in latent space to focus the genera-

tive capacity on core parts of the scene and not the extrane-

ous details captured by our voxel auto-encoders. Specif-

ically, a latent-autoencoder decomposes the scene voxels

into a 3D coarse, 2D ﬁne and 1D global latent. Hierar-

chichal diffusion models are then trained on the tri-latent

representation to generate novel 3D scenes. We show how

NF-LDM enables applications such as scene editing, birds-

eye view conditional generation and style adaptation. Fi-

nally, we demonstrate how score distillation [53] can be

used to optimize the quality of generated neural ﬁelds, al-

lowing us to leverage the representations learned from state-

of-the-art image diffusion models that have been exposed to

orders of magnitude more data.

Our contributions are: 1) We introduce NF-LDM, a hi-

erarchical diffusion model capable of generating complex

open-world 3D scenes and achieving state of the art scene

generation results on four challenging datasets. 2) We ex-

tend NF-LDM to semantic birds-eye view conditional scene

generation, style modiﬁcation and 3D scene editing.

2. Related Work

2D Generative Models In past years, generative adver-

sarial networks (GANs) [4, 19, 31, 48, 65] and likelihood-

based approaches [38, 56, 58, 78] enabled high-resolution

photorealistic image synthesis. Due to their quality, GANs

are used in a multitude of downstream applications rang-

ing from steerable content creation [34,39, 41, 42, 68, 89] to

data driven simulation [30,35,36,39]. Recently, autoregres-

sive models and score-based models, e.g. diffusion models,

demonstrate better distribution coverage while preserving

high sample quality [11, 12, 15, 23, 25, 50, 55, 60, 61, 79].

Since evaluation and optimization of these approaches in

pixel space is computationally expensive, [60, 79] apply

them to latent space, achieving state-of-the-art image syn-

thesis at megapixel resolution. As our approach operates on

3D scenes, computational efﬁciency is crucial. Hence, we

build upon [60] and train our model in latent space.

Novel View Synthesis In their seminal work [49],

Mildenhall et al. introduce Neural Radiance Fields (NeRF)

as a powerful 3D representation. PixelNeRF [84] and IBR-

Net [82] propose to condition NeRF on aggregated features

from multiple views to enable novel view synthesis from

a sparse set of views. Another line of works scale NeRF

to large-scale indoor and outdoor scenes [46, 57, 86, 88].

Recently, Nerfusion [88] predicts local radiance ﬁelds and

fuses them into a scene representation using a recurrent neu-

ral network. Similarly, we construct a latent scene represen-

tation by aggregating features across multiple views. Dif-

ferent from the aforementioned methods, our approach is a

generative model capable of synthesizing novel scenes.

3D Diffusion Models A few recent works propose to ap-

ply denoising diffusion models (DDM) [23,25,72] on point

clouds for 3D shape generation [45,85,91]. While PVD [91]

trains on point clouds directly, DPM [45] and LION [85]

use a shape latent variable. Similar to LION, we design a

hierarchical model by training separate conditional DDMs.

However, our approach generates both texture and geometry

of a scene without needing 3D ground truth as supervision.

3D-Aware Generative Models 3D-aware generative

models synthesize images while providing explicit control

over the camera pose and potentially other scene proper-

ties, like object shape and appearance. SGAM [69] gener-

ates a 3D scene by autoregressively generating sensor data

and building a 3D map. Several previous approaches gen-

erate NeRFs of single objects with conditional coordinate-

based MLPs [8, 51, 66]. GSN [9] conditions a coordinate-

based MLP on a “ﬂoor plan”, i.e. a 2D feature map, to

model more complex indoor scenes. EG3D [7] and Vox-

GRAF [67] use convolutional backbones to generate 3D

representations. All of these approaches rely on adversarial

training. Instead, we train a DDM on voxels in latent space.

The work closest to ours is GAUDI [3], which ﬁrst trains an

auto-decoder and subsequently trains a DDM on the learned

latent codes. Instead of using a global latent code, we en-

code scenes onto voxel grids and train a hierarchical DDM

to optimally combine global and local features.

3. NeuralField-LDM

Our objective is to train a generative model to synthe-

size 3D scenes that can be rendered to any viewpoint. We

assume access to a dataset {(i, κ, ρ)}

1..N

which consists of

N RGB images i and their camera poses κ, along with a

depth measurement ρ that can be either sparse (e.g. Lidar

points) or dense. The generative model must learn to model

both the texture and geometry distributions of the dataset in

3D by learning solely from the sensor observations, which

is a highly non-trivial problem.

Past work typically tackles this problem with a gener-

ative adversarial network (GAN) framework [7, 9, 66, 67].

They produce an intermediate 3D representation and ren-

der images for a given viewpoint with volume render-

ing [29, 49]. Discriminator losses then ensure that the 3D

representation produces a valid image from any viewpoint.

However, GANs come with notorious training instability

of 37

100墨值下载

关注

评论