quality images, point clouds and videos [20, 27, 45, 55, 60,
85,91]. Yet, due to the nature of our task, where image data
must be mapped to a shared 3D scene without an explicit
ground truth 3D representation, straightforward approaches
fitting a diffusion model directly to data are infeasible.
In NeuralField-LDM, we learn to model scenes using a
three-stage pipeline. First, we learn an auto-encoder that en-
codes scenes into a neural field, represented as density and
feature voxel grids. Inspired by the success of latent diffu-
sion models for images [60], we learn to model the distribu-
tion of our scene voxels in latent space to focus the genera-
tive capacity on core parts of the scene and not the extrane-
ous details captured by our voxel auto-encoders. Specif-
ically, a latent-autoencoder decomposes the scene voxels
into a 3D coarse, 2D fine and 1D global latent. Hierar-
chichal diffusion models are then trained on the tri-latent
representation to generate novel 3D scenes. We show how
NF-LDM enables applications such as scene editing, birds-
eye view conditional generation and style adaptation. Fi-
nally, we demonstrate how score distillation [53] can be
used to optimize the quality of generated neural fields, al-
lowing us to leverage the representations learned from state-
of-the-art image diffusion models that have been exposed to
orders of magnitude more data.
Our contributions are: 1) We introduce NF-LDM, a hi-
erarchical diffusion model capable of generating complex
open-world 3D scenes and achieving state of the art scene
generation results on four challenging datasets. 2) We ex-
tend NF-LDM to semantic birds-eye view conditional scene
generation, style modification and 3D scene editing.
2. Related Work
2D Generative Models In past years, generative adver-
sarial networks (GANs) [4, 19, 31, 48, 65] and likelihood-
based approaches [38, 56, 58, 78] enabled high-resolution
photorealistic image synthesis. Due to their quality, GANs
are used in a multitude of downstream applications rang-
ing from steerable content creation [34,39, 41, 42, 68, 89] to
data driven simulation [30,35,36,39]. Recently, autoregres-
sive models and score-based models, e.g. diffusion models,
demonstrate better distribution coverage while preserving
high sample quality [11, 12, 15, 23, 25, 50, 55, 60, 61, 79].
Since evaluation and optimization of these approaches in
pixel space is computationally expensive, [60, 79] apply
them to latent space, achieving state-of-the-art image syn-
thesis at megapixel resolution. As our approach operates on
3D scenes, computational efficiency is crucial. Hence, we
build upon [60] and train our model in latent space.
Novel View Synthesis In their seminal work [49],
Mildenhall et al. introduce Neural Radiance Fields (NeRF)
as a powerful 3D representation. PixelNeRF [84] and IBR-
Net [82] propose to condition NeRF on aggregated features
from multiple views to enable novel view synthesis from
a sparse set of views. Another line of works scale NeRF
to large-scale indoor and outdoor scenes [46, 57, 86, 88].
Recently, Nerfusion [88] predicts local radiance fields and
fuses them into a scene representation using a recurrent neu-
ral network. Similarly, we construct a latent scene represen-
tation by aggregating features across multiple views. Dif-
ferent from the aforementioned methods, our approach is a
generative model capable of synthesizing novel scenes.
3D Diffusion Models A few recent works propose to ap-
ply denoising diffusion models (DDM) [23,25,72] on point
clouds for 3D shape generation [45,85,91]. While PVD [91]
trains on point clouds directly, DPM [45] and LION [85]
use a shape latent variable. Similar to LION, we design a
hierarchical model by training separate conditional DDMs.
However, our approach generates both texture and geometry
of a scene without needing 3D ground truth as supervision.
3D-Aware Generative Models 3D-aware generative
models synthesize images while providing explicit control
over the camera pose and potentially other scene proper-
ties, like object shape and appearance. SGAM [69] gener-
ates a 3D scene by autoregressively generating sensor data
and building a 3D map. Several previous approaches gen-
erate NeRFs of single objects with conditional coordinate-
based MLPs [8, 51, 66]. GSN [9] conditions a coordinate-
based MLP on a “floor plan”, i.e. a 2D feature map, to
model more complex indoor scenes. EG3D [7] and Vox-
GRAF [67] use convolutional backbones to generate 3D
representations. All of these approaches rely on adversarial
training. Instead, we train a DDM on voxels in latent space.
The work closest to ours is GAUDI [3], which first trains an
auto-decoder and subsequently trains a DDM on the learned
latent codes. Instead of using a global latent code, we en-
code scenes onto voxel grids and train a hierarchical DDM
to optimally combine global and local features.
3. NeuralField-LDM
Our objective is to train a generative model to synthe-
size 3D scenes that can be rendered to any viewpoint. We
assume access to a dataset {(i, κ, ρ)}
1..N
which consists of
N RGB images i and their camera poses κ, along with a
depth measurement ρ that can be either sparse (e.g. Lidar
points) or dense. The generative model must learn to model
both the texture and geometry distributions of the dataset in
3D by learning solely from the sensor observations, which
is a highly non-trivial problem.
Past work typically tackles this problem with a gener-
ative adversarial network (GAN) framework [7, 9, 66, 67].
They produce an intermediate 3D representation and ren-
der images for a given viewpoint with volume render-
ing [29, 49]. Discriminator losses then ensure that the 3D
representation produces a valid image from any viewpoint.
However, GANs come with notorious training instability
评论