1. Introduction
Generating 3D assets is a cornerstone of immersive aug-
mented/virtual reality experiences. Without realistic and di-
verse objects, virtual worlds will look void and engagement
will remain low. Despite this need, manually creating and
editing 3D assets is a notoriously difficult task, requiring
creativity, 3D design skills, and access to sophisticated soft-
ware with a very steep learning curve. This makes 3D asset
creation inaccessible for inexperienced users. Yet, in many
cases, such as interior design, users more often than not
have a reasonably good understanding of what they want to
create. In those cases, an image or a rough sketch is some-
times accompanied by text indicating details of the asset,
which are hard to express graphically for an amateur.
Due to this need, it is not surprising that democratizing
the 3D content creation process has become an active re-
search area. Conventional 3D generative models require
direct 3D supervision in the form of point clouds [2, 21],
signed distance functions (SDFs) [9, 25], voxels [42, 47],
etc. Recently, first efforts have been made to explore the
learning of 3D geometry from multi-view supervision with
known camera poses by incorporating inductive biases via
neural rendering techniques [5, 6, 14, 37, 52]. While com-
pelling results have been demonstrated, training is often
very time-consuming and ignores available 3D data that can
be used to obtain good shape priors. We foresee an ideal
collaborative paradigm for generative methods where mod-
els trained on 3D data provide detailed and accurate geom-
etry, while models trained on 2D data provide diverse ap-
pearances. A first proof of concept is shown in Figure 1.
In our pursuit of flexible and high-quality 3D shape gen-
eration, we introduce SDFusion, a diffusion-based genera-
tive model with a signed distance function (SDF) under the
hood, acting as our 3D representation. Compared to other
3D representations, SDFs are known to represent well high-
resolution shapes with arbitrary topology [9, 18, 23, 30].
However, 3D representations are infamous for demanding
high computational resources, limiting most existing 3D
generative models to voxel grids of 32
3
resolution and point
clouds of 2K points. To side-step this issue, we first uti-
lize an auto-encoder to compress 3D shapes into a more
compact low-dimensional representation. Because of this,
SDFusion can easily scale up to a 128
3
resolution. To
learn the probability distribution over the introduced la-
tent space, we leverage diffusion models, which have re-
cently been used with great success in various 2D genera-
tion tasks [4,19,22,26,35,40]. Furthermore, we adopt task-
specific encoders and a cross-attention [34] mechanism to
support multiple conditioning inputs, and apply classifier-
free guidance [17] to enable flexible conditioning usage.
Because of these strategies, SDFusion can not only use a va-
riety of conditions from multiple modalities, but also adjust
their importance weight, as shown in Figure 1. Compared
to a recently proposed autoregressive model [25] that also
adopts an encoded latent space, SDFusion achieves supe-
rior sample quality, while offering more flexibility to handle
multiple conditions and, at the same time, features reduced
memory usage. With SDFusion, we study the interplay be-
tween models trained on 2D and 3D data. Given 3D shapes
generated by SDFusion, we take advantage of an off-the-
shelf 2D diffusion model [34], neural rendering [24], and
score distillation sampling [31] to texture the shapes given
text descriptions as conditional variables.
We conduct extensive experiments on the ShapeNet [7],
BuildingNet [38], and Pix3D [43] datasets. We show that
SDFusion quantitatively and qualitatively outperforms prior
work in shape completion, 3D reconstruction from images,
and text-to-shape tasks. We further demonstrate the capa-
bility of jointly controlling the generative model via multi-
ple conditioning modalities, the flexibility of adjusting rela-
tive weight among modalities, and the ability to texture 3D
shapes given textual descriptions, as shown in Figure 1.
We summarize the main contributions as follows:
• We propose SDFusion, a diffusion-based 3D genera-
tive model which uses a signed distance function as its
3D representation and a latent space for diffusion.
• SDFusion enables conditional generation with multi-
ple modalities, and provides flexible usage by adjust-
ing the weight among modalities.
• We demonstrate a pipeline to synthesize textured 3D
objects benefiting from an interplay between 2D and
3D generative models.
2. Related Work
3D Generative Models. Different from 2D images, it is
less clear how to effectively represent 3D data. Indeed,
various representations with different pros and cons have
been explored, particularly when considering 3D genera-
tive models. For instance, 3D generative models have been
explored for point clouds [2, 21], voxel grids [20, 42, 47],
meshes [51], signed distance functions (SDFs) [9, 11, 25],
etc. In this work, we aim to generate an SDF. Compared
to other representations, SDFs exhibit a reasonable trade-
off regarding expressivity, memory efficiency, and direct
applicability to downstream tasks. Moreover, conditioning
3D generation of SDFs on different modalities further en-
ables many applications, including shape completion, 3D
reconstruction from images, 3D generation from text, etc.
The proposed framework can handle these tasks in a single
model which makes it different from prior work.
Recently, thanks to the advancement of neural render-
ing [24], a new stream of research has emerged to learn
3D generation and manipulation from only 2D supervi-
sion [1, 5, 6, 28, 37, 39, 41, 49]. We believe the interplay
between two streams of work is promising in the foresee-
able future.
2
评论