SDFusion.pdf - 墨天轮文档

SDFusion.pdf

smith0907

137

10页

0次

2023-09-20

100墨值下载

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

Yen-Chi Cheng

Hsin-Ying Lee

Sergey Tulyakov

Alexander Schwing

1∗

Liangyan Gui

1∗

University of Illinois Urbana-Champaign

Snap Research

{yenchic3,aschwing,lgui}@illinois.edu {hlee5,stulyakov}@snap.com

https://yccyenchicheng.github.io/SDFusion/

Shape

Completion

Missing

Recon.

Multi

-condition

Text-guided Colorization

Text

-guided

Generation

a rocking chair square table, curved legs

a chair with arms

condition

strength

Text-guided

Completion

ice cream summer house Japanese castle

square couch with shelf table with net

table tennis tablewooden tablecheesecake

Figure 1. Applications of SDFusion. The proposed diffusion-based model enables various applications. (left) SDFusion can generate

shapes conditioned on different input modalities, including partial shapes, images, and text. SDFusion can even jointly handle multiple

conditioning modalities while controlling the strength for each of them. (right) We leverage pretrained 2D models to texture 3D shapes

generated by SDFusion.

Abstract

In this work, we present a novel framework built to sim-

plify 3D asset generation for amateur users. To enable in-

teractive generation, our method supports a variety of in-

put modalities that can be easily provided by a human, in-

cluding images, text, partially observed shapes and com-

binations of these, further allowing to adjust the strength

of each input. At the core of our approach is an encoder-

decoder, compressing 3D shapes into a compact latent rep-

resentation, upon which a diffusion model is learned. To

enable a variety of multi-modal inputs, we employ task-

speciﬁc encoders with dropout followed by a cross-attention

mechanism. Due to its ﬂexibility, our model naturally sup-

ports a variety of tasks, outperforming prior works on shape

completion, image-based 3D reconstruction, and text-to-

3D. Most interestingly, our model can combine all these

tasks into one swiss-army-knife tool, enabling the user to

perform shape generation using incomplete shapes, images,

and textual descriptions at the same time, providing the rel-

ative weights for each input and facilitating interactivity.

Despite our approach being shape-only, we further show an

efﬁcient method to texture the generated shape using large-

scale text-to-image models.

arXiv:2212.04493v2 [cs.CV] 22 Mar 2023

1. Introduction

Generating 3D assets is a cornerstone of immersive aug-

mented/virtual reality experiences. Without realistic and di-

verse objects, virtual worlds will look void and engagement

will remain low. Despite this need, manually creating and

editing 3D assets is a notoriously difﬁcult task, requiring

creativity, 3D design skills, and access to sophisticated soft-

ware with a very steep learning curve. This makes 3D asset

creation inaccessible for inexperienced users. Yet, in many

cases, such as interior design, users more often than not

have a reasonably good understanding of what they want to

create. In those cases, an image or a rough sketch is some-

times accompanied by text indicating details of the asset,

which are hard to express graphically for an amateur.

Due to this need, it is not surprising that democratizing

the 3D content creation process has become an active re-

search area. Conventional 3D generative models require

direct 3D supervision in the form of point clouds [2, 21],

signed distance functions (SDFs) [9, 25], voxels [42, 47],

etc. Recently, ﬁrst efforts have been made to explore the

learning of 3D geometry from multi-view supervision with

known camera poses by incorporating inductive biases via

neural rendering techniques [5, 6, 14, 37, 52]. While com-

pelling results have been demonstrated, training is often

very time-consuming and ignores available 3D data that can

be used to obtain good shape priors. We foresee an ideal

collaborative paradigm for generative methods where mod-

els trained on 3D data provide detailed and accurate geom-

etry, while models trained on 2D data provide diverse ap-

pearances. A ﬁrst proof of concept is shown in Figure 1.

In our pursuit of ﬂexible and high-quality 3D shape gen-

eration, we introduce SDFusion, a diffusion-based genera-

tive model with a signed distance function (SDF) under the

hood, acting as our 3D representation. Compared to other

3D representations, SDFs are known to represent well high-

resolution shapes with arbitrary topology [9, 18, 23, 30].

However, 3D representations are infamous for demanding

high computational resources, limiting most existing 3D

generative models to voxel grids of 32

resolution and point

clouds of 2K points. To side-step this issue, we ﬁrst uti-

lize an auto-encoder to compress 3D shapes into a more

compact low-dimensional representation. Because of this,

SDFusion can easily scale up to a 128

resolution. To

learn the probability distribution over the introduced la-

tent space, we leverage diffusion models, which have re-

cently been used with great success in various 2D genera-

tion tasks [4,19,22,26,35,40]. Furthermore, we adopt task-

speciﬁc encoders and a cross-attention [34] mechanism to

support multiple conditioning inputs, and apply classiﬁer-

free guidance [17] to enable ﬂexible conditioning usage.

Because of these strategies, SDFusion can not only use a va-

riety of conditions from multiple modalities, but also adjust

their importance weight, as shown in Figure 1. Compared

to a recently proposed autoregressive model [25] that also

adopts an encoded latent space, SDFusion achieves supe-

rior sample quality, while offering more ﬂexibility to handle

multiple conditions and, at the same time, features reduced

memory usage. With SDFusion, we study the interplay be-

tween models trained on 2D and 3D data. Given 3D shapes

generated by SDFusion, we take advantage of an off-the-

shelf 2D diffusion model [34], neural rendering [24], and

score distillation sampling [31] to texture the shapes given

text descriptions as conditional variables.

We conduct extensive experiments on the ShapeNet [7],

BuildingNet [38], and Pix3D [43] datasets. We show that

SDFusion quantitatively and qualitatively outperforms prior

work in shape completion, 3D reconstruction from images,

and text-to-shape tasks. We further demonstrate the capa-

bility of jointly controlling the generative model via multi-

ple conditioning modalities, the ﬂexibility of adjusting rela-

tive weight among modalities, and the ability to texture 3D

shapes given textual descriptions, as shown in Figure 1.

We summarize the main contributions as follows:

• We propose SDFusion, a diffusion-based 3D genera-

tive model which uses a signed distance function as its

3D representation and a latent space for diffusion.

• SDFusion enables conditional generation with multi-

ple modalities, and provides ﬂexible usage by adjust-

ing the weight among modalities.

• We demonstrate a pipeline to synthesize textured 3D

objects beneﬁting from an interplay between 2D and

3D generative models.

2. Related Work

3D Generative Models. Different from 2D images, it is

less clear how to effectively represent 3D data. Indeed,

various representations with different pros and cons have

been explored, particularly when considering 3D genera-

tive models. For instance, 3D generative models have been

explored for point clouds [2, 21], voxel grids [20, 42, 47],

meshes [51], signed distance functions (SDFs) [9, 11, 25],

etc. In this work, we aim to generate an SDF. Compared

to other representations, SDFs exhibit a reasonable trade-

off regarding expressivity, memory efﬁciency, and direct

applicability to downstream tasks. Moreover, conditioning

3D generation of SDFs on different modalities further en-

ables many applications, including shape completion, 3D

reconstruction from images, 3D generation from text, etc.

The proposed framework can handle these tasks in a single

model which makes it different from prior work.

Recently, thanks to the advancement of neural render-

ing [24], a new stream of research has emerged to learn

3D generation and manipulation from only 2D supervi-

sion [1, 5, 6, 28, 37, 39, 41, 49]. We believe the interplay

between two streams of work is promising in the foresee-

able future.

of 10

100墨值下载

关注

评论