TAPS3D.pdf - 墨天轮文档

TAPS3D.pdf

smith0907

14页

0次

2023-09-23

100墨值下载

TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision

Jiacheng Wei

Hao Wang

Jiashi Feng

Guosheng Lin

1†

Kim-Hui Yap

Nanyang Technological University, Singapore

ByteDance

{jiacheng002@e., hao005@e., gslin@, ekhyap@}ntu.edu.sg, jshfeng@bytedance.com

“a red hatchback car” “a gray armless chair” “a wooden office table” “a black motorbike”

Figure 1. Our proposed model is capable of generating detailed textured shapes based on given text prompts. We demonstrate the effective-

ness of our approach by showcasing generated results for four distinct object classes: Car, Chair, Table, and Motorbike. Both the textured

meshes and geometries are presented, with visualizations rendered using ChimeraX [10].

Abstract

In this paper, we investigate an open research task

of generating controllable 3D textured shapes from the

given textual descriptions. Previous works either require

ground truth caption labeling or extensive optimization

time. To resolve these issues, we present a novel frame-

work, TAPS3D, to train a text-guided 3D shape generator

with pseudo captions. Speciﬁcally, based on rendered 2D

images, we retrieve relevant words from the CLIP vocab-

ulary and construct pseudo captions using templates. Our

constructed captions provide high-level semantic supervi-

sion for generated 3D shapes. Further, in order to pro-

duce ﬁne-grained textures and increase geometry diversity,

we propose to adopt low-level image regularization to en-

able fake-rendered images to align with the real ones. Dur-

ing the inference phase, our proposed model can generate

3D textured shapes from the given text without any addi-

tional optimization. We conduct extensive experiments to

analyze each of our proposed components and show the

efﬁcacy of our framework in generating high-ﬁdelity 3D

textured and text-relevant shapes. Code is available at

https://github.com/plusmultiply/TAPS3D

1. Introduction

3D objects are essential in various applications [21–24],

such as video games, ﬁlm special effects, and virtual real-

∗

Equal contribution. Work done during an internship at Bytedance.

†

Corresponding author.

ity. However, realistic and detailed 3D object models are

usually hand-crafted by well-trained artists and engineers

slowly and tediously. To expedite this process, many re-

search works [3,9,13,34,48,49] use deep generative models

to achieve automatic 3D object generation. However, these

models are primarily unconditioned, which can hardly gen-

erate objects as humans will.

In order to control the generated 3D objects from text,

prior text-to-3D generation works [43, 44] leverage the pre-

trained vision-language alignment model CLIP [37], such

that they can only use 3D shape data to achieve zero-shot

learning. For example, Dream Fields [13] combines the ad-

vantages of CLIP and NeRF [27], which can produce both

3D representations and renderings. However, Dream Fields

costs about 70 minutes on 8 TPU cores to produce a sin-

gle result. This means the optimization time during the

inference phase is too slow to use in practice. Later on,

GET3D [9] is proposed with faster inference time, which

incorporates StyleGAN [15] and Deep Marching Tetrahe-

dral (DMTet) [46] as the texture and geometry generators

respectively. Since GET3D adopts a pretrained model to do

text-guided synthesis, they can ﬁnish optimization in less

time than Dream Fields. But the requirement of test-time

optimization still limits its application scenarios. CLIP-

NeRF [50] utilizes conditional radiance ﬁelds [45] to avoid

test-time optimization, but it requires ground truth text data

for the training purpose. Therefore, CLIP-NeRF is only ap-

plicable to a few object classes that have labeled text data

for training, and its generation quality is restricted by the

NeRF capacity.

To address the aforementioned limitations, we propose

arXiv:2303.13273v1 [cs.CV] 23 Mar 2023

to generate pseudo captions for 3D shape data based on

their rendered 2D images and construct a large amount of

⟨3D shape, pseudo captions⟩ as training data, such that

the text-guided 3D generation model can be trained over

them. To this end, we propose a novel framework for

Text-guided 3D textured shApe generation from Pseudo

Supervision (TAPS3D), in which we can generate high-

quality 3D shapes without requiring annotated text training

data or test-time optimization.

Speciﬁcally, our proposed framework is composed of

two modules, where the ﬁrst generates pseudo captions for

3D shapes and feeds them into a 3D generator to con-

duct text-guided training within the second module. In the

pseudo caption generation module, we follow the language-

free text-to-image learning scheme [20, 54]. We ﬁrst adopt

the CLIP model to retrieve relevant words from given ren-

dered images. Then we construct multiple candidate sen-

tences based on the retrieved words and pick sentences hav-

ing the highest CLIP similarity scores with the given im-

ages. The selected sentences are used as our pseudo cap-

tions for each 3D shape sample.

Following the notable progress of text-to-image gener-

ation models [29, 38, 39, 42, 52], we use text-conditioned

GAN architecture in the text-guided 3D generator training

part. We adopt the pretrained GET3D [9] model as our

backbone network since it has been demonstrated to gen-

erate high-ﬁdelity 3D textured shapes across various object

classes. We input the pseudo captions as the generator con-

ditions and supervise the training process with high-level

CLIP supervision in an attempt to control the generated 3D

shapes. Moreover, we introduce a low-level image regu-

larization loss to produce ﬁne-grained textures and increase

geometry diversity. We empirically train the mapping net-

works only of a pretrained GET3D model so that the train-

ing is stable and fast, and also, the generation quality of the

pretrained model can be preserved.

Our proposed model TAPS3D can produce high-quality

3D textured shapes with strong text control as shown in

Fig. 1, without any per prompt test-time optimization. Our

contribution can be summarized as:

• We introduce a new 3D textured shape generative

framework, which can generate high-quality and ﬁ-

delity 3D shapes without requiring paired text and 3D

shape training data.

• We propose a simple pseudo caption generation

method that enables text-conditioned 3D generator

training, such that the model can generate text-

controlled 3D textured shapes without test time opti-

mization, and signiﬁcantly reduce the time cost.

• We introduce a low-level image regularization loss on

top of the high-level CLIP loss in an attempt to produce

ﬁne-grained textures and increase geometry diversity.

2. Related Work

2.1. Text-Guided 3D Shape Generation

Text-guided 3D shape generation aims to generate 3D

shapes from the textual descriptions so that the genera-

tion process can be controlled. There are mainly two cat-

egories of methods, i.e., fully-supervised and optimization-

based methods. The fully-supervised method [6, 8, 25] uses

ground truth text and the paired 3D objects with explicit

3D representations as training data. Speciﬁcally, CLIP-

Forge [43] uses a two-stage training scheme, which con-

sists of shape autoencoder training, and conditional nor-

malizing ﬂow training. VQ-VAE [44] performs zero-shot

training with 3D voxel data by utilizing the pretrained CLIP

model [37].

Regarding the optimization-based methods [13, 18, 26,

34], Neural Radiance Fields (NeRF) are usually adopted as

the 3D generator. To generate 3D shapes for each input text

prompt, they use the CLIP model to supervise the seman-

tic alignment between rendered images and text prompts.

Since NeRF suffers from extensive generation time, 3D-

aware image synthesis [3, 3, 4, 11, 30, 31] has become pop-

ular, which generates multi-view consistent images by in-

tegrating neural rendering in the Generative Adversarial

Networks (GANs). Speciﬁcally, there are no explicit 3D

shapes generated during the process, while the 3D shapes

can be extracted from the implicit representations, such as

the occupancy ﬁeld or signed distance function (SDF), us-

ing the marching cube algorithm. These optimization-based

methods provide a solution to generate 3D shapes, but their

generation speed was compromised. Although [50] is free

from test-time optimization, it requires text data for train-

ing, which limits its applicable 3D object classes. Our pro-

posed method attempts to alleviate both the paired text data

shortage issues and the long optimization time in the pre-

vious work. We ﬁnally produce high-quality 3D shapes to

bring our method to practical applications.

2.2. Text-to-Image Synthesis

We may draw inspiration from text-to-image generation

methods. Typically, many research works [36, 40,53] adopt

the conditional GAN architecture, where they directly take

the text features and concatenate them with the random

noise as the input. Recently, autoregressive models [7, 39]

and diffusion models [29, 38, 41, 42] made great improve-

ment on text to image synthesis while demanding huge

computational resources and massive training data.

With the introduction of StyleGAN [14–16] mapping

networks, the input random noise can be ﬁrst mapped to an-

other latent space that has disentangled semantics, then the

model can generate images with better quality. Further, ex-

ploring the latent space of StyleGAN has been proved use-

ful by several works [1, 17,33] in text-driven image synthe-

of 14

100墨值下载

关注

评论