T2M-GPT Generating Human Motion from Textual Descriptions with.pdf

smith0907

139

14页

0次

2023-09-22

100墨值下载

T2M-GPT: Generating Human Motion from Textual Descriptions with

Discrete Representations

Jianrong Zhang

1,3∗

, Yangsong Zhang

2,3∗

, Xiaodong Cun

, Shaoli Huang

, Yong Zhang

Hongwei Zhao

, Hongtao Lu

, Xi Shen

3,†

∗

Equal contribution

†

Corresponding author

Jilin University

Shanghai Jiao Tong University

Tencent AI Lab

Abstract

In this work, we investigate a simple and must-known con-

ditional generative framework based on Vector Quantised-

Variational AutoEncoder (VQ-VAE) and Generative Pre-

trained Transformer (GPT) for human motion generation

from textural descriptions. We show that a simple CNN-

based VQ-VAE with commonly used training recipes (EMA

and Code Reset) allows us to obtain high-quality discrete rep-

resentations. For GPT, we incorporate a simple corruption

strategy during the training to alleviate training-testing dis-

crepancy. Despite its simplicity, our T2M-GPT shows better

performance than competitive approaches, including recent

diffusion-based approaches. For example, on HumanML3D,

which is currently the largest dataset, we achieve compara-

ble performance on the consistency between text and gener-

ated motion (R-Precision), but with FID 0.116 largely outper-

forming MotionDiffuse of 0.630. Additionally, we conduct

analyses on HumanML3D and observe that the dataset size

is a limitation of our approach. Our work suggests that VQ-

VAE still remains a competitive approach for human motion

generation. Our implementation is available on the project

page:

https://mael-zys.github.io/T2M-GPT/

1. Introduction

Generating motion from textual descriptions can be used

in numerous applications in the game industry, ﬁlm-making,

and animating robots. For example, a typical way to access

new motion in the game industry is to perform motion cap-

ture, which is expensive. Therefore automatically generating

motion from textual descriptions, which allows producing

meaningful motion data, could save time and be more eco-

nomical.

Motion generation conditioned on natural language is

challenging, as motion and text are from different modali-

ties. The model is expected to learn precise mapping from

Figure 1.

Visual results on HumanML3D [21].

Our approach is

able to generate precise and high-quality human motion consistent

with challenging text descriptions. More visual results are on the

project page.

the language space to the motion space. To this end, many

works propose to learn a joint embedding for language and

motion using auto-encoders [3, 20, 64] and VAEs [51, 52].

MotionClip [64] aligns the motion space to CLIP [54] space.

ACTOR [51] and TEMOES [52] propose transformer-based

VAEs for action-to-motion and text-to-motion respectively.

These works show promising performances with simple de-

scriptions and are limited to producing high-quality motion

when textual descriptions become long and complicated.

Guo et al. [21] and TM2T [22] aim to generate motion

arXiv:2301.06052v3 [cs.CV] 28 Feb 2023

sequences with more challenging textual descriptions. How-

ever, both approaches are not straightforward, involve three

stages for text-to-motion generation, and sometimes fail to

generate high-quality motion consistent with the text (See

Figure 4 and more visual results on the project page). Re-

cently, diffusion-based models [31] have shown impressive

results on image generation [59], which are then introduced

to motion generation by MDM [65] and MotionDiffuse [72]

and dominates text-to-motion generation task. However,

we ﬁnd that compared to classic approaches, such as VQ-

VAE [67], the performance gain of the diffusion-based ap-

proaches [65, 72] might not be that signiﬁcant. In this work,

we are inspired by recent advances from learning the discrete

representation for generation [5,15, 16, 18,44, 57, 67,69] and

investigate a simple and classic framework based on Vector

Quantized Variational Autoencoders (VQ-VAE) [67] and

Generative Pre-trained Transformer (GPT) [55,68] for text-

to-motion generation.

Precisely, we propose a two-stage method for motion

generation from textual descriptions. In stage 1, we use a

standard 1D convolutional network to map motion sequences

to discrete code indices. In stage 2, a standard GPT-like

model [55, 68] is learned to generate sequences of code in-

dices from pre-trained text embedding. We ﬁnd that the naive

training of VQ-VAE [67] suffers from code collapse. One

effective solution is to leverage two standard recipes during

the training: EMA and Code Reset. We provide a full anal-

ysis of different quantization strategies. For GPT, the next

token prediction brings inconsistency between the training

and inference. We observe that simply corrupting sequences

during the training alleviates this discrepancy. Moreover,

throughout the evolution of image generation, the size of

the dataset has played an important role. We further explore

the impact of dataset size on the performance of our model.

The empirical analysis suggests that the performance of our

model can potentially be improved with larger datasets.

Despite its simplicity, our approach can generate high-

quality motion sequences that are consistent with challeng-

ing text descriptions (Figure 1 and more on the project

page). Empirically, we achieve comparable or even better

performances than concurrent diffusion-based approaches

MDM [65] and HumanDiffuse [72] on two widely used

datasets: HumanML3D [21] and KIT-ML [53]. For example,

on HumanML3D, which is currently the largest dataset, we

achieve comparable performance on the consistency between

text and generated motion (R-Precision), but with FID 0.116

largely outperforming MotionDiffuse of 0.630. We conduct

comprehensive experiments to explore this area, and hope

that these experiments and conclusions will contribute to

future developments.

In summary, our contributions include:

•

We present a simple yet effective approach for mo-

tion generation from textual descriptions. Our ap-

proach achieves state-of-the-art performance on Hu-

manML3D [21] and KIT-ML [53] datasets.

•

We show that GPT-like models incorporating discrete

representations still remain a very competitive approach

for motion generation.

• We provide a detailed analysis of the impact of quanti-

zation strategies and dataset size. We show that a larger

dataset might still offer a promising prospect to the

community.

Our implementation is available on the project page.

2. Related work

VQ-VAE.

Vector Quantized Variational Autoencoders

(VQ-VAE), which is a variant of VAE [35], is initially pro-

posed in [67]. VQ-VAE is composed of an AutoEncoder

architecture, which aims at learning reconstruction with dis-

crete representations. Recently, VQ-VAE achieves promis-

ing performance on generative tasks across different modali-

ties, which includes: image synthesis [18, 69], text-to-image

generation [57], speech gesture generation [5], music gen-

eration [15, 16] etc. The success of VQ-VAE for generation

might be attributed to its decoupling of learning the discrete

representation and the prior. A naive training of VQ-VAE suf-

fers from the codebook collapse, i.e., only a number of codes

are activated, which importantly limited the performances

of the reconstruction as well as generation. To alleviate the

problem, a number of techniques can be used during train-

ing, including stop-gradient along with some losses [67] to

optimize the codebook, exponential moving average (EMA)

for codebook update [69], reset inactivated codes during the

training (Code Reset [69]), etc.

Human motion synthesis.

Research on human motion

synthesis has a long history [8]. One of the most active

research ﬁelds is human motion prediction, which aims at

predicting the future motion sequence based on past ob-

served motion. Approaches mainly focus on efﬁciently and

effectively fusing spatial and temporal information to gen-

erate deterministic future motion through different models:

RNN [12,19,49, 50], GAN [9,29], GCN [48], Attention [47]

or even simply MLP [11, 24]. Some approaches aim at

generating diverse motion through VAE [4, 25, 71]. In ad-

dition to synthesizing motion conditioning on past motion,

another related topic is generating motion “in-betweening”

that takes both past and future poses and ﬁlls motion between

them [17,26,27,34,63]. [50] considers the generation of loco-

motion sequences from a given trajectory for simple actions

such as: walking and running. Motion can also be generated

with music to produce 3D dance motion [6,13,36,37,39,40].

For unconstrained generations, [70] generates a long se-

quence altogether by transforming from a sequence of latent

of 14

100墨值下载

关注

评论