暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
T2M-GPT Generating Human Motion from Textual Descriptions with.pdf
139
14页
0次
2023-09-22
100墨值下载
T2M-GPT: Generating Human Motion from Textual Descriptions with
Discrete Representations
Jianrong Zhang
1,3
, Yangsong Zhang
2,3
, Xiaodong Cun
3
, Shaoli Huang
3
, Yong Zhang
3
Hongwei Zhao
1
, Hongtao Lu
2
, Xi Shen
3,
Equal contribution
Corresponding author
1
Jilin University
2
Shanghai Jiao Tong University
3
Tencent AI Lab
Abstract
In this work, we investigate a simple and must-known con-
ditional generative framework based on Vector Quantised-
Variational AutoEncoder (VQ-VAE) and Generative Pre-
trained Transformer (GPT) for human motion generation
from textural descriptions. We show that a simple CNN-
based VQ-VAE with commonly used training recipes (EMA
and Code Reset) allows us to obtain high-quality discrete rep-
resentations. For GPT, we incorporate a simple corruption
strategy during the training to alleviate training-testing dis-
crepancy. Despite its simplicity, our T2M-GPT shows better
performance than competitive approaches, including recent
diffusion-based approaches. For example, on HumanML3D,
which is currently the largest dataset, we achieve compara-
ble performance on the consistency between text and gener-
ated motion (R-Precision), but with FID 0.116 largely outper-
forming MotionDiffuse of 0.630. Additionally, we conduct
analyses on HumanML3D and observe that the dataset size
is a limitation of our approach. Our work suggests that VQ-
VAE still remains a competitive approach for human motion
generation. Our implementation is available on the project
page:
https://mael-zys.github.io/T2M-GPT/
.
1. Introduction
Generating motion from textual descriptions can be used
in numerous applications in the game industry, film-making,
and animating robots. For example, a typical way to access
new motion in the game industry is to perform motion cap-
ture, which is expensive. Therefore automatically generating
motion from textual descriptions, which allows producing
meaningful motion data, could save time and be more eco-
nomical.
Motion generation conditioned on natural language is
challenging, as motion and text are from different modali-
ties. The model is expected to learn precise mapping from
Figure 1.
Visual results on HumanML3D [21].
Our approach is
able to generate precise and high-quality human motion consistent
with challenging text descriptions. More visual results are on the
project page.
the language space to the motion space. To this end, many
works propose to learn a joint embedding for language and
motion using auto-encoders [3, 20, 64] and VAEs [51, 52].
MotionClip [64] aligns the motion space to CLIP [54] space.
ACTOR [51] and TEMOES [52] propose transformer-based
VAEs for action-to-motion and text-to-motion respectively.
These works show promising performances with simple de-
scriptions and are limited to producing high-quality motion
when textual descriptions become long and complicated.
Guo et al. [21] and TM2T [22] aim to generate motion
1
arXiv:2301.06052v3 [cs.CV] 28 Feb 2023
sequences with more challenging textual descriptions. How-
ever, both approaches are not straightforward, involve three
stages for text-to-motion generation, and sometimes fail to
generate high-quality motion consistent with the text (See
Figure 4 and more visual results on the project page). Re-
cently, diffusion-based models [31] have shown impressive
results on image generation [59], which are then introduced
to motion generation by MDM [65] and MotionDiffuse [72]
and dominates text-to-motion generation task. However,
we find that compared to classic approaches, such as VQ-
VAE [67], the performance gain of the diffusion-based ap-
proaches [65, 72] might not be that significant. In this work,
we are inspired by recent advances from learning the discrete
representation for generation [5,15, 16, 18,44, 57, 67,69] and
investigate a simple and classic framework based on Vector
Quantized Variational Autoencoders (VQ-VAE) [67] and
Generative Pre-trained Transformer (GPT) [55,68] for text-
to-motion generation.
Precisely, we propose a two-stage method for motion
generation from textual descriptions. In stage 1, we use a
standard 1D convolutional network to map motion sequences
to discrete code indices. In stage 2, a standard GPT-like
model [55, 68] is learned to generate sequences of code in-
dices from pre-trained text embedding. We find that the naive
training of VQ-VAE [67] suffers from code collapse. One
effective solution is to leverage two standard recipes during
the training: EMA and Code Reset. We provide a full anal-
ysis of different quantization strategies. For GPT, the next
token prediction brings inconsistency between the training
and inference. We observe that simply corrupting sequences
during the training alleviates this discrepancy. Moreover,
throughout the evolution of image generation, the size of
the dataset has played an important role. We further explore
the impact of dataset size on the performance of our model.
The empirical analysis suggests that the performance of our
model can potentially be improved with larger datasets.
Despite its simplicity, our approach can generate high-
quality motion sequences that are consistent with challeng-
ing text descriptions (Figure 1 and more on the project
page). Empirically, we achieve comparable or even better
performances than concurrent diffusion-based approaches
MDM [65] and HumanDiffuse [72] on two widely used
datasets: HumanML3D [21] and KIT-ML [53]. For example,
on HumanML3D, which is currently the largest dataset, we
achieve comparable performance on the consistency between
text and generated motion (R-Precision), but with FID 0.116
largely outperforming MotionDiffuse of 0.630. We conduct
comprehensive experiments to explore this area, and hope
that these experiments and conclusions will contribute to
future developments.
In summary, our contributions include:
We present a simple yet effective approach for mo-
tion generation from textual descriptions. Our ap-
proach achieves state-of-the-art performance on Hu-
manML3D [21] and KIT-ML [53] datasets.
We show that GPT-like models incorporating discrete
representations still remain a very competitive approach
for motion generation.
We provide a detailed analysis of the impact of quanti-
zation strategies and dataset size. We show that a larger
dataset might still offer a promising prospect to the
community.
Our implementation is available on the project page.
2. Related work
VQ-VAE.
Vector Quantized Variational Autoencoders
(VQ-VAE), which is a variant of VAE [35], is initially pro-
posed in [67]. VQ-VAE is composed of an AutoEncoder
architecture, which aims at learning reconstruction with dis-
crete representations. Recently, VQ-VAE achieves promis-
ing performance on generative tasks across different modali-
ties, which includes: image synthesis [18, 69], text-to-image
generation [57], speech gesture generation [5], music gen-
eration [15, 16] etc. The success of VQ-VAE for generation
might be attributed to its decoupling of learning the discrete
representation and the prior. A naive training of VQ-VAE suf-
fers from the codebook collapse, i.e., only a number of codes
are activated, which importantly limited the performances
of the reconstruction as well as generation. To alleviate the
problem, a number of techniques can be used during train-
ing, including stop-gradient along with some losses [67] to
optimize the codebook, exponential moving average (EMA)
for codebook update [69], reset inactivated codes during the
training (Code Reset [69]), etc.
Human motion synthesis.
Research on human motion
synthesis has a long history [8]. One of the most active
research fields is human motion prediction, which aims at
predicting the future motion sequence based on past ob-
served motion. Approaches mainly focus on efficiently and
effectively fusing spatial and temporal information to gen-
erate deterministic future motion through different models:
RNN [12,19,49, 50], GAN [9,29], GCN [48], Attention [47]
or even simply MLP [11, 24]. Some approaches aim at
generating diverse motion through VAE [4, 25, 71]. In ad-
dition to synthesizing motion conditioning on past motion,
another related topic is generating motion “in-betweening”
that takes both past and future poses and fills motion between
them [17,26,27,34,63]. [50] considers the generation of loco-
motion sequences from a given trajectory for simple actions
such as: walking and running. Motion can also be generated
with music to produce 3D dance motion [6,13,36,37,39,40].
For unconstrained generations, [70] generates a long se-
quence altogether by transforming from a sequence of latent
2
of 14
100墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜