sequences with more challenging textual descriptions. How-
ever, both approaches are not straightforward, involve three
stages for text-to-motion generation, and sometimes fail to
generate high-quality motion consistent with the text (See
Figure 4 and more visual results on the project page). Re-
cently, diffusion-based models [31] have shown impressive
results on image generation [59], which are then introduced
to motion generation by MDM [65] and MotionDiffuse [72]
and dominates text-to-motion generation task. However,
we find that compared to classic approaches, such as VQ-
VAE [67], the performance gain of the diffusion-based ap-
proaches [65, 72] might not be that significant. In this work,
we are inspired by recent advances from learning the discrete
representation for generation [5,15, 16, 18,44, 57, 67,69] and
investigate a simple and classic framework based on Vector
Quantized Variational Autoencoders (VQ-VAE) [67] and
Generative Pre-trained Transformer (GPT) [55,68] for text-
to-motion generation.
Precisely, we propose a two-stage method for motion
generation from textual descriptions. In stage 1, we use a
standard 1D convolutional network to map motion sequences
to discrete code indices. In stage 2, a standard GPT-like
model [55, 68] is learned to generate sequences of code in-
dices from pre-trained text embedding. We find that the naive
training of VQ-VAE [67] suffers from code collapse. One
effective solution is to leverage two standard recipes during
the training: EMA and Code Reset. We provide a full anal-
ysis of different quantization strategies. For GPT, the next
token prediction brings inconsistency between the training
and inference. We observe that simply corrupting sequences
during the training alleviates this discrepancy. Moreover,
throughout the evolution of image generation, the size of
the dataset has played an important role. We further explore
the impact of dataset size on the performance of our model.
The empirical analysis suggests that the performance of our
model can potentially be improved with larger datasets.
Despite its simplicity, our approach can generate high-
quality motion sequences that are consistent with challeng-
ing text descriptions (Figure 1 and more on the project
page). Empirically, we achieve comparable or even better
performances than concurrent diffusion-based approaches
MDM [65] and HumanDiffuse [72] on two widely used
datasets: HumanML3D [21] and KIT-ML [53]. For example,
on HumanML3D, which is currently the largest dataset, we
achieve comparable performance on the consistency between
text and generated motion (R-Precision), but with FID 0.116
largely outperforming MotionDiffuse of 0.630. We conduct
comprehensive experiments to explore this area, and hope
that these experiments and conclusions will contribute to
future developments.
In summary, our contributions include:
•
We present a simple yet effective approach for mo-
tion generation from textual descriptions. Our ap-
proach achieves state-of-the-art performance on Hu-
manML3D [21] and KIT-ML [53] datasets.
•
We show that GPT-like models incorporating discrete
representations still remain a very competitive approach
for motion generation.
• We provide a detailed analysis of the impact of quanti-
zation strategies and dataset size. We show that a larger
dataset might still offer a promising prospect to the
community.
Our implementation is available on the project page.
2. Related work
VQ-VAE.
Vector Quantized Variational Autoencoders
(VQ-VAE), which is a variant of VAE [35], is initially pro-
posed in [67]. VQ-VAE is composed of an AutoEncoder
architecture, which aims at learning reconstruction with dis-
crete representations. Recently, VQ-VAE achieves promis-
ing performance on generative tasks across different modali-
ties, which includes: image synthesis [18, 69], text-to-image
generation [57], speech gesture generation [5], music gen-
eration [15, 16] etc. The success of VQ-VAE for generation
might be attributed to its decoupling of learning the discrete
representation and the prior. A naive training of VQ-VAE suf-
fers from the codebook collapse, i.e., only a number of codes
are activated, which importantly limited the performances
of the reconstruction as well as generation. To alleviate the
problem, a number of techniques can be used during train-
ing, including stop-gradient along with some losses [67] to
optimize the codebook, exponential moving average (EMA)
for codebook update [69], reset inactivated codes during the
training (Code Reset [69]), etc.
Human motion synthesis.
Research on human motion
synthesis has a long history [8]. One of the most active
research fields is human motion prediction, which aims at
predicting the future motion sequence based on past ob-
served motion. Approaches mainly focus on efficiently and
effectively fusing spatial and temporal information to gen-
erate deterministic future motion through different models:
RNN [12,19,49, 50], GAN [9,29], GCN [48], Attention [47]
or even simply MLP [11, 24]. Some approaches aim at
generating diverse motion through VAE [4, 25, 71]. In ad-
dition to synthesizing motion conditioning on past motion,
another related topic is generating motion “in-betweening”
that takes both past and future poses and fills motion between
them [17,26,27,34,63]. [50] considers the generation of loco-
motion sequences from a given trajectory for simple actions
such as: walking and running. Motion can also be generated
with music to produce 3D dance motion [6,13,36,37,39,40].
For unconstrained generations, [70] generates a long se-
quence altogether by transforming from a sequence of latent
2
评论