
Based on the above insight, we propose a two-stage MO-
tion, Scene and Object decomposition framework (MOSO)
for video prediction. We distinguish objects from scenes
and utilize motion signals to guide their integration. In the
first stage, MOSO-VQVAE is developed to learn motion,
scene and object decomposition encoding and video decod-
ing in a self-supervised manner. Each decomposed com-
ponent is equipped with an independent encoder to learn
its features and to produce a distinct group of discrete to-
kens. To deal with different motion patterns, we integrate
the object and scene features under the guidance of the cor-
responding motion feature. Then the video details can be
decoded and rebuilt from the merged features. In particular,
the decoding process is devised to be time-independent, so
that a decomposed component or a single video frame can
be decoded for flexible visualization.
In the second stage, MOSO-Transformer is proposed to
generate a subsequent video clip based on a previous video
clip. Motivated by the production of animation, which first
determines character identities and then portrays a series of
actions, MOSO-Transformer firstly predicts the object and
scene tokens of the subsequent video clip from those of the
previous video clip. Then the motion tokens of the subse-
quent video clip are generated based on the predicted scene
and object tokens and the motion tokens of the previous
video clip. The predicted object, scene, and motion tokens
can be decoded to the subsequent video clip using MOSO-
VQVAE. By modeling video prediction at the token level,
MOSO-Transformer is relieved from the burden of model-
ing millions of pixels and can instead focus on capturing
global context relationships. In addition, our framework can
be easily extended to other video generation tasks, includ-
ing unconditional video generation and video frame inter-
polation tasks, by simply revising the training or generation
pipelines of MOSO-Transformer.
Our contributions are summarized as follows:
• We propose a novel two-stage framework MOSO for
video prediction, which could decompose videos into mo-
tion, scene and object components and conduct video pre-
diction at the token level.
• MOSO-VQVAE is proposed to learn motion, scene
and object decomposition encoding and time-independently
video decoding in a self-supervised manner, which allows
video manipulation and flexible video decoding.
• MOSO-Transformer is proposed to first determine the
scene and object identities of subsequent video clips and
then predict subsequent motions at the token level.
• Qualitative and quantitative experiments on five chal-
lenging benchmarks of video prediction and unconditional
video generation demonstrate that our proposed method
achieves new state-of-the-art performance.
2. Related Work
Video Prediction The video prediction task has received
increasing interest in the computer vision field. ConvLSTM
[40] combines CNN and LSTM architectures and adopts
an adversarial loss. MCnet [47] models pixel-level future
video prediction with motion and content decomposition for
the first time. GVSD [49] proposes a spatio-temporal CNN
combined with adversarial training to untangle foreground
objects from background scenes, while severe distortion of
object appearances exists in their predicted video frames.
MCVD [48] adopts a denoising diffusion model to conduct
several video-related tasks conditioned on past and/or future
frames. Although previous models can predict consistent
subsequent videos, they still suffer from indistinct or dis-
torted visual appearances since they lack a stable generator
or fail to decouple different motion patterns. SLAMP [1]
and vid2vid [50] decomposes video appearance and motion
for video prediction with the help of optical flow. SADM
[4] proposes a semantic-aware dynamic model that pre-
dicts and fuses the semantic maps (content) and optical flow
maps (motion) of future video frames. In addition to opti-
cal flow and semantic maps, Wu et al. [57] further utilizes
instance maps to help separate objects from backgrounds.
Although these works also decompose video components,
they are more complicated than MOSO since they require
much more additional information. Furthermore, these pre-
vious works are primarily based on generative adversarial
networks or recurrent neural networks, while MOSO fol-
lows a recently developed two-stage autoregressive gener-
ation framework, which demonstrates greater potential on
open domain visual generation tasks.
Two-stage Visual Generation The two-stage frame-
work is first proposed for image generation [11, 14, 36] and
demonstrates excellent generation ability. Motivated by the
success, several attempts have been made to extend the two-
stage framework to video generation tasks [14, 55, 58, 58].
For video prediction, MaskViT [20] encodes videos by
frame though VQ-GAN [14] and models video tokens with
a bidirectional Transformer through window attention. For
unconditional video generation, VideoGPT [58] encodes
videos by employing 3D convolutions and axial attention,
and then models video tokens in an auto-regressive manner.
However, existing two-stage works for video tasks do not
consider video component decomposition and are affected
by flicker artifacts and expensive computation costs.
3. MOSO
In this section, we present our proposed framework
MOSO in detail. MOSO is a novel two-stage frame-
work for video prediction and consists of MOSO-VQVAE
and MOSO-Transformer, where MOSO-VQVAE encodes
decomposed video components to tokens and MOSO-
2
评论