暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
MOSO Decomposing MOtion, Scene and Object for Video Prediction.pdf
112
15页
0次
2023-11-18
25墨值下载
Mingzhen Sun
1,
MOSO: Decomposing MOtion, Scene and Object for Video Prediction
2
Weining Wang
1
Xinxin Zhu
1
Jing Liu
1,2,
1
The Laboratory of Cognition and Decision Intelligence for Complex Systems,
Institute of Automation, Chinese Academy of Sciences (CASIA)
2
School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS)
sunmingzhen2020@ia.ac.cn {weining.wang, xinxin.zhu, jliu}@nlpr.ia.ac.cn
Abstract
Motion, scene and object are three primary visual com-
ponents of a video. In particular, objects represent the fore-
ground, scenes represent the background, and motion traces
their dynamics. Based on this insight, we propose a two-
stage MOtion, Scene and Object decomposition framework
(MOSO)
1
for video prediction, consisting of MOSO-VQVAE
and MOSO-Transformer. In the first stage, MOSO-VQVAE
decomposes a previous video clip into the motion, scene and
object components, and represents them as distinct groups
of discrete tokens. Then, in the second stage, MOSO-
Transformer predicts the object and scene tokens of the sub-
sequent video clip based on the previous tokens and adds
dynamic motion at the token level to the generated object
and scene tokens. Our framework can be easily extended
to unconditional video generation and video frame inter-
polation tasks. Experimental results demonstrate that our
method achieves new state-of-the-art performance on five
challenging benchmarks for video prediction and uncondi-
tional video generation: BAIR, RoboNet, KTH, KITTI and
UCF101. In addition, MOSO can produce realistic videos
by combining objects and scenes from different videos.
1. Introduction
Video prediction aims to generate future video frames
based on a past video without any additional annotations
[6, 18], which is important for video perception systems,
such as autonomous driving [25], robotic navigation [16]
and decision making in daily life [5], etc. Considering
that video is a spatio-temporal record of moving objects, an
ideal solution of video prediction should depict visual con-
tent in the spatial domain accurately and predict motions in
the temporal domain reasonably. However, easily distorted
object identities and infinite possibilities of motion trajecto-
* Corresponding Author
1
Codes have been released in https://github.com/iva-mzsun/MOSO
IntergrateDecompose
(b) The Proposed Method
(a) Traditional Method
Motion
Content
Scene Object Motion
Figure 1. Rebuilding video signals based on (a) traditional decom-
posed content and motion signals or (b) our decomposed scene,
object and motion signals. Decomposing content and motion sig-
nals causes blurred and distorted appearance of the wrestling man,
while further separating objects from scenes resolves this issue.
ries make video prediction a challenging task.
Recently, several works [15, 47] propose to decompose
video signals into content and motion, with content encod-
ing the static parts, i.e., scene and object identities, and mo-
tion encoding the dynamic parts, i.e., visual changes. This
decomposition allows two specific encoders to be devel-
oped, one for storing static content signals and the other
for simulating dynamic motion signals. However, these
methods do not distinguish between foreground objects and
background scenes, which usually have distinct motion pat-
terns. Motions of scenes can be caused by camera move-
ments or environment changes, e.g., a breeze, whereas mo-
tions of objects such as jogging are always more local and
routine. When scenes and objects are treated as a unity,
their motion patterns cannot be handled in a distinct man-
ner, resulting in blurry and distorted visual appearances. As
depicted in Fig. 1, it is obvious that the moving subject
(i.e., the wrestling man) is more clear in the video obtained
by separating objects from scenes than that by treating them
as a single entity traditionally.
1
arXiv:2303.03684v2 [cs.CV] 16 Mar 2023
Based on the above insight, we propose a two-stage MO-
tion, Scene and Object decomposition framework (MOSO)
for video prediction. We distinguish objects from scenes
and utilize motion signals to guide their integration. In the
first stage, MOSO-VQVAE is developed to learn motion,
scene and object decomposition encoding and video decod-
ing in a self-supervised manner. Each decomposed com-
ponent is equipped with an independent encoder to learn
its features and to produce a distinct group of discrete to-
kens. To deal with different motion patterns, we integrate
the object and scene features under the guidance of the cor-
responding motion feature. Then the video details can be
decoded and rebuilt from the merged features. In particular,
the decoding process is devised to be time-independent, so
that a decomposed component or a single video frame can
be decoded for flexible visualization.
In the second stage, MOSO-Transformer is proposed to
generate a subsequent video clip based on a previous video
clip. Motivated by the production of animation, which first
determines character identities and then portrays a series of
actions, MOSO-Transformer firstly predicts the object and
scene tokens of the subsequent video clip from those of the
previous video clip. Then the motion tokens of the subse-
quent video clip are generated based on the predicted scene
and object tokens and the motion tokens of the previous
video clip. The predicted object, scene, and motion tokens
can be decoded to the subsequent video clip using MOSO-
VQVAE. By modeling video prediction at the token level,
MOSO-Transformer is relieved from the burden of model-
ing millions of pixels and can instead focus on capturing
global context relationships. In addition, our framework can
be easily extended to other video generation tasks, includ-
ing unconditional video generation and video frame inter-
polation tasks, by simply revising the training or generation
pipelines of MOSO-Transformer.
Our contributions are summarized as follows:
We propose a novel two-stage framework MOSO for
video prediction, which could decompose videos into mo-
tion, scene and object components and conduct video pre-
diction at the token level.
MOSO-VQVAE is proposed to learn motion, scene
and object decomposition encoding and time-independently
video decoding in a self-supervised manner, which allows
video manipulation and flexible video decoding.
MOSO-Transformer is proposed to first determine the
scene and object identities of subsequent video clips and
then predict subsequent motions at the token level.
Qualitative and quantitative experiments on five chal-
lenging benchmarks of video prediction and unconditional
video generation demonstrate that our proposed method
achieves new state-of-the-art performance.
2. Related Work
Video Prediction The video prediction task has received
increasing interest in the computer vision field. ConvLSTM
[40] combines CNN and LSTM architectures and adopts
an adversarial loss. MCnet [47] models pixel-level future
video prediction with motion and content decomposition for
the first time. GVSD [49] proposes a spatio-temporal CNN
combined with adversarial training to untangle foreground
objects from background scenes, while severe distortion of
object appearances exists in their predicted video frames.
MCVD [48] adopts a denoising diffusion model to conduct
several video-related tasks conditioned on past and/or future
frames. Although previous models can predict consistent
subsequent videos, they still suffer from indistinct or dis-
torted visual appearances since they lack a stable generator
or fail to decouple different motion patterns. SLAMP [1]
and vid2vid [50] decomposes video appearance and motion
for video prediction with the help of optical flow. SADM
[4] proposes a semantic-aware dynamic model that pre-
dicts and fuses the semantic maps (content) and optical flow
maps (motion) of future video frames. In addition to opti-
cal flow and semantic maps, Wu et al. [57] further utilizes
instance maps to help separate objects from backgrounds.
Although these works also decompose video components,
they are more complicated than MOSO since they require
much more additional information. Furthermore, these pre-
vious works are primarily based on generative adversarial
networks or recurrent neural networks, while MOSO fol-
lows a recently developed two-stage autoregressive gener-
ation framework, which demonstrates greater potential on
open domain visual generation tasks.
Two-stage Visual Generation The two-stage frame-
work is first proposed for image generation [11, 14, 36] and
demonstrates excellent generation ability. Motivated by the
success, several attempts have been made to extend the two-
stage framework to video generation tasks [14, 55, 58, 58].
For video prediction, MaskViT [20] encodes videos by
frame though VQ-GAN [14] and models video tokens with
a bidirectional Transformer through window attention. For
unconditional video generation, VideoGPT [58] encodes
videos by employing 3D convolutions and axial attention,
and then models video tokens in an auto-regressive manner.
However, existing two-stage works for video tasks do not
consider video component decomposition and are affected
by flicker artifacts and expensive computation costs.
3. MOSO
In this section, we present our proposed framework
MOSO in detail. MOSO is a novel two-stage frame-
work for video prediction and consists of MOSO-VQVAE
and MOSO-Transformer, where MOSO-VQVAE encodes
decomposed video components to tokens and MOSO-
2
of 15
25墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜