
prediction/interpolation [13, 44, 48]. While study about
DPMs for video generation is still at an early stage [16] and
faces challenges since video data are of higher dimensions
and involve complex spatial-temporal correlations.
Previous DPM-based video-generation methods usually
adopt a standard diffusion process, where frames in the
same video are added with independent noises and the
temporal correlations are also gradually destroyed in noised
latent variables. Consequently, the video-generation DPM
is required to reconstruct coherent frames from independent
noise samples in the denoising process. However, it is quite
challenging for the denoising network to simultaneously
model spatial and temporal correlations.
Inspired by the idea that consecutive frames share most
of the content, we are motivated to think: would it be
easier to generate video frames from noises that also
have some parts in common? To this end, we modify
the standard diffusion process and propose a decomposed
diffusion probabilistic model, termed as VideoFusion, for
video generation. During the diffusion process, we resolve
the per-frame noise into two parts, namely base noise
and residual noise, where the base noise is shared by
consecutive frames. In this way, the noised latent variables
of different frames will always share a common part,
which makes the denoising network easier to reconstruct
a coherent video. For intuitive illustration, we use the
decoder of DALL-E 2 [25] to generate images conditioned
on the same latent embedding. As shown in Fig. 2a, if
the images are generated from independent noises, their
content varies a lot even if they share the same condition.
But if the noised latent variables share the same base noise,
even an image generator can synthesize roughly correlated
sequences (shown in Fig. 2b). Therefore, the burden of the
denoising network of video-generation DPM can be largely
alleviated.
Furthermore, this decomposed formulation brings addi-
tional benefits. Firstly, as the base noise is shared by all
frames, we can predict it by feeding one frame to a large
pretrained image-generation DPM with only one forward
pass. In this way, the image priors of the pretrained
model could be efficiently shared by all frames and thereby
facilitate the learning of video data. Secondly, the base
noise is shared by all video frames and is likely to be related
to the video content. This property makes it possible for
us to better control the content or motions of generated
videos. Experiments in Sec. 4.7 show that, with adequate
training, VideoFusion tends to relate the base noise with
video content and the residual noise to motions (Fig. 1).
Extensive experiments show that VideoFusion can achieve
state-of-the-art results on different datasets and also well
support text-conditioned video creation.
Figure 2. Comparison between images generated from (a) inde-
pendent noises; (b) noises with a shared base noise. Images of the
same row are generated by the decoder of DALL-E 2 [25] with the
same condition.
2. Related Works
2.1. Diffusion Probabilistic Models
DPM is first introduced in [35], which consists of a
diffusion (encoding) process and a denoising (decoding)
process. In the diffusion process, it gradually adds random
noises to the data x via a T -step Markov chain [18]. The
noised latent variable at step t can be expressed as:
z
t
=
p
ˆα
t
x +
p
1 − ˆα
t
t
(1)
with
ˆα
t
=
t
Y
k=1
α
k
t
∼ N(0, 1), (2)
where α
t
∈ (0, 1) is the corresponding diffusion coefficient.
For a T that is large enough, e.g. T = 1000, we have
√
ˆα
T
≈ 0 and
√
1 − ˆα
T
≈ 1. And z
T
approximates a
random gaussian noise. Then the generation of x can be
modeled as iterative denoising.
In [14], Ho et al. connect DPM with denoising score
matching [37] and propose a -prediction form for the
denoising process:
L
t
= k
t
− z
θ
(z
t
, t)k
2
, (3)
where z
θ
is a denoising neural network parameterized by
θ, and L
t
is the loss function. Based on this formulation,
DPM has been applied to various generative tasks, such as
image-generation [15, 25], super-resolution [19, 28], image
translation [31], etc., and become an important class of deep
generative models. Compared with generative adversarial
networks (GANs) [10], DPMs are easier to be trained and
able to generate more diverse samples [5, 26].
2.2. Video Generation
Video generation is one of the most challenging tasks in
the generative research field. It not only needs to generate
high-quality frames but also the generated frames need to be
temporally correlated. Previous video-generation methods
are mainly GAN-based. In VGAN [45] and TGAN [29],
评论