
multi-scale motion cues in different receptive fields by em-
ploying the encoder-decoder architecture [35], but in prac-
tice it is not flexible enough to deal with complex motions.
In this paper, we propose a Dynamic Multi-scale Voxel
Flow Network (DMVFN) to explicitly model the com-
plex motion cues of diverse scales between adjacent video
frames by dynamic optical flow estimation. Our DMVFN
is consisted of several Multi-scale Voxel Flow Blocks
(MVFBs), which are stacked in a sequential manner. On
top of MVFBs, a light-weight Routing Module is pro-
posed to adaptively generate a routing vector according
to the input frames, and to dynamically select a sub-
network for efficient future frame prediction. We con-
duct experiments on four benchmark datasets, including
Cityscapes [9], KITTI [12], DAVIS17 [43], and Vimeo-
Test [69], to demonstrate the comprehensive advantages
of our DMVFN over representative video prediction meth-
ods in terms of visual quality, parameter amount, and
computational efficiency measured by floating point oper-
ations (FLOPs). A glimpse of comparison results by differ-
ent methods is provided in Figure 1. One can see that our
DMVFN achieves much better performance in terms of ac-
curacy and efficiency on the Cityscapes [9] dataset. Exten-
sive ablation studies validate the effectiveness of the com-
ponents in our DMVFN for video prediction.
In summary, our contributions are mainly three-fold:
• We design a light-weight DMVFN to accurately pre-
dict future frames with only RGB frames as inputs.
Our DMVFN is consisted of new MVFB blocks that
can model different motion scales in real-world videos.
• We propose an effective Routing Module to dynam-
ically select a suitable sub-network according to the
input frames. The proposed Routing Module is end-
to-end trained along with our main network DMVFN.
• Experiments on four benchmarks show that our
DMVFN achieves state-of-the-art results while being
an order of magnitude faster than previous methods.
2. Related Work
2.1. Video Prediction
Early video prediction methods [35, 37, 58] only utilize
RGB frames as inputs. For example, PredNet [37] learns an
unsupervised neural network, with each layer making lo-
cal predictions and forwarding deviations from those pre-
dictions to subsequent network layers. MCNet [58] decom-
poses the input frames into motion and content components,
which are processed by two separate encoders. DVF [35]
is a fully-convolutional encoder-decoder network synthesiz-
ing intermediate and future frames by approximating voxel
flow for motion estimation. Later, extra information is ex-
ploited by video prediction methods in pursuit of better
performance. For example, the methods of Vid2vid [59],
Seg2vid [41], HVP [32], and SADM [2] require additional
semantic maps or human pose information for better video
prediction results. Additionally, Qi et al. [44] used extra
depth maps and semantic maps to explicitly inference scene
dynamics in 3D space. FVS [62] separates the inputs into
foreground objects and background areas by semantic and
instance maps, and uses a spatial transformer to predict the
motion of foreground objects. In this paper, we develop a
light-weight and efficient video prediction network that re-
quires only sRGB images as the inputs.
2.2. Optical Flow
Optical flow estimation aims to predict the per-pixel mo-
tion between adjacent frames. Deep learning-based optical
flow methods [17,29,38,53, 54] have been considerably ad-
vanced ever since Flownet [11], a pioneering work to learn
optical flow network from synthetic data. Flownet2.0 [25]
improves the accuracy of optical flow estimation by stack-
ing sub-networks for iterative refinement. A coarse-to-fine
spatial pyramid network is employed in SPynet [46] to es-
timate optical flow at multiple scales. PWC-Net [53] em-
ploys feature warping operation at different resolutions and
uses a cost volume layer to refine the estimated flow at each
resolution. RAFT [54] is a lightweight recurrent network
sharing weights during the iterative learning process. Flow-
Former [21] utilizes an encoder to output latent tokens and
a recurrent decoder to decode features, while refining the
estimated flow iteratively. In video synthesis, optical flow
for downstream tasks [22, 35, 68, 69, 72] is also a hot re-
search topic. Based on these approaches, we aim to design
a flow estimation network that can adaptively operate based
on each sample for the video prediction task.
2.3. Dynamic Network
The design of dynamic networks is mainly divided into
three categories: spatial-wise, temporal-wise, and sample-
wise [16]. Spatial-wise dynamic networks perform adap-
tive operations in different spatial regions to reduce com-
putational redundancy with comparable performance [20,
47, 57]. In addition to the spatial dimension, dynamic
processing can also be applied in the temporal dimension.
Temporal-wise dynamic networks [52, 64, 70] improve the
inference efficiency by performing less or no computation
on unimportant sequence frames. To handle the input in a
data-driven manner, sample-wise dynamic networks adap-
tively adjust network structures to side-off the extra compu-
tation [56,60], or adaptively change the network parameters
to improve the performance [10, 18, 51, 76]. Designing and
training a dynamic network is not trivial since it is difficult
to directly enable a model with complex topology connec-
tions. We need to design a well-structured and robust model
before considering its dynamic mechanism. In this paper,
评论