TL;DR: We present ExtDM, a new diffusion model that extrapolates video content from current frames by accurately modeling distribution shifts towards future frames.
Video prediction is a challenging task due to its nature of uncertainty, especially for forecasting a long period. To model the temporal dynamics, advanced methods benefit from the recent success of diffusion models, and repeatedly refine the predicted future frames with 3D spatiotemporal U-Net. However, there exists a gap between the present and future and the repeated usage of U-Net brings a heavy computation burden. To address this, we propose a diffusion-based video prediction method that predicts future frames by extrapolating the present distribution of features, namely ExtDM. Specifically, our method consists of three components: (i) a motion autoencoder conducts a bijection transformation between video frames and motion cues; (ii) a layered distribution adaptor module extrapolates the present features in the guidance of Gaussian distribution; (iii) a 3D U-Net architecture specialized for jointly fusing guidance and features among the temporal dimension by spatiotemporal-window attention. Extensive experiments on five popular benchmarks covering short- and long-term video prediction verify the effectiveness of ExtDM.
ExtDM uses two mapping functions. The first is the video compression and reconstruction process. The compression and reconstruction process is where we convert video frames into a more compact form, with an emphasis on capturing basic motion cues of scene dynamics. It enables us to handle a more manageable representation of video data, which is critical for subsequent extrapolation processes. Then we extrapolate the diffusion model of our motion distribution. Here, we predict future frames by extrapolating the current feature distribution.
Qualitative comparison among SOTA methods. The trajectory of each target is indicated by the green curve.
ExtDM-based video generation framework can boost various directions. Here, we envision two potential uses, for (a) stochastic events on SMMNIST and (b) tailored prediction on BAIR.
A comparison of quality and speed of SOTA diffusion models for short-term and long-term video prediction on SMMNIST and KTH, respectively. We report FVD as well as FPS. Note that the FPS axis is in the log scale.
Frame-wise comparison on long-term video datasets. we calculate the performance degradation between the first frame and the last one. ExtDM shows low degradation for long-term video prediction, giving us 29.60% better predictions than MCVD (-7.87 v.s -10.20).
@inproceedings{zhang2024ExtDM,
title={ExtDM: Distribution Extrapolation Diffusion Model for Video Prediction},
author={Zhang, Zhicheng and Hu, Junyao and Cheng, Wentao and Paudel, Danda and Yang, Jufeng},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2024}
}