Video Diffusion Models (Ho et al.)

May 24, 2026

video diffusion - learn to model a fixed number of frames at a fixed resolution

-instead of denoising xtRH×W×Cx_t \in \mathbb{R}^{H \times W \times C}, denoise xtRT×H×W×Cx_t \in \mathbb{R}^{T \times H \times W \times C}
-add a temporal dimension for the number of frames, along with temporal embeddings and attention
-perform space-only 3D convolutions and attention in the U-Net to model per-frame contents
-then apply temporal attention separately to model relationships across frames