Video Diffusion Models (Ho et al.)

May 24, 2026

video diffusion - learn to model a fixed number of frames at a fixed resolution

-instead of denoising

x_t \in \mathbb{R}^{H \times W \times C}

, denoise

x_t \in \mathbb{R}^{T \times H \times W \times C}

-add a temporal dimension for the number of frames, along with temporal embeddings and attention

-perform space-only 3D convolutions and attention in the U-Net to model per-frame contents

-then apply temporal attention separately to model relationships across frames