Video Diffusion Models (Ho et al.)
May 24, 2026
video diffusion - learn to model a fixed number of frames at a fixed resolution
-instead of denoising , denoise
-add a temporal dimension for the number of frames, along with temporal embeddings and attention
-perform space-only 3D convolutions and attention in the U-Net to model per-frame contents
-then apply temporal attention separately to model relationships across frames