Denoising Diffusion Probabilistic Models (Ho et al.)

May 5, 2026

diffusion (probabilistic model) - Markov chain trained using inference to produce samples matching the data after finite time

-transitions are learned to reverse a diffusion process
-Markov chain gradually adds noise to the data, and the model denoises it

forward process (diffusion) - add Gaussian noise to shrink signal at each step

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t;\sqrt{1 - \beta_t} x_{t-1}, \beta_t I)
-βt\beta_t - noise schedule parameter for time step tt, αt=1βt\alpha_t = 1 - \beta_t - amount of signal
-this gives the closed form
xt=αtx0+1αtϵforαt=s=1tαs,ϵN(0,1)x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1-\overline{\alpha}_t}\epsilon \quad \text{for} \quad \overline{\alpha}_t = \prod_{s=1}^t \alpha_s, \epsilon \sim \mathcal{N}(0, 1)
-a good schedule:
-make signal-to-noise ration decrease smoothly: αt1αt\frac{\alpha_t}{1 - \alpha_t}
-ensures each time step is equally informative to learn
-evaluated using loss per time step
-gradient norms across tt to verify

reverse process (generative) - produce a cleaner output from a noisy input (invert forward process)

-this is approximated, not analytical
pθ(xt1xt)=N(xt1;μθ(xt,t),θ(xt,t))p_θ(x_{t−1} | x_t)=\mathcal{N}(x_{t−1}; \mu_\theta(x_t,t), \sum_\theta(x_t,t))
-ϵ^θ(xt,t)\hat{\epsilon}_\theta(x_t, t) - given the noisy image xtx_t after tt steps, what noise was added to the original image to produce it
-plug this into the forward noising equation at step tt to get
xt=αtx0+1αtϵ^θ(xt,t)x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1 - \overline{\alpha}_t} \hat{\epsilon}_\theta(x_t, t)
-then, the original image is
x0=1αt(xt1αtϵ^θ(xt,t))x_0 = \frac{1}{\sqrt{\overline{\alpha_t}}} \left(x_t - \sqrt{1 - \overline{\alpha}_t} \hat{\epsilon}_\theta(x_t, t)\right)
-since the x0x_0 estimate will be noisy from larger tt, we compute the step change from tt1t \rightarrow t-1 using the x0x_0 estimate
μθ(xt,t)=1αt(xt1αt1αtϵ^)\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha}_t} (x_t - \frac{1 - \alpha_t}{\sqrt{1- \overline{\alpha}_t}}\hat{\epsilon})
-μθ(xt,t)\mu_\theta(x_t, t) is the most likely (average) image at the previous time step
-the previous time step xt1x_{t-1} is sampled from μθ(xt,t)\mu_\theta(x_t, t)
xt1=μθ(xt,t)+σtz,zN(0,I)x_{t-1} = \mu_\theta(x_t, t) + \sigma_t z, z\sim\mathcal{N}(0, I)
-σt2=βt\sigma_t^2 = \beta_t - noise schedule controls variance
-error in noise used to train the noise prediction neural network instead of negative log likelihood between image distributions
L(θ)=ϵϵ^(xt,t)2\mathcal{L}(\theta) = ||\epsilon - \hat{\epsilon}(x_t, t)||^2