Denoising Diffusion Probabilistic Models (Ho et al.)

May 5, 2026

diffusion (probabilistic model) - Markov chain trained using inference to produce samples matching the data after finite time

-transitions are learned to reverse a diffusion process

-Markov chain gradually adds noise to the data, and the model denoises it

forward process (diffusion) - add Gaussian noise to shrink signal at each step

q(x_t | x_{t-1}) = \mathcal{N}(x_t;\sqrt{1 - \beta_t} x_{t-1}, \beta_t I)

\beta_t

- noise schedule parameter for time step

t

\alpha_t = 1 - \beta_t

- amount of signal

-this gives the closed form

x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1-\overline{\alpha}_t}\epsilon \quad \text{for} \quad \overline{\alpha}_t = \prod_{s=1}^t \alpha_s, \epsilon \sim \mathcal{N}(0, 1)

-a good schedule:

-make signal-to-noise ration decrease smoothly:

\frac{\alpha_t}{1 - \alpha_t}

-ensures each time step is equally informative to learn

-evaluated using loss per time step

-gradient norms across

t

to verify

reverse process (generative) - produce a cleaner output from a noisy input (invert forward process)

-this is approximated, not analytical

p_θ(x_{t−1} | x_t)=\mathcal{N}(x_{t−1}; \mu_\theta(x_t,t), \sum_\theta(x_t,t))

\hat{\epsilon}_\theta(x_t, t)

- given the noisy image

x_t

after

t

steps, what noise was added to the original image to produce it

-plug this into the forward noising equation at step

t

to get

x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1 - \overline{\alpha}_t} \hat{\epsilon}_\theta(x_t, t)

-then, the original image is

x_0 = \frac{1}{\sqrt{\overline{\alpha_t}}} \left(x_t - \sqrt{1 - \overline{\alpha}_t} \hat{\epsilon}_\theta(x_t, t)\right)

-since the

x_0

estimate will be noisy from larger

t

, we compute the step change from

t \rightarrow t-1

using the

x_0

estimate

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha}_t} (x_t - \frac{1 - \alpha_t}{\sqrt{1- \overline{\alpha}_t}}\hat{\epsilon})

\mu_\theta(x_t, t)

is the most likely (average) image at the previous time step

-the previous time step

x_{t-1}

is sampled from

\mu_\theta(x_t, t)

x_{t-1} = \mu_\theta(x_t, t) + \sigma_t z, z\sim\mathcal{N}(0, I)

\sigma_t^2 = \beta_t

- noise schedule controls variance

-error in noise used to train the noise prediction neural network instead of negative log likelihood between image distributions

\mathcal{L}(\theta) = ||\epsilon - \hat{\epsilon}(x_t, t)||^2