Diffusion (Hugging Face)

December 23, 2025

-diffusion models are generative

-random noise -> denoised output over several iterations

-at each step, estimate how to go from the current step to a completely denoised version

-since each step has small changes, error in early steps can be corrected by later steps

Training Procedure

1.Load in images from the training data

2.Add noise in different amounts (both noisy and close-to-perfect)

3.Feed noisy versions of the inputs into the model

4.Evaluate how well the model does denoising inputs

5.Update weights

Forward Diffusion Process (Noising)

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t}x_{t-1}, \beta_t I)

noise schedule ( $\beta_t$ ) - controls how noise is introduced into the process

0 < \beta_t < 1

-typically,

\beta_t

starts small (don't destroy too much info early)

\beta_t

values are usually small overall, between the

0.0001

0.02

range

For normal distributions, we have

x \sim \mathcal{N}(\mu, \sigma^2) = \mathcal{N}(x;\mu, \sigma^2)

To sample from this, we take

x = \mu + \sigma \cdot \epsilon, \epsilon \sim \mathcal{N}(0, I)

Therefore, our noise updates become

x_t = \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t} \cdot \epsilon, \epsilon \sim \mathcal{N}(0, I)

-as we can see, each weight update scales down the current vector/image

x_{t-1}

before adding random noise

-we use

\sqrt{1 - \beta_t}

and

\sqrt{\beta_t}

so that the variance of the image doesn't explode

Mathematical derivation for mean/variance of noising process at $t$ :

\begin{align*} x_t &= \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t} \cdot \epsilon, \epsilon \sim \mathcal{N}(0, I)\\\\ \Rightarrow x_1 &= \sqrt{1-\beta_1}x_{0} + \sqrt{\beta_1} \cdot \epsilon_1,\\\\ x_2 &= \sqrt{1-\beta_2}x_{1} + \sqrt{\beta_2} \cdot \epsilon_2, \\\\ \Rightarrow x_2 &= \sqrt{1-\beta_2}(\sqrt{1-\beta_1}x_{0} + \sqrt{\beta_1} \cdot \epsilon_1) + \sqrt{\beta_2} \cdot \epsilon_2 \\\\ \Rightarrow x_2 &= (\sqrt{1-\beta_2})(\sqrt{1-\beta_1})x_{0} + (\sqrt{1-\beta_2})\sqrt{\beta_1} \cdot \epsilon_1 + \sqrt{\beta_2} \cdot \epsilon_2 \end{align*}

Notice the pattern: the coefficient of $x_0$ (signal term, from original image) becomes $\prod_{i=1}^{t} \sqrt{1 - \beta_i} x_0 = \sqrt{\prod_{i=1}^t (1 - \beta_i)}$ .

Furthermore, we notice that

\begin{align*} (\sqrt{1-\beta_2})\sqrt{\beta_1} \cdot \epsilon_1 + \sqrt{\beta_2} \cdot \epsilon_2 &\sim \mathcal{N}(0, (1-\beta_2)\beta_1) + \mathcal{N}(0, \beta_2),\\\\ (\sqrt{1-\beta_2})\sqrt{\beta_1} \cdot \epsilon_1 + \sqrt{\beta_2} \cdot \epsilon_2 &\sim \mathcal{N}(0, (1-\beta_2)\beta_1 + \beta_2),\\\\ &\sim \mathcal{N}(0, 1 - (1 - \beta_2)(1 - \beta_1)),\\\\ &\sim \sqrt{1 - \prod_{i=1}^t (1 - \beta_i)} \cdot \mathcal{N}(0, 1). \end{align*}

by the Sum of Gaussians rule above. Now, letting $\alpha_t = \prod_{i=1}^t (1 - \beta_i)$ , we have

x_t = \sqrt{\alpha_t} x_0 + \sqrt{1 - \alpha_t} \cdot \epsilon, \epsilon \sim \mathcal{N}(0, I)

Since $0 < \beta_i < 1$ , for sufficiently large $k$ , we have $\lim_{t \rightarrow k} \sqrt{\alpha_k} = 0$ while $\lim_{t \rightarrow k} \sqrt{1 - \alpha_k} = 1$ .

Thus, as the number of iterations (noising steps) increases, the original signal $x_0$ dies while the noise takes over the entire image. Furthermore, the entire system is distributed normally.