Diffusion (Hugging Face)

December 23, 2025

-diffusion models are generative
-random noise -> denoised output over several iterations
-at each step, estimate how to go from the current step to a completely denoised version
-since each step has small changes, error in early steps can be corrected by later steps

Training Procedure

1.Load in images from the training data
2.Add noise in different amounts (both noisy and close-to-perfect)
3.Feed noisy versions of the inputs into the model
4.Evaluate how well the model does denoising inputs
5.Update weights

Forward Diffusion Process (Noising)

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t}x_{t-1}, \beta_t I)

noise schedule (βt\beta_t) - controls how noise is introduced into the process

-0<βt<10 < \beta_t < 1
-typically, βt\beta_t starts small (don't destroy too much info early)
-βt\beta_t values are usually small overall, between the 0.00010.0001 to 0.020.02 range

For normal distributions, we have

xN(μ,σ2)=N(x;μ,σ2)x \sim \mathcal{N}(\mu, \sigma^2) = \mathcal{N}(x;\mu, \sigma^2)

To sample from this, we take

x=μ+σϵ,ϵN(0,I)x = \mu + \sigma \cdot \epsilon, \epsilon \sim \mathcal{N}(0, I)

Therefore, our noise updates become

xt=1βtxt1+βtϵ,ϵN(0,I)x_t = \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t} \cdot \epsilon, \epsilon \sim \mathcal{N}(0, I)
-as we can see, each weight update scales down the current vector/image xt1x_{t-1} before adding random noise
-we use 1βt\sqrt{1 - \beta_t} and βt\sqrt{\beta_t} so that the variance of the image doesn't explode

Mathematical derivation for mean/variance of noising process at tt:

xt=1βtxt1+βtϵ,ϵN(0,I)x1=1β1x0+β1ϵ1,x2=1β2x1+β2ϵ2,x2=1β2(1β1x0+β1ϵ1)+β2ϵ2x2=(1β2)(1β1)x0+(1β2)β1ϵ1+β2ϵ2 \begin{align*} x_t &= \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t} \cdot \epsilon, \epsilon \sim \mathcal{N}(0, I)\\\\ \Rightarrow x_1 &= \sqrt{1-\beta_1}x_{0} + \sqrt{\beta_1} \cdot \epsilon_1,\\\\ x_2 &= \sqrt{1-\beta_2}x_{1} + \sqrt{\beta_2} \cdot \epsilon_2, \\\\ \Rightarrow x_2 &= \sqrt{1-\beta_2}(\sqrt{1-\beta_1}x_{0} + \sqrt{\beta_1} \cdot \epsilon_1) + \sqrt{\beta_2} \cdot \epsilon_2 \\\\ \Rightarrow x_2 &= (\sqrt{1-\beta_2})(\sqrt{1-\beta_1})x_{0} + (\sqrt{1-\beta_2})\sqrt{\beta_1} \cdot \epsilon_1 + \sqrt{\beta_2} \cdot \epsilon_2 \end{align*}

Notice the pattern: the coefficient of x0x_0 (signal term, from original image) becomes i=1t1βix0=i=1t(1βi)\prod_{i=1}^{t} \sqrt{1 - \beta_i} x_0 = \sqrt{\prod_{i=1}^t (1 - \beta_i)}.

Furthermore, we notice that

(1β2)β1ϵ1+β2ϵ2N(0,(1β2)β1)+N(0,β2),(1β2)β1ϵ1+β2ϵ2N(0,(1β2)β1+β2),N(0,1(1β2)(1β1)),1i=1t(1βi)N(0,1).\begin{align*} (\sqrt{1-\beta_2})\sqrt{\beta_1} \cdot \epsilon_1 + \sqrt{\beta_2} \cdot \epsilon_2 &\sim \mathcal{N}(0, (1-\beta_2)\beta_1) + \mathcal{N}(0, \beta_2),\\\\ (\sqrt{1-\beta_2})\sqrt{\beta_1} \cdot \epsilon_1 + \sqrt{\beta_2} \cdot \epsilon_2 &\sim \mathcal{N}(0, (1-\beta_2)\beta_1 + \beta_2),\\\\ &\sim \mathcal{N}(0, 1 - (1 - \beta_2)(1 - \beta_1)),\\\\ &\sim \sqrt{1 - \prod_{i=1}^t (1 - \beta_i)} \cdot \mathcal{N}(0, 1). \end{align*}

by the Sum of Gaussians rule above. Now, letting αt=i=1t(1βi)\alpha_t = \prod_{i=1}^t (1 - \beta_i), we have

xt=αtx0+1αtϵ,ϵN(0,I)x_t = \sqrt{\alpha_t} x_0 + \sqrt{1 - \alpha_t} \cdot \epsilon, \epsilon \sim \mathcal{N}(0, I)

Since 0<βi<10 < \beta_i < 1, for sufficiently large kk, we have limtkαk=0\lim_{t \rightarrow k} \sqrt{\alpha_k} = 0 while limtk1αk=1\lim_{t \rightarrow k} \sqrt{1 - \alpha_k} = 1.

Thus, as the number of iterations (noising steps) increases, the original signal x0x_0 dies while the noise takes over the entire image. Furthermore, the entire system is distributed normally.