Diffusion in Machine Learning

Diffusion is a novel class of generatative modeling that, in most cases, work to denoise data by learning to reverse a gradual noising process. Diffusion models have recently gained significant attention in the field of machine learning, particularly for their ability to generate high-quality images, audio, and other types of data.

In the world of Variational Inference, diffusion models can be understood as a type of latent variable model where the latent variables represent progressively noisier versions of the data. The model is trained to predict the original data from these noisy versions, effectively learning to reverse the diffusion process.

However, these models are NOT simply told to denoise data directly. Instead, they are trained to predict the noise added at each step of the diffusion process, and then follow it by adding ANOTHER noise to the denoised output. This iterative process continues until the model reaches a point where the data is sufficiently denoised.

While the reasons are a bit abstract, you can think of the reason for doing this as to prevent a “too fast convergence” to a poor local optimum. By gradually denoising the data, the model can explore a wider range of possible outputs and avoid getting stuck in suboptimal solutions.

Rough Framework

From an older paper that popularized diffusion models, Denoising Diffusion Probabilistic Models(Ho et al., 2020), we can see a rough framework for diffusion models.

In this paper, the authors describe diffusion models as a type of latent variable model that learns to generate data by reversing a diffusion process that gradually adds noise to the data.

They take the form:

p_{θ} (x_{0}) := \int p_{θ} (x_{0 : T}) d x_{1 : T},

Where $x_{1}, \dots, x_{T}$ are latent variables of the same dimensionality as the data $x_{0} \sim q (x_{0})$ .

The joint distribution $p_{θ} (x_{0 : T})$ is called the reverse process, and it is defined as a Markov chain with learned Gaussian transitions starting at $p (x_{T}) = N (x_{T}; 0, I)$ :

p_{θ} (x_{0 : T}) := p (x_{T}) t = 1 \prod T p_{θ} (x_{t - 1} ∣ x_{t}), p_{θ} (x_{t - 1} ∣ x_{t}) := N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t)) (1)

The Forward Process, on the other hand, is fixed to a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule $β_{1}, \dots, β_{T}$ :

q (x_{1 : T} ∣ x_{0}) := t = 1 \prod T q (x_{t} ∣ x_{t - 1}), q (x_{t} ∣ x_{t - 1}) := N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I) (2)

And for our optimzation/training we take the Variational Inference approach of minimizing the Evidence Lower Bound (ELBO):

E [- lo g p_{θ} (x_{0})] \leq E_{q} [- lo g \frac{p _{θ} ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}] = E_{q} [- lo g p (x_{T}) - t \geq 1 \sum lo g \frac{p _{θ} ( x _{t - 1} ∣ x _{t} )}{q ( x _{t} ∣ x _{t - 1} )}] =: L (3)

This is all quite abstract though ( class academic speak ), but in layer terms, the forward process is just adding noise to the data, and the reverse process is learning to denoise it step by step.

Bibliography

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. arXiv. https://doi.org/10.48550/arXiv.2006.11239

Graph View

Diffusion in Machine Learning

Rough Framework

Bibliography

Backlinks