The Evidence Lower Bound (ELBO) is the core quantity in variational inference and the key optimization objective for training Variational Autoencoders. It provides a tractable lower bound on the log marginal likelihood (the “evidence”) of observed data.


The Problem: Intractable Posteriors

In many probabilistic models, we want to compute the posterior distribution of latent variables given observed data . Using Bayes’ rule:

The denominator is the marginal likelihood or evidence:

This integral is typically intractable for complex models because it requires integrating over all possible latent configurations. We need an alternative approach.


Variational Inference Solution

Instead of computing directly, variational inference introduces an approximate posterior (often called the “recognition model” or “inference network”) parameterized by .

The goal is to make as close as possible to the true posterior . We measure closeness using the KL divergence:

However, computing this directly still requires knowing , which brings us back to the original problem!


Deriving the ELBO

Starting from the KL divergence and using Bayes’ rule, we can derive a useful relationship:

Note that doesn’t depend on , so it comes out of the expectation. Rearranging:

The second term is the ELBO:

Since , we have:

The ELBO is a lower bound on the log evidence!


Alternative ELBO Forms

Reconstruction + Regularization

The ELBO can be rewritten using the joint distribution :

This form has a clear interpretation:

  • Reconstruction term: - how well we can reconstruct from samples of
  • Regularization term: - how close our approximate posterior is to the prior

Negative Free Energy

The ELBO is also known as the negative variational free energy in statistical physics:


Jensen’s Inequality Derivation

An alternative way to derive the ELBO uses Jensen’s Inequality. For any concave function (like ):

Starting with the marginal likelihood:

This directly gives us the ELBO! The inequality is tight when , meaning the ELBO equals the log evidence when our approximate posterior is perfect.


Geometric Intuition

The relationship between the evidence, ELBO, and KL divergence can be visualized:

  • Maximizing ELBO has two effects:
    1. Increases the log evidence (better model of data)
    2. Decreases the KL divergence (better approximate posterior)

Maximizing ELBO has the effect of both increasing our log evidence(giving us a better model of data) and decreasing the KL Divergence(yielding a better posterior approximation)

  • The gap between ELBO and true evidence is exactly the KL divergence
  • When we maximize ELBO w.r.t. only, we’re doing variational inference (improving the approximate posterior)
  • When we maximize ELBO w.r.t. only, we’re doing maximum likelihood learning (improving the generative model)
  • Maximizing w.r.t. both simultaneously (as in VAEs) combines both objectives

The Reparameterization Trick

To optimize the ELBO via gradient descent, we need . The challenge is that the expectation is over , which depends on .

The reparameterization trick transforms the random variable into a deterministic function of plus independent noise:

For a Gaussian approximate posterior :

Where is element-wise multiplication. Now the expectation is over , which doesn’t depend on :

We can now compute gradients:

where with .

This makes backpropagation through the sampling process possible!


Practical Computation in VAEs

For a VAE with Gaussian approximate posterior and standard normal prior , the ELBO becomes:

The KL divergence has a closed form for Gaussians:

where is the dimensionality of .

The reconstruction term is approximated via Monte Carlo sampling:

In practice, (single sample) often works well during training.


Why ELBO Works

The ELBO framework is powerful because:

  1. Tractability: We can compute and optimize it without knowing the intractable posterior
  2. Flexibility: Works with any choice of approximate posterior family
  3. Principled: Directly optimizes a bound on the quantity we care about (log evidence)
  4. Interpretable: Decomposes into reconstruction and regularization terms
  5. Scalable: Can be optimized via stochastic gradient descent

The key insight is that by maximizing a lower bound, we’re still pushing the true quantity upward, even though we can’t compute it directly.


Connections