Evidence Lower Bound

The Evidence Lower Bound (ELBO) is the core quantity in variational inference and the key optimization objective for training Variational Autoencoders. It provides a tractable lower bound on the log marginal likelihood (the “evidence”) of observed data.

The Problem: Intractable Posteriors

In many probabilistic models, we want to compute the posterior distribution $p_{θ} (z ∣ x)$ of latent variables $z$ given observed data $x$ . Using Bayes’ rule:

p_{θ} (z ∣ x) = \frac{p _{θ} ( x ∣ z ) p _{θ} ( z )}{p _{θ} ( x )}

The denominator $p_{θ} (x)$ is the marginal likelihood or evidence:

p_{θ} (x) = \int_{z} p_{θ} (x ∣ z) p_{θ} (z) d z

This integral is typically intractable for complex models because it requires integrating over all possible latent configurations. We need an alternative approach.

Variational Inference Solution

Instead of computing $p_{θ} (z ∣ x)$ directly, variational inference introduces an approximate posterior $q_{ϕ} (z ∣ x)$ (often called the “recognition model” or “inference network”) parameterized by $ϕ$ .

The goal is to make $q_{ϕ} (z ∣ x)$ as close as possible to the true posterior $p_{θ} (z ∣ x)$ . We measure closeness using the KL divergence:

D_{K L} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) = E_{q_{ϕ} (z ∣ x)} [lo g \frac{q _{ϕ} ( z ∣ x )}{p _{θ} ( z ∣ x )}]

However, computing this directly still requires knowing $p_{θ} (z ∣ x)$ , which brings us back to the original problem!

Deriving the ELBO

Starting from the KL divergence and using Bayes’ rule, we can derive a useful relationship:

D_{K L} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) = E_{q_{ϕ} (z ∣ x)} [lo g \frac{q _{ϕ} ( z ∣ x )}{p _{θ} ( z ∣ x )}] = E_{q_{ϕ} (z ∣ x)} [lo g q_{ϕ} (z ∣ x) - lo g p_{θ} (z ∣ x)] = E_{q_{ϕ} (z ∣ x)} [lo g q_{ϕ} (z ∣ x) - lo g \frac{p _{θ} ( x , z )}{p _{θ} ( x )}] = E_{q_{ϕ} (z ∣ x)} [lo g q_{ϕ} (z ∣ x) - lo g p_{θ} (x, z) + lo g p_{θ} (x)] = E_{q_{ϕ} (z ∣ x)} [lo g q_{ϕ} (z ∣ x) - lo g p_{θ} (x, z)] + lo g p_{θ} (x)

Note that $lo g p_{θ} (x)$ doesn’t depend on $z$ , so it comes out of the expectation. Rearranging:

lo g p_{θ} (x) = D_{K L} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) + E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x, z) - lo g q_{ϕ} (z ∣ x)]

The second term is the ELBO:

L (θ, ϕ; x) = E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x, z) - lo g q_{ϕ} (z ∣ x)]

Since $D_{K L} \geq 0$ , we have:

lo g p_{θ} (x) \geq L (θ, ϕ; x)

The ELBO is a lower bound on the log evidence!

Alternative ELBO Forms

Reconstruction + Regularization

The ELBO can be rewritten using the joint distribution $p_{θ} (x, z) = p_{θ} (x ∣ z) p_{θ} (z)$ :

L (θ, ϕ; x) = E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{K L} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z))

This form has a clear interpretation:

Reconstruction term: $E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)]$ - how well we can reconstruct $x$ from samples of $z$
Regularization term: $- D_{K L} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z))$ - how close our approximate posterior is to the prior

Negative Free Energy

The ELBO is also known as the negative variational free energy in statistical physics:

L (θ, ϕ; x) = - F (q_{ϕ}, θ; x)

Jensen’s Inequality Derivation

An alternative way to derive the ELBO uses Jensen’s Inequality. For any concave function $f$ (like $lo g$ ):

f (E [X]) \geq E [f (X)]

Starting with the marginal likelihood:

lo g p_{θ} (x) = lo g \int_{z} p_{θ} (x, z) d z = lo g \int_{z} \frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )} q_{ϕ} (z ∣ x) d z = lo g E_{q_{ϕ} (z ∣ x)} [\frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )}] \geq E_{q_{ϕ} (z ∣ x)} [lo g \frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )}] (Jensen’s inequality)

This directly gives us the ELBO! The inequality is tight when $q_{ϕ} (z ∣ x) = p_{θ} (z ∣ x)$ , meaning the ELBO equals the log evidence when our approximate posterior is perfect.

Geometric Intuition

The relationship between the evidence, ELBO, and KL divergence can be visualized:

Evidence lo g p (x) = What we maximize E L BO + Gap we want to minimize K L (q ∣∣ p)

Maximizing ELBO has two effects:
1. Increases the log evidence (better model of data)
2. Decreases the KL divergence (better approximate posterior)

Maximizing ELBO has the effect of both increasing our log evidence(giving us a better model of data) and decreasing the KL Divergence(yielding a better posterior approximation)

The gap between ELBO and true evidence is exactly the KL divergence
When we maximize ELBO w.r.t. $ϕ$ only, we’re doing variational inference (improving the approximate posterior)
When we maximize ELBO w.r.t. $θ$ only, we’re doing maximum likelihood learning (improving the generative model)
Maximizing w.r.t. both simultaneously (as in VAEs) combines both objectives

The Reparameterization Trick

To optimize the ELBO via gradient descent, we need $\nabla_{ϕ} L (θ, ϕ; x)$ . The challenge is that the expectation is over $q_{ϕ} (z ∣ x)$ , which depends on $ϕ$ .

The reparameterization trick transforms the random variable $z \sim q_{ϕ} (z ∣ x)$ into a deterministic function of $ϕ$ plus independent noise:

For a Gaussian approximate posterior $q_{ϕ} (z ∣ x) = N (z; μ_{ϕ} (x), σ_{ϕ}^{2} (x))$ :

z = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ϵ, ϵ \sim N (0, I)

Where $⊙$ is element-wise multiplication. Now the expectation is over $ϵ$ , which doesn’t depend on $ϕ$ :

L (θ, ϕ; x) = E_{ϵ \sim N (0, I)} [lo g p_{θ} (x, μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ϵ) - lo g q_{ϕ} (μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ϵ ∣ x)]

We can now compute gradients:

\nabla_{ϕ} L (θ, ϕ; x) \approx \nabla_{ϕ} [lo g p_{θ} (x, z^{(l)}) - lo g q_{ϕ} (z^{(l)} ∣ x)]

where $z^{(l)} = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ϵ^{(l)}$ with $ϵ^{(l)} \sim N (0, I)$ .

This makes backpropagation through the sampling process possible!

Practical Computation in VAEs

For a VAE with Gaussian approximate posterior $q_{ϕ} (z ∣ x) = N (μ_{ϕ} (x), diag (σ_{ϕ}^{2} (x)))$ and standard normal prior $p_{θ} (z) = N (0, I)$ , the ELBO becomes:

L (θ, ϕ; x) = E_{ϵ \sim N (0, I)} [lo g p_{θ} (x ∣ z)] - D_{K L} (q_{ϕ} (z ∣ x) ∥ N (0, I))

The KL divergence has a closed form for Gaussians:

D_{K L} (q_{ϕ} (z ∣ x) ∥ N (0, I)) = \frac{1}{2} j = 1 \sum J (μ_{j}^{2} + σ_{j}^{2} - lo g σ_{j}^{2} - 1)

where $J$ is the dimensionality of $z$ .

The reconstruction term is approximated via Monte Carlo sampling:

E_{ϵ \sim N (0, I)} [lo g p_{θ} (x ∣ z)] \approx \frac{1}{L} l = 1 \sum L lo g p_{θ} (x ∣ z^{(l)})

In practice, $L = 1$ (single sample) often works well during training.

Why ELBO Works

The ELBO framework is powerful because:

Tractability: We can compute and optimize it without knowing the intractable posterior
Flexibility: Works with any choice of approximate posterior family $q_{ϕ}$
Principled: Directly optimizes a bound on the quantity we care about (log evidence)
Interpretable: Decomposes into reconstruction and regularization terms
Scalable: Can be optimized via stochastic gradient descent

The key insight is that by maximizing a lower bound, we’re still pushing the true quantity upward, even though we can’t compute it directly.

Connections

Variational Inference: ELBO is the core objective function
Kullback-Leibler Divergence: The gap between ELBO and evidence
Jensen’s Inequality: Alternative derivation of the bound
Information Theory: ELBO relates to mutual information and entropy
Expectation-Maximization Algorithm: ELBO generalizes the EM algorithm’s objective

Graph View