The Kullback-Leibler Divergence
The Basics
TODO: note that integrand can take negative values
Definition
KL Divergence.
Let , be two probability densities on with the property that whenever . Then the Kullback-Liebler (KL) divergence of with respect to is defined as
The convention is that the first entry of is the density used as weight in the integral; i.e., the expectation is performed with respect to the probability distribution . As almost any counterexample will show, even when the integral is well-defined in both cases. Flipping the order of the arguments implies integrating with respect to the distribution in place of . We still discuss practical implications of these two alternatives later in this post.
Basic Interpretation
Note that the basic form of (1) consists of a pointwise error between the two densities, averaged over all with respect to . When the densities agree at a point , then so no contribution is made to the integral. Similarly, sets with zero probability with respect to make no contribution to the integral, regardless of how much the two densities differ. Large contributions are made to the integral in regions where and is large. Thus, the KL divergence will tend be smaller for distributions such that (i.) when is large; or (ii.) .
Example: Gaussian Distributions
The KL divergence between two Gaussian distributions can be computed in closed form, as shown below. Note that the required condition on the densities is satisfied, since Gaussian densities are positive on .
KL Divergence between Gaussians.
Let and . Then,
Proof. \begin{align} \text{KL}(p \parallel q) &= \mathbb{E}_{x \sim p}\left[\log\left(\frac{p(x)}{q(x)}\right) \right] \newline &= \mathbb{E}_{x \sim p}\left[\log\left(\frac{\text{det}(C_p)^{-1/2}}{\text{det}(C_q)^{-1/2}} \right) \cdot \frac{-\frac{1}{2} \exp\left[(u-m_p)^\top C_p^{-1}(u-m_q)\right]}{-\frac{1}{2} \exp\left[(u-m_q)^\top C_q^{-1}(u-m_q)\right]} \right] \end{align}
Interpretations
Information Theoretic Perspective
F-Divergence
Measure-Theoretic Technicalities
In this section we generalize definition (1) by slightly loosening the assumed condition on and . We view , as Lebesgue densities of two probability measures , on . In (1) we assumed that to avoid division by zero in the integrand. Since the integral is unaffected by sets of measure zero, we see that we actually only require this implication to hold on sets that have positive probability with respect to . The general condition we need is that be dominated by in the sense that A distribution satisfying this property is said to be absolutely continuous with respect to , denoted by . Under this property, definition (1) still holds.
We can make things even more general by not requiring the existence of Lebesgue densities .
Properties
KL Divergence is a Divergence
Unnormalized Densities
Chain Rule
Connection to Maximum Likelihood
Connection to Bayesian Inference
The KL divergence also plays an important role in the Bayesian setting. In fact, the posterior distribution can be interpreted as the solution of an optimization problem using the KL divergence as an objective function. The Bayesian setup consists of a joint distribution over $(x,y)$, with $x$ the parameter of interest and $y$ the data. We assume this joint distribution assumes the form \(p(x,y) = \pi_0(x)L(x;y),\) where $\pi_0$ is the prior density over $x$, and $L(x;y) = p(y|x)$ the likelihood function. The posterior density is then given by \(\pi(x) := p(x|y) = \frac{1}{Z} \pi_0(x)L(x;y),\) where the normalizing constant $Z$ is independent of $x$. With notation established, we now consider evaluating the KL divergence between the posterior $\pi$ and some other distribution $q$. \begin{align} \text{KL}(q||\pi) &= \int \log\left(\frac{q(x)}{\pi(x)}\right) \pi(x) dx \newline &= \int \log\left(\frac{q(x)Z}{\pi_0(x)L(x;y)}\right) \pi(x) dx \newline &= \log Z + \int \log\left(\frac{q(x)}{\pi_0(x)}\right) \pi(x) dx - \int \log\left(L(x;y)\right) \pi(x) dx \newline &= \log Z + \text{KL}(q||\pi_0) - \mathbb{E}_{x \sim q}[\log L(x;y)] \end{align}
If we view the negative log-likelihood \(\Phi(x) := -\log L(x;y)\) as a loss function then we see that \(\text{KL}(q||\pi) = \log Z + \text{KL}(q||\pi_0) + \mathbb{E}_{x \sim q}[\Phi(x)]\) is the sum of two terms, one that penalizes discrepancy with respect to the prior and the other that penalizes discrepancy in the model-data agreement. The third term $\log Z$ is simply a constant.
Plan:
- KL as special case of F-Divergence (See DA and Inverse Problems ML Perspective)
- Information-theoretic perspective
- Basic properties
- Connection to MLE: see e.g. https://jaketae.github.io/study/kl-mle/ MLE is with respect to fixed data realization; KL averages over data distribution.
- Forward vs. Backward KL
- Use as objective functions: why one KL direction is more suitable when you have unnormalized density and one is more suitable when you have samples.
- Results: minimizing reverse KL results in covariance matching; forward instead matches precisions.