The Kullback-Leibler Divergence

The Basics

TODO: note that integrand can take negative values

Definition

KL Divergence.
Let pp, qq be two probability densities on Rd\R^d with the property that p(x)=0p(x)=0 whenever q(x)=0q(x)=0. Then the Kullback-Liebler (KL) divergence of qq with respect to pp is defined as KL(pq):=log(p(x)q(x))p(x)dx=Exp[log(p(x)q(x))](1) \text{KL}(p \parallel q) := \int \log\left(\frac{p(x)}{q(x)}\right)p(x) dx = \mathbb{E}_{x \sim p}\left[\log\left(\frac{p(x)}{q(x)}\right) \right] \tag{1}

The convention is that the first entry of KL()\text{KL}(\cdot \parallel \cdot) is the density used as weight in the integral; i.e., the expectation is performed with respect to the probability distribution pp. As almost any counterexample will show, KL(pq)KL(qp).(2) \text{KL}(p \parallel q) \neq \text{KL}(q \parallel p). \tag{2} even when the integral is well-defined in both cases. Flipping the order of the arguments implies integrating with respect to the distribution qq in place of pp. We still discuss practical implications of these two alternatives later in this post.

Basic Interpretation

Note that the basic form of (1) consists of a pointwise error log(p(x)/q(x))\log(p(x)/q(x)) between the two densities, averaged over all xx with respect to xpx \sim p. When the densities agree at a point xx, then log(p(x)/q(x))=0\log(p(x)/q(x)) = 0 so no contribution is made to the integral. Similarly, sets with zero probability with respect to pp make no contribution to the integral, regardless of how much the two densities differ. Large contributions are made to the integral in regions where q(x)p(x)q(x) \ll p(x) and p(x)p(x) is large. Thus, the KL divergence will tend be smaller for distributions qq such that (i.) q(x)p(x)q(x) \approx p(x) when p(x)p(x) is large; or (ii.) p(x)q(x)p(x) \ll q(x).

Example: Gaussian Distributions

The KL divergence between two Gaussian distributions can be computed in closed form, as shown below. Note that the required condition on the densities is satisfied, since Gaussian densities are positive on Rd\mathbb{R}^d.

KL Divergence between Gaussians.
Let p(x)=N(xmp,Cp)p(x) = \mathcal{N}(x|m_p, C_p) and q(x)=N(xmq,Cq)q(x) = \mathcal{N}(x|m_q, C_q). Then,

Proof. \begin{align} \text{KL}(p \parallel q) &= \mathbb{E}_{x \sim p}\left[\log\left(\frac{p(x)}{q(x)}\right) \right] \newline &= \mathbb{E}_{x \sim p}\left[\log\left(\frac{\text{det}(C_p)^{-1/2}}{\text{det}(C_q)^{-1/2}} \right) \cdot \frac{-\frac{1}{2} \exp\left[(u-m_p)^\top C_p^{-1}(u-m_q)\right]}{-\frac{1}{2} \exp\left[(u-m_q)^\top C_q^{-1}(u-m_q)\right]} \right] \end{align}

Interpretations

Information Theoretic Perspective

F-Divergence

Measure-Theoretic Technicalities

In this section we generalize definition (1) by slightly loosening the assumed condition on pp and qq. We view pp,qq as Lebesgue densities of two probability measures PP,QQ on Rd\mathbb{R}^d. In (1) we assumed that q(x)=0    p(x)=0q(x)=0 \implies p(x)=0 to avoid division by zero in the integrand. Since the integral is unaffected by sets of measure zero, we see that we actually only require this implication to hold on sets that have positive probability with respect to PP. The general condition we need is that PP be dominated by QQ in the sense that Q(B)=0    P(B)=0 for all measurable sets B. Q(B) = 0 \implies P(B) = 0 \text{ for all measurable sets } B. A distribution QQ satisfying this property is said to be absolutely continuous with respect to PP, denoted by PQP \ll Q. Under this property, definition (1) still holds.

We can make things even more general by not requiring the existence of Lebesgue densities p,qp,q.

Properties

KL Divergence is a Divergence

Unnormalized Densities

Chain Rule

Connection to Maximum Likelihood

Connection to Bayesian Inference

The KL divergence also plays an important role in the Bayesian setting. In fact, the posterior distribution can be interpreted as the solution of an optimization problem using the KL divergence as an objective function. The Bayesian setup consists of a joint distribution over $(x,y)$, with $x$ the parameter of interest and $y$ the data. We assume this joint distribution assumes the form \(p(x,y) = \pi_0(x)L(x;y),\) where $\pi_0$ is the prior density over $x$, and $L(x;y) = p(y|x)$ the likelihood function. The posterior density is then given by \(\pi(x) := p(x|y) = \frac{1}{Z} \pi_0(x)L(x;y),\) where the normalizing constant $Z$ is independent of $x$. With notation established, we now consider evaluating the KL divergence between the posterior $\pi$ and some other distribution $q$. \begin{align} \text{KL}(q||\pi) &= \int \log\left(\frac{q(x)}{\pi(x)}\right) \pi(x) dx \newline &= \int \log\left(\frac{q(x)Z}{\pi_0(x)L(x;y)}\right) \pi(x) dx \newline &= \log Z + \int \log\left(\frac{q(x)}{\pi_0(x)}\right) \pi(x) dx - \int \log\left(L(x;y)\right) \pi(x) dx \newline &= \log Z + \text{KL}(q||\pi_0) - \mathbb{E}_{x \sim q}[\log L(x;y)] \end{align}

If we view the negative log-likelihood \(\Phi(x) := -\log L(x;y)\) as a loss function then we see that \(\text{KL}(q||\pi) = \log Z + \text{KL}(q||\pi_0) + \mathbb{E}_{x \sim q}[\Phi(x)]\) is the sum of two terms, one that penalizes discrepancy with respect to the prior and the other that penalizes discrepancy in the model-data agreement. The third term $\log Z$ is simply a constant.

Plan: