In this post we walk through the annealed importance sampling (AIS) scheme introduced in the classic paper Neal (1998). The algorithm seeks to improve upon standard IS by incorporating Markov chain Monte Carlo (MCMC) updates, and is widely used for estimating intractable expectations and normalizing constants. While I highly recommend reading Neal’s original paper, my presentation here leans towards more recent perspectives. My notation is most similar to Doucet et al. (2022).

1 Setup and Derivation

Consider a target distribution \[ \begin{align} &\pi(x) = \frac{f(x)}{Z}, &&Z = \int f(x) dx \end{align} \tag{1}\] such that \(f(x)\) can be evaluated at any input \(x \in \mathbb{R}^d\). Two common goals are to estimate expectations with respect to \(\pi\) and to estimate the intractable normalizing constant \(Z\). As with standard IS, AIS addresses these problems by returning weighted samples \((x^{(n)}, w^{(n)})\), with the weights \(w^{(n)}\) constructed to produce unbiased estimates of the quantities of interest.

1.1 Bridging Distributions

The idea underlying annealed IS is to improve an initial proposal via a sequence of updates that move the proposal closer to \(\pi\). An initial “simple” distribution is slowly annealed towards the intractable target distribution. Let \((\pi_k)_{k=0}^{K}\) denote the sequence of intermediate distributions \(\pi_k(x) = f_k(x)/Z_k\). In this post, we will consider \(\pi_0\) to be a simple distribution from which independent samples can be drawn. The final distribution \(\pi_K = \pi\) is the target, meaning the intermediate distributions bridge from the simple distribution to \(\pi\). The canonical choice of bridging distributions is the geometric path, defined by the geometric averages \[ \begin{align} &f_k(x) := f_0(x)^{1-\beta_k} f(x)^{\beta_k}, &&0 = \beta_0 < \beta_1 < \cdots < \beta_K = 1. \end{align} \tag{2}\]

Remark

If \(\pi_0\) is a (potentially improper) uniform distribution, then the geometric path simplifies to \(f_k(x) := f(x)^{\beta_k}\). This is a common path used in simulated annealing algorithms for optimizing \(f\), motivating the word “annealed” in annealed importance sampling.

As we will see below, we require the bridging distributions to satisfy \[ f_k(x) = 0 \implies f_{k+1}(x) = 0. \tag{3}\] In other words, the supports of the distributions cannot be growing as \(\pi_0\) evolves to \(\pi_K\). This assumption is often denoted by \(\pi_K \ll \pi_{K-1} \ll \cdots \ll \pi_0\), where \(\pi_{k+1} \ll \pi_{k}\) is defined by Equation 3 and read “\(\pi_{k+1}\) is absolutely continuous with respect to \(\pi_{k}\)”.

Remark

In some treatments (including Neal’s original paper), the sequence of distributions is defined in the reverse order, such that \(\pi_K\) is the simple distribution and \(\pi_0 = \pi\). This is mostly a matter of taste. As we will see, the reversed sequence will still become relevant here.

1.2 Markov Kernels

If we could sample directly from the \(\pi_k\) then the problem would already be solved, since \(\pi_K\) is precisely the distribution from which we hope to sample. Given that this is not possible, we will instead try to approximately track the intermediate distributions via Markov chain iterations. The importance weights will ultimately correct for the fact that we are not exactly tracking the distributions. Let \((F_k)_{k=1}^{K}\) be a collection of Markov kernels such that \(F_k\) is \(\pi_k\)-invariant. In particular, the target \(\pi\) is the stationary distribution of \(F_K\).

Remark

In general, the final Markov kernel \(F_K\) need not be \(\pi\)-invariant, a fact that will emerge as we derive the algorithm. There is actually no \(F_K\) in Neal’s original formulation. However, in most applications I am aware of it is included.

An IS proposal can then be generated via \[ \begin{align} &x_0 \sim \pi_0, &&x_k \sim F_k(x_{k-1}, \cdot), \qquad k = 1, 2, \dots, K. \end{align} \tag{4}\] The resulting vector \(x_{0:K} := (x_0, \dots, x_K)\) has joint distribution ¹ \[ \begin{align} &q(x_{0:K}) := \frac{f^q(x_{0:K})}{Z_0} &&f^q(x_{0:K}) := f_0(x_0)\prod_{k=1}^{K} F_k(x_{k-1}, x_k). \end{align} \tag{5}\] Letting \(q_k\) denote the \(k^{\text{th}}\) univariate marginal of \(q\), we see that \[ q_K(x_K) := \int \pi_0(x_0)\prod_{k=1}^{K} F_k(x_{k-1}, x_k) \ dx_{0:K-1} \tag{6}\] is the marginal distribution of \(x_K\). This final marginal can be viewed as the best approximation to \(\pi\). If we view the final entry \(x_K\) of \(x_{0:K}\) as a standard IS proposal, then we would have to compute the importance weight \(f(x_K)/f^q_K(x_K)\). In light of Equation 6, the denominator is typically intractable, and we appear out of luck.

1.3 Extended State Space

A clever way around this issue is to view Equation 4 as an IS proposal in an extended state space over \((x_0, \dots, x_K)\). The joint density \(q_K(x_K)\) in Equation 6 is thus the associated proposal density. The question now becomes how to define the target distribution \(p(x_{0:K}) = f^p(x_{0:K})/Z^p\) over the extended state space. Most importantly, we will define the target such that it admits \(\pi\) as a marginal. This implies that if we formulate a valid IS algorithm in the extended state space, then we will have solved the original problem by simply extracting the relevant component from samples of the extended state. We will also choose \(p\) to have a specific Markov structure that leads \(f^p(x_{0:K})/f^q(x_{0:K})\) to have a convenient, computable form. To this end, consider \[ p(x_{0:K}) := \frac{1}{Z^p} f(x_K) \prod_{k=0}^{K-1} R_k(x_{k+1}, x_k), \tag{7}\] where the \(R_k\) are “backwards” Markov kernels – they evolve backwards in time. This joint distribution admits \(\pi\) as the \(K^{\text{th}}\) marginal, since \[\begin{align} \int p(x_{0:K}) dx_{0:K-1} &= \frac{1}{Z^p} f(x_K) \prod_{k=0}^{K-1} \int_{x_k} R_k(x_{k+1}, x_k) dx_k = \frac{1}{Z^p} f(x_K), \end{align}\] following from the fact that each \(R_k(x_k, \cdot)\) is a probability measure by definition. In AIS, the backward kernel \(R_k\) is chosen to be the reversal of \(F_k\). The reversed Markov kernel (with respect to invariant distribution \(\pi_k\)) is defined by the identity \[ \pi_k(x)F_k(x,x^\prime) = \pi_k(x^\prime) R_k(x^\prime, x) \tag{8}\] Rearranging Equation 8, we obtain the explicit expression \[ R_k(x^\prime, x) = \frac{\pi_k(x)}{\pi_k(x^\prime)} F_k(x,x^\prime) = \frac{f_k(x)}{f_k(x^\prime)} F_k(x,x^\prime). \tag{9}\] The kernel \(R_k\) defines a valid Markov chain that preserves the stationary distribution \(\pi_k\) when run backwards in time.

Proof

Putting aside measure-theoretic technicalities, we simply verify that \(R_k(x^\prime, \cdot)\) defines a valid probability distribution for each \(x^\prime\). This follows from \[ \begin{align} \int R_k(x^\prime, x) dx &= \int \frac{\pi_k(x)}{\pi_k(x^\prime)} F_k(x, x^\prime) dx \\ &= \frac{1}{\pi_k(x^\prime)} \int \pi_k(x) F_k(x, x^\prime) dx \\ &= \frac{\pi_k(x^\prime)}{\pi_k(x^\prime)} \\ &= 1, \end{align} \] where the penultimate equality uses the fact that \(F_k\) is \(\pi_k\)-invariant. To show that \(R_k\) is also \(\pi_k\)-invariant, see that \[ \begin{align} \int \pi_k(x^\prime) R_k(x^\prime, x) dx^\prime &= \int \pi_k(x^\prime) \frac{\pi_k(x)}{\pi_k(x^\prime)} F_k(x, x^\prime) dx^\prime \\ &= \int \pi_k(x) R_k(x, x^\prime) dx^\prime \\ &= \pi_k(x) \int R_k(x, x^\prime) dx^\prime \\ &= \pi_k(x), \end{align} \] where the final equality uses the fact that \(R_k\) is a Markov kernel.

Remark

Certain Markov kernels (e.g., Metropolis-Hastings kernels) are explicitly constructed to satisfy \(F_k = R_k\) (the detailed balance condition). Kernels that are not in detailed balance with \(\pi_k\) require the re-weighting \(\pi_k(x)/\pi_k(x^\prime)\) in Equation 8 to construct a kernel that preserves the stationary distribution when run in reverse.

1.4 Importance Weights

The preceding sections define the target \(p\) and proposal \(q\) on the extended state space. The IS algorithm in the extended space proceeds via the steps:

Sample \(x_{0:K} \sim q\).
Return \((x_{0:K},w(x_{0:K}))\), where \(w(x_{0:K}) := f^p(x_{0:K})/f^q(x_{0:K})\) is the importance weight.

This algorithm will only be useful if we can easily compute the importance weight. The Markov structure in \(p\) and \(q\) yields a very convenient form for \(w\). To see this, we start by applying Equation 9 to write \(f^p\) as \[ f^p(x_{0:K}) = f^q(x_{0:K}) \frac{f(x_K)}{f_0(x_0)} \prod_{k=1}^{K} \frac{f_k(x_{k-1})}{f_k(x_k)} \tag{10}\]

Derivation

\[ \begin{align} f^p(x_{0:K}) &= f(x_K) \prod_{k=1}^{K} R_k(x_k, x_{k-1}) \\ &= f(x_K) \prod_{k=1}^{K} \frac{f_k(x_{k-1})}{f_k(x_k)} F_k(x_{k-1},x_k) \\ &= \left[f_0(x_0) \prod_{k=1}^{K} F_k(x_{k-1},x_k) \right] \cdot \frac{f(x_K)}{f_0(x_0)} \prod_{k=1}^{K} \frac{f_k(x_{k-1})}{f_k(x_k)} \\ &= f^q(x_{0:K}) \frac{f(x_K)}{f_0(x_0)} \prod_{k=1}^{K} \frac{f_k(x_{k-1})}{f_k(x_k)} \end{align} \]

The importance weight thus simplifies to \[ w(x_{0:K}) = \frac{f^p(x_{0:K})}{f^q(x_{0:K})} = \prod_{k=0}^{K-1} \frac{f_{k+1}(x_k)}{f_k(x_k)} . \tag{11}\]

Derivation

Using Equation 10, we have \[ \begin{align} \frac{f^p(x_{0:K})}{f^q(x_{0:K})} &= \frac{f(x_K)}{f_0(x_0)} \prod_{k=1}^{K} \frac{f_k(x_{k-1})}{f_k(x_k)} \end{align} \]

We see in Equation 11 the precise reason why the absolute continuity assumption in Equation 3 is required. This assumption avoids dividing by zero in density ratios, ensuring the importance weight is well-defined.

Remark

Notice that the final value \(x_K\) from the proposal \(x_{0:K} \sim q\) does not appear in the importance weight in Equation 11. This was hinted at by an earlier remark, which claimed that the algorithm would still be valid in the absence of the last kernel \(F_K\). Implicitly, this kernel contributes the term \(f_K(x_K)/f_K(x_K)\) in Equation 11.

1.5 Algorithm

The core of the AIS algorithm is the generation of independent weighted samples \((x^{(n)}_{0:K}, w(x_{0:K}^{(n)}))\). We discuss the use of these samples for estimating expectations and normalizing constants in the next section. Here, we summarize the above derivations in an algorithm. The procedure simply requires sampling \(x_{0:K} \sim q\) and then computing the weight in Equation 11. As seen below, this can be done in an online fashion, which shoes that AIS can be viewed as alternating between MCMC steps and IS updates. Note that the algorithm actually returns the logarithm of the importance weight for numerical stability.

Algorithm 1: Calculate AIS weighted sample \[ \begin{array}{ll} &\textbf{Input:} && (f_k)_{k=1}^K, \ (F_k)_{k=1}^K \\ & \textbf{Output:} && \text{Weighted sample } \{x_{0:K}, \log w(x_{0:K})\} \\ \hline \\[0.1em] & x_0 \sim \pi_0 \\ & \ell_w \gets 0 \\ & \textbf{for } k = 1 \textbf{ to } K \textbf{ do} \\ & \quad \ell_w \gets \ell_w + \log f_k(x_{k-1}) - \log f_{k-1}(x_{k-1}) \\ & \quad x_k \sim F_{k+1}(x_{k-1}, \cdot) \\ & \textbf{end for} \\[0.1em] & x \gets (x_0, \dots, x_K) \\ & \textbf{Return:} \ (x, \ell_w) \end{array} \]

2 Perspectives and Applications

2.1 Bayesian Inference

Throughout this post, we have provided a generic discussion of AIS with respect to the goal of characterizing some target distribution \(\pi\). A very common application is the setting where \(\pi\) is a Bayesian posterior distribution. A Bayesian model consists of a joint probability distribution over parameters \(x\) and data \(y\), typically specified as \[ \begin{equation} \rho(x,y) := \pi_0(x)\mathsf{L}(x; y) \end{equation} \] where \(\pi_0\) and \(\mathsf{L}\) are the prior distribution and likelihood function, respectively. The posterior distribution is then given by \[ \begin{equation} \pi(x) := \rho(x \mid y) = \frac{1}{Z} \pi_0(x)\mathsf{L}(x; y), \end{equation} \] which we now consider to be our target distribution in AIS. Using the notation from Equation 1, we write \(f(x) = \pi_0(x)\mathsf{L}(x; y)\) to denote the unnormalized target. ² As the notation suggests, we will take the prior \(\pi_0\) as the initial distribution in the sequence of bridging distributions. If we consider the geometric path in Equation 2, we obtain \[ \pi_k(x) \propto \pi_0(x)^{1-\beta_k} f(x)^{\beta_k} = \pi_0(x)^{1-\beta_k} [\pi_0(x)\mathsf{L}(x; y)]^{\beta_k} = \pi_0(x) \mathsf{L}(x; y)^{\beta_k}, \] which shows that the intermediate distributions arise from a likelihood tempering schedule in this setting.

2.2 Intermediate Targets on Extended Space

To acheive the goal of producing weighted samples from \(\pi(x)\), where \(x \in \mathbb{R}^d\), AIS considers a sequence of intermediate distributions \((\pi_k)_{k=0}^{K}\) that bridge from a simple distribution \(\pi_0\) to the target \(\pi_K = \pi\). However, as we discovered above, the correctness of the method relies on the fact that the IS is actually conducted on an extended state space. We further clarify this viewpoint here, and note connections to sequential Monte Carlo (SMC).

Recall that the extended state is denoted by \(x_{0:K}\), where each entry \(x_k\) lives in \(\mathbb{R}^d\), the same space as the original variable \(x\). The extended target distribution \(p(x_{0:K})\) defined in Equation 7 can be viewed as the final distribution \(p_K\) within a sequence of extended intermediate distributions \((p_k)_{k=0}^{K}\) given by \[ \begin{equation} p_k(x_{0:k}) := \frac{1}{Z_k^p} f_k(x_k) \prod_{k=0}^{k-1} R_k(x_{k+1}, x_k). \end{equation} \tag{12}\]

Notice that \(p_K = p\) indeed holds. While the supports of the intermediate distributions \((\pi_k)_{k=0}^{K}\) are all \(\mathbb{R}^d\), the supports of the extended intermediate distributions \((p_k)_{k=0}^{K}\) grow with \(k\). For example, \(p_k\) is defined over \(\mathbb{R}^{d(k+1)}\). As the AIS weighted sample is incrementally constructed, the partial weighted sample \(\{x_{0:k}, w(x_{0:k})\}\) targets the intermediate distribution \(p_k\). These intermediate distributions are of interest only in that they help bridge to the final distribution, which is the one we ultimately care about.

The extended intermediate distributions in Equation 12 are utilized in algorithms other than AIS. For example, these same distributions are used in SMC, but they are approximated using particle filtering techniques rather than IS (e.g., see equation 1.12 in Naesseth, Lindsten, and Schön (2024)). In such settings, the \(R_k\) may be defined as other “backwards” Markov kernels, rather than restricting them to be the reversals of the \(F_k\).

References

Doucet, Arnaud, Will Grathwohl, Alexander G. D. G. Matthews, and Heiko Strathmann. 2022. “Score-Based Diffusion Meets Annealed Importance Sampling.” https://arxiv.org/abs/2208.07698.

Naesseth, Christian A., Fredrik Lindsten, and Thomas B. Schön. 2024. “Elements of Sequential Monte Carlo.” https://arxiv.org/abs/1903.04797.

Neal, Radford M. 1998. “Annealed Importance Sampling.” https://arxiv.org/abs/physics/9803008.

Footnotes

Throughout this post we will assume all distributions and Markov kernels admit densities with respect to some common dominating measure (typically the Lebesgue measure).↩︎
We’ll assume we can evaluate \(\pi_0(x)\) here, but as before we need only be able to evaluate a function that is proportional to it.↩︎