A Few Different Adaptive Metropolis Schemes

Random Walk Metropolis-Hastings (RWMH) is a workhorse Markov Chain Monte Carlo (MCMC) algorithm for sampling from arbitrary probability distributions. The key to getting it to work well is in choosing an adequate proposal distribution. In this post, We will consider the popular choice of a multivariate Gaussian proposal, which means that the design choice entirely rests on the covariance matrix of this Gaussian. Without prior knowledge on the geometry of the target distribution, a good choice for this covariance is often not clear. A common solution is to “learn as you go”; that is, develop algorithms that adapt the proposal covariance as the MCMC algorithm proceeds. I’m writing this post as a place to record different adaptation schemes that I find useful or interesting. I plan to iteratively contribute to this post over time as new methods pique my interest.

1 Setup and Notation

We consider sampling from a probability distribution with density \(\pi(x)\), where \(x \in \mathcal{X} \subseteq \mathbb{R}^d\). The RWMH proceeds iteratively by simulating a Markov chain that is specifically designed to converge to \(\pi\) in the limit. Suppose the chain is at state \(x\) at present. The algorithm proceeds by first proposing a new state by sampling from some proposal distribution with density \(q(x, \cdot)\). We will consider Gaussian proposals of the form \(q(x, \cdot) = \mathcal{N}(\cdot \mid x, \Sigma)\), with \(\Sigma \in \mathbb{R}^{d \times d}\) a positive definite covariance matrix. The proposal \(y \sim \mathcal{N}(y \mid x, \Sigma)\) is accepted with probability \[ \alpha(x, y) := \min\left\{1, \frac{\pi(y)}{\pi(x)} \right\}, \tag{1} \] in which case \(y\) is chosen as the subsequent state. If rejected, the next state is set to \(x\). Note that typically the ratio \(q(y,x) / q(x,y)\) would also appear in the above expression, but in the present setting \(q\) is symmetric so this ratio reduces to \(1\). Notice that the form of the acceptance probability is prescribed by the algorithm. Thus, the primary flexibility we have in designing new RWMH algorithms is in choosing the proposal \(q\), which in this case means choosing the covariance \(C\).

2 Adaptive Proposal (AP)

Coming soon

3 Adaptive Metropolis (AM)

Coming soon

4 The NIMBLE method

This next method I found by looking at the source code of NIMBLE, a probabilistic programming framework in R. I’m not sure where the NIMBLE developers got this from, so please reach out if you know. For anyone interested in checking my reading of the source code, the first place to look is the sampler_RW_block class in the file nimble/packages/nimble/R/MCMC_samplers.R.

The default RWMH sampler in NIMBLE adapts every adaptInterval number of iterations. The method uses the covariance parameterization \(\Sigma = s^2 C\), and adapts both \(s\) and \(C\) each time such an adaptation iteration is reached. I will let \(\tilde{\Sigma} = \tilde{s}^2 \tilde{C}\) denote the new proposal covariance after the update is completed. The NIMBLE updates are given by \[\begin{align} \tilde{s} &:= s \exp\left(\frac{\eta}{(t+3)^{\tau}} (\overline{a} - a^*) \right) \tag{2} \newline \tilde{C} &:= C + \frac{1}{(t+3)^\tau} (\hat{C} - C), \tag{3} \end{align}\] where

\(\hat{C}\) is the sample covariance computed over the recent history (i.e., all iterations since the previous adaptation).
\(a^*\) is the target acceptance ratio.
\(\overline{a}\) is the average acceptance ratio over the recent history.
\(t\) is the number of times that adaptation has occurred up to this point (note that this is not the number of iterations).
\(\eta\) and \(\tau\) are tuning parameters.

By default, NIMBLE sets \(a^* = .234\) if the dimension \(d\) is at least 5, and uses some other specialized defaults for the lower dimensional cases. They also fix \(\eta = 10\), but I’m writing this as another tuning parameter as it is certainly something that one could consider changing.

These updates imply that the current proposal covariance \(\Sigma = s^2 C\) is updated as \[ \tilde{\Sigma} := \exp\left(\frac{2\eta}{(t+3)^{\tau}} (\overline{a} - a^*) \right)s^2 \left[C + \frac{1}{(t+3)^\tau} (\hat{C} - C) \right]. \tag{4} \] It is not immediately obvious to me that this is a good idea. In isolation, the updates to \(s\) and \(C\) make intuitive sense. If the average acceptance ratio is larger than the target, then the exponential term will be greater than \(1\) and thus inflate the scale \(s\), encouraging farther-reaching proposals. The update to \(C\) is just a weighted average between the current \(C\) and the empirical covariance estimate over the recent history. The key question here is whether these two updates work well together in producing the composite update to \(\Sigma\). In considering this question, I find it helpful to rewrite (4) in a more convenient form. To this end, let’s denote \[ \beta^2 := \exp\left(\frac{2\eta}{(t+3)^{\tau}} (\overline{a} - a^*) \right), \] which is the scalar that’s multiplied with \(s^2\) to produce \(\tilde{s}^2\). We thus have, \[\begin{align} \tilde{\Sigma} &= \beta^2 s^2 \left[C + \frac{1}{(k+3)^\tau} (\hat{C} - C) \right] \newline &= \beta^2 s^2 \left[\frac{1}{(t+3)^\tau}\hat{C} + \left(1 - \frac{1}{(t+3)^{\tau}} \right)C \right] \newline &= \beta^2 \left[\frac{1}{(t+3)^\tau}s^2 \hat{C} + \left(1 - \frac{1}{(t+3)^{\tau}} \right)s^2C \right]. \tag{5} \end{align}\] The expression (5) makes it much easier to see what’s going on here. The update is composed of two steps:

Fixing the current scale \(s\), the algorithm takes a convex combination of \(s^2 \hat{C}\) and \(s^2 C\). This portion of the update only considers the new information provided by the empirical covariance \(\hat{C}\), and treats the scale \(s\) as fixed.
A scale adjustment is made to the convex combination produced by part 1. If the average acceptance ratio \(\overline{a}\) from the recent history agrees with the target \(a^*\), then no further update is made. This makes sense since in this case the scale of the previous proposal was already where we wanted it, and thus the only thing to do is to improve our current estimate of the posterior covariance. On the other hand, suppose \(\overline{a} > a^*\). This implies that the current proposal \(s^2 C\) may be too small, in which case \(s^2 \hat{C}\) will likely also be too small. The multiplication by \(\beta^2\) (which is greater than \(1\) in this case) thus inflates the scale of the proposal to correct for this.

The concern with these composite scale and covariance updates is that the two updates will be out of sync; we see that the NIMBLE update cleverly avoids this issue.