A Few Different Adaptive Metropolis Schemes

Adaptively updating a Gaussian proposal covariance for Random Walk Metropolis-Hastings samplers.

Random Walk Metropolis-Hastings (RWMH) is a workhorse Markov Chain Monte Carlo (MCMC) algorithm for sampling from arbitrary probability distributions. The key to getting it to work well is in choosing an adequate proposal distribution. In this post, We will consider the popular choice of a multivariate Gaussian proposal, which means that the design choice entirely rests on the covariance matrix of this Gaussian. Without prior knowledge on the geometry of the target distribution, a good choice for this covariance is often not clear. A common solution is to “learn as you go”; that is, develop algorithms that adapt the proposal covariance as the MCMC algorithm proceeds. I’m writing this post as a place to record different adaptation schemes that I find useful or interesting. I plan to iteratively contribute to this post over time as new methods pique my interest.

Setup and Notation

We consider sampling from a probability distribution with density π(x)\pi(x), where xXRdx \in \mathcal{X} \subseteq \mathbb{R}^d. The RWMH proceeds iteratively by simulating a Markov chain that is specifically designed to converge to π\pi in the limit. Suppose the chain is at state xx at present. The algorithm proceeds by first proposing a new state by sampling from some proposal distribution with density q(x,)q(x, \cdot). We will consider Gaussian proposals of the form q(x,)=N(x,Σ)q(x, \cdot) = \mathcal{N}(\cdot | x, \Sigma), with ΣRd×d\Sigma \in \mathbb{R}^{d \times d} a positive definite covariance matrix. The proposal yN(yx,Σ)y \sim \mathcal{N}(y | x, \Sigma) is accepted with probability α(x,y):=min{1,π(y)π(x)},(1) \alpha(x, y) := \min\left\{1, \frac{\pi(y)}{\pi(x)} \right\}, \tag{1} in which case yy is chosen as the subsequent state. If rejected, the next state is set to xx. Note that typically the ratio q(y,x)/q(x,y)q(y,x) / q(x,y) would also appear in the above expression, but in the present setting qq is symmetric so this ratio reduces to 11. Notice that the form of the acceptance probability is prescribed by the algorithm. Thus, the primary flexibility we have in designing new RWMH algorithms is in choosing the proposal qq, which in this case means choosing the covariance CC.

Adaptive Proposal (AP)

Coming soon

Adaptive Metropolis (AM)

Coming soon

The NIMBLE method

This next method I found by looking at the source code of NIMBLE, a probabilistic programming framework in R. I’m not sure where the NIMBLE developers got this from, so please reach out if you know. For anyone interested in checking my reading of the source code, the first place to look is the sampler_RW_block class in the file nimble/packages/nimble/R/MCMC_samplers.R.

The default RWMH sampler in NIMBLE adapts every adaptInterval number of iterations. The method uses the covariance parameterization Σ=s2C\Sigma = s^2 C, and adapts both ss and CC each time such an adaptation iteration is reached. I will let Σ~=s~2C~\tilde{\Sigma} = \tilde{s}^2 \tilde{C} denote the new proposal covariance after the update is completed. The NIMBLE updates are given by \begin{align} \tilde{s} &:= s \exp\left(\frac{\eta}{(t+3)^{\tau}} (\overline{a} - a^*) \right) \tag{2} \newline \tilde{C} &:= C + \frac{1}{(t+3)^\tau} (\hat{C} - C), \tag{3} \end{align} where

By default, NIMBLE sets a=.234a^* = .234 if the dimension dd is at least 5, and uses some other specialized defaults for the lower dimensional cases. They also fix η=10\eta = 10, but I’m writing this as another tuning parameter as it is certainly something that one could consider changing.

These updates imply that the current proposal covariance Σ=s2C\Sigma = s^2 C is updated as Σ~:=exp(2η(t+3)τ(aa))s2[C+1(t+3)τ(C^C)].(4) \tilde{\Sigma} := \exp\left(\frac{2\eta}{(t+3)^{\tau}} (\overline{a} - a^*) \right)s^2 \left[C + \frac{1}{(t+3)^\tau} (\hat{C} - C) \right]. \tag{4} It is not immediately obvious to me that this is a good idea. In isolation, the updates to ss and CC make intuitive sense. If the average acceptance ratio is larger than the target, then the exponential term will be greater than 11 and thus inflate the scale ss, encouraging farther-reaching proposals. The update to CC is just a weighted average between the current CC and the empirical covariance estimate over the recent history. The key question here is whether these two updates work well together in producing the composite update to Σ\Sigma. In considering this question, I find it helpful to rewrite (4) in a more convenient form. To this end, let’s denote β2:=exp(2η(t+3)τ(aa)), \beta^2 := \exp\left(\frac{2\eta}{(t+3)^{\tau}} (\overline{a} - a^*) \right), which is the scalar that’s multiplied with s2s^2 to produce s~2\tilde{s}^2. We thus have, \begin{align} \tilde{\Sigma} &= \beta^2 s^2 \left[C + \frac{1}{(k+3)^\tau} (\hat{C} - C) \right] \newline &= \beta^2 s^2 \left[\frac{1}{(t+3)^\tau}\hat{C} + \left(1 - \frac{1}{(t+3)^{\tau}} \right)C \right] \newline &= \beta^2 \left[\frac{1}{(t+3)^\tau}s^2 \hat{C} + \left(1 - \frac{1}{(t+3)^{\tau}} \right)s^2C \right]. \tag{5} \end{align} The expression (5) makes it much easier to see what’s going on here. The update is composed of two steps:

  1. Fixing the current scale ss, the algorithm takes a convex combination of s2C^s^2 \hat{C} and s2Cs^2 C. This portion of the update only considers the new information provided by the empirical covariance C^\hat{C}, and treats the scale ss as fixed.
  2. A scale adjustment is made to the convex combination produced by part 1. If the average acceptance ratio a\overline{a} from the recent history agrees with the target aa^*, then no further update is made. This makes sense since in this case the scale of the previous proposal was already where we wanted it, and thus the only thing to do is to improve our current estimate of the posterior covariance. On the other hand, suppose a>a\overline{a} > a^*. This implies that the current proposal s2Cs^2 C may be too small, in which case s2C^s^2 \hat{C} will likely also be too small. The multiplication by β2\beta^2 (which is greater than 11 in this case) thus inflates the scale of the proposal to correct for this.

The concern with these composite scale and covariance updates is that the two updates will be out of sync; we see that the NIMBLE update cleverly avoids this issue.