A Few Different Adaptive Metropolis Schemes
Adaptively updating a Gaussian proposal covariance for Random Walk Metropolis-Hastings samplers.
Random Walk Metropolis-Hastings (RWMH) is a workhorse Markov Chain Monte Carlo (MCMC) algorithm for sampling from arbitrary probability distributions. The key to getting it to work well is in choosing an adequate proposal distribution. In this post, We will consider the popular choice of a multivariate Gaussian proposal, which means that the design choice entirely rests on the covariance matrix of this Gaussian. Without prior knowledge on the geometry of the target distribution, a good choice for this covariance is often not clear. A common solution is to “learn as you go”; that is, develop algorithms that adapt the proposal covariance as the MCMC algorithm proceeds. I’m writing this post as a place to record different adaptation schemes that I find useful or interesting. I plan to iteratively contribute to this post over time as new methods pique my interest.
Setup and Notation
We consider sampling from a probability distribution with density , where . The RWMH proceeds iteratively by simulating a Markov chain that is specifically designed to converge to in the limit. Suppose the chain is at state at present. The algorithm proceeds by first proposing a new state by sampling from some proposal distribution with density . We will consider Gaussian proposals of the form , with a positive definite covariance matrix. The proposal is accepted with probability in which case is chosen as the subsequent state. If rejected, the next state is set to . Note that typically the ratio would also appear in the above expression, but in the present setting is symmetric so this ratio reduces to . Notice that the form of the acceptance probability is prescribed by the algorithm. Thus, the primary flexibility we have in designing new RWMH algorithms is in choosing the proposal , which in this case means choosing the covariance .
Adaptive Proposal (AP)
Coming soon
Adaptive Metropolis (AM)
Coming soon
The NIMBLE method
This next method I found by looking at the
source code of
NIMBLE, a probabilistic
programming framework in R. I’m not sure where the NIMBLE developers got this from,
so please reach out if you know. For anyone interested in checking my reading
of the source code, the first place to look is the sampler_RW_block
class
in the file nimble/packages/nimble/R/MCMC_samplers.R.
The default RWMH sampler in NIMBLE adapts every adaptInterval
number of
iterations. The method uses the covariance parameterization
, and adapts both and each time such an adaptation
iteration is reached. I will let
denote the new proposal covariance after the update is completed.
The NIMBLE updates are given by
\begin{align}
\tilde{s} &:= s \exp\left(\frac{\eta}{(t+3)^{\tau}} (\overline{a} - a^*) \right) \tag{2} \newline
\tilde{C} &:= C + \frac{1}{(t+3)^\tau} (\hat{C} - C), \tag{3}
\end{align}
where
- is the sample covariance computed over the recent history (i.e., all iterations since the previous adaptation).
- is the target acceptance ratio.
- is the average acceptance ratio over the recent history.
- is the number of times that adaptation has occurred up to this point (note that this is not the number of iterations).
- and are tuning parameters.
By default, NIMBLE sets if the dimension is at least 5, and uses some other specialized defaults for the lower dimensional cases. They also fix , but I’m writing this as another tuning parameter as it is certainly something that one could consider changing.
These updates imply that the current proposal covariance is updated as It is not immediately obvious to me that this is a good idea. In isolation, the updates to and make intuitive sense. If the average acceptance ratio is larger than the target, then the exponential term will be greater than and thus inflate the scale , encouraging farther-reaching proposals. The update to is just a weighted average between the current and the empirical covariance estimate over the recent history. The key question here is whether these two updates work well together in producing the composite update to . In considering this question, I find it helpful to rewrite (4) in a more convenient form. To this end, let’s denote which is the scalar that’s multiplied with to produce . We thus have, \begin{align} \tilde{\Sigma} &= \beta^2 s^2 \left[C + \frac{1}{(k+3)^\tau} (\hat{C} - C) \right] \newline &= \beta^2 s^2 \left[\frac{1}{(t+3)^\tau}\hat{C} + \left(1 - \frac{1}{(t+3)^{\tau}} \right)C \right] \newline &= \beta^2 \left[\frac{1}{(t+3)^\tau}s^2 \hat{C} + \left(1 - \frac{1}{(t+3)^{\tau}} \right)s^2C \right]. \tag{5} \end{align} The expression (5) makes it much easier to see what’s going on here. The update is composed of two steps:
- Fixing the current scale , the algorithm takes a convex combination of and . This portion of the update only considers the new information provided by the empirical covariance , and treats the scale as fixed.
- A scale adjustment is made to the convex combination produced by part 1. If the average acceptance ratio from the recent history agrees with the target , then no further update is made. This makes sense since in this case the scale of the previous proposal was already where we wanted it, and thus the only thing to do is to improve our current estimate of the posterior covariance. On the other hand, suppose . This implies that the current proposal may be too small, in which case will likely also be too small. The multiplication by (which is greater than in this case) thus inflates the scale of the proposal to correct for this.
The concern with these composite scale and covariance updates is that the two updates will be out of sync; we see that the NIMBLE update cleverly avoids this issue.