Combining Samples From Non-mixing MCMC Chains
Consider a typical Bayesian model consisting of a joint data-parameter distribution . The goal is to characterize the posterior distribution , where is the fixed data realization. In this post, we consider the setting where may contain many local modes. There are specialty MCMC algorithms (e.g., using tempering approaches) designed to be able to hop between modes, but in general sampling from such distributions is very challenging. An alternative perspective on this problem is to abandon the hope of designing an algorithm that will generate exact posterior samples and instead be content with “exploring” the most important regions of the parameter space. Multiple chains of a standard MCMC algorithm can be run from different initial conditions, with the hope that collectively the chains will identify the dominant modes of the posterior, even though individually each may be stuck in a single mode. This is the perspective taken in (Yao et al., 2022). The authors note the observation that it is often not difficult to achieve good within-chain mixing adapted to a local region of the posterior, even when it is near impossible to achieve such mixing with respect to the whole distribution. In this post, I summarize a few different approaches used in the literature that adopt this perspective. This is certainly not a comprehensive review, and I will likely add to this post if I come across new ideas.
Setup
Consider the typical problem of estimating the expectation of , where , for some function (e.g., estimating a posterior mean or variance). This expectation is given by Suppose we run MCMC chains and obtain samples for each of the chains . Here, is the number of samples from the chain and is the draw from the chain. We assume that each set of samples is well-mixed with respect to a local region of the posterior, and that any burn-in has already been dropped. We can think of each as a summary of one of the dominant modes of the distribution, but of course in reality it may be that we have completely missed some modes or multiple chains end up in the same mode. In any case, the question is now how to use the sets of samples to estimate (1). The problem is that even if the chains are well-mixed with respect to each mode, it is not obvious how to estimate the relative weights of the modes.
Some Basic Ideas
Perhaps the simplest approach is to simply weight all of the chains equally; i.e., compute the sample mean for each chain separately and then take an average of all the chain means. The gives the estimator
A natural generalization is to introduce weights that may differ for each chain, yielding Note that (2) is a special case of (3) with . The question is how to choose these weights. In running the MCMC algorithm, we obtain information about the relative importance of different regions via the evaluation of the posterior density at each iteration. Let’s denote by and the log-likelihood and log (unnormalized) posterior density evaluations from chain . A natural choice for weights might then be or , where and denote the sample means of and , respectively; i.e.,
If one of the chains got stuck in a local mode in a negligible region of parameter space, then the log likelihood (or posterior density) evaluations in that region will be very small, so .
A Simple Heuristic
We now explore a slight generalization of the above idea that was introduced in (Pritchard et al., 2000). The context in the paper is actually quite different; the heuristic is used to select the number of clusters in a clustering model. However, the idea nicely applies to the problem of combining nonmixing MCMC chains. The idea boils down to treating each chain as corresponding to a different “model” and set each weight to an estimate of the corresponding marginal likelihood under that “model”. In reality, the only model here is and the marginal likelihood is However, we will think of there being a marginal likelihood corresponding to each of the chains. The idea is now to produce an estimate of using the log-likelihood evaluations and then set the weight to this estimate. Define the log of this quantity as which we emphasize is random as a function of (see (5)). Assuming the individual chains are well-mixed, we can view as a set of (correlated) samples from . Computing a Monte Carlo estimate of and then exponentiating the result would produce , the same weight mentioned in the previous section. The paper (Pritchard et al., 2000) considers an alternative heuristic: assume that is Gaussian distributed. We estimate the mean as in (4), but now also estimate the variance and approximate the distribution of as The chain weights are then set to where we have used the expression for the expectation of a log-normal. As opposed to just using , the weight (9) also takes into account the variability of the samples in each chain. In a sense, captures the average height of the mode, while captures some notion of its width. This is completely heuristic, but at the very least it seems reasonable to take into account both the height of the modes, as well as the volume they occupy in parameter space, in estimating their relative importance.
Other resources
- A blog post by one of the authors of (Yao et al., 2022).
- Yao, Y., Vehtari, A., & Gelman, A. (2022). Stacking for Non-mixing Bayesian Computations: The Curse and Blessing of Multimodal Posteriors. Journal of Machine Learning Research, 23(79), 1–45. http://jmlr.org/papers/v23/20-1426.html
- Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–959. https://pubmed.ncbi.nlm.nih.gov/10835412/