Pseudo-Marginal MCMC
Motivated by Bayesian inference, I introduce the pseudo-marginal approach to MCMC and then discuss why is works from a more generic perspective.
Pseudo-marginal Markov chain Monte Carlo (MCMC) is a variant of the Metropolis-Hastings algorithm that works without the ability to evaluate the unnormalized target density, so long as an unbiased sample of this density can be obtained for any input. In this post, we motivate the algorithm by considering a problem of Bayesian inference where the likelihood function is intractable. We then take a step back to understand why the algorithm works, and discuss the method from a more generic and rigorous viewpoint.
Pseudo-Marginal MCMC for Bayesian Inference
We start by considering a standard problem of Bayesian inference for a parameter of interest . Given a prior density and likelihood function , the unnormalized posterior density is then obtained as the product of these two quantities: With the ability to evaluate this unnormalized density, MCMC algorithms can be applied to obtain samples from the posterior distribution. However, suppose we face a situation where is intractable in the sense that it does not admit an analytic expression that can be computed for any . Suppose, though, that we can draw an unbiased sample of the quantity for any input ; that is, \begin{align} &\ell \sim P(u, \cdot), &&\mathbb{E}[\ell] = L(u), \tag{2} \end{align} where is a probability measure on the sample space for each (formally, we can think of as a Markov kernel). It turns out that this is sufficient to define an MCMC algorithm with target distribution equal ’s posterior. The algorithm that accomplishes this is referred as pseudo-marginal MCMC. A single step of this algorithm is detailed below.
Pseudo-Marginal MCMC. Let be the current state of the algorithm, with the associated unbiased likelihood sample. Let denote the proposal kernel. The next state is then determined as follows.
1. Propose a new state: 2. Draw an unbiased likelihood sample at the proposed state: 3. With probability set the new state to . Else set it to the current state .
Notice that the acceptance probability (5) is the typical Metropolis-Hastings acceptance probability but with the unbiased likelihood samples and inserted in place of and , respectively. The claim is that this algorithm defines a Markov chain with invariant distribution . To see why this is true, the trick is to view the above algorithm as a Metropolis-Hastings scheme operating on the extended state vector . In showing this, I will assume and admit densities and with respect to the same base measure for which is a density (typically, the Lebesgue or counting measure.) Now, to view the above algorithm with respect to the extended state space, start by noticing that (3) and (4) can be interpreted as a joint proposal with a Markov kernel on the product space with density Notice that is independent of . It now remains to write the acceptance probability (5) in a form that can be interpreted with respect to the extended state space. To this end, consider \begin{align} \frac{\pi_0(\tilde{u})\tilde{\ell}q(\tilde{u},u)}{\pi_0(u)\ell q(u,\tilde{u})} &= \frac{\pi_0(\tilde{u})\tilde{\ell}}{\pi_0(u)\ell} \cdot \frac{q(\tilde{u},u)p(u,\ell)}{q(u,\tilde{u})p(\tilde{u},\tilde{\ell})} \cdot \frac{p(\tilde{u},\tilde{\ell})}{p(u,\ell)} \newline &= \frac{\pi_0(\tilde{u})\tilde{\ell}p(\tilde{u},\tilde{\ell})}{\pi_0(u)\ell p(\tilde{u},\tilde{\ell})} \cdot \frac{\overline{q}(\tilde{u},\tilde{\ell};u,\ell)}{\overline{q}(u,\ell;\tilde{u},\tilde{\ell})}. \tag{8} \end{align} The second term is the proposal density ratio with respect to extended proposal . Thus, the function appearing in the numerator and denominator of the first term must be the (unnormalized) density targeted by this Metropolis-Hastings scheme. In other words, the invariant distribution implied by the above algorithm has unnormalized density Notice that is the unnormalized density (1) with the sample inserted in place of . This is multiplied by the weight , which encodes the probability of sampling at the input . Our proof of the algorithm’s correctness is concluded by noting that admits as a marginal distribution; indeed, \begin{align} \int \overline{\pi}(u,\ell)d\ell &= \int \pi_0(u)p(u,\ell)\ell d\ell = \pi_0(u) \int \ell \cdot p(u,\ell) d\ell = \pi_0(u) \mathbb{E}[\ell|u] = \pi_0(u) L(u), \tag{10} \end{align} following from the unbiasedness of the likelihood sample. This means that, in theory, we can run the above algorithm to obtain joint samples , and then simply extract the portion of these pairs to obtain the desired draws . One last thing to note is that we don’t actually need to be able to evaluate the density appearing in (8); we see in the acceptance probability (5) that we need only be able to sample from . As usual, we need to be able to evaluate the density .
A More Generic and Formal Perspective
The above idea of course extends beyond the Bayesian example. In this section, we discuss the pseudo-marginal algorithm from a more generic perspective, and fill in some of the measure-theoretic details. Let’s assume is some generic target distribution on a measurable space . We write to denote the Borel -algebra; that is, the -algebra generated by the open sets of . We assume admits a density (i.e., Radon-Nikodym derivative) with respect to some reference measure . The density need not be normalized. All densities considered throughout this section will be with respect to the same reference measure . As before, we consider intractable, but assume we can draw samples from an unbiased estimator. We could define as before such that samples drawn from are unbiased with respect to . However, note that this is equivalent to considering samples with expectation , such that is unbiased for . This seems to be a roundabout way to go around this, but for the purposes of analysis it turns out to be convenient. This is the definition used in some of the “noisy MCMC” literature (see, e.g., Medina-Aguayo et al, 2018). Thus, let’s go with this definition and define the Markov kernel such that (1) is a probability measure on for each , where ; and (2) produces weights with unit expectation: \begin{align} &w \sim P(u,\cdot), &&\mathbb{E}_{P_u}[w] = 1. \tag{11} \end{align} We use as shorthand for in the subscript. We again emphasize that the sample from (11) implies that is an unbiased estimate of . The pseudo-marginal algorithm proceeds exactly as before. We state it again below to emphasize the new notation.
Pseudo-Marginal MCMC.
1. Propose a new state: 2. Draw an unbiased weight sample at the proposed state: 3. With probability set the new state to . Else set it to the current state .
Of course, we can’t evaluate in (14), but we have just defined things this was for its theoretical benefits. In practice, we can think of drawing a sample to directly approximate . Similar to before, we can think about this algorithm as targeting an invariant distribution on the product space . The steps (12) and (13) represent a draw from the proposal kernel defined by for and .
References
- The pseudo-marginal approach for efficient Monte Carlo computations (Andrieu and Roberts, 2009)
- Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms (Andrieu and Vihola, 2015)
- Stability of Noisy Metropolis-Hastings (Medina-Aguayo et al, 2018)