Pseudo-Marginal MCMC

Motivated by Bayesian inference, I introduce the pseudo-marginal approach to MCMC and then discuss why is works from a more generic perspective.

Pseudo-marginal Markov chain Monte Carlo (MCMC) is a variant of the Metropolis-Hastings algorithm that works without the ability to evaluate the unnormalized target density, so long as an unbiased sample of this density can be obtained for any input. In this post, we motivate the algorithm by considering a problem of Bayesian inference where the likelihood function is intractable. We then take a step back to understand why the algorithm works, and discuss the method from a more generic and rigorous viewpoint.

Pseudo-Marginal MCMC for Bayesian Inference

We start by considering a standard problem of Bayesian inference for a parameter of interest uUu \in \mathcal{U}. Given a prior density π0(u)\pi_0(u) and likelihood function L(u)L(u), the unnormalized posterior density is then obtained as the product of these two quantities: π(u):=π0(u)L(u).(1) \pi(u) := \pi_0(u) L(u). \tag{1} With the ability to evaluate this unnormalized density, MCMC algorithms can be applied to obtain samples from the posterior distribution. However, suppose we face a situation where L(u)L(u) is intractable in the sense that it does not admit an analytic expression that can be computed for any uu. Suppose, though, that we can draw an unbiased sample of the quantity L(u)L(u) for any input uu; that is, \begin{align} &\ell \sim P(u, \cdot), &&\mathbb{E}[\ell] = L(u), \tag{2} \end{align} where P(u,)P(u,\cdot) is a probability measure on the sample space [0,)[0, \infty) for each uUu \in \mathcal{U} (formally, we can think of PP as a Markov kernel). It turns out that this is sufficient to define an MCMC algorithm with target distribution equal uu’s posterior. The algorithm that accomplishes this is referred as pseudo-marginal MCMC. A single step of this algorithm is detailed below.

Pseudo-Marginal MCMC. Let uu be the current state of the algorithm, with P(u,)\ell \sim P(u,\cdot) the associated unbiased likelihood sample. Let QQ denote the proposal kernel. The next state is then determined as follows.
1. Propose a new state: u~Q(u,)(3) \tilde{u} \sim Q(u, \cdot) \tag{3} 2. Draw an unbiased likelihood sample at the proposed state: ~P(u~,)(4) \tilde{\ell} \sim P(\tilde{u}, \cdot) \tag{4} 3. With probability α(u,;u~,~):=min(1,π0(u~)~q(u~,u)π0(u)q(u,u~)),(5) \alpha(u,\ell; \tilde{u},\tilde{\ell}) := \min\left(1, \frac{\pi_0(\tilde{u})\tilde{\ell}q(\tilde{u},u)}{\pi_0(u)\ell q(u,\tilde{u})} \right), \tag{5} set the new state to u~\tilde{u}. Else set it to the current state uu.

Notice that the acceptance probability (5) is the typical Metropolis-Hastings acceptance probability but with the unbiased likelihood samples \ell and ~\tilde{\ell} inserted in place of L(u)L(u) and L(u~)L(\tilde{u}), respectively. The claim is that this algorithm defines a Markov chain with invariant distribution π\pi. To see why this is true, the trick is to view the above algorithm as a Metropolis-Hastings scheme operating on the extended state vector (u,)(u, \ell). In showing this, I will assume P(u,)P(u,\cdot) and Q(u,)Q(u,\cdot) admit densities p(u,)p(u,\cdot) and q(u,)q(u,\cdot) with respect to the same base measure for which π\pi is a density (typically, the Lebesgue or counting measure.) Now, to view the above algorithm with respect to the extended state space, start by noticing that (3) and (4) can be interpreted as a joint proposal (u~,~)Q(u,;,),(6) (\tilde{u},\tilde{\ell}) \sim \overline{Q}(u,\ell; \cdot, \cdot), \tag{6} with Q\overline{Q} a Markov kernel on the product space U×[0,)\mathcal{U} \times [0,\infty) with density q(u,;u~,~):=q(u,u~)p(u~,~).(7) \overline{q}(u,\ell; \tilde{u},\tilde{\ell}) := q(u,\tilde{u})p(\tilde{u},\tilde{\ell}). \tag{7} Notice that Q(u,;,)\overline{Q}(u,\ell; \cdot, \cdot) is independent of \ell. It now remains to write the acceptance probability (5) in a form that can be interpreted with respect to the extended state space. To this end, consider \begin{align} \frac{\pi_0(\tilde{u})\tilde{\ell}q(\tilde{u},u)}{\pi_0(u)\ell q(u,\tilde{u})} &= \frac{\pi_0(\tilde{u})\tilde{\ell}}{\pi_0(u)\ell} \cdot \frac{q(\tilde{u},u)p(u,\ell)}{q(u,\tilde{u})p(\tilde{u},\tilde{\ell})} \cdot \frac{p(\tilde{u},\tilde{\ell})}{p(u,\ell)} \newline &= \frac{\pi_0(\tilde{u})\tilde{\ell}p(\tilde{u},\tilde{\ell})}{\pi_0(u)\ell p(\tilde{u},\tilde{\ell})} \cdot \frac{\overline{q}(\tilde{u},\tilde{\ell};u,\ell)}{\overline{q}(u,\ell;\tilde{u},\tilde{\ell})}. \tag{8} \end{align} The second term is the proposal density ratio with respect to extended proposal q\overline{q}. Thus, the function appearing in the numerator and denominator of the first term must be the (unnormalized) density targeted by this Metropolis-Hastings scheme. In other words, the invariant distribution implied by the above algorithm has unnormalized density π(u,):=π0(u)p(u,).(9) \overline{\pi}(u,\ell) := \pi_0(u)p(u,\ell)\ell. \tag{9} Notice that π0(u)\pi_0(u)\ell is the unnormalized density (1) with the sample \ell inserted in place of L(u)L(u). This is multiplied by the weight p(u,)p(u,\ell), which encodes the probability of sampling \ell at the input uu. Our proof of the algorithm’s correctness is concluded by noting that π\overline{\pi} admits π\pi as a marginal distribution; indeed, \begin{align} \int \overline{\pi}(u,\ell)d\ell &= \int \pi_0(u)p(u,\ell)\ell d\ell = \pi_0(u) \int \ell \cdot p(u,\ell) d\ell = \pi_0(u) \mathbb{E}[\ell|u] = \pi_0(u) L(u), \tag{10} \end{align} following from the unbiasedness of the likelihood sample. This means that, in theory, we can run the above algorithm to obtain joint samples (u,)π(u,\ell) \sim \overline{\pi}, and then simply extract the uu portion of these pairs to obtain the desired draws uπu \sim \pi. One last thing to note is that we don’t actually need to be able to evaluate the density p(u,)p(u,\ell) appearing in (8); we see in the acceptance probability (5) that we need only be able to sample from P(u,)P(u,\cdot). As usual, we need to be able to evaluate the density q(u,u~)q(u,\tilde{u}).

A More Generic and Formal Perspective

The above idea of course extends beyond the Bayesian example. In this section, we discuss the pseudo-marginal algorithm from a more generic perspective, and fill in some of the measure-theoretic details. Let’s assume Π\Pi is some generic target distribution on a measurable space (U,B(U))(\mathcal{U}, \mathcal{B}(\mathcal{U})). We write B(U)\mathcal{B}(\mathcal{U}) to denote the Borel σ\sigma-algebra; that is, the σ\sigma-algebra generated by the open sets of U\mathcal{U}. We assume Π\Pi admits a density (i.e., Radon-Nikodym derivative) π\pi with respect to some reference measure ν\nu. The density π\pi need not be normalized. All densities considered throughout this section will be with respect to the same reference measure ν\nu. As before, we consider π(u)\pi(u) intractable, but assume we can draw samples from an unbiased estimator. We could define P(u,)P(u,\cdot) as before such that samples drawn from P(u,)P(u,\cdot) are unbiased with respect to π(u)\pi(u). However, note that this is equivalent to considering samples wP(u,)w \sim P(u,\cdot) with expectation 11, such that wπ(u)w \cdot \pi(u) is unbiased for π(u)\pi(u). This seems to be a roundabout way to go around this, but for the purposes of analysis it turns out to be convenient. This is the definition used in some of the “noisy MCMC” literature (see, e.g., Medina-Aguayo et al, 2018). Thus, let’s go with this definition and define the Markov kernel P:U[0,1]P: \mathcal{U} \to [0,1] such that (1) P(u,)P(u,\cdot) is a probability measure on (W,B(W))(\mathcal{W},\mathcal{B}(\mathcal{W})) for each uUu \in \mathcal{U}, where W[0,)\mathcal{W} \subseteq [0,\infty); and (2) PP produces weights with unit expectation: \begin{align} &w \sim P(u,\cdot), &&\mathbb{E}_{P_u}[w] = 1. \tag{11} \end{align} We use PuP_u as shorthand for P(u,)P(u,\cdot) in the subscript. We again emphasize that the sample ww from (11) implies that wπ(u)w\pi(u) is an unbiased estimate of π(u)\pi(u). The pseudo-marginal algorithm proceeds exactly as before. We state it again below to emphasize the new notation.

Pseudo-Marginal MCMC.
1. Propose a new state: u~Q(u,)(12) \tilde{u} \sim Q(u, \cdot) \tag{12} 2. Draw an unbiased weight sample at the proposed state: w~P(u~,)(13) \tilde{w} \sim P(\tilde{u}, \cdot) \tag{13} 3. With probability α(u,w;u~,w~):=min(1,π(u~)w~q(u~,u)π(u)wq(u,u~)),(14) \alpha(u,w; \tilde{u},\tilde{w}) := \min\left(1, \frac{\pi(\tilde{u})\tilde{w}q(\tilde{u},u)}{\pi(u)w q(u,\tilde{u})} \right), \tag{14} set the new state to u~\tilde{u}. Else set it to the current state uu.

Of course, we can’t evaluate π(u)\pi(u) in (14), but we have just defined things this was for its theoretical benefits. In practice, we can think of drawing a sample to directly approximate π(u)\pi(u). Similar to before, we can think about this algorithm as targeting an invariant distribution on the product space (U×W,B(U)×B(W))(\mathcal{U} \times \mathcal{W}, \mathcal{B}(\mathcal{U}) \times \mathcal{B}(\mathcal{W})). The steps (12) and (13) represent a draw from the proposal kernel Q:U×W[0,1]\overline{Q}: \mathcal{U} \times \mathcal{W} \to [0,1] defined by Q(u,w;U,W):=UP(u~,W)Q(u,du~),(15) \overline{Q}(u,w; U,W) := \int_{U} P(\tilde{u},W)Q(u,d\tilde{u}), \tag{15} for UB(U)U \in \mathcal{B}(\mathcal{U}) and WB(W)W \in \mathcal{B}(\mathcal{W}).

References

  1. The pseudo-marginal approach for efficient Monte Carlo computations (Andrieu and Roberts, 2009)
  2. Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms (Andrieu and Vihola, 2015)
  3. Stability of Noisy Metropolis-Hastings (Medina-Aguayo et al, 2018)