Pseudo-marginal Markov chain Monte Carlo (MCMC) is a variant of the Metropolis-Hastings algorithm that works without the ability to evaluate the unnormalized target density, so long as an unbiased sample of this density can be obtained for any input. In this post, we motivate the algorithm by considering a problem of Bayesian inference where the likelihood function is intractable. We then take a step back to understand why the algorithm works, and discuss the method from a more generic and rigorous viewpoint.
0.1 Pseudo-Marginal MCMC for Bayesian Inference
We start by considering a standard problem of Bayesian inference for a parameter of interest \(u \in \mathcal{U}\). Given a prior density \(\pi_0(u)\) and likelihood function \(L(u)\), the unnormalized posterior density is then obtained as the product of these two quantities: \[ \pi(u) := \pi_0(u) L(u). \tag{1} \] With the ability to evaluate this unnormalized density, MCMC algorithms can be applied to obtain samples from the posterior distribution. However, suppose we face a situation where \(L(u)\) is intractable in the sense that it does not admit an analytic expression that can be computed for any \(u\). Suppose, though, that we can draw an unbiased sample of the quantity \(L(u)\) for any input \(u\); that is, \[ \begin{align} &\ell \sim P(u, \cdot), &&\mathbb{E}[\ell] = L(u), \tag{2} \end{align} \] where \(P(u,\cdot)\) is a probability measure on the sample space \([0, \infty)\) for each \(u \in \mathcal{U}\) (formally, we can think of \(P\) as a Markov kernel). It turns out that this is sufficient to define an MCMC algorithm with target distribution equal to \(u\)’s posterior. The algorithm that accomplishes this is referred as pseudo-marginal MCMC. A single step of this algorithm is detailed below.
Pseudo-Marginal MCMC. Let \(u\) be the current state of the algorithm, with \(\ell \sim P(u,\cdot)\) the associated unbiased likelihood sample. Let \(Q\) denote the proposal kernel. The next state is then determined as follows.
- Propose a new state: \[ \tilde{u} \sim Q(u, \cdot) \tag{3} \]
- Draw an unbiased likelihood sample at the proposed state: \[ \tilde{\ell} \sim P(\tilde{u}, \cdot) \tag{4} \]
- With probability \[ \alpha(u,\ell; \tilde{u},\tilde{\ell}) := \min\left(1, \frac{\pi_0(\tilde{u})\tilde{\ell}q(\tilde{u},u)}{\pi_0(u)\ell q(u,\tilde{u})} \right), \tag{5} \] set the new state to \(\tilde{u}\). Else set it to the current state \(u\).
Notice that the acceptance probability (5) is the typical Metropolis-Hastings acceptance probability but with the unbiased likelihood samples \(\ell\) and \(\tilde{\ell}\) inserted in place of \(L(u)\) and \(L(\tilde{u})\), respectively. The claim is that this algorithm defines a Markov chain with invariant distribution \(\pi\). To see why this is true, the trick is to view the above algorithm as a Metropolis-Hastings scheme operating on the extended state vector \((u, \ell)\). In showing this, I will assume \(P(u,\cdot)\) and \(Q(u,\cdot)\) admit densities \(p(u,\cdot)\) and \(q(u,\cdot)\) with respect to the same base measure for which \(\pi\) is a density (typically, the Lebesgue or counting measure). Now, to view the above algorithm with respect to the extended state space, start by noticing that (3) and (4) can be interpreted as a joint proposal \[ (\tilde{u},\tilde{\ell}) \sim \overline{Q}(u,\ell; \cdot, \cdot), \tag{6} \] with \(\overline{Q}\) a Markov kernel on the product space \(\mathcal{U} \times [0,\infty)\) with density \[ \overline{q}(u,\ell; \tilde{u},\tilde{\ell}) := q(u,\tilde{u})p(\tilde{u},\tilde{\ell}). \tag{7} \] Notice that \(\overline{Q}(u,\ell; \cdot, \cdot)\) is independent of \(\ell\). It now remains to write the acceptance probability (5) in a form that can be interpreted with respect to the extended state space. To this end, consider \[ \begin{align} \frac{\pi_0(\tilde{u})\tilde{\ell}q(\tilde{u},u)}{\pi_0(u)\ell q(u,\tilde{u})} &= \frac{\pi_0(\tilde{u})\tilde{\ell}}{\pi_0(u)\ell} \cdot \frac{q(\tilde{u},u)p(u,\ell)}{q(u,\tilde{u})p(\tilde{u},\tilde{\ell})} \cdot \frac{p(\tilde{u},\tilde{\ell})}{p(u,\ell)} \newline &= \frac{\pi_0(\tilde{u})\tilde{\ell}p(\tilde{u},\tilde{\ell})}{\pi_0(u)\ell p(u,\ell)} \cdot \frac{\overline{q}(\tilde{u},\tilde{\ell};u,\ell)}{\overline{q}(u,\ell;\tilde{u},\tilde{\ell})}. \tag{8} \end{align} \] The second term is the proposal density ratio with respect to extended proposal \(\overline{q}\). Thus, the function appearing in the numerator and denominator of the first term must be the (unnormalized) density targeted by this Metropolis-Hastings scheme. In other words, the invariant distribution implied by the above algorithm has unnormalized density \[ \overline{\pi}(u,\ell) := \pi_0(u)p(u,\ell)\ell. \tag{9} \] Notice that \(\pi_0(u)\ell\) is the unnormalized density (1) with the sample \(\ell\) inserted in place of \(L(u)\). This is multiplied by the weight \(p(u,\ell)\), which encodes the probability of sampling \(\ell\) at the input \(u\). Our proof of the algorithm’s correctness is concluded by noting that \(\overline{\pi}\) admits \(\pi\) as a marginal distribution; indeed, \[ \begin{align} \int \overline{\pi}(u,\ell)d\ell &= \int \pi_0(u)p(u,\ell)\ell d\ell = \pi_0(u) \int \ell \cdot p(u,\ell) d\ell = \pi_0(u) \mathbb{E}[\ell|u] = \pi_0(u) L(u), \tag{10} \end{align} \] following from the unbiasedness of the likelihood sample. This means that, in theory, we can run the above algorithm to obtain joint samples \((u,\ell) \sim \overline{\pi}\), and then simply extract the \(u\) portion of these pairs to obtain the desired draws \(u \sim \pi\). One last thing to note is that we don’t actually need to be able to evaluate the density \(p(u,\ell)\) appearing in (8); we see in the acceptance probability (5) that we need only be able to sample from \(P(u,\cdot)\). As usual, we need to be able to evaluate the density \(q(u,\tilde{u})\).
0.2 A More Generic Formulation
We introduced the pseudo-marginal algorithm above in the context of Bayesian inference, where the random variable \(\ell \sim P(u,\cdot)\) provides an unbiased estimate of the likelihood \(L(u)\). This idea of course extends beyond the Bayesian example, for sampling from a generic target distribution with unnormalized density \(\pi(u)\). The pseudomarginal method can be applied provided access to a non-negative, unbiased estimator of \(\pi(u)\). Let’s denote this estimator by \(\hat{\pi}(u; x)\), where \(x \sim h\). In constrast to the presentation above, we now suppose the randomness in the estimator stems from an underlying random variable \(x\) that is independent of \(u\). The dependence on \(u\) still shows up in \(\hat{\pi}(u; x)\), so this is essentially a different way of thinking of the same algorithm. As before, we target the extended distribution \[ \begin{equation} \overline{\pi}(u, x) := \hat{\pi}(u;x)h(x) \tag{11} \end{equation} \] and consider a proposal of the form \[ \begin{equation} \overline{q}(u,x; \tilde{u}, \tilde{x}) := q(u,\tilde{u})h(\tilde{x}). \tag{12} \end{equation} \] The Metropolis-Hastings acceptance ratio then simplifies to \[ \begin{equation} \frac{\hat{\pi}(\tilde{u};\tilde{x})h(\tilde{x}) q(\tilde{u},u) h(x)} {\hat{\pi}(u;x)h(x) q(u, \tilde{u}) h(\tilde{x})} = \frac{\hat{\pi}(\tilde{u};\tilde{x}) q(\tilde{u},u)} {\hat{\pi}(u;x) q(u, \tilde{u})}. \tag{13} \end{equation} \]
We have thus again constructed an implementable MCMC algorithm with the correct invariant distribution.
0.2.1 Efficiency
Despite being correct, pseudo-marginal schemes are known to suffer from poor efficiency when the variance of the estimator \(\hat{\pi}(u;x)\) is high. This is due to the fact that the value of \(x\) in the algorithm is only updated when a new value of \(u\) is accepted. If \(\hat{\pi}(u;x)\) is quite variable, then a value of \(x\) may be sampled such that \(\hat{\pi}(u;x)\) significantly overestimates \(\pi(u)\). It therefore might take many iterations to sample a new value \(\tilde{x}\) that allows \(\hat{\pi}(u;\tilde{x})\) to compete with \(\hat{\pi}(u;x)\) in the acceptance ratio. Deligiannidis, Doucet, and Pitt (2017) reports that this behavior can typically be avoided by ensuring that the variance of \(\log \hat{\pi}(u;x)\) stays between \(1\) and \(3\) in regions of high probability.
In practice, one commonly has control over the variance of the estimator but reducing the variance comes at the cost of increased computational expense per iteration. A typical example is where \(x = x_{1:n}\) represents \(n\) samples \(x_i \overset{\text{ind}}{\sim} h_i\), \(h(x) = \prod_{i=1}^{n} h_i(x)\), and \(\hat{\pi}(u;x) = \frac{1}{n} \sum_{i=1}^{n} \hat{\pi}(u;x_i)\) is a sample mean. One can therefore reduce the variance of the estimator by increasing the number of samples used to compute the mean.
0.4 Estimators based on Gaussian Random Variables
The key to implementing this method is to define a valid \(h\)-reversible Markov kernel \(P\). In general, this will be somewhat problem-specific, but many estimators \(\hat{\pi}(u; x)\) can be written as a suitable transformation of a Gaussian random variable \(x \sim \mathcal{N}(m, C)\). Deligiannidis, Doucet, and Pitt (2017) focus exclusively on this setting. A well-known method for constructing a \(\mathcal{N}(m, C)\)-reversible kernel is via a preconditioned Crank-Nicholson (pCN) update \[ \begin{equation} \tilde{x} := m + \rho(x-m) + \sqrt{1 - \rho^2} \xi, \qquad \xi \sim \mathcal{N}(0, C), \tag{16} \end{equation} \] implying \[ \begin{equation} p(x, \tilde{x}) = \mathcal{N}(\rho x + (1-\rho)m, (1-\rho^2)C). \tag{17} \end{equation} \] To see that \(P\) is reversible (and thus invariant) with respect to \(\mathcal{N}(m, C)\), notice that (16) implies the joint distribution \[ \begin{equation} \begin{bmatrix} x \\ \tilde{x} \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} m \\ m \end{bmatrix}, \begin{bmatrix} C & \rho C \\ \rho C & C \end{bmatrix} \right), \tag{18} \end{equation} \] which satisfies \((x,\tilde{x}) \overset{d}{=} (\tilde{x},x)\).
Since \((x,\tilde{x})\) is a linear combination of the independent Gaussian variables \(x\) and \(\xi\), it is itself Gaussian distributed. The moments of the \(x\) marginal are already known, and the remaining moments can be filled in by noting \[\begin{align} \mathbb{E}[\tilde{x}] &= m + \rho(\mathbb{E}[x]-m) + \sqrt{1 - \rho^2} \mathbb{E}[\xi] = m \\ \mathrm{Cov}[\tilde{x}] &= \rho^2 \mathrm{Cov}[x-m] + (1-\rho^2)\mathrm{Cov}(\xi) = C \\ \mathrm{Cov}[x,\tilde{x}] &= \mathrm{Cov}[x,\rho x + \sqrt{1 - \rho^2}\xi] = \rho \mathrm{Cov}[x] + \sqrt{1-\rho^2}\mathrm{Cov}[x,\xi] = \rho C. \end{align}\]
Notice that \((x,\tilde{x}) \sim h(dx)P(x,\tilde{dx})\). Therefore, the fact that \((x,\tilde{x})\) and \((\tilde{x},x)\) are equal in distribution implies the reversibility condition \(h(dx)P(x,\tilde{dx}) = h(d\tilde{x})P(\tilde{x},dx)\). Reversibility is a sufficient condition for invariance, and thus \(h(dx)P(x,d\tilde{x}) = h(d\tilde{x})\) also holds. \(\qquad \blacksquare\)
Notice from (18) that the tuning parameter \(\rho\) controls the correlation between \(x\) and \(\tilde{x}\). Since we want to induce positive correlation between these variables, \(\rho\) will typically be set to a value close to one.
As before, note that this encompasses the case where the estimator is a sample mean of the form \[ \begin{equation} \hat{\pi}(u;x) = \frac{1}{n} \sum_{i=1}^{n} \hat{\pi}(u; x_i), \qquad x_i \overset{\mathrm{iid}}{\sim} \mathcal{N}(m_0, C_0) \end{equation} \] where \(\mathbb{E}[\hat{\pi}(u; x_i)] = \pi(u)\) for each \(x_i\). The vector \(x := [x_1, \dots, x_n]\) is still Gaussian with mean \(m := [m_0, \dots, m_0]\) and block-diagonal covariance \(C := \mathrm{diag}(C_0, \dots, C_0)\). In this case, the pCN update in (16) reduces to the \(n\) updates \[ \begin{equation} \tilde{x}_i := m_0 + \rho(x_i - m_0) + \sqrt{1-\rho^2} \xi_i, \qquad \xi \overset{\mathrm{iid}}{\sim} \mathcal{N}(0, C_0) \end{equation} \] for \(i = 1, \dots, n\).
0.5 Being a bit more rigorous
In this section, we discuss the pseudo-marginal algorithm from a more generic perspective, and fill in some of the measure-theoretic details. Let’s assume \(\Pi\) is some generic target distribution on a measurable space \((\mathcal{U}, \mathcal{B}(\mathcal{U}))\). We write \(\mathcal{B}(\mathcal{U})\) to denote the Borel \(\sigma\)-algebra; that is, the \(\sigma\)-algebra generated by the open sets of \(\mathcal{U}\). We assume \(\Pi\) admits a density (i.e., Radon-Nikodym derivative) \(\pi\) with respect to some reference measure \(\nu\). The density \(\pi\) need not be normalized. All densities considered throughout this section will be with respect to the same reference measure \(\nu\). As before, we consider \(\pi(u)\) intractable, but assume we can draw samples from an unbiased estimator. We could define \(P(u,\cdot)\) as before such that samples drawn from \(P(u,\cdot)\) are unbiased with respect to \(\pi(u)\). However, note that this is equivalent to considering samples \(w \sim P(u,\cdot)\) with expectation one, such that \(w \cdot \pi(u)\) is unbiased for \(\pi(u)\). This seems to be a roundabout way to go around this, but for the purposes of analysis it turns out to be convenient. This is the definition used in some of the “noisy MCMC” literature (see, e.g., Medina-Aguayo et al, 2018). Thus, let’s go with this definition and define the Markov kernel \(P: \mathcal{U} \to [0,1]\) such that (1) \(P(u,\cdot)\) is a probability measure on \((\mathcal{W},\mathcal{B}(\mathcal{W}))\) for each \(u \in \mathcal{U}\), where \(\mathcal{W} \subseteq [0,\infty)\); and (2) \(P\) produces weights with unit expectation: \[ \begin{align} &w \sim P(u,\cdot), &&\mathbb{E}_{P_u}[w] = 1. \tag{11} \end{align} \] We use \(P_u\) as shorthand for \(P(u,\cdot)\) in the subscript. We again emphasize that the sample \(w\) from (11) implies that \(w\pi(u)\) is an unbiased estimate of \(\pi(u)\). The pseudo-marginal algorithm proceeds exactly as before. We state it again below to emphasize the new notation.
Pseudo-Marginal MCMC.
1. Propose a new state: \[ \tilde{u} \sim Q(u, \cdot) \tag{12} \] 2. Draw an unbiased weight sample at the proposed state: \[ \tilde{w} \sim P(\tilde{u}, \cdot) \tag{13} \] 3. With probability \[ \alpha(u,w; \tilde{u},\tilde{w}) := \min\left(1, \frac{\pi(\tilde{u})\tilde{w}q(\tilde{u},u)}{\pi(u)w q(u,\tilde{u})} \right), \tag{14} \] set the new state to \(\tilde{u}\). Else set it to the current state \(u\).
Of course, we can’t evaluate \(\pi(u)\) in (14), but stating the algorithm this way is useful to study its properties. In practice, we can think of drawing a sample to directly approximate \(\pi(u)\). Similar to before, we can think about this algorithm as targeting an invariant distribution on the product space \((\mathcal{U} \times \mathcal{W}, \mathcal{B}(\mathcal{U}) \times \mathcal{B}(\mathcal{W}))\). The steps (12) and (13) represent a draw from the proposal kernel \(\overline{Q}: \mathcal{U} \times \mathcal{W} \to [0,1]\) defined by \[ \overline{Q}(u,w; U,W) := \int_{U} P(\tilde{u},W)Q(u,d\tilde{u}), \tag{15} \] for \(U \in \mathcal{B}(\mathcal{U})\) and \(W \in \mathcal{B}(\mathcal{W})\).
1 References
- The pseudo-marginal approach for efficient Monte Carlo computations (Andrieu and Roberts, 2009)
- Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms (Andrieu and Vihola, 2015)
- Stability of Noisy Metropolis-Hastings (Medina-Aguayo et al, 2018)