A typical Bayesian model consists of a joint probability distribution over a parameter \(u\) and data \(y\) of the form \[ p(u,y) = \pi_0(u)L(u;y) \tag{1}\] where \(\pi_0(u)\) is the prior density on \(u\) and \(L(u;y) = p(y \mid u)\) the likelihood. The posterior distribution is then given by \[ \pi(u) := p(u \mid y) = \frac{1}{Z}\pi_0(u)L(u;y) \tag{2}\] where \(Z\) is a normalizing constant (independent of \(u\)) that we are not typically able to compute. Fortunately, common algorithms for posterior inference such as Markov chain Monte Carlo (MCMC) only require pointwise evaluations of the unnormalized posterior density \(\pi_0(u)L(u;y)\).
In this post, we consider a class of Bayesian models that adds an additional difficulty, rendering these standard inference algorithms infeasible. In particular, we assume a likelihood of the form \[ L(u;y) = \frac{f(y; u)}{C(u)}, \tag{3}\] such that we can evaluate \(f(y;u)\) but not the normalizing function \(C(u)\). The posterior density in this setting becomes \[ \pi(u) = \frac{1}{ZC(u)}\pi_0(u)f(y;u). \tag{4}\] Distributions of the form Equation 4 are known as doubly intractable owing to the two quantities we are unable to compute: \(Z\) and \(C(u)\). While the former does not pose a problem for typical inference algorithms, the presence of the latter is problematic.
1 The Problem
Recall the basic structure of a Metropolis-Hastings algorithm. If \(u\) is the current state of the Markov chain, then a new proposed state is sampled as \(\tilde{u} \sim q(\cdot \mid u)\) from some proposal distribution \(q\). The proposed state is then accepted with probability \[ \begin{align} &\alpha(\tilde{u} \mid u) = \min\{1, r(\tilde{u} \mid u)\}, &&r(\tilde{u} \mid u) = \frac{\pi(\tilde{u}) q(u \mid \tilde{u})}{\pi(u) q(\tilde{u} \mid u)}. \end{align} \tag{5}\] If accepted, the updated state is set to \(\tilde{u}\), otherwise the chain remains at \(u\). A key requirement of the algorithm is the ability to compute \(r(\tilde{u} \mid u)\). Plugging the density in Equation 4 into this expression, we see the ratio simplifies to \[ r(\tilde{u} \mid u) = \frac{\pi_0(\tilde{u})f(y; \tilde{u}) q(u \mid \tilde{u})}{\pi_0(u) f(y; u)q(\tilde{u} \mid u)} \cdot \frac{C(u)}{C(\tilde{u})}. \tag{6}\] The ratio depends on the intractable quantities \(C(u)\) and \(C(\tilde{u})\), and thus we cannot apply the standard Metropolis-Hastings scheme in this setting.
2 An Auxiliary Variable Method
In this section we summarize an auxiliary variable MCMC algorithm proposed by Møller et al. (2006) to address the doubly intractable problem. The authors show that, surprisingly, it is possible for a Markov chain to correctly target the exact posterior distribution \(\pi(u)\), despite the presence of the intractable normalizing function. The requirement of their method is the ability to draw independent realizations of data given any parameter value; i.e., to sample from the conditional \(p(y \mid u)\).
2.1 Extending the State Space
The main idea is to extend the joint probability space over \((u,y)\) in Equation 1 to a joint model over \((u,x,y)\) for some auxiliary variable \(x\). The auxiliary variable will be defined on the same space as \(y\), so we might think of it as some sort of “pseudo data”. Once we define the conditional \(p(x \mid u, y)\), then we obtain the extended model \[ p(u, x, y) := p(x \mid u, y)p(y \mid u)p(u) = p(x \mid u, y)f(y; u)\pi_0(u) / C(u). \tag{7}\] Notice that \(\pi(u) = p(u \mid y)\) is a marginal distribution of \(p(u,x \mid y)\). Therefore, if we can draw samples \((u,x) \sim p(u,x \mid y)\) then the \(u\)-component of these samples will have the desired distribution \(\pi\).
We now consider a Metropolis-Hastings algorithm targeting the extended posterior \(p(u,x \mid y)\). Letting \(q(\tilde{u},\tilde{x} \mid u,x)\) denote a proposal distribution on the extended state space, the acceptance ratio assumes the form \[ r(\tilde{u},\tilde{x} \mid u,x) = \frac{\pi_0(\tilde{u})f(y; \tilde{u}) q(u,x \mid \tilde{u},\tilde{x})}{\pi_0(u) f(y; u)q(\tilde{u},\tilde{x} \mid u,x)} \cdot \frac{C(u)}{C(\tilde{u})} \cdot \frac{p(\tilde{x} \mid \tilde{u},y)}{p(x \mid u,y)}. \tag{8}\] At present, the ratio still depends on \(C(u)/C(\tilde{u})\) and thus remains intractable.
2.2 A clever choice of proposal
It would be nice to be able to choose \(p(x \mid u,y)\) such that the dependence of Equation 7 on \(C(u)\) is eliminated. However, as pointed out by Murray, Ghahramani, and MacKay (2006), no such choice of \(p(x \mid u,y)\) is known. Instead, Møller et al. (2006) show that the proposal \(q(\tilde{u}, \tilde{x} \mid u, x)\) can be chosen to eliminate the normalizing function from the acceptance ratio \(r(\tilde{u},\tilde{x} \mid u,x)\). We consider a proposal of the form \[ q(\tilde{u}, \tilde{x} \mid u, x) := q(\tilde{u} \mid u) q(\tilde{x} \mid \tilde{u}), \tag{9}\] implying a standard proposal for \(u\), followed by a proposal of the auxiliary variable that depends on \(\tilde{u}\) but not \(x\). Given this setup, the necessary choice of \(q(\tilde{x} \mid \tilde{u})\) to eliminate dependence on the normalizing function is \[ q(\tilde{x} \mid \tilde{u}) := f(\tilde{x};\tilde{u}) / C(\tilde{u}). \tag{10}\] Indeed, plugging Equation 10 into Equation 11 yields \[ r(\tilde{u},\tilde{x} \mid u,x) = \frac{\pi_0(\tilde{u})f(y; \tilde{u}) q(u \mid \tilde{u})}{\pi_0(u) f(y; u)q(\tilde{u} \mid u)} \cdot \frac{p(\tilde{x} \mid \tilde{u},y)/f(\tilde{x};\tilde{u})}{p(x \mid u,y)/f(x;u)}. \tag{11}\]
\[ \begin{align} r(\tilde{u},\tilde{x} \mid u,x) &= \frac{\pi_0(\tilde{u})f(y; \tilde{u}) q(u \mid \tilde{u}) q(x \mid u)}{\pi_0(u) f(y; u)q(\tilde{u} \mid u) q(\tilde{x} \mid \tilde{u})} \cdot \frac{C(u)}{C(\tilde{u})} \cdot \frac{p(\tilde{x} \mid \tilde{u},y)}{p(x \mid u,y)} \\ &= \frac{\pi_0(\tilde{u})f(y; \tilde{u}) q(u \mid \tilde{u}) f(x;u)/ C(u)}{\pi_0(u) f(y; u)q(\tilde{u} \mid u) f(\tilde{x};\tilde{u})/ C(\tilde{u})} \cdot \frac{C(u)}{C(\tilde{u})} \cdot \frac{p(\tilde{x} \mid \tilde{u},y)}{p(x \mid u,y)} \\ &= \frac{\pi_0(\tilde{u})f(y; \tilde{u}) q(u \mid \tilde{u}) f(x;u)}{\pi_0(u) f(y; u)q(\tilde{u} \mid u) f(\tilde{x};\tilde{u})} \cdot \frac{p(\tilde{x} \mid \tilde{u},y)}{p(x \mid u,y)} \end{align} \]
The ratio in Equation 11 no longer involves the intractable terms! This Metropolis-Hastings scheme therefore admits \(p(u,x \mid y)\) as a stationary distribution without requiring the ability to evaluate \(C(u)\). The algorithm is thus “correct”, but its efficiency will depend heavily on the choice of the auxiliary distribution \(p(x \mid u,y)\), which is a free parameter of this method.
2.3 Choice of Auxiliary Distribution
We now aim to build some intuition as to what the algorithm is doing, which will help inform the choice of \(p(x \mid u,y)\). The situation we find ourselves in is somewhat backwards when compared to the typical design of Metropolis-Hastings algorithms. In particular, the proposal \(q(x \mid u)\) (typically a free parameter) has been prescribed, and we instead need to choose the distribution \(p(x \mid u,y)\) (typically prescribed). Ideally, the proposal will look something like the target distribution. This intuition would lead us to set \(p(x \mid u,y) := f(x;u) / C(u)\). This is of course infeasible as it would reintroduce the normalizing function, but it does give a baseline goal to shoot for.
To further understand the workings of this algorithm, notice that the first term in Equation 11 is equal to the intractable ratio in Equation 6 except that it is missing \(C(u)/C(\tilde{u})\). The second term in Equation 11 might therefore be viewed as providing an estimate of \(C(u)/C(\tilde{u})\). Indeed, consider the random ratio \[ \begin{align} &\frac{p(x \mid u,y)}{f(x;u)}, &&x \sim f(x;u)/C(u) \end{align} \tag{12}\] which has expectation \[ \mathbb{E}\left[\frac{p(x \mid u,y)}{f(x;u)}\right] = \int \frac{p(x \mid u,y)}{f(x;u)} \frac{f(x;u)}{C(u)} dx = C(u)^{-1} \int p(x \mid u,y) dx = C(u)^{-1}. \] Therefore, the ratio in Equation 12 is a single-sample importance sampling estimate of \(C(u)^{-1}\). The second term in Equation 11 can thus be viewed as \[ \frac{p(\tilde{x} \mid \tilde{u},y)/f(\tilde{x};\tilde{u})}{p(x \mid u,y)/f(x;u)} \approx \frac{C(\tilde{u})^{-1}}{C(u)^{-1}} = \frac{C(u)}{C(\tilde{u})}, \] a biased estimate of the ratio \(C(u)/C(\tilde{u})\) derived from the two importance sampling estimates. It is interesting that the algorithm is correct despite the use of this plug-in biased estimate. This importance sampling viewpoint further strengthens our intuition that \(p(x \mid u,y)\) should be chosen to approximate \(f(x;u)/C(u)\).
Møller et al. (2006) give two options for choosing \(p(x \mid u,y)\). The simpler of the two is to choose \[ \begin{align} &p(x \mid u,y) := f(x;\hat{u})/C(\hat{u}), &&\hat{u} = \hat{u}(y) \end{align} \tag{13}\] where \(\hat{u}\) is a fixed estimate of \(u\) derived from the data \(y\). Recall that \(f(x;u)/C(u)\) describes the data-generating distribution as a function of the parameter \(u\). Fixing a single \(u\) will therefore be a reasonable approximation if this distribution is not strongly dependent on \(u\). Alternatively, this may also work well if the posterior support is concentrated around \(\hat{u}\), so that a reasonable approximation is only required in this neighborhood. The second approach is to construct a more sophisticated \(u\)-dependent approximation of \(f(x;u)/C(u)\). We will not consider this option here.