Deriving the Metropolis-Hastings Update from the Transition Kernel
Given only the Metropolis-Hastings transition kernel, I show how to recover the Metropolis-Hastings update rule.
The Metropolis-Hastings (MH) Markov Chain Monte Carlo (MCMC) method is typically introduced in the form of a practical algorithm. In a more theoretically-oriented context, one might prove that the algorithm defines a Markov chain and derive the associated transition (i.e. probability) kernel. I have found it insightful to also work through the derivations in the reverse order; given only the transition kernel, how could one derive the well-known MH update? In other words, how can you simulate the Markov chain implied by the transition kernel? In this post, I work through the required derivations.
Setup and notation
We consider drawing samples from a target probability distribution supported on a state space with Borel sigma algebra . Let denote the Lebesgue density of , i.e. Let denote the proposal kernel for the MH algorithm, with the Lebesgue density of the measure . For current state and proposal we recall the MH acceptance probability Throughout this post I will let denote an arbitrary Borel set. The transition kernel implied by the MH algorithm is then given by \begin{align} P(\mathbf{x},A) = \int_A q(\mathbf{x},\mathbf{y})\alpha(\mathbf{x},\mathbf{y})d\mathbf{y} + \delta_{\mathbf{x}}(A) \int_{\mathcal{X}} q(\mathbf{x},\mathbf{y})[1-\alpha(\mathbf{x},\mathbf{y})] d\mathbf{y} \tag{1} \end{align} where denotes the Dirac measure. The first term in the kernel is the probability of accepting a proposal in the set , while the second accounts for the probability of rejection in the case that the current state is already in . I will denote the overall probability of acceptance by
Mixture of Kernels
Suppose we don’t already know the MH algorithm, and are given the task of writing an algorithm to draw a sample from the distribution defined in (1). A reasonable place to start is to try writing as a mixture of kernels from which we already know how to sample. Let’s first try this idea, attempting to write with , transition kernels we already know how to sample from. If we are able to do this, then we could easily sample from via the following simple algorithm:
- Select with probability , else select .
- Sample from the selected kernel.
To this end, let’s manipulate (1) to write it in the form of a kernel mixture. We have
\begin{align}
P(\mathbf{x},A)
&= \int_A q(\mathbf{x},\mathbf{y})\alpha(\mathbf{x},\mathbf{y})d\mathbf{y} + \delta_{\mathbf{x}}(A) \int_{\mathcal{X}} q(\mathbf{x},\mathbf{y})[1-\alpha(\mathbf{x},\mathbf{y})] d\mathbf{y} \newline
&= \overline{a}(\mathbf{x}) \int_A \frac{q(\mathbf{x},\mathbf{y})\alpha(\mathbf{x},\mathbf{y})}{\overline{a}(\mathbf{x})}d\mathbf{y} + \left[1-\overline{a}(\mathbf{x})\right] \delta_{\mathbf{x}}(A) \tag{4}
\end{align}
which is a kernel mixture of the form (3) with
\begin{align}
w = \overline{a}(\mathbf{x}), \qquad P_1(\mathbf{x},A)=\int_A \frac{q(\mathbf{x},\mathbf{y})\alpha(\mathbf{x},\mathbf{y})}{\overline{a}(\mathbf{x})}d\mathbf{y},
\qquad P_2(\mathbf{x},A)=\delta_{\mathbf{x}}(A).
\end{align}
All we did here was to multiply and divide by the acceptance probability
in the first term (under typical assumptions on this will be non-zero when
is in the support of ) and to rearrange the second term using the fact that
is a probability density; hence,
Note that the mixture weight is the overall acceptance probability,
and is thus in as required. Moreover, is the Dirac measure centered at
and is thus a valid kernel. To check that is a valid probability
measure, we recall the form of from (2) to verify that
\begin{align}
P_1(\mathbf{x},A)
&= \int_{\mathcal{X}} \frac{q(\mathbf{x},\mathbf{y})\alpha(\mathbf{x},\mathbf{y})}{\overline{a}(\mathbf{x})}d\mathbf{y} \newline
&= \frac{\int_{\mathcal{X}} q(\mathbf{x},\mathbf{y})\alpha(\mathbf{x},\mathbf{y})d\mathbf{y}}{\int_{\mathcal{X}} q(\mathbf{x},\mathbf{y})\alpha(\mathbf{x},\mathbf{y})d\mathbf{y}} \newline
&= 1.
\end{align}
The other required properties (countable additivity and non-negativity) are similarly
verified. Thus, is a probability measure with Lebesgue density
proportional to ; the overall
acceptance probability is the normalizing constant for this
density.
This representation of the MH kernel as a mixture distribution is conceptually useful, but it does not directly help us determine a sampling algorithm. Indeed, we cannot implement the simple mixture sampling algorithm described above since (i.) computing the mixture weight requires evaluating an intractable integral, and (ii.) we don’t know how to directly sample from the probability distribution with density proportional to . While this approach seems to be a dead end from a practical point of view, we should keep in mind that the MH algorithm derived below does sample from the mixture (4), but does so in a way that avoids having to compute the mixture weight or to directly sample from .
Marginalized Mixture of Kernels
In the previous section, we feigned ignorance of the MH algorithm in order to approach the problem of simulating (1) as a generic sampling problem. We found that can indeed be written as a mixture of kernels, but the problem of sampling from the resulting mixture was also intractable. To take a step in the right direction, it is useful to cheat a little bit and recall some of the mechanics of the MH algorithm. The proposal is accepted with probability ; if rejected, the next state is set to the current state . Thus, it seems that we should be looking for a mixture of kernels of the form Of course, this can’t represent the whole picture since the mixture weight in (5) depends on and the proposal kernel is completely missing from the expression. The key insight is that the MH kernel can be viewed as the expectation of (5) averaged with respect to ; i.e., is derived by marginalizing the mixture (5) with respect to . To show this, we return to the original expression (1) for the MH kernel. We have \begin{align} P(\mathbf{x},A) &= \int_A q(\mathbf{x},\mathbf{y})\alpha(\mathbf{x},\mathbf{y})d\mathbf{y} + \delta_{\mathbf{x}}(A) \int_{\mathcal{X}} q(\mathbf{x},\mathbf{y})[1-\alpha(\mathbf{x},\mathbf{y})] d\mathbf{y} \newline &= \int_{\mathcal{X}} \alpha(\mathbf{x},\mathbf{y}) \mathbf{1}(\mathbf{y} \in A) q(\mathbf{x},\mathbf{y}) d\mathbf{y} + \int_{\mathcal{X}} [1-\alpha(\mathbf{x},\mathbf{y})] \delta_{\mathbf{x}}(A) q(\mathbf{x},\mathbf{y}) d\mathbf{y} \newline &= \int_{\mathcal{X}} \alpha(\mathbf{x},\mathbf{y}) \delta_{\mathbf{y}}(A) q(\mathbf{x},\mathbf{y}) d\mathbf{y} + \int_{\mathcal{X}} [1-\alpha(\mathbf{x},\mathbf{y})] \delta_{\mathbf{x}}(A) q(\mathbf{x},\mathbf{y}) d\mathbf{y} \newline &= \int_{\mathcal{X}} \left[\alpha(\mathbf{x},\mathbf{y})\delta_{\mathbf{y}}(A) + [1-\alpha(\mathbf{x},\mathbf{y})] \delta_{\mathbf{x}}(A) \right] q(\mathbf{x},\mathbf{y}) d\mathbf{y}, \tag{6} \end{align} which is precisely the mixture (5) averaged (in ) with respect to the weights . We now have three different representations of the MH transition kernel : (1) is the most natural to derive when starting from the MH algorithm, (4) represents as a mixture of two distributions, and (6) represents as a marginalized mixture of two distributions. It is this final representation which proves useful for developing a practical simulation algorithm.
All that remains is to recall how to sample from a marginalized distribution. First note that is essentially a fixed parameter in the integral (6); the averaging is done with respect to . Now, if we condition on a fixed as well, then the expression is simply a mixture of two distributions, which we know how to sample from. Thus, a sample can be drawn from via the following algorithm:
- Sample .
- Conditional on , sample from .
For the second step, the mixture can be sampled from using the simple algorithm discussed in the previous section: randomly select one of the two kernels with probabilities equal to their mixture weights, then return a sample from the selected kernel. Since sampling from the Dirac measure simply means returning the value (and similarly for ) then this step will simply return or according to their respective probabilities and . This is precisely the MH accept-reject mechanism!
It might be helpful to make this more concrete by letting denote the value of the MCMC algorithm at iteration and letting be the random variable representing the proposal. Then the above mixture corresponds to the probability so we can re-write (6) as Once we condition on and , the only remaining randomness in the probability above is coming from the selection of one of the two kernels.
Conclusion
While all of this might appear to be overcomplicating the very simply MH algorithm, I have found it a quite worthwhile exercise to contemplate some different perspectives on the method, as well as to get some practice manipulating expressions involving probability kernels and thinking through sampling schemes. The MH transition kernel (1) can easily be derived by thinking through the mechanics of the MH algorithm. In this post, I showed in (4) how the kernel can be re-written as mixture of two distributions and in (6) as a marginalized mixture of two distributions. It is this final expression which provides the basis for a tractable simulation algorithm.