The Measure-Theoretic Context of Bayes' Rule
I describe Bayes' rule in a measure-theoretic context, explain how it can be viewed as a non-linear operator on probability measures, and detail applications to Bayesian inverse problems.
Bayes’ rule is typically written as something like
\begin{align} p(u|y) &= \frac{p(y|u)p(u)}{p(y)} = \frac{p(y|u)p(u)}{\int p(y|u)p(u)du}, \tag{1} \end{align} describing the connection between the two conditional probability density functions and . It is not very common to see this result cast in a more rigorous measure-theoretic setting. This post explores how this can be accomplished.
Defining the Required Ingredients
Most of the work in rigorously stating Bayes’ theorem goes into the setup. We therefore start by building up the necessary measure-theoretic foundation. Since Bayes’ rule considers probabilistic dependencies between two random variables and , we start by rigorously defining these random variables.
Marginal Distributions
Let be the underlying probability space which will be used as the foundation for all of the subsequent development. To define and as random variables mapping from this space, we introduce the measurable Borel spaces and such that the random variables are defined as measurable maps and . The probability distribution (i.e. law) for is the function given by for all . The distribution of is defined analogously.
Conditional Distributions
The next ingredient we need is a notion of conditional distributions in order to formally define what we mean by and in (1). To side-step all of the technical issues associated with rigorously defining a notion of conditional probability, it is common to simply assume that for each , there is a valid probability distribution which represents the conditional distribution (and vice versa for the conditional ). This assumption is nicely encoded by assuming the existence of a probability kernel (i.e., transition kernel) . By definition, the probability kernel satisfies:
- For each , is a probability measure on , which we will denote by .
- For each , is a measurable function.
While this probability kernel is generic, we will think of as representing the conditional distribution of . To define what we mean by the other conditional we similarly assume there is a probability kernel , satisfying the same two properties. We will use the notation for the probability measure that results from fixing the first argument of the kernel at .
Note that since we are assuming both of these kernels exist, the statement of Bayes’ theorem presented below will take the form “If both conditional distributions exist (and are given by the probability kernels described above), then they must be related in the following way…”. This might feel a bit unsatisfying at first, but it is not difficult to show that in the common settings of interest, it is straightforward to define the required probability kernels. This is explored in the appendix.
Densities
The last missing ingredient is to consider probability density functions (or more generally Radon-Nikodym derivatives) for the probability measures introduced above. While the typical informal statement of Bayes’ theorem implicitly assumes the existence of Lebesgue densities (or probability mass functions) we can generalize this by assuming that there is some -finite measure such that is absolutely continuous with respect to for each ; i.e., . We recall that absolute continuity means that for all . Intuitively, in order to re-weight to obtain , better not be zero where is positive. In applications, the most common choices for are the Lebesgue or counting measure, which lead to the standard presentation of Bayes’ rule for continous and discrete random variables, respectively.
The Statistical Interpretation
I was careful to keep things quite generic above; at its core Bayes’ theorem concerns the joint and conditional distributions between two random variables and . However, the result is most commonly seen applied in the field of Bayesian statistics, so I take a moment to map the above definitions onto their common Bayesian interpretations. In a Bayesian context, the random variable is the data, while is the parameter in the statistical model being considered. The measure thus represents the prior distribution on the parameter. The probability kernel formalizes the notion of a parametric statistical model. Indeed, in a standard parametric statistical setting each fixed value for the parameter yields a different data-generating process; this data-generating process is encoded by . The Radon-Nikodym derivative of (when it exists) with respect to the Lebesgue density is typically referred to the likelihood function (viewed as a function of with data fixed). From this point going forward I’ll fall back on the Bayesian terminology since it is convenient and often enlightening, but of course Bayes’ theorem is agnostic to how the probability distributions are interpreted.
Finally, Bayes’ theorem
With all of this setup out of the way, we proceed to the main result. As noted, Bayes’ theorem provides a link between the two conditional distributions which are encoded by the kernels $P$ and $\mu^Y$. To be more specific, the result provides a connection between Radon-Nikodym derivatives (i.e., densities) of these distributions. For example, the informal statement (1) links the Lebesgue densities $p(u|y)$ and $p(y|u)$. We will discuss the rigorous treatment of this case later on, but it turns out that Bayes’ theorem is more naturally stated in terms of $\frac{d\mu^y}{d\mu_U}$, the Radon-Nikodym of the posterior with respect to the prior. If you are only familiar with statements like (1), this formulation might feel a bit weird at first, but it is a very nice way of looking at things. Recall that the Radon-Nikodym derivative is the function that provides the correct weights to transform one measure into another. In this case, $\frac{d\mu^y}{d\mu_U}$ describes how to re-weight the prior in order to obtain the posterior. Since we know the prior-to-posterior transformation is due to the effect of conditioning on data, then we would expect $\frac{d\mu^y}{d\mu_U}$ to be closely related to the likelihood. As we’ll see below, it is actually exactly proportional to the likelihood! Once you wade through all of the notation, I hope that this intuition makes the theorem itself feel quite natural.
Bayes’ Theorem. Under the assumptions outlined in the preceding sections,
-
The posterior is absolutely continuous with respect to the prior: \begin{align} \mu^y \ll \mu_U \text{ for all } y \in \mathcal{Y} \end{align}
-
The Radon-Nikodym derivative of the posterior with respect to the prior is proportional to the likelihood: \begin{align} \frac{d\mu^y}{d\mu_U}(u) \propto \frac{dP_u}{d\nu}(y). \end{align} More precisely, and including the normalization constant, the posterior measure is given by
\begin{align} \mu^y(A) &= \mathbb{P}(U \in A|Y=y) = \frac{\int_A \frac{dP_u}{d\nu}(y) \mu_U(du)}{\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du)}, \end{align} for $A \in \mathcal{B}(\mathcal{U})$, and for all $y \in \mathcal{Y}$ such that the denominator is not $0$ or infinity.
As noted above, we see that Bayes’ theorem describes the change-of-measure that maps the prior $\mu$ into the posterior $\mu^y$ through conditioning on data $y$. The theorem shows that this map simply involves re-weighting the prior by the likelihood, and then normalizing the result.
Recovering the Informal Statement
One last lingering question might be how to provide a more formal justification for (1), your everyday statement of Bayes’ rule. We have seen that the rigorous formulation concerns, , the Radon-Nikodym derivative of the posterior with respect to the prior, which is expressed in terms of some base dominating measure . In the informal presentation of Bayes’ rule for continuous random variables (e.g., (1)), is taken to be the Lebesgue measure . We can then use the standard likelihood notation \begin{align} p(y|u) := \frac{dP_u}{d\lambda}(y) \end{align} to refer to the Radon-Nikodym derivative of with respect to the Lebesgue measure ; i.e. is the Lebesgue density of .
Next, note that in our rigorous statement of Bayes’ theorem, all integrals are currently with respect to the prior measure . To recover the typical informal formula, we require all integrals to be with respect to the Lebesgue density, which requires the additional assumption . Under this assumption, the Lebesgue density of exists and we will write it as \begin{align} p(u) := \frac{d\mu_U}{d\lambda}(y). \end{align} Note that there are technically two different Lebesgue measures being considered here: the first one is defined on the measurable space and corresponds to the particular choice of ; we have introduced a second Lebesgue measure on in order to obtain Lebesgue densities for the prior measure.
With all of this established, we now have \begin{align} \mu^y(A) &= \frac{\int_A \frac{dP_u}{d\lambda}(y) \mu_U(du)}{\int_{\mathcal{U}} \frac{dP_u}{d\lambda}(y) \mu_U(du)} = \int_A \frac{p(y|u)p(u)}{\int_{\mathcal{U}} p(y|u)p(u)du}du \end{align} Here, we use the typical Riemann integral notation for integration with respect to the Lebesgue measure. The last expression implies that and moreover that the Lebesgue density of the posterior is given by \begin{align} p(u|y) := \frac{d\mu^y}{d\lambda}(u) &= \frac{p(y|u)p(u)}{\int_{\mathcal{U}} p(y|u)p(u)du}, \end{align} which is precisely (1). To obtain the form of Bayes’ rule for discrete random variables, we simply substitute the Lebesgue measure for the counting measure, which has the effect of replacing densities with mass functions and integrals with sums.
The Operator-Theoretic Viewpoint
Application: Bayesian Inverse Problems
Some more technical details: The Disintegration Theorem
Appendix
Bayes’ Theorem Proof
With all of this setup out of the way, we can now state the theorem. However, before doing so, let’s provide one last bit of motivation. We know that Bayes’ theorem provides the link between two conditional distributions encoded by the kernels $P_u$ and $\mu^Y$. When thinking about how to go about establishing this link, it is helpful to notice that each conditional provides a different, but equivalent, representation of the joint distribution on $(U, Y)$. Indeed, letting $A \in \mathcal{B}(\mathcal{U})$ and $B \in \mathcal{B}(\mathcal{Y})$, we have \begin{align} \mathbb{P}(U \in A, Y \in B) &= \int_A \int_B \mu_U(du)P(u, dy) = \int_B \int_A \mu_Y(dy) \mu^Y(y, du). \end{align} Moreover, since $P(u, \cdot) \ll \nu$ by assumption, the middle term can be re-written as \begin{align} \int_A \int_B \mu_U(du)P(u, dy) &= \int_A \int_B \mu_U(du) \frac{dP_u}{d\nu}(y) \nu(dy). \end{align} Given that Bayes’ theorem ultimately provides the connection between Radon-Nikodym derivatives of the two conditionals, we see from \begin{align} \int_A \int_B \mu_U(du) \frac{dP_u}{d\nu}(y) \nu(dy) &= \int_B \int_A \mu_Y(dy) \mu^Y(y, du), \tag{2} \end{align} that we’re actually already quite close to establishing the required link. Currently, (2) has a few missing pieces; the complete proof is given in the appendix. Without further ado, here is the theorem.
We recall from the setup above that
\begin{align} \mathbb{P}(U \in A, Y \in B) &= \int_A \int_B \frac{dP_u}{d\nu}(y) \nu(dy) \mu_U(du) = \int_B \int_A \mu^y(du) \mu_Y(dy), \end{align} where , . The general structure of this proof involves manipulating these two terms so that the integrands can be combined; this requires,
- Changing the integration order in one of the expressions (we will do this for the first of the two).
- Writing the integrals with respect to a common measure (we will change the measure in the second term so that the outer integration is also with respect to ).
There is also the technical concern that the normalizing constant for the posterior distribution is zero or infinite; it turns out this is not an issue as it occurs with zero -probability. We verify this at the end of the proof to avoid cluttering the main points.
Changing the order of integration
We begin by flipping the order of the integration in the first term. Note that the function can be shown to be measurable. Moreover, it is non-negative and thus an application of Tonelli’s theorem gives \begin{align} \int_A \left[\int_B \frac{dP_u}{d\nu}(y) \nu(dy)\right] \mu_U(du) &= \int_B \left[\int_A \frac{dP_u}{d\nu}(y) \mu_U(du) \right] \nu(dy). \end{align}
Changing the measure
To write the second term with respect to , we need only confirm that so that exists. Indeed, if this is true then \begin{align} \int_B \int_A \mu^y(du) \mu_Y(dy) &= \int_B \mu^y(A) \mu_Y(dy) = \int_B \mu^y(A) \frac{d\mu_Y}{d\nu}(y) \nu(dy). \end{align} To justify this, we can use the expression derived above to obtain \begin{align} \mu_Y(B) &= \mathbb{P}(U \in \mathcal{U}, Y \in B) = \int_B \left[\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du) \right] \nu(dy), \end{align} which tells us both that and gives the specific form for the Radon-Nikodym derivative, \begin{align} \frac{d\mu_Y}{d\nu}(y) &= \int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du). \end{align} We can thus plug in this explicit expression to conclude \begin{align} \int_B \int_A \mu^y(du) \mu_Y(dy) &= \int_B \mu^y(A) \frac{d\mu_Y}{d\nu}(y) \nu(dy) = \int_B \mu^y(A) \left[\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du) \right] \nu(dy). \end{align}
Finishing the proof
We’re just about there. All we’ve done is re-write the joint probability in two ways, one using the conditional and the other using . We manipulated the integrals a bit to make them comparable, and have ended up with the equality \begin{align} \int_B \left[\int_A \frac{dP_u}{d\nu}(y) \mu_U(du) \right] \nu(dy) &= \int_B \mu^y(A) \left[\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du) \right] \nu(dy). \end{align} Subtracting one side from the other gives \begin{align} \mu^y(A) \left[\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du) \right] &= \int_A \frac{dP_u}{d\nu}(y) \mu_U(du), && \nu-a.s. \end{align} We now divide both sides by the integral in brackets. Recall that this integral is finite and non-zero -a.s., so we conclude that \begin{align} \mu^y(A) &= \int_A \left[\frac{\frac{dP_u}{d\nu}(y)}{\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du)}\right] \mu_U(du) \end{align} holds -a.s. in and -a.s. in (following from the fact that the division is well-defined on a set of -probability one, as will be confirmed below). This expression verifies that and verifies that the Radon-Nikodym derivative is given by the term in brackets.
Verifying that the normalizing constant is well-defined
References
- Gradient Flows: in Metric Spaces and in the Space of Probability Measures (results on disintegration of measures).
- Random Measures: Theory and Applications (Kallenberg)