The Measure-Theoretic Context of Bayes' Rule

I describe Bayes' rule in a measure-theoretic context, explain how it can be viewed as a non-linear operator on probability measures, and detail applications to Bayesian inverse problems.

Bayes’ rule is typically written as something like

\begin{align} p(u|y) &= \frac{p(y|u)p(u)}{p(y)} = \frac{p(y|u)p(u)}{\int p(y|u)p(u)du}, \tag{1} \end{align} describing the connection between the two conditional probability density functions p(uy)p(u|y) and p(yu)p(y|u). It is not very common to see this result cast in a more rigorous measure-theoretic setting. This post explores how this can be accomplished.

Defining the Required Ingredients

Most of the work in rigorously stating Bayes’ theorem goes into the setup. We therefore start by building up the necessary measure-theoretic foundation. Since Bayes’ rule considers probabilistic dependencies between two random variables UU and YY, we start by rigorously defining these random variables.

Marginal Distributions

Let (Ω,A,P)(\Omega, \mathcal{A}, \mathbb{P}) be the underlying probability space which will be used as the foundation for all of the subsequent development. To define UU and YY as random variables mapping from this space, we introduce the measurable Borel spaces (U,B(U))(\mathcal{U}, \mathcal{B}(\mathcal{U})) and (Y,B(Y))(\mathcal{Y}, \mathcal{B}(\mathcal{Y})) such that the random variables are defined as measurable maps U:ΩUU: \Omega \to \mathcal{U} and Y:ΩYY: \Omega \to \mathcal{Y}. The probability distribution (i.e. law) for UU is the function μU:B(U)[0,1]\mu_U: \mathcal{B}(\mathcal{U}) \to [0,1] given by μU(A)=P(UA)=P(U1(A))\mu_U(A) = \mathbb{P}(U \in A) = \mathbb{P}(U^{-1}(A)) for all AB(U)A \in \mathcal{B}(\mathcal{U}). The distribution μY\mu_Y of YY is defined analogously.

Conditional Distributions

The next ingredient we need is a notion of conditional distributions in order to formally define what we mean by p(yu)p(y|u) and p(uy)p(u|y) in (1). To side-step all of the technical issues associated with rigorously defining a notion of conditional probability, it is common to simply assume that for each uUu \in \mathcal{U}, there is a valid probability distribution P(u,)P(u, \cdot) which represents the conditional distribution YU=uY|U=u (and vice versa for the conditional UY=yU|Y=y). This assumption is nicely encoded by assuming the existence of a probability kernel (i.e., transition kernel) P:U×B(Y)[0,1]P: \mathcal{U} \times \mathcal{B}(\mathcal{Y}) \to [0,1]. By definition, the probability kernel satisfies:

  1. For each uUu \in \mathcal{U}, P(u,):B(Y)[0,1]P(u, \cdot): \mathcal{B}(\mathcal{Y}) \to [0, 1] is a probability measure on (Y,B(Y))(\mathcal{Y}, \mathcal{B}(\mathcal{Y})), which we will denote by Pu():=P(u,)P_u(\cdot) := P(u, \cdot).
  2. For each BB(Y)B \in \mathcal{B}(\mathcal{Y}), P(,B):U[0,1]P(\cdot, B): \mathcal{U} \to [0,1] is a measurable function.

While this probability kernel is generic, we will think of PuP_u as representing the conditional distribution of YU=uY|U=u. To define what we mean by the other conditional UY=yU|Y=y we similarly assume there is a probability kernel μY:Y×B(U)[0,1]\mu^Y: \mathcal{Y} \times \mathcal{B}(\mathcal{U}) \to [0,1], satisfying the same two properties. We will use the notation μy():=μY(y,)\mu^y(\cdot) := \mu^Y(y, \cdot) for the probability measure that results from fixing the first argument of the kernel at yYy \in \mathcal{Y}.

Note that since we are assuming both of these kernels exist, the statement of Bayes’ theorem presented below will take the form “If both conditional distributions exist (and are given by the probability kernels described above), then they must be related in the following way…”. This might feel a bit unsatisfying at first, but it is not difficult to show that in the common settings of interest, it is straightforward to define the required probability kernels. This is explored in the appendix.

Densities

The last missing ingredient is to consider probability density functions (or more generally Radon-Nikodym derivatives) for the probability measures introduced above. While the typical informal statement of Bayes’ theorem implicitly assumes the existence of Lebesgue densities (or probability mass functions) we can generalize this by assuming that there is some σ\sigma-finite measure ν\nu such that PuP_u is absolutely continuous with respect to ν\nu for each uUu \in \mathcal{U}; i.e., PuνP_u \ll \nu uU\forall u \in \mathcal{U}. We recall that absolute continuity means that ν(B)=0    Pu(B)=0\nu(B) = 0 \implies P_u(B)=0 for all BB(Y)B \in \mathcal{B}(\mathcal{Y}). Intuitively, in order to re-weight ν\nu to obtain PuP_u, ν\nu better not be zero where PuP_u is positive. In applications, the most common choices for ν\nu are the Lebesgue or counting measure, which lead to the standard presentation of Bayes’ rule for continous and discrete random variables, respectively.

The Statistical Interpretation

I was careful to keep things quite generic above; at its core Bayes’ theorem concerns the joint and conditional distributions between two random variables UU and YY. However, the result is most commonly seen applied in the field of Bayesian statistics, so I take a moment to map the above definitions onto their common Bayesian interpretations. In a Bayesian context, the random variable YY is the data, while UU is the parameter in the statistical model being considered. The measure μU\mu_U thus represents the prior distribution on the parameter. The probability kernel PP formalizes the notion of a parametric statistical model. Indeed, in a standard parametric statistical setting each fixed value uu for the parameter UU yields a different data-generating process; this data-generating process is encoded by PuP_u. The Radon-Nikodym derivative of PuP_u (when it exists) with respect to the Lebesgue density is typically referred to the likelihood function (viewed as a function of uu with data yYy \in \mathcal{Y} fixed). From this point going forward I’ll fall back on the Bayesian terminology since it is convenient and often enlightening, but of course Bayes’ theorem is agnostic to how the probability distributions are interpreted.

Finally, Bayes’ theorem

With all of this setup out of the way, we proceed to the main result. As noted, Bayes’ theorem provides a link between the two conditional distributions which are encoded by the kernels $P$ and $\mu^Y$. To be more specific, the result provides a connection between Radon-Nikodym derivatives (i.e., densities) of these distributions. For example, the informal statement (1) links the Lebesgue densities $p(u|y)$ and $p(y|u)$. We will discuss the rigorous treatment of this case later on, but it turns out that Bayes’ theorem is more naturally stated in terms of $\frac{d\mu^y}{d\mu_U}$, the Radon-Nikodym of the posterior with respect to the prior. If you are only familiar with statements like (1), this formulation might feel a bit weird at first, but it is a very nice way of looking at things. Recall that the Radon-Nikodym derivative is the function that provides the correct weights to transform one measure into another. In this case, $\frac{d\mu^y}{d\mu_U}$ describes how to re-weight the prior in order to obtain the posterior. Since we know the prior-to-posterior transformation is due to the effect of conditioning on data, then we would expect $\frac{d\mu^y}{d\mu_U}$ to be closely related to the likelihood. As we’ll see below, it is actually exactly proportional to the likelihood! Once you wade through all of the notation, I hope that this intuition makes the theorem itself feel quite natural.

Bayes’ Theorem. Under the assumptions outlined in the preceding sections,

  1. The posterior is absolutely continuous with respect to the prior: \begin{align} \mu^y \ll \mu_U \text{ for all } y \in \mathcal{Y} \end{align}

  2. The Radon-Nikodym derivative of the posterior with respect to the prior is proportional to the likelihood: \begin{align} \frac{d\mu^y}{d\mu_U}(u) \propto \frac{dP_u}{d\nu}(y). \end{align} More precisely, and including the normalization constant, the posterior measure is given by
    \begin{align} \mu^y(A) &= \mathbb{P}(U \in A|Y=y) = \frac{\int_A \frac{dP_u}{d\nu}(y) \mu_U(du)}{\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du)}, \end{align} for $A \in \mathcal{B}(\mathcal{U})$, and for all $y \in \mathcal{Y}$ such that the denominator is not $0$ or infinity.

As noted above, we see that Bayes’ theorem describes the change-of-measure that maps the prior $\mu$ into the posterior $\mu^y$ through conditioning on data $y$. The theorem shows that this map simply involves re-weighting the prior by the likelihood, and then normalizing the result.

Recovering the Informal Statement

One last lingering question might be how to provide a more formal justification for (1), your everyday statement of Bayes’ rule. We have seen that the rigorous formulation concerns, dμydμU(u)\frac{d\mu^y}{d\mu_U}(u), the Radon-Nikodym derivative of the posterior with respect to the prior, which is expressed in terms of some base dominating measure ν\nu. In the informal presentation of Bayes’ rule for continuous random variables (e.g., (1)), ν\nu is taken to be the Lebesgue measure λ\lambda. We can then use the standard likelihood notation \begin{align} p(y|u) := \frac{dP_u}{d\lambda}(y) \end{align} to refer to the Radon-Nikodym derivative of PuP_u with respect to the Lebesgue measure λ\lambda; i.e. p(yu)p(y|u) is the Lebesgue density of PuP_u.

Next, note that in our rigorous statement of Bayes’ theorem, all integrals are currently with respect to the prior measure μU\mu_U. To recover the typical informal formula, we require all integrals to be with respect to the Lebesgue density, which requires the additional assumption μUλ\mu_U \ll \lambda. Under this assumption, the Lebesgue density of μU\mu_U exists and we will write it as \begin{align} p(u) := \frac{d\mu_U}{d\lambda}(y). \end{align} Note that there are technically two different Lebesgue measures being considered here: the first one is defined on the measurable space (Y,B(Y))(\mathcal{Y}, \mathcal{B}(\mathcal{Y})) and corresponds to the particular choice of ν\nu; we have introduced a second Lebesgue measure on (U,B(U))(\mathcal{U}, \mathcal{B}(\mathcal{U})) in order to obtain Lebesgue densities for the prior measure.

With all of this established, we now have \begin{align} \mu^y(A) &= \frac{\int_A \frac{dP_u}{d\lambda}(y) \mu_U(du)}{\int_{\mathcal{U}} \frac{dP_u}{d\lambda}(y) \mu_U(du)} = \int_A \frac{p(y|u)p(u)}{\int_{\mathcal{U}} p(y|u)p(u)du}du \end{align} Here, we use the typical Riemann integral notation λ(du)=du\lambda(du) = du for integration with respect to the Lebesgue measure. The last expression implies that μyλ\mu^y \ll \lambda and moreover that the Lebesgue density of the posterior is given by \begin{align} p(u|y) := \frac{d\mu^y}{d\lambda}(u) &= \frac{p(y|u)p(u)}{\int_{\mathcal{U}} p(y|u)p(u)du}, \end{align} which is precisely (1). To obtain the form of Bayes’ rule for discrete random variables, we simply substitute the Lebesgue measure for the counting measure, which has the effect of replacing densities with mass functions and integrals with sums.

The Operator-Theoretic Viewpoint

Application: Bayesian Inverse Problems

Some more technical details: The Disintegration Theorem

Appendix

Bayes’ Theorem Proof

With all of this setup out of the way, we can now state the theorem. However, before doing so, let’s provide one last bit of motivation. We know that Bayes’ theorem provides the link between two conditional distributions encoded by the kernels $P_u$ and $\mu^Y$. When thinking about how to go about establishing this link, it is helpful to notice that each conditional provides a different, but equivalent, representation of the joint distribution on $(U, Y)$. Indeed, letting $A \in \mathcal{B}(\mathcal{U})$ and $B \in \mathcal{B}(\mathcal{Y})$, we have \begin{align} \mathbb{P}(U \in A, Y \in B) &= \int_A \int_B \mu_U(du)P(u, dy) = \int_B \int_A \mu_Y(dy) \mu^Y(y, du). \end{align} Moreover, since $P(u, \cdot) \ll \nu$ by assumption, the middle term can be re-written as \begin{align} \int_A \int_B \mu_U(du)P(u, dy) &= \int_A \int_B \mu_U(du) \frac{dP_u}{d\nu}(y) \nu(dy). \end{align} Given that Bayes’ theorem ultimately provides the connection between Radon-Nikodym derivatives of the two conditionals, we see from \begin{align} \int_A \int_B \mu_U(du) \frac{dP_u}{d\nu}(y) \nu(dy) &= \int_B \int_A \mu_Y(dy) \mu^Y(y, du), \tag{2} \end{align} that we’re actually already quite close to establishing the required link. Currently, (2) has a few missing pieces; the complete proof is given in the appendix. Without further ado, here is the theorem.

We recall from the setup above that

\begin{align} \mathbb{P}(U \in A, Y \in B) &= \int_A \int_B \frac{dP_u}{d\nu}(y) \nu(dy) \mu_U(du) = \int_B \int_A \mu^y(du) \mu_Y(dy), \end{align} where AB(U)A \in \mathcal{B}(\mathcal{U}), BB(Y)B \in \mathcal{B}(\mathcal{Y}). The general structure of this proof involves manipulating these two terms so that the integrands can be combined; this requires,

  1. Changing the integration order in one of the expressions (we will do this for the first of the two).
  2. Writing the integrals with respect to a common measure (we will change the measure in the second term so that the outer integration is also with respect to ν\nu).

There is also the technical concern that the normalizing constant for the posterior distribution is zero or infinite; it turns out this is not an issue as it occurs with zero μU\mu_U-probability. We verify this at the end of the proof to avoid cluttering the main points.

Changing the order of integration

We begin by flipping the order of the integration in the first term. Note that the function (u,y)dPudν(y)(u, y) \mapsto \frac{dP_u}{d\nu}(y) can be shown to be measurable. Moreover, it is non-negative and thus an application of Tonelli’s theorem gives \begin{align} \int_A \left[\int_B \frac{dP_u}{d\nu}(y) \nu(dy)\right] \mu_U(du) &= \int_B \left[\int_A \frac{dP_u}{d\nu}(y) \mu_U(du) \right] \nu(dy). \end{align}

Changing the measure

To write the second term with respect to ν\nu, we need only confirm that μYν\mu_Y \ll \nu so that dμYdν\frac{d\mu_Y}{d\nu} exists. Indeed, if this is true then \begin{align} \int_B \int_A \mu^y(du) \mu_Y(dy) &= \int_B \mu^y(A) \mu_Y(dy) = \int_B \mu^y(A) \frac{d\mu_Y}{d\nu}(y) \nu(dy). \end{align} To justify this, we can use the expression derived above to obtain \begin{align} \mu_Y(B) &= \mathbb{P}(U \in \mathcal{U}, Y \in B) = \int_B \left[\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du) \right] \nu(dy), \end{align} which tells us both that μYν\mu_Y \ll \nu and gives the specific form for the Radon-Nikodym derivative, \begin{align} \frac{d\mu_Y}{d\nu}(y) &= \int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du). \end{align} We can thus plug in this explicit expression to conclude \begin{align} \int_B \int_A \mu^y(du) \mu_Y(dy) &= \int_B \mu^y(A) \frac{d\mu_Y}{d\nu}(y) \nu(dy) = \int_B \mu^y(A) \left[\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du) \right] \nu(dy). \end{align}

Finishing the proof

We’re just about there. All we’ve done is re-write the joint probability P(UA,YB)\mathbb{P}(U \in A, Y \in B) in two ways, one using the conditional UYU|Y and the other using YUY|U. We manipulated the integrals a bit to make them comparable, and have ended up with the equality \begin{align} \int_B \left[\int_A \frac{dP_u}{d\nu}(y) \mu_U(du) \right] \nu(dy) &= \int_B \mu^y(A) \left[\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du) \right] \nu(dy). \end{align} Subtracting one side from the other gives \begin{align} \mu^y(A) \left[\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du) \right] &= \int_A \frac{dP_u}{d\nu}(y) \mu_U(du), && \nu-a.s. \end{align} We now divide both sides by the integral in brackets. Recall that this integral is finite and non-zero μU\mu_U-a.s., so we conclude that \begin{align} \mu^y(A) &= \int_A \left[\frac{\frac{dP_u}{d\nu}(y)}{\int_{\mathcal{U}} \frac{dP_u}{d\nu}(y) \mu_U(du)}\right] \mu_U(du) \end{align} holds μU\mu_U-a.s. in uu and ν\nu-a.s. in yy (following from the fact that the division is well-defined on a set of μU\mu_U-probability one, as will be confirmed below). This expression verifies that μyμU\mu^y \ll \mu_U and verifies that the Radon-Nikodym derivative dμydμU(u)\frac{d\mu^y}{d\mu_U}(u) is given by the term in brackets.

Verifying that the normalizing constant is well-defined

References

  1. Gradient Flows: in Metric Spaces and in the Space of Probability Measures (results on disintegration of measures).
  2. Random Measures: Theory and Applications (Kallenberg)