Linear Gaussian Inverse Problems
Derivations and discussion of linear Gaussian inverse problems.
This post focuses on Bayesian inverse problems with the following features:
- Linear forward model.
- Additive Gaussian observation noise.
- Gaussian prior distribution.
- Prior independence of the observation noise and prior.
We refer to such inverse problems as linear Gaussian. The typical Bayesian linear regression model with a Gaussian prior on the coefficients constitutes a common example of a linear Gaussian inverse problem. The assumptions of linearity and Gaussianity play quite nicely together, resulting in a closed-form Gaussian posterior distribution. Moreover, many extensions to nonlinear and/or non-Gaussian settings rely on methods rooted in our understanding of the linear Gaussian regime.
Setup
We consider the following linear Gaussian regression model \begin{align} y &= Gu + \epsilon \tag{1} \newline \epsilon &\sim \mathcal{N}(0, \Sigma) \newline u &\sim \mathcal{N}(m, C), && u \perp \epsilon \end{align} consisting of the observation (or data) $y \in \mathbb{R}^n$, parameter $u \in \mathbb{R}^d$, noise $\epsilon \in \mathbb{R}^n$, and linear forward model represented by the matrix $G \in \mathbb{R}^{n \times d}$. The observation covariance $\Sigma \in \mathbb{R}^{n \times n}$ and prior covariance $u \in \mathbb{R}^{d \times d}$ are both fixed positive definite matrices. The vector $m \in \mathbb{R}^d$ is the prior mean. We write $u \perp \epsilon$ to indicate the key assumption that $u$ and $\epsilon$ are a priori statistically independent. The model (1) can equivalently be written as \begin{align} y|u &\sim \mathcal{N}(Gu, \Sigma) \tag{2} \newline u &\sim \mathcal{N}(m, C), \end{align} which gives the explicit expression for the Gaussian likelihood $p(y|u)$. The solution of the Bayesian inverse problem is the posterior distribution $p(u|y)$. We provide two approaches to calculating this distribution below, which yield different (but equivalent) expressions.
Computing the Posterior.
Method 1: Completing the Square
We first tackle the problem directly, using Bayes’ theorem and
the matrix analog of completing the square from elementary algebra. Applying
Bayes’ theorem to (2) yields
\begin{align}
p(u|y)
&\propto p(y|u)p(u) \newline
&\propto \exp\left(-\frac{1}{2}\left[(y-Gu)^\top \Sigma^{-1} (y-Gu) + (u-m)^\top C^{-1} (u-m) \right] \right) \newline
&\propto \exp\left(-\frac{1}{2}\left[u^\top(G^\top \Sigma^{-1}G + C^{-1})u -
2u^\top(G^\top \Sigma^{-1}y + C^{-1}m)\right] \right). \tag{3}
\end{align}
All we have done above is to combine the Gaussian likelihood and prior, dropping
any multiplicative constants that don’t depend on $u$, and grouping like terms
in $u$. Note that since (3) is an exponential of a quadratic in $u$, then we
immediately know that the posterior must be Gaussian. It therefore remains to find
the mean $\overline{m}$ and covariance $\overline{C}$. Knowing that (3) is
proportional to a Gaussian density, let’s set the term in square brackets equal to
\begin{align}
(u - \overline{m})^\top \overline{C}^{-1} (u - \overline{m})
= u^\top \overline{C}^{-1}u - 2u^\top \overline{C}^{-1} \overline{m} +
\overline{m}^\top \overline{C}^{-1} \overline{m}
\end{align}
and equate like terms to solve for the unknowns $\overline{m}$ and $\overline{C}$.
Doing so, we find that
\begin{align}
\overline{C}^{-1} &= G^\top \Sigma^{-1} G + C^{-1} \newline
\overline{C}^{-1}\overline{m} &= G^\top \Sigma^{-1}y + C^{-1}m.
\end{align}
The $\overline{m}^\top \overline{C}^{-1} \overline{m}$ is not a problem, as it
will simply be absorbed in the proportionality sign.
Rearranging the above expressions gives the desired mean and covariance equations,
which are summarized in the following result.
Proposition. The posterior distribution under the linear Gaussian model (1) is Gaussian $u|y \sim \mathcal{N}(\overline{m}, \overline{C})$, with \begin{align} \overline{m} &= \overline{C} \left[G^\top \Sigma^{-1}y + C^{-1}m \right] \tag{4} \newline \overline{C} &= \left[G^\top \Sigma^{-1} G + C^{-1} \right]^{-1}. \end{align}
Method 2: Joint Gaussian Conditioning
We now present a second method for computing $p(u|y)$. This approach relies on the observation that the vector $(u, y)^\top \in \mathbb{R}^{d+n}$ has a joint Gaussian distribution. This follows from the prior independence of $u$ and $\epsilon$, and is formally proved in the appendix. Writing out this joint Gaussian explicitly gives \begin{align} \begin{bmatrix} u \newline y \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} m \newline Gm \end{bmatrix}, \begin{bmatrix} C & CG^\top \newline GC & GCG^\top + \Sigma \end{bmatrix} \right). \tag{5} \end{align} The mean and covariance of $u$ is immediate from (1), and the remaining quantities are computed as: \begin{align} \mathbb{E}[y] &= \mathbb{E}[Gu + \epsilon] = G\mathbb{E}[u] + \mathbb{E}[\epsilon] = Gm \tag{6} \newline \text{Cov}[y] &= \text{Cov}[Gu + \epsilon] = \text{Cov}[Gu] + \text{Cov}[\epsilon] = GCG^\top + \Sigma \tag{7} \newline \text{Cov}[y, u] &= \text{Cov}[Gu + \epsilon, u] = \text{Cov}[Gu, u] + \text{Cov}[\epsilon, u] = GC. \tag{8} \end{align} In (6) we use the linearity of expectation and the fact that the noise is zero-mean. The covariance splits into the sum in (7) due to the independence of $u$ and $\epsilon$. This independence assumption is similarly leveraged in (8).
The conditional distributions of joint Gaussians are well-known to also be Gaussian, and can be computed in closed-form. Applying these Gaussian conditioning identities to (5) provides expressions for the posterior distribution $u|y$, which is summarized in the following result.
Proposition. The posterior distribution under the linear Gaussian model (1) is Gaussian $u|y \sim \mathcal{N}(\overline{m}, \overline{C})$, with \begin{align} \overline{m} &= m + CG^\top [GCG^\top + \Sigma]^{-1}(y - Gm) \tag{9} \newline \overline{C} &= C - CG^\top [GCG^\top + \Sigma]^{-1} GC. \end{align}
Equivalence of the Two Approaches
TODO
Investigating the Posterior Equations
The Posterior Covariance
A first important observation is that the the posterior covariance $\overline{C}$ is independent of the data $y$. In this sense, the specific data realization observed does not affect the uncertainty in the estimation of $u$. The expression coming from the first derivation (4) tells us that the posterior precision (inverse covariance) $\overline{C}^{-1}$ is the sum of the prior precision $C^{-1}$ and $G^\top \Sigma^{-1}G$, which is the observation precision $\Sigma^{-1}$ modified by the forward model. Since the posterior covariance is the inverse of $G^\top \Sigma^{-1}G + C^{-1}$, we should verify that this matrix is indeed invertible. First, note that $\Sigma^{-1}$ and $C^{-1}$ are both positive definite, since the inverse of positive definite matrices are also positive definite. Thus, the factorization $\Sigma^{-1} = SS^\top$ exists, which implies \begin{align} x^\top [G^\top \Sigma^{-1}G]x &= x^\top [G^\top SS^\top G]x = \lVert S^\top Gx \rVert^2_2 \geq 0. \tag{10} \end{align} That is, $G^\top \Sigma^{-1}G$ is positive semidefinite. Since the sum of a positive semidefinite and positive definite matrix is positive definite, then $G^\top \Sigma^{-1}G + C^{-1}$ is positive definite, and thus invertible.
The covariance expression in (9) provides an alternative perspective. First, the expression tells us that conditioning on the data $y$ always decreases variance. This can be seen by noting that the matrix $CG^\top [GCG^\top + \Sigma]^{-1} GC$ (which is subtracted from the prior covariance) is positive semidefinite, and thus in particular has nonnegative values on its diagonal. To show this, we use the fact that we have just proven that $[GCG^\top + \Sigma]^{-1}$ is positive definite, and thus admits a decomposition of the form $SS^\top$. Thus, \begin{align} x^\top \left(CG^\top [GCG^\top + \Sigma]^{-1} GC\right) x = x^\top \left(CG^\top SS^\top GC\right) x = \lVert S^\top GCx \rVert_2^2 \geq 0, \tag{11} \end{align} so $CG^\top [GCG^\top + \Sigma]^{-1} GC$ is indeed positive semidefinite. Note that the covariance expression in (9) can also be written as \begin{align} \text{Cov}[u|y] &= \text{Cov}[u] - \text{Cov}[u,y] \text{Cov}[y]^{-1} \text{Cov}[y, u]. \tag{12} \end{align}
- TODO: show that $GCG^\top$ is the Hessian of the log-likelihood.
- Positive definiteness/invertibility.
Posterior Predictive Distribution
Suppose that we are now interested in a new forward model $\tilde{G} \in \mathbb{R}^{m \times d}$ and are interested in estimating the unobserved quantity $\tilde{y} \in \mathbb{R}^m$ under the model \begin{align} \tilde{y} = \tilde{G}u + \tilde{\epsilon}, \tag{13} \end{align} with the same prior distribution as in (1). For the noise distribution, we will assume \begin{align} \begin{bmatrix} \tilde{\epsilon} \newline \epsilon \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} 0 \newline 0 \end{bmatrix}, \begin{bmatrix} \tilde{\Sigma} & \Sigma^\prime \newline [\Sigma^\prime]^\top & \Sigma \end{bmatrix} \right), \tag{14} \end{align} again with the assumption that $\tilde{\epsilon}$ and $u$ are a priori independent. The classic example of this setup is prediction at a new set of inputs within a regression setup. In this scenario, $G$ would represent the design matrix containing the $n$ inputs at which the responses $y$ were observed, while $\tilde{G}$ represents a new set of $m$ inputs whose associated responses we would like to predict. The basic regression setting with homoscedastic variance would result in $\Sigma = \sigma^2 I_n$, $\tilde{\Sigma} = \sigma^2 I_m$, and $\Sigma^\prime = 0$.
In the generic inverse problem formulation, we see that this is a question of propagating the posterior uncertainty in $u$ through a new forward model $\tilde{G}$. Concretely, we are interested in characterizing the posterior predictive distribution \begin{align} p(\tilde{y} | y) &= \int p(\tilde{y},u|y)du = \int p(\tilde{y}|u,y)p(u|y) du. \end{align} Notice that the term $p(u|y)$ is the posterior distribution of the model parameters $u$. We begin by deriving the closed-form posterior predictive distribution in the common setting $\Sigma^\prime = 0$. We then turn to the general setting where $\epsilon$ and $\tilde{\epsilon}$ may be correlated. In either case, the posterior predictive $p(\tilde{y} | y)$ is Gaussian. This is verified in the appendix by showing that $(\tilde{y}, y)$ is joint Gaussian, and hence has Gaussian conditionals.
Uncorrelated Errors
We begin with the assumption of zero cross-covariance: $\text{Cov}[\tilde{\epsilon}, \epsilon] = \Sigma^\prime = 0$. This assumption makes the required computations quite straightforward. As noted above, we know that $p(\tilde{y}|y)$ is Gaussian, and thus is only remains to characterize the mean and covariance of this distribution. For the mean, we apply the law of iterated expectations to obtain \begin{align} \mathbb{E}[\tilde{y}|y] &= \mathbb{E}\left[\mathbb{E}[\tilde{y}|u,y]|y\right] \newline &= \mathbb{E}\left[\mathbb{E}[\tilde{G}u + \tilde{\epsilon}|u,y]|y\right] \newline &= \mathbb{E}\left[\tilde{G}u|y\right] \newline &= \tilde{G} \mathbb{E}\left[u|y\right] \newline &= \tilde{G}\overline{m}. \end{align} Note that everything is conditional on $y$ here; the law of iterated expectations is applied with respect to $u$, not $y$. The third equality uses the fact that $\mathbb{E}[\tilde{\epsilon}|u,y] = \mathbb{E}[\tilde{\epsilon}] = 0$, owing to the assumptions that $\tilde{\epsilon}$ is uncorrelated with both $\epsilon$ and $u$. Now, for the covariance we apply the law of total covariance \begin{align} \text{Cov}[x] = \mathbb{E}\text{Cov}[x|y] + \text{Cov}\left[\mathbb{E}(x|y) \right], \end{align} which holds for arbitrary (square integrable) random vectors $x,y$. Doing so (again, with everything conditional on $y$) we obtain: \begin{align} \text{Cov}[\tilde{y}|y] &= \mathbb{E}\left[\text{Cov}(\tilde{y}|u,y)|y\right] + \text{Cov}\left[\mathbb{E}(\tilde{y}|u,y)|y \right] \newline &= \mathbb{E}\left[\text{Cov}(\tilde{G}u + \tilde{\epsilon}|u,y)|y\right] + \text{Cov}\left[\tilde{G}u|y \right] \newline &= \mathbb{E}\left[\tilde{\Sigma}|y\right] + \text{Cov}\left[\tilde{G}u|y \right] \newline &= \tilde{\Sigma} + \tilde{G}\overline{C}\tilde{G}^\top. \end{align} We have used the fact that $\text{Cov}[\tilde{\epsilon}|u,y] = \text{Cov}[\tilde{\epsilon}] = \tilde{\epsilon}$, again due to the independence assumptions. Putting everything together, we have found that \begin{align} \tilde{y}|y &\sim \mathcal{N}\left(\tilde{G}\overline{m}, \tilde{\Sigma} + \tilde{G}\overline{C}\tilde{G}^\top \right). \end{align} This result is quite intuitive. The predictive mean simply results from propagating the posterior mean $\overline{m}$ through the new forward model $\tilde{G}$. The predictive covariance is the sum of the noise covariance $\tilde{\Sigma}$ and the covariance resulting from propagating the random variable $u|y$ through the new forward model $\tilde{G}$.
Correlated Errors
We now deal with the general case where $\Sigma^\prime$ may be non-zero. The calculations here are trickier given that $\tilde{\epsilon}|y$ is not equal in distribution to $\tilde{\epsilon}$, as was the case above. The reason is that $y$ contains information about $\epsilon$, and $\epsilon$ contains information about $\tilde{\epsilon}$ since they are correlated. Therefore, in this general setting I find it easiest to proceed by considering the joint distribution $(\tilde{y},y)$ and then applying the Gaussian conditioning identities to obtain the distribution of $\tilde{y}|y$. The joint distribution is given by \begin{align} \begin{bmatrix} \tilde{y} \newline y \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} \tilde{G}m \newline Gm \end{bmatrix}, \begin{bmatrix} \tilde{G}C\tilde{G}^\top + \tilde{\Sigma} & \tilde{G}CG^\top + \Sigma^\prime \newline GC\tilde{G}^\top + [\Sigma^\prime]^\top & GCG^\top + \Sigma \end{bmatrix} \right). \end{align} The cross covariance follows from \begin{align} \text{Cov}[\tilde{G}u + \tilde{\epsilon}, Gu + \epsilon] &= \text{Cov}[\tilde{G}u, Gu] + \text{Cov}[\tilde{\epsilon}, \epsilon] = \tilde{G}CG^\top + \Sigma^\prime, \end{align} since $\text{Cov}[u, \tilde{\epsilon}] = \text{Cov}[u, \epsilon] = 0$. The marginal calculations follow similarly. We can now apply the Gaussian conditioning identities to obtain \begin{align} \mathbb{E}[\tilde{y}|y] &= \tilde{G}m + \left[\tilde{G}CG^\top + \Sigma^\prime \right] \left(GCG^\top + \Sigma \right)^{-1} \left[y - Gm \right] \newline \text{Cov}[\tilde{y}|y] &= \tilde{\Sigma} + \tilde{G}C\tilde{G}^\top - \left[\tilde{G}CG^\top + \Sigma^\prime \right] \left(GCG^\top + \Sigma \right)^{-1} \left[GC\tilde{G}^\top + \Sigma^\prime]^\top \right] \end{align}
Example: Linear Regression
Perhaps the most common example of a linear Gaussian model takes the form of a Bayesian linear regression model. In this setting, we suppose that we have access to observed input-output pairs $(x_i, y_i)_{i=1}^{n}$ and assume that the $y_i$ arise as a linear function of the $x_i$, which are then perturbed by noise. While different formulations are possible, a popular specification assumes \begin{align} y_i|x_i, \beta &\sim \mathcal{N}(x_i^\top \beta, \sigma^2), \end{align} meaning that the magnitude of the observation noise is iid across observations. If we stack the inputs row-wise into a matrix $X \in \mathbb{R}^{n \times d}$ and the outputs into a vector $y \in \mathbb{R}^n$, then this model can be written as \begin{align} y &= X\beta + \epsilon \newline \epsilon &\sim \mathcal{N}(0, \sigma^2 I_n) \newline \beta &\sim \mathcal{N}(m, \sigma^2 C), \end{align} where we have also assumed a Gaussian prior on $\beta$. Connecting to our generic inverse problem setup, we see that the forward model $G = X$ is given by the data matrix, while the parameter $u = \beta$ is the coefficient vector. The parameterization of the prior covariance as $\sigma^2 C$ is common in this setting, as it will lead to some convenient cancellations in the posterior formulas. Indeed, applying (4) gives the posterior $\beta|y \sim \mathcal{N}(\overline{m}, \overline{C})$, with covariance \begin{align} \overline{C} &= \left[\frac{1}{\sigma^2} X^\top X + \frac{1}{\sigma^2} C^{-1} \right]^{-1} = \sigma^2 \left[X^\top X + C^{-1} \right]^{-1}. \end{align} The posterior mean is thus \begin{align} \overline{m} &= \overline{C} \left[\frac{1}{\sigma^2} X^\top y + \frac{1}{\sigma^2} C^{-1}m \right] = \left[X^\top X + C^{-1} \right]^{-1} \left[X^\top y + C^{-1}m \right], \end{align} since the $\sigma^2$ term from the covariance cancels with its reciprocal.
Marginal Likelihood
TODO
Numerically Implementing the Posterior Equations
Appendix
Joint Gaussian Distribution
Throughout this post we rely on the claim that various quantities are jointly Gaussian distributed. We verify these claims here. In the proofs, we use the fact that a random vector is Gaussian if and only if it is equal in distribution to , for and some constant vector and matrix . Any random variables labelled “z” in this section should be interpreted as standard Gaussians, with subscripts potentially indicating the random variables that they generate; e.g., .
Joint distribution: .
We first verify that the vector has a joint Gaussian distribution
under model (1). Taking the square roots of the covariance matrices allows us
to write the correlated Gaussian variables as transformations of iid Gaussian noise.
We have,
\begin{align}
\begin{bmatrix} u \newline y \end{bmatrix}
&\overset{d}{=}
\begin{bmatrix} Gu + \epsilon \newline u \end{bmatrix} \newline
&\overset{d}{=}
\begin{bmatrix} G\left(m + C^{1/2}z_u\right) + \Sigma^{1/2}z_{\epsilon} \newline m + C^{1/2}z_u \end{bmatrix} \newline
&\overset{d}{=}
\begin{bmatrix} Gu \newline m \end{bmatrix} +
\begin{bmatrix} GC^{1/2}z_u + \Sigma^{1/2}z_{\epsilon} \newline
C^{1/2}z_u \end{bmatrix} \newline
&\overset{d}{=}
\begin{bmatrix} Gu \newline m \end{bmatrix} +
\begin{bmatrix} GC^{1/2} & \Sigma^{1/2} \newline C^{1/2} & 0 \end{bmatrix}
\begin{bmatrix} z_u \newline z_{\epsilon} \end{bmatrix}.
\end{align}
Under the assumption it follows that
and are independent,
so , thus verifying the
claim.
Joint distribution: .
We similarly show that is joint Gaussian, under model (13). This fact is used in the derivation of the posterior predictive distribution. We recall that we are allowing the noise terms and to be correlated; let’s partition the square root of their joint covariance by \begin{align} \text{Cov}[(\tilde{\epsilon}, \epsilon)]^{1/2} &= \begin{bmatrix} B_{11} & B_{12} \newline B_{21} & B_{22} \end{bmatrix}. \end{align} We then have \begin{align} \begin{bmatrix} \tilde{y} \newline y \end{bmatrix} &\overset{d}{=} \begin{bmatrix} \tilde{G}u \newline Gu\end{bmatrix} + \begin{bmatrix} \tilde{\epsilon} \newline \epsilon \end{bmatrix} \newline &\overset{d}{=} \begin{bmatrix} \tilde{G}\left(m + C^{1/2}z_u \right)\newline G\left(m + C^{1/2}z_u \right) \end{bmatrix} + \begin{bmatrix} B_{11} & B_{12} \newline B_{21} & B_{22} \end{bmatrix} \begin{bmatrix} z_1 \newline z_2 \end{bmatrix} \newline &\overset{d}{=} \begin{bmatrix} \tilde{G}m \newline Gm \end{bmatrix} + \begin{bmatrix} \tilde{G}C^{1/2} & B_{11} & B_{12} \newline GC^{1/2} & B_{21} & B_{22} \end{bmatrix} \begin{bmatrix} z_u \newline z_1 \newline z_2 \end{bmatrix}. \end{align} Again, the independence of with respect to and is assured by the assumptions and . Thus, . The matrices and serve to “mix up” the two independent noise sources and in order to produce the correlations between and . If and are assumed uncorrelated then . Note also that essentially the same derivations show that are also jointly Gaussian; the “B” matrix is simply augmented with the addition of the row .