Linear Gaussian Inverse Problems
Derivations and discussion of linear Gaussian inverse problems.
This post focuses on Bayesian inverse problems with the following features:
- Linear forward model.
- Additive Gaussian observation noise.
- Gaussian prior distribution.
- Prior independence of the observation noise and prior.
We refer to such inverse problems as linear Gaussian. The typical Bayesian linear regression model with a Gaussian prior on the coefficients constitutes a common example of a linear Gaussian inverse problem. The assumptions of linearity and Gaussianity play quite nicely together, resulting in a closed-form Gaussian posterior distribution. Moreover, many extensions to nonlinear and/or non-Gaussian settings rely on methods rooted in our understanding of the linear Gaussian regime.
Setup
We consider the following linear Gaussian regression model \begin{align} y &= Gu + \epsilon \tag{1} \newline \epsilon &\sim \mathcal{N}(0, \Sigma) \newline u &\sim \mathcal{N}(m, C), && u \perp \epsilon \end{align} consisting of the observation (i.e., data) $y \in \mathbb{R}^n$, parameter $u \in \mathbb{R}^d$, noise $\epsilon \in \mathbb{R}^n$, and linear forward model represented by the matrix $G \in \mathbb{R}^{n \times d}$. The observation covariance $\Sigma \in \mathbb{R}^{n \times n}$ and prior covariance $C \in \mathbb{R}^{d \times d}$ are both fixed positive definite matrices. The vector $m \in \mathbb{R}^d$ is the prior mean. We write $u \perp \epsilon$ to indicate the key assumption that $u$ and $\epsilon$ are a priori probabilistically independent. The model (1) can equivalently be written as \begin{align} y|u &\sim \mathcal{N}(Gu, \Sigma) \tag{2} \newline u &\sim \mathcal{N}(m, C), \end{align} which gives the explicit expression for the Gaussian likelihood $p(y|u)$. The solution of the Bayesian inverse problem is the posterior distribution $p(u|y)$. We provide two approaches to calculating this distribution below, which yield different (but equivalent) expressions.
The Posterior Distribution
The ubiquity of the linear Gaussian models stems from its analytic tractability. The assumptions of linearity and Gaussianity play nicely together, resulting in a Gaussian posterior distribution that we can characterize in closed form. In Bayesian statistics, this is called a conjugate model, as the prior and posterior belong to the same distributional family (Gaussian). Once it is established that the posterior is Gaussian, it remains only to specify its mean and covariance matrix. We give two equivalent expressions for these quantities below, with complete derivations given in the appendix. The first formulation is the one typically seen in a standard Bayesian linear regression context where $d \leq n$. It is therefore advantageous to work with quantities in $\mathbb{R}^d$ rather than $\mathbb{R}^n$. In the first set of equations, the main computation - the inversion of a matrix - takes place in $\mathbb{R}^d$, so we refer to this formulation as the parameter space update. The second set of equations instead requires the inversion of a $n \times n$ matrix. This data space update in commonly seen in data assimilation applications where $d \geq n$.
Posterior moments: parameter space update.
The posterior distribution under the linear Gaussian model (1) is Gaussian $u|y \sim \mathcal{N}(m_{\star}, C_{\star})$, with moments \begin{align} m_{\star} &= C_{\star} \left[G^\top \Sigma^{-1}y + C^{-1}m \right] \tag{3} \newline C_{\star} &= \left[G^\top \Sigma^{-1} G + C^{-1} \right]^{-1}. \end{align}
Posterior moments: data space update.
The posterior moments can equivalently be computed as \begin{align} m_{\star} &= m + CG^\top [GCG^\top + \Sigma]^{-1}(y - Gm) \tag{4} \newline C_{\star} &= C - CG^\top [GCG^\top + \Sigma]^{-1} GC. \end{align}
The Optimization Perspective
In this section we consider the maximum a posteriori (MAP) estimate
\begin{equation} u_{\star} := \text{argmax}_{u \in \mathbb{R}^d} p(u|y). \end{equation}
Since is Gaussian, its mean corresponds with its mode. Thus, the MAP estimate is equal to the posterior mean , given in (3) or (4). Note that maximizing is equivalent to minimizing , since the logarithm is monotonic. Therefore, we can view the MAP estimate as the minimizer of the loss function \begin{equation} J(u) := -\log p(u|y) = -\log \mathcal{N}(y|Gu, \Sigma) - \log \mathcal{N}(u|m,C). \end{equation} For a positive definite matrix , let’s introduce the following notation for the norm and inner product, weighted by :
where and denote the standard Euclidean inner product and norm, respectively. By substituting this notation, and dropping additive constants that do not depend on , then we find the following.
MAP Estimate.
The MAP estimate of the linear Gaussian inverse problem (1) solves \begin{equation} u_{\star} := \text{argmin}_{u \in \mathbb{R}^d} \ J(u), \end{equation} with the loss function given by \begin{equation} J(u) = \frac{1}{2} \lVert y - Gu\rVert^2_{\Sigma} + \frac{1}{2} \lVert u-m\rVert^2_{C}. \end{equation}
Observe that the MAP estimate solves a regularized least squares (i.e., ridge regression) problem. The gradient and Hessian of this loss function prove to be useful quantities, both in practice and for theoretical understanding. The following result gives expressions for these quantities, with derivations provided in the appendix.
Gradient and Hessian of Loss.
The loss function above has gradient and Hessian given by \begin{align} \nabla J(u) &= -G^\top \Sigma^{-1}(y-Gu) + C^{-1}(u-m) \newline \nabla^2 J(u) &\equiv G^\top \Sigma^{-1}G + C^{-1} \end{align}
Notice that the Hessian does not vary with , as expected since the loss is a quadratic function. In addition, observe the similarity between the Hessian and the posterior covariance in (3). The following result summarizes this connection, providing a link between the optimization perspective and the posterior moments.
Connection with posterior moments.
The moments of the posterior distribution satisfy \begin{align} &\nabla J(m_{\star}) = 0, &&C^{-1}_{\star} = \nabla^2 J. \end{align} In words, this says that
1. The posterior mean minimizes the loss function .
2. The posterior precision (inverse covariance) is given by the Hessian .
Investigating the Posterior Equations
The Posterior Covariance
A first important observation is that the the posterior covariance $\overline{C}$ is independent of the data $y$. In this sense, the specific data realization observed does not affect the uncertainty in the estimation of $u$. The expression coming from the first derivation (4) tells us that the posterior precision (inverse covariance) $\overline{C}^{-1}$ is the sum of the prior precision $C^{-1}$ and $G^\top \Sigma^{-1}G$, which is the observation precision $\Sigma^{-1}$ modified by the forward model. Since the posterior covariance is the inverse of $G^\top \Sigma^{-1}G + C^{-1}$, we should verify that this matrix is indeed invertible. First, note that $\Sigma^{-1}$ and $C^{-1}$ are both positive definite, since the inverse of positive definite matrices are also positive definite. Thus, the factorization $\Sigma^{-1} = SS^\top$ exists, which implies \begin{align} x^\top [G^\top \Sigma^{-1}G]x &= x^\top [G^\top SS^\top G]x = \lVert S^\top Gx \rVert^2_2 \geq 0. \tag{10} \end{align} That is, $G^\top \Sigma^{-1}G$ is positive semidefinite. Since the sum of a positive semidefinite and positive definite matrix is positive definite, then $G^\top \Sigma^{-1}G + C^{-1}$ is positive definite, and thus invertible.
The covariance expression in (9) provides an alternative perspective. First, the expression tells us that conditioning on the data $y$ always decreases variance. This can be seen by noting that the matrix $CG^\top [GCG^\top + \Sigma]^{-1} GC$ (which is subtracted from the prior covariance) is positive semidefinite, and thus in particular has nonnegative values on its diagonal. To show this, we use the fact that we have just proven that $[GCG^\top + \Sigma]^{-1}$ is positive definite, and thus admits a decomposition of the form $SS^\top$. Thus, \begin{align} x^\top \left(CG^\top [GCG^\top + \Sigma]^{-1} GC\right) x = x^\top \left(CG^\top SS^\top GC\right) x = \lVert S^\top GCx \rVert_2^2 \geq 0, \tag{11} \end{align} so $CG^\top [GCG^\top + \Sigma]^{-1} GC$ is indeed positive semidefinite. Note that the covariance expression in (9) can also be written as \begin{align} \text{Cov}[u|y] &= \text{Cov}[u] - \text{Cov}[u,y] \text{Cov}[y]^{-1} \text{Cov}[y, u]. \tag{12} \end{align}
Posterior Predictive Distribution
Suppose that we are now interested in the outputs resulting from the application of a new forward map , assuming the same prior distribution on as in (1). That is, we extend the model as \begin{align} y &= Gu + \epsilon \newline \tilde{y} &= \tilde{G}u + \tilde{\epsilon} \newline u &\sim \mathcal{N}(m, C), && u \perp \epsilon. \end{align} It remains to specify the joint distribution of the noise terms. We will assume \begin{align} \begin{bmatrix} \tilde{\epsilon} \newline \epsilon \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} 0 \newline 0 \end{bmatrix}, \begin{bmatrix} \tilde{\Sigma} & \Sigma^\prime \newline [\Sigma^\prime]^\top & \Sigma \end{bmatrix} \right), \end{align} along with the independence assumption . The classic example of this setup is prediction at a new set of inputs using a linear regression model. In this context, represents the design matrix and is a new set of inputs at which we would like to predict the corresponding responses.
In the generic inverse problem formulation, we see that this is a question of propagating the posterior uncertainty in through a new forward model . Concretely, we are interested in characterizing the posterior predictive distribution \begin{align} p(\tilde{y} | y) &= \int p(\tilde{y},u|y)du = \int p(\tilde{y}|u,y)p(u|y) du. \end{align} Notice that the term is the posterior distribution of the model parameters . We begin by deriving the closed-form posterior predictive distribution in the common setting . We then turn to the general setting where and may be correlated. In either case, the posterior predictive is Gaussian. This is verified in the appendix by showing that is joint Gaussian, and hence has Gaussian conditionals.
Uncorrelated Errors
We begin with the assumption of zero cross-covariance: . This assumption makes the required computations quite straightforward. As noted above, we know that is Gaussian, and thus is only remains to characterize the mean and covariance of this distribution. For the mean, we apply the law of iterated expectations to obtain \begin{align} \mathbb{E}[\tilde{y}|y] &= \mathbb{E}\left[\mathbb{E}[\tilde{y}|u,y]|y\right] \newline &= \mathbb{E}\left[\mathbb{E}[\tilde{G}u + \tilde{\epsilon}|u,y]|y\right] \newline &= \mathbb{E}\left[\tilde{G}u|y\right] \newline &= \tilde{G} \mathbb{E}\left[u|y\right] \newline &= \tilde{G} m_{\star}. \end{align} Note that everything is conditional on here; the law of iterated expectations is applied with respect to , not . The third equality uses the fact that , owing to the assumptions that is uncorrelated with both and . Now, for the covariance we apply the law of total covariance \begin{align} \text{Cov}[x] = \mathbb{E}\text{Cov}[x|y] + \text{Cov}\left[\mathbb{E}(x|y) \right], \end{align} which holds for arbitrary (square integrable) random vectors . Doing so (again, with everything conditional on ) we obtain: \begin{align} \text{Cov}[\tilde{y}|y] &= \mathbb{E}\left[\text{Cov}(\tilde{y}|u,y)|y\right] + \text{Cov}\left[\mathbb{E}(\tilde{y}|u,y)|y \right] \newline &= \mathbb{E}\left[\text{Cov}(\tilde{G}u + \tilde{\epsilon}|u,y)|y\right] + \text{Cov}\left[\tilde{G}u|y \right] \newline &= \mathbb{E}\left[\tilde{\Sigma}|y\right] + \text{Cov}\left[\tilde{G}u|y \right] \newline &= \tilde{\Sigma} + \tilde{G}C_{\star}\tilde{G}^\top. \end{align} We have used the fact that , again due to the independence assumptions. Putting everything together, we have found that \begin{align} \tilde{y}|y &\sim \mathcal{N}\left(\tilde{G}m_{\star}, \tilde{\Sigma} + \tilde{G}C_{\star}\tilde{G}^\top \right). \end{align} This result is quite intuitive. The predictive mean simply results from propagating the posterior mean through the new forward model . The predictive covariance is the sum of the noise covariance and the covariance resulting from propagating the random variable through the new forward model .
Correlated Errors
We now deal with the general case where may be non-zero. The calculations here are trickier given that is not equal in distribution to , as was the case above. The reason is that contains information about , and contains information about since they are correlated. Therefore, in this general setting I find it easiest to proceed by considering the joint distribution and then applying the Gaussian conditioning identities to obtain the distribution of . The joint distribution is given by \begin{align} \begin{bmatrix} \tilde{y} \newline y \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} \tilde{G}m \newline Gm \end{bmatrix}, \begin{bmatrix} \tilde{G}C\tilde{G}^\top + \tilde{\Sigma} & \tilde{G}CG^\top + \Sigma^\prime \newline GC\tilde{G}^\top + [\Sigma^\prime]^\top & GCG^\top + \Sigma \end{bmatrix} \right). \end{align} The cross covariance follows from \begin{align} \text{Cov}[\tilde{G}u + \tilde{\epsilon}, Gu + \epsilon] &= \text{Cov}[\tilde{G}u, Gu] + \text{Cov}[\tilde{\epsilon}, \epsilon] = \tilde{G}CG^\top + \Sigma^\prime, \end{align} since . The marginal calculations follow similarly. We can now apply the Gaussian conditioning identities to obtain \begin{align} \mathbb{E}[\tilde{y}|y] &= \tilde{G}m + \left[\tilde{G}CG^\top + \Sigma^\prime \right] \left(GCG^\top + \Sigma \right)^{-1} \left[y - Gm \right] \newline \text{Cov}[\tilde{y}|y] &= \tilde{\Sigma} + \tilde{G}C\tilde{G}^\top - \left[\tilde{G}CG^\top + \Sigma^\prime \right] \left(GCG^\top + \Sigma \right)^{-1} \left[GC\tilde{G}^\top + \Sigma^\prime^\top \right] \end{align}
Example: Linear Regression
Perhaps the most common example of a linear Gaussian model is Bayesian linear regression. In this setting, we suppose that we have access to observed input-output pairs and assume that the arise as a linear function of the , which are then perturbed by noise. While different formulations are possible, a popular specification assumes \begin{equation} y_i|x_i, \beta \sim \mathcal{N}(x_i^\top \beta, \sigma^2), \end{equation} meaning that the magnitude of the observation noise is iid across observations. If we stack the inputs row-wise into a matrix and the outputs into a vector , then this model can be written as \begin{align} y &= X\beta + \epsilon \newline \epsilon &\sim \mathcal{N}(0, \sigma^2 I_n) \newline \beta &\sim \mathcal{N}(m, \sigma^2 C), \end{align} where we have also assumed a Gaussian prior on . Connecting to our generic inverse problem setup, we see that the forward model is given by the data matrix, while the parameter is the coefficient vector. The parameterization of the prior covariance as is common in this setting, as it will lead to some convenient cancellations in the posterior formulas. Indeed, applying (3) gives the posterior , with covariance \begin{align} C_{\star} &= \left[\frac{1}{\sigma^2} X^\top X + \frac{1}{\sigma^2} C^{-1} \right]^{-1} = \sigma^2 \left[X^\top X + C^{-1} \right]^{-1}. \end{align} The posterior mean is thus \begin{equation} m_{\star} = C_{\star} \left[\frac{1}{\sigma^2} X^\top y + \frac{1}{\sigma^2} C^{-1}m \right] = \left[X^\top X + C^{-1} \right]^{-1} \left[X^\top y + C^{-1}m \right], \end{equation} since the term from the covariance cancels with its reciprocal.
Appendix
Joint Gaussian Distribution
Throughout this post we rely on the claim that various quantities are jointly Gaussian distributed. We verify these claims here. In the proofs, we use the fact that a random vector is Gaussian if and only if it is equal in distribution to , for and some constant vector and matrix . Any random variables labelled “z” in this section should be interpreted as standard Gaussians, with subscripts potentially indicating the random variables that they generate; e.g., .
Joint distribution: .
We first verify that the vector has a joint Gaussian distribution
under model (1). Taking the square roots of the covariance matrices allows us
to write the correlated Gaussian variables as transformations of iid Gaussian noise.
We have,
\begin{align}
\begin{bmatrix} u \newline y \end{bmatrix}
&\overset{d}{=}
\begin{bmatrix} Gu + \epsilon \newline u \end{bmatrix} \newline
&\overset{d}{=}
\begin{bmatrix} G\left(m + C^{1/2}z_u\right) + \Sigma^{1/2}z_{\epsilon} \newline m + C^{1/2}z_u \end{bmatrix} \newline
&\overset{d}{=}
\begin{bmatrix} Gu \newline m \end{bmatrix} +
\begin{bmatrix} GC^{1/2}z_u + \Sigma^{1/2}z_{\epsilon} \newline
C^{1/2}z_u \end{bmatrix} \newline
&\overset{d}{=}
\begin{bmatrix} Gu \newline m \end{bmatrix} +
\begin{bmatrix} GC^{1/2} & \Sigma^{1/2} \newline C^{1/2} & 0 \end{bmatrix}
\begin{bmatrix} z_u \newline z_{\epsilon} \end{bmatrix}.
\end{align}
Under the assumption it follows that
and are independent,
so , thus verifying the
claim.
Joint distribution: .
We similarly show that is joint Gaussian, under model (13). This fact is used in the derivation of the posterior predictive distribution. We recall that we are allowing the noise terms and to be correlated; let’s partition the square root of their joint covariance by \begin{align} \text{Cov}[(\tilde{\epsilon}, \epsilon)]^{1/2} &= \begin{bmatrix} B_{11} & B_{12} \newline B_{21} & B_{22} \end{bmatrix}. \end{align} We then have \begin{align} \begin{bmatrix} \tilde{y} \newline y \end{bmatrix} &\overset{d}{=} \begin{bmatrix} \tilde{G}u \newline Gu\end{bmatrix} + \begin{bmatrix} \tilde{\epsilon} \newline \epsilon \end{bmatrix} \newline &\overset{d}{=} \begin{bmatrix} \tilde{G}\left(m + C^{1/2}z_u \right)\newline G\left(m + C^{1/2}z_u \right) \end{bmatrix} + \begin{bmatrix} B_{11} & B_{12} \newline B_{21} & B_{22} \end{bmatrix} \begin{bmatrix} z_1 \newline z_2 \end{bmatrix} \newline &\overset{d}{=} \begin{bmatrix} \tilde{G}m \newline Gm \end{bmatrix} + \begin{bmatrix} \tilde{G}C^{1/2} & B_{11} & B_{12} \newline GC^{1/2} & B_{21} & B_{22} \end{bmatrix} \begin{bmatrix} z_u \newline z_1 \newline z_2 \end{bmatrix}. \end{align} Again, the independence of with respect to and is assured by the assumptions and . Thus, . The matrices and serve to “mix up” the two independent noise sources and in order to produce the correlations between and . If and are assumed uncorrelated then . Note also that essentially the same derivations show that are also jointly Gaussian; the “B” matrix is simply augmented with the addition of the row .
Posterior Derivations
Parameter Space Update
We now derive the posterior moments appearing in (3). This derivation also
provides another way to show that the posterior is Gaussian. The strategy here
is to apply Bayes’ theorem and then use the matrix analog of completing the
square from elementary algebra. Any quantities that do not depend on
will be absorbed in the proportionality symbol, and hence are dropped
from the expression. Applying Bayes’ theorem to (2) yields
\begin{align}
p(u|y)
&\propto p(y|u)p(u) \newline
&\propto \exp\left(-\frac{1}{2}\left[(y-Gu)^\top \Sigma^{-1} (y-Gu) + (u-m)^\top C^{-1} (u-m) \right] \right) \newline
&\propto \exp\left(-\frac{1}{2}\left[u^\top(G^\top \Sigma^{-1}G + C^{-1})u -
2u^\top(G^\top \Sigma^{-1}y + C^{-1}m)\right] \right). \tag{A1}
\end{align}
All we have done above is to combine the Gaussian likelihood and prior, dropping
any multiplicative constants that don’t depend on , and grouping like terms
in . Note that since (A1) is an exponential of a quadratic in ,
then we immediately know that the posterior must be Gaussian. It therefore remains
to find the mean and covariance . Knowing that (A1) is
proportional to a Gaussian density, let’s set the term in square brackets equal to
\begin{align}
(u - m_{\star})^\top C_{\star}^{-1} (u - m_{\star})
= u^\top C_{\star}^{-1}u - 2u^\top C_{\star}^{-1} m_{\star} +
m_{\star}^\top C_{\star}^{-1} m_{\star} \tag{A2}
\end{align}
and equate like terms to solve for the unknowns and .
Doing so, we find that
\begin{align}
C_{\star}^{-1} &= G^\top \Sigma^{-1} G + C^{-1} \newline
C_{\star}^{-1}m_{\star} &= G^\top \Sigma^{-1}y + C^{-1}m. \tag{A3}
\end{align}
The is not a problem, as it
will simply be absorbed in the proportionality sign.
Rearranging the above expressions gives the desired mean and covariance equations
in (3).
Data Space Update
We now present a second method for computing , resulting in the data space update (4). The general idea here is to observe that follows a joint Gaussian distribution, and thus the posterior distribution is simply a conditional of this joint distribution. Since Gaussian conditionals can be characterized in closed-form, we can simply apply well-known Gaussian conditional formulas to obtain the equations in (4). Note that we have already verified earlier in the appendix that is a Gaussian random vector, which relied on the assumed prior independence of and . Writing out this joint Gaussian explicitly gives \begin{align} \begin{bmatrix} u \newline y \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} m \newline Gm \end{bmatrix}, \begin{bmatrix} C & CG^\top \newline GC & GCG^\top + \Sigma \end{bmatrix} \right). \tag{A4} \end{align} The mean and covariance of are immediate from (1), and the remaining quantities are computed as: \begin{align} \mathbb{E}[y] &= \mathbb{E}[Gu + \epsilon] = G\mathbb{E}[u] + \mathbb{E}[\epsilon] = Gm \tag{A5} \newline \text{Cov}[y] &= \text{Cov}[Gu + \epsilon] = \text{Cov}[Gu] + \text{Cov}[\epsilon] = GCG^\top + \Sigma \tag{A6} \newline \text{Cov}[y, u] &= \text{Cov}[Gu + \epsilon, u] = \text{Cov}[Gu, u] + \text{Cov}[\epsilon, u] = GC. \tag{A7} \end{align} In (A5) we use the linearity of expectation and the fact that the noise is zero-mean. The covariance splits into the sum in (A6) due to the independence of and . This independence assumption is similarly leveraged in (A7). The Gaussian conditioning identities give \begin{align} m_{\star} &= \mathbb{E}[u] - \text{Cov}[u,y]\text{Cov}[y]^{-1}(y-\mathbb{E}[y]) \tag{A8} \newline C_{\star} &= \text{Cov}[u] - \text{Cov}[u,y]\text{Cov}[y]^{-1}\text{Cov}[y,u]. \end{align} Inserting the above expressions for the means and covariances recovers the data space update (4).