Linear Gaussian Inverse Problems

Derivations and discussion of linear Gaussian inverse problems.

This post focuses on Bayesian inverse problems with the following features:

We refer to such inverse problems as linear Gaussian. The typical Bayesian linear regression model with a Gaussian prior on the coefficients constitutes a common example of a linear Gaussian inverse problem. The assumptions of linearity and Gaussianity play quite nicely together, resulting in a closed-form Gaussian posterior distribution. Moreover, many extensions to nonlinear and/or non-Gaussian settings rely on methods rooted in our understanding of the linear Gaussian regime.

Setup

We consider the following linear Gaussian regression model \begin{align} y &= Gu + \epsilon \tag{1} \newline \epsilon &\sim \mathcal{N}(0, \Sigma) \newline u &\sim \mathcal{N}(m, C), && u \perp \epsilon \end{align} consisting of the observation (i.e., data) $y \in \mathbb{R}^n$, parameter $u \in \mathbb{R}^d$, noise $\epsilon \in \mathbb{R}^n$, and linear forward model represented by the matrix $G \in \mathbb{R}^{n \times d}$. The observation covariance $\Sigma \in \mathbb{R}^{n \times n}$ and prior covariance $C \in \mathbb{R}^{d \times d}$ are both fixed positive definite matrices. The vector $m \in \mathbb{R}^d$ is the prior mean. We write $u \perp \epsilon$ to indicate the key assumption that $u$ and $\epsilon$ are a priori probabilistically independent. The model (1) can equivalently be written as \begin{align} y|u &\sim \mathcal{N}(Gu, \Sigma) \tag{2} \newline u &\sim \mathcal{N}(m, C), \end{align} which gives the explicit expression for the Gaussian likelihood $p(y|u)$. The solution of the Bayesian inverse problem is the posterior distribution $p(u|y)$. We provide two approaches to calculating this distribution below, which yield different (but equivalent) expressions.

The Posterior Distribution

The ubiquity of the linear Gaussian models stems from its analytic tractability. The assumptions of linearity and Gaussianity play nicely together, resulting in a Gaussian posterior distribution that we can characterize in closed form. In Bayesian statistics, this is called a conjugate model, as the prior and posterior belong to the same distributional family (Gaussian). Once it is established that the posterior is Gaussian, it remains only to specify its mean and covariance matrix. We give two equivalent expressions for these quantities below, with complete derivations given in the appendix. The first formulation is the one typically seen in a standard Bayesian linear regression context where $d \leq n$. It is therefore advantageous to work with quantities in $\mathbb{R}^d$ rather than $\mathbb{R}^n$. In the first set of equations, the main computation - the inversion of a matrix - takes place in $\mathbb{R}^d$, so we refer to this formulation as the parameter space update. The second set of equations instead requires the inversion of a $n \times n$ matrix. This data space update in commonly seen in data assimilation applications where $d \geq n$.

Posterior moments: parameter space update.
The posterior distribution under the linear Gaussian model (1) is Gaussian $u|y \sim \mathcal{N}(m_{\star}, C_{\star})$, with moments \begin{align} m_{\star} &= C_{\star} \left[G^\top \Sigma^{-1}y + C^{-1}m \right] \tag{3} \newline C_{\star} &= \left[G^\top \Sigma^{-1} G + C^{-1} \right]^{-1}. \end{align}

Posterior moments: data space update.
The posterior moments can equivalently be computed as \begin{align} m_{\star} &= m + CG^\top [GCG^\top + \Sigma]^{-1}(y - Gm) \tag{4} \newline C_{\star} &= C - CG^\top [GCG^\top + \Sigma]^{-1} GC. \end{align}

The Optimization Perspective

In this section we consider the maximum a posteriori (MAP) estimate

\begin{equation} u_{\star} := \text{argmax}_{u \in \mathbb{R}^d} p(u|y). \end{equation}

Since p(uy)p(u|y) is Gaussian, its mean corresponds with its mode. Thus, the MAP estimate uu_{\star} is equal to the posterior mean mm_{\star}, given in (3) or (4). Note that maximizing p(uy)p(u|y) is equivalent to minimizing logp(uy)-\log p(u|y), since the logarithm is monotonic. Therefore, we can view the MAP estimate as the minimizer of the loss function \begin{equation} J(u) := -\log p(u|y) = -\log \mathcal{N}(y|Gu, \Sigma) - \log \mathcal{N}(u|m,C). \end{equation} For a positive definite matrix AA, let’s introduce the following notation for the norm and inner product, weighted by AA:

u,vA:=A1u,v=uA1v \langle u, v \rangle_A := \langle A^{-1}u, v\rangle = u^\top A^{-1}v

uA2:=u,uA=uA1u, \lVert u \rVert^2_{A} := \langle u, u\rangle_{A} = u^\top A^{-1}u,

where ,\langle \cdot, \cdot \rangle and \lVert \cdot \rVert denote the standard Euclidean inner product and norm, respectively. By substituting this notation, and dropping additive constants that do not depend on uu, then we find the following.

MAP Estimate.
The MAP estimate of the linear Gaussian inverse problem (1) solves \begin{equation} u_{\star} := \text{argmin}_{u \in \mathbb{R}^d} \ J(u), \end{equation} with the loss function given by \begin{equation} J(u) = \frac{1}{2} \lVert y - Gu\rVert^2_{\Sigma} + \frac{1}{2} \lVert u-m\rVert^2_{C}. \end{equation}

Observe that the MAP estimate solves a regularized least squares (i.e., ridge regression) problem. The gradient and Hessian of this loss function prove to be useful quantities, both in practice and for theoretical understanding. The following result gives expressions for these quantities, with derivations provided in the appendix.

Gradient and Hessian of Loss.
The loss function J(u)J(u) above has gradient J(u)Rd\nabla J(u) \in \mathbb{R}^d and Hessian 2J(u)Rd×d\nabla^2 J(u) \in \mathbb{R}^{d \times d} given by \begin{align} \nabla J(u) &= -G^\top \Sigma^{-1}(y-Gu) + C^{-1}(u-m) \newline \nabla^2 J(u) &\equiv G^\top \Sigma^{-1}G + C^{-1} \end{align}

Notice that the Hessian does not vary with uu, as expected since the loss J(u)J(u) is a quadratic function. In addition, observe the similarity between the Hessian and the posterior covariance in (3). The following result summarizes this connection, providing a link between the optimization perspective and the posterior moments.

Connection with posterior moments.
The moments of the posterior distribution N(m,C)\mathcal{N}(m_{\star}, C_{\star}) satisfy \begin{align} &\nabla J(m_{\star}) = 0, &&C^{-1}_{\star} = \nabla^2 J. \end{align} In words, this says that
1. The posterior mean minimizes the loss function J(u)J(u).
2. The posterior precision (inverse covariance) is given by the Hessian 2J\nabla^2 J.

Investigating the Posterior Equations

The Posterior Covariance

A first important observation is that the the posterior covariance $\overline{C}$ is independent of the data $y$. In this sense, the specific data realization observed does not affect the uncertainty in the estimation of $u$. The expression coming from the first derivation (4) tells us that the posterior precision (inverse covariance) $\overline{C}^{-1}$ is the sum of the prior precision $C^{-1}$ and $G^\top \Sigma^{-1}G$, which is the observation precision $\Sigma^{-1}$ modified by the forward model. Since the posterior covariance is the inverse of $G^\top \Sigma^{-1}G + C^{-1}$, we should verify that this matrix is indeed invertible. First, note that $\Sigma^{-1}$ and $C^{-1}$ are both positive definite, since the inverse of positive definite matrices are also positive definite. Thus, the factorization $\Sigma^{-1} = SS^\top$ exists, which implies \begin{align} x^\top [G^\top \Sigma^{-1}G]x &= x^\top [G^\top SS^\top G]x = \lVert S^\top Gx \rVert^2_2 \geq 0. \tag{10} \end{align} That is, $G^\top \Sigma^{-1}G$ is positive semidefinite. Since the sum of a positive semidefinite and positive definite matrix is positive definite, then $G^\top \Sigma^{-1}G + C^{-1}$ is positive definite, and thus invertible.

The covariance expression in (9) provides an alternative perspective. First, the expression tells us that conditioning on the data $y$ always decreases variance. This can be seen by noting that the matrix $CG^\top [GCG^\top + \Sigma]^{-1} GC$ (which is subtracted from the prior covariance) is positive semidefinite, and thus in particular has nonnegative values on its diagonal. To show this, we use the fact that we have just proven that $[GCG^\top + \Sigma]^{-1}$ is positive definite, and thus admits a decomposition of the form $SS^\top$. Thus, \begin{align} x^\top \left(CG^\top [GCG^\top + \Sigma]^{-1} GC\right) x = x^\top \left(CG^\top SS^\top GC\right) x = \lVert S^\top GCx \rVert_2^2 \geq 0, \tag{11} \end{align} so $CG^\top [GCG^\top + \Sigma]^{-1} GC$ is indeed positive semidefinite. Note that the covariance expression in (9) can also be written as \begin{align} \text{Cov}[u|y] &= \text{Cov}[u] - \text{Cov}[u,y] \text{Cov}[y]^{-1} \text{Cov}[y, u]. \tag{12} \end{align}

Posterior Predictive Distribution

Suppose that we are now interested in the outputs y~Rm\tilde{y} \in \mathbb{R}^m resulting from the application of a new forward map G~Rm×d\tilde{G} \in \mathbb{R}^{m \times d}, assuming the same prior distribution on uu as in (1). That is, we extend the model as \begin{align} y &= Gu + \epsilon \newline \tilde{y} &= \tilde{G}u + \tilde{\epsilon} \newline u &\sim \mathcal{N}(m, C), && u \perp \epsilon. \end{align} It remains to specify the joint distribution of the noise (ϵ,ϵ~)(\epsilon, \tilde{\epsilon}) terms. We will assume \begin{align} \begin{bmatrix} \tilde{\epsilon} \newline \epsilon \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} 0 \newline 0 \end{bmatrix}, \begin{bmatrix} \tilde{\Sigma} & \Sigma^\prime \newline [\Sigma^\prime]^\top & \Sigma \end{bmatrix} \right), \end{align} along with the independence assumption u(ϵ,ϵ~)u \perp (\epsilon, \tilde{\epsilon}). The classic example of this setup is prediction at a new set of inputs using a linear regression model. In this context, GG represents the design matrix and G~\tilde{G} is a new set of mm inputs at which we would like to predict the corresponding responses.

In the generic inverse problem formulation, we see that this is a question of propagating the posterior uncertainty in uu through a new forward model G~\tilde{G}. Concretely, we are interested in characterizing the posterior predictive distribution \begin{align} p(\tilde{y} | y) &= \int p(\tilde{y},u|y)du = \int p(\tilde{y}|u,y)p(u|y) du. \end{align} Notice that the term p(uy)p(u|y) is the posterior distribution of the model parameters uu. We begin by deriving the closed-form posterior predictive distribution in the common setting Σ=0\Sigma^\prime = 0. We then turn to the general setting where ϵ\epsilon and ϵ~\tilde{\epsilon} may be correlated. In either case, the posterior predictive p(y~y)p(\tilde{y} | y) is Gaussian. This is verified in the appendix by showing that (y~,y)(\tilde{y}, y) is joint Gaussian, and hence has Gaussian conditionals.

Uncorrelated Errors

We begin with the assumption of zero cross-covariance: Cov[ϵ~,ϵ]=Σ=0\text{Cov}[\tilde{\epsilon}, \epsilon] = \Sigma^\prime = 0. This assumption makes the required computations quite straightforward. As noted above, we know that p(y~y)p(\tilde{y}|y) is Gaussian, and thus is only remains to characterize the mean and covariance of this distribution. For the mean, we apply the law of iterated expectations to obtain \begin{align} \mathbb{E}[\tilde{y}|y] &= \mathbb{E}\left[\mathbb{E}[\tilde{y}|u,y]|y\right] \newline &= \mathbb{E}\left[\mathbb{E}[\tilde{G}u + \tilde{\epsilon}|u,y]|y\right] \newline &= \mathbb{E}\left[\tilde{G}u|y\right] \newline &= \tilde{G} \mathbb{E}\left[u|y\right] \newline &= \tilde{G} m_{\star}. \end{align} Note that everything is conditional on yy here; the law of iterated expectations is applied with respect to uu, not yy. The third equality uses the fact that E[ϵ~u,y]=E[ϵ~]=0\mathbb{E}[\tilde{\epsilon}|u,y] = \mathbb{E}[\tilde{\epsilon}] = 0, owing to the assumptions that ϵ~\tilde{\epsilon} is uncorrelated with both ϵ\epsilon and uu. Now, for the covariance we apply the law of total covariance \begin{align} \text{Cov}[x] = \mathbb{E}\text{Cov}[x|y] + \text{Cov}\left[\mathbb{E}(x|y) \right], \end{align} which holds for arbitrary (square integrable) random vectors x,yx,y. Doing so (again, with everything conditional on yy) we obtain: \begin{align} \text{Cov}[\tilde{y}|y] &= \mathbb{E}\left[\text{Cov}(\tilde{y}|u,y)|y\right] + \text{Cov}\left[\mathbb{E}(\tilde{y}|u,y)|y \right] \newline &= \mathbb{E}\left[\text{Cov}(\tilde{G}u + \tilde{\epsilon}|u,y)|y\right] + \text{Cov}\left[\tilde{G}u|y \right] \newline &= \mathbb{E}\left[\tilde{\Sigma}|y\right] + \text{Cov}\left[\tilde{G}u|y \right] \newline &= \tilde{\Sigma} + \tilde{G}C_{\star}\tilde{G}^\top. \end{align} We have used the fact that Cov[ϵ~u,y]=Cov[ϵ~]=ϵ~\text{Cov}[\tilde{\epsilon}|u,y] = \text{Cov}[\tilde{\epsilon}] = \tilde{\epsilon}, again due to the independence assumptions. Putting everything together, we have found that \begin{align} \tilde{y}|y &\sim \mathcal{N}\left(\tilde{G}m_{\star}, \tilde{\Sigma} + \tilde{G}C_{\star}\tilde{G}^\top \right). \end{align} This result is quite intuitive. The predictive mean simply results from propagating the posterior mean mm_{\star} through the new forward model G~\tilde{G}. The predictive covariance is the sum of the noise covariance Σ~\tilde{\Sigma} and the covariance resulting from propagating the random variable uyu|y through the new forward model G~\tilde{G}.

Correlated Errors

We now deal with the general case where Σ\Sigma^\prime may be non-zero. The calculations here are trickier given that ϵ~y\tilde{\epsilon}|y is not equal in distribution to ϵ~\tilde{\epsilon}, as was the case above. The reason is that yy contains information about ϵ\epsilon, and ϵ\epsilon contains information about ϵ~\tilde{\epsilon} since they are correlated. Therefore, in this general setting I find it easiest to proceed by considering the joint distribution (y~,y)(\tilde{y},y) and then applying the Gaussian conditioning identities to obtain the distribution of y~y\tilde{y}|y. The joint distribution is given by \begin{align} \begin{bmatrix} \tilde{y} \newline y \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} \tilde{G}m \newline Gm \end{bmatrix}, \begin{bmatrix} \tilde{G}C\tilde{G}^\top + \tilde{\Sigma} & \tilde{G}CG^\top + \Sigma^\prime \newline GC\tilde{G}^\top + [\Sigma^\prime]^\top & GCG^\top + \Sigma \end{bmatrix} \right). \end{align} The cross covariance follows from \begin{align} \text{Cov}[\tilde{G}u + \tilde{\epsilon}, Gu + \epsilon] &= \text{Cov}[\tilde{G}u, Gu] + \text{Cov}[\tilde{\epsilon}, \epsilon] = \tilde{G}CG^\top + \Sigma^\prime, \end{align} since Cov[u,ϵ~]=Cov[u,ϵ]=0\text{Cov}[u, \tilde{\epsilon}] = \text{Cov}[u, \epsilon] = 0. The marginal calculations follow similarly. We can now apply the Gaussian conditioning identities to obtain \begin{align} \mathbb{E}[\tilde{y}|y] &= \tilde{G}m + \left[\tilde{G}CG^\top + \Sigma^\prime \right] \left(GCG^\top + \Sigma \right)^{-1} \left[y - Gm \right] \newline \text{Cov}[\tilde{y}|y] &= \tilde{\Sigma} + \tilde{G}C\tilde{G}^\top - \left[\tilde{G}CG^\top + \Sigma^\prime \right] \left(GCG^\top + \Sigma \right)^{-1} \left[GC\tilde{G}^\top + \Sigma^\prime^\top \right] \end{align}

Example: Linear Regression

Perhaps the most common example of a linear Gaussian model is Bayesian linear regression. In this setting, we suppose that we have access to observed input-output pairs (xi,yi)i=1n(x_i, y_i)_{i=1}^{n} and assume that the yiy_i arise as a linear function of the xix_i, which are then perturbed by noise. While different formulations are possible, a popular specification assumes \begin{equation} y_i|x_i, \beta \sim \mathcal{N}(x_i^\top \beta, \sigma^2), \end{equation} meaning that the magnitude of the observation noise is iid across observations. If we stack the inputs row-wise into a matrix XRn×dX \in \mathbb{R}^{n \times d} and the outputs into a vector yRny \in \mathbb{R}^n, then this model can be written as \begin{align} y &= X\beta + \epsilon \newline \epsilon &\sim \mathcal{N}(0, \sigma^2 I_n) \newline \beta &\sim \mathcal{N}(m, \sigma^2 C), \end{align} where we have also assumed a Gaussian prior on β\beta. Connecting to our generic inverse problem setup, we see that the forward model G=XG = X is given by the data matrix, while the parameter u=βu = \beta is the coefficient vector. The parameterization of the prior covariance as σ2C\sigma^2 C is common in this setting, as it will lead to some convenient cancellations in the posterior formulas. Indeed, applying (3) gives the posterior βyN(m,C)\beta|y \sim \mathcal{N}(m_{\star}, C_{\star}), with covariance \begin{align} C_{\star} &= \left[\frac{1}{\sigma^2} X^\top X + \frac{1}{\sigma^2} C^{-1} \right]^{-1} = \sigma^2 \left[X^\top X + C^{-1} \right]^{-1}. \end{align} The posterior mean is thus \begin{equation} m_{\star} = C_{\star} \left[\frac{1}{\sigma^2} X^\top y + \frac{1}{\sigma^2} C^{-1}m \right] = \left[X^\top X + C^{-1} \right]^{-1} \left[X^\top y + C^{-1}m \right], \end{equation} since the σ2\sigma^2 term from the covariance cancels with its reciprocal.

Appendix

Joint Gaussian Distribution

Throughout this post we rely on the claim that various quantities are jointly Gaussian distributed. We verify these claims here. In the proofs, we use the fact that a random vector vv is Gaussian if and only if it is equal in distribution to a+Bza + Bz, for zN(0,I)z \sim \mathcal{N}(0, I) and some constant vector aa and matrix BB. Any random variables labelled “z” in this section should be interpreted as standard Gaussians, with subscripts potentially indicating the random variables that they generate; e.g., zϵz_{\epsilon}.

Joint distribution: (u,y)(u, y).

We first verify that the vector (u,y)(u, y) has a joint Gaussian distribution under model (1). Taking the square roots of the covariance matrices allows us to write the correlated Gaussian variables as transformations of iid Gaussian noise. We have, \begin{align} \begin{bmatrix} u \newline y \end{bmatrix} &\overset{d}{=} \begin{bmatrix} Gu + \epsilon \newline u \end{bmatrix} \newline &\overset{d}{=} \begin{bmatrix} G\left(m + C^{1/2}z_u\right) + \Sigma^{1/2}z_{\epsilon} \newline m + C^{1/2}z_u \end{bmatrix} \newline &\overset{d}{=} \begin{bmatrix} Gu \newline m \end{bmatrix} + \begin{bmatrix} GC^{1/2}z_u + \Sigma^{1/2}z_{\epsilon} \newline C^{1/2}z_u \end{bmatrix} \newline &\overset{d}{=} \begin{bmatrix} Gu \newline m \end{bmatrix} + \begin{bmatrix} GC^{1/2} & \Sigma^{1/2} \newline C^{1/2} & 0 \end{bmatrix} \begin{bmatrix} z_u \newline z_{\epsilon} \end{bmatrix}.
\end{align}
Under the assumption uϵu \perp \epsilon it follows that zuz_u and zϵz_{\epsilon} are independent, so (zu,zϵ)N(0,Id+n)(z_u, z_{\epsilon})^\top \sim \mathcal{N}(0, I_{d+n}), thus verifying the claim.

Joint distribution: (y~,y)(\tilde{y}, y).

We similarly show that (y~,y)(\tilde{y}, y) is joint Gaussian, under model (13). This fact is used in the derivation of the posterior predictive distribution. We recall that we are allowing the noise terms ϵ\epsilon and ϵ~\tilde{\epsilon} to be correlated; let’s partition the square root of their joint covariance by \begin{align} \text{Cov}[(\tilde{\epsilon}, \epsilon)]^{1/2} &= \begin{bmatrix} B_{11} & B_{12} \newline B_{21} & B_{22} \end{bmatrix}. \end{align} We then have \begin{align} \begin{bmatrix} \tilde{y} \newline y \end{bmatrix} &\overset{d}{=} \begin{bmatrix} \tilde{G}u \newline Gu\end{bmatrix} + \begin{bmatrix} \tilde{\epsilon} \newline \epsilon \end{bmatrix} \newline &\overset{d}{=} \begin{bmatrix} \tilde{G}\left(m + C^{1/2}z_u \right)\newline G\left(m + C^{1/2}z_u \right) \end{bmatrix} + \begin{bmatrix} B_{11} & B_{12} \newline B_{21} & B_{22} \end{bmatrix} \begin{bmatrix} z_1 \newline z_2 \end{bmatrix} \newline &\overset{d}{=} \begin{bmatrix} \tilde{G}m \newline Gm \end{bmatrix} + \begin{bmatrix} \tilde{G}C^{1/2} & B_{11} & B_{12} \newline GC^{1/2} & B_{21} & B_{22} \end{bmatrix} \begin{bmatrix} z_u \newline z_1 \newline z_2 \end{bmatrix}. \end{align} Again, the independence of zuz_u with respect to z1z_1 and z2z_2 is assured by the assumptions uϵ~u \perp \tilde{\epsilon} and uϵu \perp \epsilon. Thus, (zu,z1,z2)N(0,Id+m+n)(z_u, z_1, z_2)^\top \sim \mathcal{N}(0, I_{d+m+n}). The matrices B12B_{12} and B21B_{21} serve to “mix up” the two independent noise sources z1z_1 and z2z_2 in order to produce the correlations between ϵ\epsilon and ϵ~\tilde{\epsilon}. If ϵ\epsilon and ϵ~\tilde{\epsilon} are assumed uncorrelated then B12=B21=0B_{12} = B_{21} = 0. Note also that essentially the same derivations show that (y~,y,u)(\tilde{y}, y, u) are also jointly Gaussian; the “B” matrix is simply augmented with the addition of the row [C1/200]\begin{bmatrix} C^{1/2} & 0 & 0 \end{bmatrix}.

Posterior Derivations

Parameter Space Update

We now derive the posterior moments appearing in (3). This derivation also provides another way to show that the posterior is Gaussian. The strategy here is to apply Bayes’ theorem and then use the matrix analog of completing the square from elementary algebra. Any quantities that do not depend on uu will be absorbed in the proportionality symbol, and hence are dropped from the expression. Applying Bayes’ theorem to (2) yields \begin{align} p(u|y) &\propto p(y|u)p(u) \newline &\propto \exp\left(-\frac{1}{2}\left[(y-Gu)^\top \Sigma^{-1} (y-Gu) + (u-m)^\top C^{-1} (u-m) \right] \right) \newline &\propto \exp\left(-\frac{1}{2}\left[u^\top(G^\top \Sigma^{-1}G + C^{-1})u - 2u^\top(G^\top \Sigma^{-1}y + C^{-1}m)\right] \right). \tag{A1} \end{align} All we have done above is to combine the Gaussian likelihood and prior, dropping any multiplicative constants that don’t depend on uu, and grouping like terms in uu. Note that since (A1) is an exponential of a quadratic in uu, then we immediately know that the posterior must be Gaussian. It therefore remains to find the mean mm_{\star} and covariance CC_{\star}. Knowing that (A1) is proportional to a Gaussian density, let’s set the term in square brackets equal to
\begin{align} (u - m_{\star})^\top C_{\star}^{-1} (u - m_{\star}) = u^\top C_{\star}^{-1}u - 2u^\top C_{\star}^{-1} m_{\star} + m_{\star}^\top C_{\star}^{-1} m_{\star} \tag{A2} \end{align} and equate like terms to solve for the unknowns mm_{\star} and CC_{\star}. Doing so, we find that \begin{align} C_{\star}^{-1} &= G^\top \Sigma^{-1} G + C^{-1} \newline C_{\star}^{-1}m_{\star} &= G^\top \Sigma^{-1}y + C^{-1}m. \tag{A3} \end{align} The mC1mm_{\star}^\top C_{\star}^{-1} m_{\star} is not a problem, as it will simply be absorbed in the proportionality sign. Rearranging the above expressions gives the desired mean and covariance equations in (3).

Data Space Update

We now present a second method for computing p(uy)p(u|y), resulting in the data space update (4). The general idea here is to observe that (u,y)Rd+n(u, y)^\top \in \mathbb{R}^{d+n} follows a joint Gaussian distribution, and thus the posterior distribution p(uy)p(u|y) is simply a conditional of this joint distribution. Since Gaussian conditionals can be characterized in closed-form, we can simply apply well-known Gaussian conditional formulas to obtain the equations in (4). Note that we have already verified earlier in the appendix that (u,y)(u,y)^\top is a Gaussian random vector, which relied on the assumed prior independence of uu and ϵ\epsilon. Writing out this joint Gaussian explicitly gives \begin{align} \begin{bmatrix} u \newline y \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} m \newline Gm \end{bmatrix}, \begin{bmatrix} C & CG^\top \newline GC & GCG^\top + \Sigma \end{bmatrix} \right). \tag{A4} \end{align} The mean and covariance of uu are immediate from (1), and the remaining quantities are computed as: \begin{align} \mathbb{E}[y] &= \mathbb{E}[Gu + \epsilon] = G\mathbb{E}[u] + \mathbb{E}[\epsilon] = Gm \tag{A5} \newline \text{Cov}[y] &= \text{Cov}[Gu + \epsilon] = \text{Cov}[Gu] + \text{Cov}[\epsilon] = GCG^\top + \Sigma \tag{A6} \newline \text{Cov}[y, u] &= \text{Cov}[Gu + \epsilon, u] = \text{Cov}[Gu, u] + \text{Cov}[\epsilon, u] = GC. \tag{A7} \end{align} In (A5) we use the linearity of expectation and the fact that the noise is zero-mean. The covariance splits into the sum in (A6) due to the independence of uu and ϵ\epsilon. This independence assumption is similarly leveraged in (A7). The Gaussian conditioning identities give \begin{align} m_{\star} &= \mathbb{E}[u] - \text{Cov}[u,y]\text{Cov}[y]^{-1}(y-\mathbb{E}[y]) \tag{A8} \newline C_{\star} &= \text{Cov}[u] - \text{Cov}[u,y]\text{Cov}[y]^{-1}\text{Cov}[y,u]. \end{align} Inserting the above expressions for the means and covariances recovers the data space update (4).