Linearly Transforming Gaussian Process Priors
I derive how a linear transformation of a Gaussian process prior influences the Gaussian process posterior, and consider some special cases.
The main goal of this post is to apply different types of linear transformations to Gaussian vectors, and investigate how these transformations impact the resulting Gaussian conditional distributions. Since many Gaussian process (GP) derivations reduce to calculations with Gaussian vectors, our investigations of multivariate Gaussians will immediately lead to results on GPs. From the GP perspective, I emphasize that we will not be considering linear functionals of GPs (i.e., maps that take a function as input and return a scalar). This is an interesting topic that is worthy of its own blog post. We will instead be considering linear transformations applied in a pointwise fashion. We start by considering a generic multivariate Gaussian setup, then translate the results into GP language. We conclude by discussing applications to GP regression, multi-output GPs, and linear inverse problems.
Multivariate Gaussians
Notation and Review
We start by considering (finite-dimensional) multivariate Gaussians, as many of the derivations in the Gaussian process setting reduce to computations with Gaussian vectors. Consider a Gaussian vector partitioned as \begin{align} x = \begin{bmatrix} x_M \newline x_N \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} \mu_M \newline \mu_N \end{bmatrix}, \begin{bmatrix} C_M & C_{MN} \newline C_{NM} & C_N \end{bmatrix} \right) \tag{1} \end{align} where , , and . Throughout this post, subscripts with capital letters serve as indicators of vector and matrix dimensions. It is well-known that the conditional distribution is again Gaussian. In particular, we have , where \begin{align} \hat{\mu}_M &:= \mu_M + C_{MN} C_{N}^{-1}(x_N - \mu_N) \tag{2} \newline \hat{C}_M &:= C_M - C_{MN} C_{N}^{-1} C_{NM}. \end{align}
Linear Transformation of Gaussian Vector
We recall that linear transformations of Gaussians preserve Gaussianity. Therefore, for a matrix , the random vector has distribution
\begin{align}
y \sim \mathcal{N}(A\mu, ACA^\top). \tag{3}
\end{align}
In this post we will be concerned with matrices with certain structure; the
motivation will become more clear when we start considering GPs.
For now, suppose that is of the form
\begin{align}
A :=
\begin{bmatrix} A_M & 0 \newline 0 & A_N \end{bmatrix}, \tag{4}
\end{align}
where and .
In this case, the transformed distribution (3) assumes the form
\begin{align}
y \sim \mathcal{N}\left(
\begin{bmatrix} A_M\mu_M \newline A_N \mu_N \end{bmatrix},
\begin{bmatrix}
A_M C_M A_M^\top & A_M C_{MN} A_N^\top \newline
A_N C_{NM}A_M^\top & A_N C_N A_N^\top
\end{bmatrix}
\right). \tag{5}
\end{align}
Having characterized the joint distribution of the transformed vector , we
now consider the effect on the conditional distribution. Let
and . Applying the Gaussian conditioning
identity (2), we obtain
\begin{align}
y_M | y_N \sim \mathcal{N}\left(\hat{\mu}_M^{y}, \hat{C}_M^{y} \right),
\end{align}
where
\begin{align} \hat{\mu}_{M}^{y} &:= A_M \mu_M + A_M C_{MN} A_N^\top (A_N C_N A_N^\top)^{-1}[A_N x_N - A_N \mu_N] \tag{6} \newline \hat{C}_{M}^{y} &:= A_M C_M A_M^\top - A_M C_{MN} A_N^\top (A_N C_N A_N^\top)^{-1} A_N C_{NM} A_M^\top. \end{align}
Generalization to Affine Maps
The generalization from linear to affine maps is almost immediate. Consider an
affine map of the form
\begin{align}
y := Ax + b
&= \begin{bmatrix} A_M & 0 \newline 0 & A_N \end{bmatrix}
\begin{bmatrix} x_M \newline x_N \end{bmatrix} +
\begin{bmatrix} b_M \newline b_N \end{bmatrix}. \tag{7}
\end{align}
The joint distribution of is then given by
\begin{align}
y \sim \mathcal{N}\left(
\begin{bmatrix} A_M\mu_M + b_M \newline A_N \mu_N + b_N \end{bmatrix},
\begin{bmatrix}
A_M C_M A_M^\top & A_M C_{MN} A_N^\top \newline
A_N C_{NM}A_M^\top & A_N C_N A_N^\top.
\end{bmatrix} \right) \tag{8}
\end{align}
Note that the constant terms only affect the mean. Applying the Gaussian
conditioning formulas then gives the conditional distribution
\begin{align}
y_M | y_N \sim \mathcal{N}\left(\hat{\mu}_M^{y}, \hat{C}_M^{y} \right),
\end{align}
where
\begin{align} \hat{\mu}_{M}^{y} &:= b_M + A_M \mu_M + A_M C_{MN} A_N^\top (A_N C_N A_N^\top)^{-1}[A_N x_N - A_N \mu_N] \tag{9} \newline \hat{C}_{M}^{y} &:= A_M C_M A_M^\top - A_M C_{MN} A_N^\top (A_N C_N A_N^\top)^{-1} A_N C_{NM} A_M^\top. \end{align} The only difference with respect to (6) is the addition of in the conditional mean. Note that the term in the conditional mean is unchanged due to the cancellation .
Note that and map to and , respectively. Therefore, the dimension of may differ from that of . Now, it would be nice to be able to write and as functions of and , respectively. In the general setting, there is not much we can do given that may not be invertible; thus, we can’t necessarily simplify the term .
Special case: Invertibility
Let’s now consider the special case that is invertible; in particular, this means . We’ll work with the affine map (7), since the linear result follows as the special case . With the invertibility assumption we can simplify (9) as \begin{align} \hat{\mu}_{M}^{y} &= b_M + A_M \mu_M + A_M C_{MN} A_N^\top (A_N^\top)^{-1} C_N^{-1} A_N^{-1} A_N[x_N - \mu_N] \tag{10} \newline &= b_M + A_M \left(\mu_M + C_{MN} C_N^{-1} [x_N - \mu_N] \right) \newline &= b_M + A_M \hat{\mu}_{M} \newline \hat{C}_{M}^{y} &= A_M C_M A_M^\top - A_M C_{MN} A_N^\top (A_N C_N A_N^\top)^{-1} A_N C_{NM} A_M^\top \newline &= A_M C_M A_M^\top - A_M C_{MN} A_N^\top (A_N^\top)^{-1} C_N^{-1} A_N^{-1} A_N C_{NM} A_M^\top \newline &= A_M \left(C_M - C_{MN} C_N^{-1} C_{NM}\right) A_M^\top \newline &= A_M \hat{C}_{M} A_M^\top. \end{align} In words, what we have just shown is that, if is invertible, then conditioning is equivalent to conditioning and then applying the transformation after the fact. We might write this symbolically as \begin{align} (A_M x_M | A_N x_N) \overset{d}{=} A_M(x_M | x_N). \tag{11} \end{align} This result makes intuitive sense; since is a bijection, then conditioning on is equivalent to conditioning on - they both contain the same information. Note that no invertibility assumption is required for ; what matters here is the variable that is being conditioned on. This equivalence does not necessarily hold when is non-invertible. For example, consider the case that has a single row. Then conditioning on means that the observed quantity is a linear combination of the elements of . This quantity contains less information than observing each entry of directly, since many different realizations of might yield the same observed linear combination.
Gaussian Process Review
We now seek to apply the above results in the context of Gaussian processes (GP). This section provides a brief review of GPs, mainly intended to introduce notation.
Gaussian Process Prior
We consider a GP distribution over functions with . In particular, consider a GP with mean function and positive definite kernel (i.e., covariance function) . Throughout this post we suppose that we have observed the function evaluations and seek to perform inference at a set of new inputs . Echoing the notation from the previous sections we will let denote the vectors defined by and . We analogously define to be the vectors containing the evaluations of and at the unobserved inputs . Finally, we let , , and denote the matrices given by , , and . We also define .
Gaussian Process Posterior
The vector is unobserved and the goal is to characterize the conditional distribution (note that we will always be implicitly conditioning on the inputs). This distribution can be derived by considering the joint distribution implied by the GP prior: \begin{align} \begin{bmatrix} f_M \newline f_N \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_M \newline \mu_N \end{bmatrix}, \begin{bmatrix} C_M \quad C_{MN} \newline C_{NM} \quad C_N \end{bmatrix} \right). \end{align} The Gaussian conditioning identities (2) imply that the conditional is also Gaussian with mean and covariance given by
\begin{align} \hat{\mu}_{M} &:= \mu_M + C_{MN} C_N^{-1} (f_N - \mu_N) \tag{12} \newline \hat{C}_{M} &:= C_M - C_{MN} C_N^{-1} C_{NM}. \end{align}
Since GPs are characterized by their finite-dimensional distributions, (12) implies that the conditional distribution over functions is also a GP. The formulas in (12) give the mean and covariance function of , evaluated at the arbitrary inputs .
Pointwise Transformations of Gaussian Processes
We begin our GP applications with what I’m calling a “pointwise” affine transformation of the GP. Concretely, given a GP over the input space , we define the process \begin{align} &g: \mathcal{U} \to \mathbb{R}, &&g(u) := \alpha f(u) + \beta, \tag{13} \end{align} where and . In words, we are simply applying an invertible affine transformation to the GP output on a -by- basis. Letting , be the analogs of , for the evaluations, we see that \begin{align} \begin{bmatrix} g_M \newline g_N \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} A_M \mu_M + b_M \newline A_N \mu_N + b_N \end{bmatrix}, \begin{bmatrix} A_M C_M A_M^\top \quad A_M C_{MN} A_N^\top \newline A_N C_{NM} A_M^\top \quad A_N C_N A_N^\top \end{bmatrix} \right), \end{align} where \begin{align} &b_N := \beta 1_N, &&A_N := \text{diag}(\alpha, \dots, \alpha), \end{align} with denoting a vector of ones. The quantities and are defined identically, with only the dimensions of the matrix and vector changing.
Under the assumption , is invertible so we are in the regime (10). Applying this result, we see that the conditional distribution is given by \begin{align} g_M | g_N &\sim \mathcal{N}(\hat{\mu}^g_M, \hat{C}^g_M), \end{align} where
\begin{align} &\hat{\mu}^g_M := A_M \hat{\mu}_M + b_M &&\hat{C}^g_M := A_M \hat{C}_M A_M^\top. \end{align}
In other words, we have shown \begin{align} \hat{g}(u) = \alpha \hat{f}(u) + \beta, \end{align} meaning that transforming the prior in (13) is equivalent to applying the same transformation to the posterior .
Applications
Normalizing the Response
Multi-Output GPs
Inverse Problem with Linear Forward Model
Input Dimension Reduction
Discuss alternate view as defining a new kernel.