Gaussian Process Priors, Specification and Parameter Estimation
A deep dive into hyperparameter specifications for GP mean and covariance functions, including both frequentist and Bayesian methods for hyperparameter estimation.
Gaussian processes (GP) are widely utilized across various fields, each with their own preferences, terminology, and conventions. Some notable domains that make significant use of GPs include
- Spatial statistics (kriging)
- Design and analysis of computer experiments (emulator/surrogate modeling)
- Bayesian optimization
- Machine learning
Even if you’re a GP expert in one of these domains, these differences can make navigating the GP literature in other domains a bit tricky. The goal of this post is to summarize common approaches for specifying GP distributions, and emphasize conventions and assumptions that tend to differ across fields. By “specifying GP distributions”, what I am really talking about here is parameterizing the mean and covariance functions that define the GP. While GPs are non-parametric models in a certain sense, specifying and learning the hyperparameters making up the mean and covariance functions is a crucial step to successful GP applications. I will discuss popular parameterizations for these functions, and different algorithms for learning these parameter values from data. In the spirit of drawing connections across different domains, I will try my best to borrow terminology from different fields, and will draw attention to synonymous terms by using boldface.
Background
Gaussian Processes
Gaussian processes (GPs) define a probability distribution over a space of functions in such a way that they can be viewed as a generalization of Gaussian random vectors. Just as Gaussian vectors are defined by their mean vector and covariance matrix, GPs are defined by a mean and covariance function. We will interchangeably refer to the latter as either the covariance function or kernel.
We will consider GPs defined over a space of functions of the form , where . We will refer to elements as inputs or locations and the images as outputs or responses. If the use of the word “locations” seems odd, note that in spatial statistical settings, the inputs are often geographic coordinates. We will denote the mean and covariance function defining the GP by and , respectively. The mean function is essentially unrestricted, but the covariance function must be a valid positive definite kernel. If is a GP with mean function and kernel we will denote this by \begin{align} f \sim \mathcal{GP}(\mu, k). \tag{1} \end{align}
The defining property of GPs is that their finite-dimensional distributions are Gaussian; that is, for an arbitrary finite set of inputs , the vector is distributed as \begin{align} f(X) \sim \mathcal{N}(\mu(X), k(X, X)). \tag{2} \end{align} We are vectorizing notation here so that , , and . When the two input sets to the kernel are equal, we lighten notation by writing . Now suppose we have two sets of inputs and , containing and inputs, respectively. The defining property (2) then implies
\begin{align} \begin{bmatrix} f(\tilde{X}) \newline f(X) \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} \mu(\tilde{X}) \newline \mu(X) \end{bmatrix}, \begin{bmatrix} k(\tilde{X}) & k(\tilde{X}, X) \newline k(X, \tilde{X}) & k(X) \end{bmatrix} \right). \tag{3} \end{align}
The Gaussian joint distribution (3) implies that the conditional distributions are also Gaussian. In particular, the distribution of can be obtained by applying the well-known Gaussian conditioning identities:
\begin{align} f(\tilde{X})|f(X) &\sim \mathcal{N}(\hat{\mu}(\tilde{X}), \hat{k}(\tilde{X})), \tag{4} \newline \hat{\mu}(\tilde{X}) &:= \mu(\tilde{X}) + k(\tilde{X}, X)k(X)^{-1} [f(X) - \mu(X)] \newline \hat{k}(\tilde{X}) &:= k(\tilde{X}) - k(\tilde{X}, X)k(X)^{-1} k(X, \tilde{X}). \end{align} The fact that the result (4) holds for arbitrary finite sets of inputs implies that the conditional is also a GP, with mean and covariance functions and defined by (4). On a terminology note, the matrix is often called the kernel matrix. This is the matrix containing the kernel evaluations at the set of observed locations.
Regression with GPs
One common application of GPs is their use as a flexible nonlinear regression model. Let’s consider the basic regression setup with observed data pairs . We assume that the are noisy observations of some underlying latent function output . The GP regression model arises by placing a GP prior distribution on the latent function . We thus consider the regression model \begin{align} y(x) &= f(x) + \epsilon(x) \tag{5} \newline f &\sim \mathcal{GP}(\mu, k) \newline \epsilon(x) &\overset{iid}{\sim} \mathcal{N}(0, \sigma^2), \end{align} where we have assumed a simple additive Gaussian noise model. This assumption is quite common in the GP regression setting due to the fact that it results in closed-form conditional distributions, similar to (4). We will assume the error model (5) throughout this post, but note that there are many other possibilities if one is willing to abandon closed-form posterior inference.
The solution of the regression problem is given by the distribution of or , where is the -dimensional vector of observed responses. The first distribution is the posterior on the latent function , while the second incorporates the observation noise as well. Both distributions can be derived in the same way, so we focus on the second. Letting denote a set of inputs at which we would like to predict the response, consider the joint distribution \begin{align} \begin{bmatrix} y(\tilde{X}) \newline y(X) \end{bmatrix} &\sim \mathcal{N}\left( \begin{bmatrix} \mu(\tilde{X}) \newline \mu(X) \end{bmatrix}, \begin{bmatrix} k(\tilde{X}) + \sigma^2 I_m & k(\tilde{X}, X) \newline k(X, \tilde{X}) & k(X) + \sigma^2 I_n \end{bmatrix} \right). \tag{6} \end{align} This is quite similar to (3), but now takes into account the noise term . This does not affect the mean vector since is mean-zero; nor does it affect the off-diagonal elements of the covariance matrix since and were assumed independent. Applying the Gaussian conditioning identities (4) yields the posterior distribution \begin{align} y(\tilde{X})|y(X) &\sim \mathcal{N}(\hat{\mu}(\tilde{X}), \hat{k}(\tilde{X})), \tag{7} \newline \hat{\mu}(\tilde{X}) &:= \mu(\tilde{X}) + k(\tilde{X}, X)[k(X) + \sigma^2 I_n]^{-1} [f(X) - \mu(X)] \newline \hat{k}(\tilde{X}) &:= \sigma^2 I_m + k(\tilde{X}) - k(\tilde{X}, X)[k(X) + \sigma^2 I_n]^{-1} k(X, \tilde{X}). \end{align} We will refer to (7) as the GP posterior, predictive, or generically conditional, distribution. We observe that these equations are identical to (4), modulo the appearance of in the predictive mean and covariance equations. The distribution is identical to (7), except that the is removed in the predictive covariance. Again, this reflects the subtle distinction between doing inference on the latent function versus on the observation process .
Noise, Nuggets, and Jitters
Observe that this whole regression procedure is only slightly different from the noiseless GP setting explored in the previous section (thanks to the Gaussian likelihood assumption). Indeed, the conditional distribution of is derived from by simply replacing with (obtaining the distribution requires the one additional step of adding to the predictive covariance). In other words, we have simply applied standard GP conditioning using the modified kernel matrix \begin{align} C(X) := k(X) + \sigma^2 I_n. \tag{8} \end{align} We thus might reasonably wonder if the model (5) admits an alternative equivalent representation by defining a GP directly on the observation process . Defining such a model would require defining a kernel that is consistent with (8). This route is fraught with difficulties and subtleties, which I will do my best to describe clearly here. At first glance, it seems like the right choice is \begin{align} c(x, x^\prime) := k(x, x^\prime) + \sigma^2 \delta(x, x^\prime), \tag{9} \end{align} where is sometimes called the stationary white noise kernel. Why isn’t this quite right? Notice in (9) that is added whenever the inputs are equal. However, suppose we observe multiple independent realizations of the process at the same inputs . In the regression model (9) the errors are independent across these realizations, even at the same locations. However, this will not hold true in the model under (9), since only sees the values of the inputs, and has no sense of distinction across realizations. We might try to fix this by writing something like \begin{align} c(x_i, x_j) := k(x_i, x_j) + \sigma^2 \delta_{ij}, \tag{10} \end{align} where the Delta function now depends on the labels instead of the values of the inputs. In the spatial statistics literature, it is not uncommon to see a covariance function defined like (10), but this is basically a notational hack. A kernel is a function of two inputs from - we can’t have it also depending on some side information like the labels . At the end of the day, (9) and (10) are attempts to incorporate some concept of white noise inside the kernel itself, rather than via a hierarchical model like (5). I would just stick with the hierarchical model, which is easily rigorously defined and much more intuitive.
Nonetheless, one should not be surprised if expressions like (10) pop up, especially in the spatial statistics literature. Spatial statisticians refer to the noise term as the nugget, and the nugget variance (sometimes these terms are conflated). In this context, instead of representing observation noise, is often thought of as representing some unresolved small-scale randomness in the spatial field itself. If you imagine sampling a field to determine the concentration of some mineral across space, then you would hope that repeated measurements (taken around the same time) would yield the same values. Naturally, they may not, and the introduction of the nugget is one way to account for this.
While this discussion may seem needlessly abstract, we recall that the effect of incorporating the noise term (however you want to interpret it) is to simply replace the kernel matrix with the new matrix . Confusingly, there is one more reason (having nothing to do with observation error or nuggets) that people use a matrix of the form in place of : numerical stability. Indeed, even though is theoretically positive definite, in practice its numerical instantiation may fail to have this property. A simple approach to deal with this is to add a small, fixed constant to the diagonal of the kernel matrix. In this context, is often called the jitter. While computationally its effect is the same as the nugget, note that its introduction is motivated very differently. The jitter is not stemming from some sort of random white noise; it is purely a computational hack to improve the conditioning of the kernel matrix. Check out this thread for some entertaining debates on the use of the nugget and jitter concepts.
Parameterized Means and Kernels
Everything we have discussed this far assumes fixed mean and covariance functions. In practice, suitable choices for these quantities are not typically known. Thus, the usual approach is to specify some parametric families and and learn their parameters from data. The parameters and are often referred to as hyperparameters, since they are not the primary parameters of interest in the GP regression model. Recalling from (5) that the GP acts as a prior distribution on the latent function, we see that and control the specification of this prior distribution. In addition to and , the parameter is also typically not known. I will not wade back into the previous section’s debate in arguing whether this should be classified as a “hyperparameter” or not. In any case, let’s let denote the full set of (hyper)parameters that must be learned from data.
Mean Functions
The machine learning community commonly uses the simplest possible form for the mean function: . This zero-mean assumption is less restrictive than it seems, since GPs mainly derive their expressivity from the kernel. A slight generalization is to allow a constant, non-zero mean , where . However, constant (including zero-mean) GP priors can have some undesirable properties; e.g., in the context of extrapolation. Sometimes one wants more flexibility, and in these cases it is quite common to consider some sort of linear regression model \begin{align} \mu(x) = h(x)^\top \beta, \tag{11} \end{align} where is some feature map and the associated coefficient vector. For example, would yield a standard linear model, and would allow for a quadratic trend.
Kernels
The positive definite restriction makes defining valid covariance functions much more difficult than defining mean functions. Thus, one typically falls back on one of a few popular choices of known parametric kernel families (though note that kernels can be combined in various ways to give a large variety of options). While the goal of this post is not to explore specific kernels, in order to have a concrete example in mind consider the following parameterization: \begin{align} k(x, \tilde{x}) = \alpha^2 \sum_{j=1}^{d} \left(-\frac{\lvert x^{j} - \tilde{x}^j \rvert}{\ell^j}\right)^2. \tag{12} \end{align} Note that I’m using superscripts to index vector entries here. This kernel goes by many names, including exponentiated quadratic, squared exponential, Gaussian, radial basis function, and automatic relevance determination. The parameter is sometimes called the marginal variance, or just the scale parameter. The parameters are often called lengthscale, smoothness, or range parameters, since they control the smoothness of the GP realizations along each coordinate direction. Other popular kernels (e.g., Matérn) have analogous parameters controlling similar features. Note that in this example we have . Also note that people choose to parameterize the Gaussian kernel in many different ways; for example, it’s not uncommon to see a factor included inside the exponential to make the kernel align with the typical parameterization of the Gaussian probability density function. Knowing which parameterization you’re working with is important for interpreting the hyperparameters, specifying bounds, defining priors, etc.
It is quite common in the spatial statistics (and sometimes the computer experiments) literature to see kernels written like ; in these cases typically represents a correlation function, which becomes the covariance function after multiplying by the marginal variance . There is an advantage in decomposing the kernel this way when it comes to estimating the hyperparameters, which we will discuss shortly.
The GP (Marginal) Likelihood Function
Let’s first recall the GP regression model (5)
\begin{align}
y(x) &= f(x) + \epsilon(x) \newline
f &\sim \mathcal{GP}(\mu_{\psi}, k_{\phi}) \newline
\epsilon &\overset{iid}{\sim} \mathcal{N}(0, \sigma^2),
\end{align}
where we have now explicitly added the dependence on and .
This model is defined for any . However, when estimating
hyperparameters, we will naturally be restricting the model to , the finite
set of locations at which we actually have observations. Restricting to
reduces the above model to the standard (finite-dimensional)
Bayesian regression model
\begin{align}
y(X)|f(X), \theta &\sim \mathcal{N}(f(X), \sigma^2 I_n) \tag{13} \newline
f(X)|\theta &\sim \mathcal{N}(\mu_{\psi}(X), k_{\phi}(X)).
\end{align}
We could consider completing the Bayesian specification by defining a prior
on , but we’ll hold off on this for now.
Notice that the model (13) defines a joint distribution over
, with representing the
likelihood of the observations at the observed input locations . At present
everything is conditional on a fixed .
Now, if we marginalize the likelihood with
respect to then we obtain the distribution . This is often
called the marginal likelihood, due to the fact that was marginalized
out. In particular, the distribution has implicitly marginalized
the function values at all location other than .
This same logic and terminology applies in the noiseless setting with ,
in which case the marginal likelihood is given by . In the noiseless
setting we are marginalizing both over the unobserved function values and
the noise .
Thanks to all the Gaussian assumptions here, the marginal likelihood
is available in closed-form. One could approach the derivation using (13) as
the starting point, but it’s much easier to consider the model written out using
random variables,
\begin{align}
y(X) &= f(X) + \epsilon(X).
\end{align}
Since and are independent Gaussians, then their sum is also
Gaussian with mean and covariance given by
\begin{align}
\mathbb{E}[y(X)|\theta]
&= \mathbb{E}[f(X)|\theta] + \mathbb{E}[\epsilon(X)|\theta] = \mu_{\psi}(X) \newline
\text{Cov}[y(X)|\theta]
&= \text{Cov}[f(X)|\theta] + \text{Cov}[\epsilon(X)|\theta]
= k_{\phi}(X) + \sigma^2 I_n.
\end{align}
We have thus found that
\begin{align}
y(X)|\theta \sim \mathcal{N}\left(\mu_{\psi}(X), C_{\phi, \sigma^2}(X)\right), \tag{14}
\end{align}
recalling the definition .
We will let denote the log density of this Gaussian
distribution; i.e. the log marginal likelihood:
\begin{align}
\mathcal{L}(\theta)
&:= -\frac{1}{2} \log \text{det}\left(2\pi C_{\phi, \sigma^2}(X) \right) -
\frac{1}{2} (y(X) - \mu_{\psi}(X))^\top C_{\phi, \sigma^2}(X)^{-1} (y(X) - \mu_{\psi}(X)) \tag{15}
\end{align}
The function plays a central role in the typical approach
to hyperparameter optimization, as we will explore below. Also note that
the above derivations also apply to the noiseless setting
(i.e., ) by setting . In this case, the marginal
likelihood is simply the GP distribution restricted to the inputs .
I have henceforth been a bit verbose with the notation in (15) to make very explicit the dependence on the inputs and the hyperparameters. To lighten notation a bit, we define , , and , allowing us to rewrite (15) as \begin{align} \mathcal{L}(\theta) &:= -\frac{1}{2} \log \text{det}\left(2\pi C_{\phi, \sigma^2} \right) - \frac{1}{2} (y_n - \mu_{\psi})^\top C_{\phi, \sigma^2}^{-1} (y_n - \mu_{\psi}). \tag{16} \end{align} We have simply suppressed the explicit dependence on in the notation.
Hyperparameter Optimization
We now begin to turn out attention to methods for learning the values of the hyperparameters from data. This section starts with the most popular approach: optimizing the marginal likelihood.
Maximum Marginal Likelihood, or Empirical Bayes
Recall that (16) gives the expression for the log marginal likelihood , which is just the log density of viewed as a function of . A natural approach is to set the hyperparameters to their values that maximize : \begin{align} \hat{\theta} := \text{argmax} \mathcal{L}(\theta). \tag{17} \end{align} At first glance, the Gaussian form of might look quite friendly to closed-form optimization. After all, maximum likelihood estimates of the mean and covariance of Gaussian vectors are indeed available in closed-form. However, upon closer inspection notice that the covariance is not being directly optimized; we are optimizing , and the covariance is a nonlinear function of this parameter. Thus, in general some sort of iterative numerical scheme is is used for the optimization. Typically, gradient-based approaches are preferred, meaning we must be able to calculate quantities like . The exact gradient calculations will thus depend on the choice of kernel; specifics on kernels and optimization schemes are not the focus of this post. We will instead focus on the high level ideas here. The general approach to GP regression that we have outlined so far can be summarized as:
- Solve the optimization problem (17) and fix the hyperparameters at their optimized values . The hyperparameters will be fixed from this point onward.
- Use the GP predictive equations (7) to perform inference at a set of locations of interest .
One might object to the fact that we are estimating the hyperparameters from data, and then neglecting the uncertainty in during the prediction step. It is true that this uncertainty is being ignored, but it is also very computationally convenient to do so. We will discuss alternatives later on, but this simple approach is probably the most commonly used in practice today. One way to think about this strategy is in an empirical Bayes context; that is, we can view this approach as an approximation to a fully Bayesian hierarchical model, which would involve equipping the hyperparameters with their own priors. Instead of marginalizing the hyperparameters, we instead fix them at their most likely values with respect to the observed data. We are using the data to “fine tune” the GP prior distribution. In the literature you will see this general hyperparameter optimization strategy referred to as either empirical Bayes, maximum marginal likelihood, or even just maximum likelihood.
Special Case Closed-Form Solutions: Mean Function
As mentioned above, in general the maximization of requires numerical methods. However, in certain cases elements of can be optimized in closed-form, meaning that numerical optimization may only be required for a subset of the hyperparameters. We start by considering closed form optimizers for the parameters defining the mean functions.
Constant Mean: Plug-In MLE
With the choice of constant mean the log marginal likelihood becomes
\begin{align} \mathcal{L}(\theta) &:= -\frac{1}{2} \log \text{det}\left(2\pi C_{\phi, \sigma^2} \right) - \frac{1}{2} (y_n - \beta_0 1_n)^\top C_{\phi, \sigma^2}(X)^{-1} (y_n - \beta_0 1_n), \end{align}
with denoting a vector of ones. We now consider optimizing as a function of only. The partial derivative with respect to the constant mean equals \begin{align} \frac{\partial \mathcal{L}(\theta)}{\partial \beta_0} &= y_n^\top C_{\phi, \sigma^2}^{-1}1_n - \beta_0 1_n^\top C_{\phi, \sigma^2}^{-1} 1_n. \tag{18} \end{align} Setting (18) equal to zero and solving for gives the optimum \begin{align} \hat{\beta}_0(\phi, \sigma^2) = \frac{y_n^\top C_{\phi, \sigma^2}^{-1} 1_n}{1_n^\top C_{\phi, \sigma^2}^{-1} 1_n}. \tag{19} \end{align} Notice that depends on the values of the other hyperparameters and . Therefore, while this does not give us the outright value for the mean, we can plug in place of in the marginal likelihood. This yields the profile likelihood (aka the concentrated likelihood), which is no longer a function of and hence the dimensionality of the subsequent numerical optimization problem has been reduced.
Linear Model Coefficients: Plug-In MLE
Let’s try to do the same thing with the mean function . The constant mean function is actually just a special case of this more general setting, but its common enough that it warranted its own section. If we denote by the feature matrix with rows equal to , then the marginal likelihood becomes \begin{align} \mathcal{L}(\theta) &:= -\frac{1}{2} \log \text{det}\left(2\pi C_{\phi, \sigma^2} \right) - \frac{1}{2} (y_n - H\beta)^\top C_{\phi, \sigma^2}^{-1} (y_n - H\beta), \tag{20} \end{align} with gradient \begin{align} \nabla_{\beta} \mathcal{L}(\theta) &= H^\top C_{\phi, \sigma^2}^{-1}y_n - (H^\top C_{\phi, \sigma^2}^{-1} H)\beta. \tag{21} \end{align} Setting the gradient equal to zero and solving for yields the optimality condition \begin{align} \left(H^\top C_{\phi, \sigma^2}^{-1} H\right)\hat{\beta} &= H^\top C_{\phi, \sigma^2}^{-1}y_n. \tag{22} \end{align} A unique solution for thus exists when is invertible. When does this happen? First note that this matrix is positive semidefinite, since \begin{align} \beta^\top \left(H^\top C_{\phi, \sigma^2}^{-1} H\right) \beta &= \beta^\top (H^\top [LL^\top]^{-1} H) \beta = \lVert L^{-1} H\beta \rVert_2^2 \geq 0, \end{align} where we have used the fact that is positive definite and hence admits a decomposition . The matrix is thus positive definite when has linearly independent columns; i.e., when it is full rank. We already know that is full rank. If we assume that is also full rank and then we can conclude that is full rank; see this post for a quick proof. Thus, under these assumptions we conclude that is invertible and so \begin{align} \hat{\beta}(\phi, \sigma^2) &= \left(H^\top C_{\phi, \sigma^2}^{-1} H\right)^{-1} H^\top C_{\phi, \sigma^2}^{-1}y_n. \tag{23} \end{align} Notice that (23) is simply a generalized least squares estimator. As with the constant mean, we can plug into the marginal likelihood to concentrate out the parameter . The resulting concentrated likelihood can then be numerically optimized as a function of the remaining hyperparameters.
Linear Model Coefficients: Closed-Form Marginalization
The above section showed that, conditional on fixed kernel hyperparameters,
the coefficients of a linear mean function can be optimized in closed form.
We now show a similar result: if the mean coefficients are assigned a Gaussian
prior then, conditional on fixed kernel hyperparameters, the coefficients can
be marginalized in closed form. To this end, we consider the same linear mean
function as above, but now equip the coefficients with a Gaussian prior:
\begin{align}
\mu_{\psi}(x) &= h(x)^\top \beta, &&\beta \sim \mathcal{N}(b, B).
\end{align}
Restricted to the model inputs , the model is thus
\begin{align}
y_n|\beta &\sim \mathcal{N}\left(H\beta, C_{\phi, \sigma^2} \right) \newline
\beta &\sim \mathcal{N}(b, B).
\end{align}
Our goal here is derive the marginal distribution of . We could resort to
computing the required integral by hand, but an easier approach is to notice
that under the above model is joint Gaussian distributed.
Therefore, the marginal distribution of must also be Gaussian. It thus
remains to identify the mean and covariance of this distribution. We obtain
\begin{align}
\mathbb{E}[y_n]
&= \mathbb{E}\mathbb{E}[y_n|\beta] = \mathbb{E}[H\beta] = Hb \newline
\text{Cov}[y_n]
&= \mathbb{E}[y_n y_n^\top] - \mathbb{E}[y_n]\mathbb{E}[y_n]^\top \newline
&= \mathbb{E} \mathbb{E}\left[y_n y_n^\top | \beta\right] - (Hb)(Hb)^\top \newline
&= \mathbb{E}\left[\text{Cov}[y_n|\beta] + \mathbb{E}[y_n|\beta] \mathbb{E}[y_n|\beta]^\top \right] - Hbb^\top H^\top \newline
&= \mathbb{E}\left[C_{\phi, \sigma^2} + (H\beta)(H\beta)^\top \right] - Hbb^\top H^\top \newline
&= C_{\phi, \sigma^2} + H\mathbb{E}\left[\beta \beta^\top \right]H^\top - Hbb^\top H^\top \newline
&= C_{\phi, \sigma^2} + H\left[B + bb^\top \right]H^\top - Hbb^\top H^\top \newline
&= C_{\phi, \sigma^2} + HBH^\top,
\end{align}
where we have used the law of total expectation and the various equivalent
definitions for the covariance matrix. To summarize, we have found that the
above hierarchical model implies the marginal distribution
\begin{align}
y_n &\sim \mathcal{N}\left(Hb, C_{\phi, \sigma^2} + HBH^\top \right).
\end{align}
Since this holds for any set of inputs, we obtain the analogous result for the
GP prior:
\begin{align}
y(x) &= f(x) + \epsilon(x) \newline
f &\sim \mathcal{GP}\left(\mu^\prime, k^\prime \right) \newline
\epsilon(x) &\overset{iid}{\sim} \mathcal{N}(0, \sigma^2),
\end{align}
where
\begin{align}
\mu^\prime(x) &= h(x)^\top b \newline
k^\prime(x_1, x_2) &= k(x_1, x_2) + h(x_1)^\top B h(x_2).
\end{align}
After marginalizing, we again end up with a mean function that is linear in the
basis functions . The basis function coefficients are now given by
the prior mean . The mean is something that we can prescribe, or we could
again entertain an empirical Bayes approach to set its value. Note that we have
descended another step in the hierarchical ladder. The kernel that appears from
the marginalization is now a sum of two kernels: the original kernel and
the kernel . The latter can be viewed as a linear kernel
in the transformed inputs , and weighted by the positive
definite matrix . It serves to account for the uncertainty in the coefficients
of the mean function.
Special Case Closed-Form Solutions: Marginal Variance
We now consider a closed-form plug-in estimate for the marginal variance , as mentioned in (12). The takeaway from this section will be that a closed-form estimate is only available when the covariance matrix appearing in the marginal likelihood (16) is of the form \begin{align} C_{\phi} &= \alpha^2 C. \tag{24} \end{align} This holds for any kernel of the form provided that . For example, the exponentiated quadratic kernel in (12) satisfies this requirement. With this assumption, the marginal likelihood is given by \begin{align} \mathcal{L}(\theta) &= -\frac{n}{2} \log\left(2\pi \alpha^2 \right) - \frac{1}{2}\log\text{det}(C) - \frac{1}{2\alpha^2} (y_n - \mu_{\psi})^\top C^{-1} (y_n - \mu_{\psi}). \tag{25} \end{align} The analytical derivations given below go through for a log marginal likelihood of this form. However, this doesn’t work for the common setting with an observation variance , since in this case the covariance assumes the form \begin{align} C &= \left(\alpha^2 k(X) + \sigma^2 I_n \right). \end{align} This can be addressed via the simple reparameterization \begin{align} \tilde{\alpha}^2 C &:= \tilde{\alpha}^2\left(k(X) + \tilde{\sigma}^2 I_n \right). \end{align} This gives the required form of the covariance, and maintains the same number of parameters as before. The one downside is that we lose the straightforward interpretation of the noise variance; the observation noise is now given by the product instead of being encoded in the single parameter . This reparameterization is utilized in the R package hetGP.
Plug-In MLE
Let’s consider optimizing the log marginal likelihood with respect to . The partial derivative of (25) with respect to is given by \begin{align} \frac{\partial \mathcal{L}(\theta)}{\partial \alpha^2} &= -\frac{n}{2}\frac{2\pi}{2\pi \alpha^2} - \frac{(y_n - \mu_{\psi})^\top C^{-1} (y_n - \mu_{\psi})}{2\alpha^4} \newline &= -\frac{n}{2\alpha^2} - \frac{(y_n - \mu_{\psi})^\top C^{-1} (y_n - \mu_{\psi})}{2\alpha^4}. \end{align} Setting this expression equal to zero and solving for yields \begin{align} \hat{\alpha}^2 &= \frac{(y_n - \mu_{\psi})^\top C^{-1} (y_n - \mu_{\psi})}{n}. \end{align} Following the same procedure as before, the estimate can be subbed in for in to obtain the concentrated log marginal likelihood.
Closed-Form Marginalization
Bias Corrections
Bayesian Approaches
Computational Considerations
Log Marginal Likelihood
We start by considering the computation of the log marginal likelihood (16), \begin{align} \mathcal{L}(\theta) &= -\frac{n}{2} \log(2\pi) -\frac{1}{2} \log \text{det}\left(C \right) - \frac{1}{2} (y - \mu)^\top C^{-1} (y - \mu), \end{align} where we now suppress all dependence on hyperparameters in the notation for succinctness. Since $C = k(X) + \sigma^2 I_n$ is positive definite, we may Cholesky decompose it as $C = L L^\top$. Plugging this decomposition into the log marginal likelihood yields \begin{align} \mathcal{L}(\theta) &= -\frac{n}{2} \log(2\pi) - \frac{1}{2} \log\text{det}\left(C\right) - \frac{1}{2} (y_n - \mu)^\top \left(LL^\top \right)^{-1} (y_n - \mu). \end{align} The log determinant and the quadratic term can both be conveniently written in terms of the Cholesky factor. These terms are given respectively by \begin{align} \log\text{det}\left(LL^\top\right) &= \log\text{det}\left(L\right)^2 = 2 \log \prod_{i=1}^{n} L_{ii} = 2 \sum_{i=1}^{n} \log\left(L_{ii} \right), \end{align} and \begin{align} (y_n - \mu)^\top \left(LL^\top \right)^{-1} (y_n - \mu) &= (y_n - \mu)^\top \left(L^{-1}\right)^\top L^{-1} (y_n - \mu) = \lVert L^{-1}(y - \mu)\rVert_2^2. \end{align} The linear solve $L^{-1}(y - \mu)$ can be computed in $\mathcal{O}(n^2)$ by exploiting the fact that the linear system has lower triangular structure. Plugging these terms back into the log marginal likelihood gives \begin{align} \mathcal{L}(\theta) &= -\frac{n}{2} \log(2\pi) - \sum_{i=1}^{n} \log\left(L_{ii}\right) - \frac{1}{2} \lVert L^{-1}(y - \mu)\rVert_2^2. \end{align} Note that the Cholesky factor $L$ is a function of $\phi$ and $\sigma^2$ and hence must be re-computed whenever the kernel hyperparameters or noise variances change.
Profile Log Marginal Likelihood with Linear Mean Function
We now consider computation of the concentrated marginal log-likelihood under
a mean function of the form (11), $\mu(x) = h(x)^\top \beta$, where the generalized
least squares (GLS) estimator $\hat{\beta} = \left(H^\top C^{-1} H\right)^{-1} H^\top C^{-1}y$
(see (23)) is inserted in place of $\beta$. We are thus considering the profile log
marginal likelihood
\begin{align}
\mathcal{L}(\theta)
&= -\frac{n}{2} \log(2\pi) -\frac{1}{2} \log \text{det}\left(C \right) -
\frac{1}{2} (y - H\hat{\beta})^\top C^{-1} (y - H\hat{\beta}).
\end{align}
We will derive a numerically stable implementation of this expression in two steps,
first applying a Cholesky decomposition (as in the previous section), and then
leveraging a QR decomposition as in a typical ordinary least squares (OLS)
computation. We first write $\hat{\beta}$ in terms of the Cholesky factor $L$,
where $C = LL^\top$:
\begin{align}
\hat{\beta}
&= \left(H^\top C^{-1} H\right)^{-1} H^\top C^{-1}y \newline
&= \left(H^\top \left[LL^\top\right]^{-1} H\right)^{-1} H^\top \left[LL^\top\right]^{-1}y \newline
&= \left(\left[L^{-1}H \right]^\top \left[L^{-1}H \right] \right)^{-1} \left[L^{-1}H \right]^\top
\left[L^{-1}y\right].
\end{align}
Notice that the GLS computation boils down to two lower-triangular linear solves:
$L^{-1}H$ and $L^{-1}y$. However, the above expression still requires one non-triangular
linear solve that we will now address via the QR decomposition. The above expression
for $\hat{\beta}$ can be viewed as a standard OLS estimator with design matrix
$L^{-1}H$ and response vector $L^{-1}y$. At this point, we could adopt a standard
OLS technique of taking the QR decomposition of the design matrix $L^{-1}H$.
This was my original thought, but I found a nice alternative looking through the
code in the R kergp
package (see the function .logLikFun0
in the file kergp/R/logLikFuns.R
). The approach
is to compute the QR decomposition
\begin{align}
\begin{bmatrix} L^{-1}H & L^{-1}y \end{bmatrix} &= QR = Q \begin{bmatrix} \tilde{R} & r \end{bmatrix}.
\end{align}
That is, we compute QR on the matrix formed by concatenating $L^{-1}y$ as an additional
column on the design matrix $L^{-1}H$. We have written the upper triangular matrix
$R \in \mathbb{R}^{(p+1) \times (p+1)}$ as the concatenation of
$\tilde{R} \in \mathbb{R}^{(p+1) \times p}$ and the vector
$r \in \mathbb{R}^{p+1}$ so that $L^{-1}H = Q\tilde{R}$ and $L^{-1}y = Qr$.
We recall the basic properties of the QR decomposition: $R$ is upper triangular
and invertible, and $Q$ has orthonormal columns with span equal to the column space
of $\begin{bmatrix} L^{-1}H & L^{-1}y \end{bmatrix}$. Taking the QR decomposition
of this concatenated matrix leads to a very nice expression for the quadratic
form term of the profile log marginal likelihood. But first let’s rewrite $\hat{\beta}$
in terms of these QR factors:
\begin{align}
\hat{\beta}
&= \left(\left[L^{-1}H \right]^\top \left[L^{-1}H \right] \right)^{-1} \left[L^{-1}H \right]^\top
\left[L^{-1}y\right] \newline
&= \left(\left[Q\tilde{R} \right]^\top \left[Q\tilde{R} \right] \right)^{-1} \left[Q\tilde{R} \right]^\top \left[Qr\right] \newline
&= \left(\tilde{R}^\top Q^\top Q\tilde{R} \right)^{-1} \tilde{R}^\top Q^\top Qr \newline
&= \left(\tilde{R}^\top \tilde{R} \right)^{-1} \tilde{R}^\top r,
\end{align}
where we have used the fact that $Q^\top Q$ is the identity since $Q$ is orthogonal.
Plugging this into the quadratic form term of the log likelihood gives
\begin{align}
(y - H\hat{\beta})^\top C^{-1} (y - H\hat{\beta})
&= (y - H\hat{\beta})^\top \left[LL^\top \right]^{-1} (y - H\hat{\beta}) \newline
&= \lVert L^{-1}(y - H\hat{\beta}) \rVert_2^2 \newline
&= \lVert L^{-1}y - L^{-1}H\hat{\beta} \rVert_2^2 \newline
&= \lVert Qr - Q\tilde{R} \hat{\beta} \rVert_2^2 \newline
&= \left\lVert Qr - Q\tilde{R} \left(\tilde{R}^\top \tilde{R} \right)^{-1} \tilde{R}^\top r \right\rVert_2^2 \newline
&= \left\lVert Q\left[r - \tilde{R} \left(\tilde{R}^\top \tilde{R} \right)^{-1} \tilde{R}^\top r \right] \right\rVert_2^2 \newline
&= \left\lVert r - \tilde{R} \left(\tilde{R}^\top \tilde{R} \right)^{-1} \tilde{R}^\top \right\rVert_2^2 \newline
&= \left\lVert \left[I - \tilde{R} \left(\tilde{R}^\top \tilde{R} \right)^{-1} \tilde{R}^\top \right]r \right\rVert_2^2,
\end{align}
where the penultimate line follows from the fact that $Q$ is orthogonal, and hence an
isometry. At this point, notice that the matrix
$P := \tilde{R} \left(\tilde{R}^\top \tilde{R} \right)^{-1} \tilde{R}^\top$ is the standard OLS projection matrix (i.e., hat matrix)
constructed with the design matrix $\tilde{R}$. Also, take care to notice that
$\tilde{R}$ is not invertible (it is not even square). Using standard properties
of the projection matrix, we know that $P$ has rank $p$, since $\tilde{R}$ has rank $p$.
Also, since $R$ is upper triangular, then the last row of $\tilde{R}$ contains all zeros.
Letting, $e_j$ denote the $j^{\text{th}}$ standard basis vector of $\mathbb{R}^{p+1}$,
this means that
\begin{align}
\mathcal{R}(P) \perp \text{span}(e_{p+1}),
\end{align}
where $\mathcal{R}(P)$ denotes the range (i.e., column space) of $P$.
The only subspace of $\mathbb{R}^{p+1}$ with rank $p$ and satisfying this property
is $\text{span}(e_1, \dots, e_p)$. The conclusion is that $P$ projects onto $\text{span}(e_1, \dots, e_p)$, and thus the annihilator $I - P$ projects onto
the orthogonal complement $\text{span}(e_{p+1})$. We thus conclude,
\begin{align}
\left\lVert \left[I - P\right]r \right\rVert_2^2
&= \lVert \langle r, e_{p+1} \rangle e_{p+1} \rVert_2^2 \newline
&= \lVert r_{p+1} e_{p+1} \rVert_2^2 \newline
&= r_{p+1}^2,
\end{align}
where $r_{p+1}$ is the last entry of $r$; i.e., the bottom right entry of $R$.
We finally arrive at the expression for the concentrated log marginal likelihood
\begin{align}
\mathcal{L}(\theta)
&= -\frac{n}{2} \log(2\pi) - \sum_{i=1}^{n} \log\left(L_{ii}\right) -
\frac{1}{2} r_{p+1}^2.
\end{align}
References
- Surrogates (Gramacy)
- Statistics or geostatistics? Sampling error or nugget effect? (Clark)
- Michael Betencourt’s very nice post on setting priors on GP hyperparameters.
- Large-Scale Spatial Modeling via Local Approximate Gaussian Processes in R.
(nice discussion of setting ranges for hyperparameters in the appendix; see
laGP function
darg
).