Gaussian Kernel Hyperparameters
Interpreting and optimizing the lengthscale and marginal variance parameters of a Gaussian covariance function.
In this post we discuss the interpretation of the hyperparameters defining
a Gaussian covariance function. We then consider some practical considerations
to keep in mind when learning the values of these parameters. In particular, we
focus on setting bounds and initialization values to avoid
common pathologies in hyperparameter optimization. The typical application
is hyperparameter estimation for Gaussian process models. It is important
to emphasize that ideally one would define informative priors on hyperparameters
and cast hyperparameter optimization within a Bayesian inferential framework.
Even if one is only optimizing hyperparameters, prior distributions offer a
principled means of regularizing the optimization problem. In this post, we
focus only on bound constraints for hyperparameters, which can be thought
of equivalently as uniform priors. In addition, we consider the setting where
little prior knowledge is available, so that they main purpose of the bounds
is to avoid pathological behavior in the optimization. We discuss heuristic
methods for empirically deriving reasonable default bounds based on the
training data. Deriving the bounds from the training data also implies that the
process can be automated, which may be useful in certain contexts.
In general, I would typically recommend considering priors
other than uniform distributions. However, I think this is still a useful exercise
to think through. Moreover, many of the more classically-oriented Gaussian process
packages (e.g., kergp and hetGP in R
) only offer hyperparameter regularization
via bound constraints. For a discussion on non-uniform priors and full Bayesian
inference for Gaussian process hyperparameters, see Michael Betencourt’s
excellent posts here
and here.
The Gaussian covariance function
One-Dimensional Input Space
In one dimension, the Gaussian covariance function is a function defined by We will refer to the hyperparameters and as the marginal variance and lengthscale, respectively. Notice that the range of is the interval . In interpreting (1), we will find it useful to think of a function satisfying That is, describes the covariance between the ouputs and as a function of the inputs and . The relation (2) commonly arises when assuming that is a Gaussian process with covariance function (i.e., kernel) . Under this interpretation, note that which justifies the terminology “marginal variance”. We also note that (1) is isotropic, meaning that depends only on the distance . With respect to (2), this means that the covariance between the function values and decays as a function of , regardless of the particular values of and . We will therefore find it useful to abuse notation by writing , where . To emphasize, when is written with only a single argument, the argument should be interpreted as a distance.
It is often useful to work with correlations instead of covariances, as they are scale-free. The Gaussian correlation function is given by so that . Note also that
Gaussian Density Interpretation
Another way to think about this is that is an unnormalized Gaussian density centered at and evaluated at ; i.e., The reason for the scaling is due to the fact that (1) is missing the typical factor in the Gaussian probability density (the is sometimes included in the Gaussian covariance parameterization to align it with the density). Under this interpretation, we can think of as a Gaussian curve centered at , describing the dependence between and all other function values. The fact that (1) only depends on its inputs through means that the shape of this Gaussian is the same for all ; considering different simply shifts the location of the curve. We can make this more explicit by consdering the correlation function to be a function only of distance: This is another way of viewing the fact that the rate of correlation decay is the same across the whole input space, and depends only on the distance between points.
Generalizing to Multiple Dimensions
To generalize (1) to a -dimensional input space, we define using a product of one-dimensional correlation functions: The choice of a product form for the covariance implies that the covariance decays isotropically in each dimension, with the rate of decay in dimension controlled by . There is no interaction across dimensions. In this -dimensional setting, we see that is a function of . We can thus think of as a function of the distances in each dimension.
Interpreting the Hyperparameters
We now discuss the interpretation of the marginal variance and lengthscale hyperparameters. Although we focus on the Gaussian covariance function here, the marginal variance interpretation applies to any covariance function which implies a constant variance across the input space. The interpretation of the lengthscales is also relevant to other covariance functions that have similar hyperparameters (e.g., the Matérn kernel).
Lengthscales
In the introduction, we already described how (1) describes the correlation between and as a function of , but we have not yet explicitly described the role of in controlling the rate of correlation decay. Notice in (1) that increasing decreases the rate of decay of as increases. In other words, when is larger then the function values and will retain higher correlation even when the inputs and are further apart. On the other hand, as then and will be approximately uncorrelated even when and are close. We will make the notion of “closeness” more precise shortly. The connection to the Gaussian density in (6) and (7) provides another perspective. Since is the variance of the Gaussian, then increasing implies the Gaussian curve becomes more “spread out”, indicating more dependence across longer distances.
It is important to keep in mind that there are two different distances we’re
working with here: (1) the “physical” distance in the input space; and (2)
the correlation distance describing dependence in the output space. The
correlation/covariance functions relate these two notions of distance. The
lengthscale parameter has the same units as the inputs; i.e., it is
defined in the physical space. However, it controls correlation distance as
well. Since this is what we really care about, it is useful to interpret a
specific value of based on its effect on the rate of correlation decay.
To interpret a model with a specific value of , one can
simply plot the curve as is varied; i.e.,
consider a sequence of physical distances and report
the values
In the common setting that the input space is a bounded interval , then
we can choose , and space the rest of the points uniformly
throughout the interval. Another value that might be worth reporting is the
distance at which the correlation is essentially zero. The correlation
is never exactly zero, but we might choose a small threshold that
we deem to be negligible. Setting
and solving for , we obtain
a value that scales linearly in . For example, if we set
, then (9) becomes . In words, this says
that the correlation becomes negligible when the physical distance is about
seven times the lengthscale.
To generalize these notions to the product covariance in (7), we can consider producing a correlation decay plot for each dimension separately; i.e., we can generate the values in (8) for each and . While this gives us information about the correlation decay in each dimension, we must keep in mind that the decay in is controlled by the product of .
Marginal Variance
The marginal variance is more straightforward to interpret. As we established in (3), is the variance of , which is constant across all values of . Thus, provides information on the scale of the function. Suppose that is a Gaussian process with covariance function and a constant mean . This implies that for any . Thus, where is the standard normal distribution function. In particular, meaning that values of falling outside of the interval are unlikely.
Bounds and Defaults
We now consider setting reasonable bounds and default values for the lengthscale and marginal variance. The motivation here is primarily on estimating the covariance hyperparameters for Gaussian process models via maximum marginal likelihood. In general, it is often better to regularize the optimization by specifying prior distributions on the hyperparameters, or to go the full Bayesian approach and sample from the posterior distribution over the hyperparameters. We won’t go into such topics here. Throughout this section, suppose that we have observed input-output pairs for some function of interest . For now we assume that these observations are noiseless, and is modeled as a Gaussian process with covariance function . Our intention here is not to go into depth on methods for estimating Gaussian process hyperparameters; see this post for more details on this topic. Instead, we will describe some practical considerations for avoiding pathological behavior when estimating the hyperparameters of the covariance function.
Lengthscales
One Dimensional
We start by considering bounds for the lengthscale parameter. I should start by noting that these bounds should be informed by domain knowledge, if it is available. We will consider the case where there is no prior knowledge, and instead consider empirical approaches based on the training data. In general, is constrained to the interval . However, note that learning the value of from the training data relies on the consideration of how varies based on pairwise distances between the . Thus, there is no information in the data to inform lengthscale values that are below the minimum, or above the maximum, observed pairwise distances. Let denote the set of pairwise distances constructed from the training inputs . There are such distances, corresponding the the number of entries in the lower triangle (excluding the diagonal) of an matrix. Thus, a reasonable first step is to enforce the constraint where and denote the minimum and maximum of , respectively. Without this constraint, an optimizer can sometimes get stuck in local minima at very small or large lengthscales, which can lead to pathological Gaussian process fits. As mentioned previously, it is typically best to think in terms of correlations, so let’s consider the implications of the choice (12) from this viewpoint. If , the correlation at the minimum observed pairwise distance is given by Thus, the lower bound in (12) implies that we are constraining the lengthscale to satisfy We can generalize this idea and instead set to be the value that achieves the constraint for some value we are free to specify. Solving (14) for , we obtain Note that setting loosens the bound, relative to (12), and vice versa; that is, \begin{align} &\rho_{\text{min}} < 0.37 \implies \ell_{\text{min}} < d_{\text{min}} \newline &\rho_{\text{min}} > 0.37 \implies \ell_{\text{min}} > d_{\text{min}} \end{align} We can generalize this even further by replacing and by empirical quantiles of . For example, we might consider enforcing (14) where the minimum is replaced by the percentile of the observed pairwise distances. Making this replacement without changing will result in a more restrictive bound; i.e., a larger value for .
Multiple Input Dimensions
We consider two different approaches to generalize the above procedure to -dimensional input space.
Dimension-by-Dimension
In considering bounds for the lengthscales in the product correlation (7), one approach is to simply repeat the procedure independently for each input dimension. In other words, we consider the pairwise distances in the dimension, where each distance is of the form for some . The bounds for the parameter can then be constructed as described above with respect to these pairwise distances. Let be the minimum (or some other quantile) of . This procedure then implies the dimension-by-dimension constraint which in turn implies For example, if we choose in dimensions, then . It is important to note that (16) does not provide a direct constraint on the minimum correlation at the distance . There is a distinction between the pairs of points that are closest in each input dimension separately, and the pair that is closest in . The distances in (16) may be coming from different pairs of inputs, not the particular pair of inputs that is closest in . For this portion of the post, we will use capital “D” for distances in the multivariate space, and lower case “d_j” for univariate distance with respect to the dimension.
Multivariate Constraint
We might instead choose to constrain the correlation at the distance , which means we are now considering Euclidean distance in , rather than approaching the problem one dimension at a time. For the lower bound, this means enforcing the constraint where are a pair of training inputs satisfying ; i.e., the two closest points. In order to choose the lower bounds , a reasonable approach is to assume the scale linearly with the scale of the respective input dimension. Equivalently, we can consider scaling the inputs so that . In particular, for each dimension we re-scale as where we have defined \begin{align} &m_j := \min_{i} x^{(i)}_j, &&M_j := \max_{i} x^{(i)}_j. \tag{20} \end{align} We’ll use tildes to denote quantities pertaining to the scaled space, and also stop writing the “min” subscript in for succinctness. Note that this scaling implies that univariate distances scale as Since lengthscales correspond to distances, we have Note that these scalings have no effect on the correlations, since given that the factors cancel in the ratio. If we consider an isotropic model in scaled space (i.e., ), then this is equivalent to assuming that the lengthscale bounds scale linearly as If we enforce the constraint in scaled space, then we see from (23) that this encodes the constraint (18), as desired.
Marginal Variance
We now consider the definition of reasonable default bounds for the marginal variance parameter . Throughout this section, we assume that is a zero-mean Gaussian process with a Gaussian covariance function. A reasonable approach here is to cap so that most of the prior probability falls within the observed range of the data. As always, such a heuristic could present challenges in generalizing beyond the training data, and domain knowledge should take precedent in constraining the value of .
To make precise the notion of placing most prior mass over the observed data range, let . Then we can define the upper bound as the value of implying for some (large) probability (e.g., ). This constraint excludes values , which would lead to priors that place more probability on values of that exceed . The use of the value might lead to some concerns around robustness, since a single extreme value could exert significant influence on the bound . On the flip side, we also don’t want to simply ignore the larger values and set restrictively low. Similar to the lengthscale bound approach, one option is to replace with a different empirical quantile, and pair this with a slightly lower value of .
In any case, for a chosen quantile and probability , we can solve the expression (26) for to obtain an expression for the bound . Noting that , we see that (26) can equivalently be written as where . Since Gaussians are symmetric, we can write (27) as Solving (28) for then gives recalling that denotes the standard normal distribution function.