Gaussian Kernel Hyperparameters

Interpreting and optimizing the lengthscale and marginal variance parameters of a Gaussian covariance function.

In this post we discuss the interpretation of the hyperparameters defining a Gaussian covariance function. We then consider some practical considerations to keep in mind when learning the values of these parameters. In particular, we focus on setting bounds and initialization values to avoid common pathologies in hyperparameter optimization. The typical application is hyperparameter estimation for Gaussian process models. It is important to emphasize that ideally one would define informative priors on hyperparameters and cast hyperparameter optimization within a Bayesian inferential framework. Even if one is only optimizing hyperparameters, prior distributions offer a principled means of regularizing the optimization problem. In this post, we focus only on bound constraints for hyperparameters, which can be thought of equivalently as uniform priors. In addition, we consider the setting where little prior knowledge is available, so that they main purpose of the bounds is to avoid pathological behavior in the optimization. We discuss heuristic methods for empirically deriving reasonable default bounds based on the training data. Deriving the bounds from the training data also implies that the process can be automated, which may be useful in certain contexts. In general, I would typically recommend considering priors other than uniform distributions. However, I think this is still a useful exercise to think through. Moreover, many of the more classically-oriented Gaussian process packages (e.g., kergp and hetGP in R) only offer hyperparameter regularization via bound constraints. For a discussion on non-uniform priors and full Bayesian inference for Gaussian process hyperparameters, see Michael Betencourt’s excellent posts here and here.

The Gaussian covariance function

One-Dimensional Input Space

In one dimension, the Gaussian covariance function is a function k:R×RRk: \mathbb{R} \times \mathbb{R} \to \mathbb{R} defined by k(x,z):=α2exp[(xz)2].(1) k(x,z) := \alpha^2 \exp\left[-\left(\frac{x-z}{\ell}\right)^2 \right]. \tag{1} We will refer to the hyperparameters α2\alpha^2 and \ell as the marginal variance and lengthscale, respectively. Notice that the range of k(,)k(\cdot, \cdot) is the interval (0,α2](0,\alpha^2]. In interpreting (1), we will find it useful to think of a function f:RRf: \mathbb{R} \to \mathbb{R} satisfying Cov[f(x),f(z)]=k(x,z).(2) \text{Cov}[f(x), f(z)] = k(x,z). \tag{2} That is, k(x,z)k(x,z) describes the covariance between the ouputs f(x)f(x) and f(z)f(z) as a function of the inputs xx and zz. The relation (2) commonly arises when assuming that ff is a Gaussian process with covariance function (i.e., kernel) k(,)k(\cdot, \cdot). Under this interpretation, note that Var[f(x)]=k(x,x)=α2,(3) \text{Var}[f(x)] = k(x,x) = \alpha^2, \tag{3} which justifies the terminology “marginal variance”. We also note that (1) is isotropic, meaning that k(x,z)k(x,z) depends only on the distance xz\lvert x-z \rvert. With respect to (2), this means that the covariance between the function values f(x)f(x) and f(z)f(z) decays as a function of xz\lvert x-z \rvert, regardless of the particular values of xx and zz. We will therefore find it useful to abuse notation by writing k(d)=k(x,z)k(d) = k(x,z), where xz\lvert x-z \rvert. To emphasize, when k()k(\cdot) is written with only a single argument, the argument should be interpreted as a distance.

It is often useful to work with correlations instead of covariances, as they are scale-free. The Gaussian correlation function is given by ρ(x,z):=exp[(xz)2],(4) \rho(x,z) := \exp\left[-\left(\frac{x-z}{\ell}\right)^2 \right], \tag{4} so that k(x,z)=α2ρ(x,z)k(x,z) = \alpha^2 \rho(x,z). Note also that Cor[f(x),f(z)]=Cov[f(x),f(z)]Var[f(x)]Var[f(z)]=k(x,z)k(0)=ρ(x,z).(5) \text{Cor}[f(x),f(z)] = \frac{\text{Cov}[f(x),f(z)]}{\sqrt{\text{Var}[f(x)]\text{Var}[f(z)]}} = \frac{k(x,z)}{k(0)} = \rho(x,z). \tag{5}

Gaussian Density Interpretation

Another way to think about this is that ρ(x,z)\rho(x,z) is an unnormalized Gaussian density centered at zz and evaluated at xx; i.e., ρ(x,z)N(xz,2/2).(6) \rho(x,z) \propto \mathcal{N}(x|z, \ell^2/2). \tag{6} The reason for the 1/21/2 scaling is due to the fact that (1) is missing the typical 1/21/2 factor in the Gaussian probability density (the 1/21/2 is sometimes included in the Gaussian covariance parameterization to align it with the density). Under this interpretation, we can think of ρ(,z)\rho(\cdot,z) as a Gaussian curve centered at zz, describing the dependence between f(z)f(z) and all other function values. The fact that (1) only depends on its inputs through xz\lvert x-z\rvert means that the shape of this Gaussian is the same for all zz; considering different zz simply shifts the location of the curve. We can make this more explicit by consdering the correlation function to be a function only of distance: ρ(d)N(d0,2/2).(7) \rho(d) \propto \mathcal{N}(d|0, \ell^2/2). \tag{7} This is another way of viewing the fact that the rate of correlation decay is the same across the whole input space, and depends only on the distance between points.

Generalizing to Multiple Dimensions

To generalize (1) to a pp-dimensional input space, we define k:Rp×RpRk: \mathbb{R}^p \times \mathbb{R}^p \to \mathbb{R} using a product of one-dimensional correlation functions: k(x,z):=α2j=1pρ(x,z;j):=α2j=1pexp[(xjzjj)2]=α2exp[j=1p(xjzjj)2].(7) k(x,z) := \alpha^2 \prod_{j=1}^{p} \rho(x,z;\ell_j) := \alpha^2 \prod_{j=1}^{p} \exp\left[-\left(\frac{x_j-z_j}{\ell_j}\right)^2 \right] = \alpha^2 \exp\left[-\sum_{j=1}^{p} \left(\frac{x_j - z_j}{\ell_j}\right)^2 \right]. \tag{7} The choice of a product form for the covariance implies that the covariance decays isotropically in each dimension, with the rate of decay in dimension jj controlled by j\ell_j. There is no interaction across dimensions. In this pp-dimensional setting, we see that k(x,z)k(x,z) is a function of x1z1,,xpzp\lvert x_1 - z_1\rvert, \dots, \lvert x_p - z_p\rvert. We can thus think of kk as a function k(d1,,dp)k(d_1, \dots, d_p) of the distances dj=xjzjd_j = \lvert x_j - z_j \rvert in each dimension.

Interpreting the Hyperparameters

We now discuss the interpretation of the marginal variance and lengthscale hyperparameters. Although we focus on the Gaussian covariance function here, the marginal variance interpretation applies to any covariance function which implies a constant variance across the input space. The interpretation of the lengthscales is also relevant to other covariance functions that have similar hyperparameters (e.g., the Matérn kernel).

Lengthscales

In the introduction, we already described how (1) describes the correlation between f(x)f(x) and f(z)f(z) as a function of xz\lvert x-z\rvert, but we have not yet explicitly described the role of \ell in controlling the rate of correlation decay. Notice in (1) that increasing \ell decreases the rate of decay of k(x,z)k(x,z) as xz\lvert x-z \rvert increases. In other words, when \ell is larger then the function values f(x)f(x) and f(z)f(z) will retain higher correlation even when the inputs xx and zz are further apart. On the other hand, as 0\ell \to 0 then f(x)f(x) and f(z)f(z) will be approximately uncorrelated even when xx and zz are close. We will make the notion of “closeness” more precise shortly. The connection to the Gaussian density in (6) and (7) provides another perspective. Since 2/2\ell^2/2 is the variance of the Gaussian, then increasing \ell implies the Gaussian curve becomes more “spread out”, indicating more dependence across longer distances.

It is important to keep in mind that there are two different distances we’re working with here: (1) the “physical” distance in the input space; and (2) the correlation distance describing dependence in the output space. The correlation/covariance functions relate these two notions of distance. The lengthscale parameter \ell has the same units as the inputs; i.e., it is defined in the physical space. However, it controls correlation distance as well. Since this is what we really care about, it is useful to interpret a specific value of \ell based on its effect on the rate of correlation decay. To interpret a model with a specific value of \ell, one can simply plot the curve ρ(d)\rho(d) as dd is varied; i.e., consider a sequence of physical distances d1,d2,,dkd_1, d_2, \dots, d_k and report the values
ρ(d1),ρ(d2),,ρ(dk).(8) \rho(d_1), \rho(d_2), \dots, \rho(d_k). \tag{8} In the common setting that the input space is a bounded interval [a,b][a,b], then we can choose d1=ad_1=a, dk=bd_k=b and space the rest of the points uniformly throughout the interval. Another value that might be worth reporting is the distance dd at which the correlation is essentially zero. The correlation is never exactly zero, but we might choose a small threshold ϵ\epsilon that we deem to be negligible. Setting ρ(d)=exp[(d)2]=ϵ(8) \rho(d) = \exp\left[-\left(\frac{d}{\ell}\right)^2 \right] = \epsilon \tag{8} and solving for dd, we obtain d=d()=log(1/ϵ),(9) d = d(\ell) = \ell \sqrt{\log(1/\epsilon)}, \tag{9} a value that scales linearly in \ell. For example, if we set ϵ=0.001\epsilon = 0.001, then (9) becomes d7d \approx 7\ell. In words, this says that the correlation becomes negligible when the physical distance is about seven times the lengthscale.

To generalize these notions to the product covariance in (7), we can consider producing a correlation decay plot for each dimension separately; i.e., we can generate the values ρ(di;j)\rho(d_i; \ell_j) in (8) for each i=1,,ki = 1,\dots,k and j=1,,pj = 1,\dots,p. While this gives us information about the correlation decay in each dimension, we must keep in mind that the decay in ρ(d1,,dp)\rho(d_1, \dots, d_p) is controlled by the product of ρ(d1;1),,ρ(dp;p)\rho(d_1; \ell_1), \dots, \rho(d_p; \ell_p).

Marginal Variance

The marginal variance is more straightforward to interpret. As we established in (3), α2\alpha^2 is the variance of f(x)f(x), which is constant across all values of xx. Thus, α2\alpha^2 provides information on the scale of the function. Suppose that ff is a Gaussian process with covariance function k(,)k(\cdot,\cdot) and a constant mean mm. This implies that f(x)N(m,α2)f(x) \sim \mathcal{N}(m,\alpha^2) for any xx. Thus, P[f(x)mnα]=Φ(n)Φ(n),(9) \mathbb{P}\left[\lvert f(x)-m\rvert \leq n\alpha \right] = \Phi(n) - \Phi(-n), \tag{9} where Φ\Phi is the standard normal distribution function. In particular, P[f(x)m2α]0.95,(10) \mathbb{P}\left[\lvert f(x)-m\rvert \leq 2\alpha \right] \approx 0.95, \tag{10} meaning that values of f(x)f(x) falling outside of the interval m±2αm \pm 2\alpha are unlikely.

Bounds and Defaults

We now consider setting reasonable bounds and default values for the lengthscale and marginal variance. The motivation here is primarily on estimating the covariance hyperparameters for Gaussian process models via maximum marginal likelihood. In general, it is often better to regularize the optimization by specifying prior distributions on the hyperparameters, or to go the full Bayesian approach and sample from the posterior distribution over the hyperparameters. We won’t go into such topics here. Throughout this section, suppose that we have observed nn input-output pairs y(i)=f(x(i)),i=1,,n(11) y^{(i)} = f(x^{(i)}), \qquad i=1, \dots, n \tag{11} for some function of interest ff. For now we assume that these observations are noiseless, and ff is modeled as a Gaussian process with covariance function k(,)k(\cdot, \cdot). Our intention here is not to go into depth on methods for estimating Gaussian process hyperparameters; see this post for more details on this topic. Instead, we will describe some practical considerations for avoiding pathological behavior when estimating the hyperparameters of the covariance function.

Lengthscales

One Dimensional

We start by considering bounds [min,max],(11) \ell \in [\ell_{\text{min}}, \ell_{\text{max}}], \tag{11} for the lengthscale parameter. I should start by noting that these bounds should be informed by domain knowledge, if it is available. We will consider the case where there is no prior knowledge, and instead consider empirical approaches based on the training data. In general, \ell is constrained to the interval (0,)(0, \infty). However, note that learning the value of \ell from the training data relies on the consideration of how yiy_i varies based on pairwise distances between the x(i)x^{(i)}. Thus, there is no information in the data to inform lengthscale values that are below the minimum, or above the maximum, observed pairwise distances. Let d(1),,d(m)d^{(1)}, \dots, d^{(m)} denote the set of pairwise distances constructed from the training inputs x1,,xnx_1, \dots, x_n. There are m=n(n1)2m = \frac{n(n-1)}{2} such distances, corresponding the the number of entries in the lower triangle (excluding the diagonal) of an n×nn \times n matrix. Thus, a reasonable first step is to enforce the constraint min:=dmin,max:=dmax,(12) \ell_{\text{min}} := d_{\text{min}}, \qquad \ell_{\text{max}} := d_{\text{max}}, \tag{12} where dmind_{\text{min}} and dmaxd_{\text{max}} denote the minimum and maximum of {d(1),,d(m)}\{d^{(1)}, \dots, d^{(m)}\}, respectively. Without this constraint, an optimizer can sometimes get stuck in local minima at very small or large lengthscales, which can lead to pathological Gaussian process fits. As mentioned previously, it is typically best to think in terms of correlations, so let’s consider the implications of the choice (12) from this viewpoint. If =min\ell = \ell_{\text{min}}, the correlation at the minimum observed pairwise distance is given by ρ(dmin;min)=ρ(dmin;dmin)=exp[(dmindmin)2]=exp(1)0.37. \rho(d_{\text{min}}; \ell_{\text{min}}) = \rho(d_{\text{min}}; d_{\text{min}}) = \exp\left[-\left(\frac{d_{\text{min}}}{d_{\text{min}}}\right)^2 \right] = \exp(-1) \approx 0.37. Thus, the lower bound in (12) implies that we are constraining the lengthscale to satisfy ρ(dmin;)0.37.(13) \rho(d_{\text{min}}; \ell) \geq 0.37. \tag{13} We can generalize this idea and instead set min\ell_{\text{min}} to be the value that achieves the constraint ρ(dmin;min)=ρmin,(14) \rho(d_{\text{min}};\ell_{\text{min}}) = \rho_{\text{min}}, \tag{14} for some value ρmin\rho_{\text{min}} we are free to specify. Solving (14) for min\ell_{\text{min}}, we obtain min=dminlog(1/ρmin).(15) \ell_{\text{min}} = \frac{d_{\text{min}}}{\sqrt{\log\left(1/\rho_{\text{min}}\right)}}. \tag{15} Note that setting ρmin<0.37\rho_{\text{min}} < 0.37 loosens the bound, relative to (12), and vice versa; that is, \begin{align} &\rho_{\text{min}} < 0.37 \implies \ell_{\text{min}} < d_{\text{min}} \newline &\rho_{\text{min}} > 0.37 \implies \ell_{\text{min}} > d_{\text{min}} \end{align} We can generalize this even further by replacing dmind_{\text{min}} and dmaxd_{\text{max}} by empirical quantiles of {d(1),,d(m)}\{d^{(1)}, \dots, d^\text{(m)}\}. For example, we might consider enforcing (14) where the minimum dmind_{\text{min}} is replaced by the 5th5^{\text{th}} percentile of the observed pairwise distances. Making this replacement without changing ρmin\rho_{\text{min}} will result in a more restrictive bound; i.e., a larger value for min\ell_{\text{min}}.

Multiple Input Dimensions

We consider two different approaches to generalize the above procedure to pp-dimensional input space.

Dimension-by-Dimension

In considering bounds for the lengthscales 1,,p\ell_1, \dots, \ell_p in the product correlation (7), one approach is to simply repeat the procedure independently for each input dimension. In other words, we consider the pairwise distances dj(1),,dj(n)d^{(1)}_{j}, \dots, d^{(n)}_{j} in the jthj^{\text{th}} dimension, where each distance is of the form xj(i)xj(l)\lvert x_j^{(i)} - x_j^{(l)} \rvert for some ili \neq l. The bounds for the parameter j\ell_j can then be constructed as described above with respect to these pairwise distances. Let dmin,jd_{\text{min},j} be the minimum (or some other quantile) of dj(1),,dj(n)d^{(1)}_{j}, \dots, d^{(n)}_{j}. This procedure then implies the dimension-by-dimension constraint ρ(dmin,j;j)ρmin,j=1,,p(16) \rho(d_{\text{min},j}; \ell_j) \geq \rho_{\text{min}}, \qquad j=1, \dots, p \tag{16} which in turn implies ρ(dmin,1,,dmin,p)ρminp.(17) \rho(d_{\text{min},1}, \dots, d_{\text{min},p}) \geq \rho_{\text{min}}^{p}. \tag{17} For example, if we choose ρmin=0.37\rho_{\text{min}} = 0.37 in p=5p=5 dimensions, then ρmin50.007\rho_{\text{min}}^5 \approx 0.007. It is important to note that (16) does not provide a direct constraint on the minimum correlation at the distance Dmin:=minilx(i)x(l)2D_\text{min} := \min_{i \neq l} \lVert x^{(i)} - x^{(l)} \rVert_2. There is a distinction between the pairs of points that are closest in each input dimension separately, and the pair that is closest in Rp\mathbb{R}^p. The distances dmin,jd_{\text{min},j} in (16) may be coming from different pairs of inputs, not the particular pair of inputs that is closest in Rp\mathbb{R}^p. For this portion of the post, we will use capital “D” for distances in the multivariate space, and lower case “d_j” for univariate distance with respect to the jthj^{\text{th}} dimension.

Multivariate Constraint

We might instead choose to constrain the correlation at the distance Dmin=minilx(i)x(l)2D_\text{min} = \min_{i \neq l} \lVert x^{(i)} - x^{(l)} \rVert_2, which means we are now considering Euclidean distance in Rp\mathbb{R}^p, rather than approaching the problem one dimension at a time. For the lower bound, this means enforcing the constraint ρ(x,x;min,1,,min,p)=ρmin,(18) \rho(x_{\star}, x_{\star}^\prime; \ell_{\text{min},1}, \dots, \ell_{\text{min},p}) = \rho_{\text{min}}, \tag{18} where (x,x)(x_{\star}, x_{\star}^\prime) are a pair of training inputs satisfying xx2=Dmin\lVert x_{\star} - x_{\star}^\prime \rVert_2 = D_{\text{min}}; i.e., the two closest points. In order to choose the lower bounds min,j\ell_{\text{min},j}, a reasonable approach is to assume the min,j\ell_{\text{min},j} scale linearly with the scale of the respective input dimension. Equivalently, we can consider scaling the inputs so that x[0,1]px \in [0,1]^p. In particular, for each dimension jj we re-scale as x~j:=xjmjMjmj,(19) \tilde{x}_j := \frac{x_j - m_j}{M_j - m_j}, \tag{19} where we have defined \begin{align} &m_j := \min_{i} x^{(i)}_j, &&M_j := \max_{i} x^{(i)}_j. \tag{20} \end{align} We’ll use tildes to denote quantities pertaining to the scaled space, and also stop writing the “min” subscript in min,j\ell_{\text{min},j} for succinctness. Note that this scaling implies that univariate distances scale as d~j:=x~jx~j=xjmjMjmjxjmjMjmj=xjxjMjmj=djMjmj.(21) \tilde{d}_j := \tilde{x}_j - \tilde{x}_j^\prime = \frac{x_j - m_j}{M_j - m_j} - \frac{x^\prime_j - m_j}{M_j - m_j} = \frac{x_j - x_j^\prime}{M_j - m_j} = \frac{d_j}{M_j - m_j}. \tag{21} Since lengthscales correspond to distances, we have ~j:=jMjmj,(22) \tilde{\ell}_j := \frac{\ell_j}{M_j - m_j}, \tag{22} Note that these scalings have no effect on the correlations, since ρ(x~,x~;~1,,~p)=exp[j=1p(d~j~j)2]=exp[j=1p(djj)2]=ρ(x,x;1,,p),(23) \rho(\tilde{x},\tilde{x}^\prime; \tilde{\ell}_1, \dots, \tilde{\ell}_p) = \exp\left[-\sum_{j=1}^{p} \left(\frac{\tilde{d}_j}{\tilde{\ell}_j}\right)^2 \right] = \exp\left[-\sum_{j=1}^{p} \left(\frac{d_j}{\ell_j}\right)^2 \right] = \rho(x,x^\prime; \ell_1, \dots, \ell_p), \tag{23} given that the MjmjM_j - m_j factors cancel in the ratio. If we consider an isotropic model in scaled space (i.e., ~~j j\tilde{\ell} \equiv \tilde{\ell}_j \ \forall j), then this is equivalent to assuming that the lengthscale bounds scale linearly as j=(Mjmj)~.(24) \ell_j = (M_j - m_j)\tilde{\ell}. \tag{24} If we enforce the constraint ρ(x~,x~;~,,~)=ρmin(25) \rho(\tilde{x}_{\star}, \tilde{x}^\prime_{\star}; \tilde{\ell}, \dots, \tilde{\ell}) = \rho_{\text{min}} \tag{25} in scaled space, then we see from (23) that this encodes the constraint (18), as desired.

Marginal Variance

We now consider the definition of reasonable default bounds for the marginal variance parameter α2\alpha^2. Throughout this section, we assume that f(x)f(x) is a zero-mean Gaussian process with a Gaussian covariance function. A reasonable approach here is to cap α2\alpha^2 so that most of the prior probability falls within the observed range of the data. As always, such a heuristic could present challenges in generalizing beyond the training data, and domain knowledge should take precedent in constraining the value of α2\alpha^2.

To make precise the notion of placing most prior mass over the observed data range, let ymax:=maxiyiy_{\text{max}} := \max_i \lvert y_i \rvert. Then we can define the upper bound αmax2\alpha^2_{\text{max}} as the value of α\alpha implying P[f(x)ymax]=p,(26) \mathbb{P}\left[\lvert f(x)\rvert \leq y_{\text{max}} \right] = p, \tag{26} for some (large) probability pp (e.g., p=0.99p=0.99). This constraint excludes values α>αmax\alpha > \alpha_{\text{max}}, which would lead to priors that place more probability on values of f(x)\lvert f(x) \rvert that exceed ymaxy_{\text{max}}. The use of the value ymaxy_{\text{max}} might lead to some concerns around robustness, since a single extreme value yiy_i could exert significant influence on the bound αmax\alpha_{\text{max}}. On the flip side, we also don’t want to simply ignore the larger values and set αmax\alpha_{\text{max}} restrictively low. Similar to the lengthscale bound approach, one option is to replace ymaxy_{\text{max}} with a different empirical quantile, and pair this with a slightly lower value of pp.

In any case, for a chosen quantile ymaxy_{\text{max}} and probability pp, we can solve the expression (26) for α\alpha to obtain an expression for the bound αmax\alpha_{\text{max}}. Noting that f(x)N(0,α2)f(x) \sim \mathcal{N}(0, \alpha^2), we see that (26) can equivalently be written as P[ymaxαZymax]=p,(27) \mathbb{P}\left[-y_{\text{max}} \leq \alpha Z \leq y_{\text{max}} \right] = p, \tag{27} where ZN(0,1)Z \sim \mathcal{N}(0,1). Since Gaussians are symmetric, we can write (27) as 12P[Zymax/α]=p.(28) 1 - 2 \mathbb{P}\left[Z \geq y_{\text{max}}/\alpha \right] = p. \tag{28} Solving (28) for α\alpha then gives α=ymaxΦ1(1+p2),(29) \alpha = \frac{y_{\text{max}}}{\Phi^{-1}\left(\frac{1+p}{2}\right)}, \tag{29} recalling that Φ(t):=P[Zt]\Phi(t) := \mathbb{P}[Z \leq t] denotes the standard normal distribution function.