The Polynomial Kernel
Introduction and basic properties, kernel ridge regression, Gaussian processes.
In this post we discuss the polynomial kernel, with particular emphasis on the quadratic special case. We motivate the kernel within a regularized regression context, yielding the kernel ridge regression estimator. We then shift to the Bayesian perspective, which results in Gaussian process regression. Much of the exposition here is actually agnostic to the choice of kernel, but focusing on the polynomial kernel provides a concrete and intuitive introduction to kernel methods. We conclude the post by discussing various propeties of the polynomial kernel.
From Basis Functions to Kernel
Polynomial Basis
We consider working with the multivariate polynomial space
with subscripts indexing the input dimensions. The dimension of can be shown to be (for a proof, check out chapter 2 of Wendland’s Scattered Data Approximation). Let denote the feature map corresponding to this polynomial basis, such that \begin{align} f(x) &:= \varphi(x)^\top \beta = \sum_{j=1}^{q} \beta_j \varphi_j(x) \in \mathcal{P}_p(\mathbb{R}^d). \tag{2} \end{align} Every polynomial in the space can be represented via a particular choice of the coefficient vector . As a concrete example in the one-dimensional setting with polynomial dimension we have (when we draw connections with the kernel approach, the scaling of these basis elements will be different, but the same ideas apply).
Polynomial Kernel
The goal is to find a kernel function
that induces the feature map . I claim that the
function
\begin{align}
&k(x, z) = \left(\langle x, z \rangle + c \right)^p, &&c \geq 0 \tag{3}
\end{align}
does the trick. We’ll show this for the special case below, but check out
these excellent notes
for a derivation in the general case (which follows from an application of the
multinomial theorem). Proceeding with the case, our claim is that (3) satisfies
\begin{align}
k(x, z) &= \langle \varphi(x), \varphi(z) \rangle, \tag{4}
\end{align}
with a feature map outputting a vector of the polynomial basis functions.
To show this, we simply multiply out the square to obtain
\begin{align}
k(x, z)
&= \left(\sum_{j=1}^{d} x_j z_j + c \right)^2 \tag{5} \newline
&= \left(\sum_{j=1}^{d} x_j z_j\right)^2 + 2c \sum_{j=1}^{d} x_j z_j + c^2 \newline
&= \sum_{j=1}^{d} x_j^2 z_j^2 + 2\sum_{j=2}^{d}\sum_{\ell=1}^{j-1} x_jz_j x_{\ell} z_{\ell} +
2c \sum_{j=1}^{d} x_j z_j + c^2 \newline
&= \sum_{j=1}^{d} x_j^2 z_j^2 +
\sum_{j=2}^{d}\sum_{\ell=1}^{j-1} (\sqrt{2}x_j x_{\ell}) (\sqrt{2}z_j z_{\ell}) +
\sum_{j=1}^{d} (\sqrt{2c}x_j) (\sqrt{2c}z_j) + c^2 \newline
&= \langle \varphi(x), \varphi(z) \rangle,
\end{align}
where
\begin{align}
\varphi(x) &:= \begin{bmatrix} x_1^2, \dots, x_d^2, \sqrt{2}x_2 x_1, \dots,
\sqrt{2}x_d x_{d-1}, \sqrt{2c}x_1, \dots, \sqrt{2c}x_d, c \end{bmatrix}^\top. \tag{6}
\end{align}
We have used the well-known formula for expanding the square of a summation.
The feature map (6) that popped out of these derivations does indeed correspond
to a basis for ; the scaling of the basis functions
might look a bit weird with the terms, but this is just how the
calculation works out. We have established (4) in the special case , though
the result holds for any positive integer . This is enough to verify that
(3) is a valid positive semidefinite (PSD) kernel, which avoids working with the
PSD definition directly.
To further appreciate the connection between the kernel and feature map, consider the expression derived in (5) for viewed as a function of only with a fixed parameter. We see that is a polynomial in the space , evaluated at . All of the constants and terms involving simply become the coefficients on the basis functions that determine the particular polynomial. Thus, by fixing the second argument of the kernel at different inputs , we are able to produce any polynomial in the space. Since the polynomial space is finite-dimensional, a finite set of such kernels can span the space. In other settings where the feature space is infinite-dimensional, this will not be the case.
Example 1: Ridge Regression
We now compare the explicit basis function and kernel approaches in their application to a standard regression problem. Much of the discussion here is not actually specific to polynomials, applying more generally to arbitrary feature maps and kernels.
Polynomial Basis
Let’s start by considering a standard linear regression setting with training data , with inputs stacked in the matrix and the responses in the vector . Note that we are using superscripts to index distinct vectors, while subscripts index vector dimensions. We’ll assume a regression model that is linear with respect polynomial basis functions up to degree ; in other words, the space of all possible regression functions is . We let be the associated feature map, scaled to align with (6). Let denote the feature matrix evaluated at the training inputs ; precisely, . The linear model in the polynomial basis is thus given by \begin{align} y &= \Phi \beta + \epsilon, &&\epsilon \sim \mathcal{N}(0, \sigma^2 I_n). \tag{7} \end{align} We consider the ridge regression setting, whereby the coefficients are estimated by minimizing the residual sum of squares, regularized by an penalty. This yields \begin{align} \hat{\beta} &:= \text{argmin}_{\beta} \left[\lVert y-\Phi \beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2 \right] &= \left(\lambda I_q + \Phi \Phi^\top \right)^{-1} \Phi^\top y \tag{8} \end{align} which is derived by setting the gradient of the loss equal to zero and solving for . To predict at a new set of inputs we first map to the feature matrix , then compute \begin{align} \hat{y} &:= \tilde{\Phi} \hat{\beta} = \tilde{\Phi}\left(\lambda I_q + \Phi^\top \Phi \right)^{-1} \Phi^\top y. \tag{9} \end{align} This is just run-of-the-mill ridge regression using polynomial basis functions, but we emphasize some points that will become relevant when comparing to kernel methods later on:
- Computing the predictions is . The stems from the linear solve required to compute . After model fitting, the training inputs are forgotten, the relevant information having been encoded in the parameters . Therefore, the prediction calculations scale as , independently of .
- This procedure requires estimating parameters, which grows quite quickly in both and . This can be problematic as the number of parameters approaches or exceeds the number of observations . Regression with multivariate polynomials is thus difficult to scale to higher-dimensional problems. One may deal with this by reducing the number of parameters (e.g., by excluding some of the interaction terms in the basis) or employing methods that seek to discover sparsity in (e.g., regularization). We will shortly discuss how the kernel approach can help alleviate this difficulty.
- In general, the matrix is PSD since (one can guarantee it is positive definite (PD) under certain conditions on the input points (see Wendland chapter 2 for details). In any case, the addition of with ensures that is positive definite (PD), and hence invertible. From a numerical perspective, this can be useful even when is theoretically positive definite, since its numerical instantiation may fail to be so.
Polynomial Kernel
We now discuss an alternative route to compute the prediction that eliminates the need to estimate the parameters (there will, of course, be a new cost to pay). The first required observation is that \begin{align} &&k(X, X) = \Phi \Phi^\top, &&k(\tilde{X}, X) = \tilde{\Phi} \Phi^\top \tag{10} \end{align} where we have vectorized notation by defining the matrices and by and . The matrix looks a lot like the term in (9), but the order of the terms is reversed. However, we can make these terms align by rewriting (9) via an application of the Woodbury identity. We actually really only need the simpler fact \begin{align} (\lambda I_q + UV)^{-1}U = U(\lambda I_n + VU)^{-1}, \tag{11} \end{align} which the linked Woodbury Wikipedia page refers to as the push-through identity (I’ve modified it slightly by incorporating the ). We therefore obtain \begin{align} \left(\lambda I_q + \Phi^\top \Phi \right)^{-1} \Phi^\top &= \Phi^\top \left(\lambda I_n + \Phi \Phi^\top \right)^{-1}, \tag{12} \end{align} having applied (11) with and . Therefore, we can write (9) as \begin{align} \hat{y} &= \tilde{\Phi}\left(\lambda I_q + \Phi^\top \Phi \right)^{-1} \Phi^\top y \newline &= \tilde{\Phi} \Phi^\top \left(\lambda I_n + \Phi \Phi^\top \right)^{-1} y \newline &= k(\tilde{X}, X) \left[\lambda I_n + k(X, X) \right]^{-1} y. \tag{13} \end{align} Sometimes the term \begin{align} \alpha := \left[\lambda I_n + k(X, X) \right]^{-1} y \tag{14} \end{align} is referred to as the dual weights (whereas might be called the primal weights). To motivate why, let’s consider the prediction , which can be written as
\begin{align} \hat{y}_{s} &= k(\tilde{x}^s, X) \alpha = \sum_{i=1}^{n} \alpha_i k(\tilde{x}^s, x^i). \tag{15} \end{align} The sum (15) can be viewed as the analog to \begin{align} \hat{y}^s &= \varphi(\tilde{x}^s)^\top \hat{\beta} = \sum_{j=1}^{d} \hat{\beta}_j \varphi_j(\tilde{x}^s). \tag{16} \end{align}
We see that both methods generate the prediction via a linear combination of basis functions: in (16) the basis functions come from the feature map , and in (15) the basis functions are . In the former case, there are fixed, global basis functions. In the latter, the basis functions are local, defined with respect to each training input ; therefore, the number of basis functions grows linearly in the number of observations.
We close this section by enumerating some properties of the kernel approach, serving as a comparison to the points listed in the previous section for the explicit basis function approach.
- Computing the predictions is . The comes from the linear solve . Once this linear solve has already been computed, prediction calculations scale like , which now, in contrast to the first approach, depends on . The poor scaling in is the cost we pay with the kernel method.
- Notice that no parameters are explicitly estimated in this kernel approach. Instead of compressing the information into a vector of coefficients , this method stores the kernel matrix to be used for prediction. The kernel approach does typically require estimating a few hyperparameters (in this case, ), which we discuss in detail later on.
- While the matrix can be guaranteed to be PD under some fairly relaxed conditions, the kernel matrix is almost certainly not PD. However, it still is PSD, as we noted above. This can also be seen by considering . Therefore, the addition of with is crucial in this setting to ensure that the matrix is invertible.
Example 2: The Bayesian Perspective
In the previous section, we motivated the polynomial kernel within a ridge regression setting. In this section we walk through similar derivations, but from a Bayesian point of view. In particular, we compare Bayesian linear regression using polynomial basis functions with its kernelized analog, the latter giving rise to Gaussian process (GP) regression.
Polynomial Basis
Consider the following Bayesian polynomial regression model: \begin{align} y|\beta &\sim \mathcal{N}(\Phi \beta, \sigma^2 I_n) \newline \beta &\sim \mathcal{N}\left(0, \frac{\sigma^2}{\lambda} I_q \right) \tag{17} \end{align} This is a linear Gaussian model, with posterior given by \begin{align} \beta | y &\sim \mathcal{N}\left(\hat{m}, \hat{C} \right), \tag{18} \end{align} where \begin{align} \hat{m} &= \hat{C} \left(\frac{1}{\sigma^2}\Phi^\top y \right) \tag{19} = \left(\Phi^\top \Phi + \lambda I_q \right)^{-1} \Phi^\top y \newline \hat{C} &= \sigma^2 \left(\Phi^\top \Phi + \lambda I_q \right)^{-1}. \end{align} See my post on linear Gaussian models for the derivations of these equations. Notice that the posterior mean in (19) is precisely the ridge regression estimator in (8).
We can again consider prediction at the new set of inputs . The posterior predictive distribution is given by \begin{align} \tilde{y}|y &\sim \mathcal{N}\left(\tilde{m}, \tilde{C} \right), \end{align} where \begin{align} \tilde{m} &= \tilde{\Phi}\Phi^\top \left(\Phi \Phi^\top + \lambda I_n \right)^{-1}y \newline \tilde{C} &= \end{align}