An Introduction to Scattered Data Approximation
I summarize the first chapter of Holger Wendland's book "Scattered Data Approximation", which I augment with some background on polynomial interpolation and splines.
This post provides a summary of the first chapter of Holger Wendland’s excellent book on scattered data approximation . The introductory chapter nominally provides motivation and applications for the subsequent theory, but is somewhat of a whirlwind in itself if you lack significant background in approximation theory. I therefore try to augment the material in the chapter with some additional content, which is especially suited for readers coming from more of a statistical background. The notation I use more or less follows that of Wendland.
The General Problem
The theory of scattered data interpolation addresses a very general problem, encompassing both problems of interpolation and smoothing (i.e., regression). The generic setup considers a dataset consisting of sites (also called locations, knots, or design points in different settings) and corresponding values at each respective site . In typical applications, these values are assumed to be generated by some underlying function perhaps subject to noise but this need not be the case. The noiseless and noisy settings naturally motivate the two primary problems in this domain: interpolation and approximation. I will also refer to the latter as a problem of smoothing or regression, depending on the context. In the two cases, we seek a function that either interpolates \begin{align} f_j &= s(x_j), && j = 1, \dots, N \end{align} or approximates \begin{align} f_j &\approx s(x_j), && j = 1, \dots, N \end{align} the data. The name “scattered” comes from the lack of assumptions on the sites ; we will generally treat them generically, without assuming any sort of special structure. On the other hand, we will require assumptions on the generating process for the , which typically means assuming that the underlying function satisfies some notion of smoothness. We will thus consider the approximation properties of different spaces of smooth functions , with the goal being to identify the function which is optimal in some sense.
In order to fully specify a concrete problem, we need to define in what space the live. Statisticians would probably jump right to , but in approximation theory there are some fundamental differences between one and multiple dimensions. Therefore, in the next section we start by discussing the concrete case of a one-dimensional input space, which will be used to motivate higher-dimensional generalizations later on.
The One-Dimensional Case
In this section we will assume that ; in particular, that for all for some . It will sometimes be useful to extend the notation as and , in which case we will sometimes refer to as the interior sites to distinguish from the end points. Note that there are different conventions for this in the literature, but it is important to remember that in our case denotes the number of sites with corresponding output values . The goal in this setting is to find a continuous function which either interpolates or approximates the data.
Polynomials
Following Wendland, let denote the space of polynomials with dimension at most ; i.e., the functions which admit a representation \begin{align} s(x) = \alpha_0 + \alpha_1 x + \alpha_2 x^2 + \cdots + \alpha_M x^M \end{align} for some coefficients . is indeed a space in the linear algebra sense; it is a vector space of dimension . The most obvious basis for this space is the monomial basis but many other bases are commonly used which have nicer theoretical and computational properties.
Interpolation with Polynomials
Let’s now consider solving the interpolation problem using polynomials; that is,
we will try to find a function which interpolates the
data. I’m considering and not since the space
has dimension , which will be more convenient notationally than having to write
all the time. With this choice of function space, the interpolation requirement is that
\begin{align}
f_j &= \alpha_0 + \alpha_1 x_j + \alpha_2 x_j^2 + \cdots \alpha_M x_j^{M-1}, &&\forall j = 1, \dots, N.
\end{align}
It is convenient to define
\begin{align}
\varphi(x) &:= (1, x, x^2, \dots, x^{M-1})^\top \in \mathbb{R}^{M} \newline
\alpha &:= (\alpha_0, \alpha_1, \alpha_2, \dots, \alpha_{M-1})^\top \in \mathbb{R}^{M}
\end{align}
so that we can re-write the above as
\begin{align}
f_j &= \alpha_0 + \alpha_1 x_j + \alpha_2 x_j^2 + \cdots \alpha_M x_j^{M-1} = \varphi(x_j)^\top \alpha,
&&\forall j = 1, \dots, N. \tag{1}
\end{align}
To succinctly write the equalities for all , let
be the matrix with row equal to . The requirement
(1) is then encoded in the linear system
\begin{align}
y = \Phi \alpha, \tag{2}
\end{align}
where we have additionally defined .
The questions of existence and uniqueness can now be answered by appealing to known
properties of linear systems, though it is not in general obvious when (2) will have a solution.
A unique solution exists if and only if is non-singular, which in particular
requires that is square.
In the square case, it can be
shown that is non-singular
if and only if all of the are distinct, which is the typical case for interpolation.
Under the assumption that the are distinct, we thus conclude that the
space provides a unique interpolating polynomial, whose
coefficients can be computed by .
The interpolated value at a new input is then given by
\begin{align}
s(\tilde{x}) &= \varphi(\tilde{x})^\top \hat{\alpha}_{\text{interp}}
= \varphi(\tilde{x})^\top \Phi^{-1} y
\end{align}
Regression with Polynomials
We consider the same setting as above, but now suppose that the correspond to noisy evaluations of some underlying function , We now consider finding a polynomial that satisfies for . To make this precise, we attempt to minimize the average squared error, which yields the ordinary least squares (OLS) regression problem \begin{align} \text{min}_{s \in \pi_{M-1}(\mathbb{R})} \lVert y - s(X) \rVert_2^2 = \text{min}_{\alpha \in \mathbb{R}^M} \lVert y - \Phi \alpha \rVert_2^2, \tag{3} \end{align} where .
The optimality condition for this problem is well-known to be given by the normal equations, \begin{align} \Phi^\top \Phi \alpha = \Phi^\top y. \end{align} If is of full column rank then is invertible, which leads to the unique least squares solution \begin{align} \hat{\alpha}_{\text{OLS}} = (\Phi^\top \Phi)^{-1} \Phi^\top y. \end{align} Note that the full column rank condition in particular requires , corresponding to the usual regression setting in which the number of observations exceeds the number of variables. If we consider the special case where is invertible, the unique OLS solution reduces to the interpolation solution: \begin{align} \hat{\alpha}_{\text{OLS}} = \Phi^{-1} (\Phi^\top)^{-1} \Phi^\top y = \Phi^{-1}y = \hat{\alpha}_{\text{interp}}. \end{align}
A regression prediction (assuming the case where
is unique) at a new location is given by
\begin{align}
s(\tilde{x}) &= \varphi(\tilde{x})^\top \hat{\alpha}_{\text{OLS}} = \varphi(\tilde{x})^\top (\Phi^\top \Phi)^{-1} \Phi^\top y,
\end{align}
which again reduces to the interpolation analog when is invertible.
Finally, note that in this context the function is commonly referred to as a feature map, which generates the corresponding feature (i.e. basis) matrix .
Splines
In the previous sections, we uncovered a remarkable property possessed by (one-dimensional) polynomials: for any set of distinct sites , the space contains a unique interpolating polynomial. However, even in one dimension this does not imply that the problem is solved in a practical sense. As the number of sites grows large, working with very high degree polynomials becomes very difficult and can lead to some weird results.
To prevent the dimensionality from getting out of hand, an alternative approach is to work in a space consisting of piecewise polynomials of lower dimension. A simple example is just linearly interpolating the data, but typically we want to work with smoother functions. Thus, we might impose constraints such as requiring the existence of a certain number of derivatives, which requires ensuring that the polynomials “line up” correctly at the interior knots. We briefly discuss two popular examples of such spaces below.
Cubic Splines
We define the space of cubic splines, denoted , to be the set of all piecewise cubic polynomials that are also twice continuously differentiable, We write for the restriction of the function to the interval . Also recall that by definition and . Note that, unlike the polynomial spaces considered above, the space of cubic splines is defined with respect to the specific set of locations .
The set of cubic splines form a vector space, which follows from the fact that and are themselves vector spaces. The dimension of can be established by furnishing a basis for the space, but a back-of-the-envelope calculation can give you a good guess. Indeed, note that the interior sites define intervals, on each of which contains cubic polynomials. Since has dimension then this yields degrees of freedom. But in neglecting the constraint we have over-counted. This constraint requires that the function values, first derivatives, and second derivatives match at each of the interior nodes, thus removing degrees of freedom, for a total count of . This is indeed the correct dimension of , and a basis for this space is given by using the notation . In fact, replacing with any basis of will do the trick. Cubic splines are sometimes described as the lowest order splines which look smooth to the human eye.
As far as interpolation, we can try to follow the same procedure as before, now defining the map according to the basis (6): This yields a basis matrix , so we see that we are a bit overparameterized here; i.e., the system is not guaranteed to have a unique solution. To address this, we must reduce the number of free parameters by by adding some constraints. This is done in various ways, often by imposing some boundary conditions; this blog post nicely summarizes some options. The primary method of interest here will be constraining to be linear on the boundary regions and . This leads to the concept of the natural cubic spline, which is discussed in the next section.
Having mainly discussed cubic spline interpolation, we briefly note that regression proceeds similarly. However, in this setting it is common to choose the spline knots to be points other , which can help reign in the number of parameters when is large. For example, we might compute some empirical percentiles of our data and use these evenly-spaced points as the knots. Finally, we should note that the basis (6), while conceptually simple, can tend to be numerically unstable. For computational purposes, alternative bases are preferred.
Natural Cubic Splines
As mentioned above, a popular methods for adding additional constraints to cubic splines is to enforce linearity on the boundary regions. We thus define the set of natural cubic splines as This space can equivalently be characterized by taking the subset of that satisfies the additional boundary constraints \begin{align} &s^{\prime\prime}(x_1) = 0, && s^{\prime\prime}(x_N) = 0 \end{align} Note that imposing linearity in the boundary regions is quite a reasonable idea. The curve fit to these regions is essentially extrapolating, since these regions are only constrained by observed data at one of their two endpoints. Thus, enforcing linearity here acts as a sort of regularization, attempting to prevent overfitting in data-sparse regions.
Once again, is a vector space. Trying to count the dimensions as before, we now have parameters from the two linear bases on the boundary regions and parameters from the cubic bases in the interior regions. Given that we are still imposing constraints on the interior nodes, this yields free parameters, as required to guarantee a unique interpolant. This is indeed correct, and it can be shown that the uniqueness property is satisfied for . A basis for can be obtained by starting with a basis for and then imposing the new linearity constraints. If we start with the cubic spline basis (6), such that any may be represented as and impose these constraints, then we find they translate into the coefficient conditions as well as
Generalizing to Higher Dimensions: Radial Basis Functions
The previous section provided a very brief overview of using polynomials and splines for fitting smooth, nonlinear curves in one-dimensional interpolation and regression problems. It turns out that things get difficult when trying to generalize to higher dimensions. Chapter 2 of Wendland’s book covers multivariate polynomials, so I will not dwell on this here. There are various ways one might approach generalizing these ideas to higher dimensions, and Wendland mentions a few approaches with some success in two dimensions but which become intractable beyond that. The real goal here is a method that works the same in all dimensions, and the key to finding this is to characterize the one-dimensional methods in such a way that is amenable to generalization.
With this in mind, we consider the following characterization of natural cubic splines.
Proposition. (Wendland Proposition 1.1) Every natural cubic spline has a characterization of the form where
- , .
- .
- The satisfy the conditions (7).
Conversely, for every set of pairwise distinct points , and every , there exists a function of the form (8) satisfying
- , for .
- The satisfy the conditions (7).
It is this viewpoint which provides an ideal jumping off point for generalization to higher dimensions. Indeed, consider that the function only depends on the distance . Thus, we might consider the natural generalization when . We might similarly replace the low-dimensional polynomial in (8) with a lower-dimensional multivariate polynomial . We thus consider the generalization of (8) to interpolants of the form \begin{align} s(x) = \sum_{j=1}^{N} \alpha_j \phi(\lVert x - x_j \rVert_2) + p(x), \qquad x \in \mathbb{R}^d, p \in \pi_{M-1}(\mathbb{R}^d) \tag{9} \end{align} subject to the constraint \begin{align} &\sum_{j=1}^{N} \alpha_j q(x_j) = 0, &&\forall q \in \pi_{M-1}(\mathbb{R}^d), \tag{10} \end{align} where is some fixed function.
Notice that the coefficient constraints (10) echo (7), with replacing . Notice that while splines provided the motivation for this generalization, the spaces of functions characterized by (9) will no longer consist of piecewise polynomials. They have thus deservedly been given a new name: radial basis functions, which we will refer to by the acronym RBF.
A first look at radial basis functions
To start wrapping our heads around RBFs, let’s start by considering the case where in (9). Wendland notes that this case actually is sufficient in many settings. We thus consider the space of functions of the form noting that the coefficient constraints (10) are not necessary in this case. Things are much simpler in this case, as we note that the functions under consideration are those in the span of the basis functions where . The idea of defining basis functions which depend on the data sites is very much in the spirit of splines. Note that I am using the term basis a bit loosely at this point, given that without additional conditions on there are certainly no guarantees that the functions are actually linearly independent (think about the trivial case ). In any case, we can consider proceeding as we did in the polynomial interpolation example, by first defining the map and then constructing the interpolation matrix with row equal to . Wendland calls this matrix matrix as , but I’ll stick with for now at least.
As before, the interpolation problem reduces to solving the linear system and hence a unique interpolant is guaranteed precisely when is non-singular. Note that non-singularity requires, in particular, that the are independent. So the first question we must answer is whether we can actually find a function that ensures (for any and any set of pairwise distinct locations ) is non-singular? The answer is yes, with three such examples given by \begin{align} &\phi(r) = e^{-\alpha r^2}, &&\phi(r) = (c^2 + r^2)^{-1/2}, &&&\phi(r) = \sqrt{c^2 + r^2}, \end{align} where and . These are referred to as the Gaussian, inverse multiquadric, and multiquadric functions, respectively. A particularly important special case with links to kernel methods is when is also guaranteed to be positive definite. This additional requirement is satisfied by the Gaussian and inverse multiquadric radial functions.
Example: Gaussian Processes
In this section we show that under certain assumptions
the predictive (conditional) mean of a Gaussian
process (GP) can be viewed as an RBF interpolant. We consider a zero mean
GP prior on the function , with
the kernel (i.e., covariance function), a positive definite function. The conditional
distribution can be shown to also be Gaussian, with mean function
using the notation to denote the matrix with entries
.
If we consider the basis functions defined
by and the associated
feature map and matrix , then (12) can be re-written as
which is precisely the generic interpolation formula we have seen several times
at this point. Moreover, if we denote then we can
re-write (12) as
If we further assume that the kernel
depends only on the distance between locations, then it can be viewed as a
radial function (this
property of kernels is known as isotropy in the geostatistics literature)
and the GP predictive mean assumes the form
an RBF interpolant!
Questions
- Am I reading proposition 1.1 correctly? Does the converse imply the same conditions on r and p?
- Clarification on the Sobolev space minimum norm characterization of natural cubic splines.