An Introduction to Scattered Data Approximation

I summarize the first chapter of Holger Wendland's book "Scattered Data Approximation", which I augment with some background on polynomial interpolation and splines.

This post provides a summary of the first chapter of Holger Wendland’s excellent book on scattered data approximation . The introductory chapter nominally provides motivation and applications for the subsequent theory, but is somewhat of a whirlwind in itself if you lack significant background in approximation theory. I therefore try to augment the material in the chapter with some additional content, which is especially suited for readers coming from more of a statistical background. The notation I use more or less follows that of Wendland.

The General Problem

The theory of scattered data interpolation addresses a very general problem, encompassing both problems of interpolation and smoothing (i.e., regression). The generic setup considers a dataset {(xj,fj)}j=1N\{(x_j, f_j)\}_{j=1}^{N} consisting of NN sites (also called locations, knots, or design points in different settings) X:={xj}j=1NX := \{x_j\}_{j=1}^{N} and corresponding values fjf_j at each respective site xjx_j. In typical applications, these values are assumed to be generated by some underlying function fj=f(xj), f_j = f(x_j), perhaps subject to noise fjf(xj), f_j \approx f(x_j), but this need not be the case. The noiseless and noisy settings naturally motivate the two primary problems in this domain: interpolation and approximation. I will also refer to the latter as a problem of smoothing or regression, depending on the context. In the two cases, we seek a function s()s(\cdot) that either interpolates \begin{align} f_j &= s(x_j), && j = 1, \dots, N \end{align} or approximates \begin{align} f_j &\approx s(x_j), && j = 1, \dots, N \end{align} the data. The name “scattered” comes from the lack of assumptions on the sites XX; we will generally treat them generically, without assuming any sort of special structure. On the other hand, we will require assumptions on the generating process for the fjf_j, which typically means assuming that the underlying function f(x)f(x) satisfies some notion of smoothness. We will thus consider the approximation properties of different spaces of smooth functions SS, with the goal being to identify the function sSs \in S which is optimal in some sense.

In order to fully specify a concrete problem, we need to define in what space the xjx_j live. Statisticians would probably jump right to Rd\mathbb{R}^d, but in approximation theory there are some fundamental differences between one and multiple dimensions. Therefore, in the next section we start by discussing the concrete case of a one-dimensional input space, which will be used to motivate higher-dimensional generalizations later on.

The One-Dimensional Case

In this section we will assume that xjRx_j \in \mathbb{R}; in particular, that xj(a,b)x_j \in (a, b) for all j=1,,Nj = 1, \dots, N for some a<ba < b. It will sometimes be useful to extend the notation as x0:=ax_0 := a and xN+1:=bx_{N+1} := b, in which case we will sometimes refer to X={xj}j=1NX = \{x_j\}_{j=1}^{N} as the interior sites to distinguish from the end points. Note that there are different conventions for this in the literature, but it is important to remember that in our case NN denotes the number of sites with corresponding output values fjf_j. The goal in this setting is to find a continuous function s:[a,b]Rs: [a,b] \to \mathbb{R} which either interpolates or approximates the data.

Polynomials

Following Wendland, let πM(R)\pi_M(\mathbb{R}) denote the space of polynomials with dimension at most MM; i.e., the functions s:RRs: \mathbb{R} \to \mathbb{R} which admit a representation \begin{align} s(x) = \alpha_0 + \alpha_1 x + \alpha_2 x^2 + \cdots + \alpha_M x^M \end{align} for some coefficients α0,,αMR\alpha_0, \dots, \alpha_M \in \mathbb{R}. πM(R)\pi_M(\mathbb{R}) is indeed a space in the linear algebra sense; it is a vector space of dimension M+1M+1. The most obvious basis for this space is the monomial basis 1,x,x2,,xM1, x, x^2, \dots, x^M but many other bases are commonly used which have nicer theoretical and computational properties.

Interpolation with Polynomials

Let’s now consider solving the interpolation problem using polynomials; that is, we will try to find a function sπM1(R)s \in \pi_{M-1}(\mathbb{R}) which interpolates the data. I’m considering M1M-1 and not MM since the space πM1(R)\pi_{M-1}(\mathbb{R}) has dimension MM, which will be more convenient notationally than having to write M+1M+1 all the time. With this choice of function space, the interpolation requirement is that \begin{align} f_j &= \alpha_0 + \alpha_1 x_j + \alpha_2 x_j^2 + \cdots \alpha_M x_j^{M-1}, &&\forall j = 1, \dots, N. \end{align}
It is convenient to define \begin{align} \varphi(x) &:= (1, x, x^2, \dots, x^{M-1})^\top \in \mathbb{R}^{M} \newline \alpha &:= (\alpha_0, \alpha_1, \alpha_2, \dots, \alpha_{M-1})^\top \in \mathbb{R}^{M} \end{align} so that we can re-write the above as \begin{align} f_j &= \alpha_0 + \alpha_1 x_j + \alpha_2 x_j^2 + \cdots \alpha_M x_j^{M-1} = \varphi(x_j)^\top \alpha, &&\forall j = 1, \dots, N. \tag{1} \end{align}
To succinctly write the equalities for all jj, let ΦRN×M\Phi \in \mathbb{R}^{N \times M} be the matrix with jthj^{\text{th}} row equal to φ(xj)\varphi(x_j)^\top. The requirement (1) is then encoded in the linear system \begin{align} y = \Phi \alpha, \tag{2} \end{align} where we have additionally defined y:=(f1,,fN)RNy := (f_1, \dots, f_N)^\top \in \mathbb{R}^N. The questions of existence and uniqueness can now be answered by appealing to known properties of linear systems, though it is not in general obvious when (2) will have a solution. A unique solution exists if and only if Φ\Phi is non-singular, which in particular requires that Φ\Phi is square. In the square case, it can be shown that Φ\Phi is non-singular if and only if all of the xjx_j are distinct, which is the typical case for interpolation. Under the assumption that the xjx_j are distinct, we thus conclude that the space πN1(R)\pi_{N-1}(\mathbb{R}) provides a unique interpolating polynomial, whose coefficients can be computed by α^interp=Φ1y\hat{\alpha}_{\text{interp}} = \Phi^{-1} y. The interpolated value at a new input x~(a,b)\tilde{x} \in (a, b) is then given by \begin{align} s(\tilde{x}) &= \varphi(\tilde{x})^\top \hat{\alpha}_{\text{interp}} = \varphi(\tilde{x})^\top \Phi^{-1} y \end{align}

Regression with Polynomials

We consider the same setting as above, but now suppose that the fjf_j correspond to noisy evaluations of some underlying function ff, fj=f(xj)+ϵj. f_j = f(x_j) + \epsilon_j. We now consider finding a polynomial sπM1(R)s \in \pi_{M-1}(\mathbb{R}) that satisfies s(xj)fjs(x_j) \approx f_j for j=1,,Nj = 1, \dots, N. To make this precise, we attempt to minimize the average squared error, which yields the ordinary least squares (OLS) regression problem \begin{align} \text{min}_{s \in \pi_{M-1}(\mathbb{R})} \lVert y - s(X) \rVert_2^2 = \text{min}_{\alpha \in \mathbb{R}^M} \lVert y - \Phi \alpha \rVert_2^2, \tag{3} \end{align} where s(X)=(s(x1),,s(xN))s(X) = (s(x_1), \dots, s(x_N))^\top.

The optimality condition for this problem is well-known to be given by the normal equations, \begin{align} \Phi^\top \Phi \alpha = \Phi^\top y. \end{align} If Φ\Phi is of full column rank then ΦΦ\Phi^\top \Phi is invertible, which leads to the unique least squares solution \begin{align} \hat{\alpha}_{\text{OLS}} = (\Phi^\top \Phi)^{-1} \Phi^\top y. \end{align} Note that the full column rank condition in particular requires MNM \leq N, corresponding to the usual regression setting in which the number of observations exceeds the number of variables. If we consider the special case where Φ\Phi is invertible, the unique OLS solution reduces to the interpolation solution: \begin{align} \hat{\alpha}_{\text{OLS}} = \Phi^{-1} (\Phi^\top)^{-1} \Phi^\top y = \Phi^{-1}y = \hat{\alpha}_{\text{interp}}. \end{align}

A regression prediction (assuming the case where α^OLS\hat{\alpha}_{\text{OLS}} is unique) at a new location x~(a,b)\tilde{x} \in (a, b) is given by \begin{align} s(\tilde{x}) &= \varphi(\tilde{x})^\top \hat{\alpha}_{\text{OLS}} = \varphi(\tilde{x})^\top (\Phi^\top \Phi)^{-1} \Phi^\top y, \end{align}
which again reduces to the interpolation analog when Φ\Phi is invertible.

Finally, note that in this context the function φ\varphi is commonly referred to as a feature map, which generates the corresponding feature (i.e. basis) matrix Φ\Phi.

Splines

In the previous sections, we uncovered a remarkable property possessed by (one-dimensional) polynomials: for any set of NN distinct sites XX, the space πN1(R)\pi_{N-1}(\mathbb{R}) contains a unique interpolating polynomial. However, even in one dimension this does not imply that the problem is solved in a practical sense. As the number of sites grows large, working with very high degree polynomials becomes very difficult and can lead to some weird results.

To prevent the dimensionality from getting out of hand, an alternative approach is to work in a space SS consisting of piecewise polynomials of lower dimension. A simple example is just linearly interpolating the data, but typically we want to work with smoother functions. Thus, we might impose constraints such as requiring the existence of a certain number of derivatives, which requires ensuring that the polynomials “line up” correctly at the interior knots. We briefly discuss two popular examples of such spaces below.

Cubic Splines

We define the space of cubic splines, denoted S3(X)S_3(X), to be the set of all piecewise cubic polynomials that are also twice continuously differentiable, S3(X):={sC2[a,b]:s[xj,xj+1)π3(R),0jN}.(5) S_3(X) := \left\{s \in C^2[a,b] : s|[x_j, x_{j+1}) \in \pi_3(\mathbb{R}), 0 \leq j \leq N \right\}. \tag{5} We write s[c,d)s|[c, d) for the restriction of the function ss to the interval [c,d)[c, d). Also recall that by definition x0=ax_0 = a and xN+1=bx_{N+1}=b. Note that, unlike the polynomial spaces considered above, the space of cubic splines is defined with respect to the specific set of locations XX.

The set of cubic splines form a vector space, which follows from the fact that C2[a,b]C^2[a,b] and π3(R)\pi_3(\mathbb{R}) are themselves vector spaces. The dimension of S3(X)S_3(X) can be established by furnishing a basis for the space, but a back-of-the-envelope calculation can give you a good guess. Indeed, note that the interior sites define N+1N+1 intervals, on each of which contains cubic polynomials. Since π3(R)\pi_{3}(\mathbb{R}) has dimension 44 then this yields 4(N+1)4(N+1) degrees of freedom. But in neglecting the C2[a,b]C^2[a,b] constraint we have over-counted. This constraint requires that the function values, first derivatives, and second derivatives match at each of the NN interior nodes, thus removing 3N3N degrees of freedom, for a total count of 4(N+1)3N=N+44(N+1) - 3N = N + 4. This is indeed the correct dimension of S3(X)S_3(X), and a basis for this space is given by {(xxj)+3}j=1N{1,x,x2,x3},(6) \{(x - x_j)^3_+\}_{j=1}^{N} \cup \{1, x, x^2, x^3\}, \tag{6} using the notation x+:=max(x,0)x_+ := \max(x, 0). In fact, replacing {1,x,x2,x3}\{1, x, x^2, x^3\} with any basis of π3(R)\pi_3(\mathbb{R}) will do the trick. Cubic splines are sometimes described as the lowest order splines which look smooth to the human eye.

As far as interpolation, we can try to follow the same procedure as before, now defining the map φ\varphi according to the basis (6): φ(x)=[(xx1)+3,,(xxN)+3,1,x,x2,x3]RN+4. \varphi(x) = \left[(x-x_1)^3_+, \dots, (x-x_N)^3_+, 1, x, x^2, x^3 \right]^\top \in \mathbb{R}^{N+4}. This yields a basis matrix ΦRN×(N+4)\Phi \in \mathbb{R}^{N \times (N+4)}, so we see that we are a bit overparameterized here; i.e., the system Φα=y\Phi \alpha = y is not guaranteed to have a unique solution. To address this, we must reduce the number of free parameters by 44 by adding some constraints. This is done in various ways, often by imposing some boundary conditions; this blog post nicely summarizes some options. The primary method of interest here will be constraining s()s(\cdot) to be linear on the boundary regions [a,x1)[a, x_1) and [xN,b][x_N, b]. This leads to the concept of the natural cubic spline, which is discussed in the next section.

Having mainly discussed cubic spline interpolation, we briefly note that regression proceeds similarly. However, in this setting it is common to choose the spline knots to be points other XX, which can help reign in the number of parameters when NN is large. For example, we might compute some empirical percentiles of our data and use these evenly-spaced points as the knots. Finally, we should note that the basis (6), while conceptually simple, can tend to be numerically unstable. For computational purposes, alternative bases are preferred.

Natural Cubic Splines

As mentioned above, a popular methods for adding additional constraints to cubic splines is to enforce linearity on the boundary regions. We thus define the set of natural cubic splines as NS3(X):={sS3(X):s[a,x1),s[xN,b]π3(R)}. NS_3(X) := \left\{s \in S_3(X): s|[a,x_1), s|[x_N, b] \in \pi_3(\mathbb{R}) \right\}. This space can equivalently be characterized by taking the subset of S3(X)S_3(X) that satisfies the additional boundary constraints \begin{align} &s^{\prime\prime}(x_1) = 0, && s^{\prime\prime}(x_N) = 0 \end{align} Note that imposing linearity in the boundary regions is quite a reasonable idea. The curve fit to these regions is essentially extrapolating, since these regions are only constrained by observed data at one of their two endpoints. Thus, enforcing linearity here acts as a sort of regularization, attempting to prevent overfitting in data-sparse regions.

Once again, NS3(X)NS_3(X) is a vector space. Trying to count the dimensions as before, we now have 2×22 \times 2 parameters from the two linear bases on the boundary regions and 4(N1)4(N-1) parameters from the cubic bases in the interior regions. Given that we are still imposing 3N3N constraints on the interior nodes, this yields 4+4(N1)3N=N4 + 4(N-1) - 3N = N free parameters, as required to guarantee a unique interpolant. This is indeed correct, and it can be shown that the uniqueness property is satisfied for NS3(X)NS_3(X). A basis for NS3(X)NS_3(X) can be obtained by starting with a basis for S3(X)S_3(X) and then imposing the new linearity constraints. If we start with the cubic spline basis (6), such that any sS3(X)s \in S_3(X) may be represented as s(x)=j=1Nαj(xxj)+3+β0+β1x+β2x2+β3x3, s(x) = \sum_{j=1}^{N} \alpha_j (x - x_j)_+^3 + \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3, and impose these constraints, then we find they translate into the coefficient conditions β2=β3=0\beta_2 = \beta_3 = 0 as well as j=1Nαj=j=1Nαjxj=0.(7) \sum_{j=1}^{N} \alpha_j = \sum_{j=1}^{N} \alpha_j x_j = 0. \tag{7}

Generalizing to Higher Dimensions: Radial Basis Functions

The previous section provided a very brief overview of using polynomials and splines for fitting smooth, nonlinear curves in one-dimensional interpolation and regression problems. It turns out that things get difficult when trying to generalize to higher dimensions. Chapter 2 of Wendland’s book covers multivariate polynomials, so I will not dwell on this here. There are various ways one might approach generalizing these ideas to higher dimensions, and Wendland mentions a few approaches with some success in two dimensions but which become intractable beyond that. The real goal here is a method that works the same in all dimensions, and the key to finding this is to characterize the one-dimensional methods in such a way that is amenable to generalization.

With this in mind, we consider the following characterization of natural cubic splines.

Proposition. (Wendland Proposition 1.1) Every natural cubic spline has a characterization of the form s(x)=j=1Nαjϕ(xxj)+p(x),xR(8) s(x) = \sum_{j=1}^{N} \alpha_j \phi(\lvert x - x_j \rvert) + p(x), \qquad x \in \mathbb{R} \tag{8} where

  1. ϕ(r)=r3\phi(r) = r^3, r0r \geq 0.
  2. pπ1(R)p \in \pi_1(\mathbb{R}).
  3. The {αj}\{\alpha_j\} satisfy the conditions (7).

Conversely, for every set of pairwise distinct points X={xj}j=1NX = \{x_j\}_{j=1}^{N}, and every fRNf \in \mathbb{R}^N, there exists a function s()s(\cdot) of the form (8) satisfying

  1. fj=s(xj)f_j = s(x_j), for j=1,,Nj = 1, \dots, N.
  2. The {αj}\{\alpha_j\} satisfy the conditions (7).

It is this viewpoint which provides an ideal jumping off point for generalization to higher dimensions. Indeed, consider that the function ϕ(xxj)\phi(\lvert x - x_j \rvert) only depends on the distance xxj\lvert x - x_j \rvert. Thus, we might consider the natural generalization ϕ(xxj2)\phi(\lVert x - x_j \rVert_2) when x,xjRdx, x_j \in \mathbb{R}^d. We might similarly replace the low-dimensional polynomial pπ1(R)p \in \pi_{1}(\mathbb{R}) in (8) with a lower-dimensional multivariate polynomial pπM1(Rd)p \in \pi_{M-1}(\mathbb{R}^d). We thus consider the generalization of (8) to interpolants of the form \begin{align} s(x) = \sum_{j=1}^{N} \alpha_j \phi(\lVert x - x_j \rVert_2) + p(x), \qquad x \in \mathbb{R}^d, p \in \pi_{M-1}(\mathbb{R}^d) \tag{9} \end{align} subject to the constraint \begin{align} &\sum_{j=1}^{N} \alpha_j q(x_j) = 0, &&\forall q \in \pi_{M-1}(\mathbb{R}^d), \tag{10} \end{align} where ϕ:[0,)R\phi: [0, \infty) \to \mathbb{R} is some fixed function.

Notice that the coefficient constraints (10) echo (7), with πM1(Rd)\pi_{M-1}(\mathbb{R}^d) replacing π1(R)\pi_{1}(\mathbb{R}). Notice that while splines provided the motivation for this generalization, the spaces of functions characterized by (9) will no longer consist of piecewise polynomials. They have thus deservedly been given a new name: radial basis functions, which we will refer to by the acronym RBF.

A first look at radial basis functions

To start wrapping our heads around RBFs, let’s start by considering the case where p0p \equiv 0 in (9). Wendland notes that this case actually is sufficient in many settings. We thus consider the space of functions of the form s(x)=j=1Nαjϕ(xxj2),(11) s(x) = \sum_{j=1}^{N} \alpha_j \phi(\lVert x - x_j \rVert_2), \tag{11} noting that the coefficient constraints (10) are not necessary in this case. Things are much simpler in this case, as we note that the functions under consideration are those in the span of the basis functions {ϕj}j=1N\{\phi_j\}_{j=1}^{N} where ϕj(x):=ϕ(xxj2)\phi_j(x) := \phi(\lVert x - x_j \rVert_2). The idea of defining basis functions which depend on the data sites XX is very much in the spirit of splines. Note that I am using the term basis a bit loosely at this point, given that without additional conditions on ϕ\phi there are certainly no guarantees that the ϕj\phi_j functions are actually linearly independent (think about the trivial case ϕ0\phi \equiv 0). In any case, we can consider proceeding as we did in the polynomial interpolation example, by first defining the map φ(x):=[ϕ1(x),ϕ2(x),,ϕN(x)]RN \varphi(x) := \left[\phi_1(x), \phi_2(x), \dots, \phi_N(x) \right]^\top \in \mathbb{R}^N and then constructing the interpolation matrix ΦRN×N\Phi \in \mathbb{R}^{N \times N} with jthj^{\text{th}} row equal to φ(xj)\varphi(x_j)^\top. Wendland calls this matrix matrix as Aϕ,X={ϕ(xkxj2)}1k,jNA_{\phi, X} = \{\phi(\lVert x_k - x_j \rVert_2)\}_{1 \leq k,j \leq N}, but I’ll stick with Φ\Phi for now at least.

As before, the interpolation problem reduces to solving the linear system Φα=y\Phi \alpha = y and hence a unique interpolant is guaranteed precisely when AA is non-singular. Note that non-singularity requires, in particular, that the ϕj\phi_j are independent. So the first question we must answer is whether we can actually find a function ϕ:[0,)R\phi: [0, \infty) \to \mathbb{R} that ensures Aϕ,XA_{\phi, X} (for any NN and any set of pairwise distinct locations XX) is non-singular? The answer is yes, with three such examples given by \begin{align} &\phi(r) = e^{-\alpha r^2}, &&\phi(r) = (c^2 + r^2)^{-1/2}, &&&\phi(r) = \sqrt{c^2 + r^2}, \end{align} where α>0\alpha > 0 and c>0c > 0. These are referred to as the Gaussian, inverse multiquadric, and multiquadric functions, respectively. A particularly important special case with links to kernel methods is when AA is also guaranteed to be positive definite. This additional requirement is satisfied by the Gaussian and inverse multiquadric radial functions.

Example: Gaussian Processes

In this section we show that under certain assumptions the predictive (conditional) mean of a Gaussian process (GP) can be viewed as an RBF interpolant. We consider a zero mean GP prior fGP(0,k)f \sim \mathcal{GP}(0, k) on the function ff, with k(,)k(\cdot, \cdot) the kernel (i.e., covariance function), a positive definite function. The conditional distribution fX,yf|X,y can be shown to also be Gaussian, with mean function E[f(x~)X,y]=k(x~,X)k(X,X)1y,(12) \mathbb{E}[f(\tilde{x})|X,y] = k(\tilde{x}, X) k(X, X)^{-1}y, \tag{12} using the notation k(X,X)k(X, X^\prime) to denote the matrix with entries k(X,X)jl=k(xj,xl)k(X, X^\prime)_{jl} = k(x_j, x^\prime_l).
If we consider the basis functions {ϕj}j=1N\{\phi_j\}_{j=1}^{N} defined by ϕj(x)=k(x,xj)\phi_j(x) = k(x, x_j) and the associated feature map φ\varphi and matrix Φ\Phi, then (12) can be re-written as E[f(x~)X,y]=φ(x~)Φ1y, \mathbb{E}[f(\tilde{x})|X,y] = \varphi(\tilde{x})^\top \Phi^{-1}y, which is precisely the generic interpolation formula we have seen several times at this point. Moreover, if we denote α:=k(X,X)1y\alpha := k(X, X)^{-1}y then we can re-write (12) as k(x~,X)k(X,X)1y=k(x~,X)α=j=1Nαjk(x~,xj). k(\tilde{x}, X) k(X, X)^{-1}y = k(\tilde{x}, X)\alpha = \sum_{j=1}^{N} \alpha_j k(\tilde{x}, x_j). If we further assume that the kernel depends only on the distance between locations, then it can be viewed as a radial function k(x,x)=ϕ(xx)k(x, x^\prime) = \phi(\lVert x - x^\prime \rVert) (this property of kernels is known as isotropy in the geostatistics literature) and the GP predictive mean assumes the form E[f(x~)X,y]=j=1Nαjϕ(x~xj), \mathbb{E}[f(\tilde{x})|X,y] = \sum_{j=1}^{N} \alpha_j \phi(\lVert \tilde{x} - x_j\rVert), an RBF interpolant!

Questions

  1. Am I reading proposition 1.1 correctly? Does the converse imply the same conditions on r and p?
  2. Clarification on the Sobolev space minimum norm characterization of natural cubic splines.