Basis Expansions for Black-Box Function Emulation

A discussion of the popular output dimensionality strategy for emulating multi-output functions.

Setup

Suppose we are working with a function \begin{align} \mathcal{G}: \mathcal{U} \subset \mathbb{R}^d \to \mathbb{R}^p, \tag{1} \end{align} where the output dimension pp is presumed to be “large”, potentially in the hundreds or thousands. We treat G\mathcal{G} generically as a black-box function, but in common applications of interest it takes the form of an expensive computer simulation model. We will therefore use the terms black-box function, computer model, and simulator interchangeably throughout this post. Our primary goal of interest is to construct a model that approximates the map uG(u)u \mapsto \mathcal{G}(u). We will refer to such a model as an emulator or surrogate model. The implicit assumption here is that computing G(u)\mathcal{G}(u) is quite expensive, so the idea is to replace G\mathcal{G} with a computationally cheaper approximation. If you don’t care about emulating expensive computer models, you can also view this generically as a regression (or interpolation) problem with a very high-dimensional output space. To further elucidate the connection, suppose that we evaluate the function at a set of points u1,,unUu_1, \dots, u_n \in \mathcal{U}, resulting in the input-output pairs {ui,G(ui)}i=1n\{u_i, \mathcal{G}(u_i)\}_{i=1}^{n}. We can now treat these pairs as training examples that can be use to fit a predictive model. From this point of view, the primary distinction from more traditional regression is that in this case we get to choose the input points uiu_i at which we evaluate the function.

The methods we discuss below attempt to address two primary challenges:

  1. Function evaluations G(u)\mathcal{G}(u) are expensive, and thus we seek to minimize the number of points at which G\mathcal{G} is evaluated.
  2. The output space of G\mathcal{G} is very high-dimensional.

The second challenge can be problematic for standard regression methods, which are often tailored to scalar-valued outputs. The obvious solution here might be to fit a separate model to predict each individual output; i.e., a model to emulate the map uGj(u)u \mapsto \mathcal{G}_j(u) for each j=1,,pj=1, \dots, p.
With parallel computing resources, such an approach might even be feasible for very large values of pp. However, larger numbers of outputs typically come with more structure; e.g., the outputs may consist of a time series or spatial fields. The independent output-by-output regression approach completely fails to leverage this structure. The method discussed in this post seeks to take advantage of such structure by finding a set of basis vectors that explain the majority of the variation in the outputs. We proceed by first generically discussing basis representations of the output space, and how such structure can be leveraged to address the two challenges noted above. With the general method defined, we conclude by discussing details for a specific choice of basis approximation (principal components analysis) and a specific choice of emulator model (Gaussian process regression).

Basis Representation

From a very generic perspective, the methods we discuss here can be thought of as a method for dimension reduction of the output space of a regression problem. It is perhaps more typical to see dimensionality reduction applied to the input space in such settings, with principal component regression being the obvious example. In this section, we start by discussing a low-dimensional basis representation of the output space, and then explore how such a representation can be leveraged in solving the regression problem. Throughout this introduction, I will assume that the output of G\mathcal{G} is centered; this is made explicit later on when considering a concrete application of PCA.

A Basis Representation of the Output Space

Let’s start by considering approximately representing vectors in the range of G\mathcal{G}, denoted R(G)\mathcal{R}(\mathcal{G}), with respect to a set of rpr \ll p orthonormal basis vectors {b1,,br}Rp\{b_1, \dots, b_r\} \subset \mathbb{R}^p. Given such a set of vectors, we can approximate gR(G)g \in \mathcal{R}(\mathcal{G}) by its projection onto the subspace span(b1,,br)\text{span}(b_1, \dots, b_r): g^:=j=1rg,brbr.(2) \hat{g} := \sum_{j=1}^{r} \langle g, b_r\rangle b_r. \tag{2} If we stack the basis vectors as columns in a matrix BRp×rB \in \mathbb{R}^{p \times r} then we can write this projection compactly as g^=j=1rg,brbr=j=1r(brbr)g=BBg,(3) \hat{g} = \sum_{j=1}^{r} \langle g, b_r\rangle b_r = \sum_{j=1}^{r} (b_r b_r^\top)g = BB^\top g, \tag{3} We see that BBBB^\top is the projection matrix that projects onto the span of the basis vectors. With regards to dimensionality reduction, the benefit here is that the simulator output can now be (approximately) represented using rpr \ll p numbers BgB^\top g. We can now ask the question: how do we find the basis vectors BB? If we are given a set of vectors g1,,gnR(G)g_1, \dots, g_n \in \mathcal{R}(\mathcal{G}), we can take an empirical approach and try to use these examples to determine a BB that is optimal in some well-defined sense. Assuming that R(G)Rp\mathcal{R}(\mathcal{G}) \subseteq \mathbb{R}^p is indeed a subspace and we define “optimal” in an average squared error sense, the problem we have laid out here is exactly that of principal components analysis (PCA), a topic I discuss in depth in this post. The only difference is that we are applying PCA on the subspace R(G)\mathcal{R}(\mathcal{G}). At this point, we should emphasize that in practice R(G)\mathcal{R}(\mathcal{G}) will often not be a subspace. Computer simulations may produce outputs that are subject to certain constraints, and thus R(G)\mathcal{R}(\mathcal{G}) may represent a more complicated subset of Rp\mathbb{R}^p. In these cases, one can still typically apply the PCA algorithm to obtain BB, but the result may be sub-optimal. Alternate methods of basis construction may be warranted depending on the problem at hand.

Linking the Basis Representation with the Input Parameters

In the previous subsection, we considered approximating vectors in the range of the simulator with respect to a set of basis vectors BB. However, recall that our underlying goal here is to approximate the map uG(u)u \mapsto G(u). We thus need to consider how to leverage the basis representation of the output space in achieving this goal. Assuming we have already constructed the basis vectors BB, the above map can be approximated as uj=1rG(u),brbr=B[BG(u)].(4) u \mapsto \sum_{j=1}^{r} \langle \mathcal{G}(u), b_r\rangle b_r = B [B^\top \mathcal{G}(u)]. \tag{4} In words: feed uu through the simulator and project the resulting output onto the low-dimensional subspace spanned by the basis vectors. Note that BG(u)B^\top \mathcal{G}(u) stores the rr weights defining the projection of G(u)G(u) onto the subspace generated by BB, thus providing a low dimensional summary of the simulator output. Let’s introduce the notation \begin{align} w(u) &:= B^\top \mathcal{G}(u) = \left[w_1(u), \dots, w_r(u) \right]^\top \in \mathbb{R}^r, && w_r(u) := \langle \mathcal{G}(u), b_r \rangle \end{align} to denote this weights. The basis function approximation to the simulator can thus be written as G^r(u):=j=1rwr(u)br=Bw(u)Rp.(5) \hat{\mathcal{G}}_r(u) := \sum_{j=1}^{r} w_r(u)b_r = Bw(u) \in \mathbb{R}^p. \tag{5} At this point, this isn’t helpful since the expensive simulation still needs to be run every time G^r(u)\hat{\mathcal{G}}_r(u) is evaluated. To address this, we now turn back to the idea of using emulators. Recall that such an approach was originally hindered due to the high-dimensional output space of the simulator. However, under the approximation G^r(u)\hat{\mathcal{G}}_r(u), the dependence on uu has been reduced to w(u)w(u), which effectively reduces the output dimension to rr. The idea is thus to use some sort of statistical model to emulate the map uw(u)u \mapsto w(u). Suppressing all details for now, let’s suppose we have fit such a model w(u)w^*(u). We can now plug w(u)w^*(u) in place of w(u)w(u) in (5) to obtain the approximation G^r(u):=j=1rwr(u)br=Bw(u)Rp.(6) \hat{\mathcal{G}}^*_r(u) := \sum_{j=1}^{r} w^*_r(u)b_r = Bw^*(u) \in \mathbb{R}^p. \tag{6} This approximation no longer requires running the full simulator, since evaluating G^r(u)\hat{\mathcal{G}}^*_r(u) just requires (1) computing the emulator prediction at uu; and (2) applying BB to the emulator prediction. It is worth emphasizing how this approach compares to the direct emulation method. In place of directly trying to approximate the map from uu to the model outputs G(u)\mathcal{G}(u), we are now considering approximating the map from uu to inner products of G(u)\mathcal{G}(u) with a small number of basis vectors. The hope is that these inner products are sufficient to capture the majority of information in the model response.

The General Emulation Model

In the preceding subsections, we introduced the idea of representing the output space (range) of G\mathcal{G} with respect to a low-dimensional basis in order to facilitate emulation of the map uG(u)u \mapsto \mathcal{G}(u). For concreteness, we considered an orthogonal basis, whereby approximations of G(u)\mathcal{G}(u) take the form of orthogonal projections onto the basis vectors. In this section, we take a step back and define the general model, which encompasses basis methods beyond the orthogonal projection framework.

The Basis Function Emulation Model. Given a set of vectors {b1,,br}Rp\{b_1, \dots, b_r\} \subset \mathbb{R}^{p}, we refer to a decomposition of the form G(u)=j=1rwj(u)bj+ϵ(u)(7) \mathcal{G}(u) = \sum_{j=1}^{r} w_j(u) b_j + \epsilon(u) \tag{7} as the basis function GP emulation model. We write wj(u)w^*_j(u) to denote an emulator model that approximates the map wj(u)w_j(u).

The basis function emulation model decomposes the computer model output G(u)\mathcal{G}(u) into

  1. a piece that can be explained by a linear combination of rr basis functions that are independent of uu.
  2. the residual ϵ(u)\epsilon(u), representing all variation unaccounted for by the basis functions.

We emphasize that the basis functions are independent of the input uu; the effect of the inputs is restricted to the coefficients wj(u)w_j(u), with unaccounted for uu-dependence absorbed by the residual term ϵ(u)\epsilon(u). As noted in the previous section, if we opt for an orthogonal projection approach, then the true weights assume the form wj(u)=G(u),bj,(8) w_j(u) = \langle \mathcal{G}(u), b_j \rangle, \tag{8} but the general model (7) allows for other approaches as well. Under different decomposition strategies, the true weights may not be given by the inner products (8). Nonetheless, we can still consider applying statistical models to approximate the underlying weight maps uwj(u)u \mapsto w_j(u), regardless of what form these maps may take.

Concrete Details

Having laid out the general model, we now provide concrete details for specific choices of the basis construction and the emulator model. For the former we consider the popular PCA/SVD approach, which we have already hinted at above. For the latter, we consider the use of Gaussian processes (GPs). The combination of these two choices was first presented in the seminal paper Higdon et al (2008). I have a whole in-depth post on PCA, so I will assume general background knowledge on this topic.

Constructing the PCA Basis

Let us consider the construction of the basis vectors bjb_j using a principal components analysis (PCA) approach. Depending on the field, this strategy might also be termed singular value decomposition (SVD), proper orthogonal decomposition (POD), or empirical orthogonal functions (EOF).

Initial Design

In the absence of prior knowledge about the bjb_j, PCA takes a purely empirical approach; the simulator G(u)\mathcal{G}(u) is evaluated at a variety of different inputs uiu_i in order to understand the patterns of variability induced in the model outputs. We refer to the resulting input-output pairs {ui,G(ui)},i=1,,n(9) \{u_i, \mathcal{G}(u_i)\}, \qquad i = 1, \dots, n \tag{9} as the design or design points. In practice, the availability of parallel computing resources typically means that model simulations can be run in parallel, thus reducing the computational cost of generating the design. The input design points uiu_i are often chosen to vary “uniformly” over the input space U\mathcal{U} in some sense. More generally, we might assume that the inputs are governed by some prior distribution uρ,(10) u \sim \rho, \tag{10} which might encode prior knowledge about model parameters or serve to place higher weight on regions of the input space that are deemed more important. In this case, the design inputs might be sampled independently according to ρ\rho. In any case, the spread of the inputs ought to be chosen so that the set of corresponding outputs G(ui)\mathcal{G}(u_i) is representative of variation in the simulator output under typical use cases.

PCA

We denote the design outputs by gi:=G(ui)g_i := \mathcal{G}(u_i), i=1,,ni = 1, \dots, n and consider stacking these vectors into the rows of a matrix GRn×pG \in \mathbb{R}^{n \times p}. Define the empirical mean g:=1ni=1ngi.(10) \overline{g} := \frac{1}{n} \sum_{i=1}^{n} g_i. \tag{10} Since PCA is typically defined with respect to a centered data matrix, let GcRn×pG^c \in \mathbb{R}^{n \times p} denote the analog of GG constructed using the centered outputs gic:=gigg_i^c := g_i - \overline{g}. In other words, GcG^c results from subtracting g\overline{g} from each row of GG. The rows of GcG^c provide a summary of the variation in the simulator outputs due to variation in the inputs. Each column of GcG^c represents a single dimension in the output space. We seek to identify a small number of vectors bjRpb_j \in \mathbb{R}^p such that the observed outputs gig_i can be explained as a linear combination of these vectors. PCA produces these vectors by computing an eigendecomposition of the (unnormalized) empirical covariance matrix C^g:=i=1n(gig)(gig)=(Gc)Gc,(11) \hat{C}_g := \sum_{i=1}^{n} (g_i - \overline{g})(g_i - \overline{g})^\top = (G^c)^\top G^c, \tag{11} and defining BB to consist of the rr dominant eigenvectors of C^g\hat{C}_g. We denote the eigendecomposition by C^g=VΛV,(12) \hat{C}_g = V \Lambda V^\top, \tag{12} where Λ:=diag{λ1,,λp}\Lambda := \text{diag}\{\lambda_1, \dots, \lambda_p\} contains the eigenvalues sorted in decreasing order of their magnitude, with VV storing the respective normalized eigenvectors v1,,vpv_1, \dots, v_p as columns. If we truncate to the dominant rr eigenvectors, C^gVrΛrVr,(13) \hat{C}_g \approx V_r \Lambda_r V_r^\top, \tag{13} then we define our basis vectors by \begin{align} &B := V_r, &&b_j := v_j, \qquad j = 1, \dots, r. \tag{14} \end{align} Note that C^g\hat{C}_g is positive semidefinite so λj0\lambda_j \geq 0 for all jj. We summarize these ideas below.

PCA Approximation. Given design {ui,G(ui)}i=1n\{u_i, \mathcal{G}(u_i)\}_{i=1}^{n}, the PCA-induced approximation to G(u)\mathcal{G}(u) is given by G^r(u):=g+j=1rG(u)g,vjvj+ϵ(u),(15) \hat{\mathcal{G}}_r(u) := \overline{g} + \sum_{j=1}^{r} \langle \mathcal{G}(u)-\overline{g}, v_j\rangle v_j + \epsilon(u), \tag{15} where g\overline{g} is defined in (10) and v1,,vpv_1, \dots, v_p denote the normalized eigenvectors of the matrix defined in (11). The residual term is given by ϵ(u)=j=r+1pG(u)g,vjvj.(16) \epsilon(u) = \sum_{j=r+1}^{p} \langle \mathcal{G}(u)-\overline{g}, v_j\rangle v_j. \tag{16}

Gaussian Process Emulators

The above section shows that the PCA approach yields the basis vectors bj=vjb_j = v_j with associated weights wj(u)=G(u)g,vjw_j(u) = \langle \mathcal{G}(u)-\overline{g}, v_j\rangle. In this section we consider approximating the maps uwj(u)u \mapsto w_j(u) with Gaussian processes (GPs). The specific features of the PCA basis construction provide useful information in designing a reasonable GP model. In particular, we will end up with a set of independent GPs, each with prior mean zero and prior marginal variance (scale) one. We justify each of these choices in turn, and then summarize the GP prior below. We then conclude by deriving how the GP predictive distributions induce a stochastic emulator for G(u)\mathcal{G}(u).

Zero Prior Mean

Start by noting that the sample mean of the vectors {wj(ui)}i=1n\{w_j(u_i)\}_{i=1}^{n} is zero for each j=1,,pj = 1, \dots, p. Indeed, 1ni=1nwj(ui)=1ni=1ngic,vj=1ni=1ngic,vj=0,vj=0.(17) \frac{1}{n} \sum_{i=1}^{n} w_j(u_i) = \frac{1}{n} \sum_{i=1}^{n} \langle g^c_i,v_j \rangle = \left\langle \frac{1}{n} \sum_{i=1}^{n} g^c_i,v_j \right\rangle = \langle 0,v_j \rangle = 0. \tag{17} The weight functions are thus centered about zero with respect to the design points. If the gig_i are indicative of typical variation in the outputs, then the wj(u)w_j(u) should also be approximately centered with respect to uρu \sim \rho, not just over the design points uiu_i. Therefore, it appears reasonable to define zero-mean prior distributions on the wj(u)w_j(u).

Independent GPs

Next, we justify the choice to model each wj(u)w_j(u) separately as an independent GP. For this, we point to a result proved in the post post; namely, if we define wi:=[w1(ui),,wr(ui)]Rrw^i := [w_1(u_i), \dots, w_r(u_i)]^\top \in \mathbb{R}^r, then the vectors w1,,wnw^1, \dots, w^n have empirical covariance C^w:=1n1i=1nwi(wi)=1n1Λr,(17) \hat{C}_w := \frac{1}{n-1} \sum_{i=1}^{n} w^i(w^i)^\top = \frac{1}{n-1}\Lambda_r, \tag{17} with Λr\Lambda_r given in (13). In words, the weight vectors have zero sample covariance across dimensions. Similar in spirit to the zero mean case, we hope that this fact approximately holds true beyond the sample of design points. Thus, modeling the processes wj(u)w_j(u) independently for each jj seems to be a reasonable choice.

Unit Scale

Finally, we justify the decision to define a prior distribution for wj(u)w_j(u). From (17), we see that the empirical variance of wj(u1),,wj(un)w_j(u_1), \dots, w_j(u_n) is proportional to λj\lambda_j. To make things simpler, let’s define w~j(u):=λj1/2wj(u)\tilde{w}_j(u) := \lambda_j^{-1/2} w_j(u) so that w~j(u1),,w~j(un)\tilde{w}_j(u_1), \dots, \tilde{w}_j(u_n) now has unit sample variance. TODO: need to account for n1n-1 factor as well.

References