Basis Expansions for Black-Box Function Emulation

A discussion of the popular output dimensionality strategy for emulating multi-output functions.

Setup

Suppose we are working with a function \begin{align} \mathcal{G}: \mathcal{U} \subset \mathbb{R}^d \to \mathbb{R}^p, \tag{1} \end{align} where the output dimension pp is presumed to be “large”, potentially in the hundreds or thousands. We treat G\mathcal{G} generically as a black-box function, but in common applications of interest it takes the form of an expensive computer simulation model. We will therefore use the terms black-box function, computer model, and simulator interchangeably throughout this post. Our primary goal of interest is to construct a model that approximates the map uG(u)u \mapsto \mathcal{G}(u). We will refer to such a model as an emulator or surrogate model. The implicit assumption here is that computing G(u)\mathcal{G}(u) is quite expensive, so the idea is to replace G\mathcal{G} with a computationally cheaper approximation. If you don’t care about emulating expensive computer models, you can also view this generically as a regression (or interpolation) problem with a very high-dimensional output space. To further elucidate the connection, suppose that we evaluate the function at a set of points u1,,unUu_1, \dots, u_n \in \mathcal{U}, resulting in the input-output pairs {ui,G(ui)}i=1n\{u_i, \mathcal{G}(u_i)\}_{i=1}^{n}. We can now treat these pairs as training examples that can be use to fit a predictive model. From this point of view, the primary distinction from more traditional regression is that in this case we get to choose the input points uiu_i at which we evaluate the function.

The methods we discuss below attempt to address two primary challenges:

  1. Function evaluations G(u)\mathcal{G}(u) are expensive, and thus we seek to minimize the number of points at which G\mathcal{G} is evaluated.
  2. The output space of G\mathcal{G} is very high-dimensional.

The second challenge can be problematic for standard regression methods, which are often tailored to scalar-valued outputs. The obvious solution here might be to fit a separate model to predict each individual output; i.e., a model to emulate the map uGj(u)u \mapsto \mathcal{G}_j(u) for each j=1,,pj=1, \dots, p.
With parallel computing resources, such an approach might even be feasible for very large values of pp. However, larger numbers of outputs typically come with more structure; e.g., the outputs may consist of a time series or spatial fields. The independent output-by-output regression approach completely fails to leverage this structure. The method discussed in this post seeks to take advantage of such structure by finding a set of basis vectors that explain the majority of the variation in the outputs. We proceed by first generically discussing basis representations of the output space, and how such structure can be leveraged to address the two challenges noted above. With the general method defined, we conclude by discussing details for a specific choice of basis approximation (principal components analysis) and a specific choice of emulator model (Gaussian process regression).

Basis Representation

From a very generic perspective, the methods we discuss here can be thought of as a method for dimension reduction of the output space of a regression problem. It is perhaps more typical to see dimensionality reduction applied to the input space in such settings, with principal component regression being the obvious example. In this section, we start by discussing a low-dimensional basis representation of the output space, and then explore how such a representation can be leveraged in solving the regression problem. Throughout this introduction, I will assume that the output of G\mathcal{G} is centered; this is made explicit later on when considering a concrete application of PCA.

A Basis Representation of the Output Space

Let’s start by considering approximately representing vectors in the range of G\mathcal{G}, denoted R(G)\mathcal{R}(\mathcal{G}), with respect to a set of rpr \ll p orthonormal basis vectors {b1,,br}Rp\{b_1, \dots, b_r\} \subset \mathbb{R}^p. Given such a set of vectors, we can approximate gR(G)g \in \mathcal{R}(\mathcal{G}) by its projection onto the subspace span(b1,,br)\text{span}(b_1, \dots, b_r): g^:=j=1rg,brbr.(2) \hat{g} := \sum_{j=1}^{r} \langle g, b_r\rangle b_r. \tag{2} If we stack the basis vectors as columns in a matrix BRp×rB \in \mathbb{R}^{p \times r} then we can write this projection compactly as g^=j=1rg,brbr=j=1r(brbr)g=BBg,(3) \hat{g} = \sum_{j=1}^{r} \langle g, b_r\rangle b_r = \sum_{j=1}^{r} (b_r b_r^\top)g = BB^\top g, \tag{3} We see that BBBB^\top is the projection matrix that projects onto the span of the basis vectors. With regards to dimensionality reduction, the benefit here is that the simulator output can now be (approximately) represented using rpr \ll p numbers BgB^\top g. The question is now: how do we find the basis vectors BB? If we are given a set of vectors g1,,gnR(G)g_1, \dots, g_n \in \mathcal{R}(\mathcal{G}), we can take an empirical approach and try to use these examples to determine a BB that is optimal in some well-defined sense. Assuming that R(G)Rp\mathcal{R}(\mathcal{G}) \subseteq \mathbb{R}^p is indeed a subspace and we define “optimal” in an average squared error sense, the problem we have laid out here is exactly that of principal components analysis (PCA), a topic I discuss in depth in this post. The only difference is that we are applying PCA on the subspace R(G)\mathcal{R}(\mathcal{G}). At this point, we should emphasize that in practice R(G)\mathcal{R}(\mathcal{G}) will often not be a subspace. Computer simulations may produce outputs that are subject to certain constraints, and thus R(G)\mathcal{R}(\mathcal{G}) may represent a more complicated subset of Rp\mathbb{R}^p. In these cases, one can still typically apply the PCA algorithm to obtain BB, but the result may be sub-optimal. Alternate methods of basis construction may be warranted depending on the problem at hand.

Linking the Basis Representation with the Input Parameters

In the previous subsection, we considered approximating vectors in the range of the simulator with respect to a set of basis vectors BB. However, recall that our underlying goal here is to approximate the map uG(u)u \mapsto \mathcal{G}(u). We thus need to consider how to leverage the basis representation of the output space in achieving this goal. Assuming we have already constructed the basis vectors BB, the map uG(u)u \mapsto \mathcal{G}(u) can be approximated as uj=1rG(u),brbr=B[BG(u)].(4) u \mapsto \sum_{j=1}^{r} \langle \mathcal{G}(u), b_r\rangle b_r = B [B^\top \mathcal{G}(u)]. \tag{4} In words: feed uu through the simulator and project the resulting output onto the low-dimensional subspace spanned by the basis vectors. Note that BG(u)B^\top \mathcal{G}(u) stores the rr weights defining the projection of G(u)\mathcal{G}(u) onto the subspace generated by BB, thus providing a low dimensional summary of the simulator output. Let’s introduce the notation \begin{align} w(u) &:= B^\top \mathcal{G}(u) = \left[w_1(u), \dots, w_r(u) \right]^\top \in \mathbb{R}^r, && w_r(u) := \langle \mathcal{G}(u), b_r \rangle \end{align} to denote this weights. The basis function approximation to the simulator can thus be written as G^r(u):=j=1rwr(u)br=Bw(u)Rp.(5) \hat{\mathcal{G}}_r(u) := \sum_{j=1}^{r} w_r(u)b_r = Bw(u) \in \mathbb{R}^p. \tag{5} At this point, this isn’t helpful since the expensive simulation still needs to be run every time G^r(u)\hat{\mathcal{G}}_r(u) is evaluated. To address this, we now turn back to the idea of using emulators. Recall that such an approach was originally hindered due to the high-dimensional output space of the simulator. However, under the approximation G^r(u)\hat{\mathcal{G}}_r(u), the dependence on uu has been reduced to w(u)w(u), which effectively reduces the output dimension to rr. The idea is thus to use some sort of statistical model to emulate the map uw(u)u \mapsto w(u). Suppressing all details for now, let’s suppose we have fit such a model w(u)w^*(u). We can now plug w(u)w^*(u) in place of w(u)w(u) in (5) to obtain the approximation G^r(u):=j=1rwr(u)br=Bw(u)Rp.(6) \hat{\mathcal{G}}^*_r(u) := \sum_{j=1}^{r} w^*_r(u)b_r = Bw^*(u) \in \mathbb{R}^p. \tag{6} This approximation no longer requires running the full simulator, since evaluating G^r(u)\hat{\mathcal{G}}^*_r(u) just requires (1) computing the emulator prediction at uu; and (2) applying BB to the emulator prediction. It is worth emphasizing how this approach compares to the direct emulation method. In place of directly trying to approximate the map from uu to the model outputs G(u)\mathcal{G}(u), we are now considering approximating the map from uu to inner products of G(u)\mathcal{G}(u) with a small number of basis vectors. The hope is that these inner products are sufficient to capture the majority of information in the model response.

The General Emulation Model

In the preceding subsections, we introduced the idea of representing the output space (range) of G\mathcal{G} with respect to a low-dimensional basis in order to facilitate emulation of the map uG(u)u \mapsto \mathcal{G}(u). For concreteness, we considered an orthogonal basis, whereby approximations of G(u)\mathcal{G}(u) take the form of orthogonal projections onto the basis vectors. In this section, we take a step back and define the general model, which encompasses basis methods beyond the orthogonal projection framework.

The Basis Function Emulation Model. Given a set of vectors {b1,,br}Rp\{b_1, \dots, b_r\} \subset \mathbb{R}^{p}, we refer to a decomposition of the form G(u)=j=1rwj(u)bj+ϵ(u)(7) \mathcal{G}(u) = \sum_{j=1}^{r} w_j(u) b_j + \epsilon(u) \tag{7} as the basis function GP emulation model. We write wj(u)w^*_j(u) to denote an emulator model that approximates the map wj(u)w_j(u).

The basis function emulation model decomposes the computer model output G(u)\mathcal{G}(u) into

  1. a piece that can be explained by a linear combination of rr basis functions that are independent of uu.
  2. the residual ϵ(u)\epsilon(u), representing all variation unaccounted for by the basis functions.

We emphasize that the basis functions are independent of the input uu; the effect of the inputs is restricted to the coefficients wj(u)w_j(u), with unaccounted for uu-dependence absorbed by the residual term ϵ(u)\epsilon(u). As noted in the previous section, if we opt for an orthogonal projection approach, then the true weights assume the form wj(u)=G(u),bj,(8) w_j(u) = \langle \mathcal{G}(u), b_j \rangle, \tag{8} but the general model (7) allows for other approaches as well. Under different decomposition strategies, the true weights may not be given by the inner products (8). Nonetheless, we can still consider applying statistical models to approximate the underlying weight maps uwj(u)u \mapsto w_j(u), regardless of what form these maps may take.

Concrete Details

Having laid out the general model, we now provide concrete details for specific choices of the basis construction and the emulator model. For the former we consider the popular PCA/SVD approach, which we have already hinted at above. For the latter, we consider the use of Gaussian processes (GPs). The combination of these two choices was first presented in the seminal paper (Dave Higdon & Rightley, 2008). I have a whole in-depth post on PCA, so I will assume general background knowledge on this topic.

Constructing the PCA Basis

Let us consider the construction of the basis vectors bjb_j using a principal components analysis (PCA) approach. Depending on the field, this strategy might also be termed singular value decomposition (SVD), proper orthogonal decomposition (POD), or empirical orthogonal functions (EOF).

Initial Design

In the absence of prior knowledge about the bjb_j, PCA takes a purely empirical approach; the simulator G(u)\mathcal{G}(u) is evaluated at a variety of different inputs uiu_i in order to understand the patterns of variability induced in the model outputs. We refer to the resulting input-output pairs {ui,G(ui)},i=1,,n(9) \{u_i, \mathcal{G}(u_i)\}, \qquad i = 1, \dots, n \tag{9} as the design or design points. In practice, the availability of parallel computing resources typically means that model simulations can be run in parallel, thus reducing the computational cost of generating the design. The input design points uiu_i are often chosen to vary “uniformly” over the input space U\mathcal{U} in some sense. More generally, we might assume that the inputs are governed by some prior distribution uρ,(10) u \sim \rho, \tag{10} which might encode prior knowledge about model parameters or serve to place higher weight on regions of the input space that are deemed more important. In this case, the design inputs might be sampled independently according to ρ\rho. In any case, the spread of the inputs ought to be chosen so that the set of corresponding outputs G(ui)\mathcal{G}(u_i) is representative of variation in the simulator output under typical use cases.

PCA

We denote the design outputs by gi:=G(ui)g_i := \mathcal{G}(u_i), i=1,,ni = 1, \dots, n and consider stacking these vectors into the rows of a matrix GRn×pG \in \mathbb{R}^{n \times p}. Define the empirical mean g:=1ni=1ngi.(10) \overline{g} := \frac{1}{n} \sum_{i=1}^{n} g_i. \tag{10} Since PCA is typically defined with respect to a centered data matrix, let GcRn×pG^c \in \mathbb{R}^{n \times p} denote the analog of GG constructed using the centered outputs gic:=gigg_i^c := g_i - \overline{g}. In other words, GcG^c results from subtracting g\overline{g} from each row of GG. The rows of GcG^c provide a summary of the variation in the simulator outputs due to variation in the inputs. Each column of GcG^c represents a single dimension in the output space. We seek to identify a small number of vectors bjRpb_j \in \mathbb{R}^p such that the observed outputs gig_i can be explained as a linear combination of these vectors. PCA produces these vectors by computing an eigendecomposition of the (unnormalized) empirical covariance matrix C^g:=i=1n(gig)(gig)=(Gc)Gc,(11) \hat{C}_g := \sum_{i=1}^{n} (g_i - \overline{g})(g_i - \overline{g})^\top = (G^c)^\top G^c, \tag{11} and defining BB to consist of the rr dominant eigenvectors of C^g\hat{C}_g. We denote the eigendecomposition by C^g=VΛV,(12) \hat{C}_g = V \Lambda V^\top, \tag{12} where Λ:=diag{λ1,,λp}\Lambda := \text{diag}\{\lambda_1, \dots, \lambda_p\} contains the eigenvalues sorted in decreasing order of their magnitude, with VV storing the respective normalized eigenvectors v1,,vpv_1, \dots, v_p as columns. If we truncate to the dominant rr eigenvectors, C^gVrΛrVr,(13) \hat{C}_g \approx V_r \Lambda_r V_r^\top, \tag{13} then we define our basis vectors by \begin{align} &B := V_r, &&b_j := v_j, \qquad j = 1, \dots, r. \tag{14} \end{align} Note that C^g\hat{C}_g is positive semidefinite so λj0\lambda_j \geq 0 for all jj. We summarize these ideas below.

PCA Approximation. Given design {ui,G(ui)}i=1n\{u_i, \mathcal{G}(u_i)\}_{i=1}^{n}, the PCA-induced approximation to G(u)\mathcal{G}(u) is given by G^r(u):=g+j=1rG(u)g,vjvj+ϵ(u),(15) \hat{\mathcal{G}}_r(u) := \overline{g} + \sum_{j=1}^{r} \langle \mathcal{G}(u)-\overline{g}, v_j\rangle v_j + \epsilon(u), \tag{15} where g\overline{g} is defined in (10) and v1,,vpv_1, \dots, v_p denote the normalized eigenvectors of the matrix defined in (11). The residual term is given by ϵ(u)=j=r+1pG(u)g,vjvj.(16) \epsilon(u) = \sum_{j=r+1}^{p} \langle \mathcal{G}(u)-\overline{g}, v_j\rangle v_j. \tag{16}

Gaussian Process Emulators

The above section shows that the PCA approach yields the basis vectors bj=vjb_j = v_j with associated weights wj(u)=G(u)g,vjw_j(u) = \langle \mathcal{G}(u)-\overline{g}, v_j\rangle. In this section we consider approximating the maps uwj(u)u \mapsto w_j(u) with Gaussian processes (GPs). The specific features of the PCA basis construction provide useful information in designing a reasonable GP model. In particular, we consider why it is typically reasonable emulate the maps uwj(u)u \mapsto w_j(u) separately using a set of independent GPs. We then discuss how the independent GP approach induces a multi-output GP emulator for G(u)\mathcal{G}(u) with a particular covariance structure.

Independent GPs

We first consider the typical choice to model each wj(u)w_j(u) separately as an independent GP. For this, we point to a result proved in the post post; namely, if we define wi:=[w1(ui),,wr(ui)]Rrw^i := [w_1(u_i), \dots, w_r(u_i)]^\top \in \mathbb{R}^r, then the vectors w1,,wnw^1, \dots, w^n have empirical covariance C^w:=1n1i=1nwi(wi)=1n1Λr,(17) \hat{C}_w := \frac{1}{n-1} \sum_{i=1}^{n} w^i(w^i)^\top = \frac{1}{n-1}\Lambda_r, \tag{17} with Λr\Lambda_r given in (13). In words, the weight vectors have zero sample covariance across dimensions. This seems to provide some justification for the independent GP approach, thought we should be careful not to overstate the claim. We can view C^w\hat{C}_w as an estimate of Cw:=Euρ(w(u)E[w(u)])(w(u)E[w(u)]),(18) C_w := \mathbb{E}_{u \sim \rho} (w(u) - \mathbb{E}[w(u)])(w(u) - \mathbb{E}[w(u)])^\top, \tag{18} where w(u)=(w1(u),,wr(u))w(u) = (w_1(u), \dots, w_r(u)). Even if we knew that CwC_w was actually diagonal, we should take care to note that this means the wj(u)w_j(u) are uncorrelated on average with respect to uρu \sim \rho. The assumption to fit independent GPs entails assuming a probabilistic model for w(u)w(u) such that the wj(u)w_j(u) are pairwise independent conditional on each uu. Thus, diagonal structure in CwC_w does not necessarily justify this assumption. In practice, the independent GP approach is often reasonable, and makes things much easier.

Prior Mean Specification

We next consider the specification of the GP prior for each independent GP. Start by noting that the sample mean of the vectors {wj(ui)}i=1n\{w_j(u_i)\}_{i=1}^{n} is zero for each j=1,,pj = 1, \dots, p. Indeed, 1ni=1nwj(ui)=1ni=1ngic,vj=1ni=1ngic,vj=0,vj=0.(19) \frac{1}{n} \sum_{i=1}^{n} w_j(u_i) = \frac{1}{n} \sum_{i=1}^{n} \langle g^c_i,v_j \rangle = \left\langle \frac{1}{n} \sum_{i=1}^{n} g^c_i,v_j \right\rangle = \langle 0,v_j \rangle = 0. \tag{19} Similar to above, we can view this as an estimate of Euρ[wj(u)]\mathbb{E}_{u \sim \rho}[w_j(u)]. It is tempting to conclude that a constant zero prior mean assumption is thus justified for each GP. Again, this isn’t quite right. If Euρ[wj(u)]0\mathbb{E}_{u \sim \rho}[w_j(u)] \approx 0, then this tells us that on average, over the whole input space, the weight wj(u)w_j(u) is approximately zero. This does not guarantee the absence of a trend in the map uwj(u)u \mapsto w_j(u). For example, this map might look linear, but is centered around zero so that the positive and negative values average out. The behavior of the weight maps will depend on the simulator and the basis vectors. If the simulator output is somewhat stationary, then the wj(u)w_j(u) will typically also look stationary, and so the zero constant mean assumption is probably reasonable. In other cases, one might want to consider using some sort of trend for the prior mean (e.g., polynomial basis functions). Standard GP model checking procedures should be used to determine what is best for a particular application. In the application presented in (Dave Higdon & Rightley, 2008), the zero mean assumption is regarded as reasonable.

Prior Covariance Specification

The covariance function used for each GP is also a modeling choice that must be specified on a case-by-case basis. If any trends in the functions wj(u)w_j(u) have been adequately addressed in the prior mean, then it is often reasonable to utilize a stationary covariance function; e.g., one of the form Cov[wj(u),wj(u)]=α2c(uu).(20) \text{Cov}[w_j(u), w_j(u^\prime)] = \alpha^2 c(u-u^\prime). \tag{20} Typical choices are covariances of the Gaussian or Matérn classes. In (Dave Higdon & Rightley, 2008), the authors consider “fully Bayesian” GPs, whereby priors are placed on the marginal variance α2\alpha^2 and other hyperparameters characterizing the correlation function c()c(\cdot). A common alternative is to opt for an empirical Bayes approach and fix these parameters at their optimized values.

Error Term

Next, we consider a statistical model for the error term ϵ(u)\epsilon(u) defined in (16). In (Dave Higdon & Rightley, 2008), it is assumed that ϵ(u)N(0,σ2I),(21) \epsilon(u) \sim \mathcal{N}(0, \sigma^2 I), \tag{21} with σ2\sigma^2 treated as an unknown parameter that is assigned an inverse Gamma prior.

Implied Multi-Output GP for the Simulator

Other References

The main reference for this post is the paper (Dave Higdon & Rightley, 2008). Below are some later papers that explore similar ideas.

  1. Dave Higdon, B. W., James Gattiker, & Rightley, M. (2008). Computer Model Calibration Using High-Dimensional Output. Journal of the American Statistical Association, 103(482), 570–583. https://doi.org/10.1198/016214507000000888