1 Motivation

Consider some (deterministic) function of interest \[ f: \mathcal{X} \subseteq \mathbb{R}^d \to \mathbb{R}. \] While this is a generic setup, the common motivation is that \(f\) conceptualizes a computationally costly black-box simulation. There are many things one might wish to do with such a simulation: minimize or maximize \(f\) (optimization); integrate \(f\) over some domain of interest (numerical integration); find input values \(x\) to make \(f(x)\) agree with some observed data (calibration); assess how the function outputs \(f(x)\) vary as the inputs \(x\) change (sensitivity analysis); fit a cheaper-to-evaluate surrogate model that approximates \(f\) (emulation/surrogate modeling); identify subsets of \(\mathcal{X}\) that yield desirable or undesirable outputs (contour/excursion set estimation).

All of these tasks are made more difficult when the number of inputs \(d\) is large. However, it has been routinely noted that such high-dimensional settings often exhibit low-dimensional structure; i.e., low “intrinsic dimensionality.” This is to say that \(f\) varies primarily on some low-dimensional manifold embedded in the higher-dimensional ambient space \(\mathcal{X}\). The active subspace method assumes that the manifold is a linear subspace, and seeks to identify this subspace using gradient evaluations of \(f\).

2 Ridge Functions

Before introducing the method of active subspaces, we start to build some intuition regarding the notion of low-dimensional subspace structure in the map \(f\). The extreme case of this idea is a function of the form \[ \begin{align} f(x) &= g(Ax), &&A \in \mathbb{R}^{r \times d}, \end{align} \tag{1}\] where \(r < d\). A function of this type is sometimes called a ridge function. The ridge assumption is a rather stringent one, as it implies that \(f\) truly only varies on a subspace of \(\mathcal{X}\), and is constant otherwise. Indeed, let \(x, \tilde{x} \in \mathcal{X}\) such that \(\tilde{x} \in \text{null}(A)\) (the null space of \(A\)). Then \[ \begin{align} f(\tilde{x}) &= g(A\tilde{x}) = g(0) \newline f(x + \tilde{x}) &= g(Ax + A\tilde{x}) = g(Ax). \end{align} \] We can consider the input space as partitioned via the direct sum \(\mathcal{X} = \text{null}(A) \oplus \text{row}(A)\) (using the fact that the orthogonal complement of the null space is the row space). The ridge function \(f\) is responsive only when \(x\) changes in \(\text{row}(A)\), and not at all when \(x\) only changes in \(\text{null}(A)\). Although we have not defined the concept yet, it seems intuitive that \(\text{row}(A)\) would represent the “active subspace” in this case.

3 Setup and Assumptions

As noted above, the assumption that a function has ridge structure is quite strong. The notion of active subspaces invokes looser assumptions, assuming that, on average, the function primarily varies along a subspace and varies significantly less on its orthogonal complement. To make precise what we mean by “on average”, we introduce a probability measure \(\mu\) on \(\mathcal{X}\). The choice of \(\mu\) will be dictated by the problem at hand. Given that \(f\) may be highly nonlinear, the question of identifying a subspace that on average captures the majority of the variation is not a simple one. The active subspace method addresses this challenge using gradient information, and thus we assume going forward that \(f\) is differentiable. We denote the gradient of \(f\) at an input \(x \in \mathcal{X}\) by \[ \nabla f(x) := \left[D_1 f(x), \dots, D_d f(x) \right]^\top \in \mathbb{R}^d. \] On a notational note, for a function \(\phi: \mathbb{R}^n \to \mathbb{R}^m\) we use \(D\phi(x)\) to denote the \(m \times n\) Jacobian matrix. Therefore, when applied to the scalar-valued function \(f\), we have the relation \(\nabla f(x) = Df(x)^\top\). By observing gradient evaluations \(\nabla f(x_i)\) at a set of inputs \(x_1, \dots, x_n\) sampled from \(\mu\), we can get a sense of how the function varies along each coordinate direction, on average. However, it may be the case that the function varies most significantly on a subspace not aligned with the standard coordinate directions. Thus, the idea of active subspaces is to utilize the information in these gradient observations to try to find such a subspace; that is, to find linear combinations of the coordinate directions along which \(f\) exhibits the largest variation. At this point, the idea here should be sounding quite reminiscent of principal components analysis (PCA). Indeed, there are close ties between the two methods. We therefore briefly review PCA in order to better motivate the active subspace algorithm.

4 PCA Review

In order to better motivate the active subspace method, we take a brief detour and review the relevant ideas from PCA. I have a whole post on this topic that derives all the facts stated below.

4.1 The Empirical Perspective

Consider a set of points \(x_1, \dots, x_n\) in \(\mathbb{R}^d\). Let us assume that these points are centered, in the sense that the empirical mean has been subtracted from each input. We can stack the transpose of these vectors row-wise to obtain a matrix \(X \in \mathbb{R}^{n \times d}\). The data has \(d\) features, and the goal of PCA is to construct a new set of \(r \leq d\) linear combinations that explain the majority of variation in the data. This is accomplished via an eigendecomposition of the matrix \[ \hat{C} = X^\top X = \sum_{i=1}^{n} x_i x_i^T, \] which is proportional to the empirical covariance of the data. Denoting the eigendecomposition by \(\hat{C} = VDV^\top\), the first \(r\) columns of \(V\) form a basis for the optimal \(r\)-dimensional subspace. The low-rank approximations to the data points are thus given by the projection \[ \hat{x}_i := \sum_{j=1}^{r} \langle x_i, v_j \rangle v_j \] with \(v_1, \dots, v_r\) denoting the dominant \(r\) eigenvectors of \(\hat{C}\). This approximation is optimal in the sense that it minimizes the average squared error \[ \frac{1}{n} \sum_{i=1}^{n} \left\lVert x_n - \sum_{j=1}^{r} w_{ij} b_j \right\rVert^2_2 \] over all possible orthonormal bases \(b_1, \dots, b_r\) of \(\mathbb{R}^r\) and weights \(\{w_{ij}\}\), where \(i = 1, \dots, n\) and \(j = 1, \dots, r\). Notice that the basis vectors \(b_j\) are independent of the data, while the weights \(w_{ij}\) are \(x\)-dependent.

4.2 The Distribution Perspective

The basic PCA algorithm centers on the empirical covariance matrix \(\hat{C}\). Instead of working with empirical data, we can alternatively work with the distribution of \(x\) directly. Suppose \(x = x(\omega)\) is a random vector defined with respect to the probability space \((\Omega, \mathcal{A}, \mathbb{P})\). Essentially all of the above results still hold, with the covariance \(C := \text{Cov}[x]\) replacing the empirical covariance \(\hat{C}\). In this setting, the objective function being minimized is the expected squared error \[ \mathbb{E} \left\lVert x(\omega) - \sum_{j=1}^{r} w(\omega) b_j \right\rVert^2_2 \] over all possible orthonormal bases \(b_1, \dots, b_r\) of \(\mathbb{R}^r\) and all random vectors \(w \in \mathbb{R}^r\). Just as the \(w_{ij}\) encoded information about the variation in \(x\) in the empirical setting, the coefficient vector \(w(\omega)\) is the source of randomness in the low-rank approximation that seeks to capture the variation in \(x(\omega)\).

5 Motivating Active Subspaces via PCA

5.1 PCA on the Gradient

In seeking active subspaces, we might try to perform PCA directly on \(X\) in order to capture its dominant directions of variation. However, this approach fails to take into account the function \(f\) at all. The input \(x\) may vary significantly (with respect to the measure \(\mu\)) in a certain direction, but this variation might induce essentially no change in the function output. We seek a method that accounts for both the variation in \(x\) induced by \(\mu\), as well as the action of the map \(f(x)\). In fact, it is common to assume that \(\mu\) has been centered and normalized such that \[\begin{align} &\mathbb{E}[x] = 0, &&\text{Cov}[x] = I. \end{align}\] This means that the dimensions of \(x\) are already uncorrelated; there is nothing to be gained by performing PCA directly on \(x\). However, there still may be low-dimensional structure to exploit in the map \(f\).

As noted above, the gradient \(\nabla f(x)\) provides information on the local variation in the function along the standard coordinate axes. Alluding to the PCA setting, we can think of these directions of variation as \(d\) “features”, with the goal of finding linear combinations of the features that explain the majority of the variation. Let’s go with this, taking \(\nabla f(x)\) to be the target for a PCA analysis. Since we’re considering starting with an assumed distribution \(x \sim \mu\), it makes sense to apply the distributional form of PCA, as summarized above. To this end, we start by considering the matrix \[ C := \mathbb{E}[(\nabla f(x)) (\nabla f(x))^\top], \tag{1} \] which is an unscaled version of \(\text{Cov}[\nabla f(x)]\). Note that the expectations here are all with respect to \(\mu\). The analogous quantities in the above PCA review are \[ x \iff \omega, \qquad \nabla f(x) \iff x(\omega). \] We can think of \(\nabla f\) as a random variable mapping from the probability space \((\mathcal{X}, \mathcal{B}(\mathcal{X}), \mu)\). Proceeding as usual, we consider the eigendecomposition \(C = VDV^\top\). After choosing a threshold \(r\), the first \(r\) columns of \(V\) become the basis for our active subspace. The \(r\)-dimensional approximation of the gradient is then given by \[ \hat{\nabla} f(x) := \sum_{j=1}^{r} \langle \nabla f(x), v_j \rangle v_j. \]

5.2 Defining the Active Subspace

While I find the PCA comparison quite useful for motivating the method of active subspaces, there is a risk of pushing the analogy too far. In particular, we emphasize that the goal here is not to construct an accurate low-rank approximation of the gradient, but rather to approximate \(f(x)\) by a function that looks like \(g(Ax)\). To provide some initial justification for this latter goal, we show that, on average, \(f\) varies more along the directions defined by the dominant eigenvectors. We let \(D_v f(x) := \nabla f(x)^\top v\) denote the directional derivative of \(f\) in the direction \(v\).

Proposition

Let \((\lambda_j, v_j)\) denote the \(j^{\text{th}}\) eigenvalue-eigenvector pair of \(C\), as defined above. Then \[ \begin{align} \mathbb{E}\left[(D_{v_j} f(x))^2 \right] &= \lambda_j. \end{align} \]

Proof

Using the fact that the \(v_j\) have unit norm, we obtain \[ \begin{align} \lambda_j &= \langle C v_j, v_j \rangle \\ &= v_j^\top \mathbb{E}\left[\nabla f(x) \nabla f(x)^\top\right] v_j \\ &= \mathbb{E}\left[ v_j^\top \nabla f(x) \nabla f(x)^\top v_j \right] \\ &= \mathbb{E}\left[ (\nabla f(x)^\top v_j)^2 \right] \\ &= \mathbb{E}\left[(D_{v_j} f(x))^2 \right] \end{align} \]

In words, this result says that, on average (with respect to \(\mu\)), the squared derivative in the direction of \(v_j\) is given by the eigenvalue \(\lambda_j\). That is, the eigenvalues provide information about the smoothness of the function in the directions defined by their respective eigenvectors. Given this, it seems reasonable to discard directions with eigenvalues of negligible size.

Definition

Consider partitioning the eigenvalues and eigenvectors as \[ \begin{align} C = V \Lambda V^\top = \begin{bmatrix} V_1 & V_2 \end{bmatrix} \begin{bmatrix} \Lambda_1 & 0 \newline 0 & \Lambda_2 \end{bmatrix} \begin{bmatrix} V_1^\top \newline V_2^\top \end{bmatrix}, \tag{3} \end{align} \] where \(V_1 \in \mathbb{R}^{d \times r}\) and \(\Lambda_1 \in \mathbb{R}^{r \times r}\). We define the active subspace (of dimension \(r\)) to be the span of the columns of \(V_1\), \[ \text{span}(v_1, \dots, v_r). \tag{4} \] Similarly, we refer to the subspace generated by the columns of \(V_2\) as the inactive subspace.

This is the same idea as in PCA: define a cutoff after the first \(r\) directions, after which the variation in the thing you care about becomes negligible. The subspace generated by the first \(r\) directions is the active subspace, and the columns of \(V_1\) form an orthonormal basis for this subspace. The inactive subspace is the orthogonal complement of the active subspace, since \(V_1^\top V_2 = I\).

6 Understanding the Low-Dimensional Subspace

Having defined the active subspace, we now begin to investigate its basic properties.

6.1 Active and Inactive Variables

Since the active subspace is \(r\)-dimensional, we can represent vectors living in this subspace with \(r \leq d\) coordinates. It is this fact that allows for useful dimensionality reduction in downstream applications. Given an input \(x \in \mathbb{R}^d\) we want to identify the vector in the active subspace that provides the best approximation to \(x\). The optimal vector (in a squared error sense) is given by the projection \[ \text{proj}_{V_1}(x) := \sum_{j=1}^{r} \langle x, v_j \rangle v_j \tag{5} = V_1 V_1^\top x. \] Thus, \(\langle x, v_1 \rangle, \dots, \langle x, v_r \rangle\) constitute the \(r\) coordinates defining the projection of \(x\) onto the active subspace. The quantity \(V_1^\top x \in \mathbb{R}^r\) stacks these coordinates into a vector. Note that \(V_1 V_1^\top x\) describes the same vector (i.e., the projection), but represents the vector in the original \(d\)-dimensional coordinate system. We can similarly project onto the inactive subspace via \(V_2 V_2^\top x\).

Definition

Let \(V_1\) and \(V_2\) be defined as in (3). We introduce the notation \[ \begin{align} &y := V_1^\top x, &&z \in V_2^\top x, \tag{6} \end{align} \] and refer to \(y \in \mathbb{R}^r\) and \(z \in \mathbb{R}^{d-r}\) respectively as the active variable and inactive variable.

Note that \(y\) and \(z\) are simply the coordinates of the projection of \(x\) onto the active and inactive subspace, respectively. Since \(x \sim \mu\) is a random variable, then so are \(y\) and \(z\). Noting that the active and inactive subspaces are orthogonal complements, we have that

\[ x = \text{proj}_{V_1}(x) + \text{proj}_{V_2}(x) = V_1 V_1^\top x + V_2 V_2^\top x = V_1 y + V_2 z. \tag{7} \]

Recall that \(f\) is a function of \(d\) variables that we wish to approximate with a function of only \(r\) variables. Since \[ f(x) = f(V_1 V_1^\top x + V_2 V_2^\top x) = f(V_1 y + V_2 z), \] the key will be to eliminate the dependence on \(z\), a question we will return to shortly. For now, we want to start providing some justification for this idea by demonstrating that \(f\) varies more on average as we vary \(y\) as opposed to when we vary \(z\).

Definition

Let \(y\) and \(z\) be the active and inactive variables defined in (6). Let \[ \begin{align} &T_{z}(y) := V_1 y + V_2 z, &&T_{y}(z) := V_1 y + V_2 z \end{align} \] denote the coordinate transformation from \((y,z)\) to \(x\), viewed respectively as functions of \(y\) or \(z\) only. Then, \[ \begin{align} \mathbb{E} \lVert D (f \circ T_{z})(y) \rVert_2^2 &= \lambda_1 + \cdots + \lambda_r \\ \mathbb{E} \lVert D (f \circ T_{y})(z) \rVert_2^2 &= \lambda_{r+1} + \cdots + \lambda_d. \end{align} \]

Proof

We only prove the result for \(T_z(y)\), as the proof for \(T_y(z)\) is nearly identical. By the chain rule we have \[ \begin{align} D (f \circ T_{z})(y) = Df(x) DT_{z}(y) = Df(x)V_1. \end{align} \] For succinctness, we will use the notation \[ \nabla_{y} f := [D (f \circ T_{z})(y)]^\top = V_1^\top \nabla f(x) \] in the below derivations. Thus, \[ \begin{align} \mathbb{E} \lVert \nabla_{y} f \rVert_2^2 &= \mathbb{E}\left[\text{tr}\left([\nabla_y f] [\nabla_y f]^\top \right) \right] \\ &= \text{tr}\left(\mathbb{E}\left[[\nabla_y f] [\nabla_y f]^\top \right] \right) \\ &= \text{tr}\left(V_1^\top \mathbb{E}\left[\nabla f(x) \nabla f(x)^\top \right] V_1 \right) \\ &= \text{tr}\left(V_1^\top C V_1 \right) \\ &= \text{tr}\left(V_1^\top V \Lambda V^\top V_1 \right) \\ &= \text{tr}(\Lambda_1) &&(\dagger) \\ &= \lambda_1 + \dots + \lambda_r. \end{align} \]

We obtained \((\dagger)\) by applying the decomposition (3).

Note that for \(T_{z}(y)\), the inactive variable is viewed as fixed with respect to the derivative operation. However, \(z\) is still random, and thus the expectation \(\mathbb{E} \lVert D (f \circ T_{z})(y) \rVert_2^2\) averages over the randomness in both \(y\) and \(z\).

6.2 The Distribution of the (In)Active Variables

As mentioned in the previous section, the active and inactive variables are functions of the random variable \(x \sim \mu\) and hence are themselves random variables. In this section we investigate the joint distribution of the random vector \(u := (y, z)^\top\). We recall from (6) that this vector is defined as \[ \begin{align} u := \begin{bmatrix} y \newline z \end{bmatrix} = \begin{bmatrix} V_1^\top x \newline V_2^\top x \end{bmatrix} = V^\top x, \end{align} \] where \(V\) is orthonormal. Thus, the transformation \(u = T(x) := V^\top x\) can be thought of as rotating the original coordinate system. In particular, note that \(V\) is invertible with \(V^{-1} = V^\top\). Let’s suppose that the measure \(\mu\) admits a density function \(\rho\). Since the transformation \(T\) is invertible and differentiable, then the change-of-variables formula gives us the density of the random vector \(u\). Denoting this density by \(\tilde{\rho}\), we have \[ \tilde{\rho}(u^\prime) = \rho(T^{-1}(u^\prime)) \lvert \text{det}(DT^{-1}(u^\prime)) \rvert = \rho(Vu^\prime) = \rho(x^\prime), \] following from the fact that \(\text{det}(DT^{-1}(u^\prime)) = \text{det}(V) = 1\). In words, to compute the density of \(u\) at a point \(u^\prime\), we can simply evaluate the density of \(x\) at the point \(Vu^\prime\). There is no distortion of space here since the transformation is a rotation. A more measure-theoretic argument is presented in the following drop-down.

Proof

We can also find the measure \(\tilde{\mu}\) associated with the density \(\tilde{\rho}\) by considering, for some Borel set \(A \subset \mathcal{X}\), \[ \begin{align} \tilde{\mu}(A) &= \int_{\mathbb{R}^d} 1_A[u^\prime] \tilde{\mu}(du^\prime) \newline &= \int_{\mathbb{R}^d} 1_A[u^\prime] (\mu \circ T^{-1})(du^\prime) \newline &= \int_{\mathbb{R}^d} 1_A[T(x^\prime)] \mu(dx^\prime) \newline &= \int_{\mathbb{R}^d} 1[V^\top x^\prime \in A] \mu(dx^\prime) \newline &=\int_{\mathbb{R}^d} 1[x^\prime \in VA] \mu(dx^\prime) \newline &= \mu(VA), \end{align} \] where I’m defining the set \[ VA := \{V u^\prime: u^\prime \in A \}. \] This result is analogous to the density one, and simply says that the distributions of \(x\) and \(u\) differ only by a change-of-variables.

It is worth emphasizing that \(\tilde{\rho}(u^\prime) = \tilde{\rho}(y^\prime, z^\prime)\) is a joint density over the active and inactive variables. We will shortly be interested in the marginals and conditionals of this joint distribution.

6.2.1 The Gaussian Case

As usual, things work out very nicely if we work with Gaussians. Let’s consider the case where \(\mu\) is multivariate Gaussian, of the form \(x \sim \mathcal{N}(m, \Sigma)\). Since \(u = V^\top x\) is a linear transformation of a Gaussian, then we immediately have that \[ u \sim \mathcal{N}(V^\top m, V^\top \Sigma V). \tag{8} \] Therefore, \(\tilde{\rho}(u^\prime) = \mathcal{N}(u^\prime \mid V^\top m, V^\top \Sigma V)\). In this case, the marginals and conditionals of \(\tilde{\rho}\) are all available in closed-form using standard facts about Gaussians. If we consider the case where \(\mu\) is standard Gaussian (i.e., \(x \sim \mathcal{N}(0,I)\)), then \(u \sim \mathcal{N}(0, V^\top V) = \mathcal{N}(0, I)\). In this special case, the marginal distributions of \(y, z\) and the conditional distributions of \(y \mid z\) and \(z \mid y\) are all also standard Gaussian.

7 Approximating \(f\)

Finally we return to the original question: approximating \(f\) with a function of fewer variables. In particular, we seek to approximate the original function of \(d\) variables with a function of \(y\), which is \(r < d\) dimensional. It is worth emphasizing that the function \(f\) (e.g., a computer model) accepts \(d\)-dimensional inputs, so at the end of the day we need to feed it \(d\) variables. The practical benefit we get from dimensionality reduction is that we get to work with the lower-dimensional variable \(y\) in whatever algorithm we’re working with (e.g., an optimization or sampling algorithm). Suppose this algorithm gives us a value \(y\). This quantity is only \(r\)-dimensional, so we must consider how to construct a \(d\)-dimensional quantity to feed to \(f\). One’s gut reaction might be to invoke the approximation \[ f(x) = f(V_1 y + V_2 z) \approx f(V_1 y), \tag{2}\] thus projecting the input onto the active subspace and ignoring the orthogonal complement. This corresponds to a ridge approximation, akin to Equation 1. In other words, we are treating \(f\) as if its input varies only along the column space of \(V_1\) and ignoring its variability in the column space of \(V_2\). We now consider an alternative approximation that acknowledges the uncertainty induced by the latter.

Let’s start by adopting a purely linear algebraic perspective. Suppose we are given a value \(y\) (e.g., the current iterate of an optimization or sampling algorithm). This gives us only partial information about \(x\), in the sense that many different values of \(x\) satisfy \(y = V_1^\top x\). For any \(x\) satisfying this equation, \(x + x_\perp\) also does the trick, where \(x_\perp \in \mathrm{null}(V_1^\top)\). Recall from before that the columns of \(V_2\) form a basis for this space, implying that \(x_\perp\) can be written as \(V_2 z\) for some \(z\). Then, \[ V_1^\top (x + x_\perp) = V_1^\top (x + V_2 z) = V_1^\top x + 0 = y. \] This is to say: a given \(y\) is consistent with any \(x\) lying on a subspace that is spanned by the columns of \(V_2\). More precisely, the set of possible \(x\) values is given by the affine subspace \[ V_1y + \mathrm{span}(V_2) = \{V_1y + V_2 z \mid z \in \mathbb{R}^{d - r}\}. \tag{3}\] While the affine subspace fully characterizes the set of possibilities, we must produce a single vector to input to \(f\). The approximation in Equation 2 corresponds to choosing \(0 \in \mathrm{span}(V_2)\), which ignores the remainder of the subspace. To produce a more complete summary, we return to a probabilistic perspective.

Recall that \(x \sim \rho\) implies the joint distribution \(\widetilde{\rho}\) over \((y, z)\). As we are focused on a particular point \(y\), we can consider the conditional distribution \(\widetilde{\rho}(z \mid y)\). We can now average with respect to this distribution over the set of \(x\) values that are consistent with \(y\) (Equation 3): \[ f(x) \approx \int f(V_1y + V_2 z) \widetilde{\rho}(z \mid y). \tag{4}\] Since \(z\) is integrated out, the approximation is only a function of \(y\), as desired. Note that we could have also arrived at this by standard statistical reasoning, without thinking through the linear algebra. If we were to maintain the full-dimensionality of the input, then \(f(x)\) is a known, deterministic quantity. But we seek to represent \(f(x) = f(V_1 y + V_2 z)\) as a function only of \(y = V_1^\top x\). Conditional on this constraint, \(f(x)\) is still random as a function of \(z\). We can thus approximate this random quantity with its mean \[ f(x) \approx \mathbb{E}[f(V_1 y + V_2 z) \mid y], \tag{5}\] where the expectation is with respect to \(\widetilde{\rho}\). This is simply another way of writing Equation 4.

While we cannot compute this conditional expectation exactly, we can unbiasedly estimate it using simple Monte Carlo, so long as we can directly sample the conditional \(\widetilde{\rho}(z \mid y)\). This yields \[ \begin{align} f(x) &\approx \mathbb{E}[f(V_1 y + V_2 z) \mid y] \\ &\approx \frac{1}{n} \sum_{i=1}^{n} f(V_1 y + V_2 z_i), &&z_i \overset{\mathrm{iid}}{\sim} \widetilde{\rho}(z \mid y). \end{align} \]

8 References

Active subspace methods in theory and practice: Applications to kriging surfaces SIAM J. Sci. Comput., 36 (4) (2014), pp. A1500-A1524, 10.1137/130916138