Active Subspaces
An introduction to dimensionality reduction via active subspaces.
Motivation
Consider some (deterministic) function of interest While this is a generic setup, the common motivation is that conceptualizes a computationally costly black-box simulation. There are many things one might wish to do with such a simulation:
- Minimize or maximize [optimization]
- Integrate over some domain of interest [numerical integration]
- Find input values to make agree with some observed data [calibration]
- Assess how the function outputs vary as the inputs change [sensitivity analysis]
- Fit a cheaper-to-evaluate surrogate model that approximates [emulation/surrogate modeling]
- Identify subsets of that result desirable or undesirable outputs [contour/excursion set estimation]
All of these tasks are made more difficult when the number of inputs is large. However, it has been routinely noted that such high-dimensional settings often exhibit low-dimensional structure; i.e., low “intrinsic dimensionality.” This is to say that varies primarily on some low-dimensional manifold embedded in the higher-dimensional ambient space . The active subspace method assumes that the manifold is a linear subspace, and seeks to identify this subspace using gradient evaluations of .
Ridge Functions
Before introducing the method of active subspaces, we start to build some intuition regarding the notion of low-dimensional subspace structure in the map . The extreme case of this idea is a function of the form \begin{align} f(x) &= g(Ax), &&A \in \mathbb{R}^{r \times d}, \end{align} where . A function of this type is sometimes called a ridge function. The ridge assumption is a rather stringent one, as it implies that truly only varies on a subspace of , and is constant otherwise. Indeed, let such that (the null space of ). Then \begin{align} f(\tilde{x}) &= g(A\tilde{x}) = g(0) \newline f(x + \tilde{x}) &= g(Ax + A\tilde{x}) = g(Ax). \end{align} We can consider the input space as partitioned via the direct sum (using the fact that the orthogonal complement of the null space is the row space). The ridge function is responsive only when changes in , and not at all when only changes in . Although we have not defined the concept yet, it seems intuitive that would represent the “active subspace” in this case.
Setup and Assumptions
As noted above, the assumption that a function has ridge structure is quite strong. The notion of active subspaces invokes looser assumptions, assuming that, on average, the function primarily varies along a subspace and varies significantly less on its orthogonal complement. To make precise what we mean by “on average”, we introduce a probability measure on . The choice of will be dictated by the problem at hand. Given that may be highly nonlinear, the question of identifying a subspace that on average captures the majority of the variation is not a simple one. The active subspace method addresses this challenge using gradient information, and thus we assume going forward that is differentiable. We denote the gradient of at an input by On a notational note, for a function we use to denote the Jacobian matrix. Therefore, when applied to the scalar-valued function , we have the relation . By observing gradient evaluations at a set of inputs sampled from , we can get a sense of how the function varies along each coordinate direction, on average. However, it may be the case that the function varies most significantly on a subspace not aligned with the standard coordinate directions. Thus, the idea of active subspaces is to utilize the information in these gradient observations to try to find such a subspace; that is, to find linear combinations of the coordinate directions along which exhibits the largest variation. At this point, the idea here should be sounding quite reminiscent of principal components analysis (PCA). Indeed, there are close ties between the two methods. We therefore briefly review PCA in order to better motivate the active subspace algorithm.
PCA Review
In order to better motivate the active subspace method, we take a brief detour and briefly review the relevant ideas from PCA. I have a whole post on this topic that derives all the facts stated below.
The Empirical Perspective
Consider a set of points in . Let us assume that these points are centered, in the sense that the empirical mean has been subtracted from each input. We can stack the transpose of these vectors row-wise to obtain a matrix . The data has features, and the goal of PCA is to construct a new set of linear combinations that explain the majority of variation in the data. This is accomplished via an eigendecomposition of the matrix which is proportional to the empirical covariance of the data. Considering , then the first columns of form a basis for the optimal -dimensional subspace. The low-rank approximations to the data points are thus given by the projection with denoting the dominant eigenvectors of . This approximation is optimal in the sense that it minimizes the average squared error over all possible orthonormal bases of and weights , where and . Notice that the basis vectors are independent of the data, while the weights are -dependent.
The Distribution Perspective
The basic PCA algorithm centers on the empirical covariance matrix . Instead of working with empirical data, we can alternatively work with the distribution of directly. Suppose is a random vector defined with respect to the probability space . Essentially all of the above results still hold, with the covariance replacing the empirical covariance . In this setting, the objective function being minimized is the expected squared error over all possible orthonormal bases of and all random vectors . Just as the encoded information about the variation in in the empirical setting, the coefficient vector is the source of randomness in the low-rank approximation that seeks to capture the variation in .
Motivating Active Subspaces via PCA
PCA on the Gradient
In seeking active subspaces, we might try to perform PCA directly on in order to capture its dominant directions of variation. However, this approach fails to take into account the function at all. The input may vary significantly (with respect to the measure ) in a certain direction, but this variation might induce essentially no change in the function output. We seek a method that accounts for both the variation in induced by , as well as the action of the map . In fact, it is common to assume that has been centered and normalized such that \begin{align} &\mathbb{E}[x] = 0, &&\text{Cov}[x] = I. \end{align} This means that the dimensions of are already uncorrelated; there is nothing to be gained by performing PCA directly on . However, there still may be low-dimensional structure to exploit in the map .
As noted above, the gradient provides information on the local variation in the function along the standard coordinate axes. Alluding to the PCA setting, we can think of these directions of variation as “features”, with the goal of finding linear combinations of the features that explain the majority of the variation. Let’s go with this, taking to be the target for a PCA analysis. Since we’re considering starting with an assumed distribution , it makes sense to apply the distributional form of PCA, as summarized above. To this end, we start by considering the matrix which is an unscaled version of . Note that the expectations here are all with respect to . The analogous quantities in the above PCA review are We can think of as a random variable mapping from the probability space . Proceeding as usual, we consider the eigendecomposition . After choosing a threshold , the first columns of become the basis for our active subspace. The -dimensional approximation of the gradient is then given by
Defining the Active Subspace
While I find the PCA comparison quite useful for motivating the method of active subspaces, there is a risk of pushing the analogy too far. In particular, we emphasize that the goal here is not to construct an accurate low-rank approximation of the gradient, but rather to approximate by a function that looks like . To provide some initial justification for this latter goal, we show that, on average, varies more along the directions defined by the dominant eigenvectors. We let denote the directional derivative of in the direction .
Proposition. Let denote the eigenvalue-eigenvector pair of , as defined above. Then \begin{align} \mathbb{E}\left[(D_{v_j} f(x))^2 \right] &= \lambda_j. \tag{2} \end{align}
Proof.
Using the fact that the have unit norm, we obtain
\begin{align}
\lambda_j
&= \langle C v_j, v_j \rangle \newline
&= v_j^\top \mathbb{E}\left[\nabla f(x) \nabla f(x)^\top\right] v_j \newline
&= \mathbb{E}\left[ v_j^\top \nabla f(x) \nabla f(x)^\top v_j \right] \newline
&= \mathbb{E}\left[ (\nabla f(x)^\top v_j)^2 \right] \newline
&= \mathbb{E}\left[(D_{v_j} f(x))^2 \right] \qquad \blacksquare
\end{align}
In words, this result says that, on average (with respect to ),
the squared derivative in the direction of is given by the
eigenvalue . That is, the eigenvalues provide information about the
smoothness of the function in the directions defined by their respective
eigenvectors. Given this, it seems reasonable to discard directions with
eigenvalues of negligible size.
Definition. Consider partitioning the eigenvalues and eigenvectors as \begin{align} C = V \Lambda V^\top = \begin{bmatrix} V_1 & V_2 \end{bmatrix} \begin{bmatrix} \Lambda_1 & 0 \newline 0 & \Lambda_2 \end{bmatrix} \begin{bmatrix} V_1^\top \newline V_2^\top \end{bmatrix}, \tag{3} \end{align} where and . We define the active subspace (of dimension ) to be the span of the columns of , Similarly, we refer to the subspace generated by the columns of as the inactive subspace.
This is the same idea as in PCA: define a cutoff after the first directions, after which the variation in the thing you care about becomes negligible. The subspace generated by the first directions is the active subspace, and the columns of form an orthonormal basis for this subspace. The inactive subspace is the orthogonal complement of the active subspace, since .
Understanding the Low-Dimensional Subspace
Having defined the active subspace, we now begin to investigate its basic properties.
Active and Inactive Variables
Since the active subspace is -dimensional, we can represent vectors living in this subspace with coordinates. It is this fact that allows for useful dimensionality reduction in downstream applications. Given an input we want to identify the vector in the active subspace that provides the best approximation to . The optimal vector (in a squared error sense) is given by the projection Thus, constitute the coordinates defining the projection of onto the active subspace. The quantity stacks these coordinates into a vector. Note that describes the same vector (i.e., the projection), but represents the vector in the original -dimensional coordinate system. We can similarly project onto the inactive subspace via .
Definition. Let and be defined as in (3). We introduce the notation \begin{align} &y := V_1^\top x, &&z \in V_2^\top x, \tag{6} \end{align} and refer to and respectively as the active variable and inactive variable.
Note that and are simply the coordinates of the projection of onto the active and inactive subspace, respectively. Since is a random variable, then so are and . Noting that the active and inactive subspaces are orthogonal complements, we have that
Recall that is a function of variables that we wish to approximate with a function of only variables. Since the key will be to eliminate the dependence on , a question we will return to shortly. For now, we want to start providing some justification for this idea by demonstrating that varies more on average as we vary as opposed to when we vary .
Proposition. Let and be the active and inactive variables defined in (6). Let \begin{align} &T_{z}(y) := V_1 y + V_2 z, &&T_{y}(z) := V_1 y + V_2 z \end{align} denote the coordinate transformation from to , viewed respectively as functions of or only. Then, \begin{align} \mathbb{E} \lVert D (f \circ T_{z})(y) \rVert_2^2 &= \lambda_1 + \cdots + \lambda_r \newline \mathbb{E} \lVert D (f \circ T_{y})(z) \rVert_2^2 &= \lambda_{r+1} + \cdots + \lambda_d. \end{align}
Proof. We only prove the result for , as the proof for is nearly identical. By the chain rule we have \begin{align} D (f \circ T_{z})(y) = Df(x) DT_{z}(y) = Df(x)V_1. \end{align} For succinctness, we will use the notation in the below derivations. Thus, \begin{align} \mathbb{E} \lVert \nabla_{y} f \rVert_2^2 &= \mathbb{E}\left[\text{tr}\left([\nabla_y f] [\nabla_y f]^\top \right) \right] \newline &= \text{tr}\left(\mathbb{E}\left[[\nabla_y f] [\nabla_y f]^\top \right] \right) \newline &= \text{tr}\left(V_1^\top \mathbb{E}\left[\nabla f(x) \nabla f(x)^\top \right] V_1 \right) \newline &= \text{tr}\left(V_1^\top C V_1 \right) \newline &= \text{tr}\left(V_1^\top V \Lambda V^\top V_1 \right) \newline &= \text{tr}(\Lambda_1) &&(\dagger) \newline &= \lambda_1 + \dots + \lambda_r. \end{align}
We obtained by applying the decomposition (3).
Note that for , the inactive variable is viewed as fixed with respect to the derivative operation. However, is still random, and thus the expectation averages over the randomness in both and .
The Distribution of the (In)Active Variables
As mentioned in the previous section, the active and inactive variables
are functions of the random variable and hence are themselves
random variables. In this section we investigate the joint distribution
of the random vector . We recall from (6) that this
vector is defined as
\begin{align}
u := \begin{bmatrix} y \newline z \end{bmatrix}
= \begin{bmatrix} V_1^\top x \newline V_2^\top x \end{bmatrix}
= V^\top x,
\end{align}
where is orthonormal. Thus, the transformation can be
thought of as rotating the original coordinate system. In particular, note
that is invertible with . Let’s suppose that the measure
admits a density function . Since the transformation
is invertible and differentiable, then the change-of-variables
formula gives us the density of the random vector . Denoting this density
by , we have
following from the fact that .
In words, to compute the density of at a point , we can
simply evaluate the density of at the point . There is
no distortion of space here since the transformation is a rotation.
We can also find the measure associated with the density
by considering, for some Borel set ,
\begin{align}
\tilde{\mu}(A)
&= \int_{\mathbb{R}^d} 1_A[u^\prime] \tilde{\mu}(du^\prime) \newline
&= \int_{\mathbb{R}^d} 1_A[u^\prime] (\mu \circ T^{-1})(du^\prime) \newline
&= \int_{\mathbb{R}^d} 1_A[T(x^\prime)] \mu(dx^\prime) \newline
&= \int_{\mathbb{R}^d} 1[V^\top x^\prime \in A] \mu(dx^\prime) \newline
&=\int_{\mathbb{R}^d} 1[x^\prime \in VA] \mu(dx^\prime) \newline
&= \mu(VA),
\end{align}
where I’m defining the set
This result is analogous to the density one, and simply says that the distributions
of and differ only by a change-of-variables. It is worth emphasizing that
is a joint density
over the active and inactive variables. We will shortly be interested in the
marginals and conditionals of this joint distribution.
The Gaussian Case
As usual, things work out very nicely if we work with Gaussians. Let’s consider the case where is multivariate Gaussian, of the form . Since is a linear transformation of a Gaussian, then we immediately have that Therefore, . In this case, the marginals and conditionals of are all available in closed-form using standard facts about Gaussians. If we consider the case where is standard Gaussian (i.e., ), then . In this special case, the marginal distributions of and the conditional distributions of , are all also standard Gaussian.
References
- Active subspace methods in theory and practice: Applications to kriging surfaces SIAM J. Sci. Comput., 36 (4) (2014), pp. A1500-A1524, 10.1137/130916138