Active Subspaces

An introduction to dimensionality reduction via active subspaces.

Motivation

Consider some (deterministic) function of interest f:XRdR. f: \mathcal{X} \subseteq \mathbb{R}^d \to \mathbb{R}. While this is a generic setup, the common motivation is that ff conceptualizes a computationally costly black-box simulation. There are many things one might wish to do with such a simulation:

All of these tasks are made more difficult when the number of inputs dd is large. However, it has been routinely noted that such high-dimensional settings often exhibit low-dimensional structure; i.e., low “intrinsic dimensionality.” This is to say that ff varies primarily on some low-dimensional manifold embedded in the higher-dimensional ambient space X\mathcal{X}. The active subspace method assumes that the manifold is a linear subspace, and seeks to identify this subspace using gradient evaluations of ff.

Ridge Functions

Before introducing the method of active subspaces, we start to build some intuition regarding the notion of low-dimensional subspace structure in the map ff. The extreme case of this idea is a function of the form \begin{align} f(x) &= g(Ax), &&A \in \mathbb{R}^{r \times d}, \end{align} where r<dr < d. A function of this type is sometimes called a ridge function. The ridge assumption is a rather stringent one, as it implies that ff truly only varies on a subspace of X\mathcal{X}, and is constant otherwise. Indeed, let x,x~Xx, \tilde{x} \in \mathcal{X} such that x~null(A)\tilde{x} \in \text{null}(A) (the null space of AA). Then \begin{align} f(\tilde{x}) &= g(A\tilde{x}) = g(0) \newline f(x + \tilde{x}) &= g(Ax + A\tilde{x}) = g(Ax). \end{align} We can consider the input space as partitioned via the direct sum X=null(A)row(A)\mathcal{X} = \text{null}(A) \oplus \text{row}(A) (using the fact that the orthogonal complement of the null space is the row space). The ridge function ff is responsive only when xx changes in row(A)\text{row}(A), and not at all when xx only changes in null(A)\text{null}(A). Although we have not defined the concept yet, it seems intuitive that row(A)\text{row}(A) would represent the “active subspace” in this case.

Setup and Assumptions

As noted above, the assumption that a function has ridge structure is quite strong. The notion of active subspaces invokes looser assumptions, assuming that, on average, the function primarily varies along a subspace and varies significantly less on its orthogonal complement. To make precise what we mean by “on average”, we introduce a probability measure μ\mu on X\mathcal{X}. The choice of μ\mu will be dictated by the problem at hand. Given that ff may be highly nonlinear, the question of identifying a subspace that on average captures the majority of the variation is not a simple one. The active subspace method addresses this challenge using gradient information, and thus we assume going forward that ff is differentiable. We denote the gradient of ff at an input xXx \in \mathcal{X} by f(x):=[D1f(x),,Ddf(x)]Rd. \nabla f(x) := \left[D_1 f(x), \dots, D_d f(x) \right]^\top \in \mathbb{R}^d. On a notational note, for a function ϕ:RnRm\phi: \mathbb{R}^n \to \mathbb{R}^m we use Dϕ(x)D\phi(x) to denote the m×nm \times n Jacobian matrix. Therefore, when applied to the scalar-valued function ff, we have the relation f(x)=Df(x)\nabla f(x) = Df(x)^\top. By observing gradient evaluations f(xi)\nabla f(x_i) at a set of inputs x1,,xnx_1, \dots, x_n sampled from μ\mu, we can get a sense of how the function varies along each coordinate direction, on average. However, it may be the case that the function varies most significantly on a subspace not aligned with the standard coordinate directions. Thus, the idea of active subspaces is to utilize the information in these gradient observations to try to find such a subspace; that is, to find linear combinations of the coordinate directions along which ff exhibits the largest variation. At this point, the idea here should be sounding quite reminiscent of principal components analysis (PCA). Indeed, there are close ties between the two methods. We therefore briefly review PCA in order to better motivate the active subspace algorithm.

PCA Review

In order to better motivate the active subspace method, we take a brief detour and briefly review the relevant ideas from PCA. I have a whole post on this topic that derives all the facts stated below.

The Empirical Perspective

Consider a set of points x1,,xnx_1, \dots, x_n in Rd\mathbb{R}^d. Let us assume that these points are centered, in the sense that the empirical mean has been subtracted from each input. We can stack the transpose of these vectors row-wise to obtain a matrix XRn×dX \in \mathbb{R}^{n \times d}. The data has dd features, and the goal of PCA is to construct a new set of rdr \leq d linear combinations that explain the majority of variation in the data. This is accomplished via an eigendecomposition of the matrix C^=XX=i=1nxixiT, \hat{C} = X^\top X = \sum_{i=1}^{n} x_i x_i^T, which is proportional to the empirical covariance of the data. Considering C^=VDV\hat{C} = VDV^\top, then the first rr columns of VV form a basis for the optimal rr-dimensional subspace. The low-rank approximations to the data points are thus given by the projection x^i:=j=1rxi,vjvj \hat{x}_i := \sum_{j=1}^{r} \langle x_i, v_j \rangle v_j with v1,,vrv_1, \dots, v_r denoting the dominant rr eigenvectors of C^\hat{C}. This approximation is optimal in the sense that it minimizes the average squared error 1ni=1nxnj=1rwijbj22 \frac{1}{n} \sum_{i=1}^{n} \left\lVert x_n - \sum_{j=1}^{r} w_{ij} b_j \right\rVert^2_2 over all possible orthonormal bases b1,,brb_1, \dots, b_r of Rr\mathbb{R}^r and weights {wij}\{w_{ij}\}, where i=1,,ni = 1, \dots, n and j=1,,rj = 1, \dots, r. Notice that the basis vectors bjb_j are independent of the data, while the weights wijw_{ij} are xx-dependent.

The Distribution Perspective

The basic PCA algorithm centers on the empirical covariance matrix C^\hat{C}. Instead of working with empirical data, we can alternatively work with the distribution of xx directly. Suppose x=x(ω)x = x(\omega) is a random vector defined with respect to the probability space (Ω,A,P)(\Omega, \mathcal{A}, \mathbb{P}). Essentially all of the above results still hold, with the covariance C:=Cov[x]C := \text{Cov}[x] replacing the empirical covariance C^\hat{C}. In this setting, the objective function being minimized is the expected squared error Ex(ω)j=1rw(ω)bj22 \mathbb{E} \left\lVert x(\omega) - \sum_{j=1}^{r} w(\omega) b_j \right\rVert^2_2 over all possible orthonormal bases b1,,brb_1, \dots, b_r of Rr\mathbb{R}^r and all random vectors wRrw \in \mathbb{R}^r. Just as the wijw_{ij} encoded information about the variation in xx in the empirical setting, the coefficient vector w(ω)w(\omega) is the source of randomness in the low-rank approximation that seeks to capture the variation in x(ω)x(\omega).

Motivating Active Subspaces via PCA

PCA on the Gradient

In seeking active subspaces, we might try to perform PCA directly on XX in order to capture its dominant directions of variation. However, this approach fails to take into account the function ff at all. The input xx may vary significantly (with respect to the measure μ\mu) in a certain direction, but this variation might induce essentially no change in the function output. We seek a method that accounts for both the variation in xx induced by μ\mu, as well as the action of the map f(x)f(x). In fact, it is common to assume that μ\mu has been centered and normalized such that \begin{align} &\mathbb{E}[x] = 0, &&\text{Cov}[x] = I. \end{align} This means that the dimensions of xx are already uncorrelated; there is nothing to be gained by performing PCA directly on xx. However, there still may be low-dimensional structure to exploit in the map ff.

As noted above, the gradient f(x)\nabla f(x) provides information on the local variation in the function along the standard coordinate axes. Alluding to the PCA setting, we can think of these directions of variation as dd “features”, with the goal of finding linear combinations of the features that explain the majority of the variation. Let’s go with this, taking f(x)\nabla f(x) to be the target for a PCA analysis. Since we’re considering starting with an assumed distribution xμx \sim \mu, it makes sense to apply the distributional form of PCA, as summarized above. To this end, we start by considering the matrix C:=E[(f(x))(f(x))],(1) C := \mathbb{E}[(\nabla f(x)) (\nabla f(x))^\top], \tag{1} which is an unscaled version of Cov[f(x)]\text{Cov}[\nabla f(x)]. Note that the expectations here are all with respect to μ\mu. The analogous quantities in the above PCA review are x    ω,f(x)    x(ω). x \iff \omega, \qquad \nabla f(x) \iff x(\omega). We can think of f\nabla f as a random variable mapping from the probability space (X,B(X),μ)(\mathcal{X}, \mathcal{B}(\mathcal{X}), \mu). Proceeding as usual, we consider the eigendecomposition C=VDVC = VDV^\top. After choosing a threshold rr, the first rr columns of VV become the basis for our active subspace. The rr-dimensional approximation of the gradient is then given by ^f(x):=j=1rf(x),vjvj. \hat{\nabla} f(x) := \sum_{j=1}^{r} \langle \nabla f(x), v_j \rangle v_j.

Defining the Active Subspace

While I find the PCA comparison quite useful for motivating the method of active subspaces, there is a risk of pushing the analogy too far. In particular, we emphasize that the goal here is not to construct an accurate low-rank approximation of the gradient, but rather to approximate f(x)f(x) by a function that looks like g(Ax)g(Ax). To provide some initial justification for this latter goal, we show that, on average, ff varies more along the directions defined by the dominant eigenvectors. We let Dvf(x):=f(x)vD_v f(x) := \nabla f(x)^\top v denote the directional derivative of ff in the direction vv.

Proposition. Let (λj,vj)(\lambda_j, v_j) denote the jthj^{\text{th}} eigenvalue-eigenvector pair of CC, as defined above. Then \begin{align} \mathbb{E}\left[(D_{v_j} f(x))^2 \right] &= \lambda_j. \tag{2} \end{align}

Proof. Using the fact that the vjv_j have unit norm, we obtain
\begin{align} \lambda_j &= \langle C v_j, v_j \rangle \newline &= v_j^\top \mathbb{E}\left[\nabla f(x) \nabla f(x)^\top\right] v_j \newline &= \mathbb{E}\left[ v_j^\top \nabla f(x) \nabla f(x)^\top v_j \right] \newline &= \mathbb{E}\left[ (\nabla f(x)^\top v_j)^2 \right] \newline &= \mathbb{E}\left[(D_{v_j} f(x))^2 \right] \qquad \blacksquare \end{align} In words, this result says that, on average (with respect to μ\mu), the squared derivative in the direction of vjv_j is given by the eigenvalue λj\lambda_j. That is, the eigenvalues provide information about the smoothness of the function in the directions defined by their respective eigenvectors. Given this, it seems reasonable to discard directions with eigenvalues of negligible size.

Definition. Consider partitioning the eigenvalues and eigenvectors as \begin{align} C = V \Lambda V^\top = \begin{bmatrix} V_1 & V_2 \end{bmatrix} \begin{bmatrix} \Lambda_1 & 0 \newline 0 & \Lambda_2 \end{bmatrix} \begin{bmatrix} V_1^\top \newline V_2^\top \end{bmatrix}, \tag{3} \end{align} where V1Rd×rV_1 \in \mathbb{R}^{d \times r} and Λ1Rr×r\Lambda_1 \in \mathbb{R}^{r \times r}. We define the active subspace (of dimension rr) to be the span of the columns of V1V_1, span(v1,,vr).(4) \text{span}(v_1, \dots, v_r). \tag{4} Similarly, we refer to the subspace generated by the columns of V2V_2 as the inactive subspace.

This is the same idea as in PCA: define a cutoff after the first rr directions, after which the variation in the thing you care about becomes negligible. The subspace generated by the first rr directions is the active subspace, and the columns of V1V_1 form an orthonormal basis for this subspace. The inactive subspace is the orthogonal complement of the active subspace, since V1V2=IV_1^\top V_2 = I.

Understanding the Low-Dimensional Subspace

Having defined the active subspace, we now begin to investigate its basic properties.

Active and Inactive Variables

Since the active subspace is rr-dimensional, we can represent vectors living in this subspace with rdr \leq d coordinates. It is this fact that allows for useful dimensionality reduction in downstream applications. Given an input xRdx \in \mathbb{R}^d we want to identify the vector in the active subspace that provides the best approximation to xx. The optimal vector (in a squared error sense) is given by the projection projV1(x):=j=1rx,vjvj=V1V1x.(5) \text{proj}_{V_1}(x) := \sum_{j=1}^{r} \langle x, v_j \rangle v_j \tag{5} = V_1 V_1^\top x. Thus, x,v1,,x,vr\langle x, v_1 \rangle, \dots, \langle x, v_r \rangle constitute the rr coordinates defining the projection of xx onto the active subspace. The quantity V1xRrV_1^\top x \in \mathbb{R}^r stacks these coordinates into a vector. Note that V1V1xV_1 V_1^\top x describes the same vector (i.e., the projection), but represents the vector in the original dd-dimensional coordinate system. We can similarly project onto the inactive subspace via V2V2xV_2 V_2^\top x.

Definition. Let V1V_1 and V2V_2 be defined as in (3). We introduce the notation \begin{align} &y := V_1^\top x, &&z \in V_2^\top x, \tag{6} \end{align} and refer to yRry \in \mathbb{R}^r and zRdrz \in \mathbb{R}^{d-r} respectively as the active variable and inactive variable.

Note that yy and zz are simply the coordinates of the projection of xx onto the active and inactive subspace, respectively. Since xμx \sim \mu is a random variable, then so are yy and zz. Noting that the active and inactive subspaces are orthogonal complements, we have that

x=projV1(x)+projV2(x)=V1V1x+V2V2x=V1y+V2z.(7) x = \text{proj}_{V_1}(x) + \text{proj}_{V_2}(x) = V_1 V_1^\top x + V_2 V_2^\top x = V_1 y + V_2 z. \tag{7}

Recall that ff is a function of dd variables that we wish to approximate with a function of only rr variables. Since f(x)=f(V1V1x+V2V2x)=f(V1y+V2z), f(x) = f(V_1 V_1^\top x + V_2 V_2^\top x) = f(V_1 y + V_2 z), the key will be to eliminate the dependence on zz, a question we will return to shortly. For now, we want to start providing some justification for this idea by demonstrating that ff varies more on average as we vary yy as opposed to when we vary zz.

Proposition. Let yy and zz be the active and inactive variables defined in (6). Let \begin{align} &T_{z}(y) := V_1 y + V_2 z, &&T_{y}(z) := V_1 y + V_2 z \end{align} denote the coordinate transformation from (y,z)(y,z) to xx, viewed respectively as functions of yy or zz only. Then, \begin{align} \mathbb{E} \lVert D (f \circ T_{z})(y) \rVert_2^2 &= \lambda_1 + \cdots + \lambda_r \newline \mathbb{E} \lVert D (f \circ T_{y})(z) \rVert_2^2 &= \lambda_{r+1} + \cdots + \lambda_d. \end{align}

Proof. We only prove the result for Tz(y)T_z(y), as the proof for Ty(z)T_y(z) is nearly identical. By the chain rule we have \begin{align} D (f \circ T_{z})(y) = Df(x) DT_{z}(y) = Df(x)V_1. \end{align} For succinctness, we will use the notation yf:=[D(fTz)(y)]=V1f(x) \nabla_{y} f := [D (f \circ T_{z})(y)]^\top = V_1^\top \nabla f(x) in the below derivations. Thus, \begin{align} \mathbb{E} \lVert \nabla_{y} f \rVert_2^2 &= \mathbb{E}\left[\text{tr}\left([\nabla_y f] [\nabla_y f]^\top \right) \right] \newline &= \text{tr}\left(\mathbb{E}\left[[\nabla_y f] [\nabla_y f]^\top \right] \right) \newline &= \text{tr}\left(V_1^\top \mathbb{E}\left[\nabla f(x) \nabla f(x)^\top \right] V_1 \right) \newline &= \text{tr}\left(V_1^\top C V_1 \right) \newline &= \text{tr}\left(V_1^\top V \Lambda V^\top V_1 \right) \newline &= \text{tr}(\Lambda_1) &&(\dagger) \newline &= \lambda_1 + \dots + \lambda_r. \end{align}

We obtained ()(\dagger) by applying the decomposition (3). \qquad \blacksquare

Note that for Tz(y)T_{z}(y), the inactive variable is viewed as fixed with respect to the derivative operation. However, zz is still random, and thus the expectation ED(fTz)(y)22\mathbb{E} \lVert D (f \circ T_{z})(y) \rVert_2^2 averages over the randomness in both yy and zz.

The Distribution of the (In)Active Variables

As mentioned in the previous section, the active and inactive variables are functions of the random variable xμx \sim \mu and hence are themselves random variables. In this section we investigate the joint distribution of the random vector u:=(y,z)u := (y, z)^\top. We recall from (6) that this vector is defined as \begin{align} u := \begin{bmatrix} y \newline z \end{bmatrix} = \begin{bmatrix} V_1^\top x \newline V_2^\top x \end{bmatrix} = V^\top x, \end{align} where VV is orthonormal. Thus, the transformation u=T(x):=Vxu = T(x) := V^\top x can be thought of as rotating the original coordinate system. In particular, note that VV is invertible with V1=VV^{-1} = V^\top. Let’s suppose that the measure μ\mu admits a density function ρ\rho. Since the transformation TT is invertible and differentiable, then the change-of-variables formula gives us the density of the random vector uu. Denoting this density by ρ~\tilde{\rho}, we have ρ~(u)=ρ(T1(u))det(DT1(u))=ρ(Vu)=ρ(x), \tilde{\rho}(u^\prime) = \rho(T^{-1}(u^\prime)) \lvert \text{det}(DT^{-1}(u^\prime)) \rvert = \rho(Vu^\prime) = \rho(x^\prime), following from the fact that det(DT1(u))=det(V)=1\text{det}(DT^{-1}(u^\prime)) = \text{det}(V) = 1. In words, to compute the density of uu at a point uu^\prime, we can simply evaluate the density of xx at the point VuVu^\prime. There is no distortion of space here since the transformation is a rotation. We can also find the measure μ~\tilde{\mu} associated with the density ρ~\tilde{\rho} by considering, for some Borel set AXA \subset \mathcal{X},
\begin{align} \tilde{\mu}(A) &= \int_{\mathbb{R}^d} 1_A[u^\prime] \tilde{\mu}(du^\prime) \newline &= \int_{\mathbb{R}^d} 1_A[u^\prime] (\mu \circ T^{-1})(du^\prime) \newline &= \int_{\mathbb{R}^d} 1_A[T(x^\prime)] \mu(dx^\prime) \newline &= \int_{\mathbb{R}^d} 1[V^\top x^\prime \in A] \mu(dx^\prime) \newline &=\int_{\mathbb{R}^d} 1[x^\prime \in VA] \mu(dx^\prime) \newline &= \mu(VA), \end{align} where I’m defining the set VA:={Vu:uA}. VA := \{V u^\prime: u^\prime \in A \}. This result is analogous to the density one, and simply says that the distributions of xx and uu differ only by a change-of-variables. It is worth emphasizing that ρ~(u)=ρ~(y,z)\tilde{\rho}(u^\prime) = \tilde{\rho}(y^\prime, z^\prime) is a joint density over the active and inactive variables. We will shortly be interested in the marginals and conditionals of this joint distribution.

The Gaussian Case

As usual, things work out very nicely if we work with Gaussians. Let’s consider the case where μ\mu is multivariate Gaussian, of the form xN(m,Σ)x \sim \mathcal{N}(m, \Sigma). Since u=Vxu = V^\top x is a linear transformation of a Gaussian, then we immediately have that uN(Vm,VΣV).(8) u \sim \mathcal{N}(V^\top m, V^\top \Sigma V). \tag{8} Therefore, ρ~(u)=N(uVm,VΣV)\tilde{\rho}(u^\prime) = \mathcal{N}(u^\prime | V^\top m, V^\top \Sigma V). In this case, the marginals and conditionals of ρ~\tilde{\rho} are all available in closed-form using standard facts about Gaussians. If we consider the case where μ\mu is standard Gaussian (i.e., xN(0,I)x \sim \mathcal{N}(0,I)), then uN(0,VV)=N(0,I)u \sim \mathcal{N}(0, V^\top V) = \mathcal{N}(0, I). In this special case, the marginal distributions of y,zy, z and the conditional distributions of yzy|z, zyz|y are all also standard Gaussian.

References