Principal Components Analysis

I derive the PCA decomposition from both a minimum reconstruction error and maximum variance perspective. I also discuss a statistical interpretation of PCA.

Part 1: Formulating and Solving the PCA Optimization Problem

Setup and Notation

Suppose that we have data x1,,xNRD\mathbf{x}_1, \dots, \mathbf{x}_N \in \mathbb{R}^D, stacked into the rows of a matrix XRN×DX \in \mathbb{R}^{N \times D}. Our task is to find a subspace of smaller dimension R<DR < D such that the projection of the data points onto the subspace retains as much information as possible. By restricting our attention to orthonormal bases for the low-dimensional subspace, we reduce the problem to finding a set of orthonormal basis vectors \begin{align} &\mathbf{b}_1, \dots, \mathbf{b}_R \in \mathbb{R}^D, &&\langle \mathbf{b}_r, \mathbf{b}_s \rangle = \delta_{r,s}. \end{align} Define BRD×RB \in \mathbb{R}^{D \times R} to be the matrix with rthr^{\text{th}} column equal to br\mathbf{b}_r. The subspace generated by the basis BB is given by span(B):=span(b1,,bR). \text{span}(B) := \text{span}(\mathbf{b}_1, \dots, \mathbf{b}_R). Throughout this post I will abuse notation by referring to the matrix BB when actually talking about the set of vectors {b1,,bR}\{\mathbf{b}_1, \dots, \mathbf{b}_R\}. Since there is no a priori reason to assume that the data is centered, we should also allow for the subspace to be shifted by some intercept w0RD\mathbf{w}_0 \in \mathbb{R}^D, resulting in the affine space w0+span(B)={w0+r=1Rwrbr:w1,,wRR}. \mathbf{w}_0 + \text{span}(B) = \left\{\mathbf{w}_0 + \sum_{r=1}^{R} w_r \mathbf{b}_r : w_1, \dots, w_R \in \mathbb{R} \right\}. Loosely speaking, the task is to find the basis BB, intercept w0\mathbf{w}_0, and pointwise weights w1,,wNRR\mathbf{w}_1, \dots, \mathbf{w}_N \in \mathbb{R}^R such that \begin{align} \mathbf{x}_n &\approx \mathbf{w}_0 + \sum_{r=1}^{R} (\mathbf{w}_n)_r \mathbf{b}_r &&\forall n=1,\dots,N \newline &= \mathbf{w}_0 + B\mathbf{w}_n. \end{align} To formalize this notion, PCA measures the error in the above approximation using Euclidean distance, averaged over the NN data points. To further simplify notation, we stack the wn\mathbf{w}_n in the columns of a matrix WRR×NW \in \mathbb{R}^{R \times N}. With all of this notation established, we can state that PCA solves the optimization problem argminB,W,w0n=1Nxn(w0+Bwn)22,(1) \text{argmin}_{B, W, \mathbf{w}_0} \sum_{n=1}^{N} \lVert \mathbf{x}_n - (\mathbf{w}_0 + B\mathbf{w}_n) \rVert_2^2, \tag{1} where the basis BB is constrained to be orthonormal. As we will see, this optimization naturally breaks down into two distinct problems which can be solved sequentially:

  1. Given the basis BB and intercept w0\mathbf{w}_0, find the optimal basis coefficients wn\mathbf{w}_n corresponding to each data point xn\mathbf{x}_n.
  2. Find the optimal basis and intercept.

Part of the popularity of PCA stems from the fact that both problems can be solved in closed-form. Let us consider both problems in turn.

Optimizing the Basis Coefficients

Let us first consider w0\mathbf{w}_0 and BB to be fixed, meaning that we are fixing an affine subspace of dimension RR. We seek to find the optimal way to represent the data XX in this lower-dimensional space. As we will show, the Euclidean objective used by PCA implies that this problem reduces to straightforward orthogonal projection. For now, let xnc:=xnw0\mathbf{x}^c_n := \mathbf{x}_n - \mathbf{w}_0 denote the centered data points (we will deal with the intercept shortly). We are thus considering the problem argminWn=1NxncBwn22(2) \text{argmin}_{W} \sum_{n=1}^{N} \lVert \mathbf{x}^c_n - B\mathbf{w}_n \rVert_2^2 \tag{2} Observe that wn\mathbf{w}_n only appears in the nthn^{\text{th}} term of the sum, meaning that we can consider each summand independently,
argminwnxncBwn22. \text{argmin}_{\mathbf{w}_n} \lVert \mathbf{x}^c_n - B\mathbf{w}_n \rVert_2^2. In words, we seek the linear combination of the basis vectors BB that results in minimal Euclidean distance from xnc\mathbf{x}^c_n; this is a standard orthogonal projection problem from linear algebra. Since the basis vectors are orthonormal, the optimal projection coefficients are given by \begin{align} &(\mathbf{w}_n)_r = \langle \mathbf{x}_n^c, \mathbf{b}_r \rangle, &&\mathbf{w}_n = B^\top \mathbf{x}_n^c \end{align} which can be written succinctly for all data points by stacking the wn\mathbf{w}_n^\top as rows in a matrix WW; i.e., W:=XcB, W := X^c B, with XcX^c denoting the centered data matrix with rows set to the (xnc)(\mathbf{x}^c_n)^\top.

Optimizing the Basis

In the previous section, we saw that for a fixed basis and intercept, optimizing the basis weights reduced to an orthogonal projection problem. In this section we show that with the weights fixed at their optimal values, optimizing the basis reduces to solving a sequence of eigenvalue problems. To be clear, we are now considering the problem

argminBn=1NxncBwn22,(3) \text{argmin}_{B} \sum_{n=1}^{N} \lVert \mathbf{x}^c_n - B\mathbf{w}^*_n \rVert_2^2, \tag{3} where the wn\mathbf{w}^*_n are now fixed at the optimal values satisfying (2); i.e., wn=Bxnc\mathbf{w}^*_n = B^\top \mathbf{x}^c_n. However, in the derivations below we will just write wn=wn\mathbf{w}_n = \mathbf{w}^*_n to keep the notation lighter. Note that we are still treating w0\mathbf{w}_0 as fixed for the time being. We will make another notational simplification in this section by writing xn=xnc\mathbf{x}_n = \mathbf{x}_n^c. Just keep in mind that throughout this section, xn\mathbf{x}_n should be interpreted as xnw0\mathbf{x}_n - \mathbf{w}_0.

This problem is also referred to as minimizing the reconstruction error, since xnx^n2:=xnBwn2\lVert \mathbf{x}_n - \mathbf{\hat{x}}_n \rVert_2 := \lVert \mathbf{x}_n - B\mathbf{w}_n \rVert_2 is the error between the original data point xn\mathbf{x}_n and the DD-dimensional vector x^n\mathbf{\hat{x}}_n which can be thought of as an approximation to xn\mathbf{x}_n that has been reconstructed from its lower-dimensional representation wn\mathbf{w}_n. The key here is to re-write this objective function so that this optimization problem takes the form of an eigenvalue problem, which is something that we already know how to solve (see the appendix, A1).

To start, we extend the orthonormal set b1,,bR\mathbf{b}_1, \dots, \mathbf{b}_R to an orthonormal basis b1,,bD\mathbf{b}_1, \dots, \mathbf{b}_D for RD\mathbb{R}^D. Now we can write the original data point xn\mathbf{x}_n and its approximation x^n\mathbf{\hat{x}}_n with respect to this basis as \begin{align} &\mathbf{x}_n = \sum_{r=1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r, &&\mathbf{\hat{x}}_n = \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r \end{align}

and hence the residual xnx^n\mathbf{x}_n - \mathbf{\hat{x}}_n is given by \begin{align} \mathbf{x}_n - \mathbf{\hat{x}}_n &= \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r. \end{align}

Thus, the objective function in (3) can be written as \begin{align} \sum_{n=1}^{N} \lVert \mathbf{x}_n - \mathbf{\hat{x}}_n \rVert_2^2 &= \sum_{n=1}^{N} \bigg\lVert \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r \bigg\rVert_2^2 = \sum_{n=1}^{N} \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 \lVert \mathbf{b}_r \rVert_2^2 = \sum_{n=1}^{N} \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2, \end{align}

where the second and third equalities use the facts that the br\mathbf{b}_r are orthogonal and of unit norm, respectively.

We could continue working with this formulation, but at this point it is convenient to re-write the minimization problem we have been working with as an equivalent maximization problem. Note that the above residual calculation is of the form e^n=xnx^n\mathbf{\hat{e}}_n = \mathbf{x}_n - \mathbf{\hat{x}}_n (and summed over nn). Since xn\mathbf{x}_n is fixed, then minimizing the residual (i.e., the reconstruction error) is equivalent to maximizing x^n\mathbf{\hat{x}}_n. More rigorously, we have

\begin{align} \text{argmin}_B \sum_{n=1}^{N} \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 &= \text{argmin}_B \sum_{n=1}^{N} \left(\sum_{r=1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 - \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2\right) \newline &= \text{argmin}_B \sum_{n=1}^{N} \left(\lVert \mathbf{x}_n \rVert_2^2 - \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2\right) \newline &= \text{argmax}_B \sum_{n=1}^{N} \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2. \tag{4} \end{align}

We can now re-write the squared inner product to obtain \begin{align} \sum_{n=1}^{N} \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 &= \sum_{n=1}^{N} \sum_{r=1}^{R} \mathbf{b}_r^\top \mathbf{x}_n \mathbf{x}_n^\top \mathbf{b}_r = \sum_{r=1}^{R} \mathbf{b}_r^\top \left(\sum_{n=1}^{N}\mathbf{x}_n \mathbf{x}_n^\top\right) \mathbf{b}_r = \sum_{r=1}^{R} \mathbf{b}_r^\top (X^\top X) \mathbf{b}_r^\top, \end{align}

where the final step uses this fact. We have managed to re-write (3) as argmaxBr=1Dbr(XX)br,(5) \text{argmax}_{B} \sum_{r=1}^{D} \mathbf{b}_r^\top (X^\top X) \mathbf{b}_r^\top, \tag{5} where we recall that this is also subject to the constraint that BB is orthogonal.

Before proceeding, we note that XXX^\top X is a positive semi-definite matrix, whose eigenvalues we denote λ1,,λD\lambda_1, \dots, \lambda_D, sorted in decreasing order. Note that the eigenvalues are all non-negative due to the positive definiteness. Let e1,,eD\mathbf{e}_1, \dots, \mathbf{e}_D denote the respective eigenvectors, normalized to have unit norm. These vectors are guaranteed to be orthogonal by the Spectral Theorem.

We now notice in (5) that the objective function has been decomposed into RR different terms, each of which only depends on a single br\mathbf{b}_r. However, these do not constitute RR independent optimization problems, as they are all coupled through the orthogonality constraint. We will thus consider solving them in a recursive fashion, beginning with the first term, argmaxb12=1b1(XX)b1=argmaxb12=1Xb122. \text{argmax}_{\lVert \mathbf{b}_1 \rVert_2=1} \mathbf{b}_1^\top (X^\top X) \mathbf{b}_1^\top = \text{argmax}_{\lVert \mathbf{b}_1 \rVert_2=1} \lVert X \mathbf{b}_1 \rVert_2^2. This is an eigenvalue problem! It is precisely of the form (A4) (see appendix) and so we apply that result to conclude that the optimal argument is b1=e1\mathbf{b}_1 = \mathbf{e}_1 with associated optimal value λ1\lambda_1 (note the objective here is the squared norm, in contrast to the statement in the appendix). Taking this as the base case, we now proceed inductively. Assume that at the rthr^{\text{th}} problem in the sequence, the solution is given by (b1,,br)=(e1,,er)(\mathbf{b}_1, \dots, \mathbf{b}_r) = (\mathbf{e}_1, \dots, \mathbf{e}_r). We must show the solution to the (r+1)st(r+1)^{\text{st}} problem is er+1\mathbf{e}_{r+1}. Under the inductive hypothesis, this problem is constrained so that br+1\mathbf{b}_{r+1} is orthogonal to each of e1,,er\mathbf{e}_1, \dots, \mathbf{e}_r; i.e., we require br+1span(e1,,er)\mathbf{b}_{r+1} \perp \text{span}(\mathbf{e}_1, \dots, \mathbf{e}_r). If we denote Er:=span(e1,,er)\mathcal{E}_{r} := \text{span}(\mathbf{e}_1, \dots, \mathbf{e}_r) and Er\mathcal{E}^{\perp}_{r} the orthogonal complement of Er\mathcal{E}_{r}, then a succinct way to write the orthogonality constraint is that br+1Er\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_r. The problem can thus be written as \begin{align} \text{argmax}_{\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_{r}, \lVert \mathbf{b}_{r+1} \rVert_2=1} \lVert X \mathbf{b}_{r+1} \rVert_2^2, \tag{6} \end{align} which is another eigenvalue problem, precisely of the form (A3). Using this result from the appendix, we conclude that this is solved by br+1=er+1\mathbf{b}_{r+1} = \mathbf{e}_{r+1}, with the maximal objective value λr+1\lambda_{r+1}.

That was a lot, so before moving on let’s briefly summarize. First of all, recall that I have been abusing notation by writing xn\mathbf{x}_n where I should be writing xnc=xnw0\mathbf{x}_n^c = \mathbf{x}_n - \mathbf{w}_0. In summarizing the result here I will make this correction. Here we have considered the problem of finding the optimal orthonormal basis BB, for any fixed w0RD\mathbf{w}_0 \in \mathbb{R}^D, but with the wn\mathbf{w}_n set to their optimal values satisfying (2); i.e., wn=Bxnc\mathbf{w}_n = B^\top \mathbf{x}^c_n. Given this, we showed that the reconstruction error (5) is minimized by setting BB equal to the matrix with columns given by the dominant RR (normalized) eigenvectors of (Xc)Xc(X^c)^\top X^c. We arrived at this solution by showing that the error minimization problem (5) could be viewed as a sequence of RR eigenvalue problems.

Optimizing the Intercept

The last ingredient we are missing to solve (1) is the optimal value of w0\mathbf{w}_0, which has henceforth been viewed as fixed in the above derivations. At first glance, this problem might seem like somewhat of an afterthought, but there are some subtleties that are worth exploring here.

The problem we are now considering is argminw0n=1Nxnw0Bwn22,(7) \text{argmin}_{\mathbf{w}_0} \sum_{n=1}^{N} \lVert \mathbf{x}_n - \mathbf{w}_0 - B\mathbf{w}^*_n \rVert_2^2, \tag{7} with wn\mathbf{w}^*_n denoting the optimal weights wn=Bxn\mathbf{w}^*_n = B^\top \mathbf{x}_n derived above (these derivations will go through with any orthonormal basis BB). Plugging in this expression for wn\mathbf{w}^*_n gives n=1Nxnw0Bwn22=n=1Nxnw0BBxn22=n=1N(IBB)(xnw0)22. \sum_{n=1}^{N} \lVert \mathbf{x}_n - \mathbf{w}_0 - B\mathbf{w}^*_n \rVert_2^2 = \sum_{n=1}^{N} \lVert \mathbf{x}_n - \mathbf{w}_0 - BB^\top \mathbf{x}_n \rVert_2^2 = \sum_{n=1}^{N} \lVert (I - BB^\top)(\mathbf{x}_n - \mathbf{w}_0) \rVert_2^2. Computing the gradient of this expression with respect to w0\mathbf{w}_0 and setting it equal to zero yields the optimality condition n=1N(IBB)(xnw0)=0, \sum_{n=1}^{N} (I - BB^\top)(\mathbf{x}_n - \mathbf{w}_0) = 0, where we have used the fact that (IBB)2=(IBB)(I - BB^\top)^2 = (I - BB^\top) (since IBBI - BB^\top is a projection matrix; see appendix). By linearity we then have n=1N(IBB)(xnw0)=(IBB)n=1N(xnw0)=(IBB)(NxˉNw0),(7) \sum_{n=1}^{N} (I - BB^\top)(\mathbf{x}_n - \mathbf{w}_0) = (I - BB^\top) \sum_{n=1}^{N} (\mathbf{x}_n - \mathbf{w}_0) = (I - BB^\top)(N \bar{\mathbf{x}} - N\mathbf{w}_0), \tag{7} where we have defined xˉ:=1Nn=1Nxn, \bar{\mathbf{x}} := \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n, the empirical mean of the data. Since w0\mathbf{w}_0 is optimal when (8) is equal to zero, this leads to the condition (IBB)(xˉw0)=0, (I - BB^\top)(\bar{\mathbf{x}} - \mathbf{w}_0) = 0, or equivalently, xˉw0Null(IBB). \bar{\mathbf{x}} - \mathbf{w}_0 \in \text{Null}(I - BB^\top). Noting that the null space is non-trivial, since Null(IBB)=span(b1,,bR)\text{Null}(I - BB^\top) = \text{span}(\mathbf{b}_1, \dots, \mathbf{b}_R) (again, see the appendix section on projection matrices), we then conclude that there are infinitely many optimal solutions! Using the basis b1,,bR\mathbf{b}_1, \dots, \mathbf{b}_R for the null space, we can characterize the set of optimal w0\mathbf{w}_0 as those satisfying xˉw0span(b1,,bR), \bar{\mathbf{x}} - \mathbf{w}_0 \in \text{span}(\mathbf{b}_1, \dots, \mathbf{b}_R), or, more explicitly, all vectors within the affine space w0xˉ+span(b1,,bR).(9) \mathbf{w}_0 \in \bar{\mathbf{x}} + \text{span}(\mathbf{b}_1, \dots, \mathbf{b}_R). \tag{9} While we have an infinity of optimal solutions we could choose from, the obvious choice w0:=xˉ\mathbf{w}_0^* := \bar{\mathbf{x}} stands out. Indeed, this is the choice that is essentially always made in practice, so much so that many PCA tutorials will begin by assuming that the data points have all been centered by subtracting off their empirical mean. I find it more insightful to include the intercept as a variable in the PCA optimization problem, and then show that the choice to set it equal to xˉ\bar{\mathbf{x}} is actually justified.

Moreover, it is quite interesting that the mean is not actually the unique optimal choice here. Why is this? The characterization (9) says that we can add any vector lying in the span of the orthonormal basis to xˉ\bar{\mathbf{x}} and still maintain optimality. So the key requirement is that, after shifting the data by subtracting off w0\mathbf{w}_0, the resulting shifted points must “lie along” the lower-dimensional subspace span(B)\text{span}(B). Since, span(B)\text{span}(B) defines a hyperplane, the data must lie somewhere along this plane; from the perspective of the optimization problem, it doesn’t matter whether it lies around the origin or somewhere very far away, so long as it is clustered around this plane. A picture is worth a thousand words here, and I will try to add one once I have time.

Finally, note that the specific choice of xˉ\bar{\mathbf{x}} has various other practical benefits. It leads to projections that are clustered around the origin, thus keeping numbers relatively small. It also leads to a nice statistical interpretation of the eigenvalue problems discussed in the previous subsection; e.g. the basis vector b1\mathbf{b}_1 can be viewed as the direction along which the empirical variance of the projected data is maximized. This maximum variance perspective is discussed in more detail below.

Part 2: Interpreting PCA

Minimum Error or Maximum Variance?

While the derivations in the preceding section are somewhat lengthy, recall that this was all in the pursuit of solving the optimization problem (1). In words, we derived the best RR-dimensional affine subspace to represent the data XX, where “best” is defined as minimizing the average Euclidean error between the data points and their projections onto the subspace. We showed that this error minimization problem could be re-written as a sequence of RR maximization problems of the form (6). We now show that these maximization problems have a very nice statistical interpretation.

Sample covariance of the xn\mathbf{x}_n

We first recall that the empirical covariance matrix of the data points x1,,xN\mathbf{x}_1, \dots, \mathbf{x}_N is defined to be C^:=1N1n=1N(xnxˉ)(xnxˉ), \hat{C} := \frac{1}{N-1} \sum_{n=1}^{N} (\mathbf{x}_n - \bar{\mathbf{x}}) (\mathbf{x}_n - \bar{\mathbf{x}})^\top, which can be re-written as C^=1N1n=1Nxnc(xnc)=1N1Xc(Xc),(10) \hat{C} = \frac{1}{N-1} \sum_{n=1}^{N} \mathbf{x}^c_n (\mathbf{x}^c_n)^\top = \frac{1}{N-1} X^c (X^c)^\top, \tag{10} where the superscript c indicates that the observations have been centered by subtracting off their empirical mean.

Recall that solving the maximization problems (6) revealed that the optimal basis vectors are given by the dominant eigenvectors of the matrix Xc(Xc)X^c (X^c)^\top, which is the (unscaled) covariance (10)! The scaling factor does not affect the optimal basis vectors, it simply scales the objective function. Specifically, \begin{align} \text{argmax}_{\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_{r}, \lVert \mathbf{b}_{r+1} \rVert_2=1} \left(\mathbf{b}_{r+1}^\top X^c (X^c)^\top \mathbf{b}_{r+1}\right) = \text{argmax}_{\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_{r}, \lVert \mathbf{b}_{r+1} \rVert_2=1} \left(\mathbf{b}_{r+1}^\top \hat{C} \mathbf{b}_{r+1}\right). \tag{11} \end{align} We haven’t changed anything from (6) here, other than noting that a re-scaling of the objective function allows us involve C^\hat{C} in the expression.

Sample covariance of the wn\mathbf{w}_n

Given that the sample covariance matrix of the data x1,,xNRD\mathbf{x}_1, \dots, \mathbf{x}_N \in \mathbb{R}^D is given by C^\hat{C}, it is natural to also consider the empirical covariance of w1,,wNRR\mathbf{w}_1, \dots, \mathbf{w}_N \in \mathbb{R}^R. We begin by computing the sample mean 1Nn=1Nwn=1Nn=1NBxnc=B[1Nn=1Nxnc]=0, \frac{1}{N} \sum_{n=1}^{N} \mathbf{w}_n = \frac{1}{N} \sum_{n=1}^{N} B\mathbf{x}^c_n = B \left[\frac{1}{N} \sum_{n=1}^{N} \mathbf{x}^c_n \right] = 0, using the fact that the empirical mean of the centered data points is 00. Recalling that the row vectors wn\mathbf{w}_n^\top are stored in the rows of the matrix WW, it follows that the empirical covariance matrix of w1,,wN\mathbf{w}_1, \dots, \mathbf{w}_N is given by C^w:=1N1WW,(12) \hat{C}_w := \frac{1}{N-1} W^\top W, \tag{12} which follows from the calculation (10) with WW in place of XcX^c. Since W=XBW = XB, we have C^w=1N1(XB)(XB)=BC^B, \hat{C}_w = \frac{1}{N-1} (XB)^\top (XB) = B^\top \hat{C} B, which allows us to write the covariance of the wn\mathbf{w}_n as a function of the covariance of the xn\mathbf{x}_n. We can say even more since we know that BB is given by VRV_R, the truncated set of eigenvectors obtained from the eigendecomposition XX=VΛVX^\top X = V \Lambda V^\top. We thus have C^w=1N1B(XX)B=1N1VR(XX)VR=1N1VRVRΛR=1N1ΛR,(13) \hat{C}_w = \frac{1}{N-1} B^\top (X^\top X) B = \frac{1}{N-1} V_R^\top (X^\top X) V_R = \frac{1}{N-1} V_R^\top V_R \Lambda_R = \frac{1}{N-1} \Lambda_R, \tag{13} where (XX)VR=VRΛR(X^\top X) V_R = V_R \Lambda_R follows from the fact that the columns of VRV_R store the first RR eigenvectors of XXX^\top X. The conclusion here is that C^w\hat{C}_w is diagonal, with variances equal to the eigenvalues of C^\hat{C}. In other words, PCA computes a change-of-basis that diagonalizes the empirical covariance of the data.

PCA as Variance Maximization

We now return to the goal of providing a statistical interpretation of the objective function brC^br\mathbf{b}_{r}^\top \hat{C} \mathbf{b}_r in (11). Given the derivation of C^w\hat{C}_w in (13), we see the empirical variance of (w1)r,,(wN)r(\mathbf{w}_1)_r, \dots, (\mathbf{w}_N)_r (i.e., the values in the rthr^{\text{th}} column of WW) is equal to [C^w]rr=brC^br, [\hat{C}_w]_{rr} = \mathbf{b}_r^\top \hat{C} \mathbf{b}_r, which is precisely the objective being maximized. To interpret this quantity more clearly, we consider the projection of xnc\mathbf{x}_n^c onto the span of br\mathbf{b}_r, projbrxnc:=xnc,brbr=(wn)rbr, \text{proj}_{\mathbf{b}_r} \mathbf{x}^c_n := \langle \mathbf{x}^c_n, \mathbf{b}_r \rangle \mathbf{b}_r = (\mathbf{w}_n)_r \mathbf{b}_r, which implies that (wn)r(\mathbf{w}_n)_r is the magnitude of the projection. Therefore, we conclude that brC^br\mathbf{b}_r^\top \hat{C} \mathbf{b}_r is the sample variance of the magnitude of the projections onto the subspace span(br)\text{span}(\mathbf{b}_r); loosely speaking, the variance of the projection along the rthr^{\text{th}} basis vector. Combining all of these equivalent expressions yields the chain of equalities, Tr(C^w)=1N1r=1Rλr=n=1Nr=1RWnr2.=r=1RbrC^br(15) \text{Tr}(\hat{C}_w) = \frac{1}{N-1} \sum_{r=1}^{R} \lambda_r = \sum_{n=1}^{N} \sum_{r=1}^{R} W_{nr}^2. = \sum_{r=1}^{R} \mathbf{b}_r^\top \hat{C} \mathbf{b}_r \tag{15} The trace Tr(C^w)\text{Tr}(\hat{C}_w) thus represents the total variance of the projection, summed over all of the basis vectors. The total variance is equivalently given by the sum of the eigenvalues of C^\hat{C} or by the squared Frobenius norm WF2\lVert W \rVert_F^2.

The final term in (15) provides an alternative interpretation of the objective function in (5); namely, that PCA seeks the basis BB which results in the maximal projected total variance. The resulting sequence of constrained problems, as in (11), are interpreted similarly. In particular, argmaxbr+1Er,br+12=1(br+1C^br+1) \text{argmax}_{\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_{r}, \lVert \mathbf{b}_{r+1} \rVert_2=1} \left(\mathbf{b}_{r+1}^\top \hat{C} \mathbf{b}_{r+1}\right) can be viewed as seeking the direction along which the variance of the projections is maximized, subject to the constraint that the direction be orthogonal to the previous rr directions. The optimal solution is the direction corresponding to the (r+1)st(r+1)^{\text{st}} eigenvector of the empirical covariance C^\hat{C}, and the resulting maximal variance in this direction is given by the associated eigenvalue.

The Singular Value Decomposition

A Matrix Approximation Problem: The Eckart-Young Theorem

Ignoring the intercept (or assuming that the xn\mathbf{x}_n have already been centered), we can re-write the reconstruction error as n=1NxnBwn22=XWBF2, \sum_{n=1}^{N} \lVert \mathbf{x}_n - B\mathbf{w}_n \rVert_2^2 = \lVert X - WB^\top \rVert_F^2, where F\lVert \cdot \rVert_F denotes the Frobenius norm. The PCA optimization problem can then be written as the matrix approximation problem argminB,WXWBF2,(10) \text{argmin}_{B, W} \lVert X - WB^\top \rVert_F^2, \tag{10} where BB is constrained to be an orthogonal matrix. We can equivalently write this as \begin{align} &\text{argmin}_{\hat{X} \in \mathcal{M}} \lVert X - \hat{X} \rVert_F^2, &&\mathcal{M} = \{\hat{X} \in \mathbb{R}^{N \times D} : \hat{X}=WB^\top, B \in \mathbb{R}^{D \times R} \text{ is orthogonal}\}, \tag{11} \end{align} which makes it even more clear that PCA can generically be viewed as the problem of approximating the data matrix XX with another matrix X^\hat{X} that is constrained to lie in a subset M\mathcal{M} of all N×DN \times D matrices. We can phrase this even more succinctly by noticing that M\mathcal{M} is precisely the set of all N×DN \times D matrices with rank at most RR. Indeed, if X^M\hat{X} \in \mathcal{M} then rank(X^)=rank(WB)R\text{rank}(\hat{X}) = \text{rank}(W B^\top) \leq R since WW has RR columns and thus the rank of this matrix product cannot exceed RR. Conversely, if we let X^\hat{X} be an arbirary N×DN \times D matrix of rank rRr \leq R then X^\hat{X} can be expanded using the compact SVD as X^=UΣ~V=(UΣ~)V, \hat{X} = U\tilde{\Sigma}V^\top = (U\tilde{\Sigma})V^\top, where URN×rU \in \mathbb{R}^{N \times r}, VRD×rV \in \mathbb{R}^{D \times r} are orthogonal and Σ~Rr×r\tilde{\Sigma} \in \mathbb{R}^{r \times r} is diagonal. By setting W:=UΣ~W := U\tilde{\Sigma} and B:=VB := V we have almost written things in the form required by (11), but (11) restricts BB to be orthogonal with exactly RR columns. Thus, we can simply extend VV to have RR orthonormal columns by appending columns B=[Vvr+1,,vR]B = [V|\mathbf{v}_{r+1}, \dots, \mathbf{v}_{R}] (this is justified by Gram-Schmidt) and then set W=[UΣ~0]W = [U\tilde{\Sigma}|\boldsymbol{0}]. Thus, X^=(UΣ~)V=WB\hat{X} = (U\tilde{\Sigma})V^\top = WB^\top with BB now of the required form.

We have thus verified that (11) can equivalently be written as the low-rank matrix approximation problem \begin{align} &\text{argmin}_{\hat{X} \in \mathcal{M}} \lVert X - \hat{X} \rVert_F^2, &&\mathcal{M} = \{\hat{X} \in \mathbb{R}^{N \times D} : \text{rank}(\hat{X}) \leq R \}. \tag{12} \end{align}

This is precisely the problem considered by the celebrated Eckart-Young theorem. The theorem concludes that the optimal solution to (12) is given by the truncated SVD X=URΣRVRX = U_R \Sigma_R V_R^\top, which is precisely the PCA solution computed using SVD as discussed in the previous section.

Regression with an optimal basis

Statistical/Probabalistic Perspectives

There are various ways we might attach a statistical or probabilistic perspective to PCA - we briefly discuss a few of them here. Throughout this section we will take argminW,Bn=1NxnBwn22(13) \text{argmin}_{W,B} \sum_{n=1}^{N} \lVert \mathbf{x}_n - B\mathbf{w}_n \rVert_2^2 \tag{13} as the jumping off point for a probabilistic generalization; that is, we are implicitly assuming the data is centered so that we can ignore the intercept. Note that, as always, BB is constrained to be orthogonal.

A Special Case of the Karhunen-Loeve Expansion

We start by noting that the objective function in (13) looks like a sample average of the quantities xnBwn22\lVert \mathbf{x}_n - B\mathbf{w}_n \rVert_2^2. It is therefore natural to consider some underlying true expectation that this sample average is approximating. To this end, let us view x=x(ω)\mathbf{x} = \mathbf{x}(\omega) and w=w(ω)\mathbf{w} = \mathbf{w}(\omega) as random vectors, defined over some probability space (Ω,A,P)(\Omega, \mathcal{A}, \mathbb{P}). We can then consider argminw,BExBw22=argminw,BΩx(ω)r=1Rwr(ω)br22P(dω),(14) \text{argmin}_{\mathbf{w},B} \mathbb{E} \lVert \mathbf{x} - B\mathbf{w} \rVert_2^2 = \text{argmin}_{\mathbf{w},B} \int_{\Omega} \left\lVert \mathbf{x}(\omega) - \sum_{r=1}^{R} \mathbf{w}_r(\omega) \mathbf{b}_r \right\rVert_2^2 \mathbb{P}(d\omega), \tag{14} which can be viewed as the population analog of the sample approximation in (13). We note that the second expression above shows that the low-rank approximation of the random vector x\mathbf{x} takes the form of a linear combination of (non-random) basis vectors, with random coefficients.

With BB fixed, the optimization in w\mathbf{w} is just as easy as before. Indeed, for any fixed ωΩ\omega \in \Omega, argminw(ω)x(ω)Bw(ω)22=Bx(ω), \text{argmin}_{\mathbf{w}(\omega)} \lVert \mathbf{x}(\omega) - B\mathbf{w}(\omega) \rVert_2^2 = B^\top \mathbf{x}(\omega), using the same exact orthogonal projection reasoning as before. We have thus optimized for w(ω)\mathbf{w}(\omega) on an ω\omega-by-ω\omega basis. The same result thus holds in expectation: argminwExBw22=Bx. \text{argmin}_{\mathbf{w}} \mathbb{E} \lVert \mathbf{x} - B\mathbf{w} \rVert_2^2 = B^\top \mathbf{x}. It is important to note that we have found the optimal random vector w\mathbf{w}, which was shown to be a linear transformation of the random vector x\mathbf{x}.

Optimizing for BB with the optimal w\mathbf{w} fixed also follows similarly from previous derivations. Using some of the results already derived above (see the section on optimizing the basis), we have

\begin{align} \text{argmin}_{B} \mathbb{E}\lVert \mathbf{x} - B\mathbf{w} \rVert_2^2 &= \text{argmin}_{B} \mathbb{E} \sum_{r=R+1}^{D} \langle \mathbf{x}, \mathbf{b}_r \rangle^2 \newline &= \text{argmin}_{B} \mathbb{E} \left[\lVert \mathbf{x} \rVert_2^2 - \sum_{r=1}^{R} \langle \mathbf{x}, \mathbf{b}_r \rangle^2 \right] \newline &= \text{argmax}_{B} \mathbb{E} \sum_{r=1}^{R} \langle \mathbf{x}, \mathbf{b}_r \rangle^2 \newline &= \text{argmax}_{B} \mathbb{E} \sum_{r=1}^{R} \mathbf{b}_r^\top \mathbf{x}\mathbf{x}^\top \mathbf{b}_r \newline &= \text{argmax}_{B} \sum_{r=1}^{R} \mathbf{b}_r^\top \mathbb{E}\left[\mathbf{x}\mathbf{x}^\top\right] \mathbf{b}_r \newline &= \text{argmax}_{B} \sum_{r=1}^{R} \mathbf{b}_r^\top C \mathbf{b}_r, \end{align}

where C:=Cov[x]C := \text{Cov}[\mathbf{x}]. Here we have used the centering assumption E[x]=0\mathbb{E}[\mathbf{x}] = 0, as well as the implicit assumption that the random vector x\mathbf{x} has a well-defined covariance matrix. At this point, the derivations go through exactly as before, with CC replacing the empirical covariance C^\hat{C}; that is, the optimal basis BB is obtained by extracting the first RR columns of VV, where C=VΛVC = V\Lambda V^\top. Unsurprisingly, the results that held for the sample covariance hold analogously in this setting. For example, since x\mathbf{x} has zero mean then E[wr]=E[vrx]=0. \mathbb{E}[\mathbf{w}_r] = \mathbb{E}[\mathbf{v}_r^\top \mathbf{x}] = 0. Moreover, Cov[wr,ws]=Cov[vrx,vsx]=vrCvs=vrVΛVvs=λr1[r=s], \text{Cov}[\mathbf{w}_r, \mathbf{w}_s] = \text{Cov}[\mathbf{v}_r^\top \mathbf{x}, \mathbf{v}_s^\top \mathbf{x}] = \mathbf{v}_r^\top C \mathbf{v}_s = \mathbf{v}_r^\top V \Lambda V^\top \mathbf{v}_s = \lambda_r 1[r = s], which says that the weights wr\mathbf{w}_r are uncorrelated, with variance equal to their respective eigenvalues. It is common, therefore, to normalize these random variables to have unit variance w~r:=1λrwr, \mathbf{\tilde{w}}_r := \frac{1}{\sqrt{\lambda_r}} \mathbf{w}_r, which means the optimal rank RR approximation to the random vector x\mathbf{x} may be expressed as x^(ω):=r=1Rλrw~r(ω)vr, \mathbf{\hat{x}}(\omega) := \sum_{r=1}^{R} \sqrt{\lambda_r} \mathbf{\tilde{w}}_r(\omega) \mathbf{v}_r, where we have included the ω\omega argument to emphasize which quantities are random here. The following result summarizes our findings, extending to the case with non-zero mean.

Proposition. Let x\mathbf{x} be a square-integrable DD-dimensional random vector defined on (Ω,A,P)(\Omega, \mathcal{A}, \mathbb{P}). Among all such random vectors constrained to take values in an RR-dimensional subspace of RD\mathbb{R}^D, the random vector x^(ω)=E[x]+r=1Rλrw~r(ω)vr,(15) \mathbf{\hat{x}}(\omega) = \mathbb{E}[\mathbf{x}] + \sum_{r=1}^{R} \sqrt{\lambda_r} \mathbf{\tilde{w}}_r(\omega) \mathbf{v}_r, \tag{15} provides an optimal approximation to x\mathbf{x}, in the expected Euclidean distance sense (14). Moreover, the random weights w~r\mathbf{\tilde{w}}_r are uncorrelated, have zero mean and unit variance.

Note that random vectors can conceptually be thought of as stochastic processes with finite-dimensional index sets. A similar decomposition to (15) can be constructed for more general stochastic processes with potentially uncountable index sets, with the modification that the sum in (15) will require a countably infinite number of terms. This result is generally known as the Karhunen-Loeve expansion. Thus, from this perspective PCA can be thought of as the special finite-dimensional case of the Karhunen-Loeve expansion.

Gaussian Random Vectors

The above derivations did not make any assumptions regarding the distribution of x\mathbf{x}, just that its covariance exists. Consequently, the main result (15) does not give the distribution of the weights w~r\mathbf{\tilde{w}}_r, only that they are uncorrelated, have zero mean, and unit variance. If we add the distributional assumption xN(m,C)\mathbf{x} \sim \mathcal{N}(\mathbf{m}, C) then we are able to say more. Indeed, recalling that the (unscaled) weights are given by projections wr=xm,vr\mathbf{w}_r = \langle \mathbf{x} - \mathbf{m}, \mathbf{v}_r \rangle, then we find that w=V(xm)N(0,VCV). \mathbf{w} = V^\top (\mathbf{x} - \mathbf{m}) \sim \mathcal{N}(0, V^\top C V). The key point here are that the wr\mathbf{w}_r are jointly Gaussian, and hence their uncorrelatedness implies that they are in fact independent. The weights inherit Gaussianity from x\mathbf{x}. Thus, under this stronger assumption we are able to characterize the distribution of the weights exactly.

Maximum Likelihood Estimation with Gaussian Noise

Probabilistic PCA

https://stats.stackexchange.com/questions/190308/why-does-probabilistic-pca-use-gaussian-prior-over-latent-variables

Part 3: Using PCA

  1. Decorrelating
  2. Dimensionality reduction

Part 4: Computing PCA

  1. Eigendecomposition
  2. SVD

Part 5: Application and Code

Appendix

Eigenvalue Problems

In this section, I briefly discuss the spectral norm and eigenvalue problems in finite-dimensional vector spaces, which I utilize above when optimizing the basis BB in the PCA derivation. Consider a matrix ARN×DA \in \mathbb{R}^{N \times D}, which represents a linear transformation from RD\mathbb{R}^{D} to RN\mathbb{R}^N. We define the spectral norm of AA as the largest factor by which the map AA can “stretch” a vector uRD\mathbf{u} \in \mathbb{R}^{D}, A2:=maxu20Au2u2. \lVert A \rVert_2 := \max_{\lVert \mathbf{u} \rVert_2 \neq \boldsymbol{0}} \frac{\lVert A\mathbf{u} \rVert_2}{\lVert \mathbf{u} \rVert_2}. Using the linearity of AA, one can show that we need only consider vectors of unit length; that is, A2=maxu2=1Au2.(A1) \lVert A \rVert_2 = \max_{\lVert \mathbf{u}_2 \rVert=1} \lVert A\mathbf{u} \rVert_2. \tag{A1} Optimization problems of this type are called eigenvalue problems for reasons that will shortly become clear. For the purposes of the PCA derivation, we will require consideration of a slightly more general eigenvalue problem. To define this problem, first note that the matrix AAA^\top A is symmetric, positive semi-definite since \begin{align} &(A^\top A)^\top = A^\top (A^\top)^\top = A^\top A, &\mathbf{u}^\top (A^\top A)\mathbf{u} = \lVert A \mathbf{u} \rVert_2^2 \geq 0. \end{align} Thus, by the spectral theorem AAA^\top A has DD orthogonal eigenvectors, which we will denote by e1,,eD\mathbf{e}_1, \dots, \mathbf{e}_D and assume that they have been normalized to have unit norm. By positive definiteness the respective eigenvalues λ1,,λD\lambda_1, \dots, \lambda_D (sorted in decreasing order) are all non-negative. For d=1,,Dd = 1, \dots, D, let Ed:=span(e1,,ed)\mathcal{E}_d := \text{span}(\mathbf{e}_1, \dots, \mathbf{e}_d), with Ed:=span(ed+1,,eD)\mathcal{E}^{\perp}_d := \text{span}(\mathbf{e}_{d+1}, \dots, \mathbf{e}_D) its orthogonal complement. We now consider the eigenvalue problem

\begin{align} \max_{\mathbf{u} \in \mathcal{E}^{\perp}_d, \lVert \mathbf{u} \rVert_2=1} \lVert A\mathbf{u} \rVert_2. \tag{A2} \end{align}

We have generalized (A1) by adding an orthogonality constraint. Taking as a convention E0:={0}\mathcal{E}_0 := \{\mathbf{0}\} we then have E0=RD\mathcal{E}^{\perp}_0 = \mathbb{R}^D, which means that setting d=0d = 0 in (A2) recovers the original problem (A1) as a special case. We will prove the following result.

Proposition. Let ARN×DA \in \mathbb{R}^{N \times D} be a matrix. Let e1,,eD\mathbf{e}_1, \dots, \mathbf{e}_D denote the orthonormal eigenbasis of AAA^\top A, with respective eigenvalues λ1,,λD\lambda_1, \dots, \lambda_D sorted in descending order by magnitude. For d=1,,Dd = 1, \dots, D define Ed:=span(e1,,ed)\mathcal{E}_d := \text{span}(\mathbf{e}_1, \dots, \mathbf{e}_d), with Ed:=span(ed+1,,eD)\mathcal{E}^{\perp}_d := \text{span}(\mathbf{e}_{d+1}, \dots, \mathbf{e}_D) its orthogonal complement. Then ed+1=argmaxuEd,u=1Au2.(A3) \mathbf{e}_{d+1} = \text{argmax}_{\mathbf{u} \in \mathcal{E}^{\perp}_d, \mathbf{u}=1} \lVert A\mathbf{u} \rVert_2. \tag{A3} with the maximal value equal to λd+1\sqrt{\lambda_{d+1}}. In particular, we have e1=argmaxu2=1Au2,(A4) \mathbf{e}_1 = \text{argmax}_{\lVert \mathbf{u} \rVert_2=1} \lVert A\mathbf{u} \rVert_2, \tag{A4} with maximal value λ1\sqrt{\lambda_1}.

Proof. Let uEd\mathbf{u} \in \mathcal{E}^{\perp}_d be an arbitrary vector of unit length. This vector may be represented with respect to the eigenbasis as u=r=d+1Durer,ur:=u,er \mathbf{u} = \sum_{r=d+1}^{D} u_r \mathbf{e}_r, \qquad u_r := \langle \mathbf{u}, \mathbf{e}_r \rangle We will use this representation to show that

  1. Au2\lVert A\mathbf{u} \rVert_2 is upper bounded by λd+1\sqrt{\lambda_{d+1}}.
  2. The upper bound is achieved by some uEd\mathbf{u} \in \mathcal{E}^{\perp}_d,

which together imply the claimed result. We will actually work with the squared norm instead, which allows us to leverage the inner product. We have \begin{align} \lVert A\mathbf{u} \rVert^2_2 = \langle A\mathbf{u}, A\mathbf{u} \rangle = \langle A^\top A\mathbf{u}, \mathbf{u} \rangle &= \left\langle A^\top A \sum_{r=d+1}^{D} u_r \mathbf{e}_r, \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle \newline &= \left\langle \sum_{r=d+1}^{D} u_r (A^\top A \mathbf{e}_r), \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle \newline &= \left\langle \sum_{r=d+1}^{D} u_r \lambda_r \mathbf{e}_r, \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle, \end{align} having used the fact the er\mathbf{e}_r are eigenvectors of AAA^\top A. Now we can take advantage of the fact that the er\mathbf{e}_r are orthonormal to obtain \begin{align} \left\langle \sum_{r=d+1}^{D} u_r \lambda_r \mathbf{e}_r, \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle = \sum_{r=d+1}^{D} u_r^2 \lambda_r \lVert \mathbf{e}_r \rVert^2_2 = \sum_{r=d+1}^{D} u_r^2 \lambda_r \leq \sum_{r=d+1}^{D} u_r^2 \lambda_{d+1} = \lambda_{d+1} \lVert \mathbf{u} \rVert_2^2 = \lambda_{d+1} \end{align} where the inequality follows from the fact that the eigenvalues are sorted in descending order. This verifies the upper bound Au2λd+1\lVert A\mathbf{u} \rVert_2 \leq \sqrt{\lambda_{d+1}}. To show that the bound is achieved, we consider setting u=ed+1\mathbf{u} = \mathbf{e}_{d+1}. Then,

\begin{align} \lVert A\mathbf{e}_{d+1} \rVert^2_2 = \langle A\mathbf{e}_{d+1}, A\mathbf{e}_{d+1} \rangle = \langle A^\top A\mathbf{e}_{d+1}, \mathbf{e}_{d+1} \rangle = \langle \lambda_{d+1} \mathbf{e}_{d+1}, \mathbf{e}_{d+1} \rangle = \lambda_{d+1} \lVert \mathbf{e}_{d+1} \rVert^2_2 = \lambda_{d+1}, \end{align}

so we have indeed verified that the equality Au2=λd+1\lVert A\mathbf{u} \rVert_2 = \sqrt{\lambda_{d+1}} is achieved for some unit-norm vector uEd\mathbf{u} \in \mathcal{E}^{\perp}_{d}. The claim is thus proved. \qquad \blacksquare

Projections

Truncated SVD and Eigendecomposition