Part 1: Formulating and Solving the PCA Optimization Problem
Setup and Notation
Suppose that we have data \(\mathbf{x}_1, \dots, \mathbf{x}_N \in \mathbb{R}^D\), stacked into the rows of a matrix \(X \in \mathbb{R}^{N \times D}\). Our task is to find a subspace of smaller dimension \(R < D\) such that the projection of the data points onto the subspace retains as much information as possible. By restricting our attention to orthonormal bases for the low-dimensional subspace, we reduce the problem to finding a set of orthonormal basis vectors \[ \begin{align} &\mathbf{b}_1, \dots, \mathbf{b}_R \in \mathbb{R}^D, &&\langle \mathbf{b}_r, \mathbf{b}_s \rangle = \delta_{r,s}. \end{align} \] Define \(B \in \mathbb{R}^{D \times R}\) to be the matrix with \(r^{\text{th}}\) column equal to \(\mathbf{b}_r\). The subspace generated by the basis \(B\) is given by \[ \text{span}(B) := \text{span}(\mathbf{b}_1, \dots, \mathbf{b}_R). \] Throughout this post I will abuse notation by referring to the matrix \(B\) when actually talking about the set of vectors \(\{\mathbf{b}_1, \dots, \mathbf{b}_R\}\). Since there is no a priori reason to assume that the data is centered, we should also allow for the subspace to be shifted by some intercept \(\mathbf{w}_0 \in \mathbb{R}^D\), resulting in the affine space \[ \mathbf{w}_0 + \text{span}(B) = \left\{\mathbf{w}_0 + \sum_{r=1}^{R} w_r \mathbf{b}_r : w_1, \dots, w_R \in \mathbb{R} \right\}. \] Loosely speaking, the task is to find the basis \(B\), intercept \(\mathbf{w}_0\), and pointwise weights \(\mathbf{w}_1, \dots, \mathbf{w}_N \in \mathbb{R}^R\) such that \[ \begin{align} \mathbf{x}_n &\approx \mathbf{w}_0 + \sum_{r=1}^{R} (\mathbf{w}_n)_r \mathbf{b}_r &&\forall n=1,\dots,N \\ &= \mathbf{w}_0 + B\mathbf{w}_n. \end{align} \] To formalize this notion, PCA measures the error in the above approximation using Euclidean distance, averaged over the \(N\) data points. To further simplify notation, we stack the \(\mathbf{w}_n\) in the columns of a matrix \(W \in \mathbb{R}^{R \times N}\). With all of this notation established, we can state that PCA solves the optimization problem \[ \text{argmin}_{B, W, \mathbf{w}_0} \sum_{n=1}^{N} \lVert \mathbf{x}_n - (\mathbf{w}_0 + B\mathbf{w}_n) \rVert_2^2, \tag{1} \] where the basis \(B\) is constrained to be orthonormal. As we will see, this optimization naturally breaks down into two distinct problems which can be solved sequentially: 1. Given the basis \(B\) and intercept \(\mathbf{w}_0\), find the optimal basis coefficients \(\mathbf{w}_n\) corresponding to each data point \(\mathbf{x}_n\). 2. Find the optimal basis and intercept.
Part of the popularity of PCA stems from the fact that both problems can be solved in closed-form. Let us consider both problems in turn.
Optimizing the Basis Coefficients
Let us first consider \(\mathbf{w}_0\) and \(B\) to be fixed, meaning that we are fixing an affine subspace of dimension \(R\). We seek to find the optimal way to represent the data \(X\) in this lower-dimensional space. As we will show, the Euclidean objective used by PCA implies that this problem reduces to straightforward orthogonal projection. For now, let \(\mathbf{x}^c_n := \mathbf{x}_n - \mathbf{w}_0\) denote the centered data points (we will deal with the intercept shortly). We are thus considering the problem \[ \text{argmin}_{W} \sum_{n=1}^{N} \lVert \mathbf{x}^c_n - B\mathbf{w}_n \rVert_2^2 \tag{2} \] Observe that \(\mathbf{w}_n\) only appears in the \(n^{\text{th}}\) term of the sum, meaning that we can consider each summand independently, \[ \text{argmin}_{\mathbf{w}_n} \lVert \mathbf{x}^c_n - B\mathbf{w}_n \rVert_2^2. \] In words, we seek the linear combination of the basis vectors \(B\) that results in minimal Euclidean distance from \(\mathbf{x}^c_n\); this is a standard orthogonal projection problem from linear algebra. Since the basis vectors are orthonormal, the optimal projection coefficients are given by \[ \begin{align} &(\mathbf{w}_n)_r = \langle \mathbf{x}_n^c, \mathbf{b}_r \rangle, &&\mathbf{w}_n = B^\top \mathbf{x}_n^c \end{align} \] which can be written succinctly for all data points by stacking the \(\mathbf{w}_n^\top\) as rows in a matrix \(W\); i.e., \[ W := X^c B, \] with \(X^c\) denoting the centered data matrix with rows set to the \((\mathbf{x}^c_n)^\top\).
Optimizing the Basis
In the previous section, we saw that for a fixed basis and intercept, optimizing the basis weights reduced to an orthogonal projection problem. In this section we show that with the weights fixed at their optimal values, optimizing the basis reduces to solving a sequence of eigenvalue problems. To be clear, we are now considering the problem \[ \text{argmin}_{B} \sum_{n=1}^{N} \lVert \mathbf{x}^c_n - B\mathbf{w}^*_n \rVert_2^2, \tag{3} \] where the \(\mathbf{w}^*_n\) are now fixed at the optimal values satisfying (2); i.e., \(\mathbf{w}^*_n = B^\top \mathbf{x}^c_n\). However, in the derivations below we will just write \(\mathbf{w}_n = \mathbf{w}^*_n\) to keep the notation lighter. Note that we are still treating \(\mathbf{w}_0\) as fixed for the time being. We will make another notational simplification in this section by writing \(\mathbf{x}_n = \mathbf{x}_n^c\). Just keep in mind that throughout this section, \(\mathbf{x}_n\) should be interpreted as \(\mathbf{x}_n - \mathbf{w}_0\).
This problem is also referred to as minimizing the reconstruction error, since \(\lVert \mathbf{x}_n - \mathbf{\hat{x}}_n \rVert_2 := \lVert \mathbf{x}_n - B\mathbf{w}_n \rVert_2\) is the error between the original data point \(\mathbf{x}_n\) and the \(D\)-dimensional vector \(\mathbf{\hat{x}}_n\) which can be thought of as an approximation to \(\mathbf{x}_n\) that has been reconstructed from its lower-dimensional representation \(\mathbf{w}_n\). The key here is to re-write this objective function so that this optimization problem takes the form of an eigenvalue problem, which is something that we already know how to solve (see the appendix, A1).
To start, we extend the orthonormal set \(\mathbf{b}_1, \dots, \mathbf{b}_R\) to an orthonormal basis \(\mathbf{b}_1, \dots, \mathbf{b}_D\) for \(\mathbb{R}^D\). Now we can write the original data point \(\mathbf{x}_n\) and its approximation \(\mathbf{\hat{x}}_n\) with respect to this basis as \[ \begin{align} &\mathbf{x}_n = \sum_{r=1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r, &&\mathbf{\hat{x}}_n = \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r \end{align} \]
and hence the residual \(\mathbf{x}_n - \mathbf{\hat{x}}_n\) is given by \[ \begin{align} \mathbf{x}_n - \mathbf{\hat{x}}_n &= \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r. \end{align} \]
Thus, the objective function in (3) can be written as \[ \sum_{n=1}^{N} \lVert \mathbf{x}_n - \mathbf{\hat{x}}_n \rVert_2^2 = \sum_{n=1}^{N} \bigg\lVert \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r \bigg\rVert_2^2 = \sum_{n=1}^{N} \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 \lVert \mathbf{b}_r \rVert_2^2 = \sum_{n=1}^{N} \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2, \]
where the second and third equalities use the facts that the \(\mathbf{b}_r\) are orthogonal and of unit norm, respectively.
We could continue working with this formulation, but at this point it is convenient to re-write the minimization problem we have been working with as an equivalent maximization problem. Note that the above residual calculation is of the form \(\mathbf{\hat{e}}_n = \mathbf{x}_n - \mathbf{\hat{x}}_n\) (and summed over \(n\)). Since \(\mathbf{x}_n\) is fixed, then minimizing the residual (i.e., the reconstruction error) is equivalent to maximizing \(\mathbf{\hat{x}}_n\). More rigorously, we have
\[ \begin{align} \text{argmin}_B \sum_{n=1}^{N} \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 &= \text{argmin}_B \sum_{n=1}^{N} \left(\sum_{r=1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 - \sum\_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2\right) \\ &= \text{argmin}_B \sum_{n=1}^{N} \left(\lVert \mathbf{x}_n \rVert_2^2 - \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2\right) \\ &= \text{argmax}_B \sum_{n=1}^{N} \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2. \tag{4} \end{align} \]
We can now re-write the squared inner product to obtain \[ \begin{align} \sum_{n=1}^{N} \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 &= \sum_{n=1}^{N} \sum_{r=1}^{R} \mathbf{b}_r^\top \mathbf{x}_n \mathbf{x}_n^\top \mathbf{b}_r = \sum_{r=1}^{R} \mathbf{b}_r^\top \left(\sum_{n=1}^{N}\mathbf{x}_n \mathbf{x}_n^\top\right) \mathbf{b}_r = \sum_{r=1}^{R} \mathbf{b}_r^\top (X^\top X) \mathbf{b}_r^\top, \end{align} \]
where the final step uses this fact. We have managed to re-write (3) as \[ \text{argmax}_{B} \sum_{r=1}^{D} \mathbf{b}_r^\top (X^\top X) \mathbf{b}_r^\top, \tag{5} \] where we recall that this is also subject to the constraint that \(B\) is orthogonal.
Before proceeding, we note that \(X^\top X\) is a positive semi-definite matrix, whose eigenvalues we denote \(\lambda_1, \dots, \lambda_D\), sorted in decreasing order. Note that the eigenvalues are all non-negative due to the positive definiteness. Let \(\mathbf{e}_1, \dots, \mathbf{e}_D\) denote the respective eigenvectors, normalized to have unit norm. These vectors are guaranteed to be orthogonal by the Spectral Theorem.
We now notice in (5) that the objective function has been decomposed into \(R\) different terms, each of which only depends on a single \(\mathbf{b}_r\). However, these do not constitute \(R\) independent optimization problems, as they are all coupled through the orthogonality constraint. We will thus consider solving them in a recursive fashion, beginning with the first term, \[ \text{argmax}_{\lVert \mathbf{b}_1 \rVert_2=1} \mathbf{b}_1^\top (X^\top X) \mathbf{b}_1^\top = \text{argmax}_{\lVert \mathbf{b}_1 \rVert_2=1} \lVert X \mathbf{b}_1 \rVert_2^2. \] This is an eigenvalue problem! It is precisely of the form (A4) (see appendix) and so we apply that result to conclude that the optimal argument is \(\mathbf{b}_1 = \mathbf{e}_1\) with associated optimal value \(\lambda_1\) (note the objective here is the squared norm, in contrast to the statement in the appendix). Taking this as the base case, we now proceed inductively. Assume that at the \(r^{\text{th}}\) problem in the sequence, the solution is given by \((\mathbf{b}_1, \dots, \mathbf{b}_r) = (\mathbf{e}_1, \dots, \mathbf{e}_r)\). We must show the solution to the \((r+1)^{\text{st}}\) problem is \(\mathbf{e}_{r+1}\). Under the inductive hypothesis, this problem is constrained so that \(\mathbf{b}_{r+1}\) is orthogonal to each of \(\mathbf{e}_1, \dots, \mathbf{e}_r\); i.e., we require \(\mathbf{b}_{r+1} \perp \text{span}(\mathbf{e}_1, \dots, \mathbf{e}_r)\). If we denote \(\mathcal{E}_{r} := \text{span}(\mathbf{e}_1, \dots, \mathbf{e}_r)\) and \(\mathcal{E}^{\perp}_{r}\) the orthogonal complement of \(\mathcal{E}_{r}\), then a succinct way to write the orthogonality constraint is that \(\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_r\). The problem can thus be written as \[ \begin{align} \text{argmax}_{\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_{r}, \lVert \mathbf{b}_{r+1} \rVert_2=1} \lVert X \mathbf{b}_{r+1} \rVert_2^2, \tag{6} \end{align} \] which is another eigenvalue problem, precisely of the form (A3). Using this result from the appendix, we conclude that this is solved by \(\mathbf{b}_{r+1} = \mathbf{e}_{r+1}\), with the maximal objective value \(\lambda_{r+1}\).
That was a lot, so before moving on let’s briefly summarize. First of all, recall that I have been abusing notation by writing \(\mathbf{x}_n\) where I should be writing \(\mathbf{x}_n^c = \mathbf{x}_n - \mathbf{w}_0\). In summarizing the result here I will make this correction. Here we have considered the problem of finding the optimal orthonormal basis \(B\), for any fixed \(\mathbf{w}_0 \in \mathbb{R}^D\), but with the \(\mathbf{w}_n\) set to their optimal values satisfying (2); i.e., \(\mathbf{w}_n = B^\top \mathbf{x}^c_n\). Given this, we showed that the reconstruction error (5) is minimized by setting \(B\) equal to the matrix with columns given by the dominant \(R\) (normalized) eigenvectors of \((X^c)^\top X^c\). We arrived at this solution by showing that the error minimization problem (5) could be viewed as a sequence of \(R\) eigenvalue problems.
Optimizing the Intercept
The last ingredient we are missing to solve (1) is the optimal value of \(\mathbf{w}_0\), which has henceforth been viewed as fixed in the above derivations. At first glance, this problem might seem like somewhat of an afterthought, but there are some subtleties that are worth exploring here.
The problem we are now considering is \[ \text{argmin}_{\mathbf{w}_0} \sum_{n=1}^{N} \lVert \mathbf{x}_n - \mathbf{w}_0 - B\mathbf{w}^*_n \rVert_2^2, \tag{7} \] with \(\mathbf{w}^*_n\) denoting the optimal weights \(\mathbf{w}^*_n = B^\top \mathbf{x}_n\) derived above (these derivations will go through with any orthonormal basis \(B\)). Plugging in this expression for \(\mathbf{w}^*_n\) gives \[ \sum_{n=1}^{N} \lVert \mathbf{x}_n - \mathbf{w}_0 - B\mathbf{w}^*_n \rVert_2^2 = \sum_{n=1}^{N} \lVert \mathbf{x}_n - \mathbf{w}_0 - BB^\top \mathbf{x}_n \rVert_2^2 = \sum_{n=1}^{N} \lVert (I - BB^\top)(\mathbf{x}_n - \mathbf{w}_0) \rVert_2^2. \] Computing the gradient of this expression with respect to \(\mathbf{w}_0\) and setting it equal to zero yields the optimality condition \[ \sum_{n=1}^{N} (I - BB^\top)(\mathbf{x}_n - \mathbf{w}_0) = 0, \] where we have used the fact that \((I - BB^\top)^2 = (I - BB^\top)\) (since \(I - BB^\top\) is a projection matrix; see appendix). By linearity we then have \[ \sum_{n=1}^{N} (I - BB^\top)(\mathbf{x}_n - \mathbf{w}_0) = (I - BB^\top) \sum_{n=1}^{N} (\mathbf{x}_n - \mathbf{w}_0) = (I - BB^\top)(N \bar{\mathbf{x}} - N\mathbf{w}_0), \tag{7} \] where we have defined \[ \bar{\mathbf{x}} := \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n, \] the empirical mean of the data. Since \(\mathbf{w}_0\) is optimal when (8) is equal to zero, this leads to the condition \[ (I - BB^\top)(\bar{\mathbf{x}} - \mathbf{w}_0) = 0, \] or equivalently, \[ \bar{\mathbf{x}} - \mathbf{w}_0 \in \text{Null}(I - BB^\top). \] Noting that the null space is non-trivial, since \(\text{Null}(I - BB^\top) = \text{span}(\mathbf{b}_1, \dots, \mathbf{b}_R)\) (again, see the appendix section on projection matrices), we then conclude that there are infinitely many optimal solutions! Using the basis \(\mathbf{b}_1, \dots, \mathbf{b}_R\) for the null space, we can characterize the set of optimal \(\mathbf{w}_0\) as those satisfying \[ \bar{\mathbf{x}} - \mathbf{w}_0 \in \text{span}(\mathbf{b}_1, \dots, \mathbf{b}_R), \] or, more explicitly, all vectors within the affine space \[ \mathbf{w}_0 \in \bar{\mathbf{x}} + \text{span}(\mathbf{b}_1, \dots, \mathbf{b}_R). \tag{9} \] While we have an infinity of optimal solutions we could choose from, the obvious choice \(\mathbf{w}_0^* := \bar{\mathbf{x}}\) stands out. Indeed, this is the choice that is essentially always made in practice, so much so that many PCA tutorials will begin by assuming that the data points have all been centered by subtracting off their empirical mean. I find it more insightful to include the intercept as a variable in the PCA optimization problem, and then show that the choice to set it equal to \(\bar{\mathbf{x}}\) is actually justified.
Moreover, it is quite interesting that the mean is not actually the unique optimal choice here. Why is this? The characterization (9) says that we can add any vector lying in the span of the orthonormal basis to \(\bar{\mathbf{x}}\) and still maintain optimality. So the key requirement is that, after shifting the data by subtracting off \(\mathbf{w}_0\), the resulting shifted points must “lie along” the lower-dimensional subspace \(\text{span}(B)\). Since, \(\text{span}(B)\) defines a hyperplane, the data must lie somewhere along this plane; from the perspective of the optimization problem, it doesn’t matter whether it lies around the origin or somewhere very far away, so long as it is clustered around this plane. A picture is worth a thousand words here, and I will try to add one once I have time.
Finally, note that the specific choice of \(\bar{\mathbf{x}}\) has various other practical benefits. It leads to projections that are clustered around the origin, thus keeping numbers relatively small. It also leads to a nice statistical interpretation of the eigenvalue problems discussed in the previous subsection; e.g. the basis vector \(\mathbf{b}_1\) can be viewed as the direction along which the empirical variance of the projected data is maximized. This maximum variance perspective is discussed in more detail below.
Part 2: Interpreting PCA
Minimum Error or Maximum Variance?
While the derivations in the preceding section are somewhat lengthy, recall that this was all in the pursuit of solving the optimization problem (1). In words, we derived the best \(R\)-dimensional affine subspace to represent the data \(X\), where “best” is defined as minimizing the average Euclidean error between the data points and their projections onto the subspace. We showed that this error minimization problem could be re-written as a sequence of \(R\) maximization problems of the form (6). We now show that these maximization problems have a very nice statistical interpretation.
Sample covariance of the \(\mathbf{x}_n\)
We first recall that the empirical covariance matrix of the data points \(\mathbf{x}_1, \dots, \mathbf{x}_N\) is defined to be \[ \hat{C} := \frac{1}{N-1} \sum_{n=1}^{N} (\mathbf{x}_n - \bar{\mathbf{x}}) (\mathbf{x}_n - \bar{\mathbf{x}})^\top, \] which can be re-written as \[ \hat{C} = \frac{1}{N-1} \sum_{n=1}^{N} \mathbf{x}^c_n (\mathbf{x}^c_n)^\top = \frac{1}{N-1} X^c (X^c)^\top, \tag{10} \] where the superscript c indicates that the observations have been centered by subtracting off their empirical mean.
Recall that solving the maximization problems (6) revealed that the optimal basis vectors are given by the dominant eigenvectors of the matrix \(X^c (X^c)^\top\), which is the (unscaled) covariance (10)! The scaling factor does not affect the optimal basis vectors, it simply scales the objective function. Specifically, \[ \begin{align} \text{argmax}_{\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_{r}, \lVert \mathbf{b}_{r+1} \rVert_2=1} \left(\mathbf{b}_{r+1}^\top X^c (X^c)^\top \mathbf{b}_{r+1}\right) = \text{argmax}_{\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_{r}, \lVert \mathbf{b}_{r+1} \rVert_2=1} \left(\mathbf{b}_{r+1}^\top \hat{C} \mathbf{b}_{r+1}\right). \tag{11} \end{align} \] We haven’t changed anything from (6) here, other than noting that a re-scaling of the objective function allows us involve \(\hat{C}\) in the expression.
Sample covariance of the \(\mathbf{w}_n\)
Given that the sample covariance matrix of the data \(\mathbf{x}_1, \dots, \mathbf{x}_N \in \mathbb{R}^D\) is given by \(\hat{C}\), it is natural to also consider the empirical covariance of \(\mathbf{w}_1, \dots, \mathbf{w}_N \in \mathbb{R}^R\). We begin by computing the sample mean \[ \frac{1}{N} \sum_{n=1}^{N} \mathbf{w}_n = \frac{1}{N} \sum_{n=1}^{N} B\mathbf{x}^c_n = B \left[\frac{1}{N} \sum_{n=1}^{N} \mathbf{x}^c_n \right] = 0, \] using the fact that the empirical mean of the centered data points is \(0\). Recalling that the row vectors \(\mathbf{w}_n^\top\) are stored in the rows of the matrix \(W\), it follows that the empirical covariance matrix of \(\mathbf{w}_1, \dots, \mathbf{w}_N\) is given by \[ \hat{C}_w := \frac{1}{N-1} W^\top W, \tag{12} \] which follows from the calculation (10) with \(W\) in place of \(X^c\). Since \(W = XB\), we have \[ \hat{C}_w = \frac{1}{N-1} (XB)^\top (XB) = B^\top \hat{C} B, \] which allows us to write the covariance of the \(\mathbf{w}_n\) as a function of the covariance of the \(\mathbf{x}_n\). We can say even more since we know that \(B\) is given by \(V_R\), the truncated set of eigenvectors obtained from the eigendecomposition \(X^\top X = V \Lambda V^\top\). We thus have \[ \hat{C}_w = \frac{1}{N-1} B^\top (X^\top X) B = \frac{1}{N-1} V_R^\top (X^\top X) V_R = \frac{1}{N-1} V_R^\top V_R \Lambda_R = \frac{1}{N-1} \Lambda_R, \tag{13} \] where \((X^\top X) V_R = V_R \Lambda_R\) follows from the fact that the columns of \(V_R\) store the first \(R\) eigenvectors of \(X^\top X\). The conclusion here is that \(\hat{C}_w\) is diagonal, with variances equal to the eigenvalues of \(\hat{C}\). In other words, PCA computes a change-of-basis that diagonalizes the empirical covariance of the data.
PCA as Variance Maximization
We now return to the goal of providing a statistical interpretation of the objective function \(\mathbf{b}_{r}^\top \hat{C} \mathbf{b}_r\) in (11). Given the derivation of \(\hat{C}_w\) in (13), we see the empirical variance of \((\mathbf{w}_1)_r, \dots, (\mathbf{w}_N)_r\) (i.e., the values in the \(r^{\text{th}}\) column of \(W\)) is equal to \[ [\hat{C}_w]_{rr} = \mathbf{b}_r^\top \hat{C} \mathbf{b}_r, \] which is precisely the objective being maximized. To interpret this quantity more clearly, we consider the projection of \(\mathbf{x}_n^c\) onto the span of \(\mathbf{b}_r\), \[ \text{proj}_{\mathbf{b}_r} \mathbf{x}^c_n := \langle \mathbf{x}^c_n, \mathbf{b}_r \rangle \mathbf{b}_r = (\mathbf{w}_n)_r \mathbf{b}_r, \] which implies that \((\mathbf{w}_n)_r\) is the magnitude of the projection. Therefore, we conclude that \(\mathbf{b}_r^\top \hat{C} \mathbf{b}_r\) is the sample variance of the magnitude of the projections onto the subspace \(\text{span}(\mathbf{b}_r)\); loosely speaking, the variance of the projection along the \(r^{\text{th}}\) basis vector. Combining all of these equivalent expressions yields the chain of equalities, \[ \text{Tr}(\hat{C}_w) = \frac{1}{N-1} \sum_{r=1}^{R} \lambda_r = \sum_{n=1}^{N} \sum_{r=1}^{R} W_{nr}^2. = \sum_{r=1}^{R} \mathbf{b}_r^\top \hat{C} \mathbf{b}_r \tag{15} \] The trace \(\text{Tr}(\hat{C}_w)\) thus represents the total variance of the projection, summed over all of the basis vectors. The total variance is equivalently given by the sum of the eigenvalues of \(\hat{C}\) or by the squared Frobenius norm \(\lVert W \rVert_F^2\).
The final term in (15) provides an alternative interpretation of the objective function in (5); namely, that PCA seeks the basis \(B\) which results in the maximal projected total variance. The resulting sequence of constrained problems, as in (11), are interpreted similarly. In particular, \[ \text{argmax}_{\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_{r}, \lVert \mathbf{b}_{r+1} \rVert_2=1} \left(\mathbf{b}_{r+1}^\top \hat{C} \mathbf{b}_{r+1}\right) \] can be viewed as seeking the direction along which the variance of the projections is maximized, subject to the constraint that the direction be orthogonal to the previous \(r\) directions. The optimal solution is the direction corresponding to the \((r+1)^{\text{st}}\) eigenvector of the empirical covariance \(\hat{C}\), and the resulting maximal variance in this direction is given by the associated eigenvalue.
The Singular Value Decomposition
A Matrix Approximation Problem: The Eckart-Young Theorem
Ignoring the intercept (or assuming that the \(\mathbf{x}_n\) have already been centered), we can re-write the reconstruction error as \[ \sum_{n=1}^{N} \lVert \mathbf{x}_n - B\mathbf{w}_n \rVert_2^2 = \lVert X - WB^\top \rVert_F^2, \] where \(\lVert \cdot \rVert_F\) denotes the Frobenius norm. The PCA optimization problem can then be written as the matrix approximation problem \[ \text{argmin}_{B, W} \lVert X - WB^\top \rVert_F^2, \tag{10} \] where \(B\) is constrained to be an orthogonal matrix. We can equivalently write this as \[ \begin{align} &\text{argmin}_{\hat{X} \in \mathcal{M}} \lVert X - \hat{X} \rVert_F^2, &&\mathcal{M} = \{\hat{X} \in \mathbb{R}^{N \times D} : \hat{X}=WB^\top, B \in \mathbb{R}^{D \times R} \text{ is orthogonal}\}, \tag{11} \end{align} \] which makes it even more clear that PCA can generically be viewed as the problem of approximating the data matrix \(X\) with another matrix \(\hat{X}\) that is constrained to lie in a subset \(\mathcal{M}\) of all \(N \times D\) matrices. We can phrase this even more succinctly by noticing that \(\mathcal{M}\) is precisely the set of all \(N \times D\) matrices with rank at most \(R\). Indeed, if \(\hat{X} \in \mathcal{M}\) then \(\text{rank}(\hat{X}) = \text{rank}(W B^\top) \leq R\) since \(W\) has \(R\) columns and thus the rank of this matrix product cannot exceed \(R\). Conversely, if we let \(\hat{X}\) be an arbirary \(N \times D\) matrix of rank \(r \leq R\) then \(\hat{X}\) can be expanded using the compact SVD as \[ \hat{X} = U\tilde{\Sigma}V^\top = (U\tilde{\Sigma})V^\top, \] where \(U \in \mathbb{R}^{N \times r}\), \(V \in \mathbb{R}^{D \times r}\) are orthogonal and \(\tilde{\Sigma} \in \mathbb{R}^{r \times r}\) is diagonal. By setting \(W := U\tilde{\Sigma}\) and \(B := V\) we have almost written things in the form required by (11), but (11) restricts \(B\) to be orthogonal with exactly \(R\) columns. Thus, we can simply extend \(V\) to have \(R\) orthonormal columns by appending columns \(B = [V|\mathbf{v}_{r+1}, \dots, \mathbf{v}_{R}]\) (this is justified by Gram-Schmidt) and then set \(W = [U\tilde{\Sigma}|\boldsymbol{0}]\). Thus, \(\hat{X} = (U\tilde{\Sigma})V^\top = WB^\top\) with \(B\) now of the required form.
We have thus verified that (11) can equivalently be written as the low-rank matrix approximation problem \[ \begin{align} &\text{argmin}_{\hat{X} \in \mathcal{M}} \lVert X - \hat{X} \rVert_F^2, &&\mathcal{M} = \{\hat{X} \in \mathbb{R}^{N \times D} : \text{rank}(\hat{X}) \leq R \}. \tag{12} \end{align} \]
This is precisely the problem considered by the celebrated Eckart-Young theorem. The theorem concludes that the optimal solution to (12) is given by the truncated SVD \(X = U_R \Sigma_R V_R^\top\), which is precisely the PCA solution computed using SVD as discussed in the previous section.
Regression with an optimal basis
Statistical/Probabalistic Perspectives
There are various ways we might attach a statistical or probabilistic perspective to PCA - we briefly discuss a few of them here. Throughout this section we will take \[ \text{argmin}_{W,B} \sum_{n=1}^{N} \lVert \mathbf{x}_n - B\mathbf{w}_n \rVert_2^2 \tag{13} \] as the jumping off point for a probabilistic generalization; that is, we are implicitly assuming the data is centered so that we can ignore the intercept. Note that, as always, \(B\) is constrained to be orthogonal.
A Special Case of the Karhunen-Loeve Expansion
We start by noting that the objective function in (13) looks like a sample average of the quantities \(\lVert \mathbf{x}_n - B\mathbf{w}_n \rVert_2^2\). It is therefore natural to consider some underlying true expectation that this sample average is approximating. To this end, let us view \(\mathbf{x} = \mathbf{x}(\omega)\) and \(\mathbf{w} = \mathbf{w}(\omega)\) as random vectors, defined over some probability space \((\Omega, \mathcal{A}, \mathbb{P})\). We can then consider \[ \text{argmin}_{\mathbf{w},B} \mathbb{E} \lVert \mathbf{x} - B\mathbf{w} \rVert_2^2 = \text{argmin}_{\mathbf{w},B} \int_{\Omega} \left\lVert \mathbf{x}(\omega) - \sum_{r=1}^{R} \mathbf{w}_r(\omega) \mathbf{b}_r \right\rVert_2^2 \mathbb{P}(d\omega), \tag{14} \] which can be viewed as the population analog of the sample approximation in (13). We note that the second expression above shows that the low-rank approximation of the random vector \(\mathbf{x}\) takes the form of a linear combination of (non-random) basis vectors, with random coefficients.
With \(B\) fixed, the optimization in \(\mathbf{w}\) is just as easy as before. Indeed, for any fixed \(\omega \in \Omega\), \[ \text{argmin}_{\mathbf{w}(\omega)} \lVert \mathbf{x}(\omega) - B\mathbf{w}(\omega) \rVert_2^2 = B^\top \mathbf{x}(\omega), \] using the same exact orthogonal projection reasoning as before. We have thus optimized for \(\mathbf{w}(\omega)\) on an \(\omega\)-by-\(\omega\) basis. The same result thus holds in expectation: \[ \text{argmin}_{\mathbf{w}} \mathbb{E} \lVert \mathbf{x} - B\mathbf{w} \rVert_2^2 = B^\top \mathbf{x}. \] It is important to note that we have found the optimal random vector \(\mathbf{w}\), which was shown to be a linear transformation of the random vector \(\mathbf{x}\).
Optimizing for \(B\) with the optimal \(\mathbf{w}\) fixed also follows similarly from previous derivations. Using some of the results already derived above (see the section on optimizing the basis), we have
\[ \begin{align} \text{argmin}_{B} \mathbb{E}\lVert \mathbf{x} - B\mathbf{w} \rVert_2^2 &= \text{argmin}_{B} \mathbb{E} \sum_{r=R+1}^{D} \langle \mathbf{x}, \mathbf{b}_r \rangle^2 \newline &= \text{argmin}_{B} \mathbb{E} \left[\lVert \mathbf{x} \rVert_2^2 - \sum_{r=1}^{R} \langle \mathbf{x}, \mathbf{b}_r \rangle^2 \right] \newline &= \text{argmax}_{B} \mathbb{E} \sum_{r=1}^{R} \langle \mathbf{x}, \mathbf{b}_r \rangle^2 \newline &= \text{argmax}_{B} \mathbb{E} \sum_{r=1}^{R} \mathbf{b}_r^\top \mathbf{x}\mathbf{x}^\top \mathbf{b}_r \newline &= \text{argmax}_{B} \sum_{r=1}^{R} \mathbf{b}_r^\top \mathbb{E}\left[\mathbf{x}\mathbf{x}^\top\right] \mathbf{b}_r \newline &= \text{argmax}_{B} \sum_{r=1}^{R} \mathbf{b}_r^\top C \mathbf{b}_r, \end{align} \]
where \(C := \text{Cov}[\mathbf{x}]\). Here we have used the centering assumption \(\mathbb{E}[\mathbf{x}] = 0\), as well as the implicit assumption that the random vector \(\mathbf{x}\) has a well-defined covariance matrix. At this point, the derivations go through exactly as before, with \(C\) replacing the empirical covariance \(\hat{C}\); that is, the optimal basis \(B\) is obtained by extracting the first \(R\) columns of \(V\), where \(C = V\Lambda V^\top\). Unsurprisingly, the results that held for the sample covariance hold analogously in this setting. For example, since \(\mathbf{x}\) has zero mean then \[ \mathbb{E}[\mathbf{w}_r] = \mathbb{E}[\mathbf{v}_r^\top \mathbf{x}] = 0. \] Moreover, \[ \text{Cov}[\mathbf{w}_r, \mathbf{w}_s] = \text{Cov}[\mathbf{v}_r^\top \mathbf{x}, \mathbf{v}_s^\top \mathbf{x}] = \mathbf{v}_r^\top C \mathbf{v}_s = \mathbf{v}_r^\top V \Lambda V^\top \mathbf{v}_s = \lambda_r 1[r = s], \] which says that the weights \(\mathbf{w}_r\) are uncorrelated, with variance equal to their respective eigenvalues. It is common, therefore, to normalize these random variables to have unit variance \[ \mathbf{\tilde{w}}_r := \frac{1}{\sqrt{\lambda_r}} \mathbf{w}_r, \] which means the optimal rank \(R\) approximation to the random vector \(\mathbf{x}\) may be expressed as \[ \mathbf{\hat{x}}(\omega) := \sum_{r=1}^{R} \sqrt{\lambda_r} \mathbf{\tilde{w}}_r(\omega) \mathbf{v}_r, \] where we have included the \(\omega\) argument to emphasize which quantities are random here. The following result summarizes our findings, extending to the case with non-zero mean.Proposition. Let \(\mathbf{x}\) be a square-integrable \(D\)-dimensional random vector defined on \((\Omega, \mathcal{A}, \mathbb{P})\). Among all such random vectors constrained to take values in an \(R\)-dimensional subspace of \(\mathbb{R}^D\), the random vector \[ \mathbf{\hat{x}}(\omega) = \mathbb{E}[\mathbf{x}] + \sum_{r=1}^{R} \sqrt{\lambda_r} \mathbf{\tilde{w}}_r(\omega) \mathbf{v}_r, \tag{15} \] provides an optimal approximation to \(\mathbf{x}\), in the expected Euclidean distance sense (14). Moreover, the random weights \(\mathbf{\tilde{w}}_r\) are uncorrelated, have zero mean and unit variance.
Note that random vectors can conceptually be thought of as stochastic processes with finite-dimensional index sets. A similar decomposition to (15) can be constructed for more general stochastic processes with potentially uncountable index sets, with the modification that the sum in (15) will require a countably infinite number of terms. This result is generally known as the Karhunen-Loeve expansion. Thus, from this perspective PCA can be thought of as the special finite-dimensional case of the Karhunen-Loeve expansion.
Gaussian Random Vectors
The above derivations did not make any assumptions regarding the distribution of \(\mathbf{x}\), just that its covariance exists. Consequently, the main result (15) does not give the distribution of the weights \(\mathbf{\tilde{w}}_r\), only that they are uncorrelated, have zero mean, and unit variance. If we add the distributional assumption \(\mathbf{x} \sim \mathcal{N}(\mathbf{m}, C)\) then we are able to say more. Indeed, recalling that the (unscaled) weights are given by projections \(\mathbf{w}_r = \langle \mathbf{x} - \mathbf{m}, \mathbf{v}_r \rangle\), then we find that \[ \mathbf{w} = V^\top (\mathbf{x} - \mathbf{m}) \sim \mathcal{N}(0, V^\top C V). \] The key point here are that the \(\mathbf{w}_r\) are jointly Gaussian, and hence their uncorrelatedness implies that they are in fact independent. The weights inherit Gaussianity from \(\mathbf{x}\). Thus, under this stronger assumption we are able to characterize the distribution of the weights exactly.
Maximum Likelihood Estimation with Gaussian Noise
Probabilistic PCA
https://stats.stackexchange.com/questions/190308/why-does-probabilistic-pca-use-gaussian-prior-over-latent-variables
Part 3: Using PCA
- Decorrelating
- Dimensionality reduction
Part 4: Computing PCA
- Eigendecomposition
- SVD
Part 5: Application and Code
Appendix
Eigenvalue Problems
In this section, I briefly discuss the spectral norm and eigenvalue problems in finite-dimensional vector spaces, which I utilize above when optimizing the basis \(B\) in the PCA derivation. Consider a matrix \(A \in \mathbb{R}^{N \times D}\), which represents a linear transformation from \(\mathbb{R}^{D}\) to \(\mathbb{R}^N\). We define the spectral norm of \(A\) as the largest factor by which the map \(A\) can “stretch” a vector \(\mathbf{u} \in \mathbb{R}^{D}\), \[ \lVert A \rVert_2 := \max_{\lVert \mathbf{u} \rVert_2 \neq \boldsymbol{0}} \frac{\lVert A\mathbf{u} \rVert_2}{\lVert \mathbf{u} \rVert_2}. \] Using the linearity of \(A\), one can show that we need only consider vectors of unit length; that is, \[ \lVert A \rVert_2 = \max_{\lVert \mathbf{u}_2 \rVert=1} \lVert A\mathbf{u} \rVert_2. \tag{A1} \] Optimization problems of this type are called eigenvalue problems for reasons that will shortly become clear. For the purposes of the PCA derivation, we will require consideration of a slightly more general eigenvalue problem. To define this problem, first note that the matrix \(A^\top A\) is symmetric, positive semi-definite since \[ \begin{align} &(A^\top A)^\top = A^\top (A^\top)^\top = A^\top A, &\mathbf{u}^\top (A^\top A)\mathbf{u} = \lVert A \mathbf{u} \rVert_2^2 \geq 0. \end{align} \] Thus, by the spectral theorem \(A^\top A\) has \(D\) orthogonal eigenvectors, which we will denote by \(\mathbf{e}_1, \dots, \mathbf{e}_D\) and assume that they have been normalized to have unit norm. By positive definiteness the respective eigenvalues \(\lambda_1, \dots, \lambda_D\) (sorted in decreasing order) are all non-negative. For \(d = 1, \dots, D\), let \(\mathcal{E}_d := \text{span}(\mathbf{e}_1, \dots, \mathbf{e}_d)\), with \(\mathcal{E}^{\perp}_d := \text{span}(\mathbf{e}_{d+1}, \dots, \mathbf{e}_D)\) its orthogonal complement. We now consider the eigenvalue problem
\[ \begin{align} \max_{\mathbf{u} \in \mathcal{E}^{\perp}_d, \lVert \mathbf{u} \rVert_2=1} \lVert A\mathbf{u} \rVert_2. \tag{A2} \end{align} \]
We have generalized (A1) by adding an orthogonality constraint. Taking as a convention \(\mathcal{E}_0 := \{\mathbf{0}\}\) we then have \(\mathcal{E}^{\perp}_0 = \mathbb{R}^D\), which means that setting \(d = 0\) in (A2) recovers the original problem (A1) as a special case. We will prove the following result.
Proposition. Let \(A \in \mathbb{R}^{N \times D}\) be a matrix. Let \(\mathbf{e}_1, \dots, \mathbf{e}_D\) denote the orthonormal eigenbasis of \(A^\top A\), with respective eigenvalues \(\lambda_1, \dots, \lambda_D\) sorted in descending order by magnitude. For \(d = 1, \dots, D\) define \(\mathcal{E}_d := \text{span}(\mathbf{e}_1, \dots, \mathbf{e}_d)\), with \(\mathcal{E}^{\perp}_d := \text{span}(\mathbf{e}_{d+1}, \dots, \mathbf{e}_D)\) its orthogonal complement. Then \[ \mathbf{e}_{d+1} = \text{argmax}_{\mathbf{u} \in \mathcal{E}^{\perp}_d, \mathbf{u}=1} \lVert A\mathbf{u} \rVert_2. \tag{A3} \] with the maximal value equal to \(\sqrt{\lambda_{d+1}}\). In particular, we have \[ \mathbf{e}_1 = \text{argmax}_{\lVert \mathbf{u} \rVert_2=1} \lVert A\mathbf{u} \rVert_2, \tag{A4} \] with maximal value \(\sqrt{\lambda_1}\).
Proof. Let \(\mathbf{u} \in \mathcal{E}^{\perp}_d\) be an arbitrary vector of unit length. This vector may be represented with respect to the eigenbasis as \[ \mathbf{u} = \sum_{r=d+1}^{D} u_r \mathbf{e}_r, \qquad u_r := \langle \mathbf{u}, \mathbf{e}_r \rangle \] We will use this representation to show that 1. \(\lVert A\mathbf{u} \rVert_2\) is upper bounded by \(\sqrt{\lambda_{d+1}}\). 2. The upper bound is achieved by some \(\mathbf{u} \in \mathcal{E}^{\perp}_d\),
which together imply the claimed result. We will actually work with the squared norm instead, which allows us to leverage the inner product. We have \[ \begin{align} \lVert A\mathbf{u} \rVert^2_2 = \langle A\mathbf{u}, A\mathbf{u} \rangle = \langle A^\top A\mathbf{u}, \mathbf{u} \rangle &= \left\langle A^\top A \sum_{r=d+1}^{D} u_r \mathbf{e}_r, \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle \newline &= \left\langle \sum_{r=d+1}^{D} u_r (A^\top A \mathbf{e}_r), \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle \newline &= \left\langle \sum_{r=d+1}^{D} u_r \lambda_r \mathbf{e}_r, \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle, \end{align} \] having used the fact the \(\mathbf{e}_r\) are eigenvectors of \(A^\top A\). Now we can take advantage of the fact that the \(\mathbf{e}_r\) are orthonormal to obtain \[ \begin{align} \left\langle \sum_{r=d+1}^{D} u_r \lambda_r \mathbf{e}_r, \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle = \sum_{r=d+1}^{D} u_r^2 \lambda_r \lVert \mathbf{e}_r \rVert^2_2 = \sum_{r=d+1}^{D} u_r^2 \lambda_r \leq \sum_{r=d+1}^{D} u_r^2 \lambda_{d+1} = \lambda_{d+1} \lVert \mathbf{u} \rVert_2^2 = \lambda_{d+1} \end{align} \] where the inequality follows from the fact that the eigenvalues are sorted in descending order. This verifies the upper bound \(\lVert A\mathbf{u} \rVert_2 \leq \sqrt{\lambda_{d+1}}\). To show that the bound is achieved, we consider setting \(\mathbf{u} = \mathbf{e}_{d+1}\). Then,
\[ \begin{align} \lVert A\mathbf{e}_{d+1} \rVert^2_2 = \langle A\mathbf{e}_{d+1}, A\mathbf{e}_{d+1} \rangle = \langle A^\top A\mathbf{e}_{d+1}, \mathbf{e}_{d+1} \rangle = \langle \lambda_{d+1} \mathbf{e}_{d+1}, \mathbf{e}_{d+1} \rangle = \lambda_{d+1} \lVert \mathbf{e}_{d+1} \rVert^2_2 = \lambda_{d+1}, \end{align} \]
so we have indeed verified that the equality \(\lVert A\mathbf{u} \rVert_2 = \sqrt{\lambda_{d+1}}\) is achieved for some unit-norm vector \(\mathbf{u} \in \mathcal{E}^{\perp}_{d}\). The claim is thus proved. \(\qquad \blacksquare\)