Principal Components Analysis
I derive the PCA decomposition from both a minimum reconstruction error and maximum variance perspective. I also discuss a statistical interpretation of PCA.
Part 1: Formulating and Solving the PCA Optimization Problem
Setup and Notation
Suppose that we have data , stacked into the rows of a matrix . Our task is to find a subspace of smaller dimension such that the projection of the data points onto the subspace retains as much information as possible. By restricting our attention to orthonormal bases for the low-dimensional subspace, we reduce the problem to finding a set of orthonormal basis vectors \begin{align} &\mathbf{b}_1, \dots, \mathbf{b}_R \in \mathbb{R}^D, &&\langle \mathbf{b}_r, \mathbf{b}_s \rangle = \delta_{r,s}. \end{align} Define to be the matrix with column equal to . The subspace generated by the basis is given by Throughout this post I will abuse notation by referring to the matrix when actually talking about the set of vectors . Since there is no a priori reason to assume that the data is centered, we should also allow for the subspace to be shifted by some intercept , resulting in the affine space Loosely speaking, the task is to find the basis , intercept , and pointwise weights such that \begin{align} \mathbf{x}_n &\approx \mathbf{w}_0 + \sum_{r=1}^{R} (\mathbf{w}_n)_r \mathbf{b}_r &&\forall n=1,\dots,N \newline &= \mathbf{w}_0 + B\mathbf{w}_n. \end{align} To formalize this notion, PCA measures the error in the above approximation using Euclidean distance, averaged over the data points. To further simplify notation, we stack the in the columns of a matrix . With all of this notation established, we can state that PCA solves the optimization problem where the basis is constrained to be orthonormal. As we will see, this optimization naturally breaks down into two distinct problems which can be solved sequentially:
- Given the basis and intercept , find the optimal basis coefficients corresponding to each data point .
- Find the optimal basis and intercept.
Part of the popularity of PCA stems from the fact that both problems can be solved in closed-form. Let us consider both problems in turn.
Optimizing the Basis Coefficients
Let us first consider and to be fixed, meaning that we are fixing
an affine subspace of dimension . We seek to find the optimal way to represent
the data in this lower-dimensional space. As we will show, the Euclidean objective
used by PCA implies that this problem reduces to straightforward orthogonal projection.
For now, let denote the centered
data points (we will deal with the intercept shortly). We are thus considering
the problem
Observe that only appears in the term of the sum,
meaning that we can consider each summand independently,
In words, we seek the linear combination of the basis vectors that results
in minimal Euclidean distance from ; this is a standard orthogonal
projection problem from linear algebra. Since the basis vectors are orthonormal,
the optimal projection coefficients are given by
\begin{align}
&(\mathbf{w}_n)_r = \langle \mathbf{x}_n^c, \mathbf{b}_r \rangle,
&&\mathbf{w}_n = B^\top \mathbf{x}_n^c
\end{align}
which can be written succinctly for all data points by stacking the
as rows in a matrix ; i.e.,
with denoting the centered data matrix with rows set to the
.
Optimizing the Basis
In the previous section, we saw that for a fixed basis and intercept, optimizing the basis weights reduced to an orthogonal projection problem. In this section we show that with the weights fixed at their optimal values, optimizing the basis reduces to solving a sequence of eigenvalue problems. To be clear, we are now considering the problem
where the are now fixed at the optimal values satisfying (2); i.e., . However, in the derivations below we will just write to keep the notation lighter. Note that we are still treating as fixed for the time being. We will make another notational simplification in this section by writing . Just keep in mind that throughout this section, should be interpreted as .
This problem is also referred to as minimizing the reconstruction error, since is the error between the original data point and the -dimensional vector which can be thought of as an approximation to that has been reconstructed from its lower-dimensional representation . The key here is to re-write this objective function so that this optimization problem takes the form of an eigenvalue problem, which is something that we already know how to solve (see the appendix, A1).
To start, we extend the orthonormal set to an orthonormal basis for . Now we can write the original data point and its approximation with respect to this basis as \begin{align} &\mathbf{x}_n = \sum_{r=1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r, &&\mathbf{\hat{x}}_n = \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r \end{align}
and hence the residual is given by \begin{align} \mathbf{x}_n - \mathbf{\hat{x}}_n &= \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r. \end{align}
Thus, the objective function in (3) can be written as \begin{align} \sum_{n=1}^{N} \lVert \mathbf{x}_n - \mathbf{\hat{x}}_n \rVert_2^2 &= \sum_{n=1}^{N} \bigg\lVert \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r \bigg\rVert_2^2 = \sum_{n=1}^{N} \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 \lVert \mathbf{b}_r \rVert_2^2 = \sum_{n=1}^{N} \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2, \end{align}
where the second and third equalities use the facts that the are orthogonal and of unit norm, respectively.
We could continue working with this formulation, but at this point it is convenient to re-write the minimization problem we have been working with as an equivalent maximization problem. Note that the above residual calculation is of the form (and summed over ). Since is fixed, then minimizing the residual (i.e., the reconstruction error) is equivalent to maximizing . More rigorously, we have
\begin{align} \text{argmin}_B \sum_{n=1}^{N} \sum_{r=R+1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 &= \text{argmin}_B \sum_{n=1}^{N} \left(\sum_{r=1}^{D} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 - \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2\right) \newline &= \text{argmin}_B \sum_{n=1}^{N} \left(\lVert \mathbf{x}_n \rVert_2^2 - \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2\right) \newline &= \text{argmax}_B \sum_{n=1}^{N} \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2. \tag{4} \end{align}
We can now re-write the squared inner product to obtain \begin{align} \sum_{n=1}^{N} \sum_{r=1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle^2 &= \sum_{n=1}^{N} \sum_{r=1}^{R} \mathbf{b}_r^\top \mathbf{x}_n \mathbf{x}_n^\top \mathbf{b}_r = \sum_{r=1}^{R} \mathbf{b}_r^\top \left(\sum_{n=1}^{N}\mathbf{x}_n \mathbf{x}_n^\top\right) \mathbf{b}_r = \sum_{r=1}^{R} \mathbf{b}_r^\top (X^\top X) \mathbf{b}_r^\top, \end{align}
where the final step uses this fact. We have managed to re-write (3) as where we recall that this is also subject to the constraint that is orthogonal.
Before proceeding, we note that is a positive semi-definite matrix, whose eigenvalues we denote , sorted in decreasing order. Note that the eigenvalues are all non-negative due to the positive definiteness. Let denote the respective eigenvectors, normalized to have unit norm. These vectors are guaranteed to be orthogonal by the Spectral Theorem.
We now notice in (5) that the objective function has been decomposed into different terms, each of which only depends on a single . However, these do not constitute independent optimization problems, as they are all coupled through the orthogonality constraint. We will thus consider solving them in a recursive fashion, beginning with the first term, This is an eigenvalue problem! It is precisely of the form (A4) (see appendix) and so we apply that result to conclude that the optimal argument is with associated optimal value (note the objective here is the squared norm, in contrast to the statement in the appendix). Taking this as the base case, we now proceed inductively. Assume that at the problem in the sequence, the solution is given by . We must show the solution to the problem is . Under the inductive hypothesis, this problem is constrained so that is orthogonal to each of ; i.e., we require . If we denote and the orthogonal complement of , then a succinct way to write the orthogonality constraint is that . The problem can thus be written as \begin{align} \text{argmax}_{\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_{r}, \lVert \mathbf{b}_{r+1} \rVert_2=1} \lVert X \mathbf{b}_{r+1} \rVert_2^2, \tag{6} \end{align} which is another eigenvalue problem, precisely of the form (A3). Using this result from the appendix, we conclude that this is solved by , with the maximal objective value .
That was a lot, so before moving on let’s briefly summarize. First of all, recall that I have been abusing notation by writing where I should be writing . In summarizing the result here I will make this correction. Here we have considered the problem of finding the optimal orthonormal basis , for any fixed , but with the set to their optimal values satisfying (2); i.e., . Given this, we showed that the reconstruction error (5) is minimized by setting equal to the matrix with columns given by the dominant (normalized) eigenvectors of . We arrived at this solution by showing that the error minimization problem (5) could be viewed as a sequence of eigenvalue problems.
Optimizing the Intercept
The last ingredient we are missing to solve (1) is the optimal value of , which has henceforth been viewed as fixed in the above derivations. At first glance, this problem might seem like somewhat of an afterthought, but there are some subtleties that are worth exploring here.
The problem we are now considering is with denoting the optimal weights derived above (these derivations will go through with any orthonormal basis ). Plugging in this expression for gives Computing the gradient of this expression with respect to and setting it equal to zero yields the optimality condition where we have used the fact that (since is a projection matrix; see appendix). By linearity we then have where we have defined the empirical mean of the data. Since is optimal when (8) is equal to zero, this leads to the condition or equivalently, Noting that the null space is non-trivial, since (again, see the appendix section on projection matrices), we then conclude that there are infinitely many optimal solutions! Using the basis for the null space, we can characterize the set of optimal as those satisfying or, more explicitly, all vectors within the affine space While we have an infinity of optimal solutions we could choose from, the obvious choice stands out. Indeed, this is the choice that is essentially always made in practice, so much so that many PCA tutorials will begin by assuming that the data points have all been centered by subtracting off their empirical mean. I find it more insightful to include the intercept as a variable in the PCA optimization problem, and then show that the choice to set it equal to is actually justified.
Moreover, it is quite interesting that the mean is not actually the unique optimal choice here. Why is this? The characterization (9) says that we can add any vector lying in the span of the orthonormal basis to and still maintain optimality. So the key requirement is that, after shifting the data by subtracting off , the resulting shifted points must “lie along” the lower-dimensional subspace . Since, defines a hyperplane, the data must lie somewhere along this plane; from the perspective of the optimization problem, it doesn’t matter whether it lies around the origin or somewhere very far away, so long as it is clustered around this plane. A picture is worth a thousand words here, and I will try to add one once I have time.
Finally, note that the specific choice of has various other practical benefits. It leads to projections that are clustered around the origin, thus keeping numbers relatively small. It also leads to a nice statistical interpretation of the eigenvalue problems discussed in the previous subsection; e.g. the basis vector can be viewed as the direction along which the empirical variance of the projected data is maximized. This maximum variance perspective is discussed in more detail below.
Part 2: Interpreting PCA
Minimum Error or Maximum Variance?
While the derivations in the preceding section are somewhat lengthy, recall that this was all in the pursuit of solving the optimization problem (1). In words, we derived the best -dimensional affine subspace to represent the data , where “best” is defined as minimizing the average Euclidean error between the data points and their projections onto the subspace. We showed that this error minimization problem could be re-written as a sequence of maximization problems of the form (6). We now show that these maximization problems have a very nice statistical interpretation.
Sample covariance of the
We first recall that the empirical covariance matrix of the data points is defined to be which can be re-written as where the superscript c indicates that the observations have been centered by subtracting off their empirical mean.
Recall that solving the maximization problems (6) revealed that the optimal basis vectors are given by the dominant eigenvectors of the matrix , which is the (unscaled) covariance (10)! The scaling factor does not affect the optimal basis vectors, it simply scales the objective function. Specifically, \begin{align} \text{argmax}_{\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_{r}, \lVert \mathbf{b}_{r+1} \rVert_2=1} \left(\mathbf{b}_{r+1}^\top X^c (X^c)^\top \mathbf{b}_{r+1}\right) = \text{argmax}_{\mathbf{b}_{r+1} \in \mathcal{E}^{\perp}_{r}, \lVert \mathbf{b}_{r+1} \rVert_2=1} \left(\mathbf{b}_{r+1}^\top \hat{C} \mathbf{b}_{r+1}\right). \tag{11} \end{align} We haven’t changed anything from (6) here, other than noting that a re-scaling of the objective function allows us involve in the expression.
Sample covariance of the
Given that the sample covariance matrix of the data is given by , it is natural to also consider the empirical covariance of . We begin by computing the sample mean using the fact that the empirical mean of the centered data points is . Recalling that the row vectors are stored in the rows of the matrix , it follows that the empirical covariance matrix of is given by which follows from the calculation (10) with in place of . Since , we have which allows us to write the covariance of the as a function of the covariance of the . We can say even more since we know that is given by , the truncated set of eigenvectors obtained from the eigendecomposition . We thus have where follows from the fact that the columns of store the first eigenvectors of . The conclusion here is that is diagonal, with variances equal to the eigenvalues of . In other words, PCA computes a change-of-basis that diagonalizes the empirical covariance of the data.
PCA as Variance Maximization
We now return to the goal of providing a statistical interpretation of the objective function in (11). Given the derivation of in (13), we see the empirical variance of (i.e., the values in the column of ) is equal to which is precisely the objective being maximized. To interpret this quantity more clearly, we consider the projection of onto the span of , which implies that is the magnitude of the projection. Therefore, we conclude that is the sample variance of the magnitude of the projections onto the subspace ; loosely speaking, the variance of the projection along the basis vector. Combining all of these equivalent expressions yields the chain of equalities, The trace thus represents the total variance of the projection, summed over all of the basis vectors. The total variance is equivalently given by the sum of the eigenvalues of or by the squared Frobenius norm .
The final term in (15) provides an alternative interpretation of the objective function in (5); namely, that PCA seeks the basis which results in the maximal projected total variance. The resulting sequence of constrained problems, as in (11), are interpreted similarly. In particular, can be viewed as seeking the direction along which the variance of the projections is maximized, subject to the constraint that the direction be orthogonal to the previous directions. The optimal solution is the direction corresponding to the eigenvector of the empirical covariance , and the resulting maximal variance in this direction is given by the associated eigenvalue.
The Singular Value Decomposition
A Matrix Approximation Problem: The Eckart-Young Theorem
Ignoring the intercept (or assuming that the have already been centered), we can re-write the reconstruction error as where denotes the Frobenius norm. The PCA optimization problem can then be written as the matrix approximation problem where is constrained to be an orthogonal matrix. We can equivalently write this as \begin{align} &\text{argmin}_{\hat{X} \in \mathcal{M}} \lVert X - \hat{X} \rVert_F^2, &&\mathcal{M} = \{\hat{X} \in \mathbb{R}^{N \times D} : \hat{X}=WB^\top, B \in \mathbb{R}^{D \times R} \text{ is orthogonal}\}, \tag{11} \end{align} which makes it even more clear that PCA can generically be viewed as the problem of approximating the data matrix with another matrix that is constrained to lie in a subset of all matrices. We can phrase this even more succinctly by noticing that is precisely the set of all matrices with rank at most . Indeed, if then since has columns and thus the rank of this matrix product cannot exceed . Conversely, if we let be an arbirary matrix of rank then can be expanded using the compact SVD as where , are orthogonal and is diagonal. By setting and we have almost written things in the form required by (11), but (11) restricts to be orthogonal with exactly columns. Thus, we can simply extend to have orthonormal columns by appending columns (this is justified by Gram-Schmidt) and then set . Thus, with now of the required form.
We have thus verified that (11) can equivalently be written as the low-rank matrix approximation problem \begin{align} &\text{argmin}_{\hat{X} \in \mathcal{M}} \lVert X - \hat{X} \rVert_F^2, &&\mathcal{M} = \{\hat{X} \in \mathbb{R}^{N \times D} : \text{rank}(\hat{X}) \leq R \}. \tag{12} \end{align}
This is precisely the problem considered by the celebrated Eckart-Young theorem. The theorem concludes that the optimal solution to (12) is given by the truncated SVD , which is precisely the PCA solution computed using SVD as discussed in the previous section.
Regression with an optimal basis
Statistical/Probabalistic Perspectives
There are various ways we might attach a statistical or probabilistic perspective to PCA - we briefly discuss a few of them here. Throughout this section we will take as the jumping off point for a probabilistic generalization; that is, we are implicitly assuming the data is centered so that we can ignore the intercept. Note that, as always, is constrained to be orthogonal.
A Special Case of the Karhunen-Loeve Expansion
We start by noting that the objective function in (13) looks like a sample average of the quantities . It is therefore natural to consider some underlying true expectation that this sample average is approximating. To this end, let us view and as random vectors, defined over some probability space . We can then consider which can be viewed as the population analog of the sample approximation in (13). We note that the second expression above shows that the low-rank approximation of the random vector takes the form of a linear combination of (non-random) basis vectors, with random coefficients.
With fixed, the optimization in is just as easy as before. Indeed, for any fixed , using the same exact orthogonal projection reasoning as before. We have thus optimized for on an -by- basis. The same result thus holds in expectation: It is important to note that we have found the optimal random vector , which was shown to be a linear transformation of the random vector .
Optimizing for with the optimal fixed also follows similarly from previous derivations. Using some of the results already derived above (see the section on optimizing the basis), we have
\begin{align} \text{argmin}_{B} \mathbb{E}\lVert \mathbf{x} - B\mathbf{w} \rVert_2^2 &= \text{argmin}_{B} \mathbb{E} \sum_{r=R+1}^{D} \langle \mathbf{x}, \mathbf{b}_r \rangle^2 \newline &= \text{argmin}_{B} \mathbb{E} \left[\lVert \mathbf{x} \rVert_2^2 - \sum_{r=1}^{R} \langle \mathbf{x}, \mathbf{b}_r \rangle^2 \right] \newline &= \text{argmax}_{B} \mathbb{E} \sum_{r=1}^{R} \langle \mathbf{x}, \mathbf{b}_r \rangle^2 \newline &= \text{argmax}_{B} \mathbb{E} \sum_{r=1}^{R} \mathbf{b}_r^\top \mathbf{x}\mathbf{x}^\top \mathbf{b}_r \newline &= \text{argmax}_{B} \sum_{r=1}^{R} \mathbf{b}_r^\top \mathbb{E}\left[\mathbf{x}\mathbf{x}^\top\right] \mathbf{b}_r \newline &= \text{argmax}_{B} \sum_{r=1}^{R} \mathbf{b}_r^\top C \mathbf{b}_r, \end{align}
where . Here we have used the centering assumption , as well as the implicit assumption that the random vector has a well-defined covariance matrix. At this point, the derivations go through exactly as before, with replacing the empirical covariance ; that is, the optimal basis is obtained by extracting the first columns of , where . Unsurprisingly, the results that held for the sample covariance hold analogously in this setting. For example, since has zero mean then Moreover, which says that the weights are uncorrelated, with variance equal to their respective eigenvalues. It is common, therefore, to normalize these random variables to have unit variance which means the optimal rank approximation to the random vector may be expressed as where we have included the argument to emphasize which quantities are random here. The following result summarizes our findings, extending to the case with non-zero mean.
Proposition. Let be a square-integrable -dimensional random vector defined on . Among all such random vectors constrained to take values in an -dimensional subspace of , the random vector provides an optimal approximation to , in the expected Euclidean distance sense (14). Moreover, the random weights are uncorrelated, have zero mean and unit variance.
Note that random vectors can conceptually be thought of as stochastic processes with finite-dimensional index sets. A similar decomposition to (15) can be constructed for more general stochastic processes with potentially uncountable index sets, with the modification that the sum in (15) will require a countably infinite number of terms. This result is generally known as the Karhunen-Loeve expansion. Thus, from this perspective PCA can be thought of as the special finite-dimensional case of the Karhunen-Loeve expansion.
Gaussian Random Vectors
The above derivations did not make any assumptions regarding the distribution of , just that its covariance exists. Consequently, the main result (15) does not give the distribution of the weights , only that they are uncorrelated, have zero mean, and unit variance. If we add the distributional assumption then we are able to say more. Indeed, recalling that the (unscaled) weights are given by projections , then we find that The key point here are that the are jointly Gaussian, and hence their uncorrelatedness implies that they are in fact independent. The weights inherit Gaussianity from . Thus, under this stronger assumption we are able to characterize the distribution of the weights exactly.
Maximum Likelihood Estimation with Gaussian Noise
Probabilistic PCA
https://stats.stackexchange.com/questions/190308/why-does-probabilistic-pca-use-gaussian-prior-over-latent-variables
Part 3: Using PCA
- Decorrelating
- Dimensionality reduction
Part 4: Computing PCA
- Eigendecomposition
- SVD
Part 5: Application and Code
Appendix
Eigenvalue Problems
In this section, I briefly discuss the spectral norm and eigenvalue problems in finite-dimensional vector spaces, which I utilize above when optimizing the basis in the PCA derivation. Consider a matrix , which represents a linear transformation from to . We define the spectral norm of as the largest factor by which the map can “stretch” a vector , Using the linearity of , one can show that we need only consider vectors of unit length; that is, Optimization problems of this type are called eigenvalue problems for reasons that will shortly become clear. For the purposes of the PCA derivation, we will require consideration of a slightly more general eigenvalue problem. To define this problem, first note that the matrix is symmetric, positive semi-definite since \begin{align} &(A^\top A)^\top = A^\top (A^\top)^\top = A^\top A, &\mathbf{u}^\top (A^\top A)\mathbf{u} = \lVert A \mathbf{u} \rVert_2^2 \geq 0. \end{align} Thus, by the spectral theorem has orthogonal eigenvectors, which we will denote by and assume that they have been normalized to have unit norm. By positive definiteness the respective eigenvalues (sorted in decreasing order) are all non-negative. For , let , with its orthogonal complement. We now consider the eigenvalue problem
\begin{align} \max_{\mathbf{u} \in \mathcal{E}^{\perp}_d, \lVert \mathbf{u} \rVert_2=1} \lVert A\mathbf{u} \rVert_2. \tag{A2} \end{align}
We have generalized (A1) by adding an orthogonality constraint. Taking as a convention we then have , which means that setting in (A2) recovers the original problem (A1) as a special case. We will prove the following result.
Proposition. Let be a matrix. Let denote the orthonormal eigenbasis of , with respective eigenvalues sorted in descending order by magnitude. For define , with its orthogonal complement. Then with the maximal value equal to . In particular, we have with maximal value .
Proof. Let be an arbitrary vector of unit length. This vector may be represented with respect to the eigenbasis as We will use this representation to show that
- is upper bounded by .
- The upper bound is achieved by some ,
which together imply the claimed result. We will actually work with the squared norm instead, which allows us to leverage the inner product. We have \begin{align} \lVert A\mathbf{u} \rVert^2_2 = \langle A\mathbf{u}, A\mathbf{u} \rangle = \langle A^\top A\mathbf{u}, \mathbf{u} \rangle &= \left\langle A^\top A \sum_{r=d+1}^{D} u_r \mathbf{e}_r, \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle \newline &= \left\langle \sum_{r=d+1}^{D} u_r (A^\top A \mathbf{e}_r), \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle \newline &= \left\langle \sum_{r=d+1}^{D} u_r \lambda_r \mathbf{e}_r, \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle, \end{align} having used the fact the are eigenvectors of . Now we can take advantage of the fact that the are orthonormal to obtain \begin{align} \left\langle \sum_{r=d+1}^{D} u_r \lambda_r \mathbf{e}_r, \sum_{r=d+1}^{D} u_r \mathbf{e}_r \right\rangle = \sum_{r=d+1}^{D} u_r^2 \lambda_r \lVert \mathbf{e}_r \rVert^2_2 = \sum_{r=d+1}^{D} u_r^2 \lambda_r \leq \sum_{r=d+1}^{D} u_r^2 \lambda_{d+1} = \lambda_{d+1} \lVert \mathbf{u} \rVert_2^2 = \lambda_{d+1} \end{align} where the inequality follows from the fact that the eigenvalues are sorted in descending order. This verifies the upper bound . To show that the bound is achieved, we consider setting . Then,
\begin{align} \lVert A\mathbf{e}_{d+1} \rVert^2_2 = \langle A\mathbf{e}_{d+1}, A\mathbf{e}_{d+1} \rangle = \langle A^\top A\mathbf{e}_{d+1}, \mathbf{e}_{d+1} \rangle = \langle \lambda_{d+1} \mathbf{e}_{d+1}, \mathbf{e}_{d+1} \rangle = \lambda_{d+1} \lVert \mathbf{e}_{d+1} \rVert^2_2 = \lambda_{d+1}, \end{align}
so we have indeed verified that the equality is achieved for some unit-norm vector . The claim is thus proved.