Cholesky Decomposition of a Covariance Matrix
Interpreting the Cholesky factor of a Gaussian covariance matrix.
In this post we consider the Cholesky decomposition of the covariance matrix of a Gaussian distribution. The eigendecomposition of covariance matrices gives rise to the well-known method of principal components analysis. The Cholesky decomposition is not as widely discussed in this context, but also has a variety of useful statistical applications.
Setup and Background
The Cholesky Decomposition
The Cholesky decomposition of a positive definite matrix is the unique factorization of the form where is a lower-triangular matrix with positive diagonal elements (note that constraining the diagonal to be positive is required for uniqueness). A positive definite matrix can also be uniquely decomposed as where is lower-triangular with ones on the diagonal, and is a diagonal matrix with positive entries on the diagonal. We will refer to this as the modified Cholesky decomposition, but it is also often called the LDL decomposition. Given the modified Cholesky decomposition , we can form (1) by setting . We refer to both and as the lower Cholesky factor of ; which we are referring to will be clear from context. is guaranteed to be invertible, and is itself a lower-triangular matrix. Finally, note that we could also consider decompositions of the form where is upper triangular. This “reversed” Cholesky decomposition is not as common, but will show up at one point in this post.
Statistical Setup
Throughout this post we consider a random vector with positive definite covariance . We will often assume that is Gaussian, but this assumption is not required for some of the results discussed below. We focus on the (modified) Cholesky decomposition , letting denote the entries of the lower Cholesky factor. For the modified decomposition, we write , where each .
Let and define the random variable which satisfies Thus, the map outputs a “decorrelated” random vector. The inverse map “re-correlates” , producing a random vector with covariance . If we add on the assumption that is Gaussian, then is a Gaussian vector with independent entries. The transformation is the typical method used in simulating draws from a correlated Gaussian vector. Note that if we instead considered the standard Cholesky factorization with still defined as in (5), then .
Conditional Variances and Covariances
We start by demonstrating how the (modified) Cholesky decomposition encodes information related to conditional variances and covariances between the . The below result considers conditional variances, and provides an interpretation of the diagonal entries of in the Gaussian setting.
Proposition (conditional variances).
Let , with positive definite. Set , where . Then where the case is interpreted as the unconditional variance . If we instead define by , then
Proof. From the definition and the fact that is lower triangular, we have Thus,
\begin{align} \text{Var}[x^{(j)}|x^{(1)}, \dots, x^{(j-1)}] = \text{Var}\left[\sum_{k=1}^{j} \ell_{jk} \epsilon^{(k)} \bigg|x^{(1)}, \dots, x^{(j-1)}\right] &= \text{Var}\left[\sum_{k=1}^{j} \ell_{jk} \epsilon^{(k)} \bigg|\epsilon^{(1)}, \dots, \epsilon^{(j-1)}\right] \newline &= \text{Var}\left[\ell_{jj} \epsilon^{(j)}|\epsilon^{(1)}, \dots, \epsilon^{(j-1)}\right] \newline &= \text{Var}\left[\ell_{jj} \epsilon^{(j)}\right] \newline &= \ell^2_{jj} \text{Var}[\epsilon^{(j)}]. \tag{10} \end{align}
The first equality follows from the fact that is an invertible transformation of , while the fourth uses the fact that the are independent (owing to the Gaussian assumption). In the modified Cholesky case, (10) simplifies to . For standard Cholesky, it becomes .
Thus, the diagonal entries of give the variances of the , conditional on all preceding entries in the vector. Clearly, the interpretation depends on the ordering of the entries, a fact that will be true for many results that rely on the Cholesky decomposition.
We can generalize the above result to consider conditional covariances instead of variances, which yields an interpretation of the off-diagonal elements of .
Proposition (conditional covariances).
Let , with positive definite. Set , where . Then for , where the case is interpreted as the unconditional covariance . If we instead define by , then In particular, in either case it holds that
Proof. The proof proceeds similarly to the conditional variance case. We have \begin{align} \text{Cov}[x^{(i)}, x^{(j)}|x^{(1)}, \dots, x^{(j-1)}] &= \text{Cov}\left[\sum_{k=1}^{i} \ell_{ik} \epsilon^{(k)}, \sum_{k=1}^{j} \ell_{jk} \epsilon^{(k)} \bigg|x^{(1)}, \dots, x^{(j-1)}\right] \newline &= \text{Cov}\left[\sum_{k=1}^{i} \ell_{ik} \epsilon^{(k)}, \sum_{k=1}^{j} \ell_{jk} \epsilon^{(k)} \bigg|\epsilon^{(1)}, \dots, \epsilon^{(j-1)}\right] \newline &= \text{Cov}\left[\sum_{k=j}^{i} \ell_{ik} \epsilon^{(k)}, \ell_{jj} \epsilon^{(j)} \bigg|\epsilon^{(1)}, \dots, \epsilon^{(j-1)}\right] \newline &= \sum_{k=j}^{i} \ell_{ik}\ell_{jj} \text{Cov}\left[\epsilon^{(k)}, \epsilon^{(j)}|\epsilon^{(1)}, \dots, \epsilon^{(j-1)}\right] \newline &= \sum_{k=j}^{i} \ell_{ik}\ell_{jj} \text{Cov}\left[\epsilon^{(k)}, \epsilon^{(j)}\right] \newline &= \ell_{ij}\ell_{jj} \text{Var}\left[\epsilon^{(j)}\right] \end{align} The penultimate step uses the fact that the are conditionally uncorrelated, owing to the fact that the are jointly Gaussian and independent. The final step also uses the fact that the are uncorrelated, and hence all terms where vanish. For the final expression simplifies to . For it becomes . The first implication in (13) follows immediately from (11) and (12). The second implication follows from the fact that is Gaussian, and hence the conditional uncorrelatedness implies conditional independence.
We thus find that the Cholesky decomposition of a Gaussian covariance is closely linked to the ordered conditional dependence structure of . The factorization encodes conditional covariances, where the conditioning is with respect to all preceding variables; reordering the entries of may yield drastically different insights. The connection between sparsity in the Cholesky factor and conditional independence can be leveraged in the design of statistical models and algorithms. For an example, see the paper (Jurek & Katzfuss, 2021).
A Regression Interpretation
In this section, we summarize a least squares regression interpretation of the modified Cholesky decomposition . The result is similar in spirit to (7), as we will consider a sequence of regressions that condition on previous entries of . The ideas discussed here come primarily from from (Rothman et al., 2010).
Sequence of Least Squares Problems
We start by recursively defining a sequence of least squares problems, which we then link to the factorization .
Sequential Least Squares.
Let , with positive definite. We recursively define the entries of as follows:
1. Set .
2. For define the regression coefficient by and set
In words, is the residual of the least squares regression of the response on the explanatory variables , and is the coefficient vector. Take note that we are regressing on the residuals from the previous regressions, rather than the themselves. We assume for simplicity that is mean zero to avoid having to deal with an intercept term; for non mean zero variables, we can start by subtracting off their mean and then apply the same procedure. Note also that the zero mean assumption implies that ; this follows from along with the recursion (15).
Our goal is now to connect this algorithm to the modified Cholesky decomposition of . In particular, we will show that the defined by the regression residuals is precisely the defined in (5), which arises from the modified Cholesky decomposition. To start, note that if we rearrange (15) as then we see the vectors and are related as where \begin{align} L &:= \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \newline \beta^{(2)}_1 & 1 & 0 & \cdots & 0 \newline \vdots & \vdots & \vdots & \cdots & 0 \newline \beta^{(p)}_1 & \beta^{(p)}_2 & \cdots & \cdots & 1\end{bmatrix}. \tag{18} \end{align} That is, we have defined to be the lower triangular matrix with row set to , the coefficient vector with a appended to the end. We immediately have that is invertible, as it is a triangular matrix with non-zero entries on the diagonal. We also have In order to show that (19) actually yields the modified Cholesky factorization, we must establish that , the residual covariance matrix, is diagonal with positive diagonal entries.
Proposition.
The random vector defined by (12) satisfies where is a diagonal matrix with positive entries on the diagonal.
Proof. The result follows immediately upon viewing (14) as a projection in a suitable inner product space, and then applying the Hilbert projection theorem. In particular, note that all of the and are zero mean, square integrable random variables. We can thus consider the Hilbert space of all such random variables with inner product defined by . Under this interpretation, we see that (15) can be rewritten as where is the subspace spanned by and is the norm induced by . Since is the residual associated with the projection (21), the Hilbert projection theorem gives the optimality condition ; that is, This implies that all of the residuals are pairwise uncorrelated, and hence is diagonal. We know from (17) that ; since is positive definite, then must also be positive definite. Thus, the diagonal entries of must be strictly positive.
Using the recursive regression procedure in (14) and (15), we have constructed and satisfying , where is a diagonal matrix with positive diagonal entries, and is lower triangular. By the uniqueness of the modified Cholesky decomposition (noted in the introduction) it follows that we have precisely formed the unique matrices and defining the modified Cholesky decomposition of .
TODO: connect the conditional covariance and regression interpretations by using the known form of the regression coefficient. The conditional covariance forms a portion of this coefficient expression.
Connection to the Conditional Covariance Perspective
At this point we have two different interpretations of the (modified) Cholesky decomposition of : (i.) the conditional covariance perspective provided in (12); and (ii.) the regression formulation given in (14), (15), and (18). In particular, these results yield interpretations of the entries of . Assuming we use the factorization , and letting , the above results give and In (24) we are using the notation , and inserting the closed-form solution of the optimization problem (14). As a side note, by combining (23) and (24), we see that which shows that the regression coefficient (24), which is a function of variances and covariances of the residuals , can alternatively be written using conditional variances and covariances of the original variables.
- Jurek, M., & Katzfuss, M. (2021). Hierarchical sparse Cholesky decomposition with applications to high-dimensional spatio-temporal filtering. https://arxiv.org/abs/2006.16901
- Rothman, A. J., Levina, E., & Zhu, J. (2010). A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika, 97(3), 539–550. http://www.jstor.org/stable/25734107