Cholesky Decomposition of a Covariance Matrix

Interpreting the Cholesky factor of a Gaussian covariance matrix.

In this post we consider the Cholesky decomposition of the covariance matrix of a Gaussian distribution. The eigendecomposition of covariance matrices gives rise to the well-known method of principal components analysis. The Cholesky decomposition is not as widely discussed in this context, but also has a variety of useful statistical applications.

Setup and Background

The Cholesky Decomposition

The Cholesky decomposition of a positive definite matrix CC is the unique factorization of the form C=LL,(1) C = LL^\top, \tag{1} where LL is a lower-triangular matrix with positive diagonal elements (note that constraining the diagonal to be positive is required for uniqueness). A positive definite matrix can also be uniquely decomposed as C=LDL,(2) C = LDL^\top, \tag{2} where LL is lower-triangular with ones on the diagonal, and DD is a diagonal matrix with positive entries on the diagonal. We will refer to this as the modified Cholesky decomposition, but it is also often called the LDL decomposition. Given the modified Cholesky decomposition C=L~DL~C = \tilde{L}D\tilde{L}^\top, we can form (1) by setting L:=L~D1/2L := \tilde{L}D^{1/2}. We refer to both L~\tilde{L} and LL as the lower Cholesky factor of CC; which we are referring to will be clear from context. LL is guaranteed to be invertible, and L1L^{-1} is itself a lower-triangular matrix. Finally, note that we could also consider decompositions of the form C=UDU,(3) C = UDU^\top, \tag{3} where UU is upper triangular. This “reversed” Cholesky decomposition is not as common, but will show up at one point in this post.

Statistical Setup

Throughout this post we consider a random vector x:=(x(1),,x(p))Rp,(4) x := \left(x^{(1)}, \dots, x^{(p)} \right)^\top \in \mathbb{R}^p, \tag{4} with positive definite covariance C:=Cov[x]C := \text{Cov}[x]. We will often assume that xx is Gaussian, but this assumption is not required for some of the results discussed below. We focus on the (modified) Cholesky decomposition C=LDLC = LDL^\top, letting L={ij}L = \{\ell_{ij}\} denote the entries of the lower Cholesky factor. For the modified decomposition, we write D:=diag(d1,,dp)D := \text{diag}(d_1, \dots, d_p), where each dj>0d_j > 0.

Let C=LDLC = LDL^\top and define the random variable ϵ:=L1x,(5) \epsilon := L^{-1}x, \tag{5} which satisfies Cov[ϵ]=L1CL=L1LDLL=D.(6) \text{Cov}[\epsilon] = L^{-1}CL^{-\top} = L^{-1}LDL^\top L^{-\top} = D. \tag{6} Thus, the map xL1xx \mapsto L^{-1}x outputs a “decorrelated” random vector. The inverse map ϵLϵ\epsilon \mapsto L\epsilon “re-correlates” ϵ\epsilon, producing a random vector with covariance CC. If we add on the assumption that xx is Gaussian, then ϵ\epsilon is a Gaussian vector with independent entries. The transformation LϵL\epsilon is the typical method used in simulating draws from a correlated Gaussian vector. Note that if we instead considered the standard Cholesky factorization C=LLC = LL^\top with ϵ\epsilon still defined as in (5), then Cov[ϵ]=I\text{Cov}[\epsilon] = I.

Conditional Variances and Covariances

We start by demonstrating how the (modified) Cholesky decomposition encodes information related to conditional variances and covariances between the x(j)x^{(j)}. The below result considers conditional variances, and provides an interpretation of the diagonal entries of DD in the Gaussian setting.

Proposition (conditional variances).
Let xN(m,C)x \sim \mathcal{N}(m,C), with C=Cov[x]C = \text{Cov}[x] positive definite. Set ϵ:=L1x\epsilon := L^{-1}x, where C=LDLC = LDL^\top. Then Var[x(j)x(1),,x(j1)]=dj,j=1,,p(7) \text{Var}[x^{(j)}|x^{(1)}, \dots, x^{(j-1)}] = d_j, \qquad j = 1, \dots, p \tag{7} where the j=1j=1 case is interpreted as the unconditional variance Var[x(1)]\text{Var}[x^{(1)}]. If we instead define LL by C=LLC = LL^\top, then Var[x(j)x(1),,x(j1)]=jj2,j=1,,p.(8) \text{Var}[x^{(j)}|x^{(1)}, \dots, x^{(j-1)}] = \ell^2_{jj}, \qquad j = 1, \dots, p. \tag{8}

Proof. From the definition x=Lϵx = L\epsilon and the fact that LL is lower triangular, we have x(j)=k=1jjkϵ(k).(9) x^{(j)} = \sum_{k=1}^{j} \ell_{jk} \epsilon^{(k)}. \tag{9} Thus,

\begin{align} \text{Var}[x^{(j)}|x^{(1)}, \dots, x^{(j-1)}] = \text{Var}\left[\sum_{k=1}^{j} \ell_{jk} \epsilon^{(k)} \bigg|x^{(1)}, \dots, x^{(j-1)}\right] &= \text{Var}\left[\sum_{k=1}^{j} \ell_{jk} \epsilon^{(k)} \bigg|\epsilon^{(1)}, \dots, \epsilon^{(j-1)}\right] \newline &= \text{Var}\left[\ell_{jj} \epsilon^{(j)}|\epsilon^{(1)}, \dots, \epsilon^{(j-1)}\right] \newline &= \text{Var}\left[\ell_{jj} \epsilon^{(j)}\right] \newline &= \ell^2_{jj} \text{Var}[\epsilon^{(j)}]. \tag{10} \end{align}

The first equality follows from the fact that xx is an invertible transformation of ϵ\epsilon, while the fourth uses the fact that the ϵ(j)\epsilon^{(j)} are independent (owing to the Gaussian assumption). In the modified Cholesky case, (10) simplifies to jj2Var[ϵ(j)]=1×dj=dj\ell^2_{jj} \text{Var}[\epsilon^{(j)}] = 1 \times d_j = d_j. For standard Cholesky, it becomes jj2Var[ϵ(j)]=jj21=jj2\ell^2_{jj} \text{Var}[\epsilon^{(j)}] = \ell^2_{jj} \cdot 1 = \ell^2_{jj}. \qquad \blacksquare

Thus, the diagonal entries of DD give the variances of the x(j)x^{(j)}, conditional on all preceding entries in the vector. Clearly, the interpretation depends on the ordering of the entries, a fact that will be true for many results that rely on the Cholesky decomposition.

We can generalize the above result to consider conditional covariances instead of variances, which yields an interpretation of the off-diagonal elements of LL.

Proposition (conditional covariances).
Let xN(m,C)x \sim \mathcal{N}(m,C), with C=Cov[x]C = \text{Cov}[x] positive definite. Set ϵ:=L1x\epsilon := L^{-1}x, where C=LDLC = LDL^\top. Then for i>ji > j, Cov[x(i),x(j)x(1),,x(j1)]=ijdj,j=1,,p1(11) \text{Cov}[x^{(i)}, x^{(j)}|x^{(1)}, \dots, x^{(j-1)}] = \ell_{ij}d_j, \qquad j = 1, \dots, p-1 \tag{11} where the j=1j=1 case is interpreted as the unconditional covariance Cov[x(i),x(1)]\text{Cov}[x^{(i)},x^{(1)}]. If we instead define LL by C=LLC = LL^\top, then Cov[x(i),x(j)x(1),,x(j1)]=ijjj,j=1,,p1.(12) \text{Cov}[x^{(i)}, x^{(j)}|x^{(1)}, \dots, x^{(j-1)}] = \ell_{ij}\ell_{jj}, \qquad j = 1, \dots, p-1. \tag{12} In particular, in either case it holds that ij=0    Cov[x(i),x(j)x(1),,x(j1)]=0    x(i)x(j)x(1),,x(j1).(13) \ell_{ij}=0 \iff \text{Cov}[x^{(i)}, x^{(j)}|x^{(1)}, \dots, x^{(j-1)}] = 0 \iff x^{(i)} \perp x^{(j)} | x^{(1)}, \dots, x^{(j-1)}. \tag{13}

Proof. The proof proceeds similarly to the conditional variance case. We have \begin{align} \text{Cov}[x^{(i)}, x^{(j)}|x^{(1)}, \dots, x^{(j-1)}] &= \text{Cov}\left[\sum_{k=1}^{i} \ell_{ik} \epsilon^{(k)}, \sum_{k=1}^{j} \ell_{jk} \epsilon^{(k)} \bigg|x^{(1)}, \dots, x^{(j-1)}\right] \newline &= \text{Cov}\left[\sum_{k=1}^{i} \ell_{ik} \epsilon^{(k)}, \sum_{k=1}^{j} \ell_{jk} \epsilon^{(k)} \bigg|\epsilon^{(1)}, \dots, \epsilon^{(j-1)}\right] \newline &= \text{Cov}\left[\sum_{k=j}^{i} \ell_{ik} \epsilon^{(k)}, \ell_{jj} \epsilon^{(j)} \bigg|\epsilon^{(1)}, \dots, \epsilon^{(j-1)}\right] \newline &= \sum_{k=j}^{i} \ell_{ik}\ell_{jj} \text{Cov}\left[\epsilon^{(k)}, \epsilon^{(j)}|\epsilon^{(1)}, \dots, \epsilon^{(j-1)}\right] \newline &= \sum_{k=j}^{i} \ell_{ik}\ell_{jj} \text{Cov}\left[\epsilon^{(k)}, \epsilon^{(j)}\right] \newline &= \ell_{ij}\ell_{jj} \text{Var}\left[\epsilon^{(j)}\right] \end{align} The penultimate step uses the fact that the ϵ(j)\epsilon^{(j)} are conditionally uncorrelated, owing to the fact that the ϵ(j)\epsilon^{(j)} are jointly Gaussian and independent. The final step also uses the fact that the ϵ(j)\epsilon^{(j)} are uncorrelated, and hence all terms where kjk \neq j vanish. For C=LDLC = LDL^\top the final expression simplifies to ijjjVar[ϵ(j)]=ij1dj=ijdj\ell_{ij}\ell_{jj} \text{Var}\left[\epsilon^{(j)}\right] = \ell_{ij} \cdot 1 \cdot d_j = \ell_{ij}d_j. For C=LLC = LL^\top it becomes ijjj1=ijjj\ell_{ij}\ell_{jj} \cdot 1 = \ell_{ij}\ell_{jj}. The first implication in (13) follows immediately from (11) and (12). The second implication follows from the fact that xx is Gaussian, and hence the conditional uncorrelatedness implies conditional independence. \blacksquare

We thus find that the Cholesky decomposition of a Gaussian covariance is closely linked to the ordered conditional dependence structure of xx. The factorization encodes conditional covariances, where the conditioning is with respect to all preceding variables; reordering the entries of xx may yield drastically different insights. The connection between sparsity in the Cholesky factor and conditional independence can be leveraged in the design of statistical models and algorithms. For an example, see the paper (Jurek & Katzfuss, 2021).

A Regression Interpretation

In this section, we summarize a least squares regression interpretation of the modified Cholesky decomposition C=LDLC = LDL^\top. The result is similar in spirit to (7), as we will consider a sequence of regressions that condition on previous entries of xx. The ideas discussed here come primarily from from (Rothman et al., 2010).

Sequence of Least Squares Problems

We start by recursively defining a sequence of least squares problems, which we then link to the factorization C=LDLC = LDL^\top.

Sequential Least Squares.
Let xN(0,C)x \sim \mathcal{N}(0,C), with C=Cov[x]C = \text{Cov}[x] positive definite. We recursively define the entries of ϵ:=(ϵ(1),,ϵ(p))\epsilon := (\epsilon^{(1)}, \dots, \epsilon^{(p)}) as follows:

1. Set ϵ(1):=x(1)\epsilon^{(1)} := x^{(1)}.
2. For j=2,,pj = 2, \dots, p define the regression coefficient β(j)Rj1\beta^{(j)} \in \R^{j-1} by β(j):=argminβEx(j)k=1j1βkϵ(k)2(14) \beta^{(j)} := \text{argmin}_{\beta} \mathbb{E}\left\lvert x^{(j)} - \sum_{k=1}^{j-1} \beta_k \epsilon^{(k)} \right\rvert^2 \tag{14} and set ϵ(j):=x(j)k=1j1βk(j)ϵ(k).(15) \epsilon^{(j)} := x^{(j)} - \sum_{k=1}^{j-1} \beta_k^{(j)} \epsilon^{(k)}. \tag{15}

In words, ϵ(j)\epsilon^{(j)} is the residual of the least squares regression of the response x(j)x^{(j)} on the explanatory variables ϵ(1),,ϵ(j1)\epsilon^{(1)}, \dots, \epsilon^{(j-1)}, and β(j)\beta^{(j)} is the coefficient vector. Take note that we are regressing on the residuals from the previous regressions, rather than the x(j)x^{(j)} themselves. We assume for simplicity that xx is mean zero to avoid having to deal with an intercept term; for non mean zero variables, we can start by subtracting off their mean and then apply the same procedure. Note also that the zero mean assumption implies that E[ϵ]=0\mathbb{E}[\epsilon] = 0; this follows from ϵ(1)=x(1)\epsilon^{(1)} = x^{(1)} along with the recursion (15).

Our goal is now to connect this algorithm to the modified Cholesky decomposition of CC. In particular, we will show that the ϵ\epsilon defined by the regression residuals is precisely the ϵ\epsilon defined in (5), which arises from the modified Cholesky decomposition. To start, note that if we rearrange (15) as x(j):=ϵ(j)+k=1j1β(k)ϵ(k),(16) x^{(j)} := \epsilon^{(j)} + \sum_{k=1}^{j-1} \beta^{(k)} \epsilon^{(k)}, \tag{16} then we see the vectors ϵ\epsilon and xx are related as x=Lϵ,(17) x = L\epsilon, \tag{17} where \begin{align} L &:= \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \newline \beta^{(2)}_1 & 1 & 0 & \cdots & 0 \newline \vdots & \vdots & \vdots & \cdots & 0 \newline \beta^{(p)}_1 & \beta^{(p)}_2 & \cdots & \cdots & 1\end{bmatrix}. \tag{18} \end{align} That is, we have defined LL to be the lower triangular matrix with jthj^{\text{th}} row set to (β(j),1)(\beta^{(j)}, 1), the jthj^{\text{th}} coefficient vector with a 11 appended to the end. We immediately have that LL is invertible, as it is a triangular matrix with non-zero entries on the diagonal. We also have C=Cov[x]=Cov[Lϵ]=LCov[ϵ]L.(19) C = \text{Cov}[x] = \text{Cov}[L\epsilon] = L \text{Cov}[\epsilon] L^{\top}. \tag{19} In order to show that (19) actually yields the modified Cholesky factorization, we must establish that Cov[ϵ]\text{Cov}[\epsilon], the residual covariance matrix, is diagonal with positive diagonal entries.

Proposition.
The random vector ϵ\epsilon defined by (12) satisfies ϵN(0,D),(20) \epsilon \sim \mathcal{N}(0, D), \tag{20} where DD is a diagonal matrix with positive entries on the diagonal.

Proof. The result follows immediately upon viewing (14) as a projection in a suitable inner product space, and then applying the Hilbert projection theorem. In particular, note that all of the x(j)x^{(j)} and ϵ(j)\epsilon^{(j)} are zero mean, square integrable random variables. We can thus consider the Hilbert space of all such random variables with inner product defined by ψ,η:=E[ψη]\langle \psi, \eta \rangle := \mathbb{E}[\psi \eta]. Under this interpretation, we see that (15) can be rewritten as k=1j1βk(j)ϵ(k)=argminxE(j)xx,(21) \sum_{k=1}^{j-1} \beta_k^{(j)} \epsilon^{(k)} = \text{argmin}_{x^\prime \in \mathcal{E}^{(j)}} \lVert x - x^\prime \rVert, \tag{21} where E(j)\mathcal{E}^{(j)} is the subspace spanned by ϵ(1),,ϵ(j)\epsilon^{(1)}, \dots, \epsilon^{(j)} and \lVert \cdot \rVert is the norm induced by ,\langle \cdot, \cdot \rangle. Since ϵ(j)\epsilon^{(j)} is the residual associated with the projection (21), the Hilbert projection theorem gives the optimality condition ϵ(j)E(j)\epsilon^{(j)} \perp \mathcal{E}^{(j)}; that is, ϵ(j),ϵ(k)=E[ϵ(j)ϵ(k)]=Cov[ϵ(j),ϵ(k)]=0,k=1,,j1.(22) \langle \epsilon^{(j)}, \epsilon^{(k)} \rangle = \mathbb{E}[\epsilon^{(j)} \epsilon^{(k)}] = \text{Cov}[\epsilon^{(j)}, \epsilon^{(k)}] = 0, \qquad k = 1, \dots, j-1. \tag{22} This implies that all of the residuals are pairwise uncorrelated, and hence D=Cov[ϵ]D = \text{Cov}[\epsilon] is diagonal. We know from (17) that x=Lϵx = L\epsilon; since C=Cov[x]C = \text{Cov}[x] is positive definite, then Cov[ϵ]\text{Cov}[\epsilon] must also be positive definite. Thus, the diagonal entries of DD must be strictly positive. \qquad \blacksquare

Using the recursive regression procedure in (14) and (15), we have constructed ϵ\epsilon and LL satisfying C=LDLC = LDL^\top, where D=Cov[ϵ]D = \text{Cov}[\epsilon] is a diagonal matrix with positive diagonal entries, and LL is lower triangular. By the uniqueness of the modified Cholesky decomposition (noted in the introduction) it follows that we have precisely formed the unique matrices LL and DD defining the modified Cholesky decomposition of CC.

TODO: connect the conditional covariance and regression interpretations by using the known form of the regression coefficient. The conditional covariance forms a portion of this coefficient expression.

Connection to the Conditional Covariance Perspective

At this point we have two different interpretations of the (modified) Cholesky decomposition of CC: (i.) the conditional covariance perspective provided in (12); and (ii.) the regression formulation given in (14), (15), and (18). In particular, these results yield interpretations of the entries of LL. Assuming we use the factorization C=LDLC = LDL^\top, and letting i>ji > j, the above results give ij=Cov[x(i),x(j)x(1),,x(j1)]Var[x(j)x(1),,x(j1)](23) \ell_{ij} = \frac{\text{Cov}[x^{(i)}, x^{(j)}| x^{(1)}, \dots, x^{(j-1)} ]}{\text{Var}[x^{(j)}|x^{(1)}, \dots, x^{(j-1)}]} (23) and ij=βj(i)=Cov[ϵ(1:j1)]1Cov[ϵ(1:j1),xi].(24) \ell_{ij} = \beta^{(i)}_j = \text{Cov}[\epsilon^{(1:j-1)}]^{-1} \text{Cov}[\epsilon^{(1:j-1)}, x_i]. \tag{24} In (24) we are using the notation ϵ(1:j1):=(ϵ(1),,ϵ(j1))\epsilon^{(1:j-1)} := (\epsilon^{(1)}, \dots, \epsilon^{(j-1)}), and inserting the closed-form solution of the optimization problem (14). As a side note, by combining (23) and (24), we see that βj(i)=Cov[x(i),x(j)x(1),,x(j1)]Var[x(j)x(1),,x(j1)],(25) \beta^{(i)}_j = \frac{\text{Cov}[x^{(i)}, x^{(j)}| x^{(1)}, \dots, x^{(j-1)} ]}{\text{Var}[x^{(j)}|x^{(1)}, \dots, x^{(j-1)}]}, \tag{25} which shows that the regression coefficient (24), which is a function of variances and covariances of the residuals ϵ\epsilon, can alternatively be written using conditional variances and covariances of the original xx variables.

  1. Jurek, M., & Katzfuss, M. (2021). Hierarchical sparse Cholesky decomposition with applications to high-dimensional spatio-temporal filtering. https://arxiv.org/abs/2006.16901
  2. Rothman, A. J., Levina, E., & Zhu, J. (2010). A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika, 97(3), 539–550. http://www.jstor.org/stable/25734107