Regularized Least Squares with Singular Prior

Solving the regularized least squares optimization problem when the prior covariance matrix is not positive definite.

Setup

Recall the L2L^2 regularized least squares problem (i.e., ridge regression),

\begin{align} u_{\star} &:= \text{argmin}_{u \in \mathbb{R}^d} J(u) \tag{1} \newline J(u) &:= \frac{1}{2} \lVert y - Gu\rVert^2_{\Sigma} + \frac{1}{2}\lVert u-m \rVert^2_{C} \end{align}

where yRny \in \R^n and mRdm \in \R^d are fixed vectors, GRn×dG \in \R^{n \times d} is the linear forward model, and ΣRn×n\Sigma \in \R^{n \times n} and CRd×dC \in \R^{d \times d} are positive definite matrices. In (1) we have used the following notation for inner products and norms weighted by a positive definite matrix CC: \begin{align} \langle v, v^\prime \rangle_C &:= \langle C^{-1}v, v^\prime\rangle = v^\top C^{-1}v^\prime \newline \lVert v \rVert_C^2 &:= \langle v, v\rangle_C. \end{align}

A solution to (1) can be viewed as a maximum a posteriori (MAP) estimate under the Bayesian model \begin{align} y|u &\sim \mathcal{N}(Gu, \Sigma) \tag{3} \newline u &\sim \mathcal{N}(m, C). \tag{4} \end{align}

I discuss this problem in depth in my post, on the linear Gaussian model, where I show that the solution can be written as the following equivalent expressions, \begin{align} u_{\star} &= \left(G^\top \Sigma^{-1}G + C^{-1}\right)^{-1}\left(G^\top \Sigma^{-1}y + C^{-1}m \right) \tag{5} \newline u_{\star} &= m + CG^\top \left(GCG^\top + \Sigma \right)^{-1}(y-Gm). \tag{6} \end{align}

In this post we will consider a generalization of problem (1), where CC is only required to be positive semidefinite (not strictly positive definite). In particular, this means that the inverse C1C^{-1} may not exist. It is interesting to note that, although (5) depends on C1C^{-1}, the inverse does not appear anywhere in the equivalent expression (6). This leads to the natural question of whether (6) still provides a valid solution to the optimization problem even when CC is singular. The first step in investigating this question will be providing a suitable generalization of the objective function in (1), since the expression umC2=C1/2(um)2\lVert u - m\rVert^2_{C} = \lVert C^{-1/2}(u - m)\rVert^2 is not well-defined when CC is singular.

Generalizing the Optimization Problem

Our goal is now to extend (1) to the case where CC need not be strictly positive definite. We start by generalizing (1) through the definition of a constrained optimization problem on an extended parameter space. We then provide two approaches to formulate this problem in an unconstrained fashion, each of which leads to equivalent closed-from solutions. We will find it convenient to define notation for the “centered” quantities \begin{align} &u^\prime := u - m, &&y^\prime := y - Gm \tag{7} \end{align} so that yGu=(yGm)G(um)=yGu.(8) y - Gu = (y - Gm) - G(u-m) = y^\prime - Gu^\prime. \tag{8}

Constrained Formulation

We start by considering the constrained formulation. This section follows the approach in section 8.1.2 of (Sanz-Alonso et al., 2023). To start, note that when CC is positive definite, we have umC2=uC2=C1u,u=b,u,(9) \lVert u - m\rVert^2_C = \lVert u^\prime \rVert^2_C = \langle C^{-1}u^\prime, u^\prime\rangle = \langle b, u^\prime \rangle, \tag{9} where bRdb \in \mathbb{R}^d solves Cb=u.(10) Cb = u^\prime. \tag{10} When CC is invertible, the unique bb solving (10) is simply found by multiplying both sides by C1C^{-1}. When CC is not invertible, then there may be zero or infinitely many such solutions. As long as there is at least one solution to (10) we can give meaning to expression (9) by picking a particular bb that solves (10). Now that we have introduced a new variable bb, we might consider generalizing (1) by jointly optimizing over (u,b)(u,b) subject to the linear constraint Cb=uCb = u^\prime.

Constrained Joint Optimization.
\begin{align} (u_{\star}, b_{\star}) &\in \text{argmin}_{(u,b) \in \mathcal{S}} J(u,b) \tag{11} \newline J(u,b) &:= \frac{1}{2} \lVert y - Gu\rVert^2_{\Sigma} + \frac{1}{2}\langle b, u-m \rangle \newline \mathcal{S} &:= \left\{(u,b) : Cb = u-m \right\}. \end{align}

Note that if CC is positive definite, then (11) reduces to (1). If the problem is solved for (u,b)(u_{\star}, b_{\star}), then the desired solution can be obtained by extracting uu_{\star} and discarding the nuisance parameter bb_{\star}.

Lagrange Multipliers

Observe that (11) can be viewed as optimizing J(u,b)J(u,b) subject to the constraint g(u,b)=0g(u,b) = 0, where g(u,b)=Cb(um)g(u,b) = Cb - (u-m). This is a typical setup for Lagrange multipliers. We therefore introduce the Lagrange multiplier λRd\lambda \in \mathbb{R}^d, which allows us to cast the constrained optimization over (u,b)(u,b) as an unconstrained optimization over (u,b,λ)(u,b,\lambda).

Lagrange Multiplier Formulation.
\begin{align} (u_{\star}, b_{\star}, \lambda_{\star}) &\in \text{argmin}_{u,b,\lambda} J(u,b,\lambda) \tag{12} \newline J(u,b,\lambda) &:= \frac{1}{2} \lVert y - Gu\rVert^2_{\Sigma} + \frac{1}{2}\langle b, u-m \rangle + \langle \lambda, Cb - u + m \rangle. \end{align}

We have succeeding in converting the problem to one that can be solved by analytical means; namely, by solving the system of equations uJ=bJ=λJ=0.(13) \nabla_{u} J = \nabla_{b} J = \nabla_{\lambda} J = 0. \tag{13}

This derivation is provided in the appendix, and the result is summarized below.

Solution of Lagrange Multiplier Formulation.
A solution (u,b,λ)(u_{\star}, b_{\star}, \lambda_{\star}) of (12), projected onto the uu-component, is given by u=m+Cb,(14) u_{\star} = m + Cb_{\star}, \tag{14} where b=[(GΣ1G)C+I]1GΣ1(yGm).(15) b_{\star} = \left[(G^\top \Sigma^{-1}G)C + I \right]^{-1}G^\top \Sigma^{-1}(y-Gm). \tag{15} It also holds that C[(GΣ1G)C+I]1GΣ1=CG(GCG+Σ)1,(16) C\left[(G^\top \Sigma^{-1}G)C + I \right]^{-1}G^\top \Sigma^{-1} = CG^\top \left(GCG^\top + \Sigma \right)^{-1}, \tag{16} which implies that (14) agrees with expression (6).

Basis Approach

We now consider an alternative approach that avoids the joint optimization over (u,b)(u,b). Similar exposition can be found in section 8.3 of (Evensen et al., 2022). Looking back to (9), instead of choosing to optimize over all bb satisfying (10), we will instead simply choose a particular bb satisfying this constraint. This only makes sense if the particular choice of bb does not matter; i.e., if the value of b,u\langle b, u^\prime \rangle is the same for any bb solving (10). The below propostion shows that this is indeed the case.

Proposition.
Let uRdu \in \mathbb{R}^d be a vector such that there is at least one solution bRdb \in \mathbb{R}^d to Cb=uCb = u^\prime. Then b,u\langle b, u^\prime \rangle and J(u,b)J(u,b) are constant for any choice of bb solving this linear system.

Proof. Let b,bRdb, b^\prime \in \R^d satisfy Cb=Cb=uCb = Cb^\prime = u^\prime. It thus follows that b,ub,u=bb,u=bb,Cb=C(bb),b=0,b=0, \langle b, u^\prime\rangle - \langle b^\prime, u^\prime\rangle = \langle b - b^\prime, u^\prime\rangle = \langle b - b^\prime, Cb\rangle = \langle C(b - b^\prime), b\rangle = \langle 0, b\rangle = 0, where we have used the linearity of the inner product and the fact that CC is symmetric. Since the inner product term in (11) is the only portion with dependence on bb, it follows that J(v,b)=J(v,b)J(v,b) = J(v,b^\prime). \qquad \blacksquare

Thus, for each uu that yields a consistent system Cb=uCb = u^\prime, we can simply pick any solution bb to insert into b,u\langle b, u^\prime\rangle. The objective will be well-defined since the above result verifies that the specific choice of bb is inconsequential. A natural choice is to select the bb of minimal norm; that is, b:=argminCb=ub.(17) b^{\dagger} := \text{argmin}_{Cb=u^\prime} \lVert b \rVert. \tag{17} This unique minimal norm solution is guaranteed to exist and is conveniently given by the Moore-Penrose pseudoinverse b=Cu.(18) b^{\dagger} = C^{\dagger}u^\prime. \tag{18} Note that when CC is positive definite, C=C1C^{\dagger} = C^{-1}. We can now eliminate the requirement to optimize over bb by considering the following optimization problem.

Pseudoinverse Formulation.
\begin{align} u_{\star} &\in \text{argmin}_{u \in \mathcal{S}} J(u) \tag{19} \newline J(u) &:= \frac{1}{2} \lVert y - Gu\rVert^2_{\Sigma} + \frac{1}{2}\langle C^{\dagger}(u-m), u-m \rangle \newline \mathcal{S} &:= \left\{u \in \mathbb{R}^d: u-m \in \mathcal{R}(C) \right\} \end{align}

Note that we are now required to reintroduce a constraint set S\mathcal{S}. This is due to the fact that the pseudoinverse produces a solution bb to Cb=uCb = u^\prime when a solution exists, but we must still explicitly restrict the search space to uu that admit a consistent system Cb=uCb = u^\prime; i.e., uR(C)u^\prime \in \mathcal{R}(C). The previous approach dealt with this issue by introducing a Lagrange multiplier λ\lambda. We now show that (18) can be written as an unconstrained problem without extending the parameter space. This relies on the following important fact.

Subspace Property.
Let ARd×rA \in \mathbb{R}^{d \times r} be a matrix of rank rdr \leq d satisfying C=AAC = AA^\top. Then a solution to the optimization problem (19) is constrained to the rr-dimensional affine subspace m+R(C)=m+R(A).(20) m + \mathcal{R}(C) = m + \mathcal{R}(A). \tag{20}

This is simply a rewriting of the requirement u=umR(C)u^\prime = u-m \in \mathcal{R}(C). The solution is constrained by the rank of the subspace R(C)\mathcal{R}(C). If r=dr = d (i.e., CC is full-rank) then we recover the typical unconstrained least squares problem. In general, r<dr < d so the columns of CC are linearly dependent. We therefore introduce the matrix AA whose columns a1,,ara_1, \dots, a_r form a basis for the range of CC; i.e., the columns of CC and AA have the same span.

To implicitly encode the constraint uR(C)u^\prime \in \mathcal{R}(C), we can look for solutions in m+R(A)m + \mathcal{R}(A). Any vector in this space can be written as u=m+j=1rwjaj=m+Aw,(21) u = m + \sum_{j=1}^{r} w_j a_j = m + Aw, \tag{21} for some weight vector w:=(w1,,wr)Rrw := (w_1, \dots, w_r)^\top \in \mathbb{R}^r. We can now substitute m+Awm + Aw for uu and AAAA^\top for CC in (19) and thus formulate the optimization over the weights ww: J(w):=12yG(m+Aw)Σ2+12(AA)Aw,Aw.(22) J(w) := \frac{1}{2}\lVert y - G(m+Aw)\rVert^2_{\Sigma} + \frac{1}{2}\langle (AA^\top)^{\dagger}Aw, Aw \rangle. \tag{22} The optimization is now over wRrw \in \R^r so we have succeeded in writing (18) as an unconstrained problem. We can simplify this even further using the following properties of the pseudoinverse:

  1. By definition, the pseudoinverse satisfies A=A(AA)A = A(A^{\dagger}A).
  2. The matrix AAA^{\dagger}A is a projection onto the rowspace of AA; in particular it is idempotent (i.e., (AA)2=AA(A^{\dagger}A)^2 = A^{\dagger}A) and symmetric.
  3. The following identity holds: A=A(AA)A^{\dagger} = A^\top (AA^\top)^{\dagger}.

We can thus simplify the second term in (22) as \begin{align} \langle (AA^\top)^{\dagger}Aw, Aw\rangle = \langle A^\top(AA^\top)^{\dagger}Aw, w\rangle = \langle (A^{\dagger}A)w, w\rangle &= \langle (A^{\dagger}A)^2w, w\rangle \newline &= \langle (A^{\dagger}A)w, (A^{\dagger}A)w\rangle \newline &= \lVert (A^{\dagger}A)w \rVert^2, \tag{23} \end{align} where the second equality uses the third pseudinverse property above. The third and fourth equalities follow from the fact that AAA^{\dagger}A is idempotent and symmetric. We can thus re-parameterize as w~:=(AA)w,(24) \tilde{w} := (A^{\dagger}A)w, \tag{24} and note that by the first pseudoinverse property we have Aw=A(AA)w=Aw~,(25) Aw = A(A^{\dagger}A)w = A\tilde{w}, \tag{25} which allows us to replace AwAw by Aw~A\tilde{w} in the first term in (22). We obtain J(w~):=12yG(m+Aw~)Σ2+12w~2.(26) J(\tilde{w}) := \frac{1}{2} \lVert y - G(m + A\tilde{w})\rVert_{\Sigma}^2 + \frac{1}{2} \lVert \tilde{w} \rVert^2 . \tag{26} We see that (26) is the sum of a model-data fit term plus a simple L2L^2 penalty on the weights. This final formulation is summarized below, where we have re-labelled w~\tilde{w} as ww to lighten notation.

Unconstrained Pseudoinverse Formulation.
The optimization problem in (19) can equivalently be formulated as the unconstrained problem \begin{align} w_{\star} &\in \text{argmin}_{w \in \mathbb{R}^r} J(w) \tag{27} \newline J(w) &:= \frac{1}{2} \lVert y^\prime - GAw\rVert_{\Sigma}^2 + \frac{1}{2}\lVert w \rVert^2 \end{align}

One might initially take issue with the claim that (27) is actually unconstrained given that it relies on the re-parameterization (24), which constrains the weights to lie in the range of AAA^{\dagger}A. However, recalling that AAA^{\dagger}A is an orthogonal projection, we have the following projection property: w~2=(AA)w2w2.(28) \lVert \tilde{w} \rVert^2 = \lVert (A^{\dagger}A)w \rVert^2 \leq \lVert w \rVert^2. \tag{28} In other words, allowing the weights to be unconstrained can only increase the objective function (since switching ww and w~\tilde{w} has no effect on the first term in (26)), so we are justified in considering the weights to be unconstrained in (27). In fact, we could have jumped right to this conclusion from (23) and avoided needing to define w~\tilde{w} at all.

The following result provides the solution to (27), which immediately follows from the observation that (27) is a standard least squares problem.

Solution of Unconstrained Pseudoinverse formulation.
The minimizing weight of the optimization problem in (26) is given by w=AG(GCG+Σ)1y,(29) w_{\star} = A^\top G^\top \left(GCG^\top + \Sigma \right)^{-1}y^\prime, \tag{29} which implies the optimal parameter \begin{align} u_{\star} &= m + Aw_{\star} \newline &= m + CG^\top \left(GCG^\top + \Sigma \right)^{-1}(y - Gm) \end{align} agrees with the typical least squares solution in (6).

Proof. Observe that (27) is of the form (1) with u:=wu := w, y:=yy^\prime := y, G:=GAG := GA, and C:=IC := I. Thus, we apply (6) to obtain \begin{align} w_{\star} &:= (GA)^\top \left[(GA)(GA)^\top + \Sigma \right]^{-1}y^\prime \newline &= A^\top G^\top \left[GCG^\top + \Sigma \right]^{-1}(y-Gm), \end{align} where the second equality uses C=AAC = AA^\top. We thus have \begin{align} u_{\star} &= m + Aw_{\star} \newline &= m + (AA^\top) G^\top \left[GCG^\top + \Sigma \right]^{-1}(y-Gm) \newline &= C G^\top \left[GCG^\top + \Sigma \right]^{-1}(y-Gm), \end{align} which exactly agrees with (6). \qquad \blacksquare

Concluding Notes

We have successfully shown that the optimum (6) continues to hold for the generalization (11) of the least squares formulation, even when CC is singular. We provided two different derivations of the solution, resulting in the expressions (14) and (29), both of which agree with uu_{\star} as given in (6). We conclude with a few brief follow-up comments.

Uniqueness

Our derivations show that the optimum uu_{\star} is unique, even when CC is singular. Note that this does not imply that a solution (u,b)(u_{\star}, b_{\star}) to (11) is unique, since many choices of bb may lead to the same value of the objective. Indeed, u=m+Cbu_{\star} = m + Cb_{\star} but it may be that u=m+Cbu_{\star} = m + Cb^\prime for some other bSb^\prime \in \mathcal{S}. n particular, if uu_{\star} is optimal, then (u,b)(u_{\star}, b^\prime) minimizes J(u,b)J(u,b) for any bb^\prime satisfying Cb=umCb^\prime = u_{\star} - m. This follows immediately from the result that the particular bb solving this linear system does not change the value of the objective. Given that the particular choice of bb is inconsequential, then a particular rule for choosing bb will not affect the optimal uu_{\star}. The specific approach of choosing the minimal norm bb led to the least squares problem (27), which we know has a unique solution. Therefore, if (11) has a solution, then it is unique. We summarize this below.

Uniqueness.
If the solution set S\mathcal{S} is non-empty, then there is a unique optimal value uu_{\star} solving (11).

Representations of the solution

Continuing the above discussion, it should also be emphasized that, while the solution is unique, there may be many different ways to represent it. To see this, consider that (14) gives the solution in the form u=m+Cbu_{\star} = m + Cb_{\star}, for a weight vector bb_{\star}. But in general CC has linearly dependent columns, and thus there may be multiple sets of weights that give rise to the same vector. In (29) the solution is instead represented in the form u=m+Awu_{\star} = m + Aw_{\star}. If we assume that the columns of AA are linearly independent, thus providing a basis for R(C)\mathcal{R}(C), then this provides the unique representation of the solution with respect to the particular basis a1,,ara_1, \dots, a_r. Note that the two representations are connected by u=m+Cb=m+AAb=m+A(Ab).(29) u_{\star} = m + Cb_{\star} = m + AA^\top b_{\star} = m + A(A^\top b_{\star}) \tag{29}. If the columns of AA are independent, then (29) implies w=Abw_{\star} = A^\top b_{\star}.

Applications

Why should we even care about the case when CC is singular? The main application that motivated this post comes from the field of data assimilation. In this context, the parameter and data dimensions dd and nn can be massive, potentially precluding storing the matrices Σ\Sigma or CC. A common solution to this problem is to replace the true CC with a low-rank approximation of the form C=1r1j=1r(u(j)m)(u(j)m),(30) C = \frac{1}{r-1} \sum_{j=1}^{r} (u^{(j)} - m)(u^{(j)} - m)^\top, \tag{30} where r<dr < d. This Monte Carlo approximation results in a sample covariance matrix CC that has rank at most rr. Here, {u(j)}\{u^{(j)}\} is a set of rr samples used to compute the sample covariance. If we define AA to be the d×rd \times r matrix with jthj^{\text{th}} row equal to 1r1(u(j)m)\frac{1}{\sqrt{r-1}}\left(u^{(j)} - m\right), then we see that C=AAC = AA^\top. The matrix AA is full rank if the vectors {u(j)}\{u^{(j)}\} are independent.

Appendix

Proof: Solution for Lagrange Multiplier Formulation

We derive expressions (14), (15) and (16), the latter showing that the solution agrees with (6).

Deriving (14) and (15)

We start by computing the gradients with respect to uu, bb, and λ\lambda: \begin{align} \nabla_{u}J &= -G^\top \Sigma^{-1}(y-Gu) + \frac{1}{2}b - \lambda \tag{31} \newline \nabla_{b}J &= \frac{1}{2}u^\prime + C\lambda \newline \nabla_{\lambda}J &= Cb - u^\prime. \end{align} where we recall the definition u:=umu^\prime := u - m. We now solve the system of equations uJ=bJ=λJ=0\nabla_{u}J = \nabla_{b}J = \nabla_{\lambda}J = 0 for the three unknowns. Focusing on the last two equations, if we compute 2bJ+λJ=02\nabla_{b}J + \nabla_{\lambda}J = 0 we obtain the optimality criterion C(b+2λ)=0,(32) C(b + 2\lambda) = 0, \tag{32} or equivalently, b+2λN(C),(33) b + 2\lambda \in \mathcal{N}(C), \tag{33} where N(C)\mathcal{N}(C) denotes the null space of CC. The null space of CC may be nontrivial, which aligns with the above discussion that many values of bb may yield the optimal solution. We consider taking the minimal norm solution b+2λ=0,(34) b_{\star} + 2\lambda_{\star} = 0, \tag{34} which implies λ=12b\lambda_{\star} = -\frac{1}{2}b_{\star}. Note also that yGu=(yGm)G(um)=:yGu=yGCb.(35) y - Gu_{\star} = (y-Gm) - G(u_{\star}-m) =: y^\prime - Gu_{\star}^\prime = y^\prime - GCb_{\star}. \tag{35} We plug in the expression for λ\lambda_{\star} to uJ=0\nabla_{u}J = 0, which yields GΣ1(yGCb)+b=0.(36) -G^\top \Sigma^{-1}(y^\prime - GCb_{\star}) + b_{\star} = 0. \tag{36} Solvign for bb_{\star}, we obtain b=[(GΣ1G)C+I]1GΣ1(yGm),(35) b_{\star} = \left[(G^\top \Sigma^{-1}G)C + I \right]^{-1} G^\top \Sigma^{-1}(y-Gm), \tag{35} which is equation (15). Recalling the definition u:=umu^\prime_{\star} := u_{\star} - m, we obtain equation (14): u=m+u=m+Cb.(36) u_{\star} = m + u^\prime_{\star} = m + Cb_{\star}. \tag{36}

Deriving (16)

We now show that (36) agrees with the standard least squares solution (6). Differentiating the latter with a tilde, we want to show equality between the two expressions \begin{align} u_{\star} &= m + C\left[(G^\top \Sigma^{-1}G)C + I \right]^{-1} G^\top \Sigma^{-1}(y-Gm) \newline \tilde{u}_{\star} &= m + CG^\top \left(GCG^\top + \Sigma \right)^{-1}(y-Gm). \end{align} We therefore see that it suffices to show [(GΣ1G)C+I]1GΣ1=G[GCG+Σ]1,(37) \left[(G^\top \Sigma^{-1}G)C + I \right]^{-1} G^\top \Sigma^{-1} = G^\top \left[GCG^\top + \Sigma \right]^{-1}, \tag{37} which implies (16). To show (37), we multiply both sides by each of the matrices in brackets, which gives GΣ1[GCG+Σ]=[GΣ1GC+I]G.(38) G^\top \Sigma^{-1}\left[GCG^\top + \Sigma \right] = \left[G^\top \Sigma^{-1}GC + I \right] G^\top. \tag{38} The righthand side of (38) can be factored as [GΣ1GC+I]G=GΣ1[GCG+Σ],(39) \left[G^\top \Sigma^{-1}GC + I \right] G^\top = G^\top \Sigma^{-1} \left[GCG^\top + \Sigma \right], \tag{39} completing the proof. \qquad \blacksquare

  1. Sanz-Alonso, D., Stuart, A. M., & Taeb, A. (2023). Inverse Problems and Data Assimilation. https://arxiv.org/abs/1810.06191
  2. Evensen, G., Vossepoel, F. C., & van Leeuwen, P. J. (2022). Data Assimilation Fundamentals: A Unified Formulation of the State and Parameter Estimation Problem. https://library.oapen.org/handle/20.500.12657/54434