Regularized Least Squares with Singular Prior
Solving the regularized least squares optimization problem when the prior covariance matrix is not positive definite.
Setup
Recall the regularized least squares problem (i.e., ridge regression),
\begin{align} u_{\star} &:= \text{argmin}_{u \in \mathbb{R}^d} J(u) \tag{1} \newline J(u) &:= \frac{1}{2} \lVert y - Gu\rVert^2_{\Sigma} + \frac{1}{2}\lVert u-m \rVert^2_{C} \end{align}
where and are fixed vectors, is the linear forward model, and and are positive definite matrices. In (1) we have used the following notation for inner products and norms weighted by a positive definite matrix : \begin{align} \langle v, v^\prime \rangle_C &:= \langle C^{-1}v, v^\prime\rangle = v^\top C^{-1}v^\prime \newline \lVert v \rVert_C^2 &:= \langle v, v\rangle_C. \end{align}
A solution to (1) can be viewed as a maximum a posteriori (MAP) estimate under the Bayesian model \begin{align} y|u &\sim \mathcal{N}(Gu, \Sigma) \tag{3} \newline u &\sim \mathcal{N}(m, C). \tag{4} \end{align}
I discuss this problem in depth in my post, on the linear Gaussian model, where I show that the solution can be written as the following equivalent expressions, \begin{align} u_{\star} &= \left(G^\top \Sigma^{-1}G + C^{-1}\right)^{-1}\left(G^\top \Sigma^{-1}y + C^{-1}m \right) \tag{5} \newline u_{\star} &= m + CG^\top \left(GCG^\top + \Sigma \right)^{-1}(y-Gm). \tag{6} \end{align}
In this post we will consider a generalization of problem (1), where is only required to be positive semidefinite (not strictly positive definite). In particular, this means that the inverse may not exist. It is interesting to note that, although (5) depends on , the inverse does not appear anywhere in the equivalent expression (6). This leads to the natural question of whether (6) still provides a valid solution to the optimization problem even when is singular. The first step in investigating this question will be providing a suitable generalization of the objective function in (1), since the expression is not well-defined when is singular.
Generalizing the Optimization Problem
Our goal is now to extend (1) to the case where need not be strictly positive definite. We start by generalizing (1) through the definition of a constrained optimization problem on an extended parameter space. We then provide two approaches to formulate this problem in an unconstrained fashion, each of which leads to equivalent closed-from solutions. We will find it convenient to define notation for the “centered” quantities \begin{align} &u^\prime := u - m, &&y^\prime := y - Gm \tag{7} \end{align} so that
Constrained Formulation
We start by considering the constrained formulation. This section follows the approach in section 8.1.2 of (Sanz-Alonso et al., 2023). To start, note that when is positive definite, we have where solves When is invertible, the unique solving (10) is simply found by multiplying both sides by . When is not invertible, then there may be zero or infinitely many such solutions. As long as there is at least one solution to (10) we can give meaning to expression (9) by picking a particular that solves (10). Now that we have introduced a new variable , we might consider generalizing (1) by jointly optimizing over subject to the linear constraint .
Constrained Joint Optimization.
\begin{align} (u_{\star}, b_{\star}) &\in \text{argmin}_{(u,b) \in \mathcal{S}} J(u,b) \tag{11} \newline J(u,b) &:= \frac{1}{2} \lVert y - Gu\rVert^2_{\Sigma} + \frac{1}{2}\langle b, u-m \rangle \newline \mathcal{S} &:= \left\{(u,b) : Cb = u-m \right\}. \end{align}
Note that if is positive definite, then (11) reduces to (1). If the problem is solved for , then the desired solution can be obtained by extracting and discarding the nuisance parameter .
Lagrange Multipliers
Observe that (11) can be viewed as optimizing subject to the constraint , where . This is a typical setup for Lagrange multipliers. We therefore introduce the Lagrange multiplier , which allows us to cast the constrained optimization over as an unconstrained optimization over .
Lagrange Multiplier Formulation.
\begin{align} (u_{\star}, b_{\star}, \lambda_{\star}) &\in \text{argmin}_{u,b,\lambda} J(u,b,\lambda) \tag{12} \newline J(u,b,\lambda) &:= \frac{1}{2} \lVert y - Gu\rVert^2_{\Sigma} + \frac{1}{2}\langle b, u-m \rangle + \langle \lambda, Cb - u + m \rangle. \end{align}
We have succeeding in converting the problem to one that can be solved by analytical means; namely, by solving the system of equations
This derivation is provided in the appendix, and the result is summarized below.
Solution of Lagrange Multiplier Formulation.
A solution of (12), projected onto the -component, is given by where It also holds that which implies that (14) agrees with expression (6).
Basis Approach
We now consider an alternative approach that avoids the joint optimization over . Similar exposition can be found in section 8.3 of (Evensen et al., 2022). Looking back to (9), instead of choosing to optimize over all satisfying (10), we will instead simply choose a particular satisfying this constraint. This only makes sense if the particular choice of does not matter; i.e., if the value of is the same for any solving (10). The below propostion shows that this is indeed the case.
Proposition.
Let be a vector such that there is at least one solution to . Then and are constant for any choice of solving this linear system.
Proof. Let satisfy . It thus follows that where we have used the linearity of the inner product and the fact that is symmetric. Since the inner product term in (11) is the only portion with dependence on , it follows that .
Thus, for each that yields a consistent system , we can simply pick any solution to insert into . The objective will be well-defined since the above result verifies that the specific choice of is inconsequential. A natural choice is to select the of minimal norm; that is, This unique minimal norm solution is guaranteed to exist and is conveniently given by the Moore-Penrose pseudoinverse Note that when is positive definite, . We can now eliminate the requirement to optimize over by considering the following optimization problem.
Pseudoinverse Formulation.
\begin{align} u_{\star} &\in \text{argmin}_{u \in \mathcal{S}} J(u) \tag{19} \newline J(u) &:= \frac{1}{2} \lVert y - Gu\rVert^2_{\Sigma} + \frac{1}{2}\langle C^{\dagger}(u-m), u-m \rangle \newline \mathcal{S} &:= \left\{u \in \mathbb{R}^d: u-m \in \mathcal{R}(C) \right\} \end{align}
Note that we are now required to reintroduce a constraint set . This is due to the fact that the pseudoinverse produces a solution to when a solution exists, but we must still explicitly restrict the search space to that admit a consistent system ; i.e., . The previous approach dealt with this issue by introducing a Lagrange multiplier . We now show that (18) can be written as an unconstrained problem without extending the parameter space. This relies on the following important fact.
Subspace Property.
Let be a matrix of rank satisfying . Then a solution to the optimization problem (19) is constrained to the -dimensional affine subspace
This is simply a rewriting of the requirement . The solution is constrained by the rank of the subspace . If (i.e., is full-rank) then we recover the typical unconstrained least squares problem. In general, so the columns of are linearly dependent. We therefore introduce the matrix whose columns form a basis for the range of ; i.e., the columns of and have the same span.
To implicitly encode the constraint , we can look for solutions in . Any vector in this space can be written as for some weight vector . We can now substitute for and for in (19) and thus formulate the optimization over the weights : The optimization is now over so we have succeeded in writing (18) as an unconstrained problem. We can simplify this even further using the following properties of the pseudoinverse:
- By definition, the pseudoinverse satisfies .
- The matrix is a projection onto the rowspace of ; in particular it is idempotent (i.e., ) and symmetric.
- The following identity holds: .
We can thus simplify the second term in (22) as \begin{align} \langle (AA^\top)^{\dagger}Aw, Aw\rangle = \langle A^\top(AA^\top)^{\dagger}Aw, w\rangle = \langle (A^{\dagger}A)w, w\rangle &= \langle (A^{\dagger}A)^2w, w\rangle \newline &= \langle (A^{\dagger}A)w, (A^{\dagger}A)w\rangle \newline &= \lVert (A^{\dagger}A)w \rVert^2, \tag{23} \end{align} where the second equality uses the third pseudinverse property above. The third and fourth equalities follow from the fact that is idempotent and symmetric. We can thus re-parameterize as and note that by the first pseudoinverse property we have which allows us to replace by in the first term in (22). We obtain We see that (26) is the sum of a model-data fit term plus a simple penalty on the weights. This final formulation is summarized below, where we have re-labelled as to lighten notation.
Unconstrained Pseudoinverse Formulation.
The optimization problem in (19) can equivalently be formulated as the unconstrained problem \begin{align} w_{\star} &\in \text{argmin}_{w \in \mathbb{R}^r} J(w) \tag{27} \newline J(w) &:= \frac{1}{2} \lVert y^\prime - GAw\rVert_{\Sigma}^2 + \frac{1}{2}\lVert w \rVert^2 \end{align}
One might initially take issue with the claim that (27) is actually unconstrained given that it relies on the re-parameterization (24), which constrains the weights to lie in the range of . However, recalling that is an orthogonal projection, we have the following projection property: In other words, allowing the weights to be unconstrained can only increase the objective function (since switching and has no effect on the first term in (26)), so we are justified in considering the weights to be unconstrained in (27). In fact, we could have jumped right to this conclusion from (23) and avoided needing to define at all.
The following result provides the solution to (27), which immediately follows from the observation that (27) is a standard least squares problem.
Solution of Unconstrained Pseudoinverse formulation.
The minimizing weight of the optimization problem in (26) is given by which implies the optimal parameter \begin{align} u_{\star} &= m + Aw_{\star} \newline &= m + CG^\top \left(GCG^\top + \Sigma \right)^{-1}(y - Gm) \end{align} agrees with the typical least squares solution in (6).
Proof. Observe that (27) is of the form (1) with , , , and . Thus, we apply (6) to obtain \begin{align} w_{\star} &:= (GA)^\top \left[(GA)(GA)^\top + \Sigma \right]^{-1}y^\prime \newline &= A^\top G^\top \left[GCG^\top + \Sigma \right]^{-1}(y-Gm), \end{align} where the second equality uses . We thus have \begin{align} u_{\star} &= m + Aw_{\star} \newline &= m + (AA^\top) G^\top \left[GCG^\top + \Sigma \right]^{-1}(y-Gm) \newline &= C G^\top \left[GCG^\top + \Sigma \right]^{-1}(y-Gm), \end{align} which exactly agrees with (6).
Concluding Notes
We have successfully shown that the optimum (6) continues to hold for the generalization (11) of the least squares formulation, even when is singular. We provided two different derivations of the solution, resulting in the expressions (14) and (29), both of which agree with as given in (6). We conclude with a few brief follow-up comments.
Uniqueness
Our derivations show that the optimum is unique, even when is singular. Note that this does not imply that a solution to (11) is unique, since many choices of may lead to the same value of the objective. Indeed, but it may be that for some other . n particular, if is optimal, then minimizes for any satisfying . This follows immediately from the result that the particular solving this linear system does not change the value of the objective. Given that the particular choice of is inconsequential, then a particular rule for choosing will not affect the optimal . The specific approach of choosing the minimal norm led to the least squares problem (27), which we know has a unique solution. Therefore, if (11) has a solution, then it is unique. We summarize this below.
Uniqueness.
If the solution set is non-empty, then there is a unique optimal value solving (11).
Representations of the solution
Continuing the above discussion, it should also be emphasized that, while the solution is unique, there may be many different ways to represent it. To see this, consider that (14) gives the solution in the form , for a weight vector . But in general has linearly dependent columns, and thus there may be multiple sets of weights that give rise to the same vector. In (29) the solution is instead represented in the form . If we assume that the columns of are linearly independent, thus providing a basis for , then this provides the unique representation of the solution with respect to the particular basis . Note that the two representations are connected by If the columns of are independent, then (29) implies .
Applications
Why should we even care about the case when is singular? The main application that motivated this post comes from the field of data assimilation. In this context, the parameter and data dimensions and can be massive, potentially precluding storing the matrices or . A common solution to this problem is to replace the true with a low-rank approximation of the form where . This Monte Carlo approximation results in a sample covariance matrix that has rank at most . Here, is a set of samples used to compute the sample covariance. If we define to be the matrix with row equal to , then we see that . The matrix is full rank if the vectors are independent.
Appendix
Proof: Solution for Lagrange Multiplier Formulation
We derive expressions (14), (15) and (16), the latter showing that the solution agrees with (6).
Deriving (14) and (15)
We start by computing the gradients with respect to , , and : \begin{align} \nabla_{u}J &= -G^\top \Sigma^{-1}(y-Gu) + \frac{1}{2}b - \lambda \tag{31} \newline \nabla_{b}J &= \frac{1}{2}u^\prime + C\lambda \newline \nabla_{\lambda}J &= Cb - u^\prime. \end{align} where we recall the definition . We now solve the system of equations for the three unknowns. Focusing on the last two equations, if we compute we obtain the optimality criterion or equivalently, where denotes the null space of . The null space of may be nontrivial, which aligns with the above discussion that many values of may yield the optimal solution. We consider taking the minimal norm solution which implies . Note also that We plug in the expression for to , which yields Solvign for , we obtain which is equation (15). Recalling the definition , we obtain equation (14):
Deriving (16)
We now show that (36) agrees with the standard least squares solution (6). Differentiating the latter with a tilde, we want to show equality between the two expressions \begin{align} u_{\star} &= m + C\left[(G^\top \Sigma^{-1}G)C + I \right]^{-1} G^\top \Sigma^{-1}(y-Gm) \newline \tilde{u}_{\star} &= m + CG^\top \left(GCG^\top + \Sigma \right)^{-1}(y-Gm). \end{align} We therefore see that it suffices to show which implies (16). To show (37), we multiply both sides by each of the matrices in brackets, which gives The righthand side of (38) can be factored as completing the proof.
- Sanz-Alonso, D., Stuart, A. M., & Taeb, A. (2023). Inverse Problems and Data Assimilation. https://arxiv.org/abs/1810.06191
- Evensen, G., Vossepoel, F. C., & van Leeuwen, P. J. (2022). Data Assimilation Fundamentals: A Unified Formulation of the State and Parameter Estimation Problem. https://library.oapen.org/handle/20.500.12657/54434