Optimization Viewpoints on Kalman Methodology

Viewing the Kalman Filter and its variants from an optimization perspective.

Throughout this post, we utilize the following notation for inner products and norms weighted by a positive definite matrix CC: \begin{align} \langle v, v^\prime \rangle_C &:= \langle C^{-1}v, v^\prime\rangle = v^\top C^{-1}v^\prime \newline \lVert v \rVert_C^2 &:= \langle v, v\rangle_C. \end{align} We will also require consideration of generalizing these expressions when CC is strictly positive semidefinite. The notation \lVert \cdot \rVert and ,\langle \cdot, \cdot \rangle denotes the standard Euclidean norm and inner product, respectively. For a matrix CC, we will denote by R(C)\mathcal{R}(C) and N(C)\mathcal{N}(C) its range (column space) and kernel (null space), respectively.

Setup

We start by considering the following optimization problem \begin{align} v_{\star} &:= \text{argmin}_{v \in \mathbb{R}^d} J(v) \tag{1} \newline J(v) &= \frac{1}{2} \lVert y - h(v)\rVert_R^2 + \frac{1}{2} \lVert v - \hat{v}\rVert_C, \tag{2} \end{align} where the objective J(v)J(v) is defined with respect to a map h:RdRph: \mathbb{R}^d \to \mathbb{R}^p, fixed vectors v^Rd\hat{v} \in \mathbb{R}^d and yRpy \in \mathbb{R}^p, and positive definite matrices RRp×pR \in \mathbb{R}^{p \times p} and CRd×dC \in \mathbb{R}^{d \times d}. A solution to (1) can be viewed as a maximum a posteriori (MAP) estimate under the Bayesian model \begin{align} y|v &\sim \mathcal{N}(h(v), R) \tag{3} \newline v &\sim \mathcal{N}(\hat{v}, C). \tag{4} \end{align} In the case that h(v)h(v) is linear (i.e., h(v)=Hvh(v) = Hv) then (1) assumes the form of a standard L2L^2 regularized least squares problem, admitting a closed-form solution. In this special case, (1) also defines the optimization problem that is solved to derive the analysis update of the standard Kalman filter. Under this interpretation, v^\hat{v} is the mean of the current forecast distribution, HH is the linear observation operator, RR is the observation covariance, CC is the covariance of the current forecast distribution, and the solution vv_{\star} gives the mean of the Gaussian filtering distribution. Formulation (1) also provides the basis for deriving variants of the Kalman filter, including the ensemble Kalman filter (EnKF). For now, we maintain a generic view of (1) in order to investigate its mathematical properties. Later we will make explicit links with Kalman filter methods.

Generalizing the Optimization Problem

Motivation and Setup

While the formulation (1) is perfectly adequate when CC is positive definite, it is not well-defined when CC may be only positive semidefinite. In particular, the inverse of CC may not exist and hence the term vv^C\lVert v - \hat{v}\rVert_C is no longer well-defined. Our first goal is to consider a suitable modification of the optimization problem (1) that is well-defined in this more general setting. The reason for considering this is that various filtering algorithms consider setups similar to (1) where the matrix CC is replaced by a low-rank approximation. For example, the EnKF defines CC using a sample covariance estimator, whose rank cannot exceed the number of samples used in the estimator. Thus, if the state dimension dd exceeds the number of samples, then this matrix will fail to be positive definite. We will discuss specifics regarding the EnKF later on, but for the time being keep the discussion generic. Let ARd×JA \in \mathbb{R}^{d \times J} be a matrix such that C=AA.(5) C = AA^\top. \tag{5} This implies that vCv=vAAv=Av20,(6) v^\top C v = v^\top AA^\top v = \lVert Av \rVert^2 \geq 0, \tag{6} so CC is indeed positive semidefinite (as we have assumed). However, when J<dJ < d the columns of AA must be linearly dependent so there is some vv such that Av=0Av = 0. This implies that CC is not positive definite when J<dJ < d. Note that the range CC is equal to that of AA; i.e., R(C)=R(A).(7) \mathcal{R}(C) = \mathcal{R}(A). \tag{7}

Redefining the Objective Function

Our goal is now to extend (1) to the case where CC need not be strictly positive definite. To start, note that when CC is positive definite, we have vv^C2=C1(vv^),vv^=b,vv^,(8) \lVert v - \hat{v}\rVert^2_C = \langle C^{-1}(v-\hat{v}), v-\hat{v}\rangle = \langle b, v-\hat{v} \rangle, \tag{8} where bRdb \in \mathbb{R}^d solves Cb=vv^.(9) Cb = v - \hat{v}. \tag{9} When CC is invertible, the unique bb solving (9) is simply found by multiplying both sides by C1C^{-1}. When CC is not invertible, then there may be zero or infinitely many such solutions. As long as there is at least one solution to (9) we will see that we can give meaning to the expression vv^C2\lVert v - \hat{v}\rVert^2_C even when CC is only positive semidefinite.

Joint optimization over (v,b)(v,b)

Let’s consider the case where there is at least one solution to (9); i.e., vv^R(C)v - \hat{v} \in \mathcal{R}(C). We define B(v):={bRd:Cb=vv^},(10) \mathcal{B}(v) := \left\{b \in \mathbb{R}^d : Cb = v - \hat{v} \right\}, \tag{10} the set of all solutions with respect to the input vector vv. We therefore might consider generalizing (1) to argminvRdargminbB(v)(12yh(v)R2+12b,vv^),(11) \text{argmin}_{v \in \mathbb{R}^d} \text{argmin}_{b \in \mathcal{B}(v)} \left(\frac{1}{2} \lVert y - h(v)\rVert_R^2 + \frac{1}{2} \langle b, v - \hat{v}\rangle\right), \tag{11} where we are implicitly restricting the solution space to vv such that B(v)\mathcal{B}(v) is not empty. To encode this constraint more explicitly, we can modify (11) as argmin(v,b)VJ~(v,b),(12) \text{argmin}_{(v,b) \in \mathcal{V}} \tilde{J}(v,b), \tag{12} where J~(v,b):=12yh(v)R2+12b,vv^(13) \tilde{J}(v,b) := \frac{1}{2} \lVert y - h(v)\rVert_R^2 + \frac{1}{2} \langle b, v-\hat{v}\rangle \tag{13} and V:={(v,b):Cb=vv^}.(14) \mathcal{V} := \left\{(v,b) : Cb = v - \hat{v} \right\}. \tag{14} Note that if CC is positive definite, then (12) reduces to (1).

Removing dependence on bb

In the previous section, we extended the optimization problem to consider optimizing over pairs (v,b)(v,b) in order to deal with the fact that CC may not be invertible. The below result shows that, for a fixed vv, the objective J~(v,b)\tilde{J}(v,b) in (12) is actually independent of the particular choice of bB(v)b \in \mathcal{B}(v). This fact will allow us to remove the need to jointly optimize over (v,b)(v,b).

Proposition.
Let vRdv \in \mathbb{R}^d be a vector such that there is at least one solution bRdb \in \mathbb{R}^d to Cb=vv^Cb = v - \hat{v}. Then J~(v,b)\tilde{J}(v,b), defined in (12), is constant for any choice of bb solving this linear system.

Proof. Let b,bB(v)b, b^\prime \in \mathcal{B}(v) such that Cb=Cb=vv^Cb = Cb^\prime = v - \hat{v}. It thus follows that b,vv^b,vv^=bb,vv^=bb,Cb=C(bb),b=0,b=0, \langle b, v - \hat{v}\rangle - \langle b^\prime, v - \hat{v}\rangle = \langle b - b^\prime, v - \hat{v}\rangle = \langle b - b^\prime, Cb\rangle = \langle C(b - b^\prime), b\rangle = \langle 0, b\rangle = 0, where we have used the linearity of the inner product and the fact that CC is symmetric. Since the inner product term in (12) is the only portion with dependence on bb, it follows that J~(v,b)=J~(v,b)\tilde{J}(v,b) = \tilde{J}(v,b^\prime). \qquad \blacksquare

Thus, for each vv with B(v)\mathcal{B}(v) non-empty, we can simply pick any element from bB(v)b \in \mathcal{B}(v) to insert into b,vv^\langle b, v-\hat{v}\rangle. The objective will be well-defined since the above result verifies that the specific choice of bb is inconsequential. A natural choice is to choose the bb of minimal norm. That is, for B(v)\mathcal{B}(v) non-empty, set b:=argminbB(v)b.(15) b^{\dagger} := \text{argmin}_{b \in \mathcal{B}(v)} \lVert b \rVert. \tag{15} This unique minimal norm solution is guaranteed to exist and is conveniently given by the Moore-Penrose pseudoinverse b=C(vv^)=(AA)(vv^).(16) b^{\dagger} = C^{\dagger}(v - \hat{v}) = (AA^\top)^{\dagger}(v - \hat{v}). \tag{16} Note that when CC is positive definite, C=C1C^{\dagger} = C^{-1}. We can now eliminate the requirement to optimize over bb in (12), (13), and (14). The optimization problem now assumes the form \begin{align} v_{\star} &:= \text{argmin}_{v \in \mathcal{V}} \tilde{J}(v) \tag{17} \newline \tilde{J}(v) &:= \frac{1}{2} \lVert y - h(v)\rVert_R^2 + \frac{1}{2} \langle (AA^\top)^{\dagger}(v - \hat{v}), v-\hat{v}\rangle \tag{18} \newline \mathcal{V} &:= \{v \in \mathbb{R}^d : v-\hat{v} \in \mathcal{R}(A)\}. \tag{19} \end{align}

An important consequence of this formulation is that any solution vv_{\star} must lie in the affine space v^+R(A)\hat{v} + \mathcal{R}(A); this is immediate from the definition of V\mathcal{V}. In particular, if CC is positive definite then R(A)=Rd\mathcal{R}(A) = \mathbb{R}^d so the solution space is unconstrained. Otherwise, the rank of CC imposes a constraint on the solution space.

Proposition.
A solution to the optimization problem (17) is constrained to the affine space v^+R(A)=v^+R(C)\hat{v} + \mathcal{R}(A) = \hat{v} + \mathcal{R}(C), which has rank at most JJ.

Unconstrained Optimization

We go one step farther in the problem formulation before turning to solutions in the next section. Note that, in contrast to (1), (17) is a constrained optimization problem. We now re-formulate (17) as an unconstrained problem, which will be more convenient when considering solutions. As noted above, the solution space of (17) is given by the affine space v^+R(A)\hat{v} + \mathcal{R}(A). Any vector vv in this space can thus be written as v=v^+j=1Jwjaj=v^+Aw,(20) v = \hat{v} + \sum_{j=1}^{J} w_j a_j = \hat{v} + Aw, \tag{20} for some weight vector w:=(w1,,wJ)RJw := (w_1, \dots, w_J)^\top \in \mathbb{R}^J. We have denoted the jthj^{th} column of AA by ajRda_j \in \mathbb{R}^d. If the columns of AA are linearly independent, then a1,,aJa_1, \dots, a_J provide a basis for R(A)\mathcal{R}(A). If not, then there will be some redundancy in the weights and the representation (20) will not be unique. We can now substitute v^+Aw\hat{v} + Aw for vv in (18) and thus formulate the optimization over the weights ww: J~(w):=12yh(v^+Aw)R2+12(AA)Aw,Aw.(21) \tilde{J}(w) := \frac{1}{2} \lVert y - h(\hat{v} + Aw)\rVert_R^2 + \frac{1}{2} \langle (AA^\top)^{\dagger}Aw, Aw\rangle. \tag{21}

We can simplify this even further using the following properties of the pseudoinverse:

  1. By definition, the pseudoinverse satisfies A=A(AA)A = A(A^{\dagger}A).
  2. The matrix AAA^{\dagger}A is a projection onto the rowspace of AA; in particular it is idempotent (i.e., (AA)2=AA(A^{\dagger}A)^2 = A^{\dagger}A) and symmetric.
  3. The following identity holds: A=A(AA)A^{\dagger} = A^\top (AA^\top)^{\dagger}.

We can thus simplify the second term in (21) as \begin{align} \langle (AA^\top)^{\dagger}Aw, Aw\rangle = \langle A^\top(AA^\top)^{\dagger}Aw, w\rangle = \langle (A^{\dagger}A)w, w\rangle &= \langle (A^{\dagger}A)^2w, w\rangle \newline &= \langle (A^{\dagger}A)w, (A^{\dagger}A)w\rangle \newline &= \lVert (A^{\dagger}A)w \rVert^2, \tag{22} \end{align} where the second equality uses the third pseudinverse property above. The third and fourth equalities follow from the fact that AAA^{\dagger}A is idempotent and symmetric. We can thus re-parameterize as w~:=(AA)w,(23) \tilde{w} := (A^{\dagger}A)w, \tag{23} and note that by the first pseudoinverse property we have Aw=A(AA)w=Aw~,(24) Aw = A(A^{\dagger}A)w = A\tilde{w}, \tag{24} which allows us to replace AwAw by Aw~A\tilde{w} in the first term in (21). We obtain J~(w~):=12yh(v^+Aw~)R2+12w~2.(25) \tilde{J}(\tilde{w}) := \frac{1}{2} \lVert y - h(\hat{v} + A\tilde{w})\rVert_R^2 + \frac{1}{2} \lVert \tilde{w} \rVert^2 . \tag{25} We see that (25) is the sum of a model-data fit term plus a simple L2L^2 penalty on the weights. This final formulation is summarized below, where we have re-labelled w~\tilde{w} as ww to lighten notation.

Unconstrained Formulation.
The optimization problem in (17) can equivalently be formulated as the unconstrained problem v:=argminwRJJ~(w)(26) v_{\star} := \text{argmin}_{w \in \mathbb{R}^{J}} \tilde{J}(w) \tag{26} with objective function J~(w):=12yh(v^+Aw)R2+12w2.(27) \tilde{J}(w) := \frac{1}{2} \lVert y - h(\hat{v} + Aw)\rVert_R^2 + \frac{1}{2} \lVert w \rVert^2. \tag{27}

One might initially take issue with the claim that (26) is actually unconstrained given that it relies on the re-parameterization (23), which constrains the weights to lie in the range of AAA^{\dagger}A. However, recalling that AAA^{\dagger}A is an orthogonal projection, we have the following projection property: w~2=(AA)w2w2.(28) \lVert \tilde{w} \rVert^2 = \lVert (A^{\dagger}A)w \rVert^2 \leq \lVert w \rVert^2. \tag{28} In other words, allowing the weights to be unconstrained can only increase the objective function (since switching ww and w~\tilde{w} has no effect on the first term in (26)), so we are justified in considering the weights to be unconstrained in (26). In fact, we could have jumped right to this conclusion from (22) and avoided needing to define w~\tilde{w} at all.

Solving the Optimization Problem.

With the optimization problem (17) defined, we now consider the characterization of its solution. In general, this is challenging due to the potential nonlinearity of h()h(\cdot). We start by considering the simplified linear case, before returning to the more general setting.

Linear h()h(\cdot)

Assume that h(v)=Hvh(v) = Hv for some matrix HRp×dH \in \mathbb{R}^{p \times d}. Under a data assimilation interpretation, this corresponds to the common assumption of a linear observation operator.

References

  1. Data Assimilation Fundamentals (Vossepoel, Evensen, and van Leeuwen; 2022)
  2. Inverse Problems and Data Assimilation (Stuart, Taeb, and Sanz-Alonso)
  3. Ensemble Kalman Methods with Constraints (Albers et al, 2019)