Approximating Nonlinear Functions of Gaussians, Part I - Linearized Kalman Filter Extensions
I discuss the generic problem of approximating the distribution resulting from a non-linear transformation of a Gaussian random variable, and then show how this leads to extensions of the Kalman filter which yield approximate filtering algorithms in the non-linear setting.
In a previous post, I discussed the Kalman filter (KF), which provides the closed-form mean and covariance recursions characterizing the Gaussian filtering distributions for linear Gaussian hidden Markov models. In this post, we retain the Gaussian noise assumption, but generalize to nonlinear dynamics and observation operators. Our primary focus is the additive noise model \begin{align} v_{k+1} &= g(v_k) + \eta_{k+1} && \eta_{k+1} \sim \mathcal{N}(0, Q) \tag{1} \newline y_{k+1} &= h(v_{k+1}) + \epsilon_{k+1}, && \epsilon_{k+1} \sim \mathcal{N}(0, R) \newline v_0 &\sim \mathcal{N}(m_0, C_0), &&\{\epsilon_k\} \perp \{\eta_k\} \perp v_0 \end{align} but we will also touch on the case where the noise is also subject to nonlinear mapping; i.e., \begin{align} v_{k+1} &= g(v_k, \eta_{k+1}) && \eta_{k+1} \sim \mathcal{N}(0, Q) \tag{2} \newline y_{k+1} &= h(v_{k+1}, \epsilon_{k+1}) && \epsilon_{k+1} \sim \mathcal{N}(0, R) \newline v_0 &\sim \mathcal{N}(m_0, C_0), &&\{\epsilon_k\} \perp \{\eta_k\} \perp v_0 \end{align} Note even this more general formulation is still a special case of the generic Bayesian filtering problem given that we are restricting to the setting with Gaussian noise and a Gaussian initial condition.
Let denote the set of observations up through time step . We seek to characterize the filtering distributions , and update them in an online fashion as more data arrives. I will denote the density (or more generally, Radon-Nikodym derivative) of by . In the linear Gaussian setting these distributions are Gaussian, and can be computed analytically. The introduction of nonlinearity renders the filtering distributions non-Gaussian and not analytically tractable in general, motivating the need for approximations. Certain methods, such as particle filters, are designed to handle the situation where the departure from Gaussianity is severe. The methods discussed in this post, however, are applicable when Gaussian approximations are still reasonable. Given this, the algorithms discussed here all proceed by approximating the current filtering distribution by a Gaussian . The algorithms then differ in the approximations they employ to deal with the nonlinear functions and in order to arrive at a Gaussian approximation of the subsequent filtering distribution .
There are many methods that fit into this general Gaussian approximation framework. In this post we focus on methods rooted in linearization of the nonlinear functions. My plan is to follow this up with a post on the unscented Kalman filter (which utilizes a quadrature-based approximation) and then a bunch of posts on the ensemble Kalman filter (which opts for a Monte Carlo approach). Although the motivation stems from the filtering problem, we will focus mostly on the underlying fundamental problem here: approximating the distribution of a nonlinear function of a Gaussian random variable. The next section illustrates how this problem arises in the filtering context.
Motivating the Generic Problem
We recall from the post on Bayesian filtering that the map naturally decomposes into two steps: the forecast and analysis steps. We assume here the Gaussian approximation has been invoked and consider the approximations of the map
Forecast
The forecast distribution is the distribution implied by feeding through the stochastic dynamics model. In the additive noise case (1), we observe that this comes down to approximating the distribution of the random variable However, since we will be invoking a Gaussian approximation , in addition to the assumption that and are independent, then we can just focus our attention on the term. Due to independence, once we obtain the approximation , then the approximation is immediate.
In the non-additive noise case, the problem similarly comes down to approximating the distribution Again, due to the independence assumptions we have that are jointly distributed \begin{align} \begin{bmatrix} v_k \newline \eta_{k+1} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} m_k \newline 0 \end{bmatrix}, \begin{bmatrix} C_k & 0 \newline 0 & Q \end{bmatrix} \right). \end{align} Thus, this situation also reduces to approximating the distribution of a nonlinear map of a Gaussian; in this case, the map , where is the Gaussian input.
Analysis
Let’s now suppose that we have the forecast approximation in hand. The map from forecast to filtering distribution is defined by the action of conditioning on the data . In the additive noise case (1), this entails the following application of Bayes’ theorem, \begin{align} \pi_{k+1}(v_{k+1}) &\propto \mathcal{N}(y_{k+1}|h(v_{k+1}), R)\mathcal{N}(v_{k+1}|\hat{m}_{k+1}, \hat{C}_{k+1}). \end{align} Although everything is Gaussian here, the nonlinear function breaks the Gaussianity of in general. One idea to deal with this might be to run MCMC and invoke the approximation with and set to their empirical estimates computed from the MCMC samples. This has the distinct disadvantage of requiring an MCMC run at every time step.
In order to discover alternative approximation methods, it is useful to recall the joint Gaussian view of the analysis step, which I discuss in this post. The idea here was that, in the linear Gaussian setting, has a joint Gaussian distribution. The filtering distribution is then obtained as a conditional distribution of the joint Gaussian, which is available in closed-form. In the present setting of the additive noise model (1) this joint distribution is given by
\begin{align} \begin{bmatrix} v_{k+1} \newline y_{k+1} \end{bmatrix} \bigg| Y_k &= \begin{bmatrix} \hat{v}_{k+1} \newline h(\hat{v}_{k+1}) + \epsilon_{k+1} \end{bmatrix}, && \hat{v}_{k+1} \sim \mathcal{N}(\hat{m}_{k+1}, \hat{C}_{k+1}) \end{align}
which is again generally non-Gaussian due to . This perspective points to the idea of approximating this joint distribution as a Gaussian, so that an approximation of the filtering distribution then falls out as a conditional. Notice that we have found ourselves in a very similar situation to the analysis step, in that we again want to approximate the nonlinear mapping of a Gaussian with a Gaussian. The problem is thus to furnish a Gaussian approximation of \begin{align} \tilde{h}(\hat{v}_{k+1}, \epsilon_{k+1}) &= \begin{bmatrix} \hat{v}_{k+1} \newline h(\hat{v}_{k+1}) + \epsilon_{k+1} \end{bmatrix}, && \hat{v}_{k+1} \sim \mathcal{N}(\hat{m}_{k+1}, \hat{C}_{k+1}), \ \epsilon_{k+1} \sim \mathcal{N}(0, R). \tag{5} \end{align} In the non-additive error case (2), replaces in the above expression. Note that the independence assumptions imply that is joint Gaussian so is indeed a nonlinear map of a Gaussian.
The Generic Problem Setting
Now that we have identified the fundamental issues in the context of nonlinear filtering, we state the problem in generic terms. The notation used in this section should be viewed anew, not to be confused with the state space notation used above. The task is to provide a Gaussian approximation to a random variable , where is Gaussian-distributed and is a nonlinear function; more precisely, \begin{align} u &= f(v), && v \sim \mathcal{N}(m, C), \quad f: \mathbb{R}^n \to \mathbb{R}^m. \tag{6} \end{align} In the filtering context, the forecast step represented an instantiation of this problem where and hence a special case where the dimensions of the domain and codomain of are equal. In the analysis step, is given by the map (ignoring the for now) and thus represents the case where . Although both of these cases are subsumed by (4), it is also helpful to consider them separately, as the second case has special structure which can present a more challenging problem. We thus define \begin{align} \tilde{u} &= \tilde{f}(v) := \begin{bmatrix} v \newline f(v) \end{bmatrix}, \tag{7} \end{align} which captures this special case. With the generic problem stated, we now proceed to discuss specific methods which utilize different notions of linearization to produce Gaussian approximations of the distribution of and .
Taylor Series Approximations
The first approach we consider leverages a Taylor series approximation of the nonlinear function . When applied to the filtering problem, the resulting algorithm is known as the extended Kalman filter. We note that higher order Taylor approximations are also possible in certain settings, but we restrict to first order approximations here.
The Generic Method
We consider approximating the nonlinear function with a local linear approximation, given by the Taylor series expansion around the current mean , \begin{align} f(v) \approx f(m) + Df(m)[v - m]. \end{align} Note that I am applying the Jacobian notation so that . Under this approximation we use the fact that to obtain \begin{align} u = f(v) & \approx f(m) + Df(m)[v - m] \tag{8} \newline &\sim \mathcal{N}(f(m), [Df(m)]C [Df(m)]^\top). \end{align}
The situation for is quite similar: \begin{align} \tilde{f}(v) &\approx \tilde{f}(m) + D\tilde{f}(m)[v - m] \tag{9} \newline &= \begin{bmatrix} m \newline f(m) \end{bmatrix} + \begin{bmatrix} I \newline Df(m) \end{bmatrix}[v-m] \newline &\sim \mathcal{N}\left(\begin{bmatrix} m \newline f(m) \end{bmatrix}, \begin{bmatrix} I \newline Df(m) \end{bmatrix} C \begin{bmatrix} I \newline Df(m) \end{bmatrix}^\top \right) \newline &= \mathcal{N}\left(\begin{bmatrix} m \newline f(m) \end{bmatrix}, \begin{bmatrix} C & C[Df(m)]^\top \newline [Df(m)]C & [Df(m)]C[Df(m)]^\top \end{bmatrix} \right) \end{align} where the last equality is in distribution. It is important to stress that these are local approximations; the linearization is constructed using only the local derivative information at the point . Thus, we would expect the quality of the approximation to decay for points farther from , and this decay to be more severe for which are highly nonlinear. Thus, intuitively we would expect the approximation (8) to be reasonable when the distribution of is tightly clustered around its mean. Distributions that are more diffuse will naturally lead to poorer approximations given that more of the probability mass exists in regions where the local linear approximation is not adequate. The situation in (9) presents an even greater concern; the quality of this approximation relies on the joint distribution of staying close to its mean. Not only does this require the current distribution of to be concentrated about , but also the image to be clustered about . Thus, even if is tightly bound to its mean, highly nonlinear maps have the potential to yield a large spread of points in the codomain and thus reduce the quality of the approximation.
Application: The Extended Kalman Filter
We now apply these generic equations to the filtering settings (1) and (2), again breaking the problem into the forecast and analysis steps. The resulting approximate filtering algorithm is called the extended Kalman filter (EKF).
Forecast
Assume the filtering distribution at time is given by . Starting with the additive noise model (1), we see that we must approximate the distribution . Applying (8) and then adding the independent Gaussian yields the approximate forecast distribution \begin{align} \hat{v}_{k+1} := v_{v+1}|Y_k \sim \mathcal{N}(g(m_k), [Dg(m_k)]C_k [Dg(m_k)]^\top + Q). \tag{10} \end{align} This is quite similar to the forecast distribution for the Kalman filter, which is corresponding to the linear forward model . We see that the EKF forecast covariance is equivalent to that obtained from the Kalman filter applied with the linear forward model .
The case of the non-additive noise (4) is similar, but now we must approximate . Recall that is joint Gaussian distributed, with mean . Applying (8) thus yields \begin{align} v_{v+1}|Y_k &\sim \mathcal{N}\left(g(m_k,0), [Dg(m_k,0)]\begin{bmatrix} C_k & 0 \newline 0 & Q \end{bmatrix} [Dg(m_k,0)]^\top\right) \newline &= \mathcal{N}\left(g(m_k,0), [D_vg(m_k,0)]C_k [D_vg(m_k,0)]^\top + [D_{\eta}g(m_k,0)]Q [D_{\eta}g(m_k,0)]^\top\right), \tag{11} \end{align} where the equality is in distribution and the subscripts , indicate the respective partial derivatives. Note the similarity between (10) and (11). The general form is the same, but the non-additive case requires derivatives with respect to the noise in order to approximate the effect of pushing through the nonlinear forward model. Following our intuition on when we expect the Taylor series linearization to be reasonable, we now observe that the approximation may deteriorate when either the current state or the stochastic noise is highly variable, in which case significant probability mass may be present in regions far from the point about which the Taylor series is expanded.
Analysis
Starting with the additive noise model, we recall that the analysis step requires approximation of (5). To this end, we apply (9) with where . We actually require approximation of but due to independence we can simply add post-hoc. The combination of (9) with the addition of the noise term gives \begin{align} \begin{bmatrix} \hat{v}_{k+1} \newline h(\hat{m}_{k+1}) + \epsilon_{k+1} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \hat{m}_{k+1} \newline h(\hat{m}_{k+1}) \end{bmatrix}, \begin{bmatrix} \hat{C}_{k+1} & \hat{C}_{k+1}[Dh(\hat{m}_{k+1})]^\top \newline [Dh(\hat{m}_{k+1})]\hat{C}_{k+1} & [Dh(\hat{m}_{k+1})]\hat{C}_{k+1} [Dh(\hat{m}_{k+1})]^\top + R \end{bmatrix} \right) \tag{12} \end{align}