The Ensemble Kalman Filter
I introduce the ensemble Kalman filter as a Monte Carlo approximation to the Kalman filter in the linear Gaussian state space setting, then discuss how is applied as an approximation even when these assumptions don't hold.
As discussed in a previous post, the Kalman filter (KF) provides the closed-form mean and covariance recursions characterizing the Gaussian filtering distributions for linear Gaussian hidden Markov models. In a follow-up post we relaxed the linearity assumption, and considered approximating nonlinear dynamics or observation operators using analytical linearizations, which in turn allowed us to fall back on the standard KF recursions. In this post, we introduce the Ensemble Kalman Filter (EnKF), another approximation scheme rooted in KF methodology. In place of gradient-based linearizations, the EnKF relies on Monte Carlo or particle-based approximations to address the nonlinearity.
In practice, the EnKF is commonly used as an approximation to the true Bayesian solution in settings with nonlinear dynamics or observation operators, and is in particular favored over competitors when the dimension of the state space is quite high (potentially in the millions or billions). In the linear Gaussian setting, it can be viewed as a Monte Carlo approximation to the KF. In this post our main focus will be on the discrete-time, nonlinear Gaussian state space model
\begin{align} v_{k+1} &= g(v_k) + \eta_{k+1} && \eta_{k+1} \sim \mathcal{N}(0, Q) \tag{1} \newline y_{k+1} &= h(v_{k+1}) + \epsilon_{k+1}, && \epsilon_{k+1} \sim \mathcal{N}(0, R) \newline v_0 &\sim \mathcal{N}(m_0, C_0), &&\{\epsilon_k\} \perp \{\eta_k\} \perp v_0, \end{align} with states , observations , forward dynamics operator , and observation operator . In general, we will consider and to be nonlinear, though we will discuss simplifications when one or the other is linear. As in previous posts, we will use the notation to denote the collection of observations through time . We denote the forecast and filtering densities at time by and , respectively.
Kalman Filter Review
Before introducing the EnKF we very briefly review the KF, which I discuss in detail in this post. The KF is applicable in the special case when the state space model (1) has linear dynamics and a linear observation operator; that is, and for some matrices and . We recall that under these assumptions, the forecast and filtering distributions are both Gaussian and are given by
\begin{align} &v_{k+1}|Y_{k} \sim \mathcal{N}\left(\hat{m}_{k+1}, \hat{C}_{k+1} \right), &&v_{k+1}|Y_{k+1} \sim \mathcal{N}\left(m_{k+1}, C_{k+1} \right) \tag{2} \end{align}
where
\begin{align} \hat{C}_{k+1} &= G C_k G^\top + Q \newline \tag{3} \hat{m}_{k+1} &= Gm_k \newline C_{k+1} &= \left(H^\top R^{-1} H + \hat{C}^{-1}_{k+1}\right)^{-1} \newline m_{k+1} &= C_{k+1}\left(H^\top R^{-1}y_{k+1} + \hat{C}^{-1}_{k+1} \hat{m}_{k+1} \right). \end{align}
The Big Picture
There are many different perspectives on the EnKF, so it is easy to get lost in the equations and terminology. We therefore take a moment to orient ourselves before proceeding with the EnKF algorithm. The problem we are trying to solve here is no different from that of the last couple posts; namely, we want to characterize the filtering distributions at each time step in such a way that we can relatively cheaply compute the update once the new observation becomes available at the next time step. Given that the filtering distribution encodes a compromise between (1) the dynamical model’s prediction and (2) the observed data, it is not surprising that this update boils down to (i) using the dynamical model to forecast one time step ahead, and then (ii) conditioning on the new data point. We can thus decompose the map as with the two arrows encoding steps (i) and (ii), respectively. In the linear Gaussian setting, all of these distributions turned out to be Gaussian, and hence characterizing these two updates reduced to deriving update equations for the mean and covariance. Under our current assumptions, this will not be the case in general. Despite this, the EnKF still leverages methodology rooted in the Kalman filter, and thus makes some implicit Gaussian assumptions. However, instead of propagating means and covariances, it propagates an ensemble of samples. Our update maps from above will thus assume the form for an ensemble of samples indexed by . This sample-based approximation is in the spirit of the particle filter, but instead of trying to ensure the samples have the exact correct distributions, the EnKF updates rely on approximations. Given the fact that the EnKF provides results that do not in general align with the exact Bayesian solution, there is a justifiable argument to abandon the Bayesian interpretation of the EnKF and instead view it as a derivative-free optimization algorithm. Indeed, the commonly-cited metrics of the EnKF’s superior performance over alternatives concern its ability to produce point predictions of . We will explore both of these perspectives throughout this post. The following section introduces the EnKF as a Monte Carlo approximation of the Bayesian filtering distributions, and the following subsection reinterprets the algorithm from an optimization perspective.
Introducing the EnKF: a Monte Carlo Approximation
Recall from the previous post that one way to view the challenge imposed by the nonlinear operators in (1) is that of approximating the distribution of nonlinear maps of Gaussian random variables. The extended Kalman filter addresses this challenge via derivative-based linearization. The EnKF is a derivative-free approach, instead opting for Monte Carlo based approximations. I will synonymously refer to such approximations as sample-based, ensemble-based, or particle-based. We now consider how to go about approximating both the forecast and analysis steps in this fashion.
Forecast
Suppose that we have access to an ensemble providing a particle-based approximation of the filtering distribution . The goal is now to approximate the forecast distribution . To do so, we can define a new ensemble by \begin{align} &\hat{v}_{k+1}^{(j)} := g(v_k^{(j)}) + \eta_{k+1}^{(j)}, &&\eta_{k+1}^{(j)} \overset{iid}{\sim} \mathcal{N}(0, Q). \tag{7} \end{align} This is a straightforward Monte Carlo approximation of ; we have just fed the samples through the forward map . At this point, there are two sources of error in using to approximate :
- Errors in the input ensemble accumulated from earlier steps of the algorithm; i.e., it may be that is not distributed according to .
- Monte Carlo error stemming from the fact that we are using a finite ensemble size to represent the distribution.
The first source of error is systematic, while the latter can be reduced by increasing the ensemble size. At this point, we should note that the forecast step has not contributed any new systematic errors, instead just propagating existing ones. In other words, if for all , then the only source of error in would be of the Monte Carlo variety.
We let and denote the empirical (i.e., sample) mean and covariance matrix of the forecast ensemble. Note that since may be non-Gaussian, it is likely not characterized by its first two moments alone. Nonetheless, we may consider a Gaussian approximation to the forecast distribution given by . Adopting such an approximation would introduce another source of systematic error stemming from the Gaussian approximation of a potentially non-Gaussian distribution.
Analysis
We now focus on transforming the forecast ensemble into a new ensemble that (approximately) encodes the operation of conditioning on the data . We will pursue the same general strategy as we did with the extended Kalman filter in the previous post. Specifically, notice that the filtering distribution is a conditional of the joint distribution
\begin{align} &\begin{bmatrix} \hat{v}_{k+1} \newline \hat{y}_{k+1} \end{bmatrix} := \begin{bmatrix} v_{k+1} \newline y_{k+1} \end{bmatrix} \bigg| Y_k = \begin{bmatrix} v_{k+1} \newline h(v_{k+1}) + \epsilon_{k+1} \end{bmatrix} \bigg| Y_k. \tag{8} \end{align}
Note that is the random variable equal in distribution to , the data distribution implied by the model forecast . The strategy is thus to approximate the joint distribution (8) with a Gaussian, and then use the standard Gaussian conditioning identities to obtain the conditional. We will utilize the forecast ensemble in deriving the Gaussian approximation. In particular, we consider a Gaussian approximation of (8) with mean and covariance set to the empirical mean and covariance of the particles , where
\begin{align} \hat{y}_{k+1}^{(j)} &:= h(\hat{v}^{(j)}_{k+1}) + \epsilon^{(j)}_{k+1}, &&\epsilon^{(j)}_{k+1} \overset{iid}{\sim} \mathcal{N}(0, R). \tag{9} \end{align}
Note that the notation indicates that this quantity can be interpreted as a simulated observation at time under the assumption that the current state is equal to the forecast . Of course, this will typically differ from the observation that is actually observed at this time step, and the discrepancy between these two quantities will help inform the adjustments to the forecast ensemble.
The Gaussian assumption yields the approximation
\begin{align} \begin{bmatrix} v_{k+1} \newline \hat{y}_{k+1} \end{bmatrix} \bigg| Y_k &\overset{d}{\approx} \mathcal{N}(\tilde{m}{k+1}, \tilde{C}{k+1}) := \mathcal{N}\left( \begin{bmatrix} \hat{m}_{k+1} \newline \hat{y}_{k+1} \end{bmatrix}, \begin{bmatrix} \hat{C}_{k+1} & \hat{C}^{vy}_{k+1} \newline \hat{C}^{yv}_{k+1} & \hat{C}^y_{k+1} \end{bmatrix} \right), \tag{10} \end{align}
where
\begin{align} \hat{m}_{k+1} &:= \frac{1}{J} \sum_{j=1}^{J} \hat{v}^{(j)}_{k+1} \tag{11} \newline \hat{y}_{k+1} &:= \frac{1}{J} \sum_{j=1}^{J} \hat{y}^{(j)}_{k+1} \newline \hat{C}_{k+1} &:= \frac{1}{J-1} \sum_{j-1}^{J} (v_{k+1}^{(j)}-\hat{m}_{k+1})(v_{k+1}^{(j)}-\hat{m}_{k+1})^\top \newline \hat{C}_{k+1}^{vy} &:= \frac{1}{J-1} \sum_{j-1}^{J} (v_{k+1}^{(j)}-\hat{m}_{k+1})(\hat{y}_{k+1}^{(j)}-\hat{y}_{k+1})^\top, \newline \hat{C}_{k+1}^{y} &:= \frac{1}{J-1} \sum_{j-1}^{J} (\hat{y}_{k+1}^{(j)}-\hat{y}_{k+1})(\hat{y}_{k+1}^{(j)}-\hat{y}_{k+1})^\top, \end{align} and . We have approximated the joint distribution with the Gaussian (10), whose mean and covariance have been furnished from empirical estimates derived from the forecast ensemble. Naturally, this approximation introduces a new source of systematic error in the algorithm. The exactness of the Gaussian approximation requires both (1) the current filtering distribution to be Gaussian; and (2) the observation operator to be linear. Each of these assumptions represents a potential source of systematic error. Of course, on top of this there is the Monte Carlo error in the sample mean and covariance estimates.
At this point, let’s consider a few different ways we might leverage the joint Gaussian approximation (10) to generate the filtering ensemble . The third method is the one actually employed by the EnKF.
Gaussian Conditional Moments
Perhaps the most straightforward idea is to apply the typical Gaussian conditioning identities to the joint Gaussian defined in <span class=”katex-error” title=”ParseError: KaTeX parse error: No such environment: align at position 63: …iven by
\begin{̲a̲l̲i̲g̲n̲}̲ m_{k+1} &= \ha…” style=”color:#cc0000”>(10). The conditional mean and covariance are given by
\begin{align} m_{k+1} &= \hat{m}_{k+1} + \hat{C}^{vy}_{k+1}[\hat{C}^y_{k+1}]^{-1}(y_{k+1}-\hat{y}{k+1}) \tag{12} \newline C{k+1} &= \hat{C}_{k+1} - \hat{C}^{vy}_{k+1}[\hat{C}^y_{k+1}]^{-1}\hat{C}^{yv}. \end{align}
Denoting the resulting conditional mean and covariance </span>m_{k+1}C_{k+1}\mathcal{N}(m_{k+1}, C_{k+1})\pi_{k+1}k+1(v_{k+1}^{(j)})_{j=1}^{J}k+2<span class=”katex-error” title=”ParseError: KaTeX parse error: Expected 'EOF', got '#' at position 4: .
#̲### Sample from…” style=”color:#cc0000”>.
Sample from Gaussian Conditional
Another approach is to essentially follow the same idea, but apply a neat trick to avoid having to explicitly compute </span>m_{k+1}C_{k+1}\mathcal{N}(\tilde{m}{k+1}, \tilde{C}{k+1}) v^{(j)}{k+1} := \tilde{v}^{(j)} + \hat{C}{k+1}^{vy}[\hat{C}{k+1}^{y}]^{-1}(y{k+1} - \tilde{y}^{(j)}), \qquad (\tilde{v}^{(j)},\tilde{y}^{(j)}) \sim \mathcal{N}(\tilde{m}{k+1}, \tilde{C}{k+1}). \tag{13} (v_{k+1}^{(j)})_{j=1}^{J}<span class=”katex-error” title=”ParseError: KaTeX parse error: Expected 'EOF', got '#' at position 194: …vious method.
#̲### The EnKF ap…” style=”color:#cc0000”> consists of particles distributed according to the Gaussian with moments (12); in other words, this is simply a different algorithm for obtaining the same ensemble as in the previous method.
The EnKF approach
TODO: Describe the slight modification to Matheron's rule; I should also explicitly add the conditional mean/cov equations in (1) above and probably separate these into sub-sections. Also, look at the mean and covariance implied by the particle update to see if these align with the mean/cov of the conditional Gaussian.
Summary
We have derived (a particular version of) the EnKF by opting for an ensemble representation of the forecast and filtering distributions, and dealing with the nonlinear updates required by the analysis step through the use of Gaussian approximations. Our derivations are summarized below. <blockquote> <p><strong>EnKF Algorithm: Version 1.</strong> Given the current ensemble </span>{v^{(j)}k}{j=1}^{J}kk+1 are generated as follows. <br><br>
<strong>Forecast.</strong> \begin{align} &\hat{v}{k+1}^{(j)} := g(v_k^{(j)}) + \eta{k+1}^{(j)}, &&\eta_{k+1}^{(j)} \overset{iid}{\sim} \mathcal{N}(0, Q). \tag{13} \end{align}
<br>
<strong>Analysis.</strong> \begin{align} &v^{(j)}{k+1} := \hat{v}^{(j)}{k} + \hat{C}{k+1}^{vy}[\hat{C}{k+1}^{y}]^{-1}(y_{k+1} - \hat{y}{k+1}^{(j)}), &&\hat{y}{k+1}^{(j)} \sim \mathcal{N}(h(\hat{v}{k+1}^{(j)}),R) \tag{14} \end{align} where </span>\hat{C}{k+1}^{vy}\hat{C}_{k+1}^{y}$ are defined in (11). </p> </blockquote>
Special Case: Linear Observation Operator
Let’s now specialize to the common setting where the observation operator is linear: . Indeed, this assumption is typical when presenting the EnKF in textbooks and the literature (e.g., Sanz-Alonso et al., 2018; Evensen, 2009). This assumption has no effect on the forecast step, but allows for some analytical computations to replace ensemble approximations in the analysis step. Consider the choice of covariance for the portion of the vector in the Gaussian approximation (10). Previously we considered a Monte Carlo estimate of the covariance based on propagating samples through the nonlinear observation operator . Under the linear operator we now see that
\begin{align} \text{Cov}[y_{k+1}|Y_k] &= \text{Cov}[Hv_k|Y_k] + \text{Cov}[\epsilon_k|Y_k] \newline &= H \text{Cov}[v_k|Y_k] H^\top + R \newline &\approx H\hat{C}_k H^\top + R, \tag{15} \end{align}
where we have inserted the current approximation for the covariance of the filtering distribution at time . Similarly, \begin{align} \text{Cov}[v_{k+1}, y_{k+1}|Y_k] &= \text{Cov}[v_{k+1},Hv_{k+1}|Y_k] + \text{Cov}[v_{k+1},\epsilon_k|Y_k] \newline &= \text{Cov}[v_{k+1},v_{k+1}|Y_k]H^\top \newline &\approx \hat{C}_{k+1} H^\top. \tag{16} \end{align}
Expression (10) now simplifies to \begin{align} \begin{bmatrix} v_{k+1} \newline y_{k+1} \end{bmatrix} \bigg| Y_k &\overset{d}{\approx} \mathcal{N}\left( \begin{bmatrix} \hat{m}_{k+1} \newline H\hat{m}_{k+1} \end{bmatrix}, \begin{bmatrix} \hat{C}_{k+1} & \hat{C}_{k+1}H^\top \newline H\hat{C}_{k+1} & H\hat{C}_{k+1}H^\top + R \end{bmatrix} \right). \tag{17} \end{align}
The moments still rely on the empirical estimates and , but inserts some partial analytic computations in place of the extra sample-based approximations in (10). Recall the discussion following (11) that the Gaussian approximation (10) in general introduces errors stemming from (1) the non-Gaussianity of the samples ; and (2) the nonlinearity of . The assumption that is linear eliminates this second source of error. However, in general the Gaussian assumption will still represent an approximation due to the first source of error.
We can now once again apply Matheron’s rule to sample from the conditional implied by (17) to obtain the analysis update. This update, the analog of (14), is summarized in the modified algorithm below.
EnKF Algorithm: Linear Observation Operator. Given the current ensemble at time , the forecast and filtering ensembles at time are generated as follows.
Forecast. \begin{align} &\hat{v}_{k+1}^{(j)} := g(v_k^{(j)}) + \eta_{k+1}^{(j)}, &&\eta_{k+1}^{(j)} \overset{iid}{\sim} \mathcal{N}(0, Q). \tag{18} \end{align}
Analysis. \begin{align} &v^{(j)}_{k+1} := \hat{v}^{(j)}_{k} + \hat{C}_{k+1}H^\top[H\hat{C}_{k+1}H^\top + R]^{-1}(y_{k+1} - \hat{y}_{k+1}^{(j)}), &&y_{k+1}^{(j)} \sim \mathcal{N}(H\hat{v}_{k+1}^{(j)},R) \tag{19} \end{align} where is defined in (11).
Considerations for Including Noise in the Analysis Step
Stochastic Update
We now take a moment to consider how the data appears in the analysis updates (14) and (19). These updates rely on the quantity , which measures the discrepancy between the observed data , and a simulated observation assuming the underlying state is drawn from the forecast distribution at time . In other words, this is the difference between the observation and the model forecast, with the latter adjusted to account for the noise in the observation. There are other ways we can interpret this; consider: \begin{align} y_{k+1} - \hat{y}_{k+1}^{(j)} &= y_{k+1} - [h(\hat{v}_{k+1}^{(j)}) + \epsilon_{k+1}^{(j)}] \newline &= [y_{k+1} - \epsilon_{k+1}^{(j)}] - h(\hat{v}_{k+1}^{(j)}) \newline &=: y^{(j)}_{k+1} - h(\hat{v}_{k+1}^{(j)}), \tag{20} \end{align} where we have defined the perturbed observation All we have done here is to rewrite the expression to view the observation as being perturbed by the noise instead of the model forecast . Note the since the noise has a symmetric distribution about zero, then So the alternative definition is equivalent from a probabilistic perspective, and indeed both conventions exist in the literature. The perturbations to the data in the EnKF update (14) fell right out of our derivation of the EnKF; specifically, this followed from the viewpoint we adopted of the update (14) as simply drawing a sample from the conditional distribution (10). Using terminology favored by the spatial statistics community, this perspective on the EnKF might be termed approximate conditional simulation (Katzfuss et al., 2016), with the “approximate” part highlighting that the joint Gaussian assumption (10) is generally an approximation. Even in the linear Gaussian setting the conditional simulation is still approximate stemming from the fact that the covariances in (10) are estimated from a finite set of samples. Regardless of the interpretation, it is important to note that the update (14) is stochastic; i.e., even viewed as conditional on the forecast ensemble, additional noise is introduced by sampling the simulated (or perturbed, depending on the perspective) observations. The particle updates (14) are then given by an affine transformation of the forecast particles, with the magnitude of the adjustment controlled by the difference between the simulated and observed data.
Deterministic Update
While the stochastic update naturally falls out of the approximate conditional simulation derivation we originally considered, this is but one of many perspectives on the EnKF. A popular alternative to the stochastic update (14) eliminates the use of simulated or perturbed observations, instead opting for a deterministic update. There are many variations on the deterministic update idea, but most fall under the heading of “Ensemble square-root filters” (Tippett et al, 2003).
TODO:
- Perturbed data ensemble is necessary for alignment with the KF in the linear Gaussian setting.
- Benefits of using ensemble representation of the data noise covariance.
The Optimization Perspective
The SDE/MCMC Perspective
See Evensen (2009)
The Ensemble Span
References
- G. Evensen, Data Assimilation: The Ensemble Kalman Filter. New York: Springer, 2007.
- G. Evensen, The Ensemble Kalman Filter for Combined State and Parameter Estimation. Monte Carlo Techniques for Data Assimilation in Large Systems (2009).
- Inverse Problems and Data Assimilation (Sanz-Alonso et al, 2018).
- D. Kelly, K. J. H. Law, and A. M. Stuart. Well-posedness and accuracy of the ensemble Kalman filter in discrete and continuous time. Nonlinearity, 27(10):2579, 2014.
- K. J. H. Law, A. M. Stuart, and K. Zygalakis. Data Assimilation. Springer, 2015.
- Y. Chen and D. Oliver. Ensemble randomized maximum likelihood method as an iterative ensemble smoother. Mathematical Geosciences, 44(1):1–26, 2002.
- J. Carrillo, F. Hoffmann, A. Stuart, and U. Vaes. The Ensemble Kalman filter in the near-Gaussian setting. arXiv preprint arXiv:2212.13239, 2022.
- A. Carrassi, M. Bocquet, L. Bertino, and G. Evensen. Data assimilation in the geo- sciences: An overview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change, 9(5), 2018.
- E. Calvello, S. Reich, and A. M. Stuart. Ensemble Kalman Methods: A Mean Field Perspective. arXiv, 2022.
- The Ensemble Kalman filter: a signal processing perspective (Roth et al, 2017)
- Understanding the Ensemble Kalman Filter (Katzfuss et al., 2016)
- Tippett, M. K., Anderson, J. L., Bishop, C. H., Hamill, T. M., and Whitaker, J. S. (2003), “Ensemble Square-Root Filters,” Monthly Weather Review, 131, 1485–1490.