\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\Pr}{\mathbb{P}} \newcommand{\given}{\mid} \newcommand{\Def}{:=} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Gaussian}{\mathcal{N}} \newcommand{\x}{x} \newcommand{\y}{y} \newcommand{\h}{h} \newcommand{\H}{\mathsf{H}} \newcommand{\yobs}{y^{\dagger}} \newcommand{\noise}{\epsilon} \newcommand{\covNoise}{\Sigma} \newcommand{\m}{m} \newcommand{\C}{C} \newcommand{\Cyx}{\C_{\y\x}} \newcommand{\Cxy}{\C_{\x\y}} \newcommand{\Cy}{\C_{\y}} \newcommand{\my}{m_y} \newcommand{\prior}{\pi} \newcommand{\priorEmp}{\pi_J} \newcommand{\jointEmp}{\joint_J} \newcommand{\post}{\pi^\star} \newcommand{\joint}{\overline{\prior}} \newcommand{\dimObs}{n} \newcommand{\dimPar}{d} \newcommand{\map}{\mathsf{T}} \newcommand{\proj}{\mathcal{P}_{\Gaussian}} \]

The ensemble Kalman filter (EnKF) is a family of algorithms that perform approximate inference in probabilistic state space models. The crux of this inference task is the requirement of solving a sequence of nonlinear, potentially high-dimensional, inverse problems. To approximate the solutions to these inverse problems, EnKF algorithms employ Monte Carlo methods rooted in the linear Gaussian methodology of the Kalman filter. In this post, we take a deep dive into this methodology, viewing the EnKF approximations from several different perspectives. Our goal is not to focus on the algorithmic aspects of state space inference (I cover this in other posts), but rather to strip away unnecessary details to really understand the approximations driving the EnKF.

1 The Core Inverse Problems

The celebrated “EnKF update equations” describe how to update a model’s forecast distribution (represented via samples) based on the information in newly observed data. This act of “assimilating” the data point is often referred to as the analysis step of the EnKF. From a generic perspective, the analysis step requires approximating the solution of an inverse problem. Our focus is on this single-step inverse problem; we do not consider issues related to the iterative nature of the EnKF.

The EnKF may be derived at varying levels of generality, leading to slight variations on the classical update equations. We present two common formulations, starting with the most general.

Inverse Problem 1: The General Case

Consider a joint distribution $(\x,\y) \sim \joint$ over an unobserved variable $x \in \R^\dimPar$ and an observable $\y \in \R^\dimObs$. Given samples \[ \begin{align} &(x_i, y_i) \sim \joint, &&i = 1, 2, \dots, J \end{align} \tag{1}\] and a data realization $\yobs \in \R^\dimObs$, the task is to approximate the posterior distribution \[ \post \Def \mathrm{law}(\x \given \y = \yobs). \tag{2}\]

Note that $\post$ depends on $\yobs$, but we suppress this in the notation. This is an instance of simulation-based inference; unlike in traditional inverse problem setups, we neither have access to the likelihood nor the prior density. The posterior must be approximated by leveraging information in the samples $(\x_i, \y_i)$.

It is common to also consider the special case where the prior distribution on $\x$ is only accessible via samples, but the conditional $\y \given \x$ has a known parametric form. In particular, in physical applications one commonly assumes the conditional $\Gaussian(\y \given \h(\x), \covNoise)$, where $\h: \R^\dimPar \to \R^\dimObs$ is a known, deterministic map. This yields the following inverse problem.

Inverse Problem 2: Additive Gaussian Noise

Suppose that the joint distribution $(\x, \y) \sim \joint$ is defined by the probability model

\[ \begin{align} &\y = \h(\x) + \noise, &&\noise \sim \Gaussian(0, \covNoise) \\ &\x \sim \prior, \end{align} \tag{3}\]

where $\x$ and $\noise$ are a priori independent and $\covNoise \in \R^{n \times n}$ is a known positive definite matrix.

In this setting, we assume that the prior $\prior$ is only accessible via a finite set of samples $\{x_j\}_{j=1}^{J}$. The goal is once again to approximate the posterior $\post = \mathrm{law}(\x \given \y = \yobs)$ for a particular data realization $\yobs$.

In certain settings, it is also common to further simplify the model by assuming that the map $\h$ is linear. We will explore this special case in addition to the more general settings.

In either case, we are interested in developing posterior inference methods that are fast and scalable to settings where the parameter dimension $\dimPar$ may be quite large. Standard inference algorithms like Markov chain Monte Carlo (MCMC) are not well-suited to these constraints. In particular, a key challenge here is the lack of access to either the prior density or, potentially, the likelihood. Instead, we must make do with the empirical approximations \[ \begin{align} &\priorEmp \Def \frac{1}{J} \sum_{j=1}^{J} \delta_{\x_j}, &&\jointEmp \Def \frac{1}{J} \sum_{j=1}^{J} \delta_{\x_j, \y_j} \end{align} \tag{4}\] where $\delta_{u}$ denotes the Dirac measure centered at $u$.

Notation

We utilize the notation $\my \Def \E(\y)$, $\Cy \Def \Cov(\y)$, $\Cxy \Def \Cov(\x,\y)$, and $\Cyx \Def \Cxy^\top$ for the first two moments of $\x$ and $\y$ under the joint distribution $\joint$.

2 Background: Linear Gaussian Inverse Problems

Classical linear Gaussian methodology forms the foundation for the EnKF update. We review a few key results from this methodology that will be relevant throughout this post. For more background on linear Gaussian inverse problems, see this post.

We begin by stating the well-known form of a Gaussian conditional distribution.

Gaussian Conditioning Identity

Let $\x \in \R^\dimPar$ and $\y \in \R^\dimObs$ be random vectors with joint distribution

\[ \begin{align} \begin{bmatrix} \x \\ \y \end{bmatrix} &\sim \Gaussian\left( \begin{bmatrix} m_x \\ \my \end{bmatrix}, \begin{bmatrix} C_x & \Cxy \\ \Cyx & \Cy \end{bmatrix} \right), \end{align} \tag{5}\]

where the matrix $\Cy$ is assumed positive definite. Then for a fixed vector $\yobs \in \R^\dimObs$, the conditional distribution of $\x$ given $\y = \yobs$ is \[ \x \given [\y = \yobs] \sim \Gaussian(m^\star, C^\star), \] where \[ \begin{align} m^\star &= m_x + K(\yobs - \my) \\ C^\star &= C_x - K\Cyx \\ K &= \Cxy \Cy^{-1}. \end{align} \tag{6}\] The matrix $K \in \R^{\dimPar \times \dimObs}$ is often called the Kalman gain.

To draw samples from the conditional distribution $\text{law}(\x \given \y = \yobs)$, one could compute the moments $m^\star$ and $C^\star$ using the above equations, and then sample $\Gaussian(m^\star, C^\star)$ using standard methods. Alternatively, one could draw prior samples from the joint Gaussian and then apply a suitable transformation to yield posterior samples. The function that achieves this transformation is referred to as Matheron’s rule.

Matheron’s Rule

Let $(x,y)$ have the joint Gaussian distribution given in Equation 5 and let $K, m^\star, C^\star$ be defined as in Equation 6.

It then follows that $\map(\x, \y) \sim \Gaussian(m^\star, C^\star)$, where $\map: \R^\dimPar \times \R^\dimObs \to \R^\dimPar$ is given by \[ \map(\x, \y) \Def x + K(\yobs - \y). \tag{7}\]

We will refer to $\map$ as the Matheron transport map.

We could alternatively view $\map$ as mapping from $\R^\dimPar$ to $\R^\dimPar$ by interpreting $\y$ as a random parameter of the map; i.e., $\map_{y}(x) \Def \map(x,y)$. From this perspective, $\map$ constitutes a random map that transports marginal samples of $\x$ to conditional samples.

We conclude this section by recalling a special case of the Gaussian conditioning identity from Equation 6. When the observation operator $\h$ in Equation 3 is linear, we can apply Equation 6 to solve the resulting linear Gaussian inverse problem.

Linear Gaussian Posterior

The linear Gaussian inverse problem

\[ \begin{align} &\y = \H\x + \noise, &&\noise \sim \Gaussian(0, \covNoise) \\ &\x \sim \Gaussian(m,C) \end{align} \tag{8}\]

has posterior distribution $\x \given [\y = \yobs] \sim \Gaussian(m^\star, C^\star)$, with moments given by \[ \begin{align} m^\star &= m + K(\yobs - \H m) \\ C^\star &= C - K \H C \\ K &= C \H^\top (\H C \H^\top + \covNoise)^{-1} \end{align} \tag{9}\]

3 Linear Ensemble Transforms

We now return to the general problems in Equation 1 and Equation 3. Both setups face the challenge posed by the empirical nature of the prior. There are many approaches to deal with this issue. For example, a kernel density estimate can be employed to smooth $\priorEmp$, yielding a continuous prior approximation. Alternatively, importance sampling techniques work directly with $\{x_j\}_{j=1}^{J}$, re-weighting the particles to achieve a sample-based approximation of the posterior $\post$. However, both of these methods are known to degenerate as the parameter dimension $\dimPar$ increases.

This post focuses on methods that produce approximate samples from $\post$ by applying a map $\map: \R^\dimPar \to \R^\dimPar$ to each prior sample; that is, \[ x_j^\star \Def \map(x_j), \qquad j = 1, 2, \dots, J. \] The function $\map$ is sometimes called a transport map, and may be deterministic or stochastic. In particular, we focus on linear transport maps, yielding a family of algorithms known as linear ensemble transforms. Ensemble Kalman transforms construct linear transport maps $\map$ that depend on the empirical distribution $\priorEmp$ only through its first two moments, the sample mean and covariance: \[ \begin{align} \hat{m} &\Def \frac{1}{J} \sum_{j=1}^{J} x_j \\ \hat{C} &\Def \frac{1}{J-1} \sum_{j=1}^{J} (x_j - \hat{m})(x_j - \hat{m})^\top. \end{align} \tag{10}\]

4 Gaussian Prior Approximation

In this section, we introduce the EnKF in the setting of Equation 3 where $\h(\x) = \H\x$ for some matrix $\H \in \R^{\dimObs \times \dimPar}$. We generalize beyond this linear assumption in future sections.

4.1 Gaussian Projection of the Prior

The classical EnKF that arises in this setting is rooted in the idea of estimating a parametric prior approximation using $\{x_j\}_{j=1}^{J}$. While different distributional families can be used, ensemble Kalman methods employ a Gaussian approximation. Given a square-integrable distribution $\nu$, let $\proj \nu$ denote the Gaussian “projection” of $\nu$; that is, the Gaussian distribution that matches the mean and covariance of $\nu$. Given this definition, the Gaussian prior approximation employed by the EnKF can be phrased as follows.

Gaussian Prior Approximation

Given the setup in Equation 3, the EnKF uses the samples $\{x_j\}_{j=1}^{J}$ to fit a Gaussian prior approximation \[ \hat{\prior} \Def \proj \priorEmp = \Gaussian(\hat{m}, \hat{C}), \tag{11}\] where $\hat{m},\hat{C}$ are the empirical mean and covariance, as defined in Equation 10.

The parametric approximation has a regularizing effect in high-dimensions, and the choice of a Gaussian distribution allows for convenient analytical expressions. This is particularly true when the observation operator $\h$ is linear, in which case the problem reduces to the linear Gaussian setting. Indeed, by treating $\Gaussian(\hat{m}, \hat{C})$ as if it were the true prior, we obtain the modified inverse problem \[ \begin{align} &\y = \H\x + \noise, &&\noise \sim \Gaussian(0, \covNoise) \\ &\x \sim \Gaussian(\hat{m},\hat{C}), \end{align} \tag{12}\] which is of the linear Gaussian variety (see Equation 8). Following Equation 9, we find that the posterior under this model is Gaussian, with moments \[ \begin{align} \hat{m}^\star &= \hat{m} + \hat{K}(\yobs - \H \hat{m}) \\ \hat{C}^\star &= \hat{C} - \hat{K} \H \hat{C}, \end{align} \tag{13}\] where \[ \hat{K} = \hat{C} \H^\top (\H \hat{C} \H^\top + \covNoise)^{-1}. \tag{14}\]

4.2 Approximate Transport

At this point, we could take $\Gaussian(\hat{m}^\star, \hat{C}^\star)$ as our posterior approximation. However, in the iterative EnKF context, one seeks a Monte Carlo approximation of the posterior, such that each prior sample $x_j$ is mapped to a posterior sample $x^\star_j$. Adopting such a transport-based approach ensures correspondence between the prior and posterior samples, which is important in typical EnKF applications where the $j^{\text{th}}$ sample represents a particular model trajectory through time. Our task now becomes to find a suitable transport map; this goal is formalized below.

Moment Matching Transport

Given the Gaussian prior approximation $\Gaussian(\hat{m},\hat{C})$ in Equation 11, we seek a (potentially stochastic) transport map $\map$ that produces samples $x_j^\star = \map(x_j)$ with empirical mean and covariance matching $\hat{m}^\star$ and $\hat{C}^\star$ in Equation 14; that is, \[ \begin{align} \hat{m}^\star &= \frac{1}{J} \sum_{j=1}^{J} x_j^\star \\ \hat{C}^\star &= \frac{1}{J-1} \sum_{j=1}^{J} (x_j^\star - \hat{m}^\star)(x_j^\star - \hat{m}^\star)^\top. \end{align} \tag{15}\] If the map $\map$ is stochastic, then these equalities are required to hold in expectation.

By enforcing the constraints in Equation 15, we are requiring that the posterior samples should have mean and covariance matching the posterior mean and covariance of the linear Gaussian problem Equation 12, which itself was the result of a Gaussian prior approximation. If $\map$ is deterministic, then the posterior samples will have sample mean and covariance exactly matching these quantities. If $\map$ is stochastic then the moments will match in expectation (with respect to the noise in $\map$), but the sample moments of the realized ensemble $\{x_j^\star\}_{j=1}^{J}$ may differ.

4.3 The Perturbed Observation EnKF Update

The goal of transporting prior to posterior samples should bring to mind the Matheron transport map defined in Equation 7. We have already seen that the Matheron map transports marginal to conditional samples under a joint Gaussian model. Even though the prior samples $x_j$ are not necessarily Gaussian distributed, we can still apply this same map in this context to satisfy the moment constraints in Equation 15 in expectation. This yields the classical “perturbed observation” EnKF update.

Perturbed Observation EnKF Update

Consider the inverse problem Equation 3 with a linear observation operator $\h(\x) = \H\x$. The perturbed observation EnKF updates prior samples $\{x_j\}_{j=1}^{J}$ to posterior samples $\{x^\star_j\}_{j=1}^{J}$ as follows:

Fit a Gaussian prior approximation $\hat{\prior} = \proj \priorEmp = \Gaussian(\hat{m}, \hat{C})$ as in Equation 11.
For each $j = 1, \dots, J$ sample $y_j \sim \Gaussian(\H x_j, \covNoise)$ and compute \[ x_j^\star \Def \hat{\map}_{y_j}(x_j), \] where \[ \hat{\map}_{y}(x) \Def x + \hat{K}(\yobs - y), \] with $\hat{K}$ defined in Equation 14.

It is important to recognize that the ensemble $\{x_j^\star\}_{j=1}^{J}$ produced by this update is not necessarily Gaussian. In the special case $\{x_j\}_{j=1}^{J} \sim \Gaussian(m, C)$ then the $(x_j, y_j)$ will be jointly Gaussian and hence $x_j^\star = \map_{y_j}(x_j)$ will also be Gaussian. However, even in this case the distribution of $x_j^\star$ will deviate from the exact Gaussian posterior due to the empirical estimate of the Kalman gain.

In general, the prior samples $x_j$ will be non-Gaussian, implying $x_j^\star$ may also be non-Gaussian. However, the structure of the Matheron map will still transform the mean and covariance of the samples in the same way. For completeness, a derivation is provided below to verify that the constraints in Equation 15 hold in expectation.

Proof: Perturbed EnKF Satisfies Moment Constraints

Consider the prior samples $\{x_j\}_{j=1}^{J} \sim \prior$ with empirical mean $\hat{m}$ and covariance $\hat{C}$. Recall that these sample moments are random and unbiased; i.e., $\E(\hat{m}) = m$ and $\E(\hat{C}) = C$.

Let $y_j \sim \Gaussian(\H x_j, \Sigma)$ be sampled independently. Then \[ \begin{align} \E[\map_{y_j}(x_j)] &= \E[x_j] + \hat{K} (\yobs - \E[y_j]) \newline &= m + \hat{K} (\yobs - \H m) \newline &= \hat{m}^\star \end{align} \]

and

\[\begin{align} \Cov[\\map_{y_j}(x_j)] &= \Cov[x_j + \hat{K} (\yobs - \E[y_j])] \\ &= C + \covCross [\covObs]^{-1} \tcovCross - 2\covCross [\covObs]^{-1} \tcovCross \\ &= C - \covCross [\covObs]^{-1} \tcovCross \\ &= \Cov[\tilde{\u} \given \tilde{\y} = \yobs], \end{align}\] $$ where the second equality follows from simplifying the two cross terms that come out of the covariance expression. □

5 Joint Gaussian Approximation