Ensemble Kalman Methodology for Inverse Problems
Particle-based Kalman methods for the approximate solution of Bayesian inverse problems.
In this post we provide a brief introduction to how ideas related to the Ensemble Kalman Filter (EnKF) can be applied to approximate the posterior distribution of a Bayesian inverse problem. While the EnKF is traditionally applied to time-varying dynamical systems, we show how to introduce artificial dynamics into an otherwise static inverse problem in order to port the EnKF methodology to the inverse problem setting. After defining the inverse problem formulation, we start simply by demonstrating how combining joint Gaussian assumptions with Monte Carlo approximations can provide a means of approximate Bayesian inference. We then generalize this idea by considering a sequence of such joint Gaussian approximations, providing a clear link to the EnKF. These methods are characterized by a finite sequence of updates that map the prior distribution to an (approximate) posterior distribution. We then generalize further by considering a slight adjustment to the EnKF update rules such that the resulting algorithms consist of an infinite sequence of updates. Practical algorithms are defined by truncating after a finite number of steps. Algorithms falling into this class have come to be known as ensemble Kalman inversion (EKI), and have gained much recent interest. We conclude by briefly highlighting connections to measure transport.
Setup: Bayesian Inverse Problems
In this post, we consider the inverse problem formulation \begin{align} y &= \mathcal{G}(u) + \epsilon \tag{1} \newline u &\sim \pi_0 \newline \epsilon &\sim \mathcal{N}(0, \Sigma), \end{align} whereby the task is to recover a parameter from noisy observations resulting from the mapping of through a forward model . The Bayesian approach to inverse problems treats as a random variable, such that the statistical model (1) defines a joint distribution over the random vector . The solution of the Bayesian inverse problem is then defined to be the conditional distribution of ; i.e., the posterior distribution. The model (1) defines the joint distribution by separately defining the two components in the product We note that the model (1) implicitly assumes a priori independence between the noise and parameter . The first term in (2), when viewed as a function of is called the likelihood. For convenience later on, we will introduce notation for (up to a constant) the negative log-likelihood which is sometimes also called the potential or model-data misfit. Throughout most of this post, we restrict to the choice of a Gaussian likelihood (as in (1)), in which case \begin{align} -\log p(y|u) &= -\log \mathcal{N}(y|\mathcal{G}(u),\Sigma) \newline &= \frac{1}{2}\log\det(2\pi \Sigma) + \frac{1}{2}\lVert y - \mathcal{G}(u)\rVert^2_{\Sigma} \tag{4} \end{align} and \begin{align} &\Phi(u) = \frac{1}{2}\lVert y - \mathcal{G}(u)\rVert^2_{\Sigma}, &&C = \frac{1}{2}\log\det(2\pi \Sigma). \tag{5} \end{align} As above, we will use the notation to denote the Euclidean norm weighted by the inverse of a positive definite matrix . The second term in (2), , is called the prior. In this post, we will restrict to the setting and assume that the prior distribution is given by a density . We denote the resulting posterior density (the density describing the distribution of ) by . The methods discussed below can be generalized to settings where may be a function space; e.g., the prior distribution may be given by a Gaussian process.
Background: Joint Gaussian Conditioning
Joint Gaussian Assumption
We now begin by considering approximation of the posterior by way of
a certain Gaussian approximation. In particular, we will assume that
are jointly Gaussian distributed, in which case standard Gaussian conditioning
identities can be implied to yield an approximation of . Given that
conditionals of Gaussians are also Gaussian, this approach produces a Gaussian
approximation to the posterior . However, borrowing an idea from EnKF
methodology, we will consider a slight modification with the ability to produce
non-Gaussian approximations. To avoid notational confusion between
exact and approximate distributions, we will denote by
the random variables defining the joint Gaussian approximation. The Gaussian
assumption thus takes the form
\begin{align}
\begin{bmatrix} \hat{u} \newline \hat{y} \end{bmatrix}
\sim
\mathcal{N}\left(
\begin{bmatrix} \overline{u} \newline \overline{y} \end{bmatrix},
\begin{bmatrix} \hat{C} & \hat{C}^{uy} \newline
\hat{C}^{yu} & \hat{C}^{y} \end{bmatrix}
\right) \tag{6}
\end{align}
where the moments , , ,
, and defining this approximation are quantities that
we must specify. We use the notation .
Note that if the forward model is linear and
the prior distribution is Gaussian, then the joint Gaussian approximation
(6) is actually exact, and the moments can be computed in closed-form. In other
words, with the moments properly defined,
and therefore
; that is, the posterior approximation
is exact. This special case is typically referred to as the linear Gaussian
setting, which I discuss in depth in this
this post. When
is nonlinear and/or is non-Gaussian, then (6) will truly
be an approximation and the above equalities will not hold.
In the following subsections, we briefly review some properties of joint Gaussians and their conditional distributions. We work with the joint distribution (6), assuming the means and covariances are known. With the necessary background established, we then discuss practical algorithms for estimating these moments and producing approximations of .
Gaussian Conditional Moments
Regardless of whether or not we are truly in the linear Gaussian setting, let us suppose that we have constructed the joint distribution (6). Using standard facts about Gaussian distributions, we know the conditional is also Gaussian with moments given by
Gaussian Conditional Simulation
As an alternative to explicitly computing the conditional moments (8) and (9), we can consider a Monte Carlo representation of . The conditional distribution can be directly simulated (without computing (8) and (9)) using the fact which can be quickly verified by computing the mean and covariance of each side. Note that the randomness in the righthand side is inherited from the random variables and , while here is viewed as a specific realization of the data (and is thus non-random). The result (10), known as Matheron’s Rule, provides the basis for the following algorithm to draw independent samples from the conditional distribution .
Gaussian Conditional Simulation via Matheron's Rule.
Independent samples can be simulated from the distribution by repeating the following procedure for each :
1. Draw independent samples and from the marginal distributions of and , respectively. That is, \begin{align} &\hat{u}^{(j)} \sim \mathcal{N}(\overline{u},\hat{C}) &&\hat{y}^{(j)} \sim \mathcal{N}(\overline{y},\hat{C}^y). \tag{11} \end{align} 2. Return
Posterior Approximation Algorithm
Up to this point, we have not discussed the choice of the moments defining the joint Gaussian approximation (6). We now provide these definitions, leading to concrete algorithms for posterior approximation. We will adopt a Monte Carlo strategy by sampling independently from the true joint distribution defined by (1); i.e., . We next compute the empirical means and covariances of these random draws and insert them into the joint Gaussian approximation (6). The below subsection focuses on the estimation of these moments, which we will then follow by utilizing the above Gaussian conditioning results to derive various approximations of .
Estimating the Gaussian Moments
The first step in our approach requires generating the prior ensemble constructed by sampling according to model (1); i.e., \begin{align} &u^{(j)} \sim \pi_0, &&\epsilon^{(j)} \sim \mathcal{N}(0, \Sigma). \tag{14} \end{align} We now consider estimating the first two moments of this joint distribution. Starting with the marginal, we define the sample estimates \begin{align} \overline{u} &:= \frac{1}{J}\sum_{j=1}^{J} u^{(j)} \tag{15} \newline \hat{C} &:= \frac{1}{J-1}\sum_{j=1}^{J} (u^{(j)}-\overline{u})(u^{(j)}-\overline{u})^\top. \tag{16} \end{align} Alternatively, if the prior takes the form of a well-known distribution, then we can simply set and/or to the known moments of this distribution. We can likewise consider such mean and covariance estimates for the portion of (6), defined with respect to the ensemble However, we can simplify matters a bit by performing part of the calculations analytically, owing to the simple additive Gaussian error structure. Noting that under (1) we have \begin{align} \mathbb{E}[y] &= \mathbb{E}[\mathcal{G}(u) + \epsilon] = \mathbb{E}[\mathcal{G}(u)] \tag{18} \newline \text{Cov}[y] &= \text{Cov}[\mathcal{G}(u) + \epsilon] = \text{Cov}[\mathcal{G}(u)] + \text{Cov}[\epsilon] = \text{Cov}[\mathcal{G}(u)] + \Sigma, \tag{19} \end{align} we can focus our efforts on substituting sample-based estimates for the first term in both (18) and (19). Doing so yields, \begin{align} \overline{y} &:= \frac{1}{J} \sum_{j=1}^{J} \mathcal{G}(u^{(j)}) \tag{20} \newline \hat{C}^y &:= \frac{1}{J-1} \sum_{j=1}^{J} (\mathcal{G}(u^{(j)})-\overline{y})(\mathcal{G}(u^{(j)})-\overline{y})^\top + \Sigma. \tag{21} \end{align} We similarly define the last remaining quantity by noting that We thus consider the estimator
Gaussian Approximations
At this point, we have obtained the joint approximation (6) derived by computing the sample moments as described in the previous subsection. The question is now how best to use this joint approximation in approximating the true posterior. The most straightforward approach is to simply define our approximate posterior to be , the Gaussian conditional. This conditional is available in closed-form and is defined in equations (7), (8), and (9). Suppose, for whatever reason, we are only interested in a sample-based representation of this Gaussian conditional. In this case, we can apply the algorithm summarized in (11) and (12) based on Matheron’s Rule. Explicitly, we construct an approximate posterior ensemble via the update equation in (12), for . At this point we should be quite clear about the definition of each term in (24). The mean and covariance estimates appearing in this expression are all derived from the ensemble sampled from the true joint distribution implied by (1). On the other hand, are samples from the approximate joint distribution defined in (6). This is simply an exact implementation of Matheron’s rule under the joint distribution (6); i.e., the resulting ensemble contains iid samples from . If it seems odd to dwell on this point, we note that we are primarily doing so to provide motivation for the alternative method discussed below.
Non-Gaussian Approximation via EnKF Update
As discussed above, a direct application of Matheron’s rule to (6) results in an ensemble of samples from a Gaussian distribution, which we interpret as a sample-based approximation to the true posterior. We now consider constructing the ensemble via the slightly different update rule for . This is precisely the update equation used in the analysis (i.e., update) step of the EnKF, so we refer to (25) as the EnKF update. The only difference from (24) is that we have replaced the samples from (6) with the samples from the true joint distribution, as defined in (14) and (17). In other words, we now utilize the samples from the initial ensemble (the same samples used to compute the mean and covariance estimates). Thus, equation (25) can be viewed as a particle-by-particle update to the initial ensemble, The update (25) is somewhat heuristic, and it is difficult to assess the properties of the updated ensemble. Given that the initial ensemble will not in general constitute samples from a Gaussian (since is in general non-Gaussian), then the updated ensemble will also generally be non-Gaussian distributed. If the inverse problem (1) is linear Gaussian, then the updates (24) and (25) are essentially equivalent. Beyond this special case, the hope is that the use of the true prior samples in (25) will better approximate non-Gaussianity in the posterior distribution, as opposed to the Gaussian approximation implied by (24). The EnKF update can alternatively be justified from an optimization perspective, which we won’t discuss here; see my post on the EnKF for details. We conclude this section by summarizing the final algorithm.
Posterior Approximation via EnKF Update.
1. Generate the initial ensembles and , where .
2. Compute the sample estimates , , , , and as defined in (20)-(23).
3. Return the updated ensemble by applying the EnKF update \begin{align} \hat{u}^{(j)}_* &:= u^{(j)} + \hat{C}^{uy}[\hat{C}^y]^{-1}(y-y^{(j)}), &&y^{(j)} \sim \mathcal{N}(\mathcal{G}(u^{(j)}), \Sigma). \tag{27} \end{align}
Artificial Dynamics
As noted above, the approximation algorithm summarized in (27) can be viewed as performing an iteration of the EnKF algorithm. In this section, we take this idea further by encoding the structure of the inverse problem (1) in a dynamical system. Once these artificial dynamics are introduced, we can then consider applying standard algorithms for Bayesian estimation of time-evolving systems (e.g., the EnKF) to solve the inverse problem. We emphasize that (1), the original problem we are trying to solve, is static. The dynamics we introduce here are purely artificial, introduced for the purpose of making a mathematical connection with the vast field of time-varying systems. This allows us to port methods that have demonstrated success in the time-varying domain to our current problem of interest.
Prior-to-Posterior Map in Finite Time
We start by generically considering how to construct a sequence of probability distributions that maps from the prior to the posterior in a finite number of steps. We will find it convenient to write the likelihood with respect to the potential defined in (3), such that the posterior distribution solving (1) is given by
\begin{align} \pi(u) &= \frac{1}{Z} \pi_0(u)\exp(-\Phi(u)), &&Z = \int \pi_0(u)\exp(-\Phi(u)) du. \tag{28} \end{align}
The following result describes a likelihood tempering approach to constructing the desired sequence of densities.
Prior-to-Posterior Sequence of Densities.
For a positive integer , define the sequence of densities recursively by \begin{align} \pi_{k+1}(u) &:= \frac{1}{Z_{k+1}}\pi_k(u)\exp\left(-\frac{1}{K}\Phi(u)\right), &&Z_{k+1} := \int \pi_k(u)\exp\left(-\frac{1}{K}\Phi(u)\right) du, \tag{29} \end{align} for . Then the final density satisfies , where is defined in (28).
Proof. We first consider the normalizing constant at the final iteration: \begin{align} Z_K &= \int \pi_{K-1}(u)\exp\left(-\frac{1}{K}\Phi(u)\right) du \newline &= \int \pi_{0}(u)\frac{1}{Z_{K-1} \cdots Z_1}\exp\left(-\frac{1}{K}\Phi(u)\right)^K du \newline &= \frac{1}{Z_{K-1} \cdots Z_1} \int \pi_{0}(u)\exp\left(-\Phi(u)\right) du \newline &= \frac{Z}{Z_1 \cdots Z_{K-1}}, \end{align} where the second equality simply iterates the recursion (29). Rearranging, we see that
We similarly iterate the recursion for : \begin{align} \pi_K(u) &= \frac{1}{Z_K} \pi_{K-1}(u) \exp\left(-\frac{1}{K}\Phi(u)\right) \newline &= \frac{1}{Z_K Z_{K-1} \cdots Z_1} \pi_{0}(u) \exp\left(-\frac{1}{K}\Phi(u)\right)^K \newline &= \frac{1}{Z} \pi_{0}(u) \exp\left(-\Phi(u)\right) \newline &= \pi(u) \qquad\qquad\qquad\qquad\qquad \blacksquare \end{align}
The sequence defined by (29) essentially splits the single Bayesian inverse problem (1) into a sequence of inverse problems. In particular, the update implies solving the inverse problem \begin{align} y|u &\sim \tilde{p}(y|u) \tag{31} \newline u &\sim \pi_k(u), \end{align} with the tempered likelihood Each update thus encodes the action of conditioning on the data , but under the modified likelihood which “spreads out” the information in the data over the time steps. If we were to consider using the true likelihood in each time step, this would artificially inflate the information content in , as if we had independent data vectors instead of just one.
Gaussian Special Case
Suppose the likelihood is Gaussian , with associated potential . The tempered likelihood in this case is The modified likelihood remains Gaussian, and is simply the original likelihood with the variance inflated by a factor of . This matches the intuition from above; the variance is increased to account for the fact that we are conditioning on the same data vector times.
If, in addition to the Gaussian likelihood, the prior is Gaussian and the map is linear then the final posterior and each intermediate distribution will also be Gaussian. In this case, the recursion (29) defines a sequence of linear Gaussian Bayesian inverse problems.
Introducing Artificial Dynamics
The previous section considered a discrete process on the level of densities; i.e., the dynamics (29) describe the evolution of . Our goal is now to design an artificial dynamical system that treats as the state variable, such that describes the filtering distribution of the state at iteration . In theory, we can then draw samples from by applying standard filtering algorithms to this artificial dynamical system.
There are many approaches we could take here, but let us start with the simplest. We know that the update should encode the action of conditioning on with respect to the tempered likelihood. Thus, let’s consider the following dynamics and observation model: \begin{align} u_{k+1} &= u_k \tag{34} \newline y_{k+1} &= \mathcal{G}(u_{k+1}) + \epsilon_{k+1}, &&\epsilon_{k+1} \sim \mathcal{N}(0, K\Sigma) \tag{35} \newline u_0 &\sim \pi_0 \tag{36} \end{align} The lines (34) and (36) define our artificial dynamics in , with the former providing the evolution equation and the latter the initial condition. These dynamics are rather uninteresting; the evolution operator is the identity, meaning that the state remains fixed at its initial condition. All of the interesting bits here come into play in the observation model (34). We observe that, by construction, the filtering distribution of this dynamical system at time step is given by : To be clear, we emphasize that the quantities in the observation model (35) are random variables, and indicates that the condition that the random variable is equal to the fixed data realizaton .
Extending the State Space
We now provide an alternative, but equivalent, formulation that gives another useful perspective. Observe that the observation model (34) can be written as where we have defined \begin{align} H &:= \begin{bmatrix} 0 & I \end{bmatrix} \in \mathbb{R}^{p \times (d+p)}, &&v_k := \begin{bmatrix} u_k \newline \mathcal{G}(u_k) \end{bmatrix} \in \mathbb{R}^{d+p}. \tag{38} \end{align} We will now adjust the dynamical system (33) to describe the dynamics with respect to the state vector : \begin{align} v_{k+1} &= v_k \tag{40} \newline y_{k+1} &= Hv_{k+1} + \epsilon_{k+1}, &&\epsilon_{k+1} \sim \mathcal{N}(0, K\Sigma) \tag{41} \newline u_0 &\sim \pi_0. \tag{42} \end{align} We continue to write the initial condition (42) with respect to but note that the distribution on induces an initial distribution for . Why extend the state space in this way? For one, the observation operator in (41) is now linear. Linearity of the observation operator is a common assumption in the data assimilation literature, so satisfying this assumption allows us more flexibility in choosing a filtering algorithm. The main reason I opt to include the state space extension here is that this is the approach taken in Iglesias et al (2012), which is the first paper to systematically propose and analyze the application of the EnKF to inverse problems. In effect, if you have defined the EnKF with respect to a linear observation operator (as is commonly done), then the extended state space formulation allows you to extend the algorithm to the nonlinear case. As we will see, what you ultimately get is identical to the joint Gaussian approximation viewpoint used in deriving (27).
This extended state space formulation still gives rise to the sequence as before. However, the distribution is now a marginal of filtering distribution for (the marginal corresponding to the first entries of ).
Applying Filtering Algorithms
We have now considered a couple options to formulate the Bayesian inverse problem (1) as a discrete dynamical system, such that the filtering distributions of the artificial system are given by , as defined in (29). The distribution at the final time step satisfies . This points to a possible algorithmic approach to posterior inference. If we can propagate an ensemble of particles such that they are distributed according to at time step , then the ensemble will represent the posterior at time step . Particle filtering methods might be considered to exactly implement this Monte Carlo scheme. However, in this post we consider approximate methods rooted in Kalman methodology. Specifically, let’s consider applying the EnKF to the model given in (40). We could also consider (34) and get the same result, but I’ll follow Iglesias et al (2012) in using the extended state space formulation to generalize the linear observation operator EnKF to nonlinear operators. A direct application of the EnKF to the system (40) gives the following recursive algorithm.
Posterior Approximation via EnKF Update in Finite Time.
Time k=0:
Generate the initial ensemble , where \begin{align} &v^{(j)}_0 := \begin{bmatrix} u^{(j)}_0, \mathcal{G}(u^{(j)}_0) \end{bmatrix}^\top, &&u^{(j)}_0 \overset{iid}{\sim} \pi_0. \tag{43} \end{align} Time k+1:
1. Perform the forecast step:
2. Compute the sample estimates: \begin{align} \overline{v}_{k+1} &:= \frac{1}{J} \sum_{j=1}^{J} \hat{v}^{(j)}_{k+1} \tag{45} \newline \hat{C}^v_{k+1} &:= \frac{1}{J-1} \sum_{j=1}^{J} (\hat{v}^{(j)}_{k+1} - \overline{v}_{k+1}) (\hat{v}^{(j)}_{k+1} - \overline{v}_{k+1})^\top. \end{align} 3. Perform the analysis step: \begin{align} v^{(j)}_{k+1} &:= \hat{v}^{(j)}_{k+1} + \hat{C}^{v}_{k+1}H^\top[H\hat{C}^{v}_{k+1}H^\top + K\Sigma]^{-1}(y-y^{(j)}), &&y^{(j)} \sim \mathcal{N}(H\hat{v}_{k+1}^{(j)}, K\Sigma). \tag{46} \end{align}
References
- The Ensemble Kalman Filter for Inverse Problems (Iglesias et al, 2012)