An Introduction to Bayesian Filtering and Smoothing

I provide an overview of the general framework for statistical filtering, viewed from the Bayesian perspective.

In this post I provide an overview of the filtering problem for discrete stochastic dynamical systems, which is nicely cast as a problem of Bayesian inference. In short, the problem we are trying to solve is as follows: suppose we are trying to predict the state of a dynamical process which evolves over time. For example, the process could be the global climate system and the state could be the temperature at some fixed locations. Suppose we have a model for the system of interest, which is stochastic in nature; the randomness might be inherent to the system under study or might come from uncertainties stemming from our lack of understanding of the dynamics. We could thus step our model forward in time to try to predict the true state of the system. However, suppose we have a second source of information: measurements of the process. Just like the dynamics, the observations are subject to noise, which in this case typically represents some sort of measurement error. The filtering problem seeks to combine these two imperfect sources of information to produce an estimate of the true state, as well as an estimate of the uncertainty - this will be made precise below. For other references on this topic, I recommend (Sanz-Alonso et al., 2023), (Särkkä, 2009), (Evensen et al., 2022), and (Reich & Cotter, 2015).

The Probabilistic State Space Model

The stochastic dynamics and observational process are defined below, formulated as a Markov state space model (i.e., a hidden Markov model), \begin{align} v_{k+1} &\sim G(v_k, \cdot) \tag{1} \newline y_{k+1} &\sim H(v_{k+1}, \cdot) \newline v_0 &\sim \mu_0 \end{align} where GG and HH are probability kernels defining the stochastic dynamics and observation process, respectively, and v0v_0 is the initial condition distributed according to distribution μ0\mu_0. We will also make the following independence assumptions:

  1. For all j,kj,k, the distributions H(vk,)H(v_{k}, \cdot) and G(vj,)G(v_j, \cdot) are independent.
  2. For all jkj \neq k, the distributions H(vk,)H(v_{k}, \cdot) and H(vj,)H(v_{j}, \cdot) are independent.
  3. For all jkj \neq k, the distributions G(vk,)G(v_{k}, \cdot) and G(vj,)G(v_{j}, \cdot) are independent.
  4. The initial condition v0v_0 is independent of all of these distributions as well.

In words, the first assumption is requiring that the dynamical noise be independent of the observation noise, the second assumes the observation noise is independent across time, and the third implies a time-homogenous Markov model for the dynamics. This state space model is quite general; in a series of posts on various filtering methods we will focus on special cases of this general formulation. We will often restrict our attention to the popular case of an additive noise model, in which
the dynamics G(vk,)G(v_k, \cdot) are given by a deterministic update g(vk)g(v_k) plus some noise ηk+1\eta_{k+1}, and similarly for the observation process; concretely, \begin{align} v_{k+1} &= g(v_k) + \eta_{k+1} \tag{2} \newline y_{k+1} &= h(v_{k+1}) + \epsilon_{k+1}, \newline v_0 &\sim \mu_0, \end{align}

where we emphasize that in general the non-random functions $g$ and $h$ may be nonlinear and the random variables ${\eta_k}$, ${\epsilon_k}$ may be non-Gaussian. The independence assumptions are easier to state in this special case, reducing to the assumption that $\{\eta_k\} \perp \{\epsilon_k\} \perp v_0$; that is, all of these random variables are pairwise independent. For example, $\eta_k$ is independent of all other $\eta_j$ and all $\epsilon_k$, in addition to $v_0$. In future posts, we will consider algorithms that are specialized to the additive noise setting, but in this introduction all of the calculations will go through under the generic formulation (1). To complete our setup, we assume $v_k \in \mathbb{R}^d$ and $y_k \in \mathbb{R}^n$ and introduce the notation $Y_k := \{y_1, \dots, y_k\}$, the set of all observations up through time $k$. We similarly write $V_k := \{v_0, v_1, \dots, v_k\}$ to denote the set of states through time $k$. For $\ell \leq k$, we will also occasionally make use of the shorthand $V_{\ell:k} := \{v_{\ell}, v_{\ell+1}, \dots, v_{k}\}$ and likewise for $Y_{\ell:k}$.

A Pure Probabilistic Modeling Perspective

Defining the probability kernels $G$ and $H$ is a convenient way to encode the probabalistic structure of the state space model. We can alternatively take a very generic probabalistic perspective and simply view the model as defining a joint distribution over all random quantities; supposing that we’re considering a fixed time window $0 \leq k \leq K$, the joint distribution is \begin{align} p(V_K, Y_K) = p(v_1, \dots, v_K, y_1, \dots, y_K). \end{align} In general, we could define this distribution however we want and make it quite complicated. The Markovian and independence assumptions vastly simplify things, making the distribution much easier to work with. These assumptions are as follows:

  1. The Markov property for state transitions. \begin{align} p(v_k|V_{k-1}, Y_{k-1}) &= p(v_k|v_{k-1}) \newline p(v_{k-1}|V_{k:K}, Y_{k:K}) &= p(v_{k-1}|v_k) \end{align}
  2. Conditional independence for observations. \begin{align} p(y_k|V_k, Y_{k-1}) &= p(y_k|v_k) \end{align}

It is this perspective that is taken in (Särkkä, 2009) (see page 52).

The Filtering and Smoothing Problems

We are now ready to define the two primary problems of interest in this s setting–filtering and smoothing–plus a bonus which we will encounter as an intermediate task in addressing the first two.

  1. The filtering problem is to characterize the conditional distribution $\mu_k := \mathcal{L}(v_k|Y_k)$ ($\mathcal{L}$ for law); in Bayesian parlance, this is the posterior distribution of the current state $v_k$ given all of the data observed up through time $k$. We call $\mu_k$ the filtering distribution at time $k$, which is assumed to have a Lebesgue density $\pi_k$.
  2. The smoothing problem is to characterize the conditional distribution $\mathcal{L}(v_k|Y_K)$ with $K > k$; that is, future observations are used to inform inference regarding past states. The distribution $\mathcal{L}(v_k|Y_K)$ is referred to as the smoothing distribution.
  3. The prediction problem is to characterize the distribution $\mathcal{L}(v_{k+l}|Y_k)$; that is, to peform inference about a future state using observations which are not fully up-to-date. We call $\mathcal{L}(v_{k+l}|Y_k)$ the prediction, or forecast distribution.

These definitions convey the sequential nature of these problems; the goal will not only be limited to performing inference for a fixed $k$, but developing algorithms that are easily updated when new data arrive at future times. We can also view this from a more traditional static inference perspective by considering the distribution $\mathcal{L}(V_K|Y_K)$. This joint distribution encodes both the filtering and smoothing distributions. Indeed, its marginal distribution in the last slot recovers the filtering distribution at time $K$. Moreover, for any $k < K$, the marginal consisting of the first $k$ slots is a smoothing distribution. Note that some authors choose to define the filtering and smoothing distributions slightly differently. For example, Sans-Alonso, Stuart, and Taeb actually call the joint distribution $\mathcal{L}(V_K|Y_K)$ the smoothing distribution, while Saarka reserves this term for the case where the time index of the state under consideration is strictly less than that of the time index of the available data. The former approach serves to emphasize the key conceptual difference between sequentially updating $\mathcal{L}(v_k|Y_k)$ and performing one-shot inference for $\mathcal{L}(V_K|Y_K)$.

Finally, a few last notational notes. In most of the subsequent exposition, I will assume that all of the distributions in question admit Lebesgue densities, and abuse notation by writing $p(\cdot)$ for most densities, with the special notation $\pi_k(\cdot)$ reserved for the density for the filtering distribution $\mu_k$.

The Bayesian Perspective

Let’s consider a fixed time horizon 1kK1 \leq k \leq K here, but note that our primary interest will ultimately center on how to update the current filtering distribution at later times: K+1K+1, K+2K+2, etc. A Bayesian analysis assumes all of the quantities under consideration are governed by probability distributions. In this case, the quantities are the states VKV_K and the observations YKY_K. In this context, the states VKV_K are the unobserved quantities of interest, the role typically assumed by the parameters in a Bayesian statistical model. The stochastic dynamics model provides the prior distribution p(VK)p(V_K) over these unknown quantities, while the observation process supplies the likelihood which connects the observed data YKY_K to the unobserved states VKV_K. The state space model defined above provides the information needed to define the joint distribution over all the variables {VK,YK}={v1,,vK,y1,,yK}\{V_K, Y_K\} = \{v_1, \dots, v_K, y_1, \dots, y_K\}. This joint distribution assumes a very simple form due to the Markovian assumption in the dynamics and the independence assumptions stated above, \begin{align} p(V_K, Y_K) &= p(Y_K|V_K)p(V_K) \newline &= \left[\prod_{k=1}^{K} p(y_k|v_k)\right] \cdot \left[\prod_{k=1}^{K} p(v_k|v_{k-1}) \right] \cdot p(v_0) \newline &= p(v_0) \prod_{k=1}^{K} p(v_k|v_{k-1}) p(y_k|v_k). \end{align} The first product in the second line follows from the second independence assumption, while the second uses the Markovian structure in the dynamics. Note that the conditional distributions p(VkYk)p(V_k|Y_k) and p(vkYk)p(v_k|Y_k) (which are the primary objects of interest) are both proportional to p(VK,YK)p(V_K, Y_K). We will use this fact below when considering how to sample from these distributions.

Can we just do MCMC?

Since we’re keeping things quite general in this post, non-linearities and non-Gaussianity in the state space equations will unsurprisingly lead to filtering distributions which cannot be computed in closed-form. In this section we consider how we might go about sampling from μk\mu_k using Markov chain Monte Carlo (MCMC) and in doing so highlight some limitations of this approach. Let’s consider designing an MCMC sampler targeting p(VKYK)p(V_K|Y_K), recalling that this would yield both the filtering distribution at time KK, in addition to all of the smoothing distributions up through time KK. The structure of this problem lends itself to an MCMC scheme that updates each conditional distribution one at a time; that is, either a Gibbs or Metropolis-within-Gibbs algorithm depending on whether or not it is possible to directly sample the conditionals p(vkVk,Yk)p(v_k|V_{-k},Y_k) (I use the notation Vk:={v0,,vk1,vk+1,,vk}V_{-k} := \{v_0, \dots, v_{k-1}, v_{k+1}, \dots, v_k\}). Recalling the form of the joint distribution, these conditionals are given by \begin{align} p(v_k|V_{-k},Y_k) &\propto p(y_k|v_k)p(v_k|v_{k-1})p(v_{k+1}|v_k). \end{align} Aside from the issue that the MCMC sampler might fail to mix well when the state dimension dd or number of time steps KK is large, this approach as one other glaring issue. If we acquire a single new data point yK+1y_{K+1} then the sampler must be run from scratch to sample the new posterior p(VK+1YK+1)p(V_{K+1}|Y_{K+1}). Since the dimensionality of the state vector grows as KK increases, each subsequent MCMC run will become more challenging and ultimately computationally infeasible. This motivates approaches which are amenable to fast updates when new data comes online. The form of the filtering distribution makes it particularly suitable for such recursive (i.e. sequential) inference schemes.

Sequential Inference for the Filtering Problem

Motivated by the previous section, we seek solutions which allow an estimate of the filtering distribution to be updated sequentially as time progresses. We begin by considering what exact inference would entail in this setting; i.e., the computations required for updating the filtering density in closed-form. Think about what is required to map μk\mu_k to μk+1\mu_{k+1}; first, the dynamics model generates a prediction for vk+1v_{k+1} and then we receive a new observation yk+1y_{k+1}. These two sources of information are combined to produce the new filtering distribution μk+1\mu_{k+1}. Thus, it is natural to break the update from μk\mu_k to μk+1\mu_{k+1} into two steps: prediction and analysis. Assuming we already know μk\mu_k, we now characterize these two steps, the composition of which results in the map μkμk+1\mu_k \mapsto \mu_{k+1}.

Forecast

The first step concerns the map μkL(vk+1Yk)\mu_k \mapsto \mathcal{L}(v_{k+1}|Y_k), which represents the state of knowledge about vk+1v_{k+1} given the all model predictions through time k+1k+1, but only considering the observations up through time kk. This is the forecast distribution defined above, with time lag l=1l=1. We will denote this special case of the forecast distribution by μ^k+1\hat{\mu}_{k+1}, with density π^k+1()=p(Yk)\hat{\pi}_{k+1}(\cdot) = p(\cdot|Y_k). The forecast distribution may be derived by first considering the joint distribution \begin{align} p(v_k, v_{k+1}|Y_k) &= p(v_{k+1}|v_k, Y_k) p(v_k|Y_k) = p(v_{k+1}|v_k)p(v_k|Y_k) = p(v_{k+1}|v_k)\pi_k(v_k), \end{align} where the final equality uses the Markov assumption. The forecast distribution then follows by marginalizing vkv_k, \begin{align} \hat{\pi}_{k+1}(v_{k+1}) &= p(v_{k+1}|Y_k) = \int p(v_k, v_{k+1}|Y_k) dv_k = \int p(v_{k+1}|v_k)\pi_k(v_k) dv_k. \end{align} This is a version of the Chapman-Kolmogorov equation applied to the Markovian dynamics model, but simply conditional on YkY_k. We observe that the current filtering density πk(vk)\pi_k(v_k) is transformed into the forecast density π^_k+1(vk+1)\hat{\pi}\_{k+1}(v_{k+1}) by means of the above integral, which defines the map μkμ^k+1\mu_k \mapsto \hat{\mu}_{k+1}. Note that this is alternatively referred to as the predict step, and the associated distribution the predictive distribution.

Analysis

We have decomposed the map μkμk+1\mu_k \mapsto \mu_{k+1} as the composition μkμ^k+1μk+1\mu_k \mapsto \hat{\mu}_{k+1} \mapsto \mu_{k+1}. The analysis step is defined to be the second map in this composition. It involves combining the forecast distribution with the new observation yk+1y_{k+1} and thus simply requires an application of Bayes’ theorem \begin{align} \pi_{k+1}(v_{k+1}) = p(v_{k+1}|Y_{k+1}) &= p(v_{k+1}|Y_{k}, y_{k+1}) \newline &= \frac{p(y_{k+1}|v_{k+1},Y_k)p(v_{k+1}|Y_k)}{p(y_{k+1}|Y_k)} \newline &= \frac{p(y_{k+1}|v_{k+1})\hat{\pi}_{k+1}(v_{k+1})}{p(y_{k+1}|Y_k)}, \end{align} where the final line uses the independence assumptions and the definition of the forecast distribution. Note that Bayes’ theorem is being applied with respect to the data yk+1y_{k+1} and everything is conditional on YkY_k. Note that in the context of the analysis step, the forecast distribution assumes the role of prior distribution on the state vk+1v_{k+1}. Alternative names for this step include the update or correction step.

Viewing the Filtering Update as an Operator on Probability Distributions

In the previous section we viewed the map μkμ^k+1μk+1\mu_k \mapsto \hat{\mu}_{k+1} \mapsto \mu_{k+1} through the manner in which it transforms probability density functions. Here we take a step back and appreciate the structure of these maps from a more generic perspective.

Prediction

Abstractly, we can view the prediction step as the result of feeding the current distribution μk\mu_{k} through the one-step Markov kernel GG, \begin{align} \hat{\mu}_{k+1}(\cdot) &= (\mu_k G)(\cdot) := \int G(v_k, \cdot) \mu_k(dv_k). \end{align} Viewing the integral as an operation on μk\mu_k, observe that this map is linear, even though the dynamic model may apply a non-linear map to the states.

Analysis

The operator mapping μ^k+1\hat{\mu}_{k+1} to μk+1\mu_{k+1} arises from Bayes’ rule, and thus can be thought of as the composition of a likelihood operator followed by a normalization operator. Assuming that the observation process admits probability densities, Bayes’ rule can be abstractly stated as \begin{align} \frac{d\mu_{k+1}}{d\hat{\mu}_{k+1}}(v_{k+1}) &= \frac{p(y_{k+1}|v_{k+1})}{\int p(y_{k+1}|v_{k+1}) \hat{\mu}_{k+1}(dv_{k+1})} \end{align} which expresses the idea that the posterior is a re-weighted version of the prior, with the weights provided by the likelihood p(yk+1vk+1)p(y_{k+1}|v_{k+1}); the denominator is simply a normalization constant. By definition of the Radon-Nikodym derivative, this yields

\begin{align} \mu_{k+1}(A) &= \frac{\int_A p(y_{k+1}|v_{k+1}) \hat{\mu}_{k+1}(dv_{k+1})}{\int p(y_{k+1}|v_{k+1}) \hat{\mu}_{k+1}(dv_{k+1})} \end{align}

where AA is a measurable set. Viewing this as an operation on μ^k+1\hat{\mu}_{k+1}, observe that the numerator represents a linear transformation of the probability distribution. However, the operation defined by the denominator is non-linear, and hence the map μ^k+1\hat{\mu}_{k+1} to μk+1\mu_{k+1} is non-linear. Denoting this map by A\mathcal{A} (for analysis) we can write μk+1=A(μ^k+1)\mu_{k+1} = \mathcal{A}(\hat{\mu}_{k+1}).

In this section we have taken an alternative view of the filtering problem. Instead of considering maps acting on states, we view the state space model as defining maps which act on probability measures. We showed that updates the filtering distribution from μk\mu_k to μk+1\mu_{k+1} can be decomposed as μkμ^k+1μk+1\mu_k \mapsto \hat{\mu}_{k+1} \mapsto \mu_{k+1}, where the first prediction operator is linear and Markovian, while the second analysis operator is a non-linear likelihood map. Thus, the complete map between the subsequent filtering distributions is given by μk+1=A(μkG). \mu_{k+1} = \mathcal{A}(\mu_k G). Many filtering algorithms can be viewed as providing approximations for these two component operators.

Addressing the Smoothing Problem

We now turn from filtering to smoothing, where data up through the current time step is used to inform estimates of past states; that is, the target distribution here is p(vkYK)p(v_k|Y_K) with K>kK > k. Given this backward looking nature, this problem is also referred to as reanalysis. We will see that the smoothing problem is strictly more difficult than the filtering problem; in fact, smoothing algorithm typically first require solving the filtering problem as an initial step. In the filtering section, we established the recursion p(vkYk)p(vk+1Yk+1)p(v_k|Y_k) \to p(v_{k+1}|Y_{k+1}), which naturally breaks into the forecast and analysis steps. In seeking a similar result for the smoothing problem we will instead find the relationship \begin{align} p(v_k|Y_K) &= \pi_k(v_k) \int \frac{p(v_{k+1}|v_k)}{p(v_{k+1}|Y_k)} p(v_{k+1}|Y_K) dv_{k+1} \end{align} which establishes a backward recursion p(vk+1YK)p(vkYK)p(v_{k+1}|Y_K) \to p(v_k|Y_K). Before deriving this, let’s spend a minute thinking about why it makes sense. The smoothing distribution at time kk depends on a product of two terms; the first is the filtering distribution at time kk, which is capturing the information up through the current time. The information coming from the future is encoded in the second term, which can be viewed as a weigted average of the smoothing distribution at time k+1k+1, p(vk+1YK)p(v_{k+1}|Y_K). The weights in this average are p(vk+1vk)p(vk+1Yk)\frac{p(v_{k+1}|v_k)}{p(v_{k+1}|Y_k)}, the ratio of the distribution predicted by the dynamics alone over the prediction distribution. As an example, consider the extreme case that this ratio is always equal to 11, which causes the integral to evaluate to 11 and hence p(vkYK)p(v_k|Y_K) just equals the filtering distribution p(vkYk)p(v_k|Y_k). Does this make sense? Well, if the weights in the integral are always one then this is saying that, conditional on YkY_k, knowing vkv_k doesn’t help predict vk+1v_{k+1}. So given YkY_k, the states vkv_k and vk+1v_{k+1} are essentially unrelated so the link between vkv_k and any useful information from the future has been severed. All of the useful information is already contained in the filtering distribution.

Before deriving the recursion, I emphasize again that (1) the formula defines a backwards recursion; and (2) the formula depends on the filtering distribution. The consequences of this will become clear in future posts where we see that many smoothing algorithms adopt a two-stage “forward filter, backward sample” structure. Without further ado, the backward recursion is derived below. \begin{align} p(v_k|Y_K) &= \int p(v_k, v_{k+1}|Y_K) dv_{k+1} \newline &= \int p(v_k|v_{k+1}, Y_k)p(v_{k+1}|Y_K) dv_{k+1} \newline &= \int p(v_k|v_{k+1}, Y_k)p(v_{k+1}|Y_K) dv_{k+1} \newline &= \int \frac{p(v_{k+1}|v_k, Y_k)p(v_k|Y_k)}{p(v_{k+1}|Y_k)} p(v_{k+1}|Y_K) dv_{k+1} \newline &= \int \frac{p(v_{k+1}|v_k)\pi_k(v_k)}{p(v_{k+1}|Y_k)} p(v_{k+1}|Y_K) dv_{k+1} \newline &= \pi_k(v_k) \int \frac{p(v_{k+1}|v_k)}{p(v_{k+1}|Y_k)} p(v_{k+1}|Y_K) dv_{k+1} \end{align} The third and fifth lines follow from the conditional independence assumptions, while the fourth applies Bayes’ rule.

  1. Sanz-Alonso, D., Stuart, A. M., & Taeb, A. (2023). Inverse Problems and Data Assimilation. https://arxiv.org/abs/1810.06191
  2. Särkkä, S. (2009). Bayesian Estimation of Time-Varying Systems: Discrete-Time Systems. https://api.semanticscholar.org/CorpusID:5616218
  3. Evensen, G., Vossepoel, F. C., & van Leeuwen, P. J. (2022). Data Assimilation Fundamentals: A Unified Formulation of the State and Parameter Estimation Problem. https://library.oapen.org/handle/20.500.12657/54434
  4. Reich, S., & Cotter, C. (2015). Probabilistic Forecasting and Bayesian Data Assimilation. Cambridge University Press.