Preliminaries
{% katexmm %} Let \(x, y \in \mathbb{R}^n\). Throughout this post, we write \(\langle x, y \rangle = x^\top y\) for the standard inner product on \(\mathbb{R}^n\), and \(\lVert x \rVert_2 = \sqrt{\langle x, x \rangle}\) the norm induced by this inner product. We write \(L(\mathcal{X}, \mathcal{Y})\) to denote the set of linear maps from \(\mathcal{X}\) to \(\mathcal{Y}\). We will frequently consider linear functions of the form \(\ell: \mathbb{R}^n \to \mathbb{R}\), and denote the set of all such functions as \((\mathbb{R}^n)^* := L(\mathbb{R}^n, \mathbb{R})\). Every \(\ell \in (\mathbb{R}^n)^*\) can be uniquely represented as an inner product with some vector \(y \in \mathbb{R}^n\). When we wish to make this identification explicit, we will write \(\ell_y\) to denote the linear map given by \(\ell_y(x) = \langle x, y \rangle\). Likewise, if we are working with a generic linear map \(\ell \in (\mathbb{R}^n)^*\), then we will write \(y_{\ell} \in \mathbb{R}^n\) to denote the unique vector satisfying \(\ell(x) = \langle x, y_{\ell} \rangle\). We will also loosely refer to \(\ell_y(x)\) as a projection onto \(y\). Note that if \(y\) has unit norm, then this is precisely the magnitude of the orthogonal projection of \(x\) onto \(y\). {% endkatexmm %}
Sigma Algebra
{% katexmm %} We recall from the previous post that a univariate Gaussian measure is defined on the Borel \(\sigma\)-algebra \(\mathcal{B}(\mathbb{R})\). Analogously, we will define an \(n\)-dimensional Gaussian measure on the Borel sets \(\mathcal{B}(\mathbb{R}^n)\). There are two reasonable approaches to define \(\mathcal{B}(\mathbb{R}^n)\), and I want to take a moment to highlight them since the same two options will present themselves when we consider defining the Borel sets over infinite-dimensional spaces.
Option 1: Leverage the Standard Topology on \(\mathbb{R}^n\)
A Borel \(\sigma\)-algebra can be defined for any space that comes equipped with a topology; i.e., a collection of open sets. The Borel \(\sigma\)-algebra is then defined as the smallest \(\sigma\)-algebra that contains all of these open sets. In the present setting, this means \[ \mathcal{B}(\mathbb{R}^n) := \sigma\left\{\mathcal{O} \subseteq \mathbb{R}^n : \mathcal{O} \text{ is open} \right\}, \tag{1} \] where \(\sigma(\mathcal{S})\) denotes the \(\sigma\)-algebra generated by a collection of sets \(\mathcal{S}\). A nice perspective on \(\mathcal{B}(\mathbb{R}^n)\) is that it is the smallest \(\sigma\)-algebra that ensures all continuous functions \(f: \mathbb{R}^n \to \mathbb{R}\) are measurable. We note that this is not a property of Borel \(\sigma\)-algebras more generally, but one that does hold in the special case of \(\mathbb{R}^n\); see this StackExchange post for some details.
Option 2: Product of One-Dimensional Borel Sets
A second reasonable approach is to try extending what we have already defined in one dimension, which means simply taking Cartesian products of one-dimensional Borel sets: \[ \mathcal{B}(\mathbb{R}^n) := \sigma\left\{B_1 \times \cdots \times B_n: B_i \in \mathcal{B}(\mathbb{R}), \ i = 1, \dots, n \right\}. \tag{2} \] It turns out that the resulting \(\sigma\)-algebra agrees with that defined in option 1, so there is no ambiguity in the notation. {% endkatexmm %}
Definition: One-Dimensional Projections
{% katexmm %} With the \(\sigma\)-algebra defined, we now consider how to define a Gaussian measure \(\mu\) on the measurable space \((\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n))\). We will explore a few different equivalent definitions, starting with this: a measure is Gaussian if all of its one-dimensional projections are (univariate) Gaussians.As in the univariate setting, we define a random variable \(X\) as Gaussian if its law \(X\) is a Gaussian measure. Recall that each linear map \(\ell\) can be identified with a unique \(y \in \mathbb{R}^n\) such that \(\ell(x) = \langle x, y \rangle\), which we indicate by writing \(\ell_y = \ell\). We thus see that \(\mu \circ \ell_{y}^{-1}\) is the distribution of the random variable \(\langle X, y \rangle\). The previous definition can therefore be re-stated in the language of random variables as follows: an \(n\)-dimensional random variable is Gaussian if every linear combination of the entries of \(X\) is univariate Gaussian. More precisely:Definition. A probability measure \(\mu\) defined on the Borel measurable space \((\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n))\) is called Gaussian if, for all linear maps \(\ell \in (\mathbb{R}^n)^*\), the pushforward measure \(\mu \circ \ell^{-1}\) is Gaussian on \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\).
Definition. Let \((\Omega, \mathcal{A}, \mathbb{P})\) be a a probability space and \(X: \Omega \to \mathbb{R}^n\) a random vector. Then \(X\) is called Gaussian if \(\langle X, y \rangle\) is a univariate Gaussian random variable for all \(y \in \mathbb{R}^n\).
Notice that by choosing \(y := e_j\) (the vector with a \(1\) in its \(j^{\text{th}}\) entry and zeros everywhere else), then this definition immediately tells us that a Gaussian random vector has univariate Gaussian marginal distributions. That is, if \(X = (X_1, \dots, X_n)^\top\) then \(X_i\) is univariate Gaussian for all \(i = 1, \dots, n\). {% endkatexmm %}
Fourier Transform
{% katexmm %} Just as in the univariate case, the Fourier transform \(\hat{\mu}\) provides an alternate, equivalent, characterization of Gaussian measures. First, we recall how such a Fourier transform is defined in the multiple variable setting.Definition. Let \(\mu\) be a measure on \((\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n))\). Then the Fourier transform of \(\mu\) is defined as \[ \hat{\mu}(y) := \int_{\mathbb{R}^n} e^{i\langle x, y\rangle} \mu(dx), \qquad y \in \mathbb{R}^n \tag{3} \]
We can alternatively view \(\hat{\mu}\) as a function of \(\ell_y \in (\mathbb{R}^n)^*\); that is, \[ \ell_y \mapsto \int_{\mathbb{R}^n} e^{i \ell_y(x)} \mu(dx). \] Note that this is similar in spirit to the definition of the \(n\)-dimensional Gaussian measure, in the sense that the extension from one to multiple dimensions is acheived by considering one-dimensional linear projections. This idea will also provide the basis for an extension to infinite dimensions.
With this background established, we can state the following, which gives an alternate definition of Gaussian measures.The proof, which is given in the appendix, also provides the expressions for the mean and covariance of \(\mu\) as a byproduct.Theorem. A probability measure \(\mu\) defined on the Borel measurable space \((\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n))\) is Gaussian if and only if its Fourier transform is of the form \[ \hat{\mu}(y) = \exp\left\{i\langle m, y\rangle - \frac{1}{2}\langle Cy, y\rangle \right\}, \tag{4} \] for some fixed vector \(m \in \mathbb{R}^{n}\) and symmetric, positive semi-definite matrix \(C \in \mathbb{R}^{n \times n}\).
Corollary. Let \(\mu\) be a Gaussian measure with Fourier transform \(\hat{\mu}(y) = \exp\left\{\langle y, m\rangle - \frac{1}{2}\langle Cy, y\rangle\right\}\). Then the mean vector and covariance matrix of \(\mu\) are given by \[\begin{align} m &= \int x \mu(dx) \tag{5} \newline C &= \int (x-m)(x-m)^\top \mu(dx). \end{align}\]
{% endkatexmm %}
Density Function
The one-dimensional projections and Fourier transform provide equivalent definitions of multivariate Gaussian measures. The more familiar notion of the Gaussian density provides a third characterization, with the caveat that it only pertains to the case that the covariance matrix \(C\) is positive definite.Proposition. Let \(\mu\) be a Gaussian measure with mean vector \(m\) and covariance matrix \(C\), as in (5). Then \(\mu\) admits a Lebesgue density if and only if \(C\) is positive definite, in which case \[ \frac{d\mu}{d\lambda}(x) = \text{det}(2\pi C)^{-1/2}\exp\left\{-\frac{1}{2} \langle C^{-1}(x-m), x-m\rangle\right\}. \tag{6} \]
Transformation of Standard Gaussian Random Variables
In this section we provide yet another characterization of Gaussian measures. We consider a generative perspective, whereby a Gaussian random vector \(X \in \mathbb{R}^n\) arises via a linear transformation of \(n\) iid \(\mathcal{N}(0,1)\) random variables.
Proposition. Let \(Z_1, \dots, Z_n\) be iid \(\mathcal{N}(0, 1)\) random variables stacked into the column vector \(Z \in \mathbb{R}^n\). Then, for any fixed vector \(m \in \mathbb{R}^n\) and matrix \(A \in \mathbb{R}^{n \times n}\), the random variable given by \[ X := m + AZ \tag{7} \] has a Gaussian distribution \(\mathcal{N}(m, AA^\top)\).
Conversely, let \(X \in \mathbb{R}^n\) be a Gaussian random variable. Then there exists a vector \(m \in \mathbb{R}^n\) and matrix \(A \in \mathbb{R}^{n \times n}\) such that \(X = m + AZ\).
{% katexmm %} Another way to think about this is that we have defined a transport map \(T: \mathbb{R}^n \to \mathbb{R}^n\) such that \[\begin{align} T(Z) &= X, &&\text{where } T(z) = m + Az. \end{align}\] That is, we feed in vectors with iid standard Gaussian components, and get out vectors with distribution \(\mathcal{N}(m, AA^\top)\). This is a very practical way to look at multivariate Gaussians, immediately providing the basis for a sampling algorithm. Indeed, suppose we want to draw iid samples from the distribution \(\mathcal{N}(m, C)\). Then the above proposition gives us a way to do so, provided that we can (1) draw univariate \(\mathcal{N}(0,1)\) samples; and (2) factorize the matrix \(C\) as \(C = AA^\top\) for some \(A\). This procedure is summarized in the below corollary.
Corollary. The following algorithm produces a sample from the distribution \(\mathcal{N}(m, C)\).
Repeating steps 1 and 3 will produce independent samples from \(\mathcal{N}(m, C)\) (the matrix factorization need not be re-computed each time).
- Draw \(n\) iid samples \(Z_i \sim \mathcal{N}(0,1)\) and stack them in a column vector \(Z\).
- Compute a factorization \(C = AA^\top\).
- Return \(m + AZ\).
As for the factorization, the Cholesky decomposition is a standard choice when \(C\) is positive definite. When \(C\) is merely positive semidefinite, the eigendecomposition provides another option, since \[ C = UDU^\top = UD^{1/2} D^{1/2} U^\top = (UD^{1/2})(UD^{1/2})^\top \] so setting \(A := UD^{1/2}\) does the trick. Note that \(C\) is positive semidefinite so \(D\) is just a diagonal matrix with nonnegative values on the diagonal.
Finally, note that we can expand (7) as
\[\begin{equation}
X = m + \sum_{i=1}^{n} Z_i A_i,
\end{equation}\] where the \(A_i\) are the columns of \(A\). This shows that we are representing the random vector \(X\) as “randomized summation”; that is, a linear combination of nonrandom vectors \(A_i\) where the coefficients \(Z_i\) are random. In the case of the eigendecomposition, this summation assumes the form \[\begin{equation}
X = m + \sum_{i=1}^{n} \sqrt{\lambda}_i Z_i U_i, \tag{8}
\end{equation}\] where the \(\lambda_i\) are the diagonal values of \(D\). This expression can be viewed as a special case of the Karhunen-Loeve expansion {% cite alexanderianKLExpansion %}.
{% endkatexmm %}
Conditional Distributions
{% katexmm %} Throughout this section, let \((X,Y)\) denote a Gaussian vector in \(\R^{d+n}\), such that \(X \in \R^d\) and \(Y \in \R^n\). We are interested in the conditional distribution of \(X\) given \(Y\).
Conditional Expectation
We start by characterizing the distribution of the conditional expectation \(\mathbb{E}(X|Y)\), which we recall is a \(Y\)-measurable random vector in \(\R^d\). The first part of the below result, which does not require any of the covariances involved to be positive definite, is theorem 1.2.8 in {% cite BogachevGaussian %}.Theorem. The conditional expectation \(\mathbb{E}(X|Y)\) is a Gaussian vector, and can be written as \[ \mathbb{E}(X|Y) = \mathbb{E}X + K(Y - \mathbb{E}Y), \tag{9} \] for a nonrandom linear operator \(K: \R^n \to \R^d\). If \(\text{Cov}[Y]\) is positive definite, then \[ K = \text{Cov}[X,Y]\text{Cov}[Y]^{-1}. \tag{10} \]
Proof. We consider the zero mean case \(\mathbb{E}X = \mathbb{E}Y = 0\), since we can simply subtract off the means in the non-centered case. Since Gaussian random variables are square integrable, then the conditional expectation can be characterized as an \(L^2\) projection \[ \mathbb{E}(X|Y) = \text{argmin}_{g} \mathbb{E}\lVert X - g(Y) \rVert^2, \tag{11} \] where the minimum is considered over all \(Y\)-measurable functions \(g: \R^n \to \R^d\). The Hilbert projection theorem provides the orthogonality condition \(\langle X - \mathbb{E}(X|Y), g(Y) \rangle_{L^2} = 0\) for the optimum. We thus seek to show that an operator \(K\) can be constructed satisfying \(\langle X - KY, g(Y) \rangle_{L^2} = 0\) for all measurable functions \(g\). However, note that \((X-KY,Y)\) are jointly Gaussian, as this vector results from a linear map applied to \((X,Y)\). Consequently, the condition \(\text{Cov}[X-KY,Y]=0\) implies the independence of \(X-KY\) and \(Y\), which in turn implies the \(L^2\) orthogonality condition holds for all \(g\). Thus, we must show that \(K\) can be defined to satisfy \[ \text{Cov}[X-KY,Y]=0, \tag{12} \] or equivalently \[ \text{Cov}[X,Y] = K\text{Cov}[Y]. \] If \(\text{Cov}[Y]\) is positive definite, then it is invertible so \(K = \text{Cov}[X,Y]\text{Cov}[Y]^{-1}\). We now deal with the positive semidefinite case. Consider the eigendecomposition \(\text{Cov}[Y] = U \Lambda U^\top + V \Gamma V^\top\), partitioned so that \(\Lambda\) contains the nonzero eigenvalues, while \(\Gamma\) is the zero matrix (and hence the second term vanishes). Plugging in this expression we now have the condition \(\text{Cov}[X,Y] = KU\Lambda U^\top\). Right multiplying both sides by \(U\) yields \[ \text{Cov}[X,Y] U = KU\Lambda. \] Treating this condition column-by-column, we require \(\text{Cov}[X,Y] u_j = \lambda_j Ku_j\). We can thus define \(K\) on the range of \(U\) by \[ Ku_j := \frac{1}{\lambda_j}\text{Cov}[X,Y]u_j, \] which we note is valid since \(\lambda_j > 0\). To extend the definition to all of \(\R^n\), note that any \(y \in \R^n\) can be written as \(y = \tilde{y} + y^\perp\), where \(\tilde{y}\) is in the range of \(U\) and \(y^\perp\) in the orthogonal complement, given by the range of \(V\). We then define \[ Ky := K\tilde{y} + Ky^\perp = K\tilde{y}. \] Note that the above derivations show that \(\text{Cov}[X-KY,Y]\) is not a function of \(Y^\perp\). We have thus defined a linear operator \(K: \R^n \to \R^d\) that satisfies \(\text{Cov}[X-KY,Y]=0\). Since \(\mathbb{E}[X|Y] = KY\) is a linear function of the Gaussian vector \(Y\), then \(\mathbb{E}[X|Y]\) is itself Gaussian. \(\qquad \blacksquare\)
Notice that the above result implies that the conditional expectation satisfies \[ \mathbb{E}[X|Y] \sim \mathcal{N}\left(\mathbb{E}[X], K\text{Cov}[Y]K^\top \right). \tag{13} \] The fact that the expectation is \(\mathbb{E}[X]\) can be seen from direct calculation or as a consequence of the law of iterated expectation. A very useful perspective is to view \(X\) as a sum of \(\mathbb{E}[X|Y]\) and some residual term \(\epsilon\).
Corollary. The random vector \(X\) can be decomposed as \[\begin{align} &X = \mathbb{E}[X|Y] + \epsilon, &&\epsilon := X - \mathbb{E}[X|Y], \tag{14} \end{align}\] where \(\mathbb{E}[X|Y]\) and \(\epsilon\) are independent.
Proof. Observe that \((\mathbb{E}[X|Y], \epsilon)\) are jointly Gaussian, as this vector results from a linear map applied to \((X,Y)\). Moreover, we know that \(\epsilon\) is uncorrelated with any measurable function of \(Y\) due to the orthogonality condition of conditional expectation. \(\mathbb{E}[X|Y]\) is one such measurable function and thus \(\epsilon\) and \(\mathbb{E}[X|Y]\) are uncorrelated. However, they are jointly Gaussian and hence also independent. \(\qquad \blacksquare\)
Conditional Distribution
In practice, we are typically interested in conditioning on a single realization \(y\) of the random vector \(Y\). The concept that captures this notion is regular conditional probabilty. A full discussion of this would take us too far afield for this post, but just note that when we talk of the distribution of \(X|Y=y\), we technically are referring to the regular conditional probability kernel. In the case that the Gaussian distribution of \((X,Y)\) is nonsingular, then this concept reverts to the typical notion of conditional densities.
We now leverage our characterization of \(\mathbb{E}[X|Y]\) to derive the conditional distribution \(X|Y=y\). To do so, we utilize a basic fact about regular conditional probability; a proof can be found in Lemma 2 of {% cite pathwiseGP %}. The result states that if the decomposition \[ X \overset{d}{=} g(Y) + \epsilon \tag{15} \] holds, and \(Y\) is independent of \(\epsilon\), then \[ (X|Y=y) \overset{d}{=} g(y) + \epsilon. \tag{16} \]
The Gaussian conditional can thus be characterized by combining this fact with (14).Corollary. The conditional distribution \(X|Y=y\) is Gaussian, with mean and covariance given by \[\begin{align} \mathbb{E}[X|Y=y] &= \mathbb{E}[X] + K(y-\mathbb{E}[Y]) \tag{17} \newline \text{Cov}[X|Y=y] &= \text{Cov}[X] - K\text{Cov}[Y]K^\top - \text{Cov}[X,Y]K^\top - K\text{Cov}[X,Y] \tag{18} \end{align}\]
Proof. The expression \(X = \mathbb{E}[X|Y] + \epsilon\) given in (14) is of the form (15) since \(Y\) and \(\epsilon\) are independent. Thus, (16) gives \[ (X|Y=y) \overset{d}{=} \mathbb{E}[X|y] + \epsilon, \tag{19} \] where \[\begin{align} \mathbb{E}[X|y] &= \mathbb{E}[X] + K(y-\mathbb{E}[Y]) \newline \epsilon &= X - \mathbb{E}[X|Y] = X - \mathbb{E}[X] - K(Y-\mathbb{E}[Y]). \end{align}\] The righthand side of (19) results from the application of a linear function to \((X,Y)\) and thus is Gaussian, verifying that \(X|Y=y\) is Gaussian. The conditional mean (17) follows immediately upon noting that \(\mathbb{E}[\epsilon] = 0\). For the covariance, we have \[\begin{align} \text{Cov}[\mathbb{E}[X|y] + \epsilon] &= \text{Cov}[\epsilon] \newline &= \text{Cov}[X - KY] \newline &= \text{Cov}[X] - K\text{Cov}[Y]K^\top - \text{Cov}[X,Y]K^\top - K\text{Cov}[X,Y]. \qquad \blacksquare \end{align}\]
In the case that \(\text{Cov}[Y]\) is positive definite, then we substitute expression (10) for \(K\) to yield the beloved “Gaussian conditioning identities”.
Corollary. If \(\text{Cov}[Y]\) is positive definite, then (17) and (18) are given by \[\begin{align} \mathbb{E}[X|Y=y] &= \mathbb{E}[X] + \text{Cov}[X,Y]\text{Cov}[Y]^{-1}(y-\mathbb{E}[Y]) \tag{20} \newline \text{Cov}[X|Y=y] &= \text{Cov}[X] - \text{Cov}[X,Y]\text{Cov}[Y]^{-1}\text{Cov}[Y,X] \tag{21} \end{align}\]
{% endkatexmm %}
Covariance Operator
{% katexmm %} As shown in (5) (and derived in the appendix), the covariance matrix associated with a Gaussian measure \(\mu\) satisfies \[\begin{align} C &= \int (x - m)(x - m)^\top \mu(dx), \end{align}\] where \(m\) and \(C\) are the quantities given in the Fourier transform (4). We take a step further in this section by viewing the covariance as an operator rather than a matrix. Definitions of the covariance operator differ slightly across various textbooks and literature; we will try to touch on the different conventions here and explain their connections. As a starting point, we consider the following definition.Definition. Let \(\mu\) be a Gaussian measure with Fourier transform given by (4). Then the covariance operator of \(\mu\) is defined as the function \(\mathcal{C}: \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}\) given by \[ \mathcal{C}\left(y, y^\prime \right) = \langle Cy, y^\prime\rangle. \tag{9} \]
We immediately have a variety of equivalent expressions for this operator: \[\begin{align} \mathcal{C}\left(y, y^\prime \right) &= \langle Cy, y^\prime\rangle \newline &= y^\top \left[\int (x - m)(x - m)^\top \mu(dx)\right] y^\prime \newline &= \int \langle y, x - m\rangle \langle y^\prime, x - m \rangle \mu(dx). \tag{10} \end{align}\] In terms of the random variable \(X \sim \mu\), we can also write this as \[\begin{align} \mathcal{C}\left(y, y^\prime \right) &= \int \langle y, x - m\rangle \langle y^\prime, x - m \rangle \mu(dx) \newline &= \int \left(\langle y, x\rangle - \langle y, \mathbb{E}[X]\rangle\right) \left(\langle y^\prime, x\rangle - \langle y^\prime, \mathbb{E}[X]\rangle\right) \mu(dx) \newline &= \mathbb{E}\left[\left(\langle y, x\rangle - \mathbb{E} \langle y,X\rangle\right) \left(\langle y^\prime, x\rangle - \mathbb{E} \langle y^\prime,X\rangle\right)\right] \newline &= \text{Cov}\left[\langle y,X\rangle, \langle y^\prime,X\rangle \right]. \end{align}\] In words, the covariance operator \(\mathcal{C}\left(y, y^\prime \right)\) outputs the covariance between the one dimensional projections of \(X\) along the directions \(y\) and \(y^\prime\). Given that the multivariate Gaussian measure is defined in terms of its one-dimensional projections, this should feel fairly natural. In fact, we see that the Fourier transform of \(\mu\) can be written as \[ \hat{\mu}(y) = \exp\left\{i\langle y, m\rangle - \frac{1}{2}\mathcal{C}(y,y) \right\}. \] When the same argument is fed into both slots of the covariance operator (as is the case in the Fourier transform expression above), the result is seen to correspond to the variance of the one-dimensional projection: \[ \mathcal{C}(y,y) = \text{Var}\left[\langle y, X\rangle \right]. \]
Inner Products
One feature that makes the covariance operator a convenient mathematical object to study is the inner product structure it provides. Indeed, the following result states that the covariance operator is almost an inner product, and is a true inner product when the covariance matrix \(C\) is positive definite.Proposition. Let \(\mu\) be a Gaussian measure with Fourier transform given by (4). Then the covariance operator (9) is symmetric, bilinear, and positive semidefinite. If \(C\), the covariance matrix of \(\mu\), is positive definite, then the covariance operator is also positive definite and thus defines an inner product.
Proof. Bilinearity follows immediately from definition (9). Symmetry similarly follows, and is more immediately obvious in expression (10). Since \(C\) is positive semidefinite, then \[ \mathcal{C}(y,y) = \langle Cy, y\rangle \geq 0, \] so \(\mathcal{C}\) is also positive semidefinite. The inequality is strict when \(C\) is positive definite and \(y \neq 0\), in which case \(\mathcal{C}(\cdot, \cdot)\) is an inner product. \(\qquad \blacksquare\)
We can therefore think of \(\mathcal{C}\) as defining a new inner product by weighting the Euclidean inner product by a positive definite matrix \(C\).
Alternative Definition
As mentioned above, definitions of the covariance operator vary slightly in the literature. One basic modification commonly seen is to assume that \(\mu\) is centered (zero mean) and thus define the covariance operator as \[ \mathcal{C}(y, y^\prime) := \int \langle y, x\rangle \langle y^\prime, x\rangle \mu(dx). \tag{14} \] This is done primarily for convenience, as one can always center a Gaussian measure and then add back the mean when needed. Indeed, assume we are working with a Gaussian measure with mean \(m\). To apply (14), we center the measure, which formally means considering the pushforward \(\nu := \mu \circ T^{-1}\) where \(T(x) := x - m\). Using subscripts to indicate the measure associated with each operator, we apply the change-of-variables theorem to obtain \[\begin{align} \mathcal{C}_{\nu}(y, y^\prime) &= \int \langle y, x\rangle \langle y^\prime, x\rangle (\mu \circ T^{-1})(dx) \newline &= \int \langle y, T(x)\rangle \langle y^\prime, T(x)\rangle \mu(dx) \newline &= \int \langle y, x-m \rangle \langle y^\prime, x-m \rangle \mu(dx), \end{align}\] which we see agrees with (10), our (uncentered) definition of \(\mathcal{C}\). Thus, our original definition (10) can be thought of as first centering the measure and then applying (14). We could similarly have defined \(\mathcal{C}^\prime\) in this way, via \[ \mathcal{C}^\prime(y) := \int x \langle y,x\rangle \mu(dx). \] This is simply (13) with \(m=0\).
Dual Space Interpretation
As we have done repeatedly throughout this post, we can identify \(\mathbb{R}^n\) with its dual \((\mathbb{R}^n)^*\). This may seem needlessly pedantic in the present context, but becomes necessary when defining Gaussian measures on infinite-dimensional spaces. The expression (10) provides the natural jumping off point for reinterpreting the covariance operator as acting on linear functionals. To this end, we can consider re-defining the covariance operator as \(\mathcal{C}: (\mathbb{R}^n)^* \times (\mathbb{R}^n)^* \to \mathbb{R}\), where \[ \mathcal{C}(\ell, \ell^\prime) := \int \ell(x-m) \ell^\prime(x-m) \mu(dx). \tag{15} \] By identifying each \(y \in \mathbb{R}^n\) with its dual vector \(\ell_y \in (\mathbb{R}^n)^*\), this definition is seen to agree with (9). Note that \(\ell\) and \(\ell^\prime\) are linear, so we could have equivalently defined \(\mathcal{C}\) as \[ \mathcal{C}(\ell, \ell^\prime) := \int (\ell(x)-\ell(m)) (\ell^\prime(x)-\ell(m)) \mu(dx). \]
We can similarly apply the dual space interpretation to \(\mathcal{C}^\prime\). There are a view different ways we can think about this. Let’s start by identifying the codomain of \(\mathcal{C}^\prime\) with its dual and hence re-define this operator as \(\mathcal{C}^\prime: \mathbb{R}^n \to (\mathbb{R}^n)^*\), where \[ \mathcal{C}^\prime(y)(\cdot) := \mathcal{C}(y, \cdot) = \langle Cy, \cdot\rangle = \ell_{Cy}(\cdot) \tag{16} \] Under this definition, \(\mathcal{C}\) maps an input \(y \in \mathbb{R}^n\) to a linear functional \(\ell_{Cy} \in (\mathbb{R}^n)^*\). Alternatively, we could identify the domain with its dual, and instead consider the operator \(\mathcal{C}^\prime: (\mathbb{R}^n)^* \to \mathbb{R}^n\), where \[ \mathcal{C}^\prime(\ell) := \int (x-m) \ell(x-m) \mu(dx). \tag{17} \] We can of course combine these two ideas and consider the map \(\mathcal{C}^\prime: (\mathbb{R}^n)^* \to (\mathbb{R}^n)^*\). However, thinking ahead to more abstract settings, it is actually a bit more interesting to consider \(\mathcal{C}^\prime: (\mathbb{R}^n)^* \to (\mathbb{R}^n)^{**}\) by identifying \(\mathbb{R}^n\) with its double dual. From this perspective, the operator is defined by \[ \mathcal{C}^\prime(\ell)(\ell^\prime) := \mathcal{C}(\ell, \ell^\prime) = \int \ell(x-m) \ell^\prime(x-m) \mu(dx). \tag{18} \] Notice that in this case \(\mathcal{C}^\prime\) maps a dual vector \(\ell\) to a double dual vector \(\ell^\prime \mapsto \mathcal{C}(\ell, \ell^\prime)\) (i.e., the output is itself a function that accepts a linear functional as input). Since, \(\mathbb{R}^n\), \((\mathbb{R}^n)^*\), and \((\mathbb{R}^n)^{**}\) are all isomorphic, in the present setting these various perspectives are interesting but perhaps a bit overkill. When we consider the infinite-dimensional setting in the subsequent post, not all of these perspectives will generalize. The key will be identifying the perspective that does actually generalize to infinite dimensional settings. {% endkatexmm %}
Appendix
Proof of (4): Fourier Transform Characterization
{% katexmm %} Assume that the probability measure \(\mu\) has a Fourier transform given by \[ \hat{\mu}(y) = \exp\left\{i \langle m, y\rangle - \frac{1}{2}\langle Cy, y\rangle \right\}, \] for some nonrandom vector \(m \in \mathbb{R}^n\) and symmetric positive semidefinite matrix \(C \in \mathbb{R}^{n \times n}\). We must show that the pushforward \(\mu \circ \ell_y^{-1}\) is Gaussian for an arbitrary \(\ell_y \in \left(\mathbb{R}^n\right)^*\). We will do so by invoking the known form of the Fourier transform for univariate Gaussians. To this end, let \(t \in \mathbb{R}\) and consider \[\begin{align} \mathcal{F}\left(\mu \circ \ell_y^{-1}\right)(t) &= \int e^{its} \left(\mu \circ \ell_y^{-1} \right)(ds) \newline &= \int e^{it \ell_y(x)} \mu(dx) \newline &= \int e^{i \langle ty, x\rangle} \mu(dx) \newline &= \hat{\mu}(ty) \newline &= \exp\left(i \langle m, ty\rangle - \frac{1}{2}\langle C(ty), ty\rangle \right) \newline &= \exp\left(it \langle m, y\rangle - \frac{1}{2}t^2\langle Cy, y\rangle \right), \end{align}\] where the second equality uses the change-of-variables formula, and the final uses the assumed form of \(\hat{\mu}\). Also recall the alternate notation for the Fourier transform: \(\hat{\mu}(y) = \mathcal{F}(\mu)(y)\). We recognize the final expression above as the Fourier transform of a univariate Gaussian measure with mean \(\langle y, m\rangle\) and variance \(\langle Cy, y\rangle\), evaluated at frequency \(t\). This implies that \(\mu \circ \ell_y^{-1}\) is Gaussian. Since \(\ell_y \in \left(\mathbb{R}^n \right)^*\) was arbitrary, it follows by definition that \(\mu\) is Gaussian.
Conversely, assume that \(\mu\) is Gaussian. Then, \(\mu \circ \ell_y^{-1}\) is univariate Gaussian for all \(\ell_y \in \left(\mathbb{R}^n \right)^*\). We must show that \(\hat{\mu}\) assumes the claimed form. Letting \(y \in \mathbb{R}^n\), we have \[\begin{align}
\hat{\mu}(y)
&= \int e^{i \langle y, x\rangle} \mu(dx) \newline
&= \int e^{is} \left(\mu \circ \ell_y^{-1}\right)(ds) \newline
&= \mathcal{F}\left(\mu \circ \ell_y^{-1}\right)(1) \newline
&= \exp\left(i m(y) - \frac{1}{2}\sigma^2(y) \right),
\end{align}\] where \(m(y)\) and \(\sigma^2(y)\) are the mean and variance of \(\mu \circ \ell_y^{-1}\), respectively. The first equality again uses the change-of-variables formula, while the last expression follows from the assumption that \(\mu \circ \ell_y^{-1}\) is Gaussian, and hence must have a Fourier transform of this form. It remains to verify that \(m(y) = \langle y, m\rangle\) and \(\sigma^2(y) = \langle Cy, y\rangle\) to complete the proof. By definition, the
mean of \(\mu \circ \ell_y^{-1}\) is given by \[\begin{align}
m(y) &= \int \ell_y(x) \mu(dx) \newline
&= \int \langle y, x\rangle \mu(dx) \newline
&= \left\langle y, \int x \mu(dx) \right\rangle \newline
&=: \langle y, m \rangle,
\end{align}\] where we have used the linearity of integration and defined the nonrandom vector \(m := \int x \mu(dx)\). Now, for the variance we have \[\begin{align}
\sigma^2(y)
&= \int \left[\ell_y(x) - m(y) \right]^2 \mu(dx) \newline
&= \int \left[\langle y, x\rangle - \langle y, m \rangle \right]^2 \mu(dx) \newline
&= \int \langle y, x-m\rangle^2 \mu(dx) \newline
&= y^\top \left[\int (x-m)(x-m)^\top \mu(dx) \right] y \newline
&=: y^\top C y \newline
&= \langle Cy, y \rangle.
\end{align}\] Note that \(\sigma^2(y)\) is the expectation of a nonnegative quantity, so \(\langle Cy, y \rangle \geq 0\) for all \(y \in \mathbb{R}^n\); i.e.,
\(C\) is positive semidefinite. We have thus shown that \[\begin{align}
\hat{\mu}(y) &= \exp\left(\langle y, m\rangle - \frac{1}{2}\langle Cy,y\rangle \right),
\end{align}\] with \(C\) a positive semidefinite matrix, as required. \(\qquad \blacksquare\) {% endkatexmm %}
Proof of (6): Density Function
{% katexmm %} Let’s start by assuming \(\mu\) is a Gaussian measure with mean \(m\) and positive definite covariance matrix \(C\). Then \(C\) admits an eigendecomposition \(C = UDU^\top\) where the columns \(u_1, \dots, u_n\) of \(U\) are orthonormal and \(D = \text{diag}\left(\lambda_1, \dots, \lambda_n\right)\) with \(\lambda_1 \geq \lambda_2 \geq \cdots \lambda_n > 0\). Then by definition of a Gaussian measure, the one-dimensional projections \(\mu \circ \ell^{-1}_{u_i}\) are Gaussian, with respective means \(\langle m, u_i\rangle\) and variances \(\langle Cu_i, u_i\rangle = \langle \lambda_i u_i, u_i\rangle = \lambda_i > 0\) (see the above proof for the derivation of the mean and variance). Note that the positive definite assumption ensures that the variances are all strictly positive. Since the variances are positive, each of these univariate Gaussians admits a density \[ \frac{d\left(\mu \circ \ell^{-1}_{u_i}\right)}{d\lambda}(t) = \exp\left\{-\frac{1}{2\lambda_i}\left(t - \langle m, u_i\rangle \right)^2\right\}, \] for \(i = 1, \dots, n\). We will now show that \(\mu\) can be written as the product of \(n\) independent univariate Gaussian measures. We will leverage the Fourier transform to establish this fact. Letting \(y \in \mathbb{R}^n\), we will lighten notation by writing \(\alpha_i := \langle y, u_i\rangle\) and \(\beta_i := \langle m, u_i \rangle\); \(y\) and \(m\) can thus be represented with respect to the eigenbasis as \[\begin{align} &y = \sum_{i=1}^{n} \alpha_i u_i, &m = \sum_{i=1}^{n} \beta_i u_i. \end{align}\] Taking the Fourier transform of \(\mu\), we have \[\begin{align} \hat{\mu}(y) &= \exp\left(i\langle y,m\rangle - \frac{1}{2}\langle Cy,y\rangle \right) \newline &= \exp\left(i\left\langle \sum_{i=1}^{n} \alpha_i u_i, \sum_{i=1}^{n} \beta_i u_i \right\rangle - \frac{1}{2}\left\langle \sum_{i=1}^{n} \alpha_i Cu_i, \sum_{i=1}^{n} \alpha_i u_i \right\rangle \right) \newline &= \exp\left(i\sum_{i=1}^{n} \alpha_i \beta_i - \frac{1}{2}\sum_{i=1}^{n}\lambda_i \alpha_i^2 \right) \newline &= \prod_{i=1}^{n} \exp\left(i\alpha_i \beta_i - \frac{1}{2}\lambda_i \alpha_i^2 \right) \newline &= \prod_{i=1}^{n} \mathcal{F}\left(\mathcal{N}(\beta_i, \lambda_i) \right)(\alpha_i). \end{align}\]
Proof of (7): Transformation of Standard Gaussian
For completeness, we start by proving the following basic fact.
Lemma. Let \(Z_i \overset{iid}{\sim} \mathcal{N}(0, 1)\) and define the random vector \(Z := \begin{bmatrix} Z_1, \dots, Z_n \end{bmatrix}^\top\). Then the law of \(Z\) is multivariate Gaussian, in particular \(\mathcal{N}(0, I)\).
Proof. Let \(\mu\) and \(\nu\) denote the law of \(Z\) and \(Z_i\), respectively. Observe that \(\mu\) is the product measure constructed from \(n\) copies of \(\nu\); that is, \(\mu = \nu \otimes \cdots \otimes \nu\). We will establish the Gaussianity of \(\mu\) by appealing to the Fourier transform. Let \(y \in \mathbb{R}^n\) and consider \[\begin{align} \hat{\mu}(y) &= \int e^{i \langle y, x\rangle} \mu(dx) \newline &= \int \prod_{i=1}^{n} \exp\left(i y_i x_i\right) (\nu \otimes \cdots \otimes \nu)(dx_1, \dots, dx_n) \newline &= \prod_{i=1}^{n} \int \exp\left(iy_i x_i \right) \nu(dx_i) \newline &= \prod_{i=1}^{n} \hat{\nu}(y_i) \newline &= \prod_{i=1}^{n} \exp\left(-\frac{1}{2}y_i^2\right) \newline &= \exp\left(-\frac{1}{2} \sum_{i=1}^{n} y_i^2 \right) \newline &= \exp\left(-\frac{1}{2} \langle Iy, y \rangle \right), \end{align}\] where we have used the Fourier transform of the univariate Gaussian measure \(\nu\). We recognize the final expression to be the Fourier transform of a Gaussian measure with mean vector \(0\) and covariance matrix \(I\). \(\qquad \blacksquare\)
Proof of (7). Proceeding with the main result, we first show that the random variable \(X := m + AZ\) has law \(\mathcal{N}(m, AA^\top)\). This follows immediately from the above lemma and basic facts about Fourier transforms. In particular, recall the following properties of Fourier transforms. Recall that we write \(\mathcal{L}(Y)\) to denote the law of a random variable \(Y\), and thus \(\hat{\mathcal{L}}(Y)\) is the Fourier transform of this law. We are interested in the Fourier transform \[
\hat{\mathcal{L}}(X)
= \hat{\mathcal{L}}(m + AZ),
\] which is easily derived if one recalls the effect of affine transformations on Fourier transforms. To be self-contained, we derive the required results here; let \(Y\) be an arbitrary \(n\)-dimensional random vector, and \(m\), \(A\) be non-random as above. Then,
\[\begin{align}
\hat{\mathcal{L}}(AY)(x)
&= \mathbb{E}\left[\exp\left(i\langle x, AY\rangle \right) \right]
= \mathbb{E}\left[\exp\left(i\langle A^\top x, Y\rangle \right) \right]
= \hat{\mathcal{L}}(Y)(A^\top x)
\end{align}\] and \[\begin{align}
\hat{\mathcal{L}}(m + Y)(x)
&= \mathbb{E}\left[\exp\left(i\langle x, m+Y\rangle \right) \right]
= \exp\left(i\langle x, m\rangle \right)\mathbb{E}\left[\exp\left(i\langle x, Y\rangle \right) \right]
= \exp\left(i\langle x, m\rangle \right) \hat{\mathcal{L}}(Y)(x).
\end{align}\] We combine these two results to the present problem, obtaining \[\begin{align}
\hat{\mathcal{L}}(X)(x)
&= \hat{\mathcal{L}}(m + AZ)(x) \newline
&= \exp\left(i\langle x, m\rangle \right) \hat{\mathcal{L}}(Z)(A^\top x) \newline
&= \exp\left(i\langle x, m\rangle \right)\exp\left(-\frac{1}{2}\langle A^\top x, A^\top x\rangle \right) \newline
&= \exp\left(i\langle x, m\rangle \right)\exp\left(-\frac{1}{2}\langle AA^\top x, x\rangle \right) \newline
&= \exp\left(i\langle x, m\rangle - \frac{1}{2}\langle AA^\top x, x\rangle \right),
\end{align}\] where we have used the fact
{% endkatexmm %}
References
- Gaussian Measures (Vladimir Bogachev)
- An Introduction to Stochastic PDEs (Martin Hairer)
- S. Janson, Gaussian Hilbert spaces,
- M.A. Lifshits, Gaussian random functions
- R. Durrett, Probability: theory and examples
- CONDITIONAL MEANS AND COVARIANCES OF NORMAL VARIABLES WITH SINGULAR COVARIANGE MATRIX (George Marsaglia)
TODOs
- Proof that zeros in covariance matrix imply independence.
- Proof that zeros in precision matrix imply conditional independence.