Gaussian Measures, Part 2 - The Multivariate Case

A fairly deep dive into Gaussian measures in finitely many dimensions. The next step in building up to the infinite-dimensional case.

Preliminaries

Let x,yRnx, y \in \mathbb{R}^n. Throughout this post, we write x,y=xy\langle x, y \rangle = x^\top y for the standard inner product on Rn\mathbb{R}^n, and x2=x,x\lVert x \rVert_2 = \sqrt{\langle x, x \rangle} the norm induced by this inner product. We write L(X,Y)L(\mathcal{X}, \mathcal{Y}) to denote the set of linear maps from X\mathcal{X} to Y\mathcal{Y}. We will frequently consider linear functions of the form :RnR\ell: \mathbb{R}^n \to \mathbb{R}, and denote the set of all such functions as (Rn):=L(Rn,R)(\mathbb{R}^n)^* := L(\mathbb{R}^n, \mathbb{R}). Every (Rn)\ell \in (\mathbb{R}^n)^* can be uniquely represented as an inner product with some vector yRny \in \mathbb{R}^n. When we wish to make this identification explicit, we will write y\ell_y to denote the linear map given by y(x)=x,y\ell_y(x) = \langle x, y \rangle. Likewise, if we are working with a generic linear map (Rn)\ell \in (\mathbb{R}^n)^*, then we will write yRny_{\ell} \in \mathbb{R}^n to denote the unique vector satisfying (x)=x,y\ell(x) = \langle x, y_{\ell} \rangle. We will also loosely refer to y(x)\ell_y(x) as a projection onto yy. Note that if yy has unit norm, then this is precisely the magnitude of the orthogonal projection of xx onto yy.

Sigma Algebra

We recall from the previous post that a univariate Gaussian measure is defined on the Borel σ\sigma-algebra B(R)\mathcal{B}(\mathbb{R}). Analogously, we will define an nn-dimensional Gaussian measure on the Borel sets B(Rn)\mathcal{B}(\mathbb{R}^n). There are two reasonable approaches to define B(Rn)\mathcal{B}(\mathbb{R}^n), and I want to take a moment to highlight them since the same two options will present themselves when we consider defining the Borel sets over infinite-dimensional spaces.

Option 1: Leverage the Standard Topology on Rn\mathbb{R}^n

A Borel σ\sigma-algebra can be defined for any space that comes equipped with a topology; i.e., a collection of open sets. The Borel σ\sigma-algebra is then defined as the smallest σ\sigma-algebra that contains all of these open sets. In the present setting, this means B(Rn):=σ{ORn:O is open},(1) \mathcal{B}(\mathbb{R}^n) := \sigma\left\{\mathcal{O} \subseteq \mathbb{R}^n : \mathcal{O} \text{ is open} \right\}, \tag{1} where σ(S)\sigma(\mathcal{S}) denotes the σ\sigma-algebra generated by a collection of sets S\mathcal{S}. A nice perspective on B(Rn)\mathcal{B}(\mathbb{R}^n) is that it is the smallest σ\sigma-algebra that ensures all continuous functions f:RnRf: \mathbb{R}^n \to \mathbb{R} are measurable. We note that this is not a property of Borel σ\sigma-algebras more generally, but one that does hold in the special case of Rn\mathbb{R}^n; see this StackExchange post for some details.

Option 2: Product of One-Dimensional Borel Sets

A second reasonable approach is to try extending what we have already defined in one dimension, which means simply taking Cartesian products of one-dimensional Borel sets: B(Rn):=σ{B1××Bn:BiB(R), i=1,,n}.(2) \mathcal{B}(\mathbb{R}^n) := \sigma\left\{B_1 \times \cdots \times B_n: B_i \in \mathcal{B}(\mathbb{R}), \ i = 1, \dots, n \right\}. \tag{2} It turns out that the resulting σ\sigma-algebra agrees with that defined in option 1, so there is no ambiguity in the notation.

Definition: One-Dimensional Projections

With the σ\sigma-algebra defined, we now consider how to define a Gaussian measure μ\mu on the measurable space (Rn,B(Rn))(\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n)). We will explore a few different equivalent definitions, starting with this: a measure is Gaussian if all of its one-dimensional projections are (univariate) Gaussians.

Definition. A probability measure μ\mu defined on the Borel measurable space (Rn,B(Rn))(\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n)) is called Gaussian if, for all linear maps (Rn)\ell \in (\mathbb{R}^n)^*, the pushforward measure μ1\mu \circ \ell^{-1} is Gaussian on (R,B(R))(\mathbb{R}, \mathcal{B}(\mathbb{R})).

As in the univariate setting, we define a random variable XX as Gaussian if its law XX is a Gaussian measure. Recall that each linear map \ell can be identified with a unique yRny \in \mathbb{R}^n such that (x)=x,y\ell(x) = \langle x, y \rangle, which we indicate by writing y=\ell_y = \ell. We thus see that μy1\mu \circ \ell_{y}^{-1} is the distribution of the random variable X,y\langle X, y \rangle. The previous definition can therefore be re-stated in the language of random variables as follows: an nn-dimensional random variable is Gaussian if every linear combination of the entries of XX is univariate Gaussian. More precisely:

Definition. Let (Ω,A,P)(\Omega, \mathcal{A}, \mathbb{P}) be a a probability space and X:ΩRnX: \Omega \to \mathbb{R}^n a random vector. Then XX is called Gaussian if X,y\langle X, y \rangle is a univariate Gaussian random variable for all yRny \in \mathbb{R}^n.

Notice that by choosing y:=ejy := e_j (the vector with a 11 in its jthj^{\text{th}} entry and zeros everywhere else), then this definition immediately tells us that a Gaussian random vector has univariate Gaussian marginal distributions. That is, if X=(X1,,Xn)X = (X_1, \dots, X_n)^\top then XiX_i is univariate Gaussian for all i=1,,ni = 1, \dots, n.

Fourier Transform

Just as in the univariate case, the Fourier transform μ^\hat{\mu} provides an alternate, equivalent, characterization of Gaussian measures. First, we recall how such a Fourier transform is defined in the multiple variable setting.

Definition. Let μ\mu be a measure on (Rn,B(Rn))(\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n)). Then the Fourier transform of μ\mu is defined as μ^(y):=Rneix,yμ(dx),yRn(3) \hat{\mu}(y) := \int_{\mathbb{R}^n} e^{i\langle x, y\rangle} \mu(dx), \qquad y \in \mathbb{R}^n \tag{3}

We can alternatively view μ^\hat{\mu} as a function of y(Rn)\ell_y \in (\mathbb{R}^n)^*; that is, yRneiy(x)μ(dx). \ell_y \mapsto \int_{\mathbb{R}^n} e^{i \ell_y(x)} \mu(dx). Note that this is similar in spirit to the definition of the nn-dimensional Gaussian measure, in the sense that the extension from one to multiple dimensions is acheived by considering one-dimensional linear projections. This idea will also provide the basis for an extension to infinite dimensions.

With this background established, we can state the following, which gives an alternate definition of Gaussian measures.

Theorem. A probability measure μ\mu defined on the Borel measurable space (Rn,B(Rn))(\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n)) is Gaussian if and only if its Fourier transform is of the form μ^(y)=exp{im,y12Cy,y},(4) \hat{\mu}(y) = \exp\left\{i\langle m, y\rangle - \frac{1}{2}\langle Cy, y\rangle \right\}, \tag{4} for some fixed vector mRnm \in \mathbb{R}^{n} and symmetric, positive semi-definite matrix CRn×nC \in \mathbb{R}^{n \times n}.

The proof, which is given in the appendix, also provides the expressions for the mean and covariance of μ\mu as a byproduct.

Corollary. Let μ\mu be a Gaussian measure with Fourier transform μ^(y)=exp{y,m12Cy,y}\hat{\mu}(y) = \exp\left\{\langle y, m\rangle - \frac{1}{2}\langle Cy, y\rangle\right\}. Then the mean vector and covariance matrix of μ\mu are given by \begin{align} m &= \int x \mu(dx) \tag{5} \newline C &= \int (x-m)(x-m)^\top \mu(dx). \end{align}

Density Function

The one-dimensional projections and Fourier transform provide equivalent definitions of multivariate Gaussian measures. The more familiar notion of the Gaussian density provides a third characterization, with the caveat that it only pertains to the case that the covariance matrix $C$ is positive definite.

Proposition. Let $\mu$ be a Gaussian measure with mean vector $m$ and covariance matrix $C$, as in (5). Then $\mu$ admits a Lebesgue density if and only if $C$ is positive definite, in which case $$ \frac{d\mu}{d\lambda}(x) = \text{det}(2\pi C)^{-1/2}\exp\left\{-\frac{1}{2} \langle C^{-1}(x-m), x-m\rangle\right\}. \tag{6} $$

Transformation of Standard Gaussian Random Variables

In this section we provide yet another characterization of Gaussian measures. We consider a generative perspective, whereby a Gaussian random vector $X \in \mathbb{R}^n$ arises via a linear transformation of $n$ iid $\mathcal{N}(0,1)$ random variables.

Proposition. Let $Z_1, \dots, Z_n$ be iid $\mathcal{N}(0, 1)$ random variables stacked into the column vector $Z \in \mathbb{R}^n$. Then, for any fixed vector $m \in \mathbb{R}^n$ and matrix $A \in \mathbb{R}^{n \times n}$, the random variable given by $$ X := m + AZ \tag{7} $$ has a Gaussian distribution $\mathcal{N}(m, AA^\top)$. Conversely, let $X \in \mathbb{R}^n$ be a Gaussian random variable. Then there exists a vector $m \in \mathbb{R}^n$ and matrix $A \in \mathbb{R}^{n \times n}$ such that $X = m + AZ$.

Another way to think about this is that we have defined a transport map T:RnRnT: \mathbb{R}^n \to \mathbb{R}^n such that \begin{align} T(Z) &= X, &&\text{where } T(z) = m + Az. \end{align} That is, we feed in vectors with iid standard Gaussian components, and get out vectors with distribution N(m,AA)\mathcal{N}(m, AA^\top). This is a very practical way to look at multivariate Gaussians, immediately providing the basis for a sampling algorithm. Indeed, suppose we want to draw iid samples from the distribution N(m,C)\mathcal{N}(m, C). Then the above proposition gives us a way to do so, provided that we can (1) draw univariate N(0,1)\mathcal{N}(0,1) samples; and (2) factorize the matrix CC as C=AAC = AA^\top for some AA. This procedure is summarized in the below corollary.

Corollary. The following algorithm produces a sample from the distribution N(m,C)\mathcal{N}(m, C).
1. Draw nn iid samples ZiN(0,1)Z_i \sim \mathcal{N}(0,1) and stack them in a column vector ZZ.
2. Compute a factorization C=AAC = AA^\top.
3. Return m+AZm + AZ.
Repeating steps 1 and 3 will produce independent samples from N(m,C)\mathcal{N}(m, C) (the matrix factorization need not be re-computed each time).

As for the factorization, the Cholesky decomposition is a standard choice when CC is positive definite. When CC is only positive semidefinite, the eigendecomposition provides another option, since C=UDU=UD1/2D1/2U=(UD1/2)(UD1/2) C = UDU^\top = UD^{1/2} D^{1/2} U^\top = (UD^{1/2})(UD^{1/2})^\top so setting A:=UD1/2A := UD^{1/2} does the trick. Note that CC is positive semidefinite so DD is just a diagonal matrix with nonnegative values on the diagonal.

Covariance Operator

As shown in (5) (and derived in the appendix), the covariance matrix associated with a Gaussian measure μ\mu satisfies \begin{align} C &= \int (x - m)(x - m)^\top \mu(dx), \end{align} where mm and CC are the quantities given in the Fourier transform (4). We take a step further in this section by viewing the covariance as an operator rather than a matrix. Definitions of the covariance operator differ slightly across various textbooks and literature; we will try to touch on the different conventions here and explain their connections. As a starting point, we consider the following definition.

Definition. Let μ\mu be a Gaussian measure with Fourier transform given by (4). Then the covariance operator of μ\mu is defined as the function C:Rn×RnR\mathcal{C}: \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R} given by C(y,y)=Cy,y.(9) \mathcal{C}\left(y, y^\prime \right) = \langle Cy, y^\prime\rangle. \tag{9}

We immediately have a variety of equivalent expressions for this operator: \begin{align} \mathcal{C}\left(y, y^\prime \right) &= \langle Cy, y^\prime\rangle \newline &= y^\top \left[\int (x - m)(x - m)^\top \mu(dx)\right] y^\prime \newline &= \int \langle y, x - m\rangle \langle y^\prime, x - m \rangle \mu(dx). \tag{10} \end{align} In terms of the random variable XμX \sim \mu, we can also write this as \begin{align} \mathcal{C}\left(y, y^\prime \right) &= \int \langle y, x - m\rangle \langle y^\prime, x - m \rangle \mu(dx) \newline &= \int \left(\langle y, x\rangle - \langle y, \mathbb{E}[X]\rangle\right) \left(\langle y^\prime, x\rangle - \langle y^\prime, \mathbb{E}[X]\rangle\right) \mu(dx) \newline &= \mathbb{E}\left[\left(\langle y, x\rangle - \mathbb{E} \langle y,X\rangle\right) \left(\langle y^\prime, x\rangle - \mathbb{E} \langle y^\prime,X\rangle\right)\right] \newline &= \text{Cov}\left[\langle y,X\rangle, \langle y^\prime,X\rangle \right]. \end{align} In words, the covariance operator C(y,y)\mathcal{C}\left(y, y^\prime \right) outputs the covariance between the one dimensional projections of XX along the directions yy and yy^\prime. Given that the multivariate Gaussian measure is defined in terms of its one-dimensional projections, this should feel fairly natural. In fact, we see that the Fourier transform of μ\mu can be written as μ^(y)=exp{iy,m12C(y,y)}. \hat{\mu}(y) = \exp\left\{i\langle y, m\rangle - \frac{1}{2}\mathcal{C}(y,y) \right\}. When the same argument is fed into both slots of the covariance operator (as is the case in the Fourier transform expression above), the result is seen to correspond to the variance of the one-dimensional projection: C(y,y)=Var[y,X]. \mathcal{C}(y,y) = \text{Var}\left[\langle y, X\rangle \right].

Inner Products

One feature that makes the covariance operator a convenient mathematical object to study is the inner product structure it provides. Indeed, the following result states that the covariance operator is almost an inner product, and is a true inner product when the covariance matrix CC is positive definite.

Proposition. Let μ\mu be a Gaussian measure with Fourier transform given by (4). Then the covariance operator (9) is symmetric, bilinear, and positive semidefinite. If CC, the covariance matrix of μ\mu, is positive definite, then the covariance operator is also positive definite and thus defines an inner product.

Proof. Bilinearity follows immediately from definition (9). Symmetry similarly follows, and is more immediately obvious in expression (10). Since CC is positive semidefinite, then C(y,y)=Cy,y0, \mathcal{C}(y,y) = \langle Cy, y\rangle \geq 0, so C\mathcal{C} is also positive semidefinite. The inequality is strict when CC is positive definite and y0y \neq 0, in which case C(,)\mathcal{C}(\cdot, \cdot) is an inner product. \qquad \blacksquare

We can therefore think of C\mathcal{C} as defining a new inner product by weighting the Euclidean inner product by a positive definite matrix CC.

Our definition for the covariance operator C\mathcal{C} arises form a looking at the quadratic form Cy,y\langle Cy, y\rangle (the expression that appears in the Fourier transform) in a new way. In particular, we viewed this as a function of two arguments, such that the above quadratic form is the value the function takes when both arguments happen to be yy. We could look at this from yet another perspective by considering the quadratic form as a function of only one of its arguments, say, the left one. This gives another useful operator that is closely related to C\mathcal{C}.

Definition. Let μ\mu be a Gaussian measure with mean mm and covariance matrix CC. We define the operator C:RnRn\mathcal{C}^\prime: \mathbb{R}^n \to \mathbb{R}^n by C(y):=Cy.(12) \mathcal{C}^\prime(y) := Cy. \tag{12}

By plugging in the definition of the covariance matrix, we see that this is equivalent to C(y)=Cy=((xm)(xm)μ(dx))y=(xm)xm,yμ(dx).(13) \mathcal{C}^\prime(y) = Cy = \left(\int (x-m)(x-m)^\top \mu(dx)\right)y = \int (x-m) \langle x-m, y\rangle \mu(dx). \tag{13} We thus have the connection between CC, C\mathcal{C}, and C\mathcal{C}^\prime: Cy,y=C(y,y)=C(y),y. \langle Cy, y^\prime \rangle = \mathcal{C}(y, y^\prime) = \langle \mathcal{C}^\prime(y), y^\prime\rangle. While some sources also refer to C\mathcal{C}^\prime as the covariance operator, we will reserve this term for C\mathcal{C}. The following result is immediate, since C\mathcal{C}^\prime inherits the claimed properties from CC.

Proposition. The linear operator C\mathcal{C}^\prime is self-adjoint and positive semidefinite.

At this point, the definition of C\mathcal{C}^\prime seems rather unnecessary given its similarity to CC. These are, after all, essentially the same objects aside from the fact that we view CC as an element of Rn×n\mathbb{R}^{n \times n} and C\mathcal{C}^\prime as an element of L(Rn,Rn)L(\mathbb{R}^n, \mathbb{R}^n), the set of linear maps from Rn\mathbb{R}^n to Rn\mathbb{R}^n. These distinctions will become more consequential when we start considering Gaussian measures in more abstract settings.

Alternative Definition

As mentioned above, definitions of the covariance operator very slightly in the literature. One basic modification commonly seen is to assume that μ\mu is centered (zero mean) and thus define the covariance operator as C(y,y):=y,xy,xμ(dx).(14) \mathcal{C}(y, y^\prime) := \int \langle y, x\rangle \langle y^\prime, x\rangle \mu(dx). \tag{14} This is done primarily for convenience, as one can always center a Gaussian measure and then add back the mean when needed. Indeed, assume we are working with a Gaussian measure with mean mm. To apply (14), we center the measure, which formally means considering the pushforward ν:=μT1\nu := \mu \circ T^{-1} where T(x):=xmT(x) := x - m. Using subscripts to indicate the measure associated with each operator, we apply the change-of-variables theorem to obtain \begin{align} \mathcal{C}_{\nu}(y, y^\prime) &= \int \langle y, x\rangle \langle y^\prime, x\rangle (\mu \circ T^{-1})(dx) \newline &= \int \langle y, T(x)\rangle \langle y^\prime, T(x)\rangle \mu(dx) \newline &= \int \langle y, x-m \rangle \langle y^\prime, x-m \rangle \mu(dx), \end{align} which we see agrees with (10), our (uncentered) definition of C\mathcal{C}. Thus, our original definition (10) can be thought of as first centering the measure and then applying (14). We could similarly have defined C\mathcal{C}^\prime in this way, via C(y):=xy,xμ(dx). \mathcal{C}^\prime(y) := \int x \langle y,x\rangle \mu(dx). This is simply (13) with m=0m=0.

Dual Space Interpretation

As we have done repeatedly throughout this post, we can identify Rn\mathbb{R}^n with its dual (Rn)(\mathbb{R}^n)^*. This may seem needlessly pedantic in the present context, but becomes necessary when defining Gaussian measures on infinite-dimensional spaces. The expression (10) provides the natural jumping off point for reinterpreting the covariance operator as acting on linear functionals. To this end, we can consider re-defining the covariance operator as C:(Rn)×(Rn)R\mathcal{C}: (\mathbb{R}^n)^* \times (\mathbb{R}^n)^* \to \mathbb{R}, where C(,):=(xm)(xm)μ(dx).(15) \mathcal{C}(\ell, \ell^\prime) := \int \ell(x-m) \ell^\prime(x-m) \mu(dx). \tag{15} By identifying each yRny \in \mathbb{R}^n with its dual vector y(Rn)\ell_y \in (\mathbb{R}^n)^*, this definition is seen to agree with (9). Note that \ell and \ell^\prime are linear, so we could have equivalently defined C\mathcal{C} as C(,):=((x)(m))((x)(m))μ(dx). \mathcal{C}(\ell, \ell^\prime) := \int (\ell(x)-\ell(m)) (\ell^\prime(x)-\ell(m)) \mu(dx).

We can similarly apply the dual space interpretation to C\mathcal{C}^\prime. There are a view different ways we can think about this. Let’s start by identifying the codomain of C\mathcal{C}^\prime with its dual and hence re-define this operator as C:Rn(Rn)\mathcal{C}^\prime: \mathbb{R}^n \to (\mathbb{R}^n)^*, where C(y)():=C(y,)=Cy,=Cy()(16) \mathcal{C}^\prime(y)(\cdot) := \mathcal{C}(y, \cdot) = \langle Cy, \cdot\rangle = \ell_{Cy}(\cdot) \tag{16} Under this definition, C\mathcal{C} maps an input yRny \in \mathbb{R}^n to a linear functional Cy(Rn)\ell_{Cy} \in (\mathbb{R}^n)^*. Alternatively, we could identify the domain with its dual, and instead consider the operator C:(Rn)Rn\mathcal{C}^\prime: (\mathbb{R}^n)^* \to \mathbb{R}^n, where C():=(xm)(xm)μ(dx).(17) \mathcal{C}^\prime(\ell) := \int (x-m) \ell(x-m) \mu(dx). \tag{17} We can of course combine these two ideas and consider the map C:(Rn)(Rn)\mathcal{C}^\prime: (\mathbb{R}^n)^* \to (\mathbb{R}^n)^*. However, thinking ahead to more abstract settings, it is actually a bit more interesting to consider C:(Rn)(Rn)\mathcal{C}^\prime: (\mathbb{R}^n)^* \to (\mathbb{R}^n)^{**} by identifying Rn\mathbb{R}^n with its double dual. From this perspective, the operator is defined by C()():=C(,)=(xm)(xm)μ(dx).(18) \mathcal{C}^\prime(\ell)(\ell^\prime) := \mathcal{C}(\ell, \ell^\prime) = \int \ell(x-m) \ell^\prime(x-m) \mu(dx). \tag{18} Notice that in this case C\mathcal{C}^\prime maps a dual vector \ell to a double dual vector C(,)\ell^\prime \mapsto \mathcal{C}(\ell, \ell^\prime) (i.e., the output is itself a function that accepts a linear functional as input). Since, Rn\mathbb{R}^n, (Rn)(\mathbb{R}^n)^*, and (Rn)(\mathbb{R}^n)^{**} are all isomorphic, in the present setting these various perspectives are interesting but perhaps a bit overkill. When we consider the infinite-dimensional setting in the subsequent post, not all of these perspectives will generalize. The key will be identifying the perspective that does actually generalize to infinite dimensional settings.

Conditional Distributions

Appendix

Proof of (4): Fourier Transform Characterization

Assume that the probability measure μ\mu has a Fourier transform given by μ^(y)=exp{im,y12Cy,y}, \hat{\mu}(y) = \exp\left\{i \langle m, y\rangle - \frac{1}{2}\langle Cy, y\rangle \right\}, for some nonrandom vector mRnm \in \mathbb{R}^n and symmetric positive semidefinite matrix CRn×nC \in \mathbb{R}^{n \times n}. We must show that the pushforward μy1\mu \circ \ell_y^{-1} is Gaussian for an arbitrary y(Rn)\ell_y \in \left(\mathbb{R}^n\right)^*. We will do so by invoking the known form of the Fourier transform for univariate Gaussians. To this end, let tRt \in \mathbb{R} and consider \begin{align} \mathcal{F}\left(\mu \circ \ell_y^{-1}\right)(t) &= \int e^{its} \left(\mu \circ \ell_y^{-1} \right)(ds) \newline &= \int e^{it \ell_y(x)} \mu(dx) \newline &= \int e^{i \langle ty, x\rangle} \mu(dx) \newline &= \hat{\mu}(ty) \newline &= \exp\left(i \langle m, ty\rangle - \frac{1}{2}\langle C(ty), ty\rangle \right) \newline &= \exp\left(it \langle m, y\rangle - \frac{1}{2}t^2\langle Cy, y\rangle \right), \end{align} where the second equality uses the change-of-variables formula, and the final uses the assumed form of μ^\hat{\mu}. Also recall the alternate notation for the Fourier transform: μ^(y)=F(μ)(y)\hat{\mu}(y) = \mathcal{F}(\mu)(y). We recognize the final expression above as the Fourier transform of a univariate Gaussian measure with mean y,m\langle y, m\rangle and variance Cy,y\langle Cy, y\rangle, evaluated at frequency tt. This implies that μy1\mu \circ \ell_y^{-1} is Gaussian. Since y(Rn)\ell_y \in \left(\mathbb{R}^n \right)^* was arbitrary, it follows by definition that μ\mu is Gaussian.

Conversely, assume that μ\mu is Gaussian. Then, μy1\mu \circ \ell_y^{-1} is univariate Gaussian for all y(Rn)\ell_y \in \left(\mathbb{R}^n \right)^*. We must show that μ^\hat{\mu} assumes the claimed form. Letting yRny \in \mathbb{R}^n, we have \begin{align} \hat{\mu}(y) &= \int e^{i \langle y, x\rangle} \mu(dx) \newline &= \int e^{is} \left(\mu \circ \ell_y^{-1}\right)(ds) \newline &= \mathcal{F}\left(\mu \circ \ell_y^{-1}\right)(1) \newline &= \exp\left(i m(y) - \frac{1}{2}\sigma^2(y) \right), \end{align} where m(y)m(y) and σ2(y)\sigma^2(y) are the mean and variance of μy1\mu \circ \ell_y^{-1}, respectively. The first equality again uses the change-of-variables formula, while the last expression follows from the assumption that μy1\mu \circ \ell_y^{-1} is Gaussian, and hence must have a Fourier transform of this form. It remains to verify that m(y)=y,mm(y) = \langle y, m\rangle and σ2(y)=Cy,y\sigma^2(y) = \langle Cy, y\rangle to complete the proof. By definition, the
mean of μy1\mu \circ \ell_y^{-1} is given by \begin{align} m(y) &= \int \ell_y(x) \mu(dx) \newline &= \int \langle y, x\rangle \mu(dx) \newline &= \left\langle y, \int x \mu(dx) \right\rangle \newline &=: \langle y, m \rangle, \end{align} where we have used the linearity of integration and defined the nonrandom vector m:=xμ(dx)m := \int x \mu(dx). Now, for the variance we have \begin{align} \sigma^2(y) &= \int \left[\ell_y(x) - m(y) \right]^2 \mu(dx) \newline &= \int \left[\langle y, x\rangle - \langle y, m \rangle \right]^2 \mu(dx) \newline &= \int \langle y, x-m\rangle^2 \mu(dx) \newline &= y^\top \left[\int (x-m)(x-m)^\top \mu(dx) \right] y \newline &=: y^\top C y \newline &= \langle Cy, y \rangle. \end{align} Note that σ2(y)\sigma^2(y) is the expectation of a nonnegative quantity, so Cy,y0\langle Cy, y \rangle \geq 0 for all yRny \in \mathbb{R}^n; i.e.,
CC is positive semidefinite. We have thus shown that \begin{align} \hat{\mu}(y) &= \exp\left(\langle y, m\rangle - \frac{1}{2}\langle Cy,y\rangle \right), \end{align} with CC a positive semidefinite matrix, as required. \qquad \blacksquare

Proof of (6): Density Function

Let’s start by assuming μ\mu is a Gaussian measure with mean mm and positive definite covariance matrix CC. Then CC admits an eigendecomposition C=UDUC = UDU^\top where the columns u1,,unu_1, \dots, u_n of UU are orthonormal and D=diag(λ1,,λn)D = \text{diag}\left(\lambda_1, \dots, \lambda_n\right) with λ1λ2λn>0\lambda_1 \geq \lambda_2 \geq \cdots \lambda_n > 0. Then by definition of a Gaussian measure, the one-dimensional projections μui1\mu \circ \ell^{-1}_{u_i} are Gaussian, with respective means m,ui\langle m, u_i\rangle and variances Cui,ui=λiui,ui=λi>0\langle Cu_i, u_i\rangle = \langle \lambda_i u_i, u_i\rangle = \lambda_i > 0 (see the above proof for the derivation of the mean and variance). Note that the positive definite assumption ensures that the variances are all strictly positive. Since the variances are positive, each of these univariate Gaussians admits a density d(μui1)dλ(t)=exp{12λi(tm,ui)2}, \frac{d\left(\mu \circ \ell^{-1}_{u_i}\right)}{d\lambda}(t) = \exp\left\{-\frac{1}{2\lambda_i}\left(t - \langle m, u_i\rangle \right)^2\right\}, for i=1,,ni = 1, \dots, n. We will now show that μ\mu can be written as the product of nn independent univariate Gaussian measures. We will leverage the Fourier transform to establish this fact. Letting yRny \in \mathbb{R}^n, we will lighten notation by writing αi:=y,ui\alpha_i := \langle y, u_i\rangle and βi:=m,ui\beta_i := \langle m, u_i \rangle; yy and mm can thus be represented with respect to the eigenbasis as \begin{align} &y = \sum_{i=1}^{n} \alpha_i u_i, &m = \sum_{i=1}^{n} \beta_i u_i. \end{align} Taking the Fourier transform of μ\mu, we have \begin{align} \hat{\mu}(y) &= \exp\left(i\langle y,m\rangle - \frac{1}{2}\langle Cy,y\rangle \right) \newline &= \exp\left(i\left\langle \sum_{i=1}^{n} \alpha_i u_i, \sum_{i=1}^{n} \beta_i u_i \right\rangle - \frac{1}{2}\left\langle \sum_{i=1}^{n} \alpha_i Cu_i, \sum_{i=1}^{n} \alpha_i u_i \right\rangle \right) \newline &= \exp\left(i\sum_{i=1}^{n} \alpha_i \beta_i - \frac{1}{2}\sum_{i=1}^{n}\lambda_i \alpha_i^2 \right) \newline &= \prod_{i=1}^{n} \exp\left(i\alpha_i \beta_i - \frac{1}{2}\lambda_i \alpha_i^2 \right) \newline &= \prod_{i=1}^{n} \mathcal{F}\left(\mathcal{N}(\beta_i, \lambda_i) \right)(\alpha_i). \end{align}

Proof of (7): Transformation of Standard Gaussian

For completeness, we start by proving the following basic fact.

Lemma. Let ZiiidN(0,1)Z_i \overset{iid}{\sim} \mathcal{N}(0, 1) and define the random vector Z:=[Z1,,Zn]Z := \begin{bmatrix} Z_1, \dots, Z_n \end{bmatrix}^\top. Then the law of ZZ is multivariate Gaussian, in particular N(0,I)\mathcal{N}(0, I).

Proof. Let μ\mu and ν\nu denote the law of ZZ and ZiZ_i, respectively. Observe that μ\mu is the product measure constructed from nn copies of ν\nu; that is, μ=νν\mu = \nu \otimes \cdots \otimes \nu. We will establish the Gaussianity of μ\mu by appealing to the Fourier transform. Let yRny \in \mathbb{R}^n and consider \begin{align} \hat{\mu}(y) &= \int e^{i \langle y, x\rangle} \mu(dx) \newline &= \int \prod_{i=1}^{n} \exp\left(i y_i x_i\right) (\nu \otimes \cdots \otimes \nu)(dx_1, \dots, dx_n) \newline &= \prod_{i=1}^{n} \int \exp\left(iy_i x_i \right) \nu(dx_i) \newline &= \prod_{i=1}^{n} \hat{\nu}(y_i) \newline &= \prod_{i=1}^{n} \exp\left(-\frac{1}{2}y_i^2\right) \newline &= \exp\left(-\frac{1}{2} \sum_{i=1}^{n} y_i^2 \right) \newline &= \exp\left(-\frac{1}{2} \langle Iy, y \rangle \right), \end{align} where we have used the Fourier transform of the univariate Gaussian measure ν\nu. We recognize the final expression to be the Fourier transform of a Gaussian measure with mean vector 00 and covariance matrix II. \qquad \blacksquare

Proof of (7). Proceeding with the main result, we first show that the random variable X:=m+AZX := m + AZ has law N(m,AA)\mathcal{N}(m, AA^\top). This follows immediately from the above lemma and basic facts about Fourier transforms. In particular, recall the following properties of Fourier transforms. Recall that we write L(Y)\mathcal{L}(Y) to denote the law of a random variable YY, and thus L^(Y)\hat{\mathcal{L}}(Y) is the Fourier transform of this law. We are interested in the Fourier transform L^(X)=L^(m+AZ), \hat{\mathcal{L}}(X) = \hat{\mathcal{L}}(m + AZ), which is easily derived if one recalls the effect of affine transformations on Fourier transforms. To be self-contained, we derive the required results here; let YY be an arbitrary nn-dimensional random vector, and mm, AA be non-random as above. Then,
\begin{align} \hat{\mathcal{L}}(AY)(x) &= \mathbb{E}\left[\exp\left(i\langle x, AY\rangle \right) \right] = \mathbb{E}\left[\exp\left(i\langle A^\top x, Y\rangle \right) \right]
= \hat{\mathcal{L}}(Y)(A^\top x) \end{align} and \begin{align} \hat{\mathcal{L}}(m + Y)(x) &= \mathbb{E}\left[\exp\left(i\langle x, m+Y\rangle \right) \right]
= \exp\left(i\langle x, m\rangle \right)\mathbb{E}\left[\exp\left(i\langle x, Y\rangle \right) \right] = \exp\left(i\langle x, m\rangle \right) \hat{\mathcal{L}}(Y)(x). \end{align} We combine these two results to the present problem, obtaining \begin{align} \hat{\mathcal{L}}(X)(x) &= \hat{\mathcal{L}}(m + AZ)(x) \newline &= \exp\left(i\langle x, m\rangle \right) \hat{\mathcal{L}}(Z)(A^\top x) \newline &= \exp\left(i\langle x, m\rangle \right)\exp\left(-\frac{1}{2}\langle A^\top x, A^\top x\rangle \right) \newline &= \exp\left(i\langle x, m\rangle \right)\exp\left(-\frac{1}{2}\langle AA^\top x, x\rangle \right) \newline &= \exp\left(i\langle x, m\rangle - \frac{1}{2}\langle AA^\top x, x\rangle \right), \end{align} where we have used the fact

References

  1. Gaussian Measures (Vladimir Bogachev)
  2. An Introduction to Stochastic PDEs (Martin Hairer)

TODOs