If you have trouble viewing this, try the pdf of this post. You can download the code used to produce the figures in this post.

In my last post, I showed that the multivariate normal, abbreviated multinormal, is a good model for the noise **w** in a linearized x-ray system model. In this post, I will discuss some of the properties of the multinormal distribution. I will show a rationale for its expression using vectors and matrices. This will lead me to discuss matrix calculus. I will describe diagonalizing and whitening transformations and derive the moment generating functions of the uninormal and multinormal to show that linear combinations of multinormals are also multinormal. This post will provide math background for my discussions of detection and maximum likelihood estimation with the linearized x-ray model.

Matrices and vectors are natural ways to arrange and keep track of multivariate data so they are widely used in statistical signal processing but most references simply state the matrix expression for the multinormal distribution. I will give a rationale for it and use it to derive some basic properties of the distribution. For this, I start with the univariate normal (uninormal) distribution discussed in elementary probability books (see for example Sec. 5.4 of Ross[3])

where *m* is the expected value and *σ*^{2} is the variance. Suppose we have a set of *n* independent uninormal random variables, their joint density function is

(2)
*f*(**x**)
=
*f*_{1}(*x*_{1})*f*_{2}(*x*_{2})…*f*_{n}(*x*_{n})
=
(1)/((2*π*)^{n⁄2}*σ*_{1}*σ*_{2}…*σ*_{n})exp⎡⎣ − (1)/(2)^{n}⎲⎳_{k = 1}⎛⎝(*x*_{k} − *m*_{k})/(*σ*_{k})⎞⎠^{2}⎤⎦

We can summarize the {*x*_{k}} as the components of a vector **x**, the expected values as a vector **m**, and the variances as the diagonal covariance matrix
**C** = ⎡⎢⎢⎢⎣
*σ*^{2}_{1}
0
⋯
0
*σ*^{2}_{n}
⎤⎥⎥⎥⎦.

The matrix is diagonal since the covariance of independent random variables is zero. We notice that *σ*_{1}*σ*_{2}…*σ*_{n} is the square root of the determinant of the covariance |**C**|^{1⁄2}, and that we can express the exponent as the quadratic form (**x − m**)^{T}**C**^{ − 1}(**x − m**). This leads to the final expression

Eq. 3↑ was derived for independent variables but I will now show that it makes sense to use it for the general multinormal distribution when **C** is any symmetric, positive definite matrix and **m** is any vector of real numbers. First, since the exponential function is always non-negative, and the leading factor in (3↑) is greater 0, it is clear that *φ*(**x**) ≥ 0. Next, I will show that ∫*φ*(**x**)*d***x** = 1 by taking a detour through matrix calculus and the principal components of **C**. The principal components give us an orthogonal transformation that diagonalizes the matrix. We can also use them to “whiten” the covariance. Applying the transformation allows us to transform ∫*φ*(**x**)*d***x** into a product of integrals of uninormals, each of which is equal to one. Finally, I will derive the moment generating function of the multinormal and use it to show that linear combinations of multinormals are also normal.

I will introduce and derive some results from matrix calculus that will be used here and in other posts. I also list some matrix manipulation basics in Eq. 20↓ at the end of this post. For other results, you can refer to The Matrix Cookbook, which is available free online, and has a huge list of formulas. The book by Harville[1] proves provides proofs for many of the formulas.

The derivative of a matrix whose elements are functions of a scalar is simply the matrix of the derivatives

Similarly the integral of a matrix is the integrals of its components

The matrix formulas are applicable to a vector, which I take to be a column matrix.

Suppose we have a scalar function of a matrix *g*(**x**). An example is the atmospheric temperature as a function of the 3D position. The derivative is a matrix (actually a vector) and is the familiar gradient with components

From this we can derive the derivative of the dot product with respect to one of the vectors. Since
*g*(**x**) = **a**^{T}x = **x**^{T}a = *a*_{1}*x*_{1} + … + *a*_{n}*x*_{n}

⎡⎣(∂*g*)/(∂**x**)⎤⎦_{k} = *a*_{k}

so

I will also use the derivative of a quadratic form *g*(**x**) = **x**^{T}Ax. This can be derived by writing out the products, see Sec. 15.3 of Harville[1].

(8)
(∂**x**^{T}Ax)/(∂**x**)
=
**(A + A**^{T})x
=
2** Ax**
**A** symmetric

The principal components expansion of the covariance can be derived by noticing that the exponent of (3↑) defines a family of hyper-ellipsoids centered on **m**. Translating the origin to the mean value so **z = x − m**, the exponent is proportional to
which defines a hyperellipsoid for each positive real number *c*. The first principal component is the line passing through **m** with the longest distance to any point **x** on the surface of the ellipsoid (see the discussion in Ch. 3 of Morrison[2]). We can find this line by maximizing the distance (squared) **z**^{T}z subject to (9↑) by using Lagrange multipliers. The maximum is found by setting the derivative of the distance after adding a term equal to zero.
*g*(**z**) = **z**^{T}z − *λ*[**z**^{T}C^{ − 1}z − *c*]

Applying (7↑), the derivative of the first term is

(∂**z**^{T}z)/(∂**z**) = 2** z**

Using (8↑) for the derivative of a quadratic form and the fact that **C** and therefore its inverse is symmetric

(∂*g*)/(∂**z**) = 2(** z** − *λ***C**^{ − 1}z)

Setting this equation equal to zero, we see that **z** is an eigenvector of the inverse of covariance matrix.

Since **C** is invertible **z** is also an eigenvector of the covariance matrix, **Cz = ***λ***z**.

The derivative will be zero for any eigenvector **z**_{k} but only one eigenvector corresponds to the maximum difference. Premultiplying (10↑) by **z**^{T}_{k},
**z**^{T}_{k}C^{ − 1}z_{k} = (1)/(*λ*_{k})**z**^{T}_{k}z_{k} = *c*
we see that the distance is **z**^{T}_{k}z_{k} = *λ*_{k}*c*. Therefore we can maximize the distance from the centroid by choosing the eigenvector corresponding to the largest eigenvalue. This gives us the principal axis of the ellipsoid. The other eigenvectors correspond to the remaining principal components with smaller distances. The other axes can be listed in order of decreasing eigenvalues.

We can construct a coordinate transformation matrix **Φ** that diagonalizes the covariance matrix from the eigenvectors. First, we normalize the eigenvectors to have unit length by dividing by their length i.e. their norm. We can always do this because they are guaranteed to have non-zero norm. We then form a matrix with the eigenvectors as columns. To avoid confusion, I will denote the eigenvectors as **φ**_{k}
**Φ** = [
**φ**_{1}
⋮
**φ**_{n}
]

With this definition

We can show that if **φ**_{j} and **φ**_{k} are two eigenvectors with different eigenvalues *λ*_{j} and *λ*_{k}, then they are orthogonal. The eigen equations for these components are

(12)
**Cφ**_{j} = *λ*_{j}**φ**_{j}
**Cφ**_{k} = *λ*_{k}**φ**_{k}

Premultiplying the top equation in (12↑) by **φ**^{T}_{k} and taking the transpose of the bottom equation and postmultiplying by **φ**_{j}
**φ**^{T}_{k}Cφ_{j} = *λ*_{j}**φ**^{T}_{k}φ_{j}
**φ**^{T}_{k}C^{T}φ_{j}** = ***λ*_{k}**φ**^{T}_{k}φ_{j}

Since **C** is symmetric, the left hand sides of the equations are equal, so subtracting them

(*λ*_{k} − *λ*_{k})**φ**^{T}_{k}φ_{j} = 0

The eigenvalues are different so **φ**^{T}_{k}φ_{j} = 0 and the eigenvectors corresponding to different eigenvalues are orthogonal. This implies that the eigenvector matrix **Φ** is orthogonal so its transpose is equal to its inverse, **Φ**^{T} = Φ^{ − 1}.

The covariance of the transformed coordinates **z’ = Φ**^{T}z is

(13)
**C’**
=
⟨**z’z’**^{T}⟩
=
⟨**Φ**^{T}zz^{T}Φ⟩
=
**Φ**^{T}CΦ
=
**Φ**^{T}ΦD
=
**D**

The covariance of the **Φ** transformed coordinates is diagonal but the variances are different. In some cases, we want them to be equal i.e. whitened. Studying Eq. 13↑, the whitening transform is **Φ**_{w} = ΦD^{ − 1⁄2}. Since **D** is diagonal, **D**^{ − 1⁄2} = *diagonal*[^{1}⁄_{√(λ1)}, …, ^{1}⁄_{√(λn)}] . The covariance of the transformed coordinates **z’ = Φ**^{T}_{w}z is
**C**_{w}’
=
⟨**z’z’**^{T}⟩
=
⟨**D**^{ − 1⁄2}Φ^{T}zz^{T}ΦD^{ − 1⁄2}⟩
=
**D**^{ − 1⁄2}Φ^{T}CΦD^{ − 1⁄2}
=
**D**^{ − 1⁄2}Φ^{T}ΦDD^{ − 1⁄2}
=
**D**^{ − 1⁄2}D^{1⁄2}
=
**I**

We can use the whitening transform to define a useful factorization of the covariance. Defining **V = Φ**^{T}_{w}
**V**^{T}V
=
**Φ**_{W}Φ^{T}_{w}
=
**ΦD**^{ − 1⁄2}D^{ − 1⁄2}Φ^{T}
=
**Φ(ΦD)**^{ − 1}
the transpose of an orthogonal is also the inverse
=
**Φ(CΦ)**^{ − 1}
=
**ΦΦ**^{ − 1}C^{ − 1}
=
**C**^{ − 1}

In this section, I will show that the density function for the general multinormal Eq. 3↑) has the correct normalizing constant. For this, I need to compute the integral. Defining an unnormalized density function as
*d*(**x**) = exp[ − ^{1}⁄_{2}(** x − m**)^{T}**C**^{ − 1}(**x − m**)]

we need to compute ∫*d*(**x**)*d***x**. Translating the origin to **m** so **z = x − m**, and transforming by **Φ** so **z = ****Φ**u, the exponent in the multinormal distribution is

Premultiplying (11↑) by **Φ**^{ − 1}, **C = ΦDΦ**^{ − 1}. Therefore **C**^{ − 1} = ΦD^{ − 1}Φ^{ − 1}. Substituting in (14↑),
**z**^{T}C^{ − 1}z = u^{T}Φ^{T}ΦD^{ − 1}Φ^{ − 1}Φu = u^{T}D^{ − 1}u

The last step follows because the transpose of an orthogonal matrix is its inverse, **Φ**^{T} = Φ^{ − 1}. Since **D** is diagonal
**D**^{ − 1} = ⎡⎢⎢⎢⎣
^{1}⁄_{λ1}
0
⋱
0
1 ⁄ *λ*_{n}
⎤⎥⎥⎥⎦

The exponent is then
**u**^{T}D^{ − 1}u = ^{n}⎲⎳_{k = 1}(*u*^{2}_{k})/(*λ*_{k})

and

⌠⌡*d*(**x**)*d***x** = ⌠⌡exp⎡⎣ − (1)/(2)^{n}⎲⎳_{k = 1}(*u*^{2}_{k})/(*λ*_{k})⎤⎦|**Φ**|*d***u**

For an orthogonal transformation the determinant is one,|**Φ**| = 1 and we can write the transformed integrand as

exp⎡⎣ − (1)/(2)^{n}⎲⎳_{k = 1}(*u*^{2}_{k})/(*λ*_{k})⎤⎦ = *g*_{1}(*u*_{1})*g*_{2}(*u*_{2})…*g*_{n}(*u*_{n})

where
*g*_{k}(*u*_{k}) = exp⎡⎣ − (1)/(2)(*u*^{2}_{k})/(*λ*_{k})⎤⎦

We know from the uninormal distribution that ∫exp⎡⎣ − (1)/(2)(*u*^{2}_{k})/(*λ*_{k})⎤⎦*du*_{k} = √(2*π**λ*_{k}). Therefore, the integral of *d*(**x)** is the product of the univariate integrals and ∫*d*(**x**)*d***x** = (2*π*)^{n⁄2}√(*λ*_{1}*λ*_{2}…*λ*_{n}). The product of the eigenvalues is the determinant of the **D** matrix defined in (11↑) and since **C = ΦDΦ**^{ − 1}and |**Φ**| is orthogonal |**C**| = |**D**|. The integral is therefore

⌠⌡*d*(**x**)*d***x** = (2*π*)^{n⁄2}|** C**|^{1⁄2}

so the proposed multinormal density (3↑) has the correct normalizing constant.

An important property of multinormal variables is that linear combinations are also normal. I will prove this by deriving the moment generating function, which is also useful for other purposes. Let’s start with the moment generating function of the “standard” uninormal random variable N(0, 1) with 0 mean and variance equal to 1. By definition this is

(15)
*M*_{N}(*t*)
=
⟨*e*^{tX}⟩
=
(1)/(√(2*π*))⌠⌡exp(*tx*)exp( − ^{x2}⁄_{2})*dx*
=
(1)/(√(2*π*))⌠⌡exp⎛⎝ − (*x*^{2} − 2*tx* + *t*^{2})/(2) + (*t*^{2})/(2)⎞⎠*dx*
=
(1)/(√(2*π*))*e*^{t2⁄2}⌠⌡exp⎛⎝ − ((*x* − *t*)^{2})/(2)⎞⎠*dx*
=
*e*^{t2⁄2}

The general uninormal random variable N(*m*, *σ*^{2}) is a linear combination of a constant and the “standard” uninormal *Y* = *m* + *σ*N. Its moment generating function is

(16)
*M*_{Y}(*t*)
=
⟨*e*^{t(m + σN)}⟩
=
*e*^{mt}⟨*e*^{σtN}⟩
=
*e*^{mt}*M*_{N}(*σ**t*)
=
exp⎛⎝*mt* + (*σ*^{2}*t*^{t})/(2)⎞⎠

Now, let’s consider the vector case. The joint moment generating function is defined to be *M*_{X}(**t**) = ⟨exp(**t**^{T}**X**)⟩, where **t** and **X** are now vectors of length *n*. If **N**(**0, 1**) is the “standard” multinormal random vector with independent components, **0** mean and all variances equal to 1

(17)
*M*_{N}(**t**)
=
⟨exp(*t*_{1} N_{1} + *t*_{2} N_{2} + …*t*_{n}N_{n})⟩
=
⟨*e*^{t1 N1}⟩⋯⟨*e*^{tnNn}⟩
components independent
=
exp⎡⎣(*t*^{2}_{1})/(2) + …(*t*^{2}_{n})/(2)⎤⎦
use Eq. \refeq:std − uninormal − MGF
=
*exp*⎛⎝(1)/(2)** t**^{T}t⎞⎠

We can use this to derive the moment generating function of the general multinormal distribution. To do this, I will derive the formula for the linear combination of a general multivariate random vector, not necessarily normal. Let **Y = A + B**^{T}X where **X** is an *n* × 1 random vector, **A** is an *m* × 1 constant vector and **B** is an *n* × *m* constant matrix. The moment generating function of **Y** is

(18)
*M*_{Y}(*t*)
=
⟨exp(**t**^{T}**A** + **t**^{T}**B**^{T}**X**)⟩
=
exp(**t**^{T}**A**)⟨exp(**t**^{T}**B**^{T}**X**)⟩
=
exp(**t**^{T}**A**)*M*_{X}(**Bt**)

To derive the moment generating function of a general multinormal let **Y = m + S**^{T}N(0, 1) where **N(0, 1)** is the standard multinormal (i.e. with zero mean and unit diagonal covariance). Applying the general result for a linear combination (18↑)
*M*_{Y}(**t**)
=
exp(**t**^{T}**m**)*M*_{N}(**St**)
=
exp⎡⎣**t**^{T}**m** + (1)/(2)** t**^{T}S^{T}St⎤⎦
=
exp⎡⎣**t**^{T}**m** + (1)/(2)** t**^{T}Ct⎤⎦

where the covariance **C = S**^{T}S.

To show that linear combinations of multinormals are also multinormal, I will use the moment generating function of a linear combination of general multinormal random variable, **Y**_{normal} = A + B^{T}N

(19)
*M*_{Ynormal}(**t**)
=
exp(**t**^{T}**A**)*M*_{N}(**Bt**)
apply (\refeq:MGF − lin − comb)
=
exp(**t**^{T}**A**)exp(**t**^{T}Bm + (1)/(2)**t**^{T}B^{T}CBt)
apply (\refeq:multionorm − MGF)
=
exp(**t**^{T}**(A + Bm) + (1)/(2)t**^{T}BCB^{T}t)

This is the moment generating function of a multinormal random vector with expected value **A + Bm** and covariance **BCB**^{T}.

With this general results, we can also show that the components of a multinormal random vector are uninormal. In (19↑), let **A = 0** and **B = **[
1
0
⋯
0
]^{T}, then *M*_{Y}(**t**) = exp(*m*_{1} + (1)/(2)*C*_{11}*t*^{2}_{1}), which is uninormal with mean *m*_{1} and variance *C*_{11}. This result applies to all the components.

The material discussed in this post will be used during my discussion of statistical detection and estimation theory applied to x-ray imaging. Eq. 20↓ summarizes basic matrix manipulations

(20)
**(A + B)**^{T}
=
**A**^{T} + B^{T}
**(ABC)**^{T}
=
**C**^{T}**B**^{T}A^{T}
(**AB****C**)^{ − 1}
=
** C**^{ − 1}**B**^{ − 1}A^{ − 1}
**A, B, C** square, invertible
(**A**^{T})^{ − 1}
=
(** A**^{ − 1})^{T}

Eq. 21↓ shows some basic matrix calculus formulas

(21)
(∂**a**^{T}x)/(∂**x**) = **(∂****x**^{T}a)/(∂**x**) = a
a,b equal length column vectors
(∂**x**^{T}Ax)/(∂**x**) = **(A + A**^{T})x
A matrix, x vector
(∂**x**^{T}Ax)/(∂**x**) = **2Ax**
A symmetric

Eq. 22↓ has some formulas for normal distributions

(22)
N(0, 1;*x*) = (1)/(√(2*π*)*σ*)exp⎡⎣ − (1)/(2)⎛⎝(*x* − *m*)/(*σ*)⎞⎠^{2}⎤⎦
*mean* = *m*, *variance* = *σ*^{2} univariate normal
exp⎛⎝*mt* + (*v*^{2}*t*^{t})/(2)⎞⎠
univariate moment generating function
**N**(**m, C;****x**) = (1)/((2*π*)^{n⁄2}|** C**|^{1⁄2})exp[ − ^{1}⁄_{2}(** x − m**)^{T}**C**^{ − 1}(**x − m**)]
*mean* = **m**, *covariance* = **C** multivariate
exp⎡⎣**t**^{T}**m** + (1)/(2)** t**^{T}Ct⎤⎦
multivariate normal moment generating function

Eq. 23↓ lists the diagonalizing and whitening transforms

(23)
**Cφ**_{k} = *λ*_{k}**φ**_{k}
eigenvectors of covariance
**CΦ = ΦD**
matrix of eigenvectors
**D**
diagonal matrix of eigenvalues
**z’ = Φ**^{T}z
diagonalizing transform
**Φ**_{w} = ΦD^{ − 1⁄2}, z_{white} = Φ^{T}_{w}z
whitening transform
**V = Φ**^{T}_{w}
whitening factor
**C**^{ − 1} = V^{T}V
whitening factorization

Last edited Jan 06, 2012

© 2012 by Aprend Technology and Robert E. Alvarez

Linking is allowed but reposting or mirroring is expressly forbidden.

[1] : *Matrix algebra from a statistician's perspective*. Springer, 2008.

[2] : *Multivariate statistical methods*. Thomson/Brooks/Cole, 2005.

[3] : *A First Course in Probability*. Prentice Hall College Div, 1997.