Multivariate normal random variables
In my
last post, I showed that the multivariate normal, abbreviated multinormal, is a good model for the noise
w in a linearized x-ray system model. In this post, I will discuss some of the properties of the multinormal distribution. I will show a rationale for its expression using vectors and matrices. This will lead me to discuss matrix calculus. I will describe diagonalizing and whitening transformations and derive the moment generating functions of the uninormal and multinormal to show that linear combinations of multinormals are also multinormal. This post will provide math background for my discussions of detection and maximum likelihood estimation with the linearized x-ray model.
Matrix expression for the multinormal distribution
Matrices and vectors are natural ways to arrange and keep track of multivariate data so they are widely used in statistical signal processing but most references simply state the matrix expression for the multinormal distribution. I will give a rationale for it and use it to derive some basic properties of the distribution. For this, I start with the univariate normal (uninormal) distribution discussed in elementary probability books (see for example Sec. 5.4 of Ross
[3])
where
m is the expected value and
σ2 is the variance. Suppose we have a set of
n independent uninormal random variables, their joint density function is
We can summarize the
{xk} as the components of a vector
x, the expected values as a vector
m, and the variances as the diagonal covariance matrix
C = ⎡⎢⎢⎢⎣
σ21
0
⋯
0
σ2n
⎤⎥⎥⎥⎦.
The matrix is diagonal since the covariance of independent random variables is zero. We notice that
σ1σ2…σn is the square root of the determinant of the covariance
|C|1⁄2, and that we can express the exponent as the quadratic form
(x − m)TC − 1(x − m). This leads to the final expression
Eq.
3↑ was derived for independent variables but I will now show that it makes sense to use it for the general multinormal distribution when
C is any symmetric, positive definite matrix and
m is any vector of real numbers. First, since the exponential function is always non-negative, and the leading factor in (
3↑) is greater
0, it is clear that
φ(x) ≥ 0. Next, I will show that
∫φ(x)dx = 1 by taking a detour through matrix calculus and the principal components of
C. The principal components give us an orthogonal transformation that diagonalizes the matrix. We can also use them to “whiten” the covariance. Applying the transformation allows us to transform
∫φ(x)dx into a product of integrals of uninormals, each of which is equal to one. Finally, I will derive the moment generating function of the multinormal and use it to show that linear combinations of multinormals are also normal.
Matrix calculus
I will introduce and derive some results from matrix calculus that will be used here and in other posts. I also list some matrix manipulation basics in Eq.
20↓ at the end of this post. For other results, you can refer to
The Matrix Cookbook, which is available free online, and has a huge list of formulas. The book by Harville
[1] proves provides proofs for many of the formulas.
The derivative of a matrix whose elements are functions of a scalar is simply the matrix of the derivatives
Similarly the integral of a matrix is the integrals of its components
The matrix formulas are applicable to a vector, which I take to be a column matrix.
Suppose we have a scalar function of a matrix
g(x). An example is the atmospheric temperature as a function of the 3D position. The derivative is a matrix (actually a vector) and is the familiar gradient with components
From this we can derive the derivative of the dot product with respect to one of the vectors. Since
g(x) = aTx = xTa = a1x1 + … + anxn
⎡⎣(∂g)/(∂x)⎤⎦k = ak
I will also use the derivative of a quadratic form
g(x) = xTAx. This can be derived by writing out the products, see Sec. 15.3 of Harville
[1].
Principal components of the covariance
The principal components expansion of the covariance can be derived by noticing that the exponent of (
3↑) defines a family of hyper-ellipsoids centered on
m. Translating the origin to the mean value so
z = x − m, the exponent is proportional to
which defines a hyperellipsoid for each positive real number
c. The first principal component is the line passing through
m with the longest distance to any point
x on the surface of the ellipsoid (see the discussion in Ch. 3 of Morrison
[2]). We can find this line by maximizing the distance (squared)
zTz subject to (
9↑) by using Lagrange multipliers. The maximum is found by setting the derivative of the distance after adding a term equal to zero.
g(z) = zTz − λ[zTC − 1z − c]
Applying (
7↑), the derivative of the first term is
(∂zTz)/(∂z) = 2 z
Using (
8↑) for the derivative of a quadratic form and the fact that
C and therefore its inverse is symmetric
(∂g)/(∂z) = 2( z − λC − 1z)
Setting this equation equal to zero, we see that
z is an eigenvector of the inverse of covariance matrix.
Since C is invertible z is also an eigenvector of the covariance matrix, Cz = λz.
The derivative will be zero for any eigenvector
zk but only one eigenvector corresponds to the maximum difference. Premultiplying (
10↑) by
zTk,
zTkC − 1zk = (1)/(λk)zTkzk = c
we see that the distance is
zTkzk = λkc. Therefore we can maximize the distance from the centroid by choosing the eigenvector corresponding to the largest eigenvalue. This gives us the principal axis of the ellipsoid. The other eigenvectors correspond to the remaining principal components with smaller distances. The other axes can be listed in order of decreasing eigenvalues.
Diagonalizing transform
We can construct a coordinate transformation matrix
Φ that diagonalizes the covariance matrix from the eigenvectors. First, we normalize the eigenvectors to have unit length by dividing by their length i.e. their norm. We can always do this because they are guaranteed to have non-zero norm. We then form a matrix with the eigenvectors as columns. To avoid confusion, I will denote the eigenvectors as
φk
Φ = [
φ1
⋮
φn
]
We can show that if
φj and
φk are two eigenvectors with different eigenvalues
λj and
λk, then they are orthogonal. The eigen equations for these components are
Premultiplying the top equation in (
12↑) by
φTk and taking the transpose of the bottom equation and postmultiplying by
φj
φTkCφj = λjφTkφj
φTkCTφj = λkφTkφj
Since
C is symmetric, the left hand sides of the equations are equal, so subtracting them
(λk − λk)φTkφj = 0
The eigenvalues are different so φTkφj = 0 and the eigenvectors corresponding to different eigenvalues are orthogonal. This implies that the eigenvector matrix Φ is orthogonal so its transpose is equal to its inverse, ΦT = Φ − 1.
The covariance of the transformed coordinates
z’ = ΦTz is
Whitening transform
The covariance of the
Φ transformed coordinates is diagonal but the variances are different. In some cases, we want them to be equal i.e. whitened. Studying Eq.
13↑, the whitening transform is
Φw = ΦD − 1⁄2. Since
D is diagonal,
D − 1⁄2 = diagonal[1⁄√(λ1), …, 1⁄√(λn)] . The covariance of the transformed coordinates
z’ = ΦTwz is
Cw’
=
⟨z’z’T⟩
=
⟨D − 1⁄2ΦTzzTΦD − 1⁄2⟩
=
D − 1⁄2ΦTCΦD − 1⁄2
=
D − 1⁄2ΦTΦDD − 1⁄2
=
D − 1⁄2D1⁄2
=
I
We can use the whitening transform to define a useful factorization of the covariance. Defining
V = ΦTw
VTV
=
ΦWΦTw
=
ΦD − 1⁄2D − 1⁄2ΦT
=
Φ(ΦD) − 1
the transpose of an orthogonal is also the inverse
=
Φ(CΦ) − 1
=
ΦΦ − 1C − 1
=
C − 1
Proof that ∫f(x)dx = 1 for multinormal
In this section, I will show that the density function for the general multinormal Eq.
3↑) has the correct normalizing constant. For this, I need to compute the integral. Defining an unnormalized density function as
d(x) = exp[ − 1⁄2( x − m)TC − 1(x − m)]
we need to compute
∫d(x)dx. Translating the origin to m so z = x − m, and transforming by Φ so z = Φu, the exponent in the multinormal distribution is
Premultiplying (
11↑) by
Φ − 1,
C = ΦDΦ − 1. Therefore
C − 1 = ΦD − 1Φ − 1. Substituting in (
14↑),
zTC − 1z = uTΦTΦD − 1Φ − 1Φu = uTD − 1u
The last step follows because the transpose of an orthogonal matrix is its inverse,
ΦT = Φ − 1. Since
D is diagonal
D − 1 = ⎡⎢⎢⎢⎣
1⁄λ1
0
⋱
0
1 ⁄ λn
⎤⎥⎥⎥⎦
The exponent is then
uTD − 1u = n⎲⎳k = 1(u2k)/(λk)
and
⌠⌡d(x)dx = ⌠⌡exp⎡⎣ − (1)/(2)n⎲⎳k = 1(u2k)/(λk)⎤⎦|Φ|du
For an orthogonal transformation the determinant is one,
|Φ| = 1 and we can write the transformed integrand as
exp⎡⎣ − (1)/(2)n⎲⎳k = 1(u2k)/(λk)⎤⎦ = g1(u1)g2(u2)…gn(un)
where
gk(uk) = exp⎡⎣ − (1)/(2)(u2k)/(λk)⎤⎦
We know from the uninormal distribution that
∫exp⎡⎣ − (1)/(2)(u2k)/(λk)⎤⎦duk = √(2πλk). Therefore, the integral of
d(x) is the product of the univariate integrals and
∫d(x)dx = (2π)n⁄2√(λ1λ2…λn). The product of the eigenvalues is the determinant of the
D matrix defined in (
11↑) and since
C = ΦDΦ − 1and
|Φ| is orthogonal
|C| = |D|. The integral is therefore
⌠⌡d(x)dx = (2π)n⁄2| C|1⁄2
so the proposed multinormal density (
3↑) has the correct normalizing constant.
Linear combinations of multinormal random variables--moment generating functions
An important property of multinormal variables is that linear combinations are also normal. I will prove this by deriving the moment generating function, which is also useful for other purposes. Let’s start with the moment generating function of the “standard” uninormal random variable
N(0, 1) with 0 mean and variance equal to 1. By definition this is
The general uninormal random variable
N(m, σ2) is a linear combination of a constant and the “standard” uninormal
Y = m + σN. Its moment generating function is
Now, let’s consider the vector case. The joint moment generating function is defined to be
MX(t) = ⟨exp(tTX)⟩, where
t and
X are now vectors of length
n. If
N(0, 1) is the “standard” multinormal random vector with independent components,
0 mean and all variances equal to 1
We can use this to derive the moment generating function of the general multinormal distribution. To do this, I will derive the formula for the linear combination of a general multivariate random vector, not necessarily normal. Let
Y = A + BTX where
X is an
n × 1 random vector,
A is an
m × 1 constant vector and
B is an
n × m constant matrix. The moment generating function of
Y is
To derive the moment generating function of a general multinormal let
Y = m + STN(0, 1) where
N(0, 1) is the standard multinormal (i.e. with zero mean and unit diagonal covariance). Applying the general result for a linear combination (
18↑)
MY(t)
=
exp(tTm)MN(St)
=
exp⎡⎣tTm + (1)/(2) tTSTSt⎤⎦
=
exp⎡⎣tTm + (1)/(2) tTCt⎤⎦
where the covariance C = STS.
To show that linear combinations of multinormals are also multinormal, I will use the moment generating function of a linear combination of general multinormal random variable,
Ynormal = A + BTN
This is the moment generating function of a multinormal random vector with expected value A + Bm and covariance BCBT.
With this general results, we can also show that the components of a multinormal random vector are uninormal. In (
19↑), let
A = 0 and
B = [
1
0
⋯
0
]T, then
MY(t) = exp(m1 + (1)/(2)C11t21), which is uninormal with mean
m1 and variance
C11. This result applies to all the components.
Summary
The material discussed in this post will be used during my discussion of statistical detection and estimation theory applied to x-ray imaging. Eq.
20↓ summarizes basic matrix manipulations
Eq.
21↓ shows some basic matrix calculus formulas
Eq.
22↓ has some formulas for normal distributions
Eq.
23↓ lists the diagonalizing and whitening transforms
Last edited Jan 06, 2012
© 2012 by Aprend Technology and Robert E. Alvarez
Linking is allowed but reposting or mirroring is expressly forbidden.
References
[1] David A. Harville: Matrix algebra from a statistician's perspective. Springer, 2008.
[2] Donald F. Morrison: Multivariate statistical methods. Thomson/Brooks/Cole, 2005.
[3] Sheldon M. Ross: A First Course in Probability. Prentice Hall College Div, 1997.