If you have trouble viewing this, try the pdf of this post. You can download the code used to produce the figures in this post.

# Multivariate normal random variables

In my last post, I showed that the multivariate normal, abbreviated multinormal, is a good model for the noise w in a linearized x-ray system model. In this post, I will discuss some of the properties of the multinormal distribution. I will show a rationale for its expression using vectors and matrices. This will lead me to discuss matrix calculus. I will describe diagonalizing and whitening transformations and derive the moment generating functions of the uninormal and multinormal to show that linear combinations of multinormals are also multinormal. This post will provide math background for my discussions of detection and maximum likelihood estimation with the linearized x-ray model.

## Matrix expression for the multinormal distribution

Matrices and vectors are natural ways to arrange and keep track of multivariate data so they are widely used in statistical signal processing but most references simply state the matrix expression for the multinormal distribution. I will give a rationale for it and use it to derive some basic properties of the distribution. For this, I start with the univariate normal (uninormal) distribution discussed in elementary probability books (see for example Sec. 5.4 of Ross[3])
(1) f(x) = (1)/((2π)σ)exp − (1)/(2)(x − m)/(σ)2.
where m is the expected value and σ2 is the variance. Suppose we have a set of n independent uninormal random variables, their joint density function is
(2) f(x)  =  f1(x1)f2(x2)…fn(xn)  =  (1)/((2π)n2σ1σ2σn)exp − (1)/(2)nk = 1(xk − mk)/(σk)2
We can summarize the {xk} as the components of a vector x, the expected values as a vector m, and the variances as the diagonal covariance matrix
C =  σ21 0 0 σ2n .
The matrix is diagonal since the covariance of independent random variables is zero. We notice that σ1σ2σn is the square root of the determinant of the covariance |C|12, and that we can express the exponent as the quadratic form (x − m)TC − 1(x − m). This leads to the final expression
(3) f(x) = (1)/((2π)n2| C|12)exp[ − 12( x − m)TC − 1(x − m)].
Eq. 3↑ was derived for independent variables but I will now show that it makes sense to use it for the general multinormal distribution when C is any symmetric, positive definite matrix and m is any vector of real numbers. First, since the exponential function is always non-negative, and the leading factor in (3↑) is greater 0,  it is clear that φ(x) ≥ 0. Next, I will show that φ(x)dx = 1 by taking a detour through matrix calculus and the principal components of C. The principal components give us an orthogonal transformation that diagonalizes the matrix. We can also use them to “whiten” the covariance. Applying the transformation allows us to transform φ(x)dx into a product of integrals of uninormals, each of which is equal to one. Finally, I will derive the moment generating function of the multinormal and use it to show that linear combinations of multinormals are also normal.

## Matrix calculus

I will introduce and derive some results from matrix calculus that will be used here and in other posts. I also list some matrix manipulation basics in Eq. 20↓ at the end of this post. For other results, you can refer to The Matrix Cookbook, which is available free online, and has a huge list of formulas. The book by Harville[1] proves provides proofs for many of the formulas.
The derivative of a matrix whose elements are functions of a scalar is simply the matrix of the derivatives
(4) (dA)/(dt)jk = (dAjk)/(dt)
Similarly the integral of a matrix is the integrals of its components
(5) [A(t)dt]jk = Ajk(t)dt
The matrix formulas are applicable to a vector, which I take to be a column matrix.
Suppose we have a scalar function of a matrix g(x). An example is the atmospheric temperature as a function of the 3D position. The derivative is a matrix (actually a vector) and is the familiar gradient with components
(6) (g)/(x)k = [g]k = (g)/(xk)
From this we can derive the derivative of the dot product with respect to one of the vectors. Since
g(x) = aTx = xTa = a1x1 + … + anxn
(g)/(x)k = ak
so
(7) (aTx)/(x) = (xTa)/(x) = a
I will also use the derivative of a quadratic form g(x) = xTAx. This can be derived by writing out the products, see Sec. 15.3 of Harville[1].
(8) (xTAx)/(x)  =  (A + AT)x  =  2 Ax A symmetric

## Principal components of the covariance

The principal components expansion of the covariance can be derived by noticing that the exponent of (3↑) defines a family of hyper-ellipsoids centered on m. Translating the origin to the mean value so z = x − m, the exponent is proportional to
(9) zTC − 1z = c
which defines a hyperellipsoid for each positive real number c. The first principal component is the line passing through m with the longest distance to any point x on the surface of the ellipsoid (see the discussion in Ch. 3 of Morrison[2]). We can find this line by maximizing the distance (squared) zTz subject to (9↑) by using Lagrange multipliers. The maximum is found by setting the derivative of the distance after adding a term equal to zero.
g(z) = zTz − λ[zTC − 1z − c]
Applying (7↑), the derivative of the first term is
(zTz)/(z) = 2 z
Using (8↑) for the derivative of a quadratic form and the fact that C and therefore its inverse is symmetric
(g)/(z) = 2( z − λC − 1z)
Setting this equation equal to zero, we see that z is an eigenvector of the inverse of covariance matrix.
(10) C − 1z = (1)/(λ)z
Since C is invertible z is also an eigenvector of the covariance matrix, Cz = λz.
The derivative will be zero for any eigenvector zk but only one eigenvector corresponds to the maximum difference. Premultiplying (10↑) by zTk,
zTkC − 1zk = (1)/(λk)zTkzk = c
we see that the distance is zTkzk = λkc. Therefore we can maximize the distance from the centroid by choosing the eigenvector corresponding to the largest eigenvalue. This gives us the principal axis of the ellipsoid. The other eigenvectors correspond to the remaining principal components with smaller distances. The other axes can be listed in order of decreasing eigenvalues.

## Diagonalizing transform

We can construct a coordinate transformation matrix Φ that diagonalizes the covariance matrix from the eigenvectors. First, we normalize the eigenvectors to have unit length by dividing by their length i.e. their norm. We can always do this because they are guaranteed to have non-zero norm. We then form a matrix with the eigenvectors as columns. To avoid confusion, I will denote the eigenvectors as φk
Φ = [ φ1 φn ]
With this definition
(11)  = Φ λ1 0 0 λn  = ΦD
We can show that if φj and φk are two eigenvectors with different eigenvalues λj and λk, then they are orthogonal. The eigen equations for these components are
(12) j = λjφj k = λkφk
Premultiplying the top equation in (12↑) by φTk and taking the transpose of the bottom equation and postmultiplying by φj
φTkj = λjφTkφj φTkCTφj = λkφTkφj
Since C is symmetric, the left hand sides of the equations are equal, so subtracting them
(λk − λk)φTkφj = 0
The eigenvalues are different so φTkφj = 0 and the eigenvectors corresponding to different eigenvalues are orthogonal. This implies that the eigenvector matrix Φ is orthogonal so its transpose is equal to its inverse, ΦT = Φ − 1.
The covariance of the transformed coordinates z’ = ΦTz is
(13) C’  =  z’z’T  =  ΦTzzTΦ  =  ΦT  =  ΦTΦD  =  D

## Whitening transform

The covariance of the Φ transformed coordinates is diagonal but the variances are different. In some cases, we want them to be equal i.e. whitened. Studying Eq. 13↑, the whitening transform is Φw = ΦD − 12. Since D is diagonal, D − 12 = diagonal[1(λ1), …, 1(λn)] . The covariance of the transformed coordinates z’ = ΦTwz is
Cw  =  z’z’T  =  D − 12ΦTzzTΦD − 12  =  D − 12ΦTCΦD − 12  =  D − 12ΦTΦDD − 12  =  D − 12D12  =  I
We can use the whitening transform to define a useful factorization of the covariance. Defining V = ΦTw
VTV  =  ΦWΦTw  =  ΦD − 12D − 12ΦT  =  Φ(ΦD) − 1 the transpose of an orthogonal is also the inverse  =  Φ() − 1  =  ΦΦ − 1C − 1  =  C − 1

## Proof that ∫f(x)dx = 1 for multinormal

In this section, I will show that the density function for the general multinormal Eq. 3↑) has the correct normalizing constant. For this, I need to compute the integral. Defining an unnormalized density function as
d(x) = exp[ − 12( x − m)TC − 1(x − m)]
we need to compute d(x)dx. Translating the origin to m so z = x − m,  and transforming by Φ so z = Φu, the exponent in the multinormal distribution is
(14) (x − m)TC − 1(x − m) = zTC − 1z = (Φu)TC − 1(Φu)
Premultiplying (11↑) by Φ − 1, C = ΦDΦ − 1. Therefore C − 1 = ΦD − 1Φ − 1. Substituting in (14↑),
zTC − 1z = uTΦTΦD − 1Φ − 1Φu = uTD − 1u
The last step follows because the transpose of an orthogonal matrix is its inverse, ΦT = Φ − 1. Since D is diagonal
D − 1 =  1λ1 0 0 1 ⁄ λn
The exponent is then
uTD − 1u = nk = 1(u2k)/(λk)
and
d(x)dx = exp − (1)/(2)nk = 1(u2k)/(λk)|Φ|du
For an orthogonal transformation the determinant is one,|Φ| = 1 and we can write the transformed integrand as
exp − (1)/(2)nk = 1(u2k)/(λk) = g1(u1)g2(u2)…gn(un)
where
gk(uk) = exp − (1)/(2)(u2k)/(λk)
We know from the uninormal distribution that exp − (1)/(2)(u2k)/(λk)duk = (2πλk). Therefore, the integral of d(x) is the product of the univariate integrals and d(x)dx = (2π)n2(λ1λ2λn). The product of the eigenvalues is the determinant of the D matrix defined in (11↑) and since C = ΦDΦ − 1and |Φ| is orthogonal |C| = |D|. The integral is therefore
d(x)dx = (2π)n2| C|12
so the proposed multinormal density (3↑) has the correct normalizing constant.

## Linear combinations of multinormal random variables--moment generating functions

An important property of multinormal variables is that linear combinations are also normal. I will prove this by deriving the moment generating function, which is also useful for other purposes. Let’s start with the moment generating function of the “standard” uninormal random variable N(0, 1) with 0 mean and variance equal to 1. By definition this is
(15) MN(t)  =  etX  =  (1)/((2π))exp(tx)exp( − x22)dx  =  (1)/((2π))exp − (x2 − 2tx + t2)/(2) + (t2)/(2)dx  =  (1)/((2π))et22exp − ((x − t)2)/(2)dx  =  et22
The general uninormal random variable N(m, σ2) is a linear combination of a constant and the “standard” uninormal Y = m + σN. Its moment generating function is
(16) MY(t)  =  et(m + σN)  =  emteσtN  =  emtMN(σt)  =  expmt + (σ2tt)/(2)
Now, let’s consider the vector case. The joint moment generating function is defined to be MX(t) = exp(tTX), where t and X are now vectors of length n. If N(0, 1) is the “standard” multinormal random vector with independent components, 0 mean and all variances equal to 1
(17) MN(t)  =  exp(t1 N1 + t2 N2 + …tnNn)  =  et1 N1etnNn components independent  =  exp(t21)/(2) + …(t2n)/(2)  use Eq. \refeq:std − uninormal − MGF  =  exp(1)/(2) tTt
We can use this to derive the moment generating function of the general multinormal distribution. To do this, I will derive the formula for the linear combination of a general multivariate random vector, not necessarily normal. Let Y = A + BTX where X is an n × 1 random vector, A is an m × 1 constant vector and B is an n × m constant matrix. The moment generating function of Y is
(18) MY(t)  =  exp(tTA + tTBTX)  =  exp(tTA)exp(tTBTX)  =  exp(tTA)MX(Bt)
To derive the moment generating function of a general multinormal let Y = m + STN(0, 1) where N(0, 1) is the standard multinormal (i.e. with zero mean and unit diagonal covariance). Applying the general result for a linear combination (18↑)
MY(t)  =  exp(tTm)MN(St)  =  exptTm + (1)/(2) tTSTSt  =  exptTm + (1)/(2) tTCt
where the covariance C = STS.
To show that linear combinations of multinormals are also multinormal, I will use the moment generating function of a linear combination of general multinormal random variable, Ynormal = A + BTN
(19) MYnormal(t)  =  exp(tTA)MN(Bt) apply (\refeq:MGF − lin − comb)  =  exp(tTA)exp(tTBm + (1)/(2)tTBTCBt) apply (\refeq:multionorm − MGF)  =  exp(tT(A + Bm) + (1)/(2)tTBCBTt)
This is the moment generating function of a multinormal random vector with expected value A + Bm and covariance BCBT.
With this general results, we can also show that the components of a multinormal random vector are uninormal. In (19↑), let A = 0 and B = [ 1 0 0 ]T, then MY(t) = exp(m1 + (1)/(2)C11t21), which is uninormal with mean m1 and variance C11. This result applies to all the components.

## Summary

The material discussed in this post will be used during my discussion of statistical detection and estimation theory applied to x-ray imaging. Eq. 20↓ summarizes basic matrix manipulations
(20) (A + B)T  =  AT + BT (ABC)T  =  CTBTAT (ABC) − 1  =   C − 1B − 1A − 1 A, B, C square, invertible (AT) − 1  =  ( A − 1)T
Eq. 21↓ shows some basic matrix calculus formulas
(21) (aTx)/(x) = (xTa)/(x) = a a,b equal length column vectors (xTAx)/(x) = (A + AT)x A matrix, x vector (xTAx)/(x) = 2Ax A symmetric
Eq. 22↓ has some formulas for normal distributions
(22) N(0, 1;x) = (1)/((2π)σ)exp − (1)/(2)(x − m)/(σ)2 mean = m,  variance = σ2  univariate normal expmt + (v2tt)/(2)  univariate moment generating function N(m, C;x) = (1)/((2π)n2| C|12)exp[ − 12( x − m)TC − 1(x − m)] mean = m,  covariance = C multivariate exptTm + (1)/(2) tTCt multivariate normal moment generating function
Eq. 23↓ lists the diagonalizing and whitening transforms
(23) k = λkφk eigenvectors of covariance CΦ = ΦD matrix of eigenvectors D diagonal matrix of eigenvalues z’ = ΦTz diagonalizing transform Φw = ΦD − 12,  zwhite = ΦTwz whitening transform V = ΦTw whitening factor C − 1 = VTV whitening factorization
Last edited Jan 06, 2012
© 2012 by Aprend Technology and Robert E. Alvarez
Linking is allowed but reposting or mirroring is expressly forbidden.

# References

[1] David A. Harville: Matrix algebra from a statistician's perspective. Springer, 2008.

[2] Donald F. Morrison: Multivariate statistical methods. Thomson/Brooks/Cole, 2005.

[3] Sheldon M. Ross: A First Course in Probability. Prentice Hall College Div, 1997.