aprendtech.com >> blog >> this post

If you have trouble viewing this, try the pdf of this post. You can download the code used to produce the figures in this post.

Multivariate normal random variables

In my last post, I showed that the multivariate normal, abbreviated multinormal, is a good model for the noise w in a linearized x-ray system model. In this post, I will discuss some of the properties of the multinormal distribution. I will show a rationale for its expression using vectors and matrices. This will lead me to discuss matrix calculus. I will describe diagonalizing and whitening transformations and derive the moment generating functions of the uninormal and multinormal to show that linear combinations of multinormals are also multinormal. This post will provide math background for my discussions of detection and maximum likelihood estimation with the linearized x-ray model.

Matrix expression for the multinormal distribution

Matrices and vectors are natural ways to arrange and keep track of multivariate data so they are widely used in statistical signal processing but most references simply state the matrix expression for the multinormal distribution. I will give a rationale for it and use it to derive some basic properties of the distribution. For this, I start with the univariate normal (uninormal) distribution discussed in elementary probability books (see for example Sec. 5.4 of Ross[3])

(1) f(x) = (1)/(√(2π)σ)exp⎡⎣ − (1)/(2)⎛⎝(x − m)/(σ)⎞⎠²⎤⎦.

where m is the expected value and σ² is the variance. Suppose we have a set of n independent uninormal random variables, their joint density function is

(2) f(x) = f₁(x₁)f₂(x₂)…f_n(x_n) = (1)/((2π)^ⁿ⁄₂σ₁σ₂…σ_n)exp⎡⎣ − (1)/(2)ⁿ⎲⎳_k = 1⎛⎝(x_k − m_k)/(σ_k)⎞⎠²⎤⎦

We can summarize the {x_k} as the components of a vector x, the expected values as a vector m, and the variances as the diagonal covariance matrix

C = ⎡⎢⎢⎢⎣ σ²₁ 0 ⋯ 0 σ²_n ⎤⎥⎥⎥⎦.

The matrix is diagonal since the covariance of independent random variables is zero. We notice that σ₁σ₂…σ_n is the square root of the determinant of the covariance |C|^¹⁄₂, and that we can express the exponent as the quadratic form (x − m)^TC^− 1(x − m). This leads to the final expression

(3) f(x) = (1)/((2π)^ⁿ⁄₂| C|^¹⁄₂)exp[ − ¹⁄₂( x − m)^TC^− 1(x − m)].

Eq. 3↑ was derived for independent variables but I will now show that it makes sense to use it for the general multinormal distribution when C is any symmetric, positive definite matrix and m is any vector of real numbers. First, since the exponential function is always non-negative, and the leading factor in (3↑) is greater 0, it is clear that φ(x) ≥ 0. Next, I will show that ∫φ(x)dx = 1 by taking a detour through matrix calculus and the principal components of C. The principal components give us an orthogonal transformation that diagonalizes the matrix. We can also use them to “whiten” the covariance. Applying the transformation allows us to transform ∫φ(x)dx into a product of integrals of uninormals, each of which is equal to one. Finally, I will derive the moment generating function of the multinormal and use it to show that linear combinations of multinormals are also normal.

Matrix calculus

I will introduce and derive some results from matrix calculus that will be used here and in other posts. I also list some matrix manipulation basics in Eq. 20↓ at the end of this post. For other results, you can refer to The Matrix Cookbook, which is available free online, and has a huge list of formulas. The book by Harville[1] proves provides proofs for many of the formulas.

The derivative of a matrix whose elements are functions of a scalar is simply the matrix of the derivatives

(4) ⎡⎣(dA)/(dt)⎤⎦_jk = (dA_jk)/(dt)

Similarly the integral of a matrix is the integrals of its components

(5) [⌠⌡A(t)dt]_jk = ⌠⌡A_jk(t)dt

The matrix formulas are applicable to a vector, which I take to be a column matrix.

Suppose we have a scalar function of a matrix g(x). An example is the atmospheric temperature as a function of the 3D position. The derivative is a matrix (actually a vector) and is the familiar gradient with components

(6) ⎡⎣(∂g)/(∂x)⎤⎦_k = [∇g]_k = (∂g)/(∂x_k)

From this we can derive the derivative of the dot product with respect to one of the vectors. Since

g(x) = a^Tx = x^Ta = a₁x₁ + … + a_nx_n

⎡⎣(∂g)/(∂x)⎤⎦_k = a_k

(7) (∂a^Tx)/(∂x) = (∂x^Ta)/(∂x) = a

I will also use the derivative of a quadratic form g(x) = x^TAx. This can be derived by writing out the products, see Sec. 15.3 of Harville[1].

(8) (∂x^TAx)/(∂x) = (A + A^T)x = 2 Ax A symmetric

Principal components of the covariance

The principal components expansion of the covariance can be derived by noticing that the exponent of (3↑) defines a family of hyper-ellipsoids centered on m. Translating the origin to the mean value so z = x − m, the exponent is proportional to

(9) z^TC^− 1z = c

which defines a hyperellipsoid for each positive real number c. The first principal component is the line passing through m with the longest distance to any point x on the surface of the ellipsoid (see the discussion in Ch. 3 of Morrison[2]). We can find this line by maximizing the distance (squared) z^Tz subject to (9↑) by using Lagrange multipliers. The maximum is found by setting the derivative of the distance after adding a term equal to zero.

g(z) = z^Tz − λ[z^TC^− 1z − c]

Applying (7↑), the derivative of the first term is

(∂z^Tz)/(∂z) = 2 z

Using (8↑) for the derivative of a quadratic form and the fact that C and therefore its inverse is symmetric

(∂g)/(∂z) = 2( z − λC^− 1z)

Setting this equation equal to zero, we see that z is an eigenvector of the inverse of covariance matrix.

(10) C^− 1z = (1)/(λ)z

Since C is invertible z is also an eigenvector of the covariance matrix, Cz = λz.

The derivative will be zero for any eigenvector z_k but only one eigenvector corresponds to the maximum difference. Premultiplying (10↑) by z^T_k,

z^T_kC^− 1z_k = (1)/(λ_k)z^T_kz_k = c

we see that the distance is z^T_kz_k = λ_kc. Therefore we can maximize the distance from the centroid by choosing the eigenvector corresponding to the largest eigenvalue. This gives us the principal axis of the ellipsoid. The other eigenvectors correspond to the remaining principal components with smaller distances. The other axes can be listed in order of decreasing eigenvalues.

Diagonalizing transform

We can construct a coordinate transformation matrix Φ that diagonalizes the covariance matrix from the eigenvectors. First, we normalize the eigenvectors to have unit length by dividing by their length i.e. their norm. We can always do this because they are guaranteed to have non-zero norm. We then form a matrix with the eigenvectors as columns. To avoid confusion, I will denote the eigenvectors as φ_k

Φ = [ φ₁ ⋮ φ_n ]

With this definition

(11) CΦ = Φ⎡⎢⎢⎢⎣ λ₁ 0 ⋱ 0 λ_n ⎤⎥⎥⎥⎦ = ΦD

We can show that if φ_j and φ_k are two eigenvectors with different eigenvalues λ_j and λ_k, then they are orthogonal. The eigen equations for these components are

(12) Cφ_j = λ_jφ_j Cφ_k = λ_kφ_k

Premultiplying the top equation in (12↑) by φ^T_k and taking the transpose of the bottom equation and postmultiplying by φ_j

φ^T_kCφ_j = λ_jφ^T_kφ_j φ^T_kC^Tφ_j = λ_kφ^T_kφ_j

Since C is symmetric, the left hand sides of the equations are equal, so subtracting them

(λ_k − λ_k)φ^T_kφ_j = 0

The eigenvalues are different so φ^T_kφ_j = 0 and the eigenvectors corresponding to different eigenvalues are orthogonal. This implies that the eigenvector matrix Φ is orthogonal so its transpose is equal to its inverse, Φ^T = Φ^− 1.

The covariance of the transformed coordinates z’ = Φ^Tz is

(13) C’ = ⟨z’z’^T⟩ = ⟨Φ^Tzz^TΦ⟩ = Φ^TCΦ = Φ^TΦD = D

Whitening transform

The covariance of the Φ transformed coordinates is diagonal but the variances are different. In some cases, we want them to be equal i.e. whitened. Studying Eq. 13↑, the whitening transform is Φ_w = ΦD^{− ¹⁄₂}. Since D is diagonal, D^{− ¹⁄₂} = diagonal[¹⁄_√(λ₁), …, ¹⁄_{√(λ_n)}] . The covariance of the transformed coordinates z’ = Φ^T_wz is

C_w’ = ⟨z’z’^T⟩ = ⟨D^{− ¹⁄₂}Φ^Tzz^TΦD^{− ¹⁄₂}⟩ = D^{− ¹⁄₂}Φ^TCΦD^{− ¹⁄₂} = D^{− ¹⁄₂}Φ^TΦDD^{− ¹⁄₂} = D^{− ¹⁄₂}D^¹⁄₂ = I

We can use the whitening transform to define a useful factorization of the covariance. Defining V = Φ^T_w

V^TV = Φ_WΦ^T_w = ΦD^{− ¹⁄₂}D^{− ¹⁄₂}Φ^T = Φ(ΦD)^− 1 the transpose of an orthogonal is also the inverse = Φ(CΦ)^− 1 = ΦΦ^− 1C^− 1 = C^− 1

Proof that ∫f(x)dx = 1 for multinormal

In this section, I will show that the density function for the general multinormal Eq. 3↑) has the correct normalizing constant. For this, I need to compute the integral. Defining an unnormalized density function as

d(x) = exp[ − ¹⁄₂( x − m)^TC^− 1(x − m)]

we need to compute ∫d(x)dx. Translating the origin to m so z = x − m, and transforming by Φ so z = Φu, the exponent in the multinormal distribution is

(14) (x − m)^TC^− 1(x − m) = z^TC^− 1z = (Φu)^TC^− 1(Φu)

Premultiplying (11↑) by Φ^− 1, C = ΦDΦ^− 1. Therefore C^− 1 = ΦD^− 1Φ^− 1. Substituting in (14↑),

z^TC^− 1z = u^TΦ^TΦD^− 1Φ^− 1Φu = u^TD^− 1u

The last step follows because the transpose of an orthogonal matrix is its inverse, Φ^T = Φ^− 1. Since D is diagonal

D^− 1 = ⎡⎢⎢⎢⎣ ¹⁄_λ₁ 0 ⋱ 0 1 ⁄ λ_n ⎤⎥⎥⎥⎦

The exponent is then

u^TD^− 1u = ⁿ⎲⎳_k = 1(u²_k)/(λ_k)

and

⌠⌡d(x)dx = ⌠⌡exp⎡⎣ − (1)/(2)ⁿ⎲⎳_k = 1(u²_k)/(λ_k)⎤⎦|Φ|du

For an orthogonal transformation the determinant is one,|Φ| = 1 and we can write the transformed integrand as

exp⎡⎣ − (1)/(2)ⁿ⎲⎳_k = 1(u²_k)/(λ_k)⎤⎦ = g₁(u₁)g₂(u₂)…g_n(u_n)

where

g_k(u_k) = exp⎡⎣ − (1)/(2)(u²_k)/(λ_k)⎤⎦

We know from the uninormal distribution that ∫exp⎡⎣ − (1)/(2)(u²_k)/(λ_k)⎤⎦du_k = √(2πλ_k). Therefore, the integral of d(x) is the product of the univariate integrals and ∫d(x)dx = (2π)^ⁿ⁄₂√(λ₁λ₂…λ_n). The product of the eigenvalues is the determinant of the D matrix defined in (11↑) and since C = ΦDΦ^− 1and |Φ| is orthogonal |C| = |D|. The integral is therefore

⌠⌡d(x)dx = (2π)^ⁿ⁄₂| C|^¹⁄₂

so the proposed multinormal density (3↑) has the correct normalizing constant.

Linear combinations of multinormal random variables--moment generating functions

An important property of multinormal variables is that linear combinations are also normal. I will prove this by deriving the moment generating function, which is also useful for other purposes. Let’s start with the moment generating function of the “standard” uninormal random variable N(0, 1) with 0 mean and variance equal to 1. By definition this is

(15) M_N(t) = ⟨e^tX⟩ = (1)/(√(2π))⌠⌡exp(tx)exp( − ^x²⁄₂)dx = (1)/(√(2π))⌠⌡exp⎛⎝ − (x² − 2tx + t²)/(2) + (t²)/(2)⎞⎠dx = (1)/(√(2π))e^{^t²⁄₂}⌠⌡exp⎛⎝ − ((x − t)²)/(2)⎞⎠dx = e^{^t²⁄₂}

The general uninormal random variable N(m, σ²) is a linear combination of a constant and the “standard” uninormal Y = m + σN. Its moment generating function is

(16) M_Y(t) = ⟨e^{t(m + σN)}⟩ = e^mt⟨e^σtN⟩ = e^mtM_N(σt) = exp⎛⎝mt + (σ²t^t)/(2)⎞⎠

Now, let’s consider the vector case. The joint moment generating function is defined to be M_X(t) = ⟨exp(t^TX)⟩, where t and X are now vectors of length n. If N(0, 1) is the “standard” multinormal random vector with independent components, 0 mean and all variances equal to 1

(17) M_N(t) = ⟨exp(t₁ N₁ + t₂ N₂ + …t_nN_n)⟩ = ⟨e^t₁ N₁⟩⋯⟨e^t_nN_n⟩ components independent = exp⎡⎣(t²₁)/(2) + …(t²_n)/(2)⎤⎦ use Eq. \refeq:std − uninormal − MGF = exp⎛⎝(1)/(2) t^Tt⎞⎠

We can use this to derive the moment generating function of the general multinormal distribution. To do this, I will derive the formula for the linear combination of a general multivariate random vector, not necessarily normal. Let Y = A + B^TX where X is an n × 1 random vector, A is an m × 1 constant vector and B is an n × m constant matrix. The moment generating function of Y is

(18) M_Y(t) = ⟨exp(t^TA + t^TB^TX)⟩ = exp(t^TA)⟨exp(t^TB^TX)⟩ = exp(t^TA)M_X(Bt)

To derive the moment generating function of a general multinormal let Y = m + S^TN(0, 1) where N(0, 1) is the standard multinormal (i.e. with zero mean and unit diagonal covariance). Applying the general result for a linear combination (18↑)

M_Y(t) = exp(t^Tm)M_N(St) = exp⎡⎣t^Tm + (1)/(2) t^TS^TSt⎤⎦ = exp⎡⎣t^Tm + (1)/(2) t^TCt⎤⎦

where the covariance C = S^TS.

To show that linear combinations of multinormals are also multinormal, I will use the moment generating function of a linear combination of general multinormal random variable, Y_normal = A + B^TN

(19) M_{Y_normal}(t) = exp(t^TA)M_N(Bt) apply (\refeq:MGF − lin − comb) = exp(t^TA)exp(t^TBm + (1)/(2)t^TB^TCBt) apply (\refeq:multionorm − MGF) = exp(t^T(A + Bm) + (1)/(2)t^TBCB^Tt)

This is the moment generating function of a multinormal random vector with expected value A + Bm and covariance BCB^T.

With this general results, we can also show that the components of a multinormal random vector are uninormal. In (19↑), let A = 0 and B = [ 1 0 ⋯ 0 ]^T, then M_Y(t) = exp(m₁ + (1)/(2)C_₁₁t²₁), which is uninormal with mean m₁ and variance C₁₁. This result applies to all the components.

Summary

The material discussed in this post will be used during my discussion of statistical detection and estimation theory applied to x-ray imaging. Eq. 20↓ summarizes basic matrix manipulations

(20) (A + B)^T = A^T + B^T (ABC)^T = C^TB^TA^T (ABC)^− 1 = C^− 1B^− 1A^− 1 A, B, C square, invertible (A^T)^− 1 = ( A^− 1)^T

Eq. 21↓ shows some basic matrix calculus formulas

(21) (∂a^Tx)/(∂x) = (∂x^Ta)/(∂x) = a a,b equal length column vectors (∂x^TAx)/(∂x) = (A + A^T)x A matrix, x vector (∂x^TAx)/(∂x) = 2Ax A symmetric

Eq. 22↓ has some formulas for normal distributions

(22) N(0, 1;x) = (1)/(√(2π)σ)exp⎡⎣ − (1)/(2)⎛⎝(x − m)/(σ)⎞⎠²⎤⎦ mean = m, variance = σ² univariate normal exp⎛⎝mt + (v²t^t)/(2)⎞⎠ univariate moment generating function N(m, C;x) = (1)/((2π)^ⁿ⁄₂| C|^¹⁄₂)exp[ − ¹⁄₂( x − m)^TC^− 1(x − m)] mean = m, covariance = C multivariate exp⎡⎣t^Tm + (1)/(2) t^TCt⎤⎦ multivariate normal moment generating function

Eq. 23↓ lists the diagonalizing and whitening transforms

(23) Cφ_k = λ_kφ_k eigenvectors of covariance CΦ = ΦD matrix of eigenvectors D diagonal matrix of eigenvalues z’ = Φ^Tz diagonalizing transform Φ_w = ΦD^{− ¹⁄₂}, z_white = Φ^T_wz whitening transform V = Φ^T_w whitening factor C^− 1 = V^TV whitening factorization

Last edited Jan 06, 2012

Linking is allowed but reposting or mirroring is expressly forbidden.

References

[1] David A. Harville: Matrix algebra from a statistician's perspective. Springer, 2008.

[2] Donald F. Morrison: Multivariate statistical methods. Thomson/Brooks/Cole, 2005.

[3] Sheldon M. Ross: A First Course in Probability. Prentice Hall College Div, 1997.