L2 Review Of Basics

OutlineProbability basics

Information theory basicsAdditional properties of random variables

Review of Statistics and Information TheoryBasics

Lecture 2, COMP / ELEC / STAT 502c E. Merenyi

January 12, 2012

Lecture 2, COMP / ELEC / STAT 502 c E. Merenyi Review of Statistics and Information Theory Basics



Outline

Probability basicsRandom variables, CDF, pdf, momentsCovariance, joint CDF and pdf

Information theory basicsInformation, entropy, differential entropyMaximum Entropy principleFurther entropy definitionsMutual information and relation to entropy

Additional properties of random variables




Random variables, CDF, pdf, momentsCovariance, joint CDF and pdf

Outline








CDF of random variable

Let , be continuous scalar random variables (r.v.-s).

(Cumulative) probability distribution function (CDF)

F(x) = Pr( x) = x

f(u)du

and

F(y) = Pr( y) = y

f(v)dv

(2.1)

where

F (x) = f(x) and F(y) = f(y) (2.2)

and exist almost everywhere for continuous r.v.-s. by definition.F(x) is monotoniously non-decreasing; right-continuous, andF() = 0, F() = 1.





pdf of random variable

Probability density function (pdf)f(x) and f(y) are the pdfs of and , respectively. Then

0 f(x) and 0 f(y)

f(x)dx = 1 and

f(x)dx = 1(2.3)

and f(x) and f(y) are continuous almost everywhere.

Interpretation of f(x) as probability:

f(x) = limh0

F(x + h/2) F(x h/2)h

= limh0

P [ x + h/2] P [ x h/2]h

= limh0

P [x h/2) < x + h/2]h

(2.4)

The larger f(x) the larger the probability that falls close to x .





Moments of random variable

The kth moment M(k) of is defined as the expected value of the kth

power of the variable:

M(k) = E [

k ] =

xkdF(x) =

xk f(x)dx (2.5)

The kth central moment, m(k) is defined similarly except for a shift to

obtain zero mean (m(1) ).

m(k) =

(x E [])kdF(x) =

(x E [])k f(x)dx (2.6)






Quantities of interest related to the first four moments, as estimatedfrom a sample, are

Mean = M(1) =

1

n

ni=1

i (2.7)

Variance = m(2)

=1

n 1n

i=1

(i M(1) )2 (2.8)

Standard Deviation (STD) =Variance (2.9)






Skewness =1

n

ni=1

(i M(1)STD

)3= m

(2)

3/2 m(3) (2.10)

(Normalized) Kurtosis =1

n

ni=1

(i M(1)STD

)4 3

= m(2)

2 m(4) 3 (2.11)






Kurtosis is often used as a quantitative measure of non-Gaussianity. Ifkurt() denotes the Kurtosis of ,

kurt() =

< 0 for subgaussian or platykurtic distribution

= 0 for gaussian or mesokurtic distribution

> 0 for supergaussian or leptokurtic distribution

(2.12)





Covariance of random variables

The covariance of random variables and is defined as

c = E [( M(1) )( M(1) )] (2.13)or in general, for the elements xi of random vector x

cij = E [(xi M(1)xi )(xj M(1)xj )] or (2.14)

Cx = E [(xM(1)x )(xM(1)x )T ] (2.15)

Similarly, the cross-covariance of random vectors x, y is

Cxy = E [(xM(1)x )(y M(1)y )T ] (2.16)





Joint CDF and pdf of random variables

Joint CDF and pdf of (continuous) random variables and

F,(x , y) = Pr( x y) (2.17)

F,(x , y) =

x

y

f,(u, v)dudv (2.18)

where

f,(x , y) =2

xyF,(x , y) (2.19)

exists almost everywhere, and it is the joint pdf of and . Properties ofF,(x , y) and f,(x , y) are analogous to eqs (2.1)(2.3).





Marginal distributions

F(x) = F,(x ,) = x

f,(u, v)dudv (2.20)

F(y) = F,(, y) =

y

f,(u, v)dudv (2.21)





Statistical independence

A pair of joint random variables and are statistically independent ifftheir joint CDF factorizes into the product of the marginal CDFs:

F,(x , y) = F(x) F(y) (2.22)For two jointly continuous random variables and , equivalently,

f,(x , y) = f(x) f(y) (2.23)

F,(x , y) =

x

y

f,(u, v)dudv

=

x

f(u)du

y

f(v)dv

= F(x) F(y) (2.24)(Prove the other direction at home.)





Conditional probability

Conditional probability

F|(x |y) = F,(x , y)/F(y) (2.25)f|(x |y) = f,(x , y)/f(y) (2.26)

Obviously, F|(x |y) = F(x) iff and are independent.Equivalently, f|(x |y) = f(x) iff and are independent.

f(x) =

f|(x |y)f(y)dy

E [|] =

x f|(x |y)dx




Information, entropy, differential entropyMaximum Entropy principleFurther entropy definitionsMutual information and relation to entropy

Outline








Information

Let X be a discrete random variable, taking its values fromH = {x1, x2, . . . , xn}. Let pi be the probability of xi , pi 0 andn

i=1 pi = 1. The amount of information in observing the event X = xiwas defined by Shannon as

I (xi ) = loga(pi ) (2.27)

If a =

2 then I is given in bits,

e then I is given in nats,

10 then I is given in digits.

(2.28)





Shannon entropy

The expected value of I (xi ) over the entire set of events H is the entropyof X :

H(X ) = n

i=1

pi log(pi ) (2.29)

(Note that X in H(X ) is not an argument of a function but the label ofthe random variable.) The convention 0 log0 = 0 is used. The entropyis the average amount of information gained by observing one event fromall possible events for the random variable X (the average informationper message), the average length of the shortest description of a randomvariable, the amount of uncertainty, a measure of the randomness of X .





Shannon entropy

Some basic properties of H(X ):

H(X ) = 0 iff pi = 1 for some i and pk = 0 for k 6= i . (2.30)

(If an event occurs with probability 1 then there is no uncertainty.)

0 H(X ) log |H| with equality iff pi = 1/|H| = 1/n for all i . (2.31)

(Uniform distribution carries maximum uncertainty.)





Differential entropy

The differential entropy of a continuous random variable.A quantity analogous to the entropy of a discrete random variable can bedefined for continuous random variables.

Reminder: The random variable is continuous if its distributionfunction, F(x) is continuous.

Definition: Let f(x) = F(x) be the pdf of . The set S : {x |f(x) > 0}

is called the support set of . The differential entropy of is defined as

h() = S

f(x)logf(x)dx = E [logf(x)] (2.32)

where S is the support set of . The lower case letter h is used todistinguish differential entropy from the discrete (or absolute) entropy H.





Discrete vs differential entropy

The differential entropy of a continuous variable can be derived as alimiting case for the discrete entropy of the quantized version of . (See,for example, Cover and Thomas, p. 228.) If = 2n is the bin size ofthe quantization, and the quantized version of is denoted by then

H() + log h() as n ( 0) (2.33)

That is, the absolute entropy of an n-bit quantization of a continuousrandom variable is approximately h() log2n = h() + n. It followsthat the absolute entropy of a continuous (non-quantized) variable is .





The Maximum Entropy principle

Consider the following: suppose we have a stochastic system with a setof known states but unknown probabilities, and we also know someconstraints on the distribution of the states. How can we choose theprobability model that is optimum in some sense among the possiblyinfinite number of models that may satisfy the constraints?

The Maximum Entropy principle (due to Jaynes, 1957) states thatWhen an inference is made on the basis of incomplete information, itshould be drawn from the probability ditribution that maximizes theentropy, subject to constraints on the distribution.





Finding the Maximum Entropy distribution

Solving this problem, in general, involves the maximization of thedifferential entropy

h() =

f(x)logf(x)dx (2.34)

over all pdfs f(x) of the r.v. , subject to the following constraints:

f(x) 0 (2.35)

f(x)dx = 1 (2.36)

f(x)gi (x)dx = i for i = 2, . . . ,m (2.37)

where f(x) = 0 holds outside of the support set of , and i are theprior knowledge about .






Using the Lagrange multipliers i (i = 1, 2, . . . ,m) to formulate theobjective function

J(f ) =

[f(x)logf(x) + 1f(x) +

mi=2

igi (x)f(x)

]dx (2.38)

Differentiation of the integrand with respect to f and setting the resultsto 0 produces

1 logf(x) + 1 +mi=2

igi (x) = 0 (2.39)






Solving this for f(x) yields the maximum entropy distribution for thegiven constraints:

f(x) = exp

(1 + 1 +

mi=2

igi (x)

)(2.40)

Example: If the prior knowledge consists of the mean and the variance2 of then

f(x)(x )2dx = 2 (2.41)

and from the constraints it follows that g2(x) = (x )2 and 2 = 2.The Maximum Entropy distribution for this case is the Gaussiandistribution as shown on page 28 of Colin Fyfe 2 (CF2), or Haykin,Chapter 10, page 491.





Joint differential entropy

Joint differential entropy (C-T 230)

h(, ) =

f,(u, v)logf,(u, v)dudv (2.42)

For more details on joint differential entropy, see C-T, p. 230.





Conditional differential entropy

In CF 2/2, you have seen the formula of the conditional entropy ofdiscrete random variable X given Y . It represents the amount ofuncertainty remaining about the system input X after the system outputY has been observed. Analogous to that is the conditional differentialentropy for continuous random variables:

h(|) =

f,(x , y)logf,(x |y)dxdy (2.43)

Since f,(x |y) = f,(x , y)/f(y)

h(|) = h(, ) h(). (2.44)





Relative entropy (Kullback-Leibler distance)

The relative entropy is an entropy based measure of the differencebetween two pdfs f(x) and g(x), which is defined as

D(f g) =

f(x)logf(x)

g(x)dx (2.45)

Note that D(f g) is finite only if the support set of f is contained inthe support set of g . The K-L distance is a measure of the penalty(error) for assuming g(x) when in reality the density function of isf(x). The convention 0 log 00 = 0 is used.

It follows (can be proven) that D(f g) 0 with equality iff f(x) =g(x) almost everywhere. The relative entropy or K-L distance is alsocalled the K-L divergence.





Mutual information

Mutual information is the measure of the amount of information onerandom variable contains about another.

I (; ) =

f,(x , y)logf,(x , y)

f(x)f(y)dxdy (2.46)

The concept of mutual information is of profound importance in thedesign of self-organizing systems, where the objective is to develop analgorithm that can learn an input-output relationship of interest on thebasis of the input patterns alone. The Maximum mutual informationprinciple of Linsker (1988) states that the synaptic connections of amultilayered ANN develop such as to maximize the amount of informationthat is preserved when signals are transformed at each processing stageof the network, subject to certain constraints. This is based on earlierideas that came from the recognition that encoding of data from a scenefor the purpose of redundancy reduction is related to the identification ofspecific features in the scene (by the biological perceptual machinery).





Properties of Mutual Information

I (; ) =


f(x)f(y)dxdy

=


f(y)dxdy

f,(x , y)logf(x)dxdy

= h(|)

(

f,(x , y)dy)logf(x)dx

= h(|)

f(x)logf(x)dx

= h() h(|)= h() h(|)= h() + h() h(, ) (2.47)





Properties of Mutual Information

I (; ) = I (; ) (2.48)

I (; ) 0 (2.49)

I (; ) = D(f,(x , y) f(x)f(y)) (2.50)The last equation expresses mutual information as a special case of theK-L distance.

I (; ) = D(f,(x, y) density #1

f(x)f(y) density #2

)

The larger the K-L distane the less independent and are, because f,(x, y) and f(x)f(y) are farther apart= I is large. See also diagram in CF 2/2 or in Haykin p.494 (fig. 10.1) showing insightful relations amongmutual information, entropy, conditional and joint entropy.




Outline







Sufficient condition of independece

Statistically independent random variables satisfy the following:

E [g()h()] = E [g()]E [h()] (2.51)

where g() and h() are any absolutely integrable functions of and ,respectively. This follows from

E [g()h()] =

g(u)h(v)f,(u, v)dudv

=

g(u)f(u)du

h(v)f(v)dv

= E [g()]E [h()] (2.52)




Uncorrelatedness vs independece

Statistical independence is a stronger property than uncorrelatedness. and are uncorrelated iff

E [] = E []E [] (2.53)

In general two random vectors x, y are uncorrelated iff

Cxy = E [(xM(1)x )(y M(1)y )T ] = 0 (2.54)or equivalently,

Rxy = E [xyT ] = E [x]E [yT ] = M(1)x M

(1)y

T(2.55)

where Rxy is the correlation matrix of x, y. For zero mean thecorrelation matrix is the same as the covariance matrix.




Uncorrelatedness, contd, and whiteness

As a special case, the components of a random vector x are mutuallyuncorrelated iff

Cx = E [(xM(1)x )(xM(1)x )T ]= D = Diag(c11, c22, . . . , cnn)

= Diag(21 , 22 , . . . ,

2n) (2.56)

Random vectors with

M(1)x = 0 and Cx = Rx = I (2.57)

i.e., having zero mean and unit covariance (and hence unit correlation)matrix are called white. Whiteness is a stronger property thanuncorrelatedness but weaker than statistical independence.




Density functions of frequently used distributions

Uniform:

f(x) =

{1

ba x [a, b]0 elsewhere

(2.58)

Gaussian:

f(x) =12pi

e(x)2/22 (2.59)

where and 2 denote the mean and variance, respectively.

Laplacian:

f(x) =

2e|x| > 0 (2.60)




Exponential power family of density functions

These density functions are special cases of the generalized gaussian orexponential power family that has the general formula (for zero mean):

f(x) = C exp( |x |

E [|x | ])

(2.61)

As it is easy to verify, = 2 yields the Gaussian, = 1 the Laplacian,and the uniform density function. The values of < 2 give riseto supergaussian densities, > 2 to subgaussian ones.

f(x) =12pi

e(x)2/22 f(x) =

2e|x| > 0




Uncorrelated but not independent example

Example of uncorrelated but not independent random variables:

Assume that and are discrete valued and their joint probabilities are0.25 for any of (0,1), (0,-1), (1,0) and (-1,0). Then it is easy to calculatethat and are uncorrelated. However, they are not independent since

E [22] = 0 6= 0.25 = E [2]E [2] (2.62)


OutlineProbability basicsRandom variables, CDF, pdf, momentsCovariance, joint CDF and pdf



Documents

L2 Review Of Basics