50
2015 Information Theory Lecture Notes Brian M. Kurkoski December 11, 2015

Information Theory Lecture Notes

Embed Size (px)

DESCRIPTION

Information Theory Lecture Notes

Citation preview

Page 1: Information Theory Lecture Notes

2015 Information Theory Lecture

Notes

Brian M. Kurkoski

December 11, 2015

Page 2: Information Theory Lecture Notes

2

Page 3: Information Theory Lecture Notes

Contents

1 Measuring Information: Entropy 7

1.1 Entropy and the Horse Race . . . . . . . . . . . . . . . . . . . . . 7

1.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Binary Entropy Function . . . . . . . . . . . . . . . . . . 9

1.2.2 Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Properties of Conditional Entropy . . . . . . . . . . . . . . . . . 12

1.5 Chain rules for Entropy . . . . . . . . . . . . . . . . . . . . . . . 13

1.6 More Numerical Examples . . . . . . . . . . . . . . . . . . . . . . 14

1.7 Source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.7.1 Matlab basics . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.7.2 Binary Entropy Function . . . . . . . . . . . . . . . . . . 16

1.7.3 Compute Entropy . . . . . . . . . . . . . . . . . . . . . . 16

2 Tour of Probability Theory 19

2.1 Probability of Events . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Single, Joint and Conditional Random Variables . . . . . . . . . 19

2.2.1 Jointly Distributed Random Variables . . . . . . . . . . . 21

2.2.2 Conditional probability distributions . . . . . . . . . . . . 23

2.2.3 Bayes Rule, Total Probability, All-Knowing Joint Distri-bution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.4 Example: Discrete Memoryless Channel . . . . . . . . . . 24

2.3 Independence, Expected Value, Mean and Variance . . . . . . . . 25

2.3.1 Independence and Conditional Independence . . . . . . . 25

2.3.2 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . 26

3

Page 4: Information Theory Lecture Notes

4 CONTENTS

2.3.3 Union Bound . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.1 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.2 Binary Random Vector Example . . . . . . . . . . . . . . 29

2.5 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5.1 Markov inequality . . . . . . . . . . . . . . . . . . . . . . 30

2.5.2 Chebyshev inequality . . . . . . . . . . . . . . . . . . . . 30

2.5.3 Random Vectors: How Close Is the Sample Mean? . . . . 30

2.5.4 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . 32

2.6 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6.1 Basic Probability Operations . . . . . . . . . . . . . . . . 34

2.6.2 Random Variable Generation . . . . . . . . . . . . . . . . 35

2.6.3 Sample Mean Experiments . . . . . . . . . . . . . . . . . 35

3 Mutual Information and KL divergence 37

3.1 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1.1 Properties of Mutual Information . . . . . . . . . . . . . . 38

3.1.2 Conditional Mutual Information and Chain Rules . . . . . 39

3.1.3 Numerical Example . . . . . . . . . . . . . . . . . . . . . 39

3.2 Kullback Leiber Distance (Relative Entropy) . . . . . . . . . . . 41

3.2.1 Consequences of Non-Negativity of KL divergence . . . . 41

3.3 Data Processing Inequality and Markov Chains . . . . . . . . . . 42

3.3.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.2 Data Processing Inequality . . . . . . . . . . . . . . . . . 43

3.4 Descriptions Using Expectation . . . . . . . . . . . . . . . . . . . 43

4 Source Coding for Independent Vector Sources 45

4.1 Sample Entropy and Typical Sets . . . . . . . . . . . . . . . . . . 46

4.1.1 Sample Entropy . . . . . . . . . . . . . . . . . . . . . . . 46

4.1.2 Typical Sets and Typical Sequences . . . . . . . . . . . . 47

Page 5: Information Theory Lecture Notes

Introduction

This is the lecture notes for I232 Information Theory, taught at the Japan Ad-vanced Institute of Science and Technology. These lecture notes used Elements

of Information Theory by Cover and Thomas as a starting point. I stronglyrecommend you use this book during the course.

The goal of these lecture notes is to provide enough material that studentscan succeed in the course. This course has an online component that includesvideos, quizzes and homework. The course website is: http://www.jaist.ac.

jp/celeste/moodle/ and find “I232 Information Theory.”

These lecture notes are updated as the course progresses. Generally speak-ing, one chapter will be added and put on line before each lecture. If you finderrors, please post a note in the on-line Discussion Forum.

5

Page 6: Information Theory Lecture Notes

6 CONTENTS

Page 7: Information Theory Lecture Notes

Chapter 1

Measuring Information:

Entropy

There are three important ways to measure information: entropy, mutual in-formation and the Kullback-Leibler divergence (KL divergence). The phrase“measuring information” is meant in an informal sense, since entropy, mutualinformation and KL divergence are not measures or metrics, in the mathematicalsense. This chapter introduces entropy.

1.1 Entropy and the Horse Race

Imagine a race of eight horses. We want to send a message to another personindicating which horse won the race, and we’ll consider the average number ofbits needed for the message. If we do not know the probability of winning, wecould assign a 3-bit message to each horse, and transmit that. Assign the 3-bitmessage 000 to horse A, the message 001 to horse B, etc. The average messagelength is 3 bits.

But, we can assign shorter messages to the horses more likely to win, longermessages to horses less likely to win. One such assignment, using a variable-length message, is shown in Table 1.1. Horse A is most likely to win, and has theshortest 1-bit message 0; Horse B is next-most-likely and has a 2-bit message 10,etc. Using the messages in the table, the average message length is computedas:

1 · 1

2

+ 2 · 1

4

+ 3 · 1

8

+ 4 · 1

16

+ 4

6 · 1

64

= 2 bits (1.1)

That is, it is possible to describe the winner with 2 bits, rather than 3 bits. Thisis connected with the idea of entropy, which is the minimum number of bits todescribe a random variable.

7

Page 8: Information Theory Lecture Notes

8 CHAPTER 1. MEASURING INFORMATION: ENTROPY

probability 3-bit variable-lengthhorse name of win message message

Adios 12 000 0

Big Brown 14 001 10

Cigar 18 010 110

Deep Impact 116 011 1110

Easy Goer 164 100 111100

Funny Cide 164 101 111101

Go Man Go 164 110 111110

Hyperion 164 111 111111

Table 1.1: A list of horses and their probability of a win. A 3-bit messagerequires 3 bits on average two transmit the identity of the winning horse. Avariable-length message requires only 2 bits on average to transmit.

1.2 Entropy

Entropy is a measure of the uncertainty in the random variable.

Definition The entropy H(X) of a discrete random variable X with probabil-ity distribution pX(x) is defined by:

H(X) = �X

x2XpX(x) log pX(x) (1.2)

where we take: 0 log 0 = 0. Unless otherwise noted, the logarithm is base 2,which corresponds to measuring entropy in bits1.

Example — entropy of a fair coin flip Let X = {0, 1} and pX(0) = pX(1) =

12 ,

and H(X) is the entropy of a fair coin flip:

H(X) = �1

2

log

1

2

� 1

2

log

1

2

= 1 bit (1.3)

Example Find the entropy H(X) of the result X of rolling a six-sided die,where X = {1, 2, 3, 4, 5, 6} and pX(x) =

16 :

H(X) = �6 ·⇣

1

6

log

1

6

= log 6 ⇡ 2.584 bits (1.4)

Example In the example of the horse race, the random variable X indicatesthe winner of the race and has distribution:

pX(x) = [

1

2

,1

4

,1

8

,1

16

,1

64

,1

64

,1

64

,1

64

]. (1.5)

The entropy is:

H(X) = 2 bits. (1.6)

1If the base is e, the measure is not bits, but a curious unit called “nats”.

Page 9: Information Theory Lecture Notes

1.2. ENTROPY 9

In this case, the average code length was 2 bits as H(X), but this is not alwaysthe case. It will be shown in following lectures that the entropy is the lowerbound on the average code length.

Example Find the entropy of the following random variable X with X =

{a,b,c,d}:

pX(x) =

8

>

>

>

<

>

>

>

:

12 for x = a14 for x = b18 for x = c18 for x = d

(1.7)

Then,

H(X) = ��

1

2

log

1

2

+

1

4

log

1

4

+

1

8

log

1

8

+

1

8

log

1

8

) =

7

4

bits . (1.8)

Entropy satisfies the following inequality.

0 H(X) log |X |, (1.9)

The lower bound and upper bound are both satisfied by specific distributions.

Theorem — Uniform distribution maximizes entropy. The uniform distribu-tion is pX(x) =

1|X | for x 2 X , and so H(X) =

P 1|X | log |X | = log |X |. This will

be proved later.

The entropy of a constant is 0. Since a constant is not random, it has nouncertainty and the entropy is 0.

Entropy is non-negative. The lower bound H(X) � 0 can be shown by notingthat pX(x) 1 means log

1pX(x)

� 0.

1.2.1 Binary Entropy Function

An important random variable is the binary random variable. Let X with X =

{0, 1} have distribution:

pX(x) =

(

p if x = 0

1 � p if x = 1

(1.10)

Then, H(X) = �p log p � (1 � p) log(1 � p). Clearly H(X) is a function of p,and this is the called the binary entropy function

h(p) = �p log p � (1 � p) log(1 � p) (1.11)

The binary entropy function is plotted in Fig. 1.1. Binary entropy is 0 whenp = 0 or p = 1. In these cases, X is a constant, and the entropy of a constant is0. The binary entropy function has the maximum value of 1 when p =

12 . Like

flipping a coin with heads and tails equally likely, the uncertainty is maximumwhen p =

12 .

Page 10: Information Theory Lecture Notes

10 CHAPTER 1. MEASURING INFORMATION: ENTROPY

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100.10.20.30.40.50.60.70.80.91

Figure 1.1: Binary entropy function.

1.2.2 Joint Entropy

Definition The joint entropy H(X,Y) of discrete random variables X and Yjointly distributed as pX,Y(x, y) is:

H(X,Y) = �X

x2X

X

y2YpX,Y(x, y) log pX,Y(x, y) (1.12)

Note that H(X,Y) does not take negative values.

Example Let X = {apple, banana} and Y = {yellow, green}. Find the jointentropy H(X,Y) when X and Y are jointly distributed according to:

pXY(x, y) y = Y y = Gx = A 1

814

x = B 12

18

The joint entropy is:

H(X,Y) = ��

1

2

log

1

2

+

1

4

log

1

4

+

1

8

log

1

8

+

1

8

log

1

8

) (1.13)

=

7

4

bits . (1.14)

Note the similarity with (1.8).

1.3 Conditional Entropy

Conditional entropy H(Y|X) is the uncertainty of Y given than X is known.If X and Y are correlated, then knowledge of X can reduce the uncertainty ofY. Conditional entropy is one of the most important concepts in informationtheory. There are two types of conditional entropy, H(Y|X = x) and H(Y|X).

Page 11: Information Theory Lecture Notes

1.3. CONDITIONAL ENTROPY 11

Definition The conditional entropy H(Y|X = x) is given by:

H(Y|X = x) = �X

y2YpY|X(y|x) log pY|X(y|x) (1.15)

Definition The conditional entropy H(Y|X) of discrete random variables Xand Y jointly distributed as pX,Y(x, y) is:

H(Y|X) =

X

x2XpX(x)H(Y|X = x) (1.16)

= �X

x2XpX(x)

X

y2YpY|X(y|x) log pY|X(y|x) (1.17)

= �X

x2X

X

y2YpX(x)pY|X(y|x) log pY|X(y|x) (1.18)

= �X

x2X

X

y2YpXY(x, y) log pY|X(y|x). (1.19)

Note that H(Y|X) is a number, while H(Y|X = x) is a function (a function ofx).

As a simple example, let X be the result of rolling a six-sided die, whereX = {1, 2, 3, 4, 5, 6} and pX(x) =

16 . Let Y indicate whether X is odd or even,

so Y = {odd, even}. Clearly if you know X, then you know Y, so there is nouncertainty:

H(Y|X) = 0 bits . (1.20)

On the other hand if you only know Y = even, then X could be 2, 4 or 6, eachwith probability 1

3 :

H(Y|X = even) = log 3 (1.21)

and similarly H(Y|X = odd) = log 3. Since pY(odd) = pY(even) =

12 , we have:

H(Y|X) =

1

2

log 3 +

1

2

log 3 = log 3. (1.22)

Example Let Y be a random variable that indicates how Steve goes to work,Y = {Bicycle, Train}. Let X be a random variable that indicates the weather,X = {Sunny, Rainy}. Let the conditional probability pY|X(y|x) be given by:

pY|X(y|x) y = B y = Tx = S 1

212

x = R 0 1

and let pX(S) =

23 and pX(R) =

13 . Find H(Y|X = x) and H(Y|X).

Page 12: Information Theory Lecture Notes

12 CHAPTER 1. MEASURING INFORMATION: ENTROPY

Solution

H(Y|X = S) = �X

y2YpY|X(y|S) log pY|X(y|S) = �1

2

log

1

2

� 1

2

log

1

2

(1.23)

= 1 (1.24)

H(Y|X = R) = �X

y2YpY|X(y|R) log pY|X(y|R) = �1 log 1 + 0 log 0 (1.25)

= 0 (1.26)

Here, H(Y|X = x) is a function of x, which can be written as a vector [1, 0].Next,

H(Y|X) =

X

x2XpX(x)H(Y|X = x) (1.27)

= pX(S)H(Y|X = S) + pX(R)H(Y|X = R) (1.28)

=

2

3

. (1.29)

1.4 Properties of Conditional Entropy

Theorem — Conditioning reduces entropy : H(X|Y) H(X), with equality ifand only if X and Y are independent.

If X and Y are correlated, then knowing something about Y will reduce youruncertainty about X. This is an important theorem, and will be proved later.Note H(Y|X) = H(Y) if and only if X and Y are independent.

Theorem — Conditional entropy of functions For any2 function g(·), H(g(X)|X) =

0.

The theorem says that if you know a random variable, then you also know afunction of that random variable. A special case is g(x) = x, so that H(X|X) = 0

— if you know a variable, then there is no uncertainty about it.

Theorem — Entropy conditioned on a deterministic function Let g(x) be abijective function, that is, g(x) and g�1

(x) have one unique value for all x 2 X .Then, H(X|g(X)) = 0.

Note that if g(·) is not bijective, then thje entropy may not be zero, whichis illustrated in the following example.

Example Let X be defined on X = {�1, 0, 1} with. Let g(x) = x2 and letY = g(X), so that Y = {0, 1} and pY(0) =

1/2 and pY(1) =

1/2. Compute H(Y|X)

and H(X|Y).

Solution. First compute H(Y|X):

H(Y|X) = pX(�1)H(Y|X = �1) + pX(0)H(Y|X = 0) + pX(1)H(Y|X = 1)

2The function g is deterministic, that is, not random. A student once asked “what if g israndom?” We usually do not think of functions as being random, but if it were, then yes, itcould increase entropy.

Page 13: Information Theory Lecture Notes

1.5. CHAIN RULES FOR ENTROPY 13

Given X is known, there is no uncertainty about Y, that is H(Y|X = x) = 0, soH(Y|X) = 0.

On the other hand, to compute H(X|Y), note that x2 is not bijective, since�1

2= 1

2. The conditional distribution pX|Y(x|y) is:

pX|Y(x|y) x = �1 x = 0 x = 1

y = 0 0 1 0y = 1

1/2 0 1/2

and pY(y) = [

12 , 1

2 ]. In particular, H(X|Y = 0) = 0 since if y = 0 we could onlyhave x = 0. But H(X|Y = 1) is not 0, because x could be �1 or 1.

H(X|Y = 1) =

X

x2XpX|Y(x|1) log pX|Y(x|1) (1.30)

= ��

1

2

log

1

2

+ 0 log 0 +

1

2

log

1

2

(1.31)

= 1 (1.32)

So that:

H(X|Y) = H(X|Y = 0)pY(0) + H(X|Y = 1)pY(1) (1.33)

= 0 · 1

2

+ 1 · 1

2

(1.34)

=

1

2

. (1.35)

1.5 Chain rules for Entropy

Theorem — Chain Rule for Entropy For random variables X and Y:

H(X,Y) = H(X) + H(Y|X) (1.36)

Proof:

H(X,Y) = �X

x2X

X

y2YpXY(x, y) log pXY(x, y) (1.37)

= �X

x2X

X

y2YpXY(x, y) log pX(x)pY|X(y|x) (1.38)

= �X

x2X

X

y2YpXY(x, y) log pX(x) �

X

x2X

X

y2YpXY(x, y) log pY|X(y|x)

= �X

x2XpX(x) log pX(x) �

X

x2X

X

y2YpXY(x, y) log pY|X(y|x) (1.39)

= H(X) + H(Y|X) (1.40)

Note that H(X|Y) 6= H(Y|X), but:

H(X) � H(X|Y) = H(Y) � H(Y|X) (1.41)

Page 14: Information Theory Lecture Notes

14 CHAPTER 1. MEASURING INFORMATION: ENTROPY

Theorem — Generalized Chain Rule for Entropy. Let X1,X2, . . . ,Xn bejointly distributed as pX1,Xn(x1, . . . , xn) Then,

H(X1,X2, . . . ,Xn) =

nX

i=1

H(Xi|Xi�1 . . . ,X1) (1.42)

Proof:

H(X1,X2) = H(X1) + H(X2|X1) (1.43)H(X1,X2,X3) = H(X1) + H(X2,X3|X1) (1.44)

= H(X1) + H(X2|X1) + H(X3|X2,X1) (1.45)

Continuing this process iteratively, we have:

H(X1,X2, . . . ,Xn) = H(X1) + H(X2|X1) + · · · + H(Xn|Xn�1, · · · ,X2,X1)

=

nX

i=1

H(Xi|Xi�1, . . . ,X1)

Theorem — Independence bound on entropy. Let X1,X2, . . . ,Xn randomvariables jointly distributed as pX(x1, . . . , xn). Then,

H(X1,X2, . . . ,Xn) nX

i=1

H(Xi) (1.46)

Proof:

H(X1,X2, . . . ,Xn) =

nX

i=1

H(Xi|Xi�1, . . . ,X1) nX

i=1

H(Xi) (1.47)

1.6 More Numerical Examples

Example Find the entropies H(X,Y), H(X|Y) and H(Y|X) for random variablesX,Y with X = Y = {1, 2, 3, 4} jointly distributed as:

pX,Y =

2

6

6

6

6

4

18

116

132

132

116

18

132

132

116

116

116

116

14 0 0 0

3

7

7

7

7

5

(1.48)

(x is in the columns, y is in the rows). Then, the marginal distribution of X andY are:

pX(x) =

12

14

18

18

(1.49)pY(y) =

14

14

14

14

(1.50)

Hence, H(X) = 7/4 bits and H(Y) = 2 bits. The joint entropy H(X,Y) is:

H(X,Y) = �4

X

x=1

4X

y=1

pXY(x, y) log pXY(x, y) =

27

8

bits (1.51)

Page 15: Information Theory Lecture Notes

1.7. SOURCE CODE 15

And the conditional entropies are:

H(X|Y) = H(X,Y) � H(Y) =

27

8

� 2 =

11

8

bits, and (1.52)

H(Y|X) = H(X,Y) � H(X) =

27

8

� 7

4

=

13

8

bits. (1.53)

1.7 Source code

1.7.1 Matlab basics

Command line computations:

1 >> log2(3)23 ans =45 1.585067 >> px = [1/2 1/4 1/4];8 >> �sum( px .* log2(px) )9

10 ans =1112 1.5000

Open a file in the text editor (create if it does not exist):

1 >> edit binaryEntropyFunction.m

Then cut-and-paste source code from the file below.

Run a command, then display a variable:

1 >> H = binaryEntropyFunction(0.11);2 >> H34 H =56 0.4999

Run a command and show the output (no semicolon):

1 >> binaryEntropyFunction(0.11)23 ans =45 0.4999

Page 16: Information Theory Lecture Notes

16 CHAPTER 1. MEASURING INFORMATION: ENTROPY

1.7.2 Binary Entropy Function

Compute the binary entropy function h(p):

1 function h = binaryEntropyFunction(p)23 p(find(p<1E�10)) = 1E�10;45 h = � p .* log2(p) � (1�p) .* log2(1�p);

Make a plot of the binary entropy function:

1 >> p = linspace(0,1,101);2 >> hp = binaryEntropyFunction(p);3 >> plot(p,hp);4 >> xlabel('source probability p')5 >> ylabel('entropy h(p)')

1.7.3 Compute Entropy

Given an arbitrary source distribution pX(x), find the entropy H(X).

1 >> px = rand(1,100); %100 random numbers between 0 and 12 >> px = px / sum(px); %force px to sum to 1.3 >> H = computeEntropy(px) %compute the entropy

1 function H = computeEntropy(px)23 assert( all(px) >= 0,'px are not all non�negative');4 assert( abs( 1 � sum(px) ) < 1E�10, 'px does not sum to 1');56 %change 0 to 1E�10 avoids px*log2(px) = NaN7 px(find(px < 1E�10)) = 1E�10;89 H = �sum( px .* log2(px) );

Given pY|X(y|x) and pX(x), compute the conditional entropy H(Y|X).

1 function H = computeConditionalEntropy(pygx,px)23 assert( all(px) >= 0,'px are not all non�negative');4 assert( abs( 1 � sum(px) ) < 1E�10, 'px does not sum to 1');56 [X,Y] = size(pygx);7 assert(X == length(px),'number X elements in px and pygx disagree')89 t = zeros(1,X); %t(x) is H(Y | X = x)

10 for x = 1:X;11 t(x) = computeEntropy(pygx(x,:));

Page 17: Information Theory Lecture Notes

1.7. SOURCE CODE 17

12 end1314 H = t * px(:); %H(Y|X) = sum_x p(x)*t(x)

Page 18: Information Theory Lecture Notes

18 CHAPTER 1. MEASURING INFORMATION: ENTROPY

Page 19: Information Theory Lecture Notes

Chapter 2

Tour of Probability Theory

This chapter describes some of the basic probability used in information theory.

2.1 Probability of Events

An event is an outcome, or a set of outcomes, from an experiment. a set ofoutcomes of an experiment (a subset of the sample space) to which a probabilityis assigned. A single outcome may be an element of many different events.

2.2 Single, Joint and Conditional Random Vari-ables

A random variable is a variable that takes on a value that represents the outcomeof a probabilistic experiment. For example, in rolling a die, there are 6 possiblevalues, 1, 2, 3, 4, 5 or 6, and each is equally likely.

Probability is defined using a set X and a function pX(x). The set X is calledthe sample space and the function pX(x) is called the probability mass function

or probability distribution. The function pX(x) must satisfy two conditions:

0 pX(x) 1 (2.6)

for all x 2 X , and,X

x2XpX(x) = 1. (2.7)

Sometimes X ⇠ pX(x) is written to mean “X is distributed as pX(x).”

Sanserif type X is used for random variables, distinct from X. Calligraphicscript X is used for sets, and the sample space is a set. In the example of rollinga die with 6 sides, the sample space is X = {1, 2, 3, 4, 5, 6}, and the 6 eventsare equally likely: pX(x) =

16 for x 2 X . Another example is if Z represents the

19

Page 20: Information Theory Lecture Notes

20 CHAPTER 2. TOUR OF PROBABILITY THEORY

• Random variable: A random variable X has a probability distributionpX(x) for x in the sample space X , where:

0 pX(x) 1 andX

x2XpX(x) = 1. (2.1)

Here, pX(x) is Pr[X = x], that is “the probability the random variable X isequal to x.”

• Jointly distributed random variables X,Y have a joint probability distri-bution pXY for x 2 X and y 2 Y.

• Independence of X and Y means pX,Y(x, y) = pX(x)pY(y).

• Definition of conditional probability:

pX|Y(x|y) =

pXY(x, y)

pY(y)

. (2.2)

• Bayes rule:

pX|Y(x|y) =

pY|X(y|x)pX(x)

pY(y)

(2.3)

• Expected value

E[X] =

X

x2XxpX(x). (2.4)

• Expected value of a function g(x):

E[g(X)] =

X

x2Xg(x)pX(x). (2.5)

• Marginalization: pX(x) =

P

y2Y pXY(x, y)

• Theorem of total probability: pX(x) =

P

y2Y pX|Y(x|y)pY(y)

• Independent and identically distributed, i.i.d.: for a sequenceof n i.i.d. random variables X1,X2, . . . ,Xn have distributionpX1(x1)pX2(x2) · · · pXn(xn).

Figure 2.1: Summary of important random variable relationships.

Page 21: Information Theory Lecture Notes

2.2. SINGLE, JOINT AND CONDITIONAL RANDOM VARIABLES 21

outcome of flipping a coin, which has heads and tails, each with probability 12 .

Then Z = {heads, tails} and pZ(heads) = pZ(tails) =

12 .

For example, let X be a random variable denoting the outcome of rolling a diewith 6 sides. Then the sample space is X = {1, 2, 3, 4, 5, 6}, and the probabilitydistribution is:

pX(x) =

(

16 for x 2 {1, 2, 3, 4, 5, 6}0 otherwise

(2.8)

A discrete random variable X has real, discrete values, and so the samplespace X is likewise discrete, and often X is a subset of the integers. The prob-ability distribution can be written is various ways:

Pr(X = x) or p(x) or pX(x) (2.9)

for x 2 X . Note that X is a random variable, while x is a constant. The termPr(X = x) and pX(x) show both X and x, but sometimes we use p(x) to savespace. Writing Pr(X) is undesirable (without the “= x”), because it does notexplicitly show the dependence on x.

Often pX(x) is written as a vector, for example:

pX(x) = [

1

2

,1

4

,1

8

,1

8

], (2.10)

which is understood to mean,

pX(x) =

8

>

>

>

<

>

>

>

:

12 x = 1

14 x = 2

18 x = 3

18 x = 4

(2.11)

when X is are integers {1, 2, 3, 4} or X is otherwise not important.

Example A binary random variable

1 X, with parameter p, has a sample spacewith two values X = {0, 1} and it takes on one value with probability p, andthe other value with probability 1 � p:

pX(0) = Pr(X = 0) = 1 � p (2.12)pX(1) = Pr(X = 1) = p (2.13)

and could also be written as pX(x) = [1 � p, p].

2.2.1 Jointly Distributed Random Variables

Jointly distributed random variables X and Y are defined for sample spaces Xand Y respectively, with a joint probability distribution:

pXY(x, y) = Pr(X = x,Y = y). (2.14)1often called a Bernoulli random variable

Page 22: Information Theory Lecture Notes

22 CHAPTER 2. TOUR OF PROBABILITY THEORY

Again, we almost always write pXY(x, y) or p(x, y) instead of Pr(X = x,Y = y).The joint probability distribution satisfies

0 pXY(x, y) 1, (2.15)

for all x 2 X and all y 2 Y and:X

x2X

X

y2YpXY(x, y) = 1 (2.16)

The marginal distribution of X is:

pX(x) =

X

y2YpXY(x, y) (2.17)

and the marginal distribution of Y is:

pY(y) =

X

x2XpXY(x, y) (2.18)

Example Consider the jointly distributed random variables X and Y, withX = {1, 2, 3}, Y = {1, 2, 3} and probability distribution given in matrix form:

pXY(x, y) =

2

4

12 0 0

14

116 0

0

18

116

3

5 , (2.19)

where the rows correspond to values of x and columns correspond to values ofy, so pXY(2, 1) =

14 . Then the marginal distribution pX(x) is:

pX(x) =

3X

y=1

pXY(x, y), (2.20)

which has values 12 , 5

16 , 316 . Similarly, the marginal distribution pY(y) has values

34 , 3

16 , 116 .

Understanding joint distributions The following is a perspective on jointdistributions. Consider X = {apple, banana} and Y = {3, 5}. Suppose the jointdistribution pXY(x, y) is:

pXY(x, y) 3 5apple 2

14314

banana 414

514

It is sometimes convenient to think of the pair (X,Y) as a single random variableZ on the sample space:

Z = {(apple, 3), (apple, 5), (banana, 3), (banana, 5)}, (2.21)

Page 23: Information Theory Lecture Notes

2.2. SINGLE, JOINT AND CONDITIONAL RANDOM VARIABLES 23

where Z = X ⇥ Y is the cartesian product. The probability distribution on Zis:

pZ(z) =

8

>

>

>

<

>

>

>

:

214 if z = (apple, 3)314 if z = (apple, 5)414 if z = (banana, 3)514 if z = (banana, 5)

(2.22)

The random variable Z and the random variable pair (X,Y) are connected inthis way.

2.2.2 Conditional probability distributions

For two jointly distributed random variables X and Y, the conditional distribu-

tion of X given Y:

pX|Y(x|y) = Pr(X = x|Y = y) (2.23)

is defined as:

Pr(X = x|Y = y) =

Pr(X = x,Y = y)

Pr(Y = y)

or (2.24)

pX|Y(x|y) =

pXY(x, y)

pY(y)

. (2.25)

The term pX|Y(x|y) or Pr(X = x|Y = y) is read as “the probability X is equal tox, given Y is equal to y.” For any value of y,

X

x2XpX|Y(x|y) = 1 (2.26)

must hold.

Example Consider again the joint distribution in (2.19). Compute pY|X(2|3),that is, Pr(Y = 2|X = 3):

pY|X(2|3) =

pX,Y(3, 2)

pX(3)

=

1/8

3/16

=

2

3

(2.27)

The conditional distribution pY|X(y|x) can be written in matrix form:

pY|X(y|x) =

2

4

1 0 0

45

15 0

0

23

13

3

5 . (2.28)

Conditional probability distributions matrix is usually written so the rows sumto 1 (rather than the columns summing to 1).

Page 24: Information Theory Lecture Notes

24 CHAPTER 2. TOUR OF PROBABILITY THEORY

2.2.3 Bayes Rule, Total Probability, All-Knowing JointDistribution

Bayes’ rule (or Bayes’ theorem) states:

pX|Y(x|y) =

pY|X(y|x)pX(x)

pY(y)

, (2.29)

and is easily proved by observing that pXY(x, y) can be written two ways pXY(x, y) =

pX|Y(x|y)pY(y) and pXY(x, y) = pY|X(y|x)pY(y).

The theorem of total probability or law of total probability is related tomarginalization (2.17), (2.18). For X,Y jointly distributed as pXY(x, y):

pX(x) =

X

y2YpX|Y(x|y)pY(y), (2.30)

obtained using the definition of conditional probability.2

The theorem of total probability has various forms. For example, if X,Y andZ are jointly distributed, then:

pY|X(y|x) =

X

z2ZpY|XZ(y|x, z)pZ|X(z|x) (2.31)

is also a useful form of the theorem of total probability.

For two random variables, the joint distribution pXY(x, y) is the “all-knowingdistribution.” This means that given pXY(x, y), is is possible to compute pX(x),pY(y), pX|Y(x|y) and pY|X(y|x). If you are given only pY|X(y|x), then you addi-tionally need pX(x) to find the joint distribution using pXY(x, y) = pY|X(y|x)pX(x).In fact, sometimes pY|X(y|x)pX(x) is called the “joint distribution.”

The phrase “X,Y are jointly distributed as pXY(x, y)” simply means that Xand Y are random variables that have a known joint distribution, and they arenot necessarily independent.

2.2.4 Example: Discrete Memoryless Channel

An example of a conditional probability distribution is a discrete memorylesschannel, or DMC. As a model of a communications channel, one symbol x froman alphabet X is transmitted, and another symbol y from an alphabet Y isreceived. Given that X was transmitted, the probability that Y was received isgiven by the conditional distribution pY|X(y|x) which specifies the DMC; X iscalled the channel input, Y is called the channel output.

Example For a discrete memoryless channel with pX(x) = [

12 , 1

4 , 14 ] and with

conditional probability distribution:

pY|X(y|x) =

2

4

23

13 0

0

23

13

13 0

23

3

5 , (2.32)

2One could also writeP

y2Y pY|X(y|x)pX(x), but this is not especially useful, becauseusually we are trying to find pX(x) from pX|Y(x|y) and pY(y).

Page 25: Information Theory Lecture Notes

2.3. INDEPENDENCE, EXPECTED VALUE, MEAN AND VARIANCE 25

Figure 2.2: Example of a discrete memoryless channel, or DMC, with input Xand output Y.

where the rows are x and the columns are y. Compute pY(y).

We have

pY(y) =

X

x2XpXY(x, y) =

X

x2XpY|X(y|x)pX(x) (2.33)

From this, find:

pY(1) =

2

3

· 1

2

+ 0 +

1

3

· 1

4

=

5

12

(2.34)

pY(2) =

1

3

· 1

2

+

2

3

· 1

4

+ 0 =

4

12

(2.35)

pY(3) = 0 +

1

3

· 1

4

+

2

3

· 1

4

=

3

12

(2.36)

The computation can also be performed in matrix form:

12

14

14

·

2

4

23

13 0

0

23

13

13 0

23

3

5

=

512

412

312

. (2.37)

2.3 Independence, Expected Value, Mean and Vari-ance

2.3.1 Independence and Conditional Independence

Let two random variables X and Y have a joint distribution pXY(x, y).

Definition Two variables X and Y are independent if and only if:

pXY(x, y) = pX(x)pY(y). (2.38)

If further pX(x) = pY(x) for all x 2 X , then we say X and Y are independent

and identically distributed, often abbreviated iid. If X and Y are independent,

Page 26: Information Theory Lecture Notes

26 CHAPTER 2. TOUR OF PROBABILITY THEORY

then:

pX|Y(x|y) = pX(x), (2.39)

which is easily obtained using pXY(x, y) = pX|Y(x|y)pY(y).

Let three variables X, Y and Z be jointly distributed. Even if X and Y arenot independent, it is possible that X and Y are conditionally independent given

Z, if the following holds:

pXY|Z(x, y|z) = pX|Z(x|z) · pY|Z(y|z) (2.40)

for all x 2 X , for all y 2 Y and all z 2 Z.

2.3.2 Expected Value

The expected value E[X], of a random variable X with probability distributionpX(x) is:

E[X] =

X

x2XxpX(x) (2.41)

The expected value E[X] is also called the mean.

Example The expected value E[X] of the binary random variable X given in(2.12) is:

E[X] = 0 · (1 � p) + 1 · p = p. (2.42)

Example If X is rolling a die, then E[X] is given by:

E[X] =

6X

x=1

pX(x)x (2.43)

=

1

6

· 1 +

1

6

· 2 +

1

6

· 3 +

1

6

· 4 +

1

6

· 5 +

1

6

· 6 (2.44)

= 3.5 (2.45)

The expectation of a function g of a random variable X is:

E[g(X)] =

X

x2Xg(x)pX(x) (2.46)

Note that g(x) is a deterministic function of x, that is, g is not random.

Example Find E[X2] when X be the outcome of a die roll:

E[X2] =

6X

x=1

pX(x)x2 (2.47)

=

1

6

· 1

2+

1

6

· 2

2+

1

6

· 3

2+

1

6

· 4

2+

1

6

· 5

2+

1

6

· 6

2 (2.48)

=

91

6

(2.49)

Page 27: Information Theory Lecture Notes

2.3. INDEPENDENCE, EXPECTED VALUE, MEAN AND VARIANCE 27

Example Let X ⇠ pX(x) = [

12 , 1

4 , 14 ], and let g(x) = � log pX(x). Find E[g(X)]:

E[g(X)] = �3

X

x=1

pX(x) log pX(x) (2.50)

= �1

2

log

1

2

� 2

1

4

log

1

4

=

3

2

. (2.51)

Writing E[g(X)] is way to write entropy H(X), which will be shown later.

The variance of X, Var[X], is:

Var[X] = E[X2] � (E[X])

2. (2.52)

The variance Var[X] of the die roll random variable X is:

Var[X] = E[X2] � (E[X])

2 (2.53)

=

91

6

��

7

2

=

35

12

(2.54)

For two jointly distributed random variables X and Y, the conditional expec-

tation E[X|Y = y] is:

E[X|Y = y] =

X

x2Xx pX|Y(x|y) (2.55)

The conditional expectation E[X|Y = y] is a function of y, for example, f(y) =

E[X|Y = y]. On the other hand, E[X|Y] is a number.

Note that E[X|Y] is distinct from E[X|Y = y]. In particular E[X|Y] is arandom variable equal to f(Y), where f(y) = E[X|Y = y]. Since E[X|Y] is arandom variable, it has an expectation:

E⇥

E[X|Y]

= E[X], (2.56)

which is the law to total expectation.

Expectation and Variance of Multiple Random Variables For any X1,X2, . . . ,Xn

and constants a1, a2, . . . , an (that is independence is not required):

E[a1X1 + a2X2 + · · · anXn] = a1E[X1] + a2E[X2] + · · · anE[Xn] (2.57)

For any independent X1,X2, . . . ,Xn and constants a1, a2, . . . , an:

Var[a1X1 + a2X2 + · · · anXn] = a21Var[X1] + a2

2Var[X2] + · · · a2nVar[Xn] (2.58)

2.3.3 Union Bound

Information theory deals primarily with random variables and their probabilitydistributions. But we also deal with probability of events. If E is an event thatcan either occur or not occur, then Pr(E) is “the probability that the event Eoccurs.”

Page 28: Information Theory Lecture Notes

28 CHAPTER 2. TOUR OF PROBABILITY THEORY

Let A1, A2, . . . , An be events. The union bound is:

Pr(

n[

i=1

Ai) nX

i=1

Pr(Ai) (2.59)

For n = 2, we have:

Pr(A1 [ A2) Pr(A1) + Pr(A2) (2.60)

since Pr(A [ B) = Pr(A) + Pr(B) � Pr(A \ B).

For example, is some communication system, let E1 be the event that anerror of one type occurs, and Pr(E1) = 0.1. Let E2 be the event that anothertype of error occurs, and Pr(E2) = 0.05. If either E1 or E2 occurs, then thesystem fails, and event called E. Even if we don’t know the correlation betweenE1 and E2, we can upper bound Pr(E) by the union bound:

Pr(E) Pr(E1) + Pr(E2) (2.61)

so in this example, Pr(E) 0.15.

2.4 Random Vectors

2.4.1 Random Vectors

A random vector is a sequence of n random variables, independent and identi-cally distributed. A random variable Xi has distribution pX(x) on sample spaceX . Then, the random vector X:

X =

X1,X2,X3, . . . ,Xn

(2.62)

is a sequence of independent and identically distributed random variables, Xi ⇠pX(x). The sample space of X is the cartesian product Xn.

The joint distribution of the vector X is pX(x), and since the Xi are inde-pendent, the joint distribution is:

pX(x) = pX1(x1)pX2(x2) · · · pXn(xn)

=

nY

i=1

pX(x), (2.63)

where pX(x) is the distribution of the random variable X.

Binary random vectors If Xi is a binary random variable X = {0, 1}, withprobability of a one equal to 0 p 1, as given in (2.12), then we say X is abinary random vector. The binary random vector X has parameters n and p,and is the outcome of n of a sequence of n yes/no experiments. The samplespace is Xn

= {0, 1}n.

The vector X has z ones and n � z zeros. Let Z be the sum of X:

Z =

nX

i=1

Xi (2.64)

Page 29: Information Theory Lecture Notes

2.5. LAW OF LARGE NUMBERS 29

so that Z is a random variable expressing the number of ones in X. This isthe binomial random variable . For z 2 Z = {0, 1, 2, . . . , n}, the probabilitydistribution pZ(z) is:

pZ(z) =

n

z

pz(1 � p)

n�z for z = 0, 1, . . . , n, (2.65)

where�

nz

=

n!m!(n�z)! . Here

nz

is the number of distinct ways to place z onesinto n positions.

2.4.2 Binary Random Vector Example

Consider an example of a binary random vector with n = 4 and p =

14 . The

sample space is:

X 4= {0000, 0001, 0010, . . . , 1111}. (2.66)

Now we answer questions (a) what is the probability of x = 0110? (b) Whatis the probability of any sequence x with two ones?

(a) The probability of x = 0110 is Pr[X = 0110] = pX(0110), computed using(2.63):

pX(0110) = pX(0)pX(1)pX(1)pX(0) (2.67)= (1 � p) · p · p · (1 � p) = p2(1 � p)

2 (2.68)

=

9

256

⇡ 0.03515 (2.69)

Another sequence with two ones and two zeros, for example 1010, will also occurwith probability 9

256 .

(b) The sequences with two ones and two zeros are:

{1100, 1010, 1001, 0110, 0101, 0011} (2.70)

There are�42

= 6 sequences, and so the probability of any of them occurring is6 · 9

256 =

27128 . This is the same as the binomial random variable with Z = 2, so

using (2.65):

pZ(2) =

4

2

p2(1 � p)

2=

27

128

. (2.71)

The event “X is any sequence with z ones” is the same as the event “Z = z.”

2.5 Law of Large Numbers

A central result of information theory, Shannon’s channel coding theorem, aclever application of the law of large numbers. This section reviews the law oflarge numbers.

Page 30: Information Theory Lecture Notes

30 CHAPTER 2. TOUR OF PROBABILITY THEORY

2.5.1 Markov inequality

In probability theory, the Markov inequality gives an upper bound for the prob-ability that a non-negative function of a random variable is greater than or equalto some positive constant. Markov’s inequality relate probabilities to expecta-tions, and provide bounds for the cumulative distribution function of a randomvariable. The Markov inequality is often not a tight bound, but nonethelessuseful.

Markov Inequality If X is any nonnegative random variable and a > 0, then:

Pr(X � a) E[X]

a(2.72)

An example of an application of Markov’s inequality is the fact that (as-suming incomes are non-negative) no more than 1/5 of the population can havemore than 5 times the average income.

2.5.2 Chebyshev inequality

Let X be a random variable with finite expected value E[X] and finite non-zerovariance Var[X]. Then for any real number ✏ > 0, the Chebyshev inequality

states:

Pr(|X � E[X]| < ✏) � 1 � Var[X]

✏2. (2.73)

Some textbooks may write the inequality as:

Pr(|X � E[X]| � ✏) Var[X]

✏2or (2.74)

Note that this is only interesting when Var[X] ✏2.

For example, X has expected value µ and variance �2. If we choose ✏2 = 2�2.Using (2.73), the probability that X is in the interval:

(µ � ✏, µ + ✏) = (µ � 2�2, µ + 2�2) (2.75)

is 1 � �2

✏2 = 0.5 or greater.

Because it can be applied to completely arbitrary distributions (unknownexcept for mean and variance), the inequality generally gives a poor boundcompared to what might be possible if something is known about the distributioninvolved.

2.5.3 Random Vectors: How Close Is the Sample Mean?

This describes the sample mean of a random vector, and how close it can be tothe true mean. In simple cases, the probability the sample mean is close to thetrue mean can be computed exactly. This section uses the Chebyshev inequalityto give a lower bound on this probability.

Page 31: Information Theory Lecture Notes

2.5. LAW OF LARGE NUMBERS 31

Let X = X1X2 · · · ,Xn be a random vector of n random variables, indepen-dent and identically distributed, each with mean E[X] and variance Var[X]. LetXn be the sample mean of these variables:

Xn =

1

n

nX

i=1

Xi (2.76)

Since Xi are random variables, Xn is also a random variable. Its expected valueE[Xn] is equal to E[X], easily shown using (2.57).

We are interested sequences X for which the sample mean Xn is “epsilonclose” to its mean E[Xn] = E[X]:

|Xn � E[X]| ✏, (2.77)

where ✏ > 0 is some small constant value. Since Xn is a random variable, weare interested in the probability:

Pr(|Xn � E[X]| ✏). (2.78)

Computing this probability directly can be difficult, instead a lower bound q:

Pr

|Xn � E[X]| ✏�

� q, (2.79)

will be found using the Chebyshev inequality.

We consider a specific case where Xi are iid binary random variables withprobability of a one equal to p, so that E[X] = p. The number of ones in X isPn

i=1 Xi, and the “average number of ones” is Xn. The mean and variance of Xn

are:

E[Xn] = p (2.80)

Var[Xn] =

p(1 � p)

n. (2.81)

Consider the case of n = 15 and p =

13 . What is the average number of one’s

you expect in any realization of X? You should expect this value to be p, but ifyou take an actual realization of X, for example:

x = 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 (2.82)

there are 6 ones, so the average number of ones is 615 = 0.4, which is close to,

but not equal to E[X] =

13 . (Alternatively, you might expect, on average np = 5

ones.)

Using the Chebyshev inequality, we can find an upper bound on the prob-ability that Xn is within ✏ =

215 of its mean, that is, what is the probability

that the sample mean Xn takes on values in 13 ± 2

15 or (

315 , 7

15 ). An equivalentcondition is X has 3, 4, 5, 6 or 7 ones. The numerical mean and variance areE[Xn] =

13 and Var[Xn] =

29n ⇡ 0.0148. Then, using3 (2.73):

Pr(|Xn � E[X]| <2

15

) � 1 � 2/135

4/225

=

1

6

, (2.83)

3Note that even though a strict inequality is inside the probability, if 215 is replaced with

215 plus a very small number, then all the values in range 1

3 ± 215 will be included.

Page 32: Information Theory Lecture Notes

32 CHAPTER 2. TOUR OF PROBABILITY THEORY

That is, the Chebyshev inequality states that the probability that |Xn � E[X]|is less than ✏ =

215 is greater than 1

6 . The exact value is

7X

i=3

15

i

pi(1 � p)

15�i ⇡ 0.8324. (2.84)

We are interested in the following case: for a fixed ✏ and fixed q, what valueof n is needed to satisfy (2.79)? When n is small, this n can be found by directcomputation. But for large n and arbitrary distribution, we can form a lowerbound using the Chebyshev inequality. If n is increased, then the lower boundq provided by the Chebyshev bound will increase towards 1. The table belowshows the values of n required such that the Chebyshev bound will satisfy thespecified value of q.

q Chebyshev bound on n exact value of n16 15 —

0.9 125 300.99 1250 800.999 12500 134

In addition, Fig. 2.3 shows the probability distribution on the random vari-able Xn, for various values of n, where it is clear that the distribution tends toconcentrate around the mean as n becomes larger. Thus, any desired probabil-ity q can be achieved by making n sufficiently large, for any fixed ✏. The law oflarge numbers lets n go to infinity, and shows that q can go to 1.

2.5.4 Law of Large Numbers

The law of large numbers plays a central role in information theory. The law oflarge numbers shows that the sample mean of a sequence of random variablesapproaches the true mean, as the number of random variables goes to infinity.

The law of large numbers states that the sample mean converges in proba-bility towards the expected value. That is, for any positive number "

lim

n!1Pr

|Xn � E[X]| < "�

= 1. (2.85)

To prove this, first let the variance of the iid Xi be Var[X]. For the samplemean Xn, it is easy to show E[Xn] = E[X] and Var[Xn] =

Var[X]n , using (2.57)

and (2.58). Apply the Chebyshev inequality (2.73) to Xn:

Pr

|Xn � E[Xn]| < ✏⇤

� 1 � Var[Xn]

✏2(2.86)

and then:

Pr

|Xn � E[X]| < ✏⇤

� 1 � Var[X]

n✏2(2.87)

Page 33: Information Theory Lecture Notes

2.5. LAW OF LARGE NUMBERS 33

0 0.3333 0.6667 10

0.02

0.04

0.06

0.08

0.1

0.12

n = 60

0 0.3333 0.6667 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

n = 120

0 0.3333 0.6667 10

0.01

0.02

0.03

0.04

0.05

0.06

n = 216

0 0.3333 0.6667 10

0.05

0.1

0.15

0.2

0.25

n = 15

0 0.3333 0.6667 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

n = 30

Figure 2.3: Example of the sample mean of a binary random vector with p =

13

for various values of n.

Taking the limit of both sides:

lim

n!1Pr

|Xn � E[X]| < ✏⇤

= 1 (2.88)

since limn!1 1 � Var[X]n✏2 = 1.

Interpreting this result, the law of large numbers4 states that for any nonzeromargin ✏, no matter how small, with a sufficiently large sample size there willbe a very high probability that the average of the observations will be close tothe expected value; that is, within the margin.

The related central limit theorem states that the average of a large number ofsamples will asymptotically approach a Gaussian distribution, when the averageis appropriately scaled. The central limit theorem is a powerful result, but isnot used in this course.

4The lecture notes use the weak law of large numbers. The strong law of large numbersstates that Pr

�limn!1 Xn = EX

�= 1, which is a more powerful result, but more difficult to

prove.

Page 34: Information Theory Lecture Notes

34 CHAPTER 2. TOUR OF PROBABILITY THEORY

2.6 Source Code

2.6.1 Basic Probability Operations

Compute pY(y) = pX(x) · pY|X(y|x) (pygx is “probability of Y given X”).

1 >> pygx = [1 0 0 ; 4/5 1/5 0 ; 0 2/3 1/3]23 pygx =45 1.0000 0 06 0.8000 0.2000 07 0 0.6667 0.333389 >> px = [1/2 1/4 1/4];

10 >> py = px * pygx1112 py =1314 0.7000 0.2167 0.0833

Marginalization: pY(y) =

P

y2Y pXY(x, y) and pX(x) =

P

x2X pXY(x, y).

1 >> pxy = [1/2 0 0 ; 1/4 1/16 0 ; 0 1/8 1/16]23 pxy =45 0.5000 0 06 0.2500 0.0625 07 0 0.1250 0.062589 >> py = sum(pxy,1)

1011 py =1213 0.7500 0.1875 0.06251415 >> px = sum(pxy,2)' %transpose1617 px =1819 0.5000 0.3125 0.18752021 >> rats(px) %display as rational numbers2223 ans =2425 1/2 5/16 3/16

Compute pXY(x, y) = pX(x)pY|X(y|x).

Page 35: Information Theory Lecture Notes

2.6. SOURCE CODE 35

1 [X,Y] = size(pygx);2 pxy = repmat(px(:),1,Y) .* pygx;

2.6.2 Random Variable Generation

Generate n samples of a random variable according to a distribution pX(x).

1 function X = randomSamples(px,n)23 if nargin < 24 n = 1; %default n5 end67 Fx = cumsum([0 px(1:end�1)] ); %cumulative distribution8 X=zeros(1,n); %pre�allocate9 for ii = 1:n;

10 X(ii)=find(Fx < rand,1,'last'); %generate sample11 end

Generate n = 10 samples from the distribution pX(x) = [

14 , 1

2 , 14 ]

1 >> randomSamples([1/4 1/2 1/4],10)23 ans =45 2 3 3 2 2 1 2 2 2 2

2.6.3 Sample Mean Experiments

Conduct a large number of experiments to find the number of samples withinepsilon ✏ of the true mean. Then use this to compute the probability of beingwithin ✏ of the true mean.

1 clear all23 px = [1/4 1/2 1/4];4 trueMean = [1 2 3] * px(:); %true mean of the random variable5 n = 50; %number of samples6 epsilon = 0.1;7 numberOfExperiments = 10000;89 for ii = 1:numberOfExperiments

10 x = randomSamples(px,n);11 sampleMean(ii) = mean(x);12 end13 numberWithinEpsilon = length(find( abs(sampleMean � trueMean) <

epsilon))

Page 36: Information Theory Lecture Notes

36 CHAPTER 2. TOUR OF PROBABILITY THEORY

14 probability = numberWithinEpsilon / numberOfExperiments

Page 37: Information Theory Lecture Notes

Chapter 3

Mutual Information and KL

divergence

This chapter reviews two ways of “measuring” information. Informally, themutual information I(X;Y) between two random variables X and Y is how muchknowing one tells you about the other. The Kullback-Leiber divergence is KLdivergence, D(p(x)||q(x)); it can be considered as a distance between a trueprobability distribution p(x) and an approximate distribution q(x).

3.1 Mutual Information

The mutual information I(X,Y) is the reduction in the uncertainty of X byknowing Y.

Definition Let X and Y be jointly distributed random variables. Then themutual information I(X;Y) between X and Y is:

I(X;Y) = H(X) � H(X|Y) (3.1)

Alternatively, mutual information may be defined as follows. Consider ran-dom variables X and Y with a joint probability distribution function pX,Y(x, y)

and marginal distributions pX(x) and pY (y). Then I(X;Y) is given by:

I(X;Y) =

X

x2X

X

y2YpX,Y(x, y) log

pX,Y(x, y)

pX(x)pY (y)

(3.2)

The use of (3.1) is so common, that it will be referred to as the “definition ofmutual information”, even though many textbooks take (3.2) as the definition.It is straightforward to show they are equivalent.

37

Page 38: Information Theory Lecture Notes

38 CHAPTER 3. MUTUAL INFORMATION AND KL DIVERGENCE

I(X;Y) H(Y|X)H(X|Y)

Figure 3.1: Venn diagram expressing relationships

3.1.1 Properties of Mutual Information

This section describes key properties of mutual information.

Symmetry of Mutual Information From (3.2), it is easy to see that mutualinformation is symmetric in its two variables:

I(X;Y) = I(Y;X). (3.3)

The mutual information between X and itself, I(X;X) is called the self in-

formation, and,

I(X;X) = H(X). (3.4)

Corollary I(X;Y) = H(X) + H(Y) � H(X,Y).

By applying the chain rule H(X,Y) = H(X) + H(Y|X), then:

I(X;Y) = H(X) + H(Y) � H(X,Y) (3.5)

or by rearranging terms

H(X,Y) = H(X) + H(Y) � I(X;Y) (3.6)

The relationship between mutual information and entropy can be expressedby a Venn diagram in Fig. 3.1. The entropy H(X) and H(Y) are representedby circles, and the joint entropy H(X,Y) is represented by the union of the twocircles. The mutual information I(X;Y) is the intersection of the two circles.The left circle H(X) has two parts, so that H(X) = I(X;Y) + H(X|Y), from thedefinition of mutual information.

Non-Negativity of Mutual Information Mutual information is non-negative:

I(X;Y) � 0, (3.7)

with equality if and only if X and Y are independent. This is shown in the nextsection

Page 39: Information Theory Lecture Notes

3.1. MUTUAL INFORMATION 39

Mutual information upper bound Random variables X and Y take values fromthe alphabets X and Y, respectively. Mutual information is upper bounded as:

I(X;Y) log |X| (3.8)I(X;Y) log |Y| (3.9)

The first bound can be shown by writing I(X;Y) = H(X) � H(X|Y) and usingH(X) log |X | and H(X|Y) � 0.

3.1.2 Conditional Mutual Information and Chain Rules

Just as there is conditional entropy H(X|Z), there is also conditional mutualinformation I(X;Y|Z).

Definition The conditional mutual information of random variable X and Ygiven Z is:

I(X;Y|Z) =H(X|Z) � H(X|Y,Z) (3.10)

There is a chain rule for mutual information

I(XZ;Y) = I(X;Y|Z) + I(Z;Y) (3.11)

Theorem — Chain Rule for Mutual Information

I(X1,X2, . . .Xn;Y) =

nX

i=1

I(Xi;Y|Xi�1,Xi�2, . . . ,X1) (3.12)

Proof:

I(X1,X2, . . .Xn;Y) = H(X1,X2, . . . ,Xn) � H(X1,X2, . . . ,Xn|Y) (3.13)

=

nX

i�1

H(Xi|Xi�1, . . . ,X1) �nX

i=1

H(Xi|Xi�1, . . . ,X1,Y)

(3.14)

=

nX

i=1

I(Xi;Y|X1,X2, . . . ,Xi�1) (3.15)

3.1.3 Numerical Example

A discrete memoryless channel has input X and output Y. Given a conditionalinput distribution pY|X(y|x) and an input distribution pX(x), we often computethe mutual information I(X;Y). It is convenient to write mutual information(3.2) as:

I(X;Y) =

X

x2X

X

y2YpY|X(y|x)pX(x) log

pY|X(y|x)

pY(y)

(3.16)

Page 40: Information Theory Lecture Notes

40 CHAPTER 3. MUTUAL INFORMATION AND KL DIVERGENCE

Find the mutual information I(X;Y) for the DMC given by:

pY|X(y|x) =

0.89 0.11

0.11 0.89

(3.17)

(rows correspond to x = 0, 1) with input distribution pX(x) = [

14 , 3

4 ].

To apply (3.16), first find pY(y):

pY(y) =

X

x2XpY|X(y|x)pX(x) =

14

34

·

0.89 0.11

0.11 0.89

(3.18)

=

0.305 0.695

(3.19)

Then,

I(X;Y) = 0.89 · 1

4

log

0.89

0.305

+ 0.11 · 1

4

log

0.11

0.695

+ 0.11 · 3

4

log

0.11

0.305

+ 0.89 · 3

4

log

0.89

0.695

= 0.3874

Later, we will show that for this DMC and this input distribution, the maximumcommunications capacity is 0.3874 bits per channel use.

Page 41: Information Theory Lecture Notes

3.2. KULLBACK LEIBER DISTANCE (RELATIVE ENTROPY) 41

3.2 Kullback Leiber Distance (Relative Entropy)

The information divergence, relative entropy or Kullback Leiber distance D(p||q)is a measure of a “distance” between two distributions p(x) and q(x). It is helpfulto think of p(x) as a true distribution, and q(x) as an approximation distribution.It is a measure of the information lost when q is used to approximate p.

Definition The Kullback-Leiber distance D(p||q) between the two probabilitydistribution functions p(x) and q(x) is defined as:

D(p||q) =

X

x2Xp(x) log

p(x)

q(x)

(3.20)

with 0 log

0q = 0 and p log

p0 = 1.

The KL divergence is a fundamental distance measure of importance. It willbe used to prove several properties of entropy and mutual information.

KL divergence is non-negative:

D(p||q) � 0 (3.21)

This is an important property and will be proved in a future chapter. KLdivergence is not symmetric in p and q, that is, in general:

D(p||q) 6= D(q||p). (3.22)

The mutual information I(X;Y) is the relative entropy between the jointdistribution pXY(x, y) and the product distribution pX(x)pY(y):

I(X;Y) =

X

x2X

X

y2YpXY(x, y) log

pXY(x, y)

pX(x)pY(y)

(3.23)

= D�

pXY(x, y) || pX(x)pY(y)

(3.24)

3.2.1 Consequences of Non-Negativity of KL divergence

The non-negativity of KL divergence allows proving a few results that werestated earlier.

The non-negativity of mutual information follows from the non-negativity ofKL divergence: I(X;Y) = D(p(x, y)||p(x)p(y)) � 0.

Theorem — Uniform distribution maximizes entropy. Let X take on valuesfrom X. Then, H(X) log |X |, with equality if and only if X has a uniformdistribution over X .

Proof: Let u(x) = 1/|X | be the uniform probability distribution function

Page 42: Information Theory Lecture Notes

42 CHAPTER 3. MUTUAL INFORMATION AND KL DIVERGENCE

over X , and let pX(x) be the probability distribution for X. Then,

D�

pX(x)||u(x)

=

X

x2XpX(x) log

pX(x)

u(x)

(3.25)

=

X

x2XpX(x) log

1

u(x)

+

X

x2XpX(x) log pX(x) (3.26)

= log |X | � H(X). (3.27)

Hence, by the non-negativity of relative entropy,

0 D�

pX(x)||u(x)

= log |X | � H(X) (3.28)

3.3 Data Processing Inequality and Markov Chains

The data processing inequality expresses the idea processing cannot not increase

information. First, a brief introduction to Markov chains is given.

3.3.1 Markov Chains

Here a three-variable Markov chain is described (later a general Markov chainsthat use a large sequence of random variables will be described). Let X,Yand Z be jointly distributed random variable. If the conditional probabilitypZ|XY(z|x, y) does not change if X is dropped:

Pr(Z = z|X = x,Y = y) = Pr(Z = z|Y = y) (3.29)pZ|XY(z|x, y) = pZ|Y(z|y) (3.30)

then X ! Y ! Z forms a Markov chain. The idea of Markovity is expressed by“the future (Z) depends on the present (Y) and not the past (X) .”

The random variables X,Y and Z are jointly distributed as pXYZ(x, y, z). Bythe definition of conditional probability:

pXYZ(x, y, z) = pZ|XY(z|x, y) · pXY(x, y) (3.31)= pZ|XY(z|x, y) · pY|X(y|x) · pX(x) (3.32)

Since the dependence on X can be dropped, an equivalent condition to form aMarkov chain is the joint probability distribution factors as:

pXYZ(x, y, z) = pZ|Y(z|y)pY|X(y|x)pX(x) (3.33)

Here are three properties of Markov chains:

(1) X ! Y ! Z if and only if X and Z are conditionally independent givenY, that is:

pXZ|Y(x, z|y) =

pXYZ(x, y, z)

pY(y)

=

pXY(x, y)pZ|Y(z|y)

pY(y)

= pX|Y(x|y)pZ|Y(z|y)

(2) X ! Y ! Z implies Z ! Y ! X.

(3) If f is a function over Y, then X ! Y ! g(Y) forms a Markov chain.

Page 43: Information Theory Lecture Notes

3.4. DESCRIPTIONS USING EXPECTATION 43

3.3.2 Data Processing Inequality

Theorem — Data Processing Inequality. If X ! Y ! Z then I(X;Y) � I(X;Z).

Proof: By the chain rule: I(X;Y,Z) = I(X;Z) + I(X;Y|Z) = I(X;Y) +

I(X;Z|Y). Since X and Z are conditionally independent given Y we have I(X;Z|Y) =

0. Since I(X;Y|Z) � 0, we have:

I(X;Y) � I(X;Z) (3.34)

Note also that if X ! Y ! Z then I(X;Y|Z) I(X;Y). Observation reducesdependence of random variables.

Also, since Z ! Y ! X also forms a Markov chain, I(Y;Z) � I(X;Z) alsoholds.

If Z = g(Y), then X ! Y ! Z, specifically X ! Y ! g(Y) and thereforeI(X;Y) � I(X; g(Y)) As a result, a deterministic function f cannot increaseinformation.

3.4 Descriptions Using Expectation

Entropy, mutual information and KL divergence can be described using expec-tation.

Recall that for a random variable X with distribution pX(x) and a functiong, the expectation of g(X) is given by:

E[g(X)] =

X

x2XpX(x)g(x). (3.35)

If we take the function g(x) = � log pX(x), then:

E[g(X)] = �X

x2XpX(x) log pX(x), (3.36)

which is the entropy H(X), that is:

H(X) = �E⇥

log pX(X)

(3.37)

Take a close look at (3.37). The expectation is taken with respect to X, andthe expectation contains pX(X) and not pX(x). Cover and Thomas called thisexpression “eerily self-referential,” and it certainly is!

Similarly, the joint entropy H(X,Y) of discrete random variables X and Y isexpressed by:

H(X,Y) = �E[log pXY(X,Y)] (3.38)

= �X

x2X

X

y2YpX,Y(x, y) log pX,Y(x, y) (3.39)

Page 44: Information Theory Lecture Notes

44 CHAPTER 3. MUTUAL INFORMATION AND KL DIVERGENCE

and conditional entropy:

H(Y|X) = �X

x2X

X

y2YpXY(x, y) log pY|X(y|x) (3.40)

� E[log pY|X(Y|X)] (3.41)

The mutual information I(X;Y) can be expressed as:

I(X;Y) = E[log

pX,Y(X,Y)

pX(X)pY(Y)

] (3.42)

And, the KL divergence is expressed as:

D(p||q) = E[log

p(X)

q(X)

] (3.43)

=

X

x2XpX(x) log

p(x)

q(x)

(3.44)

Conditional mutual information is expressed as:

I(X;Y|Z) = H(X|Z) � H(X|Y,Z) (3.45)

= E[log

pXY|Z(X,Y|Z)

pX|Z(X|Z)pY|Z(Y|Z)

] (3.46)

Page 45: Information Theory Lecture Notes

Chapter 4

Source Coding for

Independent Vector Sources

This section considers source coding of random vector sources:

X = (X1,X2, . . . ,Xn) (4.1)

where the Xi are independent and identically distributed. The central result isthat there exists a source code C with expected length L(C), such that:

H(X) L(C) H(X) + ✏0, (4.2)

where ✏0 > 0 can be made as small as needed, as n gets large. This improvesthe result for single sources, that is H(X) L(C) H(X) + 1, where the upperbound included +1.

To achieve this, this lecture covers the following, in the following steps:

• Introduce sample entropy and typical sets,

• describe a “vector compression scheme” and a theorem that its expectedlength satisfies (4.2) ,

• introduce the asymptotic equipartition property (AEP), and

• prove the theorem using the AEP.

A future lecture shows a similar result that holds even for vector sources whichare not independent, specifically when X is a stationary stochastic process.

45

Page 46: Information Theory Lecture Notes

46CHAPTER 4. SOURCE CODING FOR INDEPENDENT VECTOR SOURCES

4.1 Sample Entropy and Typical Sets

4.1.1 Sample Entropy

The probability review lecture for random vectors X = X1X2 · · ·Xn comparedthe true mean E[Xi] with the sample mean of realizations x1x2 · · · xn:

xn =

1

n

nX

i=1

xi. (4.3)

As n gets larger, the sample mean tends to concentrate around the true mean.

In the same way, there is a sample entropy for a realization x distributedas pX(x), and the sample entropies will tend to concentrate around the trueentropy H(X) as n gets large.

Definition For any fixed sequence x, jointly distributed according to pX(x),the sample entropy is defined as:

� 1

nlog pX(x). (4.4)

For the important case when X1X2 . . .Xn are iid with marginal distributionpX(xi), the sample entropy of the sequence x = x1x2 . . . xn is:

� 1

nlog pX(x) = � 1

nlog

nY

i=1

pX(x) = � 1

n

nX

i=1

log pX(xi) (4.5)

since pX(x) = pX(x1)pX(x2) · · · pX(xn). Note that (4.3) is not unlike (4.5), butan important difference is that the sample entropy is given with respect to thedistribution pX(x).

Example Consider n = 4 iid binary random variables X1X2X3X4 with pX(x) =

[

34 , 1

4 ]. Find the sample entropy of all possible sequences.

Start with x = 0100. Using (4.5), the sample entropy of x = 0100 is:

�1

4

4X

i=1

log pX(xi) = �1

4

3 log

3

4

+ log

1

4

= 2 � 3

4

log 3 ⇡ 0.8113. (4.6)

Any sequence with 3 zeros and 1 one has the same sample entropy of 0.8113.

The possible values for pX(x) are:

{0.31641, 0.10547, 0.03516, 0.01172, 0.00391}, (4.7)

which correspond to sequences with 0, 1, 2, 3 and 4 ones, respectively. Thecorresponding sample entropies are:

{0.41504, 0.81128, 1.20752, 1.60376, 2.0}. (4.8)

A list of all n = 4 sequences, their probability pX(x), and their sample entropyare shown in Table 4.1.

Page 47: Information Theory Lecture Notes

4.1. SAMPLE ENTROPY AND TYPICAL SETS 47

4.1.2 Typical Sets and Typical Sequences

Typical sets and typical sequences are central to proofs in information theory.Roughly speaking, a sequence x = x1x2 . . . xn is an ✏-typical sequence if itssample entropy (4.5) is within ✏ of the true entropy, H(X). The set of all typicalsequences is called the typical set. For a given n and ✏, the typical set is denotedT (n)✏ .

Definition Consider a random vector X = X1X2 · · ·Xn distributed as pX(x).For a parameter ✏ � 0, a typical sequence x = (x1, x2, . . . , xn) 2 Xn is anysequence that satisfies:

2

�n(H(X)+✏) pX(x) 2

�n(H(X)�✏), or (4.9)

H(X) � ✏ � 1

nlog pX(x) H(X) + ✏. (4.10)

Another way to write this is:

|H(X) ��

� 1

nlog pX(x)

| ✏, (4.11)

showing the sample entropy is within ✏ of the true entropy H(X).

Definition The typical set T (n)✏ is the set of all typical sequences, that is, all

those sequences in Xn that satisfy (4.10). The size of the typical set is |T (n)✏ |.

For a typical sequence x 2 T (n)✏ , the probability of that sequence pX(x) is

close to 2

�nH(X). The probability that a randomly drawn vector is in the typicalset is Pr(X 2 T (n)

✏ ), which can be written as,

Pr(X 2 T (n)✏ ) =

X

x2T (n)✏

pX(x). (4.12)

By summing over x 2 T (n)✏ in (4.9),

|T (n)✏ |2�n(H(X)+✏) Pr(X 2 T (n)

✏ ) |T (n)✏ |2�n(H(X)�✏) (4.13)

sinceP

x2T (n)✏

1 = |T (n)✏ |.

Consider binary random vectors X. The typical set consists of sequenceswith a, a + 1, . . . , d ones, and the size of the typical set will be:

|T (n)✏ | =

dX

i=a

n

i

. (4.14)

Example Continue the previous example of binary random vector withn = 4. The true entropy H(Xi) is 0.8113. For some ✏, the typical set T (n)

✏ arethose sequences that have sample entropy � 1

n log pX(x) within ✏ of H(X). Fora small ✏ = 0.01, then the typical set is:

T (4)0.01 = {1000, 0100, 0010, 0001}, (4.15)

Page 48: Information Theory Lecture Notes

48CHAPTER 4. SOURCE CODING FOR INDEPENDENT VECTOR SOURCES

Table 4.1: List of all sequences for a binary random variable sequence X4 withp =

14 . * indicates sequences in T (4)

0.01. † indicates sequences in T (4)0.45.

x1 x2 x3 x4 p(x1, x2, x3, x4) � 1n log p(x1, x2, x3, x4)

0 0 0 0 † 0.31641 0.415041 0 0 0 † * 0.10547 0.811280 1 0 0 † * 0.10547 0.811280 0 1 0 † * 0.10547 0.811280 0 0 1 † * 0.10547 0.811281 1 0 0 † 0.03516 1.207521 0 1 0 † 0.03516 1.207520 1 1 0 † 0.03516 1.207521 0 0 1 † 0.03516 1.207520 1 0 1 † 0.03516 1.207520 0 1 1 † 0.03516 1.207521 1 1 0 0.01172 1.603761 1 0 1 0.01172 1.603761 0 1 1 0.01172 1.603760 1 1 1 0.01172 1.603761 1 1 1 0.00391 2.00000

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

sequ

ence

with

0 o

nes

sequ

ence

s with

1 o

ne

sequ

ence

s with

2 o

nes

sequ

ence

s with

3 o

nes

sequ

ence

with

4 o

nes

prob

abili

ty

H(X)H(X) � ✏ H(X) + ✏

Figure 4.1: Probability distribution versus � 1n log pX4

(x1, x2, x3, x4), for se-quences of 0, 1, 2, 3 and 4 ones.

Page 49: Information Theory Lecture Notes

4.1. SAMPLE ENTROPY AND TYPICAL SETS 49

that is, all sequences with 1 one. The size of the typical set is clearly |T (n)✏ | = 4:

|T (4)0.01| =

1X

i=1

4

i

= 4 (4.16)

For a larger ✏ = 0.45 the typical set is expanded to include:

T (4)0.45 = {all sequences with 0, 1 or 2 ones}. (4.17)

In that case, the size the typical set is:

|T (4)0.45| =

2X

i=0

4

i

(4.18)

= 1 + 4 + 6 = 11 (4.19)

The probability of a sequence with a given sample entropy is shown in Fig. 4.1.An n gets larger, the typical sequences will concentrate near the entropy, asshown in Fig. 4.2. As n ! 1 we can make the probability that a randomlydrawn sequence is in the typical set ! 1.

The set Xn can be partitioned into two sets, the typical set T (n)✏ and its

compliment T (n)✏ . For the example above with n = 4 and ✏ = 0.01,

Xn= T (n)

✏ [ T (n)✏

=

n

0000, 0001, 0010, 0100, 1000,

0011, 0101, 1001, 0110, 1010, 1100, 0111, 1011, 1101, 1110, 1111

o

where red indicate elements of the typical set, and blue indicates elements of itscomplement.

Page 50: Information Theory Lecture Notes

50CHAPTER 4. SOURCE CODING FOR INDEPENDENT VECTOR SOURCES

Figure 4.2: Sample entropy of x versus probability of x, for a p =

14 binary

random variable for n = 8, 16, 32 and 256. For ✏ = 0.1, those sample entropiesin the typical set are shown in red.