48
Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 1 / 48

Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Lecture 6 - Information theory

Jan Bouda

FI MU

April 19, 2009

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 1 / 48

Page 2: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Part I

Uncertainty and entropy

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 2 / 48

Page 3: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Uncertainty

Given a random experiment it is natural to ask how uncertain we areabout an outcome of the experiment.Compare two experiments - tossing an unbiased coin and throwing afair six-sided dice. First experiment attains two outcomes and thesecond experiment has six possible outcomes. Both experiments havethe uniform probability distribution. Our intuition says that we aremore uncertain about an outcome of the second experiment.Let us compare tossing of an ideal coin and a binary message sourceemitting 0 and 1 both with probability 1/2. Intuitively we shouldexpect that the uncertainty about an outcome of each of theseexperiments is the same. Therefore the uncertainty should be basedonly on the probability distribution and not on the concrete samplespace.Therefore, the uncertainty about a particular random experiment canbe specified as a function of the probability distribution{p1, p2, . . . , pn} and we will denote it as H(p1, p2, . . . , pn).Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 3 / 48

Page 4: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Uncertainty - requirements

1 Let us fix the number of outcomes of an experiment and compare theuncertainty of different probability distributions. Natural requirementis that the most uncertain is the experiment with the uniformprobability distribution, i.e. H(p1, . . . pn) is maximal forp1 = · · · = pn = 1/n.

2 Permutation of probability distribution does not change theuncertainty, i.e. for any permutation π : {1 . . . n} → {1 . . . n} it holdsthat H(p1, p2, . . . , pn) = H(pπ(1), pπ(2) . . . , pπ(n)).

3 Uncertainty should be nonnegative and equals to zero if and only ifwe are sure about the outcome of the experiment.H(p1, p2, . . . , pn) ≥ 0 and it is equal if and only of pi = 1 for some i .

4 If we include into an experiment an outcome with zero probability, thisdoes not change our uncertainty, i.e. H(p1, . . . , pn, 0) = H(p1, . . . , pn)

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 4 / 48

Page 5: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Uncertainty - requirements

5 As justified before, having the uniform probability distribution on noutcomes cannot be more uncertain than having the uniformprobability distribution on n + 1 outcomes, i.e.

H(

n×︷ ︸︸ ︷1/n, . . . , 1/n) ≤ H(

(n+1)×︷ ︸︸ ︷1/(n + 1), . . . , 1/(n + 1)).

6 H(p1, . . . , pn) is a continuous function of its parameters.

7 Uncertainty of an experiment consisting of a simultaneous throw of mand n sided die is as uncertain as an independent throw of m and nsided die implying

H(

mn×︷ ︸︸ ︷1/(mn), . . . , 1/(mn)) = H(

m×︷ ︸︸ ︷1/m, . . . , 1/m) + H(

n×︷ ︸︸ ︷1/n, . . . , 1/n).

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 5 / 48

Page 6: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Entropy and uncertainty

8 Let us consider a random choice of one of n + m balls, m being redand n being blue. Let p =

∑mi=1 pi be the probability that a red ball

is chosen and q =∑m+n

i=m+1 pi be the probability that a blue one ischosen. Then the uncertainty which ball is chosen is the uncertaintywhether red of blue ball is chosen plus weighted uncertainty that aparticular ball is chosen provided blue/red ball was chosen. Formally,

H(p1, . . . , pm, pm+1, . . . , pm+n) =

=H(p, q) + pH

(p1

p, . . . ,

pm

p

)+ qH

(pm+1

q, . . . ,

pm+n

q

).

(1)

It can be shown that any function satisfying Axioms 1− 8 is of the form

H(p1, . . . , pm) = −(loga 2)m∑

i=1

pi log2 pi (2)

showing that the function is defined uniquely up to multiplication by aconstant, which effectively changes only the base of the logarithm.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 6 / 48

Page 7: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Entropy and uncertainty

Alternatively, we may show that the function H(p1, . . . , pm) is uniquelyspecified through axioms

1 H(1/2, 1/2) = 1.

2 H(p, 1− p) is a continuous function of p.

3 H(p1, . . . , pm) = H(p1 + p2, p3, . . . , pm) + (p1 + p2)H( p1p1+p2

, p2p1+p2

)

as in Eq. (2).

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 7 / 48

Page 8: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Entropy

The function H(p1, . . . , pn) we informally introduced is called the (Shannon)entropy and, as justified above, it measures our uncertainty about anoutcome of an experiment.

Definition

Let X be a random variable with probability distribution p(x). Then the(Shannon) entropy of the random variable X is defined as

H(X ) = −∑

x∈Im(X )

p(X = x) log P(X = x).

In the definition we use the convention that 0 log 0 = 0, what is justified bylimx→0 x log x = 0. Alternatively, we may sum only over nonzeroprobabilities.As explained above, all required properties are independent ofmultiplication by a constant what changes the base of the logarithm in thedefinition of the entropy. Therefore, in the rest of this part we will uselogarithm without explicit base. In case we want to measure information inbits, we should use logarithm base 2.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 8 / 48

Page 9: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Entropy

Let φ : R→ R be a function. Let us recall that the expectation of thetransformed random variable is E [φ(X )] =

∑x∈Im(X ) φ(x)P(X = x).

Using this formalism we may write most of the information-theoreticquantities. In particular, the entropy can be expressed as

H(X ) = E

[log

1

p(X )

],

where p(x) = P(X = x).

Lemma

H(X ) ≥ 0.

Proof.

0 < p(x) ≤ 1 implies log(1/p(x)) ≥ 0.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 9 / 48

Page 10: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Part II

Joint and Conditional entropy

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 10 / 48

Page 11: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Joint entropy

In order to examine an entropy of more complex random experimentsdescribed by correlated random variables we have to introduce the entropyof a pair (or n–tuple) of random variables.

Definition

Let X and Y be random variables distributed according to the probabilitydistribution p(x , y) = P(X = x ,Y = y). We define the joint (Shannon)entropy of random variables X and Y as

H(X ,Y ) = −∑

x∈Im(X )

∑y∈Im(Y )

p(x , y) log p(x , y),

or, alternatively,

H(X ,Y ) = −E [log p(X ,Y )] = E

[1

log p(X ,Y )

].

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 11 / 48

Page 12: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Conditional entropy

Important question is how uncertain we are about an outcome of arandom variable X given an outcome of a random variable Y . Naturally,our uncertainty about an outcome of X given Y = y is

H(X |Y = y) = −∑

x∈Im(X )

P(X = x |Y = y) log P(X = x |Y = y). (3)

The uncertainty about an outcome of X given an (unspecified) outcome ofY is naturally defined as a sum of equations (3) weighted according toP(Y = y), i.e.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 12 / 48

Page 13: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Conditional Entropy

Definition

Let X and Y be random variables distributed according to the probabilitydistribution p(x , y) = P(X = x ,Y = y). Let us denotep(x |y) = P(X = x |Y = y). The conditional entropy of X given Y is

H(X |Y ) =∑

y∈Im(Y )

p(y)H(X |Y = y) =

=−∑

y∈Im(Y )

p(y)∑

x∈Im(X )

p(x |y) log p(x |y) =

=−∑

x∈Im(X )

∑y∈Im(Y )

p(x , y) log p(x |y)

=− E [log p(X |Y )].

(4)

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 13 / 48

Page 14: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Conditional Entropy

Using the previous definition we may raise the question how muchinformation we learn on average about X given an outcome of Y .Naturally, we may interpret it as the decrease of our uncertainty about Xwhen we learn outcome of Y , i.e. H(X )− H(X |Y ). Analogously, theamount of information we obtain when we learn the outcome of X isH(X ).

Theorem (Chain rule of conditional entropy)

H(X ,Y ) = H(Y ) + H(X |Y ).

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 14 / 48

Page 15: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Chain rule of conditional entropy

Proof.

H(X ,Y ) =−∑

x∈Im(X )

∑y∈Im(Y )

p(x , y) log p(x , y) =

=−∑

x∈Im(X )

∑y∈Im(Y )

p(x , y) log[p(y)p(x |y)] =

=−∑

x∈Im(X )y∈Im(Y )

p(x , y) log p(y)−∑

x∈Im(X )y∈Im(Y )

p(x , y) log p(x |y) =

=−∑

y∈Im(Y )

p(y) log p(y)−∑

x∈Im(X )y∈Im(Y )

p(x , y) log p(x |y) =

=H(Y ) + H(X |Y ).

(5)

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 15 / 48

Page 16: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Chain rule of conditional entropy

Proof.

Alternatively we may use log p(X ,Y ) = log p(Y ) + log p(X |Y ) and takethe expectation on both sides to get the desired result.

Corollary (Conditioned chain rule)

H(X ,Y |Z ) = H(Y |Z ) + H(X |Y ,Z ).

Note that in general H(Y |X ) 6= H(X |Y ). On the other hand,H(X )− H(X |Y ) = H(Y )− H(Y |X ) showing that information issymmetric.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 16 / 48

Page 17: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Part III

Relative Entropy and Mutual Information

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 17 / 48

Page 18: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Relative entropy

Let us start with the definition of the relative entropy, which measuresinefficiency of assuming that a given distribution is q(x) when the truedistribution is p(x).

Definition

The relative entropy or Kullback-Leibler distance between twoprobability distributions p(x) and q(x) is defined as

D(p‖q) =∑

x∈Im(X )

p(x) logp(x)

q(x)= E

[log

p(X )

q(X )

].

In the definition we use the convention that 0 log 0q = 0 and p log p

0 =∞.Important is that the relative entropy is always nonnegative and it is zeroif and only if p(x) = q(x). It is not a distance in the mathematical sensesince it is not symmetric in its parameters and it does not satisfy thetriangle inequality.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 18 / 48

Page 19: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Mutual information

Mutual information measures information one random variable containsabout another random variable. It is the decrease of the uncertainty aboutan outcome of a random variable given an outcome of another randomvariable, as already discussed above.

Definition

Let X and Y be random variables distributed according to the probabilitydistribution p(x , y). The mutual information I (X ; Y ) is the relativeentropy between the joint distribution and the product of marginaldistributions

I (X ; Y ) =∑

x∈Im(X )

∑y∈Im(Y )

p(x , y) logp(x , y)

p(x)p(y)

=D(p(x , y)‖p(x)p(y)) = E

[log

p(X ,Y )

p(X )p(Y )

].

(6)

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 19 / 48

Page 20: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Mutual Information and Entropy

Theorem

I (X ; Y ) = H(X )− H(X |Y ).

Proof.

I (X ; Y ) =∑x ,y

p(x , y) logp(x , y)

p(x)p(y)==

∑x ,y

p(x , y) logp(x |y)

p(x)=

=−∑x ,y

p(x , y) log p(x) +∑x ,y

p(x , y) log p(x |y) =

=−∑x ,y

p(x) log p(x)−

(−∑x ,y

p(x , y) log p(x |y)

)=

=H(X )− H(X |Y ).

(7)

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 20 / 48

Page 21: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Mutual information

From symmetry we get also I (X ; Y ) = H(Y )− H(Y |X ). X says about Yas much as Y says about X . Using H(X ,Y ) = H(X ) + H(Y |X ) we get

Theorem

I (X ; Y ) = H(X ) + H(Y )− H(X ,Y ).

Note that I (X ; X ) = H(X )− H(X |X ) = H(X ).

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 21 / 48

Page 22: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Part IV

Properties of Entropy and Mutual Information

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 22 / 48

Page 23: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

General Chain Rule for Entropy

Theorem

Let X1,X2, . . . ,Xn be random variables. Then

H(X1,X2, . . . ,Xn) =n∑

i=1

H(Xi |Xi−1, . . . ,X1).

Proof.

We use repeated application of the chain rule for a pair of random variables

H(X1,X2) =H(X1) + H(X2|X1),

H(X1,X2,X3) =H(X1) + H(X2,X3|X1) =

=H(X1) + H(X2|X1) + H(X3|X2,X1),

...

(8)

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 23 / 48

Page 24: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

General Chain Rule for Entropy

Proof.

...

H(X1,X2, . . . ,Xn) =H(X1) + H(X2|X1) + · · ·+ H(Xn|Xn−1, . . . ,X1) =

=n∑

i=1

H(Xi |Xi−1, . . . ,X1).

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 24 / 48

Page 25: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Conditional Mutual Information

Definition

The conditional mutual information between random variables X and Ygiven Z is defined as

I (X ; Y |Z ) = H(X |Z )− H(X |Y ,Z ) = E

[log

p(X ,Y |Z )

p(X |Z )p(Y |Z )

],

where the expectation is taken over p(x , y , z).

Theorem (Chain rule for mutual information)

I (X1,X2, . . . ,Xn; Y ) =∑n

i=1 I (Xi ; Y |Xi−1, . . . ,X1)

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 25 / 48

Page 26: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Conditional Relative Entropy

Definition

The conditional relative entropy is the average of the relative entropiesbetween the conditional probability distributions p(y |x) and q(y |x)averaged over the probability distribution p(x). Formally,

D(p(y |x)‖q(y |x)

)=∑x

p(x)∑y

p(y |x) logp(y |x)

q(y |x)= E

[log

p(Y |X )

q(Y |X )

].

The relative entropy between two joint distributions can be expanded asthe sum of a relative entropy and a conditional relative entropy.

Theorem (Chain rule for relative entropy)

D(p(x , y)‖q(x , y)) = D(p(x)‖q(x)) + D(p(y |x)‖q(y |x)).

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 26 / 48

Page 27: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Chain Rule for Relative Entropy

Proof.

D(p(x , y)‖q(x , y)) =∑x

∑y

p(x , y) logp(x , y)

q(x , y)=

=∑x

∑y

p(x , y) logp(x)p(y |x)

q(x)q(y |x)=

=∑x ,y

p(x , y) logp(x)

q(x)+∑x ,y

p(x , y) logp(y |x)

q(y |x)=

=D(p(x)‖q(x)) + D(p(y |x)‖q(y |x)).

(9)

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 27 / 48

Page 28: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Part V

Jensen’s inequality

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 28 / 48

Page 29: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Convex and concave functions

Before introducing Jensen’s inequality let us briefly refresh definitions ofconvex and concave function, what is crucial in this part.

Definition

A function f (x) is said to be convex on a set S if for every x1, x2 ∈ S and0 ≤ λ ≤ 1

f (λx1 + (1− λ)x2) ≤ λf (x1) + (1− λ)f (x2).

A function is strictly convex if the equality holds only if λ = 0 or λ = 1.A function f is concave if −f is convex. A function f is strictly concaveif −f is strictly convex.

Theorem

If the function has a second derivative which is nonnegative (positive)everywhere, then the function is convex (strictly convex).

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 29 / 48

Page 30: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Convex and Concave Functions

Proof.

We use the Taylor series expansion of the function around x0

f (x) = f (x0) + f ′(x0)(x − x0) +f ′′(x∗)

2(x − x0)2,

where x∗ lies between x0 and x . By our initial assumption the term f ′′(x∗) isalways nonnegative and the same holds for the last addend. Letx0 = λx1 + (1− λ)x2, λ ∈ [0, 1] and x = x1 and we have

f (x1) ≥ f (x0) + f ′(x0)[(1− λ)(x1 − x2)]. (10)

Similarly, taking x = x2 we obtain

f (x2) ≥ f (x0) + f ′(x0)[λ(x2 − x1)]. (11)

Multiplying (10) by λ and (11) by (1− λ) and adding we obtain the convexity.The proof for the strict convexity is analogous.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 30 / 48

Page 31: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Convex and Concave Functions

Proof.

Multiplying (10) by λ and (11) by (1− λ) and adding we obtain theconvexity

λf (x1) + (1− λ)f (x2) ≥≥λ(f (x0) + f ′(x0)[(1− λ)(x1−x2)]) + (1−λ)(f (x0) + f ′(x0)[λ(x2−x1)]) =

=λf (x0)+(1−λ)f (x0)+λf ′(x0)[(1−λ)(x1−x2)]−(1−λ)f ′(x0)[λ(x1−x2)] =

=f (x0) = f (λx1 + (1− λ)x2).

The proof for the strict convexity is analogous.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 31 / 48

Page 32: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Jensen’s Inequality

Last theorem shows immediately the strict convexity for x2, ex and x log xfor x ≥ 0, and the strict concavity of log x and

√x for x ≥ 0.

The following inequality is behind most of the fundamental theorems ininformation theory and in mathematics in general.

Theorem (Jensen’s inequality)

If f is a convex function and X is a random variable, then

E [f (X )] ≥ f (E (X )). (12)

Moreover, if f is strictly convex, the equality in (12) implies thatX = E (X ) occurs with probability 1, i.e. X is a constant.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 32 / 48

Page 33: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Jensen’s Inequality

Proof.

We prove this inequality by induction on the number of elements inIm(X ). For probability distribution on two points we have

E (f (X )) = p1f (x1) + p2f (x2) ≥ f (p1x1 + p2x2) = f (E (X )) (13)

what follows directly from convexity. Suppose the theorem holds for k − 1points. Then we put p′i = pi/(1− pk) for i = 1, 2 . . . , k − 1 and we have

E (f (X )) =k∑

i=1

pi f (xi ) =pk f (xk) + (1− pk)k−1∑i=1

p′i f (xi ) ≥

≥pk f (xk) + (1− pk)f

(k−1∑i=1

p′ixi

)≥

(14)

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 33 / 48

Page 34: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Jensen’s Inequality

Proof.

≥f

(pkxk + (1− pk)

k−1∑i=1

p′ixi

)=

=f

(k∑

i=1

pixi

)= f (E (X )),

where the first inequality follows from the induction hypothesis and thesecond one from convexity of f .

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 34 / 48

Page 35: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Information Inequality

Theorem (Information inequality)

Let p(x) and q(x), x ∈ X, be two probability distributions. Then

D(p‖q) ≥ 0

with equality if and only if p(x) = q(x) for all x.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 35 / 48

Page 36: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Information Inequality

Proof.

Let A = {x |p(x) > 0} be the support set of p(x). Then

−D(p‖q) =−∑x∈A

p(x) logp(x)

q(x)=

=∑x∈A

p(x) logq(x)

p(x)≤

(∗)≤ log

∑x∈A

p(x)q(x)

p(x)=

= log∑x∈A

q(x) ≤ log∑x∈X

q(x) =

= log 1 = 0,

(15)

where (∗) follows from Jensen’s inequality.Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 36 / 48

Page 37: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Information Inequality

Proof.

Since log t is a strictly concave function (implying − log t is strictly convex)of t, we have equality in (∗) if and only if q(x)/p(x) = 1 everywhere, i.e.p(x) = q(x). Also, if p(x) = q(x) the second inequality also becomesequality.

Corollary (Nonnegativity of mutual information)

For any two random variables X , Y

I (X ; Y ) ≥ 0

with equality if and only if X and Y are independent.

Proof.

I (X ; Y ) = D(p(x , y)‖p(x)p(y)) ≥ 0 with equality if and only ifp(x , y) = p(x)p(y), i.e. X and Y are independent.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 37 / 48

Page 38: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Consequences of Information Inequality

Corollary

D(p(y |x)‖q(y |x)) ≥ 0

with equality if and only if p(y |x) = q(y |x) for all y and x with p(x) > 0.

Corollary

I (X ; Y |Z ) ≥ 0

with equality if and only if X and Y are conditionally independent given Z.

Theorem

H(X ) ≤ log |Im(X )| with equality if and only if X has a uniformdistribution over Im(X ).

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 38 / 48

Page 39: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Consequences of Information Inequality

Proof.

Let u(x) = 1/|Im(X )| be a uniform probability distribution over Im(X )and let p(x) be the probability distribution of X . Then

D(p‖u) =∑

p(x) logp(x)

u(x)=

=−∑

p(x) log u(x)−(−∑

p(x) log p(x))

= log |Im(X )| − H(X ).

Theorem (Conditioning reduces entropy)

H(X |Y ) ≤ H(X )

with equality if and only if X and Y are independent.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 39 / 48

Page 40: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Consequences of Information Inequality

Proof.

0 ≤ I (X ; Y ) = H(X )− H(X |Y ).

Previous theorem says that on average knowledge of a random variable Yreduces our uncertainty about other random variable X . However, theremay exist y such that H(X |Y = y) > H(X ).

Theorem (Independence bound on entropy)

Let X1,X2, . . . ,Xn be drawn according to p(x1, x2, . . . , xn). Then

H(X1,X2, . . . ,Xn) ≤n∑

i=1

H(Xi )

with equality if and only if Xi ’s are mutually independent.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 40 / 48

Page 41: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Consequences of Information Inequality

Proof.

We use the chain rule for entropy

H(X1,X2, . . . ,Xn) =n∑

i=1

H(Xi |Xi−1, . . . ,X1)

≤n∑

i=1

H(Xi ),

(16)

where the inequality follows directly from the previous theorem. We haveequality if and only if Xi is independent of all Xi−1, . . . ,X1.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 41 / 48

Page 42: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Part VI

Log Sum Inequality and Its Applications

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 42 / 48

Page 43: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Log Sum Inequality

Theorem (Log sum inequality)

For a nonnegative numbers a1, a2, . . . , an and b1, b2, . . . , bn it holds that

n∑i=1

ai logai

bi≥

(n∑

i=1

ai

)log

∑ni=1 ai∑ni=1 bi

with equality if and only if ai/bi = const.

In the theorem we used again the convention that 0 log 0 = 0,a log(a/0) =∞ if a > 0 and 0 log(0/0) = 0.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 43 / 48

Page 44: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Log Sum Inequality

Proof.

Assume WLOG that ai > 0 and bi > 0. The function f (t) = t log t isstrictly convex since f ′′(t) = 1

t log e > 0 for all positive t. We use theJensen’s inequality to get

∑i

αi f (ti ) ≥ f

(∑i

αi ti

)

for αi ≥ 0,∑

i αi = 1. Setting αi = bi/∑n

j=1 bj and ti = ai/bi we obtain

∑i

ai∑j bj

logai

bi≥

(∑i

ai∑j bj

)log∑

i

ai∑j bj

,

what is the desired result.

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 44 / 48

Page 45: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Consequences of Log Sum Inequality

Theorem

D(p‖q) is convex in the pair (p, q), i.e. if (p1, q1) and (p2, q2) are twopairs of probability distributions, then

D(λp1 + (1− λ)p2‖λq1 + (1− λ)q2) ≤ λD(p1‖q1) + (1− λ)D(p2‖q2)

for all 0 ≤ λ ≤ 1.

Theorem (Concavity of entropy)

H(p) is a concave function of p

Theorem

Let (X ,Y ) ∼ p(x , y) = p(x)p(y |x). The mutual information I (X ; Y ) is aconcave function of p(x) for fixed p(y |x) and a convex function of p(y |x)for fixed p(x).

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 45 / 48

Page 46: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Part VII

Data Processing inequality

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 46 / 48

Page 47: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Data Processing Inequality

Theorem

X → Y → Z is a Markov chain if and only if X and Z are independentwhen conditioned by Y , i.e.

p(x , z |y) = p(x |y)p(z |y).

Note that X → Y → Z implies Z → Y → X . Also, if Z = f (Y ), thenX → Y → Z .

Theorem (Data processing inequality)

If X → Y → Z , then I (X ; Y ) ≥ I (X ; Z ).

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 47 / 48

Page 48: Lecture 6 - Information theory - Masaryk Universityxbouda1/teaching/2009/IV111/lecture6.pdf · Lecture 6 - Information theory Jan Bouda FI MU April 19, 2009 ... Part I Uncertainty

Data Processing Inequality

Proof.

We expand mutual information using the chain rule in two different ways as

I (X ; Y ,Z ) =I (X ; Z ) + I (X ; Y |Z )

=I (X ; Y ) + I (X ; Z |Y ).(17)

Since X and Z are conditionally independent given Y we haveI (X ; Z |Y ) = 0. Since I (X ; Y |Z ) ≥ 0 we have

I (X ; Y ) ≥ I (X ; Z ).

Jan Bouda (FI MU) Lecture 6 - Information theory April 19, 2009 48 / 48