24
Entropy and Mutual Information Differential Entropy Lecture 5: Measures of Information for Continuous Random Variables I-Hsiang Wang Department of Electrical Engineering National Taiwan University [email protected] October 26, 2015 1 / 24 I-Hsiang Wang IT Lecture 5

Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Lecture 5: Measures of Information forContinuous Random Variables

I-Hsiang Wang

Department of Electrical EngineeringNational Taiwan University

[email protected]

October 26, 2015

1 / 24 I-Hsiang Wang IT Lecture 5

Page 2: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

From Discrete to Continuous

So far we have focused on the discrete (& finite-alphabet) r.v’s:Entropy and mutual information for discrete r.v’s.Lossless source coding for discrete stationary sources.Channel coding over discrete memoryless channels.

In this lecture and the next two lectures, we extend the basic principlesand fundamental theorems to continuous random sources and channels.In particular:

Mutual information for continuous r.v.’s. (this lecture)

Lossy source coding for continuous stationary sources. (Lecture 7)

Gaussian channel capacity. (Lecture 6)

2 / 24 I-Hsiang Wang IT Lecture 5

Page 3: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Outline

1 First we investigate basic information measures – entropy, mutualinformation, and KL divergence – when the r.v.’s are continuous.We will see that both mutual information and KL divergence are welldefined, while entropy of continuous r.v. is not.

2 Then, we introduce differential entropy as a continuous r.v.’scounterpart of Shannon entropy, and discuss the related properties.

3 / 24 I-Hsiang Wang IT Lecture 5

Page 4: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

1 Entropy and Mutual Information

2 Differential Entropy

4 / 24 I-Hsiang Wang IT Lecture 5

Page 5: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Entropy of a Continuous Random Variable

Question:What is the entropy of a continuous real-valued random variable X ?Suppose X has the probability density function (p.d.f.) f (x).

Let us discretize X to answer this question, as follows:Partition R into length-∆ intervals:

R =∞∪

k=−∞[k∆, (k + 1)∆) .

Suppose that f (x) is continuous, then by the mean-value theorem,

∀ k ∈ Z, ∃ xk ∈ [k∆, (k + 1)∆) such that f (xk) =1∆

∫ (k+1)∆

k∆ f (x) dx.

Set [X]∆ ≜ xk if X ∈ [k∆, (k + 1)∆), with p.m.f. p (xk) = f (xk)∆.

5 / 24 I-Hsiang Wang IT Lecture 5

Page 6: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

f (x)

x

FX (x) � P {X � x}f (x) � dFX (x)

dx

6 / 24 I-Hsiang Wang IT Lecture 5

Page 7: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

f (x)

x

7 / 24 I-Hsiang Wang IT Lecture 5

Page 8: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

f (x)

x

x1 x3 x5

8 / 24 I-Hsiang Wang IT Lecture 5

Page 9: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Observation: lim∆→0 H ([X]∆ ) = H (X ) (intuitively), while

H ([X]∆ ) = −∞∑

k=−∞

(f (xk)∆) log (f (xk)∆)

= −∆∞∑

k=−∞

f (xk) log f (xk)− log∆

→ −∫ ∞

∞f (x) log f (x) dx +∞

= ∞ as ∆ → 0

Hence, H (X ) = ∞ if −∫∞∞ f (x) log f (x) dx = E

[log 1

f(X)

]exists.

It is quite intuitive that the entropy of a continuous random variable canbe arbitrarily large, because it can take infinitely many possible values.

9 / 24 I-Hsiang Wang IT Lecture 5

Page 10: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Mutual Information between Continuous Random Variables

How about mutual information between two continuous r.v.’s X and Y,with joint p.d.f. fX,Y (x, y) and marginal p.d.f.’s fX (x) and fY (y)?

Again, we use discretization:Partition R2 plane into ∆×∆ squares: R2 =

∪∞k,j=−∞ I∆

k × I∆j ,

where I∆k ≜ [k∆, (k + 1)∆).

Suppose that fX,Y (x, y) is continuous, then by the mean-valuetheorem (MVT), ∀ k, j ∈ Z, ∃ (xk, yj) ∈ I∆

k × I∆j such that

fX,Y (xk, yj) =1∆2

∫I∆

k ×I∆j

fX,Y (x, y) dx dy.

Set ([X]∆ , [Y]∆) ≜ (xk, yj) if (X,Y) ∈ I∆k × I∆

j , with p.m.f.p (xk, yj) = fX,Y (xk, yj)∆

2.

10 / 24 I-Hsiang Wang IT Lecture 5

Page 11: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

By MVT, ∀ k, j ∈ Z, ∃ xk ∈ I∆k and yj ∈ I∆

j such that

p (xk) =∫I∆

kfX (x) dx = ∆fX (xk) , p (yj) =

∫I∆

jfY (y) dy = ∆fY (yj) .

Observation: lim∆→0 I ([X]∆ ; [Y]∆ ) = I (X ;Y ) (intuitively), while

I([X]∆ ; [Y]∆

)=

∞∑k,j=−∞

p (xk, yj) log p (xk, yj)

p (xk) p (yj)

=∞∑

k,j=−∞

(fX,Y (xk, yj)∆

2) log fX,Y (xk, yj)��∆2

fX (xk) fY (yj)��∆2

= ∆2∞∑

k,j=−∞

fX,Y (xk, yj) log fX,Y (xk, yj)

fX (xk) fY (yj)

→∫ ∞

−∞

∫ ∞

−∞fX,Y (x, y) log fX,Y (x, y)

fX (x) fY (y) dx dy as ∆ → 0

Hence, I (X ;Y ) = E[log f(X,Y)

f(X)f(Y)

]if the improper integral exists.

11 / 24 I-Hsiang Wang IT Lecture 5

Page 12: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Mutual Information

Unlike entropy that is only well-defined for discrete random variables, ingeneral we can define the mutual information between two real-valuedrandom variables (no necessarily continuous or discrete) as follows.

Definition 1 (Mutual information)The mutual information between two random variables X and Y isdefined as I (X ;Y ) = sup

P,QI([X]P ; [Y]Q

), where the supremum is taken

over all pairs of partitions P and Q of R.

Similar to mutual information, KL divergence can also be definedbetween two probability measures, no matter the probability distributionsare discrete, continuous, etc.Remark: Although defining information measures in such a general wayis nice, these definitions do not provide explicit ways to compute theseinformation measures.

12 / 24 I-Hsiang Wang IT Lecture 5

Page 13: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

1 Entropy and Mutual Information

2 Differential Entropy

13 / 24 I-Hsiang Wang IT Lecture 5

Page 14: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Differential Entropy

For continuous r.v.’s, it turns out to be useful to define the counterpartsof entropy and conditional entropy, as follows:

Definition 2 (Differential entropy and conditional differential entropy)The differential entropy of a continuous r.v. X with p.d.f. f (x) is definedas h (X ) ≜ E

[log 1

f(X)

]if the (improper) integral exists.

The conditional differential entropy of a continuous r.v. X given Y,where (X,Y) has joint p.d.f. f (x, y) and conditional p.d.f. f (x|y), isdefined as h (X |Y ) ≜ E

[log 1

f(X|Y)

]if the (improper) integral exists.

We have the following theorem immediately from the previous discussion:

Theorem 1 (Mutual information between two continuous r.v.’s)

I (X ;Y ) = E[log f(X,Y)

f(X)f(Y)

]= h (X )− h (X |Y ).

14 / 24 I-Hsiang Wang IT Lecture 5

Page 15: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Kullback-Leibler Divergence

Definition 3 (KL divergence between densities)The Kullback-Leibler divergence between two probability densityfunctions f (x) and g (x) is defined as D (f ∥g) ≜ E

[log f(X)

g(X)

]if the

(improper) integral exists. The expectation is taken over r.v. X ∼ f (x).

By Jensen’s inequality, it is straightforward to see that the non-negativityof KL divergence remains.

Proposition 1 (Non-negativity of KL divergence)D (f ∥g) ≥ 0, with equality iff f = g almost everywhere (i.e., except forsome points with zero probability).

Note: D (f ∥g) is finite only if the support of f (x) is contained in thesupport of g (x).

15 / 24 I-Hsiang Wang IT Lecture 5

Page 16: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Properties that Extend to Continuous R.V.’s

Proposition 2 (Chain rule)

h (X,Y ) = h (X ) + h (Y |X ) , h (Xn ) =n∑

i=1

h(Xi

∣∣Xi−1).

Proposition 3 (Conditioning reduces differential entropy)

h (X |Y ) ≤ h (X ) , h (X |Y,Z ) ≤ h (X |Z ) .

Proposition 4 (Non-negativity of mutual information)

I (X ;Y ) ≥ 0, I (X ;Y |Z ) ≥ 0.

16 / 24 I-Hsiang Wang IT Lecture 5

Page 17: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Examples

Example 1 (Differential entropy of a uniform r.v.)For a r.v. X ∼ Unif [a, b], that is, its p.d.f. fX (x) = 1

b−a 1 {a ≤ x ≤ b},its differential entropy h (X ) = log (b − a).

Example 2 (Differential entropy of N (0, 1))For a r.v. X ∼ N (0, 1), that is, its p.d.f. fX (x) = 1√

2πe− x2

2 , itsdifferential entropy h (X ) = 1

2 log (2πe).

17 / 24 I-Hsiang Wang IT Lecture 5

Page 18: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

New Properties of Differential Entropy

Differential entropy can be negative.Since b − a can be made arbitrarily small, h (X ) = log (b − a) can benegative. Hence, the non-negative property of entropy cannot beextended to differential entropy.

Scaling will change the differential entropy.Consider X ∼ Unif [0, 1]. Then, 2X ∼ Unif [0, 2]. Hence,

h (X ) = log 1 = 0, h (2X ) = log 2 = 1 =⇒ h (X ) = h (2X ) .

This is in sharp contrast to entropy: H (X ) = H (g (X) ) as long as g (·) isan invertible function.

18 / 24 I-Hsiang Wang IT Lecture 5

Page 19: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Scaling and Translation

Proposition 5 (Scaling and Translation in the Scaler Case)Let X be a continuous random variable with differential entropy h (X).

Translation does not change the differential entropy:For a constant c, h (X + c) = h (X).Scaling shift the differential entropy:For a constant a = 0, h (aX) = h (X) + log |a|.

Proposition 6 (Scaling and Translation in the Vector Case)Let X be a continuous random vector with differential entropy h (X).

For a constant vector c, h (X + c) = h (X).For an invertible matrix A ∈ Rn×n, h (AX) = h (X) + log|det A|.

The proof of these propositions are left as exercises (simple calculus).

19 / 24 I-Hsiang Wang IT Lecture 5

Page 20: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Differential Entropy of Gaussian Random Vectors

Example 3 (Differential entropy of Gaussian random vectors)For a n-dim random vector X ∼ N (m,K), its differential entropyh (X ) = 1

2 log (2πe)n(det K).

sol: For a n-dim random vector X ∼ N (m,K), we can rewrite X as

X = AW + m,

where AAT = K and W consists of i.i.d. Wi ∼ N (0, 1), i = 1, . . . ,n.Hence, by the translation and scaling properties of differential entropy:

h (X ) = h (W ) + log|det A| =n∑

i=1

h (Wi ) +12 log (det K)

= n2 log (2πe) + 1

2 log (det K) = 12 log (2πe)n

(det K) .

20 / 24 I-Hsiang Wang IT Lecture 5

Page 21: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Maximum Differential Entropy

Uniform distribution maximizes entropy for r.v. with finite support.For differential entropy, the maxmization problem needs to be associatedwith constraints on the distribution. (otherwise, it is simple to make it infinite)

The following theorem asserts that, under a second moment constraint,zero-mean Gaussian maximizes the differential entropy.

Theorem 2 (Maximum Differential Entropy under Covariance Constraint)Let X be a random vector with mean m and covariance matrix

E[(X − m) (X − m)

T]= K,

and XG be Gaussian with the same covariance K. Then,

h (X ) ≤ h(XG )

= 12 log (2πe)n

(det K) .

21 / 24 I-Hsiang Wang IT Lecture 5

Page 22: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

pf: First, we can assume WLOG that both X and XG are zero-mean.Let the p.d.f. of X be f (x) and the p.d.f. of XG be fG (x). Hence,

0 ≤ D (f ∥fG) = E [log f (X)]−E [log fG (X)] = −h (X )−Ef [log fG (X)] .

Note that log fG (x) is a quadratic function of x, and X,XG have thesame second moment. Hence,

Ef [log fG (X)] = EfG [log fG (X)] = −h(XG )

,

=⇒ 0 ≤ D (f ∥fG) = −h (X ) + h(XG )

=⇒ h (X ) ≤ h(XG )

.

22 / 24 I-Hsiang Wang IT Lecture 5

Page 23: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Summary

23 / 24 I-Hsiang Wang IT Lecture 5

Page 24: Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Mutual information between two continuous r.v.’s X and Y withjoint density fX,Y: I (X ;Y ) = E

[log fX,Y(X,Y)

fX(X)fY(Y)

].

Differential entropy and conditional differential entropy:h (X ) ≜ E

[log 1

fX(X)

], h (X |Y ) ≜ E

[log 1

fX|Y(X|Y)

].

I (X ;Y ) = h (X )− h (X |Y ) = h (Y )− h (Y |X ).

KL divergence between densities f and g: D (f ∥g) ≜ Ef

[log f(X)

g(X)

].

Chain rule, conditioning reduces differential entropy, non-negativityof mutual information and KL divergence: remain to hold.

Differential entropy can be negative; h (X )�≤h (X,Y ).

24 / 24 I-Hsiang Wang IT Lecture 5