Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector

Entropy and Mutual InformationDifferential Entropy

Lecture 5: Measures of Information forContinuous Random Variables

I-Hsiang Wang

Department of Electrical EngineeringNational Taiwan University

[email protected]

October 26, 2015

1 / 24 I-Hsiang Wang IT Lecture 5

[email protected]


From Discrete to Continuous

So far we have focused on the discrete (& finite-alphabet) r.v’s:Entropy and mutual information for discrete r.v’s.Lossless source coding for discrete stationary sources.Channel coding over discrete memoryless channels.

In this lecture and the next two lectures, we extend the basic principlesand fundamental theorems to continuous random sources and channels.In particular:

Mutual information for continuous r.v.’s. (this lecture)

Lossy source coding for continuous stationary sources. (Lecture 7)

Gaussian channel capacity. (Lecture 6)



Outline

1 First we investigate basic information measures – entropy, mutualinformation, and KL divergence – when the r.v.’s are continuous.We will see that both mutual information and KL divergence are welldefined, while entropy of continuous r.v. is not.

2 Then, we introduce differential entropy as a continuous r.v.’scounterpart of Shannon entropy, and discuss the related properties.



1 Entropy and Mutual Information

2 Differential Entropy



Entropy of a Continuous Random Variable

Question:What is the entropy of a continuous real-valued random variable X ?Suppose X has the probability density function (p.d.f.) f (x).

Let us discretize X to answer this question, as follows:Partition R into length-∆ intervals:

R =∞∪

k=−∞[k∆, (k + 1)∆) .

Suppose that f (x) is continuous, then by the mean-value theorem,

∀ k ∈ Z, ∃ xk ∈ [k∆, (k + 1)∆) such that f (xk) =1∆

∫ (k+1)∆

k∆ f (x) dx.

Set [X]∆ ≜ xk if X ∈ [k∆, (k + 1)∆), with p.m.f. p (xk) = f (xk)∆.



f (x)

x

FX (x) � P {X � x}f (x) � dFX (x)

dx



f (x)

x

�



f (x)

x

�

x1 x3 x5



Observation: lim∆→0 H ([X]∆ ) = H (X ) (intuitively), while

H ([X]∆ ) = −∞∑

k=−∞

(f (xk)∆) log (f (xk)∆)

= −∆∞∑

k=−∞

f (xk) log f (xk)− log∆

→ −∫ ∞

∞f (x) log f (x) dx +∞

= ∞ as ∆ → 0

Hence, H (X ) = ∞ if −∫∞∞ f (x) log f (x) dx = E

[log 1

f(X)

]exists.

It is quite intuitive that the entropy of a continuous random variable canbe arbitrarily large, because it can take infinitely many possible values.



Mutual Information between Continuous Random Variables

How about mutual information between two continuous r.v.’s X and Y,with joint p.d.f. fX,Y (x, y) and marginal p.d.f.’s fX (x) and fY (y)?

Again, we use discretization:Partition R2 plane into ∆×∆ squares: R2 =

∪∞k,j=−∞ I∆

k × I∆j ,

where I∆k ≜ [k∆, (k + 1)∆).

Suppose that fX,Y (x, y) is continuous, then by the mean-valuetheorem (MVT), ∀ k, j ∈ Z, ∃ (xk, yj) ∈ I∆

k × I∆j such that

fX,Y (xk, yj) =1∆2

∫I∆

k ×I∆j

fX,Y (x, y) dx dy.

Set ([X]∆ , [Y]∆) ≜ (xk, yj) if (X,Y) ∈ I∆k × I∆

j , with p.m.f.p (xk, yj) = fX,Y (xk, yj)∆

2.



By MVT, ∀ k, j ∈ Z, ∃ xk ∈ I∆k and yj ∈ I∆

j such that

p (xk) =∫I∆

kfX (x) dx = ∆fX (xk) , p (yj) =

∫I∆

jfY (y) dy = ∆fY (yj) .

Observation: lim∆→0 I ([X]∆ ; [Y]∆ ) = I (X ;Y ) (intuitively), while

I([X]∆ ; [Y]∆

)=

∞∑k,j=−∞

p (xk, yj) log p (xk, yj)

p (xk) p (yj)

=∞∑

k,j=−∞

(fX,Y (xk, yj)∆

2) log fX,Y (xk, yj)��∆2

fX (xk) fY (yj)��∆2

= ∆2∞∑

k,j=−∞

fX,Y (xk, yj) log fX,Y (xk, yj)

fX (xk) fY (yj)

→∫ ∞

−∞

∫ ∞

−∞fX,Y (x, y) log fX,Y (x, y)

fX (x) fY (y) dx dy as ∆ → 0

Hence, I (X ;Y ) = E[log f(X,Y)

f(X)f(Y)

]if the improper integral exists.



Mutual Information

Unlike entropy that is only well-defined for discrete random variables, ingeneral we can define the mutual information between two real-valuedrandom variables (no necessarily continuous or discrete) as follows.

Definition 1 (Mutual information)The mutual information between two random variables X and Y isdefined as I (X ;Y ) = sup

P,QI([X]P ; [Y]Q

), where the supremum is taken

over all pairs of partitions P and Q of R.

Similar to mutual information, KL divergence can also be definedbetween two probability measures, no matter the probability distributionsare discrete, continuous, etc.Remark: Although defining information measures in such a general wayis nice, these definitions do not provide explicit ways to compute theseinformation measures.



1 Entropy and Mutual Information

2 Differential Entropy



Differential Entropy

For continuous r.v.’s, it turns out to be useful to define the counterpartsof entropy and conditional entropy, as follows:

Definition 2 (Differential entropy and conditional differential entropy)The differential entropy of a continuous r.v. X with p.d.f. f (x) is definedas h (X ) ≜ E

[log 1

f(X)

]if the (improper) integral exists.

The conditional differential entropy of a continuous r.v. X given Y,where (X,Y) has joint p.d.f. f (x, y) and conditional p.d.f. f (x|y), isdefined as h (X |Y ) ≜ E

[log 1

f(X|Y)

]if the (improper) integral exists.

We have the following theorem immediately from the previous discussion:

Theorem 1 (Mutual information between two continuous r.v.’s)

I (X ;Y ) = E[log f(X,Y)

f(X)f(Y)

]= h (X )− h (X |Y ).



Kullback-Leibler Divergence

Definition 3 (KL divergence between densities)The Kullback-Leibler divergence between two probability densityfunctions f (x) and g (x) is defined as D (f ∥g) ≜ E

[log f(X)

g(X)

]if the

(improper) integral exists. The expectation is taken over r.v. X ∼ f (x).

By Jensen’s inequality, it is straightforward to see that the non-negativityof KL divergence remains.

Proposition 1 (Non-negativity of KL divergence)D (f ∥g) ≥ 0, with equality iff f = g almost everywhere (i.e., except forsome points with zero probability).

Note: D (f ∥g) is finite only if the support of f (x) is contained in thesupport of g (x).



Properties that Extend to Continuous R.V.’s

Proposition 2 (Chain rule)

h (X,Y ) = h (X ) + h (Y |X ) , h (Xn ) =n∑

i=1

h(Xi

∣∣Xi−1).

Proposition 3 (Conditioning reduces differential entropy)

h (X |Y ) ≤ h (X ) , h (X |Y,Z ) ≤ h (X |Z ) .

Proposition 4 (Non-negativity of mutual information)

I (X ;Y ) ≥ 0, I (X ;Y |Z ) ≥ 0.



Examples

Example 1 (Differential entropy of a uniform r.v.)For a r.v. X ∼ Unif [a, b], that is, its p.d.f. fX (x) = 1

b−a 1 {a ≤ x ≤ b},its differential entropy h (X ) = log (b − a).

Example 2 (Differential entropy of N (0, 1))For a r.v. X ∼ N (0, 1), that is, its p.d.f. fX (x) = 1√

2πe− x2

2 , itsdifferential entropy h (X ) = 1

2 log (2πe).



New Properties of Differential Entropy

Differential entropy can be negative.Since b − a can be made arbitrarily small, h (X ) = log (b − a) can benegative. Hence, the non-negative property of entropy cannot beextended to differential entropy.

Scaling will change the differential entropy.Consider X ∼ Unif [0, 1]. Then, 2X ∼ Unif [0, 2]. Hence,

h (X ) = log 1 = 0, h (2X ) = log 2 = 1 =⇒ h (X ) = h (2X ) .

This is in sharp contrast to entropy: H (X ) = H (g (X) ) as long as g (·) isan invertible function.



Scaling and Translation

Proposition 5 (Scaling and Translation in the Scaler Case)Let X be a continuous random variable with differential entropy h (X).

Translation does not change the differential entropy:For a constant c, h (X + c) = h (X).Scaling shift the differential entropy:For a constant a = 0, h (aX) = h (X) + log |a|.

Proposition 6 (Scaling and Translation in the Vector Case)Let X be a continuous random vector with differential entropy h (X).

For a constant vector c, h (X + c) = h (X).For an invertible matrix A ∈ Rn×n, h (AX) = h (X) + log|det A|.

The proof of these propositions are left as exercises (simple calculus).



Differential Entropy of Gaussian Random Vectors

Example 3 (Differential entropy of Gaussian random vectors)For a n-dim random vector X ∼ N (m,K), its differential entropyh (X ) = 1

2 log (2πe)n(det K).

sol: For a n-dim random vector X ∼ N (m,K), we can rewrite X as

X = AW + m,

where AAT = K and W consists of i.i.d. Wi ∼ N (0, 1), i = 1, . . . ,n.Hence, by the translation and scaling properties of differential entropy:

h (X ) = h (W ) + log|det A| =n∑

i=1

h (Wi ) +12 log (det K)

= n2 log (2πe) + 1

2 log (det K) = 12 log (2πe)n

(det K) .



Maximum Differential Entropy

Uniform distribution maximizes entropy for r.v. with finite support.For differential entropy, the maxmization problem needs to be associatedwith constraints on the distribution. (otherwise, it is simple to make it infinite)

The following theorem asserts that, under a second moment constraint,zero-mean Gaussian maximizes the differential entropy.

Theorem 2 (Maximum Differential Entropy under Covariance Constraint)Let X be a random vector with mean m and covariance matrix

E[(X − m) (X − m)

T]= K,

and XG be Gaussian with the same covariance K. Then,

h (X ) ≤ h(XG )

= 12 log (2πe)n

(det K) .



pf: First, we can assume WLOG that both X and XG are zero-mean.Let the p.d.f. of X be f (x) and the p.d.f. of XG be fG (x). Hence,

0 ≤ D (f ∥fG) = E [log f (X)]−E [log fG (X)] = −h (X )−Ef [log fG (X)] .

Note that log fG (x) is a quadratic function of x, and X,XG have thesame second moment. Hence,

Ef [log fG (X)] = EfG [log fG (X)] = −h(XG )

,

=⇒ 0 ≤ D (f ∥fG) = −h (X ) + h(XG )

=⇒ h (X ) ≤ h(XG )

.



Summary



Mutual information between two continuous r.v.’s X and Y withjoint density fX,Y: I (X ;Y ) = E

[log fX,Y(X,Y)

fX(X)fY(Y)

].

Differential entropy and conditional differential entropy:h (X ) ≜ E

[log 1

fX(X)

], h (X |Y ) ≜ E

[log 1

fX|Y(X|Y)

].

I (X ;Y ) = h (X )− h (X |Y ) = h (Y )− h (Y |X ).

KL divergence between densities f and g: D (f ∥g) ≜ Ef

[log f(X)

g(X)

].

Chain rule, conditioning reduces differential entropy, non-negativityof mutual information and KL divergence: remain to hold.

Differential entropy can be negative; h (X )�≤h (X,Y ).


Documents

Lecture 5: Measures of Information for Continuous Random ...homepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_05_… · Proposition 6 (Scaling and Translation in the Vector