Lecture 4 Noisy Channel Codinghomepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_04_v0.… · Channel Capacity and the Weak Converse Achievability Proof and Source-Channel

Channel Capacity and the Weak ConverseAchievability Proof and Source-Channel Separation

Lecture 4Noisy Channel Coding

I-Hsiang Wang

Department of Electrical EngineeringNational Taiwan University

[email protected]

October 9, 2015

1 / 56 I-Hsiang Wang IT Lecture 4

[email protected]


The Channel Coding Problem

Channel Encoder

Channel Decoder

x

N yNNoisy Channel

w bw

Meta Description

1 Message: Random message W ∼ Unif [1 : 2K].2 Channel: Consist of an input alphabet X , an output alphabet Y,

and a family of conditional distributions{

p(yk∣∣xk, yk−1

)| k ∈ N

}determining the stochastic relationship between the output symbolyk and the input symbol xk along with all past signals

(xk−1, yk−1

).

3 Encoder: Encode the message w by a length N codeword xN ∈ XN.4 Decoder: Reconstruct message w from the channel output yN.5 Efficiency: Maximize the code rate R ≜ K

N bits/channel use, givencertain decoding criterion.



Decoding Criterion: Vanishing Error Probability

Channel Encoder

Channel Decoder

x

N yNNoisy Channel

w bw

A key performance measure: Error Probability P(N)e ≜ P

{W = W

}.

Question: Is it possible to get zero error probability?Ans: Probably not, unless the channel noise has some special structure.Following the development of lossless source coding, Shannon turned theattention to answering the following question:

Is it possible to have a sequence of encoder/decoder pairs suchthat P(N)

e → 0 as N → ∞? If so, what is the largest possiblecode rate R where vanishing error probability is possible?

Recall: In lossless source coding, we see that the infimum of compressionrates where vanishing error probability is possible is H ({Si} ).



Rate

R

Block Length

N P(N)e

Probability of Error

Capacity: Take N → ∞, Require P(N)e → 0

=⇒ sup R = C.Error Exponent: Take N → ∞, Fix rate R=⇒ min P(N)

e ≈ 2−NE(R).Finite Block Length: Fix N, Require P(N)

e ≤ ε

=⇒ sup R = C −√

VN Q−1 (ε) + O

(log N

N

).

Remark: For source coding, one can establish a similar framework.



In this lecture we only focus on capacity.

In other words, we ignore the issue of finite block length(FBL). FBL performance can be obtained via techniquesextending from CLT.

We do not pursue finer analysis on the error probability vialarge deviation techniques either.



Discrete Memoryless Channel (DMC)

In order to demonstrate the key ideas in channel coding, in this lecturewe shall focus on discrete memoryless channels (DMC) defined below.

Definition 1 (Discrete Memoryless Channel)A discrete channel

(X ,{


)| k ∈ N

},Y)

is memoryless if∀ k ∈ N,


)= pY|X (yk|xk) .

In other words, Yk − Xk −(Xk−1,Yk−1

).

Here the conditional p.m.f. pY|X is called the channel law or channeltransition function.

Question: is our definition of a channel sufficient to specify p(yN∣∣xN),

the stochastic relationship between the channel input (codeword) xN andthe channel output yN?



p(yN∣∣xN) = p

(xN, yN)p (xN)

p(xN, yN) = N∏

k=1

p(xk, yk

∣∣xk−1, yk−1)

=N∏

k=1


)p(xk∣∣xk−1, yk−1

)Hence, we need to further specify

{p(xk∣∣xk−1, yk−1

)| k ∈ N

}, which

cannot be obtained from p(xN).

Interpretation:{

p(xk∣∣xk−1, yk−1

)| k ∈ N

}is induced by the encoding

function, which implies that the encoder can potentially make use of thepast channel output, i.e., feedback.



DMC without Feedback

Channel Encoder

xk ykNoisy Channel

w

No Feedback

Channel Encoder

xk ykNoisy Channel

w

Dyk�1

With Feedback

Suppose in the system, the encoder has no knowledge about therealization of the channel output, then, p

(xk∣∣xk−1, yk−1

)= p

(xk∣∣xk−1

)for all k ∈ N, and it is said the the channel has no feedback. In this case,specifying

{p(yk∣∣xk, yk−1

)| k ∈ N

}suffices to specify p

(yN∣∣xN).

Proposition 1 (DMC without Feedback)

For a DMC(X , pY|X,Y

)without feedback, p

(yN∣∣xN) = N∏

k=1

pY|X (yi|xi).



Overview

In this lecture, we would like to establish the following (informally described)noisy channel coding theorem due to Shannon:

For a DMC(X , pY|X,Y

), the maximum code rate with

vanishing error probability is the channel capacity

C ≜ maxpX(·)

I (X ;Y ) .

The above holds regardless of the availability of feedback.

To demonstrate this result, we organize the lecture as follows:1 Give the problem formulation, state the main theorem, and visit a

couple of examples to show how to compute channel capacity.2 Prove the converse part: an achievable rate cannot exceed C.3 Prove the achievability part with a random coding argument.



Channel CapacityProof of the Weak ConverseFeedback Capacity

1 Channel Capacity and the Weak ConverseChannel CapacityProof of the Weak ConverseFeedback Capacity

2 Achievability Proof and Source-Channel SeparationAchievability ProofSource-Channel Separation









Channel Coding without Feedback: Problem Setup

Channel Encoder

Channel Decoder

x

N yNNoisy Channel

w bw

1 A(2NR,N

)channel code consists of

an encoding function (encoder) encN : [1 : 2K] → XN that mapseach message w to a length N codeword xN, where K ≜ ⌈NR⌉.a decoding function (decoder) decN : YN → [1 : 2K] ∪ {∗} that mapsa channel output sequence yN to a reconstructed message w or anerror message ∗.

2 The error probability is defined as P(N)e ≜ P

{W = W

}.

3 A rate R is said to be achievable if there exist a sequence of(2NR,N

)codes such that P(N)

e → 0 as N → ∞. The channelcapacity is defined as C ≜ sup {R | R : achievable}.




Channel Coding Theorem for Discrete Memoryless Channel

Theorem 1 (Channel Coding Theorem for DMC without Feedback)The capacity C of the DMC p (y|x) without feedback is given by

C = maxp(x)

I (X ;Y ) . (1)

The capacity formula (1) is intuitive, since I (X ;Y ) represents theamount of information about the channel input X that one can inferfrom the channel output Y.The maximization over p (x) stands for choosing the best possibleinput distribution so that the amount of information transfer ismaximized.




Rest of the lecture:1 First we give some examples of noisy channels to show how to

compute capacity.

2 Then, we prove that for any rate R > C, it is impossible to havevanishing error probability (converse).

3 Finally, we prove that for any R < C, there exist a sequence ofencoding/decoding schemes such that the error probability vanishesas blocklength tends to ∞ (achievability), based on a probabilisticargument called random coding.




Binary Symmetric Channel

A binary symmetric channel (BSC) consists ofBinary input/output X = Y = {0, 1}.

Channel law p (y|x) =[1− p p

p 1− p

].

The capacity of BSC is CBSC = 1− Hb (p). 1

0

1

0

p

1� p

p

1� p

X Y

To compute BSC capacity, observe I (X ;Y ) = H (Y )− H (Y |X ), andH (Y |X = 0 ) = H (Y |X = 1 ) = Hb (p) =⇒ H (Y |X ) = Hb (p).H (Y ) ≤ log 2 = 1, with equality iff Y is uniform.

Question: Is it possible to choose a p (x) such that Y is uniform?Ans: Yes, choose X to be uniform =⇒ C = max

p(x)I (X ;Y ) = 1− Hb (p).




Binary Erasure Channel

A binary erasure channel (BEC) consists ofBinary input X = {0, 1} and output witherasure Y = {0, 1, ∗}.

Channel law p (y|x) =[1− p p 00 p 1− p

].

The capacity of BEC is CBEC = 1− p. 1

0

1

0

p

1� p

p

1� p

X Y

⇤

Suppose we begin with I (X ;Y ) = H (Y )− H (Y |X ). Then,H (Y |X = 0 ) = H (Y |X = 1 ) = Hb (p) =⇒ H (Y |X ) = Hb (p).H (Y ) ≤ log 3, with equality iff Y is uniform.

Question: Is it possible to choose a p (x) such that Y is uniform?Ans: No. So, we cannot say that maxp(x) H (Y ) = log 3.




1

0

1

0

p

1� p

p

1� p

X Y

⇤

1

0

1

0

XY

⇤

1

1

↵

1� ↵

Instead, we can start with I (X ;Y ) = H (X )− H (X |Y ). Then, we have

the reverse channel law p (x|y) =

1 0α 1− α0 1

, where α ≜ P {X = 0}.

H (X |Y = 0 ) = H (X |Y = 1 ) = 0, H (X |Y = ∗ ) = Hb (α) = H (X )

=⇒ H (X |Y ) = P {Y = ∗} = pH (X ).H (X ) ≤ 1, with equality iff X is uniform.

Hence, CBEC = maxp(x) (1− p)H (X ) = 1− p.




Erasure Channel

We can generalize BEC to the following erasure channel:Input X , output Y = X ∪ {∗}.

Channel law p (y|x) =

1− p, y = xp, y = ∗0, otherwise

A motivation for this model is from networking, where the erasure ∗models the “packet drop”.

Exercise 1Show that the capacity of the erasure channel is

CEC = (1− p) log|X |.




Symmetric Channel

In computing the capacity of BSC, we observe that1 H (Y |X ) = Hb (p) regardless of p (x). Why? Because all rows of

p (y|x) are permutations of a same probability vector[p 1− p

].

2 H (Y ) = log|Y| can be attained, that is, Y can be made uniform bychoosing X to be uniform. Why? Because all columns of p (y|x),have the same sum

∑x p (y|x).

Definition 2 (Symmetric Channel)A symmetric channel is a channel with channel law p (y|x) satisfying (1)all rows of p (y|x) are permutations of a same probability vector p, and(2) all columns of p (y|x), have the same sum

∑x p (y|x).

Exercise 2Show that the capacity of a symmetric channel is log|Y| − H (p).




Computing Capacity of DMC via Convex Optimization

For a DMC, we are able to find its capacity efficiently by revokingefficient algorithms in solving convex programs, since I (X ;Y ) is aconcave function of p (x) for fixed p (y|x).

Proposition 2I (X ;Y ) is a concave function of p (x) for fixed p (y|x).

pf: By definition, I (X ;Y ) = H (Y )− H (Y |X ).H (Y |X ) =

∑x p (x)H (Y|X = x) is a linear function of p (x), because

H (Y|X = x) = −∑

p (y|x) log p (y|x) is constant for fixed p (y|x).H (Y ) is a concave function of p (y). p (y) is a linear function of p (x) forfixed p (y|x). Hence, H (Y ) is a concave function of p (x) for fixed p (y|x).Putting the above together, we complete the proof.









Proof of the (Weak) Converse (1)

We would like to show that for every sequence of(2NR,N

)codes such

that P(N)e → 0 as N → ∞, the rate R ≤ max

p(x)I (X ;Y ).

pf: Note that W ∼ Unif [1 : 2K] and hence K = H (W ).

NR ≤ H (W ) = I(

W ; W)+ H

(W∣∣∣W)

(2)

≤ I(W ;YN )+ (1 + P(N)

e log(2K + 1

))(3)

≤N∑

k=1

I(W ;Yk

∣∣Yk−1)+(1 + P(N)

e (NR + 2))

(4)

(2) is due to K = ⌈NR⌉ ≥ NR and chain rule.(3) is due to W − YN − W and Fano’s inequality.(4) is due to chain rule and 2K + 1 ≤ 2NR+1 + 1 ≤ 2× 2NR+1.





Set εN ≜ 1N

(1 + P(N)

e (NR + 2))

, we see that εN → 0 as N → ∞

because limN→∞ P(N)e = 0.

The next step is to relate∑N

k=1 I(W ;Yk

∣∣Yk−1)

to I (X ;Y ), by thefollowing manipulation:

I(W ;Yk

∣∣Yk−1)≤ I

(W,Yk−1 ;Yk

)≤ I

(W,Yk−1,Xk ;Yk

)(5)

= I (Xk ;Yk ) ≤ maxp(x)

I (X ;Y ) (6)

(5) is due to the fact that conditioning reduces entropy.(6) is due to DMC: p

(yk|xk, yk−1,w

)= p

(yk|xk, yk−1

)= p (yk|xk)

=⇒ Yk − Xk −(W,Xk−1,Yk−1

)=⇒ Yk − Xk −

(W,Yk−1

).





Hence, we have

NR ≤N∑

k=1

I(W ;Yk

∣∣Yk−1)+ NεN ≤ N max

p(x)I (X ;Y ) + NεN

=⇒ R ≤ maxp(x)

I (X ;Y ) + εN, ∀N.

Taking N → ∞, we have: R ≤ maxp(x)

I (X ;Y ) if it is achievable.

Remark: Similar to the source coding problem, a stronger version of theconverse holds in the channel coding problem as well: if R > C, thenP(N)

e → 1 as N → ∞ for any encoding/decoding functions.









Channel Coding with Feedback: Problem Setup

Channel Encoder

Channel Decoder

x

N yNNoisy Channel

w bw

D

1 A(2NR,N

)channel code consists of

an encoding function (encoder) encN : [1 : 2K]×YN−1 → XN thatmaps each message w to a length N codeword xN, where K ≜ ⌈NR⌉,and the k-th symbol xk is a function of

(w, yk−1

)for all k ∈ [1 : N].

a decoding function (decoder) decN : YN → [1 : 2K] ∪ {∗} that mapsa channel output sequence yN to a reconstructed message w or anerror message ∗.

2 The error probability is defined as P(N)e ≜ P

{W = W

}.

3 A rate R is said to be achievable if there exist a sequence of(2NR,N

)codes such that P(N)

e → 0 as N → ∞. The channelcapacity is defined as C ≜ sup {R | R : achievable}.




Dependency Graph: Without vs. With Feedback

X1

W

X2

Xk

XN YN

Yk

Y2

Y1

cW

XN Y N

pY |X

encN decN

No Feedback




Dependency Graph: Without vs. With Feedback

X1

W

X2

Xk

XN YN

Yk

Y2

Y1

cW

XN Y N

pY |X

encN decN

With Feedback




Feedback Capacity

Theorem 2 (Channel Coding Theorem for DMC with Feedback)The capacity of the DMC p (y|x) with feedback is given by (1), the sameas that without feedback. In other words, feedback does not increase thechannel capacity for DMC.

The proof is immediate because in the converse proof of channel codingtheorem without feedback, we do not make use of the assumption thatthere is no feedback. In other words, the proof is identical even withfeedback.Remark: Although feedback does not increase capacity, it does improvethe reliability (error exponent) and finite-blocklength performance greatly.Furthermore, the design and the complexity of the coding scheme mayalso be greatly simplified and reduced due to feedback. The details areout of scope of this lecture.



Achievability ProofSource-Channel Separation











Overview

In order to prove the achievability part of Theorem 1, we need to showthe following mathematical statement: ∀R < C, R ≥ 0,

∃ a sequence of(2NR,N

)codes such that lim

N→∞P(N)

e = 0.

In general, to prove the existence of certain objects satisfying somedesirable properties, there are two possible ways:

1 Explicitly construct an object and prove that the properties hold.2 Assume that no objects can satisfy the properties, and show

contradiction.The achievability proof presented in this lecture is more of the secondflavor, and in fact belongs to the so-called probabilistic method.




The Probabilistic Method

What is the probabilistic method?Roughly speaking, in order to show the existence of certain objectssatisfying some desirable properties,

One first imposes particular probability distribution over the possibleobject space.Then, by showing that “on average” the properties hold or theproperties hold with non-zero probability, one concludes theexistence of such objects.

Example 1Given a set of n-dimensional unit vectors {v1,v2, . . . ,vk}, show that∃ xi ∈ {±1}, i = 1, . . . , k such that

∣∣∣∑ki=1 xivi

∣∣∣ ≥ √k.




pf: Let {Xi}ki=1 be i.i.d. r.v.’s with P {Xi = 1} = P {Xi = −1} = 1

2 .

Define V ≜∑k

i=1 Xivi. Compute E[|V|2

]as follows:

E[|V|2

]= E

[VTV

]= E

[( k∑i=1

XivTi

)( k∑i=1

Xivi

)]

= E

k∑i=1

k∑j=1

XiXjvTi vj

=k∑

i=1

k∑j=1

E [XiXj]vTi vj

∵ {Xi} are mutually independent, E [XiXj] = E [Xi]E [Xj] = 0 for i = j.

∴ E[|V|2

]=∑k

i=1 E[X2

i]∥vi∥2 = k.

Hence, ∃ xi ∈ {±1}, i = 1, . . . , k such that∣∣∣∑k

i=1 xivi

∣∣∣ ≥ √k. Otherwise,

E[|V|2

]should be less than k, leading to contradiction.




Paul Erdős (1913 – 1996)




Coding over Noisy Channel

Before we prove the main theorem, let us set up a few notations relatedto coding over noisy channel.

1 Codebook c ={

xN (1) , xN (2) . . . xN (2K)} consists of the 2K

codewords and is the range of the encoding function.2 ML Decoder (maximum likelihood) is the optimal decoder that

minimizes the probability of error P(N)e when the messages are

uniformly chosen (uniform prior):wML = arg maxw∈[1:2K] p

(yN∣∣xN (w)

).

3 Probability of Error of Message m: λm ≜ P{

W = W∣∣∣W = m

}In principle, one can derive the ML decoding rule and compute P(N)

e for agiven codebook. But, there are some challenges toward proving thechannel coding theorem.




Challenges and Work-Arounds

First, the expression of error probability of ML is usually intractable, andit is hard to obtain any insight regarding the asymptotic behaviors.Second, it is unclear how to construct the codebook and thecorresponding decoding scheme.In summary, to prove the achievability part of the channel codingtheorem, there are two main challenges we shall overcome:

1 How to show the existence of good codebooks?We circumvent the issue of explicit construction by using a randomcoding argument (a kind of the probabilistic method)

2 How to analyze the error probability?We circumvent the issue of ML decoding error analysis by using asuboptimal decoder and derive upper bounds on the probability oferror of the chosen decoder.




Proof Program

1 Random Codebook Generation:Generate an ensemble of codebooks according to certain probabilitydistribution. Hence, codebook C becomes a random object.

2 Error Probability Analysis:Goal: Show that as N → ∞,

EC[P(N)

e,ML (C)]→ 0,

and conclude that there must exist a codebook c such that thedecoding error probability P(N)

e,ML → 0.To simplify analysis, we shall introduce suboptimal decoders andgive a tractable upper bound of error probability using union ofevents bound.




Random Codebook Generation

A simple way is to i.i.d. generate 2K codewords, and each codewordp(xN) ∼∏i=1 pX (xi).

In other words, if we stack all 2K codewords together into a 2K × Nmatrix C, the elements of the matrix C will be i.i.d. distributedaccording to pX: (each row is a codeword)

c =

X1 (1) X2 (1) · · · XN (1)X1 (2) X2 (2) · · · XN (2)

...... . . . ...

X1

(2K) X2

(2K) · · · XN

(2K)

and p (c) ≜ P {C = c} =

∏2K

w=1

∏Ni=1 pX (xi(w)).

It turns out the symmetry in such codebook ensemble distribution helpssimplify the analysis.




Encoding and Decoding

For a realization c of the codebook random ensemble C, we describe theencoding and decoding methods below.Encoding: for a message m ∈ [1 : 2K], choose the m-th row of thecodebook c and send it out.Decoding: ideally one would like to use the following ML decoding rule:

wML = arg maxw∈[1:2K] p(yN∣∣xN (w)

).

However, the performance of ML decoder is usually not tractable, asmentioned before. Instead, we introduce a suboptimal decoder based ontypical sequences as follows:

wT = a unique w such that(xN(w), yN) ∈ T (N)

ε (X,Y).

Note: there are some other suboptimal decoders can be used, such asthreshold decoders.




Error Probability Analysis (1)

Since the ML decoder is optimal, we can analyze the performance of thetypicality decoder and use it as an upper bound. Hence, our goal isturned to proving limN→∞ EC

[P(N)

e,T (C)]= 0.

1 The first step is to use the symmetry of codebook ensemble tosimplify EC

[P(N)

e,T (C)]

and argue that we can focus on analyzingthe error probability of the first codeword XN (1) averaged over C:

EC[P(N)

e,T (C)]= EC

[2−K

∑2K

m=1λm (C)

]= 2−K

∑m

EC [λm (C)]

= 2−K∑m

EC [λ1 (C)] = EC [λ1 (C)]

= P {Error, averaged over C|W = 1}





2 For notational simplicity, use E denote the text “Error” event anddrop the “averaged over C”. Our next focus is to upper bound

P {E|W = 1} ≜ P1 (E).The trick here is to distinguish into two different kinds of errors:

E = Ea ∪ Et,

Ea ≜{(

XN (1) ,YN) /∈ T (N)ε

}Et ≜

{(XN (w) ,YN) ∈ T (N)

ε for some w = 1}

The core is whether or not the joint sequence(XN (w) ,YN) are

ε-typical. Let us define Aw ≜{(

XN (w) ,YN) ∈ T (N)ε

}.

We can then rewrite Ea = Ac1, Et = ∪w =1Aw, and hence

E = Ea ∪ Et = Ac1 ∪ (∪w=1Aw).





3 We are now ready to apply the union of events bound:

P1 {E} = P1 {Ac1 ∪ (∪w=1Aw)} ≤ P1 {Ac

1}+2K∑

w=2

P1 {Aw} .

Next, we shall develop upper bounds onthe probability that the actual transmitted codeword XN (1) and theactual received signal YN are not (jointly) typical.the probability that some other (random) codeword XN ( = 1) andthe actual received signal YN are (jointly) typical.

Lemma 1 (A Key Lemma)P1 {A1} ≥ 1− ε for N large enough, and P1 {Aw} ≤ 2−N(I(X ;Y )−δ(ε))

for all w = 1, where δ (ε) → 0 as ε → 0.





4 Finally, let us put all the above together and apply Lemma 1:

EC[P(N)

e,T (C)]≜ P {E} = P {E|W = 1} ≜ P1 {E}

≤ P1 {Ac1}+

2K∑w=2

P1 {Aw}

≤ ε+

2K∑w=2

2−N(I(X ;Y )−δ(ε))

≤ ε+ 2−N(I(X ;Y )−δ(ε)−R)

As long as R ≤ I (X ;Y )− δ(ε), we are able to make P {E} ≤ 2ε forN large enough, which is equivalent to limN→∞ EC

[P(N)

e,T (C)]→ 0.




Completion of the Achievability Proof

We have shown that as long as R ≤ I (X ;Y )− δ(ε),limN→∞ EC

[P(N)

e,T (C)]→ 0, and hence there must exist a realization of

codebook c such that P(N)e,T (c) → 0 as N → ∞.

Finally, taking the codebook generating distribution

pX = arg maxp(x)I (X ;Y ) ,

we conclude that ∀R < C = maxp(x) I (X ;Y ), R is achievable.




Proof of Lemma 1 (1): Recap of Typicality

Recall: by definition, an ε-typical (vector) sequence (xn, yn) shall satisfy

|π (a, b|xn, yn)− pX,Y (a, b)| ≤ εpX,Y (a, b) , ∀ (a, b) ∈ X × Y.

(Note: we can think of (X,Y) as a r.v. and apply the same definition of typicality!)

Hence, if (Xn,Yn) ∼∏n

i=1 pX,Y (xi, yi), then we have0 (xn, yn) ∈ T (n)

ε (X,Y) =⇒ xn ∈ T (n)ε (X) , yn ∈ T (n)

ε (Y).1 ∀ (xn, yn) ∈ T (n)

ε (X,Y),∣∣− 1

n log p (xn, yn)− H (X,Y )∣∣ ≤ δ(ε),

where δ(ε) = εH (X,Y ).2 p

(T (n)ε (X,Y)

)≥ 1− ε for n large enough.

3 |T (n)ε (X,Y)| ≤ 2n(H(X,Y )+δ(ε)).

4 |T (n)ε (X,Y)| ≥ (1− ε)2n(H(X,Y )−δ(ε)) for n large enough.




Proof of Lemma 1 (2): Typical with Actual Codeword

Let us first consider P1 {A1} = P{(

XN (1) ,YN) ∈ T (N)ε

∣∣∣W = 1}

.

We are averaging over a random codebook ensemble C, and the randomcodebook is generated element-by-element i.i.d. based on pX.DMC without feedback implies p

(yN∣∣xN) =∏N

i=1 pY|X (yi|xi).

Hence, given W = 1,(XN(1),YN) has the following joint distribution:

p(xN, yN) = p

(xN) · p

(yN∣∣xN) =∏N

i=1pX (xi) ·

∏N

i=1pY|X (yi|xi)

=∏N

i=1pX,Y (xi, yi)

By Property 2 (LLN), we see that for N large enough,

P1 {A1} = P{(

XN (1) ,YN) ∈ T (N)ε

∣∣∣W = 1}≥ 1− ε.




Proof of Lemma 1 (3): Typical with a Wrong Codeword

Consider P1 {Aw} = P{(

XN (w) ,YN) ∈ T (N)ε

∣∣∣W = 1}

for w = 1.

Note that we are averaging over a random codebook ensemble C, andthe random codebook is generated element-by-element i.i.d. based on pX.Hence, although XN (1) and XN (w) have the same marginal distributionpX, they are actually independent.Due to DMC,

(XN(1),YN)⊥⊥ XN (w). Hence, YN ⊥⊥ XN (w), and

P1 {Aw} =∑

(xN,yN)∈T (N)ε

p(xN) · p

(yN)

≤ 2N(1+ε)H(X,Y )︸︷︷︸cardinality upper

bound on typical set

· 2−N(1−ε)H(X)︸︷︷︸upper bound on prob.of a typical sequence

· 2−N(1−ε)H(Y)︸︷︷︸upper bound on prob.of a typical sequence

= 2−N(I(X ;Y )−δ(ε)),

where δ (ε) = ε (H (X,Y ) + H (X) + H (Y)) → 0 as ε → 0.




Some Reflections

Reflection 1: Mutual independence of codewords.In the random coding argument of the proof, 2K × N elements of the codebookmatrix C are generated i.i.d., and hence the 2K rows

{XN (1) , . . . ,XN (

2K)}are mutually independent. However, in the proof we only require pairwiseindependence: XN (1)⊥⊥ XN (w) , ∀w = 1.

Reflection 2: Typicality decoder.We use typicality decoder other than the optimal ML decoder to find tractableupper bounds on the error probability. There are other suboptimal decoderscan be used. For example, the following threshold decoder can also work:

wth ≜ a unique w such that i(

xN (w) ; yN)> β,

where i(xN; yN) ≜ log p(xN,yN)

p(xN)p(yN)=

N∑k=1

log pY|X(yk|xk)pX(xk)

, and β ≜ I (X ;Y )− ε.









Joint Source-Channel Coding: Problem Setup

Encoder DecoderChannelSource Destination

pY |XbsNssNs

x

Nc yNc

{Si}

Source model: discrete stationary ergodic with entropy rate H ({Si} ).Channel model: DMC pY|X with channel capacity C

(pY|X

).

1 A(|S|NcR

,Nc

)joint source-channel code consists of

an encoding function (encoder) encNc : SNs → XNc that maps eachsource sequence sNs to a length Nc codeword xNc , Ns ≜ ⌈NcR⌉.a decoding function (decoder) decNc : YNc → SNs that maps achannel output sequence yNc to a reconstructed sequence sNs .

2 The error probability is defined as P(Nc)e ≜ P

{SNs = SNs

}.

3 A rate R is said to be achievable if there exist a sequence of(|S|NcR

,Nc

)codes such that P(Nc)

e → 0 as Nc → ∞.




Source-Channel Separation Theorem

Theorem 3 (Source-Channel Separation)

1 If R < CH({Si} ) , then R is achievable, i.e., lossless reconstruction of

source {Si} is possible via the noisy channel pY|X.

2 Conversely, if R > CH({Si} ) , then R is not achievable, i.e., lossless

reconstruction is impossible.

SourceEncoderSource

NoisyChannel

ChannelEncoder

Destination SourceDecoder

ChannelDecoder

Binary Interface

sNs

bsNs

x

Nc

yNc

bK

bbK




Proof of Achievability

pf: (Achievability Part):Choose a

(2NsRs ,Ns

)lossless source code with Rs = H ({Si} ) + εs.

Choose a(2NcRc ,Nc

)channel code with Rc = C − εc.

Due the the channel coding theorem, the binary sequence bK lives in thedigital interface between the source and the channel coders can bedecoded with vanishing error probability.Due to the lossless source coding theorem, the source sequences can bereconstructed with vanishing error probability as long as the bit sequencebK can be successfully decoded by the channel decoder.Concatenate the above two codes together, we see that as long asNsRs < NcRc ⇐⇒ Ns

Nc< Rc

Rs= C−εc

H({Si} )+εs, the separation scheme is

able to reconstruct the source sequence with vanishing error probability.Since εs, εc can be made arbitrarily small, as long as R < C

H({Si} ) , it isachievable.




Proof of Converse

pf: (Converse Part): We shall prove that ∀ achievable R, R ≤ CH({Si} ) .

NsH ({Si} ) ≤ H(SNs

)= I

(SNs ; SNs

)+ H

(SNs

∣∣∣SNs)

(7)

≤ I(SNs ;YNc

)+(1 + P(Nc)

e Ns log|S|)

(8)

≤Nc∑

k=1

I(SNs ;Yk

∣∣Yk−1)+(1 + P(Nc)

e Ns log|S|)

≤ Nc (C + εNc) , where εNc → 0 as Nc → ∞. (9)

(7) is due to the property of entropy rate and chain rule.(8) is due to SNs − YNc − SNs and Fano’s inequality.(9) is due to similar steps as in the channel coding converse proof.

Hence, R ≤ NsNc

≤ CH({Si} ) if R is achievable.



Summary



Channel coding theorem: C = maxp(x) I (X ;Y ), for DMC pY|X withor without feedbackWeak converse: Fano’s inequality, data processing inequality, andDMC assumptionAchievability: random coding argument, typicality decoderFeedback does not increase the capacity of DMC.Symmetric channel capacity = log|Y| − H (p ), where p permutes allrows of pY|X.Erasure channel capacity = (1− p) log|X |.Joint source-channel coding theorem:R < C

H({Si} ) =⇒ R is achievable;R > C

H({Si} ) =⇒ R is not achievable.

Source-channel separation is optimal.


Documents

Lecture 4 Noisy Channel Codinghomepage.ntu.edu.tw/~ihwang/Teaching/Fa15/Slides/IT_Lecture_04_v0.… · Channel Capacity and the Weak Converse Achievability Proof and Source-Channel