Lecture 2: Basic Information Theory TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)

Lecture 2:Lecture 2:Basic Information TheoryBasic Information Theory

TSBK01 Image Coding and Data CompressionTSBK01 Image Coding and Data Compression

Jörgen AhlbergDiv. of Sensor Technology

Swedish Defence Research Agency (FOI)

TodayToday

1.1. What is information theory about?What is information theory about?

2.2. Stochastic (information) sources.Stochastic (information) sources.

3.3. Information and entropy.Information and entropy.

4.4. Entropy for stochastic sources.Entropy for stochastic sources.

5.5. The source coding theorem.The source coding theorem.

Part 1:Part 1: Information Theory Information Theory

Claude ShannonClaude Shannon: A Mathematical Theory of Communication

The

Bell System Technical Journal, 1948

Sometimes referred to as ”Shannon-Weaver”, sincethe standalone publication has a foreword by Weaver.

Be careful!Be careful!

Quotes about ShannonQuotes about Shannon

””What is information? Sidestepping questions about What is information? Sidestepping questions about meaning, Shannon showed that it is a measurable meaning, Shannon showed that it is a measurable commodity”.commodity”.

””Today, Shannon’s insight help shape virtually all systems Today, Shannon’s insight help shape virtually all systems that store, process, or transmit information in digital form, that store, process, or transmit information in digital form, from compact discs to computers, from facsimile machines from compact discs to computers, from facsimile machines to deep space probes”.to deep space probes”.

””Information theory has also infilitrated fields outside Information theory has also infilitrated fields outside communications, including linguistics, psychology, communications, including linguistics, psychology, economics, biology, even the arts”.economics, biology, even the arts”.

Change to an efficient representation,i.e., data compression.

SourceChannel

coder

Source

coder

Channel

Source

decoder

Sink,

receiver

Channel

decoder

Channel

Any source of informationChange to an efficient representation for,

transmission, i.e., error control coding.

Recover from channel distortion.UncompressThe channel is anything transmitting or storing information –

a radio link, a cable, a disk, a CD, a piece of paper, …

Fundamental EntitiesFundamental Entities

SourceChannel

coder

Source

coder

Channel

Source

decoder

Sink,

receiver

Channel

decoder

Channel

HH: The information content of the source.

HH

RR: Rate from the source coder.

RRCC

CC

CC: Channel capacity.

Shannon 2Shannon 2: Source coding and channel coding can be optimized independently, and binary symbols can be used as intermediate format. Assumption: Arbitrarily long delays.

Fundamental TheoremsFundamental Theorems

SourceChannel

coder

Source

coder

Channel

Source

decoder

Sink,

receiver

Channel

decoder

Channel

HH RRCC

CC

Shannon 1Shannon 1: Error-free transmission possible if R¸H and C¸R.

Source coding theorem (simplified)Channel coding theorem (simplified)

Part 2:Part 2: Stochastic sources Stochastic sources

A source outputs A source outputs symbolssymbols XX11, , XX22, ..., ...

Each symbol take its value from an Each symbol take its value from an alphabetalphabet AA = (= (aa11, , aa22, …)., …).

Model:Model: PP((XX11,…,,…,XXNN)) assumed to be known for assumed to be known for

all combinations.all combinations.

Source X1, X2, …

Example 1: A text is a sequence of symbols each taking its value from the alphabetA = (a, …, z, A, …, Z, 1, 2, …9, !, ?, …).

Example 2: A (digitized) grayscale image is a sequence of symbols each taking its value from the alphabet A = (0,1) or A = (0, …, 255).

Two Special CasesTwo Special Cases

1.1. The Memoryless SourceThe Memoryless Source Each symbol independent of the previous Each symbol independent of the previous

ones.ones. PP((XX11, , XX22, …, , …, XXnn) = ) = PP((XX11) ) ¢¢ PP((XX22) ) ¢¢ … … ¢¢ PP((XXnn))

2.2. The Markov SourceThe Markov Source Each symbol depends on the previous one.Each symbol depends on the previous one. PP((XX11, , XX22, …, , …, XXnn)) = = PP((XX11) ) ¢¢ PP((XX22||XX11) ) ¢¢ PP((XX33||XX22) ) ¢¢ … …

¢¢ PP((XXnn|X|Xnn-1-1))

The Markov SourceThe Markov Source

A symbol depends only on the previous A symbol depends only on the previous symbol, so the source can be modelled by a symbol, so the source can be modelled by a state diagram.state diagram.

a

b

c

1.00.5

0.7

0.3

0.30.2

A ternary source withalphabet A = (a, b, c).


Assume we are in state Assume we are in state aa, i.e., , i.e., XXkk = = aa..

The probabilities for the next symbol are:The probabilities for the next symbol are:

a

b

c

1.00.5

0.7

0.3

0.30.2

PP((XXkk+1+1 = = a | Xa | Xkk = a = a) = 0.3) = 0.3

PP((XXkk+1+1 = = b | Xb | Xkk = a = a) = 0.7) = 0.7

PP((XXkk+1+1 = = c | Xc | Xkk = a = a) = 0) = 0


So, if So, if XXkk+1+1 = = bb, we know that , we know that XXkk+2+2 will will

equal equal cc..

a

b

c

1.00.5

0.7

0.3

0.30.2

PP((XXkk+2+2 = = a | Xa | Xkk+1+1 = b = b) = 0) = 0

PP((XXkk+2+2 = = b | Xb | Xkk+1+1 = b = b) = 0) = 0

PP((XXkk+2+2 = = c | Xc | Xkk+1+1 = b = b) = 1) = 1


If all the states can be reached, the If all the states can be reached, the stationary probabilitiesstationary probabilities for the states can be for the states can be calculated from the given transition calculated from the given transition probabilities.probabilities.

Markov models can be used to represent Markov models can be used to represent sources with dependencies more than one sources with dependencies more than one step back.step back.– Use a state diagram with several symbols in Use a state diagram with several symbols in

each state.each state.

Stationary probabilities? That’s theprobabilities i = P(Xk = ai) for anyk when Xk-1, Xk-2, … are not given.

Analysis and SynthesisAnalysis and Synthesis

Stochastic models can be used for Stochastic models can be used for analysinganalysing a source. a source.– Find a model that well represents the real-world Find a model that well represents the real-world

source, and then analyse the model instead of source, and then analyse the model instead of the real world.the real world.

Stochastic models can be used for Stochastic models can be used for synthesizingsynthesizing a source. a source. – Use a random number generator in each step of Use a random number generator in each step of

a Markov model to generate a sequence a Markov model to generate a sequence simulating the source.simulating the source.

Show plastic slides!

Part 3:Part 3: Information and Entropy Information and Entropy

Assume a binary memoryless source, e.g., a flip of Assume a binary memoryless source, e.g., a flip of a coin. How much information do we receive when a coin. How much information do we receive when we are told that the outcome is we are told that the outcome is headsheads??– If it’s a fair coin, i.e., If it’s a fair coin, i.e., PP((headsheads) = ) = PP ( (tailstails) = 0.5, we say ) = 0.5, we say

that the that the amount of information is amount of information is 1 bit1 bit..– If we already know that it will be (or was) heads, i.e., If we already know that it will be (or was) heads, i.e.,

PP((headsheads) = 1, the ) = 1, the amount of information is amount of information is zerozero!!– If the coin is not fair, e.g., If the coin is not fair, e.g., PP((headsheads) = 0.9, the ) = 0.9, the amount of amount of

information is information is more than zero but less than one bitmore than zero but less than one bit!!– Intuitively, the amount of information received Intuitively, the amount of information received is the is the

samesame if if PP((headsheads) = 0.9 or ) = 0.9 or PP ( (headsheads) = 0.1.) = 0.1.

Self InformationSelf Information

So, let’s look at it the way Shannon did.So, let’s look at it the way Shannon did. Assume a memoryless source withAssume a memoryless source with

– alphabet alphabet AA = (= (aa11, …, a, …, ann))

– symbol probabilities symbol probabilities ((pp11, …, p, …, pnn))..

How much information do we get when How much information do we get when finding out that the next symbol is finding out that the next symbol is aaii??

According to Shannon the According to Shannon the self informationself information of of aaii is is

Why?Why?Assume Assume two independent eventstwo independent events AA and and BB, with, withprobabilities probabilities PP((AA)) = p = pAA and and PP((BB)) = p = pBB..

For both the events to happen, the probability isFor both the events to happen, the probability isppAA ¢¢ p pBB. However, the . However, the amount of informationamount of information

should be addedshould be added, not multiplied., not multiplied.

Logarithms satisfy this!

No, we want the information to increase withdecreasing probabilities, so let’s use the negativelogarithm.


Example 1:Example 1:

Example 2:Example 2:

Which logarithm?Which logarithm? Pick the one you like! If you pick the natural log, Pick the one you like! If you pick the natural log,you’ll measure in you’ll measure in natsnats, if you pick the 10-log, you’ll get , if you pick the 10-log, you’ll get HartleysHartleys,,if you pick the 2-log (like everyone else), you’ll get if you pick the 2-log (like everyone else), you’ll get bitsbits..


HH((XX)) is called the first order is called the first order entropyentropy of the source. of the source.

This can be regarded as the degree of This can be regarded as the degree of uncertaintyuncertaintyabout the following symbol.about the following symbol.

On On average over all the symbolsaverage over all the symbols, we get:, we get:

EntropyEntropy

Example:Example: Binary Memoryless Source

BMS 0 1 1 0 1 0 0 0 …

1

0 0.5 1

The uncertainty (information) is greatest when

Often denotedThen

Let

Entropy: Three propertiesEntropy: Three properties

1.1. It can be shown that It can be shown that 0 0 ·· HH ·· log Nlog N..

2.2. Maximum entropyMaximum entropy ( (H = log NH = log N) is reached ) is reached when all symbols are when all symbols are equiprobableequiprobable, i.e.,, i.e.,ppii = = 11/N/N..

3.3. The difference The difference log N – Hlog N – H is called the is called the redundancyredundancy of the source. of the source.

Part 4:Part 4: Entropy for Memory Sources Entropy for Memory Sources

Assume a block of source symbols Assume a block of source symbols ((XX11, …, , …,

XXnn)) and define the and define the block entropyblock entropy::

The The entropy for a memory sourceentropy for a memory source is defined is defined as:as:

That is, the summation is done over all possible combinations of That is, the summation is done over all possible combinations of nn symbols. symbols.

That is, let the block length go towards infintity.That is, let the block length go towards infintity.Divide by Divide by nn to get the number of to get the number of bits / symbolbits / symbol..

Entropy for a Markov SourceEntropy for a Markov Source

The The entropy for a state Sentropy for a state Skk can be expressed as can be expressed as

Averaging over all states, we get theAveraging over all states, we get theentropy for the Markov sourceentropy for the Markov source as as

PPklkl is the transition probability from state is the transition probability from state kk to state to state ll..

The Run-length SourceThe Run-length Source

Certain sources generate long Certain sources generate long runsruns or or burstsbursts of of equal symbols.equal symbols.

Example:Example:

Probability for a burst of length Probability for a burst of length rr: : PP((rr)) = = ((1-1-))r-1r-1¢¢ EntropyEntropy: : HHRR = - = - r=1r=1

11 PP((rr) ) loglog PP((rr)) If the average run length is If the average run length is , then , then HHRR// = = HHMM..

A B

Part 5:Part 5: The Source Coding Theorem The Source Coding Theorem

The entropy is the smallest number of bitsThe entropy is the smallest number of bitsallowing error-free representation of the source.allowing error-free representation of the source.

Why is this? Let’s take a look on typical sequences!

Typical SequencesTypical Sequences

Assume a Assume a longlong sequence from a binary sequence from a binary memoryless source with memoryless source with PP(1) = (1) = pp..

Among n bits, there will be approximatelyAmong n bits, there will be approximatelyw = n w = n ¢¢ p p ones.ones.

Thus, there is Thus, there is MM = ( = (n n over over ww) ) such such typical typical sequencessequences!!

Only these sequences are interesting.Only these sequences are interesting. All All other sequences will appear with smaller other sequences will appear with smaller probability the larger is probability the larger is nn..

How many are the typical How many are the typical sequences?sequences?

bits/symbol

Enumeration needs log M bits, i.e,bits per symbol!

How many bits do we need?How many bits do we need?

Thus, we need Thus, we need HH((XX)) bits per symbol bits per symbolto code any typical sequence!to code any typical sequence!

The Source Coding TheoremThe Source Coding Theorem

Does tell usDoes tell us– that we can represent the output from a source that we can represent the output from a source

XX using using HH((XX)) bits/symbol. bits/symbol.– that we cannot do better.that we cannot do better.

Does not tell usDoes not tell us– how to do it.how to do it.

SummarySummary

The mathematical model of communication.The mathematical model of communication.– Source, source coder, channel coder, channel,…Source, source coder, channel coder, channel,…– Rate, entropy, channel capacity.Rate, entropy, channel capacity.

Information theoretical entitiesInformation theoretical entities– Information, self-information, uncertainty, entropy.Information, self-information, uncertainty, entropy.

SourcesSources– BMS, Markov, RLBMS, Markov, RL

The Source Coding TheoremThe Source Coding Theorem

Documents

Lecture 2: Basic Information Theory TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)