Lecture 5 arithmetic coding

November 1, 2015 1 [email protected]


Contents

Introduction

Coding a sequence

Arithmetic coding: Description

Arithmetic coding: Encoder

Basic idea in arithmetic coding

Arithmetic Coding Algorithm

Encoder Example

Determine new interval

Decoding Algorithm

Arithmetic coding: Issues

Mapping symbols

Example 1

Generating a tag

Example 2

Decoding Example

Interval Update

Example 3

Main Results on Uniqueness and Efficiency


The first thing to understand about arithmetic coding is what it produces.

Arithmetic coding is a common algorithm used in both lossless and lossy data

compression algorithms. It is an entropy encoding technique, in which the

frequently seen symbols are encoded with fewer bits than lesser seen symbols.

Arithmetic coding takes a message (often a file) composed of symbols (nearly

always eight-bit characters), and converts it to a floating-point number greater

than or equal to zero and less than one.

The second thing to understand about arithmetic coding is that it relies on a

model to characterize the symbols it is processing. The job of the model is to

tell the encoder what the probability of a character is in a given message. If the

model gives an accurate probability of the characters in the message, they will

be encoded very close to optimally. If the model misrepresents the probabilities

of symbols, the encoder may actually expand a message instead of

compressing it!

Introduction

Arithmetic coding is similar to Huffman coding; they both achieve their

compression by reducing the average number of bits required to represent a

symbol.

November 1, 2015 [email protected] 4

Coding a Sequence

o In order to distinguish a sequence of symbols from another sequence of

symbols we need to tag it with a unique identifier. One possible set of tags

for representing sequences of symbols are the numbers in the unit interval

[0,1).

o Because the number of numbers in the unit interval is infinite, it should be

possible to assign a unique tag to each distinct sequence of symbols.

o Cumulative distribution function (cdf) (use in developing the arithmetic

code) is a function that will map sequences of symbols (random variables)

into the unit interval.

o Before we begin our development of the arithmetic code, we need to

establish some notation. Recall that a random variable maps the outcomes,

or sets of outcomes, of an experiment to values on the real number line.


Arithmetic Coding: Description

In the following discussions, we will use M as the size of the

alphabet of the data source,

N[x] as symbol x's probability,

Q[x] as symbol x's cumulative probability (i.e., Q[i]=N[1]+N[2]+...+N[i-1])

Assume we know the probabilities of each symbol of the data

source

we can allocate to each symbol an interval with width proportional to its

probability, and each of the intervals does not overlap with others.

This can be done if we use the cumulative probabilities as the two ends

of each interval. Therefore, the two ends of each symbol x amount to

Q[x-1] and Q[x].

Symbol x is said to own the range [Q[x-1], Q[x]).


Arithmetic Coding: Encoder

We begin with the interval [0,1) and subdivide the interval iteratively.

For each symbol entered, the current interval is divided according to the

probabilities of the alphabet.

The interval corresponding to the symbol is picked as the interval to be

further proceeded with.

The procedure continues until all symbols in the message have been

processed.

Since each symbol's interval does not overlap with others, for each

possible message there is a unique interval assigned.

We can represent the message with the interval's two ends [L,H). In fact,

taking any single value in the interval as the encoded code is enough,

and usually the left end L is selected.


– represent each string x of length n by a unique interval [L,R) in [0,1).

– The width R-L of the interval [L,R) represents the probability of x occurring.

– The interval [L,R) can itself be represented by any number, called a tag,

within the half open interval.

– The k significant bits of the tag .t1t2t3... is the code of x. That is,

.t1t2t3...tk000... is in the interval [L,R).

Basic idea in arithmetic coding


• P(a1), P(a2), … , P(am)

• C(ai) = P(a1) + P(a2) + … + P(ai-1)

• Encode x1x2...xn

Initialize L = 0 and R = 1;

for i = 1 to n do

W = R – L ;

L = L + W * C(xi) ;

R = L + W * P(xi) ;

t = (L+R) / 2 ;

choose code for the tag

Arithmetic Coding Algorithm


Encoder example

Symbol, x Probability, N[x] [Q[x-1], Q[x])

A 0.4 0.0, 0.4

B 0.3 0.4, 0.7

C 0.2 0.7, 0.9

D 0.1 0.9, 1.0


Symbol B:

0.7 – 0.4 = 0.3 size of new interval

Interval [ low high)

A [ 0.4 0.4 * 0.3+ 0.40] = [ 0.4 0.52)

B [ 0.52 0.3 * 0.3+ 0.52] = [0.52 0.61)

C [ 0.61 0.2 * 0.3+ 0.61] = [0.61 0.67)

D [ 0.67 0.1 * 0.3+ 0.67] = [0.67 0.70)

Symbol C:

0.67 - 0.61 = 0.06 size of new interval

A [ 0.61 0.4 * 0.06+ 0.61 ] = [ 0.61 0.634)

B [ 0.634 0.3 * 0.06+ 0.634] = [0.634 0.652)

C [ 0.652 0.2 * 0.06+ 0.652] = [0.652 0.664)

D [ 0.664 0.1 * 0.06+ 0.664] = [0.664 0.67 )

Determine new interval


Decoding Algorithm

When decoding the code v is placed on the current code interval to find the symbol x so that Q[x-1] <= code < Q[x]. The procedure iterates until all symbols are decoded.

Decoding procedure

• Find symbol straddling this range

• Fin R where

R = Q[x] – Q[x-1]

• Update v where

v = ( v – Q[x-1] ) / R


v ( v – Q[x-1] ) / R

Output Char

X

Q[x-1] Q[x] R Q[x] – Q[x-1]

0.6196 B 0.4 0.7 0.3

0.732 C 0.7 0.9 0.2

0.16 A 0.0 0.4 0.4

0.4 B 0.4 0.7 0.3

0.0


Arithmetic Coding: Issues

The zero-frequency problem: Each symbol's predicted probability must not be zero or the interval will become zero and interval renormalization would fail. This is called the zero-frequency problem. Models that adapt online may encounter such problem when decaying.

The EOF problem:

Assume we pick the lower end of the interval as the encoded code. Two messages may yield the same code if one message is identical to the other, except for a sequence of finite number of the first symbol (first in table, not in the sequence) as a suffix.

For e.g., Both BCAB, BCABA, BCABAA, BCABAAA will have the same lower interval but different upper intervals. (try it)

The simplest solution is to let the decoder know the length of the encoded message. The decoder will know if the message size is fixed or can be transmitted at first. However this is not plausible if the data size is not known beforehand, such as live broadcasting data; or it's too costly to do so, such as tapes whose size is unknown at the beginning.

There is another solution if we introduce a special EOF symbol to the alphabet. The symbol takes a small interval and is used only at the end of the message. When the decoder detects the EOF symbol it knows the end of the message is reached.


We need to map the source symbols or letters to numbers by using the

mapping

where A = {a1, a2, . . . , am} is the alphabet for a discrete source and X is a

random variable.

This mapping means that given a probability model P for the source, we

also have a probability density function for the random variable

and the cumulative density function can be defined as

Mapping symbols


Consider a three-letter alphabet A = {a1, a2, a3} with P(a1) = 0.7, P(a2) = 0.1, and

P(a3) = 0.2. Using the mapping Equation

Example 1

We get: FX (1) = 0.7, FX (2) = 0.8, and FX (3) = 1.


In order to see how the tag generation procedure works mathematically,

we start with sequences of length one. Suppose we have a source that

puts out symbols from some alphabet A = {a1, a2, . . . , am}. We can map

the symbols {ai} to real numbers {i}. Define 𝑇 X (ai ) as

For each ai , 𝑇 X (ai ) will have a unique value. This value can be used as

a unique tag for ai.

Tag Generation


Consider a simple dice-throwing experiment with a fair die. The outcomes of a roll of

the die can be mapped into the numbers {1, 2, . . . , 6}. For a fair die

Therefore, using (3) we can find the tag for X = 2 as

and the tag for X = 5 as


We can extend previous example so that the sequence consists of two

rolls of a die. Using the ordering scheme described above, the outcomes

(in order) would be 11, 12, 13, . . . , 66. The tags can then be generated

using previous equation. For example, the tag for the sequence (1 3)

would be

The requirement that the probability of all sequences of a given length

be explicitly calculated can be as prohibitive as the requirement that we

have codewords for all sequences of a given length.


Encode the string "ace" using the probability ranges from the following table:

Example 2


Start with lower and upper probability bounds of 0 and 1.

Encode 'a'

current range = 1 - 0 = 1

upper bound = 0 + (1 × 0.3) = 0.3

lower bound = 0 + (1 × 0.0) = 0.0

Encode 'c'

current range = 0.3 - 0.0 = 0.3

upper bound = 0.0 + (0.3 × 0.70) = 0.210

lower bound = 0.0 + (0.3 × 0.45) = 0.135

Encode 'e'

current range = 0.210 - 0.135 = 0.075

upper bound = 0.135 + (0.075 × 1.00) = 0.210

lower bound = 0.135 + (0.075 × 0.80) = 0.195

The string "ace" may be encoded by any value within the probability range [0.195,

0.210). (Pick 0.2)

W = R – L ;

L = L + W * C(xi)

R = L + W * P(xi)


Using the probability ranges from table in slide 19 decode the three character string encoded as

0.20.

Decode first symbol

0.20 is within [0.00, 0.30)

0.20 encodes 'a'

Remove effects of 'a' from encode value

current range = 0.30 - 0.00 = 0.30

encoded value = (0.20 - 0.0) / 0.30 = 0.67 (rounded)

Decode second symbol

0.67 is within [0.45, 0.70)

0.67 encodes 'c'

Remove effects of 'c' from encode value

current range = 0.70 - 0.45 = 0.35

encoded value = (0.67 - 0.45) / 0.35 = 0.88

Decode third symbol

0.88 is within [0.80, 1.00)

0.88 encodes 'e'

Decoding Example

The encoded string is "ace".


Algorithm : lower bound = 0

upper bound = 1

while there are still symbols to encode

current range = upper bound - lower bound

lower bound = lower bound + (current range × lower bound of new symbol)

upper bound = lower bound + (current range × upper bound of new symbol)

end while

l is lower bound

u is upper bound

Interval Update


Example 3






Main Results on Uniqueness and Efficiency


Science

Lecture 5 arithmetic coding