Data Compression. 2 Terminology n Physical versus logical –Physical n Performed on data regardless of what information it contains n Translates a series

Data CompressionData Compression

Data Compression 2

TerminologyTerminology

Physical versus logicalPhysical versus logical– PhysicalPhysical

Performed on data regardless of what Performed on data regardless of what information it containsinformation it contains

Translates a series of bits to another Translates a series of bits to another series of bitsseries of bits

– LogicalLogical Knowledge-basedKnowledge-based Change Change United Kingdom United Kingdom to to UKUK

Data Compression 3


SymmetricSymmetric– Compression and decompression Compression and decompression

roughly use the same techniques and roughly use the same techniques and take just as longtake just as long

– Data transmission which requires Data transmission which requires compression and decompression on-compression and decompression on-the-fly will require these types of the-fly will require these types of algorithmsalgorithms

Data Compression 4


AsymmetricAsymmetric– Most common is where compression Most common is where compression

takes a lot more time than decompressiontakes a lot more time than decompression In an image database, each image will be In an image database, each image will be

compressed once and decompressed many compressed once and decompressed many timestimes

– Less common is where decompression Less common is where decompression takes a lot more time than compressiontakes a lot more time than compression Creating many backup files which will hardly Creating many backup files which will hardly

ever be readever be read

Data Compression 5


Non-adaptiveNon-adaptive– Contain a static dictionary of Contain a static dictionary of

predefined substrings to encode predefined substrings to encode which are known to occur with high which are known to occur with high frequencyfrequency

AdaptiveAdaptive– Dictionary is built from scratchDictionary is built from scratch

Data Compression 6


Semi-adaptiveSemi-adaptive– In pass 1, an optimal dictionary is In pass 1, an optimal dictionary is

constructedconstructed– In pass 2, the actual compression In pass 2, the actual compression

occursoccurs

Data Compression 7


LosslessLossless– decompress(compress(data)) = datadecompress(compress(data)) = data

LossyLossy– decompress(compress(data)) decompress(compress(data)) data data– A small change in pixel values may be A small change in pixel values may be

invisible, howeverinvisible, however

Data Compression 8

Pixel PackingPixel Packing

Data Compression 9

Run-Length EncodingRun-Length Encoding

Repeating string of characters, Repeating string of characters, called a called a run, run, is coded into two is coded into two bytesbytes– First byte contains the First byte contains the run count, run count, one one

less than the number of repetitionsless than the number of repetitions– Second byte contains the Second byte contains the run value, run value,

the character being repeatedthe character being repeated

Data Compression 10

Run-Length EncodingRun-Length Encoding

‘‘77777zzzyyyyyyV’ becomes 77777zzzyyyyyyV’ becomes ‘472z5y0V’‘472z5y0V’– 15 byte string becomes 8 bytes long15 byte string becomes 8 bytes long– Compression ratio of almost 2 to 1Compression ratio of almost 2 to 1

Some strings become twice as longSome strings become twice as long– ‘‘7fu5JLY9jhYIujG’7fu5JLY9jhYIujG’

Data Compression 11

Data Compression 12

Lempel-Ziv-Welch (LZW)Lempel-Ziv-Welch (LZW)

LosslessLossless GIF, TIFF, V.42bis modem compression GIF, TIFF, V.42bis modem compression

standard, PostScript Level 2standard, PostScript Level 2 Substitutional or dictionary-basedSubstitutional or dictionary-based

– Algorithm builds a data dictionaryAlgorithm builds a data dictionary– Code emitted if pattern found in Code emitted if pattern found in

dictionary, while if not already in dictionary, while if not already in dictionary, it is addeddictionary, it is added

– Not necessary to have dictionary to do Not necessary to have dictionary to do decompressiondecompression

Data Compression 13


HistoryHistory– 19771977

Abraham Lempel and Jakob Ziv published a Abraham Lempel and Jakob Ziv published a paper on a universal data compression paper on a universal data compression algorithmalgorithm– Called LZ77Called LZ77

– 19781978 Lempel and Ziv formulated an improved, Lempel and Ziv formulated an improved,

dictionary-based data compression algorithmdictionary-based data compression algorithm– Called LZ78Called LZ78

Data Compression 14



While working for Sperry, Lempel and Ziv, with While working for Sperry, Lempel and Ziv, with some other researchers filed for a patent for some other researchers filed for a patent for LZ78LZ78– Granted in 1984Granted in 1984

– 19841984 While working for Sperry, Terry Welch modified While working for Sperry, Terry Welch modified

LZ78LZ78– Result was LZW algorithmResult was LZW algorithm– Published in IEEE ComputerPublished in IEEE Computer

Data Compression 15



Sperry granted a patent for Welch’s Sperry granted a patent for Welch’s modification and for implementation of modification and for implementation of LZWLZW

– 19861986 Sperry and Burroughs merged to form Sperry and Burroughs merged to form

UnisysUnisys– Ownership of Sperry patent transferred to Ownership of Sperry patent transferred to

UnisysUnisys

Data Compression 16



CompuServe created GIF file formatCompuServe created GIF file format– Required use of LZW algorithmRequired use of LZW algorithm– Didn’t check patents for LZWDidn’t check patents for LZW– Unisys also didn’t realize GIF used LZW 1988Unisys also didn’t realize GIF used LZW 1988

Aldus released Revision 5.0 of TIFF file formatAldus released Revision 5.0 of TIFF file format– Used LZW algorithmUsed LZW algorithm

– 19901990 Unisys licensed Adobe for use of LZW patent Unisys licensed Adobe for use of LZW patent

for PostScriptfor PostScript

Data Compression 17



Unisys licensed Aldus for use of LZW Unisys licensed Aldus for use of LZW patent in TIFFpatent in TIFF

– 19931993 Unisys became aware the GIF file format Unisys became aware the GIF file format

used LZWused LZW Negotiations began with CompuServeNegotiations began with CompuServe

Data Compression 18



Unisys and CompuServe came to an Unisys and CompuServe came to an understanding that LZW algorithm by understanding that LZW algorithm by CompuServe would be licensed for the CompuServe would be licensed for the application of the GIF file format in software application of the GIF file format in software used primarily to access the CompuServe used primarily to access the CompuServe Information ServiceInformation Service

– 19951995 America Online and Prodigy also entered into America Online and Prodigy also entered into

license agreements with Unisys for LZWlicense agreements with Unisys for LZW

Data Compression 19


GIF is not in public domainGIF is not in public domain Some people were suspicious Some people were suspicious

regarding the announcement of regarding the announcement of CompuServe that it was getting a CompuServe that it was getting a license from Unisyslicense from Unisys– In programming community it was In programming community it was

known for many years prior to this that known for many years prior to this that GIF used LZW and that LZW was GIF used LZW and that LZW was patented by Unisyspatented by Unisys

Data Compression 20


Some people were suspicious regarding Some people were suspicious regarding the announcement of CompuServe that the announcement of CompuServe that it was getting a license from Unisysit was getting a license from Unisys– Unisys claimed that CompuServe only Unisys claimed that CompuServe only

found out rather late that this was the casefound out rather late that this was the case– GIF was becoming an integral part of WWW GIF was becoming an integral part of WWW

for exchanging low-resolution graphicsfor exchanging low-resolution graphics

Data Compression 21


Eventually, Unisys’ LZW patent and Eventually, Unisys’ LZW patent and licensing agreements heldlicensing agreements held– Unisys reduced license fees after 1995Unisys reduced license fees after 1995– Unisys wouldn’t charge anything for Unisys wouldn’t charge anything for

inadvertent infringement by GIF inadvertent infringement by GIF software products delivered prior to software products delivered prior to 19951995

License fees still required for updates License fees still required for updates delivered after 1995delivered after 1995

Data Compression 22


Not illegal to own, transmit, or Not illegal to own, transmit, or receive GIF files, just to compress receive GIF files, just to compress or decompress them without a or decompress them without a licenselicense

Data Compression 23


3 1 2 5 1 3 1 4 1 2 5 1 5 5 1 5 5 1 4

Search buffer Lookahead buffer

offset = 0

length = 0

Output is (0, 0, code(4))

Data Compression 24


3 1 2 5 1 3 1 4 1 2 5 1 5 5 1 5 5 1 4


offset = 7

length = 4


Data Compression 25


3 1 2 5 1 3 1 4 1 2 5 1 5 5 1 5 5 1 4


offset = 3

length = 5


Data Compression 26

JPEGJPEG

Joint Photographic Experts GroupJoint Photographic Experts Group 19821982

– ISO (International Standard ISO (International Standard Organization) formed Photographic Organization) formed Photographic Experts Group (PEG)Experts Group (PEG)

Develop methods of transmitting video, Develop methods of transmitting video, images and text over ISDN (Integrated images and text over ISDN (Integrated Services Digital Network) linesServices Digital Network) lines

Data Compression 27

JPEGJPEG

19861986– Subgroup of CCITT (International Subgroup of CCITT (International

Telegraph and Telephone Telegraph and Telephone Consultative Committee) began to Consultative Committee) began to look at methods of compressing color look at methods of compressing color and gray-scale data for fax and gray-scale data for fax transmissiontransmission

– Methods for this were similar to those Methods for this were similar to those being considered by PEGbeing considered by PEG

Data Compression 28

JPEGJPEG

19871987– Two groups combined into JPEGTwo groups combined into JPEG

Most previous compression Most previous compression methods did poor job of methods did poor job of compressing continuous-tone compressing continuous-tone image dataimage data

Data Compression 29

JPEGJPEG

Very few file formats can support Very few file formats can support 24-bit raster images24-bit raster images– GIF only works for 256 colorsGIF only works for 256 colors– LZW doesn’t work well on scanned LZW doesn’t work well on scanned

image dataimage data– TIFF and BMP didn’t compress this TIFF and BMP didn’t compress this

type of image data very welltype of image data very well

Data Compression 30

JPEGJPEG

JPEG compresses continuous tone JPEG compresses continuous tone image data with a pixel depth of 6-image data with a pixel depth of 6-24 bits with good efficiency24 bits with good efficiency

JPEG itself doesn’t define standard JPEG itself doesn’t define standard file formatfile format

Data Compression 31

JPEGJPEG

Toolkit of methods with quality-Toolkit of methods with quality-compression trade-offcompression trade-off

LossyLossy– Discards information that human eye Discards information that human eye

cannot easily seecannot easily see Slight changes in color not perceived wellSlight changes in color not perceived well Slight changes in intensity are well Slight changes in intensity are well

perceivedperceived

Data Compression 32

JPEGJPEG

Works well with color or gray-scale Works well with color or gray-scale continuous tone images: continuous tone images: photographs, video stills, complex photographs, video stills, complex graphics which resemble natural graphics which resemble natural objectsobjects

Doesn’t work well for animations, ray Doesn’t work well for animations, ray tracing, line art, black-and-white tracing, line art, black-and-white documents, and typical vector documents, and typical vector graphicsgraphics

Data Compression 33

JPEGJPEG

End-user can tune quality of JPEG End-user can tune quality of JPEG encoder through use of Q-factor, encoder through use of Q-factor, which ranges from 1-100which ranges from 1-100– Q-factor = 1 produces smallest, worst Q-factor = 1 produces smallest, worst

quality imagesquality images– Q-factor = 100 produces largest, best Q-factor = 100 produces largest, best

quality imagesquality images Optimal value of Q-factor is image Optimal value of Q-factor is image

dependentdependent

Data Compression 34

JPEGJPEG

JPEG introduces artifacts in images JPEG introduces artifacts in images containing large areas of a single containing large areas of a single colorcolor

JPEG is slow if implemented in JPEG is slow if implemented in softwaresoftware

Baseline JPEGBaseline JPEG– Minimal subset of JPEG which all JPEG-Minimal subset of JPEG which all JPEG-

aware applications are required to aware applications are required to supportsupport

Data Compression 35

JPEGJPEG

Data Compression 36

JPEGJPEG

Color transformColor transform– Encodes each component in a color Encodes each component in a color

model separatelymodel separately– Is independent of any color space Is independent of any color space

modelmodel

Data Compression 37

JPEGJPEG

Color transformColor transform– Best compression ratios result if a Best compression ratios result if a

luminance (gray scale)/chrominance luminance (gray scale)/chrominance (color) color space, such as YUV, is used(color) color space, such as YUV, is used Human eyes more sensitive to luminance Human eyes more sensitive to luminance

information (Y) than to chrominance information (Y) than to chrominance information (U, V)information (U, V)

The other models spread human sensitive The other models spread human sensitive information across each of their 3 information across each of their 3 componentscomponents

Data Compression 38

JPEGJPEG

Down-samplingDown-sampling– Average groups of pixels togetherAverage groups of pixels together– To exploit human’s lesser sensitivity to To exploit human’s lesser sensitivity to

chrominance information, we use fewer chrominance information, we use fewer pixels for the chrominance channelspixels for the chrominance channels In an image of 1000 In an image of 1000 1000 pixels, we might 1000 pixels, we might

use 1000 use 1000 1000 luminance pixels, but only 1000 luminance pixels, but only

500 500 500 chrominance pixels 500 chrominance pixels– Each chrominance pixel covers the same area as a Each chrominance pixel covers the same area as a

2 2 2 block of luminance pixels 2 block of luminance pixels

Data Compression 39

JPEGJPEG

Down-samplingDown-sampling– For each 2 For each 2 2 block, we can store 6 2 block, we can store 6

pixel values pixel values 4 luminance values and 2 chrominance 4 luminance values and 2 chrominance values [1 for each of 2 channels] values [1 for each of 2 channels]

instead of 12 instead of 12 4 pixel values for each of 3 channels4 pixel values for each of 3 channels

This 50% reduction in data has almost no This 50% reduction in data has almost no perceivable effectperceivable effect

Data Compression 40

JPEGJPEG

Discrete cosine transformDiscrete cosine transform– For each color channel, the image For each color channel, the image

data is divided into 8 data is divided into 8 8 blocks 8 blocks– DCT applied to each blockDCT applied to each block

Low-order, or DC, term represents Low-order, or DC, term represents average value in the blockaverage value in the block

Successive higher-order, or AC, terms Successive higher-order, or AC, terms represent the strength of more rapid represent the strength of more rapid changes across the blockchanges across the block

Data Compression 41

JPEGJPEG

Discrete cosine transformDiscrete cosine transform– Can discard high-frequency dataCan discard high-frequency data– DCT is lossless except for roundoff DCT is lossless except for roundoff

errorserrors– DCT is most costly step in JPEGDCT is most costly step in JPEG

Data Compression 42

JPEGJPEG

Scan-order of each 8 8 block of pixels for DCT

Data Compression 43

JPEGJPEG

An 8 An 8 8 block 8 block from an 8 bit image an 8 bit image

124 125 122 120 122 119 117 118121 121 120 119 119 120 120 118126 124 123 122 121 121 120 120124 124 125 125 126 125 124 124127 127 128 129 130 128 127 125143 142 143 142 140 139 139 139150 148 152 152 152 152 150 151156 159 158 155 158 158 157 156

Data Compression 44

JPEGJPEG The DCT coefficients corresponding to the previous 8 The DCT coefficients corresponding to the previous 8 8 8

blockblock

39.88 6.56 -2.24 1.22 -0.37 -1.08 0.79 1.13-102.43 4.56 2.26 1.12 0.35 -0.63 -1.05 -0.48

37.77 1.31 1.77 0.25 -1.50 -2.21 -0.10 0.23-5.67 2.24 -1.32 -0.81 1.41 0.22 -0.13 0.17-3.37 -0.74 -1.75 0.77 -0.62 -2.65 -1.30 0.765.98 -0.13 -0.45 -0.77 1.99 -0.26 1.46 0.003.97 5.52 2.39 -0.55 -.051 -0.84 -0.52 -0.13

-3.43 0.51 -1.07 0.87 0.96 0.09 0.33 0.01

DC coefficient AC coefficients

Data Compression 45

JPEGJPEG

Quantization Quantization – Divide DCT output by a quantization Divide DCT output by a quantization

coefficient and round result to integercoefficient and round result to integer The larger the coefficient, the more data is The larger the coefficient, the more data is

lostlost Each of the 64 positions of the DCT output Each of the 64 positions of the DCT output

block has its own coefficientblock has its own coefficient– Higher order terms have a larger coefficientHigher order terms have a larger coefficient

Different coefficients for luminance and Different coefficients for luminance and chrominance channelschrominance channels

Data Compression 46

JPEGJPEG

QuantizationQuantization– This is the step controlled by the This is the step controlled by the

quality-factorquality-factor– Selecting quantization coefficients is Selecting quantization coefficients is

an artan art

Data Compression 47

JPEGJPEG

Sample quantization tableSample quantization table– Coefficients based on human perceptionCoefficients based on human perception

16 11 10 16 24 40 51 6112 12 14 19 26 58 60 5514 13 16 24 40 57 69 5614 17 22 29 51 87 80 6218 22 37 56 68 109 103 7724 35 55 64 81 104 113 9249 64 78 87 103 121 120 10172 92 95 98 112 100 103 99

Data Compression 48

JPEGJPEG

LabelsLabels– Label labLabel labijij corresponding to the quantized corresponding to the quantized

value of the transform coefficient cvalue of the transform coefficient cijij is is

where Qwhere Qijij is the (i,j) is the (i,j)thth element of the element of the quantization tablequantization table

labc

Qijij

ij

0 5.

Data Compression 49

JPEGJPEG

Quantizer labels corresponding to Quantizer labels corresponding to the previous 8 the previous 8 8 block 8 block2 1 0 0 0 0 0 0

-9 0 0 0 0 0 0 03 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

Data Compression 50

EncodingEncoding

Huffman compress resulting Huffman compress resulting coefficientscoefficients– Can use arithmetic coding as wellCan use arithmetic coding as well

Data Compression 51

Huffman CodingHuffman Coding

LosslessLossless

Symbol Probability Symbol Probabilitya .10 a .10b .20 b .20c .04 (ce) .11d .10 d .10e .07 f .20f .20 g .29g .29

Data Compression 52


Symbol Probability Symbol Probability(ad) .20 ((ad)(ce)) .31

b .20 b .20(ce) .11 f .20

f .20 g .29g .29

Symbol Probability Symbol Probability((ad)(ce)) .31 (((ad)(ce))g) .60

(bf) .40 (bf) .40g .29

Data Compression 53


0

0

0

0

1

1

11

11 0 0

a d c e g b f

Symbol Code Symbol Codea 0000 e 0011b 10 f 11c 0010 g 01d 0001

Data Compression 54

Arithmetic CodingArithmetic Coding

LosslessLossless

Symbol Probabilitya .3b .2c .1d .4

String = caddcadd

Data Compression 55

Arithmetic CodingArithmetic Coding1.0000

0.6000

0.5000

0.3000

0.0000

dc

ba

*

0.6000

0.5600

0.5500

0.5300

0.5000

dc

ba *

0.5300

0.5180

0.5150

0.5090

0.5000

dc

ba

*

0.5300

0.5252

0.5216

0.5090

0.5180

dc

ba

* • Tag for string caddcadd is any number in [0.5252, 0.5300)• Such a number is .10000111• Thus, the code of caddcadd is 10000111

Data Compression 56

JPEG ExtensionsJPEG Extensions

ProgressiveProgressive– For applications that need to receive For applications that need to receive

JPEG data streams and display them JPEG data streams and display them on the flyon the fly

– Baseline JPEG image can be displayed Baseline JPEG image can be displayed only after all of the image data has only after all of the image data has been receivedbeen received

Data Compression 57

JPEG ExtensionsJPEG Extensions

ProgressiveProgressive– Instead of interlacing, where a Instead of interlacing, where a

majority of the image must be sent to majority of the image must be sent to be able to tell what it is, we send be able to tell what it is, we send successively better resolution imagessuccessively better resolution images

Lossless JPEGLossless JPEG

Data Compression 58

Fractal CompressionFractal Compression

Suppose we have a linear, non-Suppose we have a linear, non-identity, function of one variable, identity, function of one variable, g, having xg, having xff as a fixed point as a fixed point

– g(xg(xff) = x) = xff

We can compute the fixed point by We can compute the fixed point by the approximation x*, g(x*), the approximation x*, g(x*), g(g(x*)), g(g(g(x*))), …, where x* is g(g(x*)), g(g(g(x*))), …, where x* is any initial approximationany initial approximation

Data Compression 59


ExampleExample– f(x) = ax + bf(x) = ax + b

– Fixed point is solution to xFixed point is solution to xff = ax = axff + b + b oror

– For a = 0.5, b = 1, we have that xFor a = 0.5, b = 1, we have that xff = = 22

xb

af 1

Data Compression 60


ExampleExample– To calculate the fixed point by the To calculate the fixed point by the

previous approximation, use the previous approximation, use the initial guess 1 and calculate g(1), initial guess 1 and calculate g(1), g(g(1)), g(g(g(1))), …, where g(x) = g(g(1)), g(g(g(1))), …, where g(x) = x/2 + 1x/2 + 1

The approximations are 1.5, 1.75, 1.875, The approximations are 1.5, 1.75, 1.875, 1.9375, …, which converges to 2, the 1.9375, …, which converges to 2, the fixed pointfixed point

Data Compression 61


Given an image I, treated as an array Given an image I, treated as an array of integers, suppose we have a non-of integers, suppose we have a non-identity function g(I) = Iidentity function g(I) = I

If it was cheaper to encode g than to If it was cheaper to encode g than to encode I, we could communicate g and encode I, we could communicate g and reconstruct I by the sequence of reconstruct I by the sequence of approximations Iapproximations I00, g(I, g(I00), g(g(I), g(g(I00)), )), g(g(g(Ig(g(g(I00))), …, where I))), …, where I00 is the all zero is the all zero imageimage

Data Compression 62


Partition image into equal size Partition image into equal size range blocksrange blocks

For each range block, RFor each range block, Rkk, find a , find a domain blockdomain block, D, Dkk, twice the size , twice the size of a range block, and a function gof a range block, and a function gkk such that such that g D Rk k k

Data Compression 63


Consider the functionConsider the function

– This function has a fixed point IThis function has a fixed point Iff = = g(Ig(Iff), where), where

This function has a fixed point IThis function has a fixed point Iff = g(I = g(Iff), ), wherewhere

g gkk

I If

Data Compression 64


ggkk is a composition of a is a composition of a geometricgeometric transformation followed by a transformation followed by a massicmassic transformation transformation– Geometric transformationGeometric transformation

Moves domain blockMoves domain block Changes the size of the domain blockChanges the size of the domain block

– Massic transformationMassic transformation Adjusts intensity and orientation of pixelsAdjusts intensity and orientation of pixels

Data Compression 65


gk

Dk

Rk

Documents

Data Compression. 2 Terminology n Physical versus logical –Physical n Performed on data regardless of what information it contains n Translates a series