Coding Methods in Embedded Computing Wayne Wolf Dept. of Electrical Engineering Princeton University

Coding Methods in Embedded Computing

Wayne Wolf

Dept. of Electrical Engineering

Princeton University

© 2004 Embedded Systems Group

Outline

Lv/Henkel/Lekatsas/Wolf: Adaptive dictionary method for bus encoding

Lin/Xie/Wolf: Dictionary coding for code compression


Adaptive bus encoding

Goal: Reduce bus energy Significant part of energy is related to IO Significant Impact of inter-wire capacitances

Approach: Explore data properties Past success in address buses Few approaches for data buses

Results: 28% average power reduction One additional line for a 32-line bus No additional cycles

Applies to both address and data buses


Related work

Stan/Burleson [TVLSI95]: Bus-invert Encoding Panda/Dutt [TVLSI99]: Reduce address bus

switching by memory access exploitation Benini et al. [GLS-VLSI97]: T0 Encoding Mussol et al. [TVLSI98]: Working Zone Encoding Sotiriadis/Chandrakasan [ICCAD00]: Transition

Pattern Coding Kim et al. [DAC00]: Coupling sensitive scheme


iRlrl

lrl

lrl

lrl

lrl

lrl

lcI lcI lcI

lclLoadC

LoadC

lcl

lcl lcl

lcl lcl

iR

iR

iR

ecapacitanc wire-interlinear :c

ecapacitancLinear :c

resistanceLinear :r

resistancedriver Internal :

I

l

l

iR

General Two Line Bus

Bus model (I)


n transitio torelatedEnergy :),(

ecapacitanc wire-inter torelatedEnergy :),,,( 2121

fI

ffII

VVEns

VVVVEni

iR

ICLC

LC

iR

iR

iR

Simplify bus model by quantizing energy values: 0, 1, 2.

Bus model (II)


Bus model (III)

})](),1([

)](),(),1(),1([{)(

1

0

2

011

2

N

iii

N

iiiiiddL

kVkVEns

kVkVkVkVEniVCkEn

LC

ecapacitanc wire-inter

Bus Energy for multiple line buses


Source properties on data buses

0%20%40%60%80%

100%compressijpegliadpcm encm88ksimgcc

Correlation of transition signaling code on adjacent lines: D(x) = nx/N = transitions/total transactions

Bit number


Source properties (II)

Adjacent bit lines in a word are correlated.

0.00%

10.00%20.00%

30.00%

40.00%50.00%

60.00%

70.00%

80.00%90.00%

100.00%

0-1

3-4

6-7

9-10

12-1

3

15-1

6

18-1

9

21-2

2

24-2

5

27-2

8

30-3

1

blowfishdec

CRC32

FFT

go

gsm

ispell

jpeg

lame

qsort

rsynth


Source properties (III)

10 most frequently-occurring patterns:

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%

blowfis

hdec

CRC32FFT go gs

misp

elljpe

glam

eqs

ort

rsyn

thAV

G


Energy savings from different compression schemes.

Compare transition, interwire energy savings.

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Size ET EI

Arithmetic

Huffman

LZ77


Dictionary techniques

Look up symbol strings in dictionary; replace with shorter code.

Types of dictionaries: Static dictionary Adaptive dictionary

‘is’

‘the’

‘are’

‘do’


Approach

Use dictionary scheme to take advantage of frequent patterns.

Word divided into key, index, bypassed part:

Upper partWidth: N-wi-wo

Index partWidth: wi

Bypassed partWidth:wo

N 0


Adaptive Dictionary Encode Scheme (ADES)

Bus Encoder Bus Decoder

Compressible

DictionaryEncoder

DictionaryDecoder

Compressed

InputBus Output

Status Line


Encoder miss

0xFFFF000

0x0000000

0x1234FFE

0xFEFA830

0xF1234FF0 =? 0

0

0

0xF1234FF

0

0

Upper Part

Non-compress Part

Index Part

0xF1234FF0xF1234FF


Decoder hit

0xFFFF000

0x0000000

0x1234FFE

0xFEFA8301

2

3

XRead

0x1234FFEB

t


Decoder miss

0xFFFF000

0x0000000

0x1234FFE

0xFEFA8300

0

0

0xF1234FFWrite

0xF1234FF0

0xF1234FF


=?

Upper_part0Upper_part

1

Upper_partk

Dictionary

Tri

Encoder Decoder

Upper_part0Upper_part

1

Upper_partkW/R

TriDictionary

Data

[wo+

wi:N

-1]

Data[0:wo-1]

Data[wo:wo+wi-1]

Data[wo+wi:N-1]

Status line 1

1

N-wi-wo

wi

wo

Architecture

ADES


Area, delay, energy

Area750 Gates

EnergyPrimarily consumed by relatively small memory

LatencyEncoding/decoding can be finished in one cycle


Results: experimental setup SimpleScalar simulator 32-bit data bus Various real-world applications in SPEC95 and

MediabenchApplication DescriptionAdpcm-enc ADPCM encoder for voiceAdpcm-dec ADPCM decoder for voice Compress File compression program in UNIX systemGcc Gnu c compilerGo Go is a game program in SPEC95Ijpeg JPEG encoder/decoder programLi Lisp interpreterM88ksim A small operating systemPerl Perl language interpreter


Results: detail

34 5

6

4

60

0.1

0.2

0.3

Ene

rgy

Re

du

ctio

n (

1 =1

00%

)

wo

wi


Results: comparison

SchemeAvg.

Energy per Mem

Access

Avg. Energy

Reduced

Num. of Additional

Lines

Number of Gates

(approximately)Delay

Raw 1.94e-11J 0% N/A N/A N/A

BI4 1.90e-11J 2.5% 4 100 Low

WZE 1.60e-11J 17.8% 4 1800 High

TPC 3.28e-11J -68.9% 12 N/A LowADES with

BI1.38e-11J 28.9% 2 750 Low


Results: graphical comparison of energy

savings

0.00% 10.00% 20.00% 30.00% 40.00%

Data

Addr

Mix

ADES

TPC

WZE

GI


Summary: adaptive bus encoding Upcoming technologies induce inter-wire

capacitances in the order of magnitude of intrinsic capacitances

Ordinary methods (e.g. Hamming distance) minimization can’t capture those effects

Exploits information redundancy on data buses ADES

Average 28% energy savings on data bus Extendable to address buses Low cost


50.6

68.2

106.2

149.1

182.1

0

20

40

60

80

100

120

140

160

180

200

Intel x86 Thum b Sharc TMS320C6x IA-64

Co

de

size

(kB

yte)

Code compression

Memory size is critical for embedded system

Program size grows with application complexity

Code compression is a solution to reduce code size

Code size grows as RISC or VLIW is used

Improved VLIW code compression is needed

(Xie,2002)

Code Size of MPEG2 Encoder


Base l1 l2 l3 lk. . .

b1 b2 b4b3

ck2

block4

block1 blo-

block3

block4

Requirements on code compression

Random Access• Start decompression at block boundaries• Synchronize model and arithmetic coder

• Byte Alignment • Faster Decoding• Easier and more compact indexing

• Indexing• LAT

• Patching branch offsets (only for code compression)


Previous work

Wolfe and Chanin (1992)

IBM CodePack (1998) Larin and Conte (1999)

Huffman coding Xie et al. (2001-02)

F2VCC and V2FCC

Power PC 40xEmbeddedProcessor

Cache

ExternalMemory

DecompressionCore

Processor Local Bus

Decoder Table


Our approach

Problem definition Propose code compression schemes to reduce code

size on VLIW embedded system Texas Instruments’ TMS320C6x VLIW DSP

Our contribution Branch blocks

Branch targets are fixed once the code is compiled Average: 80.1 blocks, 454 bytes

LZW-based code compression schemes Selective code compression schemes


Compression/decompression

Compression Engine Decompression Engine

Read A Codeword

Branch target? Y

Read Coding Table

Output PhraseIndicated by Codeword

N

Update Coding TableIf necessary

Refresh Coding Table

Read New Data

Branch target?

Refresh Table

Y

N

Generate CodewordOutput


Compression Engine Decompression Engine

Read A Codeword

Branch target? Y

Read Coding Table


N



Read A Codeword

Branch target? Y

Read Coding Table


N



Read New Data

Branch target?

Refresh Table

Y

N



Read New Data

Branch target?

Refresh Table

Y

N




Decompression architecture

Works for pre-/post-cache:

Memory(Compressed Code)

I-Cache(Compressed Code)

DecompressionEngine

Table


I-Cache(Original Code)

DecompressionEngine

Processor(Original Code)

Table


(a)

(b)


I-Cache(Compressed Code)

DecompressionEngine

Table


I-Cache(Original Code)

DecompressionEngine


Table


(a)

(b)


LZW data compression

Input: a a b ab aba aaOutput: 0 0 1 3 5 2

CompressionEngine

DecompressionEngineCodeword

LongestPhrase

OriginalPhrase

Table Table

N+1 N= N??

Welch (1984) modified Ziv-Lempel (1978)

Generate coding table on-the-fly

Search for the longest phrase already in the table

Output the index of the phrase

Add the phrase with the next element as a new table entry

Decompression lags compression by one codeword


Example

Index Phrase Derivation

0 a Initial

1 b Initial

2 aa 0 + a3 ab 0 + b4 ba 1 + a5 aba 3 + a6 abaa 5 + a

Input: a a b ab aba aaOutput: 0 0 1 3 5 2

CompressionEngine

DecompressionEngine

CodewordLongestPhrase

OriginalPhrase

Table Table

N+1 N

= N??


LZW-based code compression

Use BYTE (0x00 ~ 0xFF) as basic element. Variable-to-fixed code compression:

Longer codeword means: Larger table (exponentially) More decompression overhead Useless when the block is too small Use more bits to encode same phrase CR: 83, 83, 84, 87% for 9-12 bit LZW

Wider decoding table means: Larger table (linearly) Wider decoding bandwidth Less than 1% CR difference for 8-20 bytes


Compression ratio vs. codeword size for two examples

0.73

0.840.88

0.96

1.04

1.12

1.20

1.28

0.72

0.79 0.77 0.76 0.74 0.740.78

0.84

0.60

0.70

0.80

0.90

1.00

1.10

1.20

1.30

9-bit 10-bit 11-bit 12-bit 13-bit 14-bit 15-bit 16-bit

ADPCM decoder MPEG2ENC

small

large


Compression ratio vs. codeword size on benchmark set


Selective code compression

Motivation Branch blocks vary in size No benefit to use longer codeword if the block can not fill up the coding

table Only 12.8% of the branch blocks can fill up 9-bit LZW table Only < 1% of the branch blocks can fill up 12-bit LZW table

Selective Code Compression Apply different compression methods on different branch blocks Block size, instruction frequency, … are collected during profiling Profile is used to determine the compression method

SourceProgram

BranchBlocks

Profiling

MethodSelection

CompressionCompressed

Code


Selective compression (cont’d.)

Minimum table-usage selective compression (MTUSC) Calculate the number of phrases generated during compression Select the smallest table that all the phrases could fit in the table Average compression ratio is 79.2%

Minimum code-size selective compression (MCSSC) Some compressed blocks use more bytes than original data Compress the blocks using different codeword length The smallest compressed or uncompressed block is selected Average compression ratio is 76.8%

Dynamic LZW Codeword length grows as compression goes on 75.8% and 75.2% for MTUSC and MCSSC


Experiments

Benchmarks Collected from Texas Instruments and

Mediabench Compression Ratio

Longer codeword works better in large benchmarks

Dynamic MCSSC is always the best


Compression ratio vs. algorithm


Average throughput

1.72 bytes for 12-bit LZW and 1.82 bytes for dynamic MCSSC

1.31.41.51.61.71.81.92.02.1

Aver

age

Thro

ughp

ut (b

ytes

)

12-bit LZW MTUSC MCSSC (4) MCSSC (32) MCSSC (d)


Parallel decompression

Parallel Decompression Execution time: 0.51x, 0.27x, 0.14x Throughput: 3.31, 6.37, 12.29 bytes

Hardware Features 2-30 kBytes decoding table < 4500 m2 using TSMC .25 m model 5508 cycles to decompress 9344 bytes ADPCM

decoder 90k cycles to decompress 182k bytes MPEG-2 encoder

Current Code = 300

DC1 DC2

Code 295Code 277

DC1 DC2

Code 295Code 301


Comparison with previous workWolfe Chanin

MIPS Huffman 73% < 1mm2 1 byte serial

CodePack PowerPC CodePack 60% < 1mm2 1 byte serial

Lekatsas MIPS SAMC 57% 4K table NA serial

Xie TMS320 F2V

V2F

65%

70%-82%

6-48K table

2-30K table

4.9 bits avg, 13 bits max

89 bits max

IID is parallel

Us C6x LZW

MCSSC

83%-87%

75%

< 0.05mm2

30K table

1.3-1.7 avg

1.8 bytes avg, 13 bytes max

parallel

parallel


Code compression summary

We proposed code compression schemes using branch blocks as compression unit.

Compression ratio is around 83% and 75% respectively. Low power is achieved by smaller memory required. Compare to previous work, our schemes have less decompression

overhead, larger decompression bandwidth with comparable compression ratio.

Parallel decompression could be applied to achieve faster decompression which is suitable for VLIW architecture.

Compiler techniques could be used to generate source programs more suitable for code compression.

Find other schemes can take advantage of branch blocks.

Documents

Coding Methods in Embedded Computing Wayne Wolf Dept. of Electrical Engineering Princeton University