Upload
rodger-arnold
View
217
Download
1
Embed Size (px)
Citation preview
Coding Methods in Embedded Computing
Wayne Wolf
Dept. of Electrical Engineering
Princeton University
© 2004 Embedded Systems Group
Outline
Lv/Henkel/Lekatsas/Wolf: Adaptive dictionary method for bus encoding
Lin/Xie/Wolf: Dictionary coding for code compression
© 2004 Embedded Systems Group
Adaptive bus encoding
Goal: Reduce bus energy Significant part of energy is related to IO Significant Impact of inter-wire capacitances
Approach: Explore data properties Past success in address buses Few approaches for data buses
Results: 28% average power reduction One additional line for a 32-line bus No additional cycles
Applies to both address and data buses
© 2004 Embedded Systems Group
Related work
Stan/Burleson [TVLSI95]: Bus-invert Encoding Panda/Dutt [TVLSI99]: Reduce address bus
switching by memory access exploitation Benini et al. [GLS-VLSI97]: T0 Encoding Mussol et al. [TVLSI98]: Working Zone Encoding Sotiriadis/Chandrakasan [ICCAD00]: Transition
Pattern Coding Kim et al. [DAC00]: Coupling sensitive scheme
© 2004 Embedded Systems Group
iRlrl
lrl
lrl
lrl
lrl
lrl
lcI lcI lcI
lclLoadC
LoadC
lcl
lcl lcl
lcl lcl
iR
iR
iR
ecapacitanc wire-interlinear :c
ecapacitancLinear :c
resistanceLinear :r
resistancedriver Internal :
I
l
l
iR
General Two Line Bus
Bus model (I)
© 2004 Embedded Systems Group
n transitio torelatedEnergy :),(
ecapacitanc wire-inter torelatedEnergy :),,,( 2121
fI
ffII
VVEns
VVVVEni
iR
ICLC
LC
iR
iR
iR
Simplify bus model by quantizing energy values: 0, 1, 2.
Bus model (II)
© 2004 Embedded Systems Group
Bus model (III)
})](),1([
)](),(),1(),1([{)(
1
0
2
011
2
N
iii
N
iiiiiddL
kVkVEns
kVkVkVkVEniVCkEn
LC
ecapacitanc wire-inter
Bus Energy for multiple line buses
© 2004 Embedded Systems Group
Source properties on data buses
0%20%40%60%80%
100%compressijpegliadpcm encm88ksimgcc
Correlation of transition signaling code on adjacent lines: D(x) = nx/N = transitions/total transactions
Bit number
© 2004 Embedded Systems Group
Source properties (II)
Adjacent bit lines in a word are correlated.
0.00%
10.00%20.00%
30.00%
40.00%50.00%
60.00%
70.00%
80.00%90.00%
100.00%
0-1
3-4
6-7
9-10
12-1
3
15-1
6
18-1
9
21-2
2
24-2
5
27-2
8
30-3
1
blowfishdec
CRC32
FFT
go
gsm
ispell
jpeg
lame
qsort
rsynth
© 2004 Embedded Systems Group
Source properties (III)
10 most frequently-occurring patterns:
0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%
100.00%
blowfis
hdec
CRC32FFT go gs
misp
elljpe
glam
eqs
ort
rsyn
thAV
G
© 2004 Embedded Systems Group
Energy savings from different compression schemes.
Compare transition, interwire energy savings.
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Size ET EI
Arithmetic
Huffman
LZ77
© 2004 Embedded Systems Group
Dictionary techniques
Look up symbol strings in dictionary; replace with shorter code.
Types of dictionaries: Static dictionary Adaptive dictionary
‘is’
‘the’
‘are’
‘do’
© 2004 Embedded Systems Group
Approach
Use dictionary scheme to take advantage of frequent patterns.
Word divided into key, index, bypassed part:
Upper partWidth: N-wi-wo
Index partWidth: wi
Bypassed partWidth:wo
N 0
© 2004 Embedded Systems Group
Adaptive Dictionary Encode Scheme (ADES)
Bus Encoder Bus Decoder
Compressible
DictionaryEncoder
DictionaryDecoder
Compressed
InputBus Output
Status Line
© 2004 Embedded Systems Group
Encoder miss
0xFFFF000
0x0000000
0x1234FFE
0xFEFA830
0xF1234FF0 =? 0
0
0
0xF1234FF
0
0
Upper Part
Non-compress Part
Index Part
0xF1234FF0xF1234FF
© 2004 Embedded Systems Group
Decoder hit
0xFFFF000
0x0000000
0x1234FFE
0xFEFA8301
2
3
XRead
0x1234FFEB
t
© 2004 Embedded Systems Group
Decoder miss
0xFFFF000
0x0000000
0x1234FFE
0xFEFA8300
0
0
0xF1234FFWrite
0xF1234FF0
0xF1234FF
© 2004 Embedded Systems Group
=?
Upper_part0Upper_part
1
Upper_partk
Dictionary
Tri
Encoder Decoder
Upper_part0Upper_part
1
Upper_partkW/R
TriDictionary
Data
[wo+
wi:N
-1]
Data[0:wo-1]
Data[wo:wo+wi-1]
Data[wo+wi:N-1]
Status line 1
1
N-wi-wo
wi
wo
Architecture
ADES
© 2004 Embedded Systems Group
Area, delay, energy
Area750 Gates
EnergyPrimarily consumed by relatively small memory
LatencyEncoding/decoding can be finished in one cycle
© 2004 Embedded Systems Group
Results: experimental setup SimpleScalar simulator 32-bit data bus Various real-world applications in SPEC95 and
MediabenchApplication DescriptionAdpcm-enc ADPCM encoder for voiceAdpcm-dec ADPCM decoder for voice Compress File compression program in UNIX systemGcc Gnu c compilerGo Go is a game program in SPEC95Ijpeg JPEG encoder/decoder programLi Lisp interpreterM88ksim A small operating systemPerl Perl language interpreter
© 2004 Embedded Systems Group
Results: detail
34 5
6
4
60
0.1
0.2
0.3
Ene
rgy
Re
du
ctio
n (
1 =1
00%
)
wo
wi
© 2004 Embedded Systems Group
Results: comparison
SchemeAvg.
Energy per Mem
Access
Avg. Energy
Reduced
Num. of Additional
Lines
Number of Gates
(approximately)Delay
Raw 1.94e-11J 0% N/A N/A N/A
BI4 1.90e-11J 2.5% 4 100 Low
WZE 1.60e-11J 17.8% 4 1800 High
TPC 3.28e-11J -68.9% 12 N/A LowADES with
BI1.38e-11J 28.9% 2 750 Low
© 2004 Embedded Systems Group
Results: graphical comparison of energy
savings
0.00% 10.00% 20.00% 30.00% 40.00%
Data
Addr
Mix
ADES
TPC
WZE
GI
© 2004 Embedded Systems Group
Summary: adaptive bus encoding Upcoming technologies induce inter-wire
capacitances in the order of magnitude of intrinsic capacitances
Ordinary methods (e.g. Hamming distance) minimization can’t capture those effects
Exploits information redundancy on data buses ADES
Average 28% energy savings on data bus Extendable to address buses Low cost
© 2004 Embedded Systems Group
50.6
68.2
106.2
149.1
182.1
0
20
40
60
80
100
120
140
160
180
200
Intel x86 Thum b Sharc TMS320C6x IA-64
Co
de
size
(kB
yte)
Code compression
Memory size is critical for embedded system
Program size grows with application complexity
Code compression is a solution to reduce code size
Code size grows as RISC or VLIW is used
Improved VLIW code compression is needed
(Xie,2002)
Code Size of MPEG2 Encoder
© 2004 Embedded Systems Group
Base l1 l2 l3 lk. . .
b1 b2 b4b3
ck2
block4
block1 blo-
block3
block4
Requirements on code compression
Random Access• Start decompression at block boundaries• Synchronize model and arithmetic coder
• Byte Alignment • Faster Decoding• Easier and more compact indexing
• Indexing• LAT
• Patching branch offsets (only for code compression)
© 2004 Embedded Systems Group
Previous work
Wolfe and Chanin (1992)
IBM CodePack (1998) Larin and Conte (1999)
Huffman coding Xie et al. (2001-02)
F2VCC and V2FCC
Power PC 40xEmbeddedProcessor
Cache
ExternalMemory
DecompressionCore
Processor Local Bus
Decoder Table
© 2004 Embedded Systems Group
Our approach
Problem definition Propose code compression schemes to reduce code
size on VLIW embedded system Texas Instruments’ TMS320C6x VLIW DSP
Our contribution Branch blocks
Branch targets are fixed once the code is compiled Average: 80.1 blocks, 454 bytes
LZW-based code compression schemes Selective code compression schemes
© 2004 Embedded Systems Group
Compression/decompression
Compression Engine Decompression Engine
Read A Codeword
Branch target? Y
Read Coding Table
Output PhraseIndicated by Codeword
N
Update Coding TableIf necessary
Refresh Coding Table
Read New Data
Branch target?
Refresh Table
Y
N
Generate CodewordOutput
Update Coding TableIf necessary
Compression Engine Decompression Engine
Read A Codeword
Branch target? Y
Read Coding Table
Output PhraseIndicated by Codeword
N
Update Coding TableIf necessary
Refresh Coding Table
Read A Codeword
Branch target? Y
Read Coding Table
Output PhraseIndicated by Codeword
N
Update Coding TableIf necessary
Refresh Coding Table
Read New Data
Branch target?
Refresh Table
Y
N
Generate CodewordOutput
Update Coding TableIf necessary
Read New Data
Branch target?
Refresh Table
Y
N
Generate CodewordOutput
Update Coding TableIf necessary
© 2004 Embedded Systems Group
Decompression architecture
Works for pre-/post-cache:
Memory(Compressed Code)
I-Cache(Compressed Code)
DecompressionEngine
Table
Memory(Compressed Code)
I-Cache(Original Code)
DecompressionEngine
Processor(Original Code)
Table
Processor(Original Code)
(a)
(b)
Memory(Compressed Code)
I-Cache(Compressed Code)
DecompressionEngine
Table
Memory(Compressed Code)
I-Cache(Original Code)
DecompressionEngine
Processor(Original Code)
Table
Processor(Original Code)
(a)
(b)
© 2004 Embedded Systems Group
LZW data compression
Input: a a b ab aba aaOutput: 0 0 1 3 5 2
CompressionEngine
DecompressionEngineCodeword
LongestPhrase
OriginalPhrase
Table Table
N+1 N= N??
Welch (1984) modified Ziv-Lempel (1978)
Generate coding table on-the-fly
Search for the longest phrase already in the table
Output the index of the phrase
Add the phrase with the next element as a new table entry
Decompression lags compression by one codeword
© 2004 Embedded Systems Group
Example
Index Phrase Derivation
0 a Initial
1 b Initial
2 aa 0 + a3 ab 0 + b4 ba 1 + a5 aba 3 + a6 abaa 5 + a
Input: a a b ab aba aaOutput: 0 0 1 3 5 2
CompressionEngine
DecompressionEngine
CodewordLongestPhrase
OriginalPhrase
Table Table
N+1 N
= N??
© 2004 Embedded Systems Group
LZW-based code compression
Use BYTE (0x00 ~ 0xFF) as basic element. Variable-to-fixed code compression:
Longer codeword means: Larger table (exponentially) More decompression overhead Useless when the block is too small Use more bits to encode same phrase CR: 83, 83, 84, 87% for 9-12 bit LZW
Wider decoding table means: Larger table (linearly) Wider decoding bandwidth Less than 1% CR difference for 8-20 bytes
© 2004 Embedded Systems Group
Compression ratio vs. codeword size for two examples
0.73
0.840.88
0.96
1.04
1.12
1.20
1.28
0.72
0.79 0.77 0.76 0.74 0.740.78
0.84
0.60
0.70
0.80
0.90
1.00
1.10
1.20
1.30
9-bit 10-bit 11-bit 12-bit 13-bit 14-bit 15-bit 16-bit
ADPCM decoder MPEG2ENC
small
large
© 2004 Embedded Systems Group
Selective code compression
Motivation Branch blocks vary in size No benefit to use longer codeword if the block can not fill up the coding
table Only 12.8% of the branch blocks can fill up 9-bit LZW table Only < 1% of the branch blocks can fill up 12-bit LZW table
Selective Code Compression Apply different compression methods on different branch blocks Block size, instruction frequency, … are collected during profiling Profile is used to determine the compression method
SourceProgram
BranchBlocks
Profiling
MethodSelection
CompressionCompressed
Code
© 2004 Embedded Systems Group
Selective compression (cont’d.)
Minimum table-usage selective compression (MTUSC) Calculate the number of phrases generated during compression Select the smallest table that all the phrases could fit in the table Average compression ratio is 79.2%
Minimum code-size selective compression (MCSSC) Some compressed blocks use more bytes than original data Compress the blocks using different codeword length The smallest compressed or uncompressed block is selected Average compression ratio is 76.8%
Dynamic LZW Codeword length grows as compression goes on 75.8% and 75.2% for MTUSC and MCSSC
© 2004 Embedded Systems Group
Experiments
Benchmarks Collected from Texas Instruments and
Mediabench Compression Ratio
Longer codeword works better in large benchmarks
Dynamic MCSSC is always the best
© 2004 Embedded Systems Group
Average throughput
1.72 bytes for 12-bit LZW and 1.82 bytes for dynamic MCSSC
1.31.41.51.61.71.81.92.02.1
Aver
age
Thro
ughp
ut (b
ytes
)
12-bit LZW MTUSC MCSSC (4) MCSSC (32) MCSSC (d)
© 2004 Embedded Systems Group
Parallel decompression
Parallel Decompression Execution time: 0.51x, 0.27x, 0.14x Throughput: 3.31, 6.37, 12.29 bytes
Hardware Features 2-30 kBytes decoding table < 4500 m2 using TSMC .25 m model 5508 cycles to decompress 9344 bytes ADPCM
decoder 90k cycles to decompress 182k bytes MPEG-2 encoder
Current Code = 300
DC1 DC2
Code 295Code 277
DC1 DC2
Code 295Code 301
© 2004 Embedded Systems Group
Comparison with previous workWolfe Chanin
MIPS Huffman 73% < 1mm2 1 byte serial
CodePack PowerPC CodePack 60% < 1mm2 1 byte serial
Lekatsas MIPS SAMC 57% 4K table NA serial
Xie TMS320 F2V
V2F
65%
70%-82%
6-48K table
2-30K table
4.9 bits avg, 13 bits max
89 bits max
IID is parallel
Us C6x LZW
MCSSC
83%-87%
75%
< 0.05mm2
30K table
1.3-1.7 avg
1.8 bytes avg, 13 bytes max
parallel
parallel
© 2004 Embedded Systems Group
Code compression summary
We proposed code compression schemes using branch blocks as compression unit.
Compression ratio is around 83% and 75% respectively. Low power is achieved by smaller memory required. Compare to previous work, our schemes have less decompression
overhead, larger decompression bandwidth with comparable compression ratio.
Parallel decompression could be applied to achieve faster decompression which is suitable for VLIW architecture.
Compiler techniques could be used to generate source programs more suitable for code compression.
Find other schemes can take advantage of branch blocks.