View
224
Download
0
Tags:
Embed Size (px)
Citation preview
August August
Code Compaction for UniCoreon Link-Time Optimization Platform
Zhang JiyuCompilation Toolchain Group
MPRC
August August ICDFN 2006ICDFN 2006
Compilation Process
Design
• Ideas
Source Code
• *.cpp• *.c• *.h• Makefile• ......
Assembly Code
• *.asm• *.s• Linking Scripts• ......
21 3
Object Files
• *.o• *.a• *.so• ……
Executable
• DLL• Executable
Profile Data
• Execute Frequency
• Traces
• ……
54 6
Coding Compile
Assemble Linking Execute
Profile Guided Optimization
Design
• Ideas
Source Code
• *.cpp• *.c• *.h• Makefile• ......
Assembly Code
• *.asm• *.s• Linking Scripts• ......
21 3
Object Files
• *.o• *.a• *.so• ……
Executable
• DLL• Executable
Profile Data
• Execute Frequency
• Traces
• ……
54 6
Coding Compile
Assemble Linking Execute
Profile Guided Optimization
August August ICDFN 2006ICDFN 2006
Our Optimization Process
Design
• Ideas
Source Code
• *.cpp• *.c• *.h• Makefile• ......
Assembly Code
• *.asm• *.s• Linking Scripts• ......
21 3
Object Files
• *.o• *.a• *.so• ……
Executable
• DLL• Executable
Profile Data
• Execute Frequency
• Traces
• ……
54 6
Coding Compile
Assemble Linking &Link-Time Optimization
Execute
Profile Guided Optimization
Profile Guided Optimization
August August ICDFN 2006ICDFN 2006
CLOU is a Link-time Optimizer for UniCore
CodeCode
DataData
MetaMeta
CodeCode
DataData
MetaMeta
CodeCode
DataData
MetaMeta
CodeCode
DataData
CodeCode
DataData
CodeCode
DataData
DataData
DataData
DataData
Translation to IR
Translation to IR
CFG construction&
Optimizations
CFG construction&
Optimizations
ExecExec
Layout; AssemblingLayout; Assembling
LinkingLinking
A Graph Modified From Diablo
August August ICDFN 2006ICDFN 2006
Code Compaction based on CLOU
• Motivation of code compaction– Limited memory and energy resources for embedded systems– Code density affects both memory and energy consumption
• Goal: reducing code size without losing performance• Code compaction in different levels
1. Typical optimizations for code size reduction at link-time
2. Hot/cold code splitting
3. New mixed code generation method
Code
Cold Code
Hot Code
Hot Code
Cold CodeCode
Cold Code
Hot Code
Hot Code
Cold Code
Code
Cold Code
Hot Code
Hot Code
Cold Code
Code
Cold Code
Hot Code
Hot Code
Cold Code
Code
Cold Code
Hot Code
Hot Code
Cold Code
Code
Cold Code
Hot Code
Cold Code
Cold Code
1 2 3
August August ICDFN 2006ICDFN 2006
Typical Optimizations for Code Size Reduction• Redundant code elimination
– Computations whose results have been computed previously and are guaranteed to be available at that point
• Unreachable code elimination– Code fragments which there is no control flow path to from the
entry node– Many of them are following useless comparisons
• Dead code elimination– Computations whose results are never used
• Peephole optimization• Procedural abstraction -- might lead to performance loss
August August ICDFN 2006ICDFN 2006
Experiments for Typical Optimizations for Code Size Reduction
• Benchmark: Mediabench
• Code size reduction– Average: 12.8%– Max: 22.3%
• Performance improvement– Average: 2.4%– Max: 4.2%
August August ICDFN 2006ICDFN 2006
• Less code transferred from remote to local, from disk to memory, or from memory to cache
– Question: might be too conservative or lead to performance loss?
• Get hot/cold code splitted through basic block reordering
Hot/Cold Code Splitting
Condition
2
Hot Code Cold Code
More Code
Code1
Hot Code
More Code
Cold Code
Condition
Code
3
Hot Code
Cold Code
More Code
Condition
Code
August August ICDFN 2006ICDFN 2006
Hot/Cold Code Splitting
• PH: A popular greedy approach• Structural Analysis Based Basic Block Reordering
– Most part of a program can be
decomposed into several typical structures
– Cost Module for each structure
– Minimal-cost layout Optimal layout
for each local structure based on
profiling information
B1
B2 B3
(d) Whi l e- l oop
yx
y
B1
B2
B3
(e) Repeat- l oop
yx
y
B1
B2 B3
(f ) Natural - l oop
yx
z
B4x1
x2
B1
B2 B3
(g) Natural - l oop
yx
B4
y1 y2
B5
B1
B2
(a) Bl ock模型
B1
B2
B3
(b) I f - then
B1
B2
B4
(c) I f - then-el se
y
x
B3
yx
Bn
. . .
August August ICDFN 2006ICDFN 2006
Basic Block Reordering
• Cost Model– Different kinds of control flow
edges have different cost– For a specific order,
– A list can be got for each structure
f (structure, frequencies of all edges) the best order of basic blocks for the local structure
...
...
(a)
L1:...
...b L1
(b)
A
B
B
A
cmp …beq L1
...
(c)
A
B
L1:...
C
L1:...
L2:...
(d)
C
B
cmp …beq L1b L2
A
( )* ( )e
Cost Cost e Frequency e
control flow edges
August August ICDFN 2006ICDFN 2006
Experiments
• Complexity: O(N*log N) , N: number of basic blocks• Experiment results (not using other link-time
optimizations)• Normalized cycle counts Normalized cache miss rate
总体性能
0. 75
0. 8
0. 85
0. 9
0. 95
1
1. 05
ORI GPHSABO
Cache指令 失效
0. 50. 550. 60. 650. 70. 750. 80. 850. 90. 9511. 051. 11. 151. 21. 251. 31. 351. 41. 451. 51. 551. 6
adpc
m-en
code ep
i c
j peg
- enc
ode
pegw
i t- e
ncod
e
mipm
ap
osde
mo
ORI GPHSABO
August August ICDFN 2006ICDFN 2006
Mixed Code Generation
• Dual-width Instruction Set– 32-bit ISA: more powerful– 16-bit ISA: more compact
• Less coding space for operations• Less register field• Less immediate field
32-bit:add r0, r0,
0xff800000
16-bit:str r2, [addr]mov r2, 0xfflsl r2, #1add r2, #1lsl r2, 24 add r0, r2ld r2, [addr]
August August ICDFN 2006ICDFN 2006
Mixed Code Generation
• Related works in dual-width Instruction Set design and mixed code generation– Coarse-grained function-level mixed code generation
• By BX in arm and JALX in MIPS
– Simple fine-grained instruction-level mixed code generation
• By BX in arm and JALX in MIPS• By single specific mode-changing instruction
– Specialized coding• One-leading instruction word indicates one 32-bit instruction; Zero-leading instruction word indicates two 16-bit instruction.• 16-bit ISA extensions
• Problem: Always lead to performance loss
August August ICDFN 2006ICDFN 2006
Potential benefit
• Analysis of Programs in Mediabench
27851 different instructions in all programs:
Log(27851)=15
RankUnicore32
InstructionAverage
Percentage
1 mov 23%
2 ldr 16%
3 cmp 8%
4 add 8%
5 str 6%
6 b 5%
Total 66%
1 2
August August ICDFN 2006ICDFN 2006
Two Main Kinds of Frequent Instructions
• Two-operand instructions mov rd, rm
or short immediate
cmp rn, rm
or short immediate
• Branch/Jump – Distribution of immediate-
offsets of branch instructions.
00. 020. 040. 060. 080. 1
0. 120. 140. 160. 180. 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Number of bi ts needed
perc
enta
ge
August August ICDFN 2006ICDFN 2006
The Idea of Mode-Changing Instruction Set (MC)
• Extend the 32-bit ISA to add a small MC Instruction Set (using the reserved coding space)– Change the CPU mode
– Perform its own normal operation
• Scan for suitable 32-bit instructions to be encoded into 16-bit instructions
• A mixed code fraction with MC instructions
32-bit instructions
MC instruction UniCore16 instruction
UniCore16 instruction UniCore16 instruction
… …
UniCore16 instruction UniCore16 instruction
MC instruction UniCore16 instruction
32-bit instructions
August August ICDFN 2006ICDFN 2006
Modification to Micro Architecture
• Mixed code execution in Unicore-I pipeline
• Improved mixed code executionin Unicore-I pipeline
DECIF
IF
EXEDECIF
MEMEXEDECIF
WBMEMEXEDECIF
DECIF
IF
EXEDECIF
MEMEXEDECIF
WBMEMEXEDECIF
Inst 4, UniCore16
Inst 5, UniCore16
Inst 3, UniCore16
Inst 2, UniCore16
Inst 1, BX, UniCore32
Inst 4, UniCore16
Inst 5, UniCore16
Inst 3, UniCore16
Inst 2, UniCore16
Inst 1, BX, UniCore32
DECIF
EXEDECIF
MEMEXEDECIF
WBMEMEXEDECIF
WBMEMEXEDECIF
WBMEMEXEDECIF
DECIF
EXEDECIF
MEMEXEDECIF
WBMEMEXEDECIF
WBMEMEXEDECIF
WBMEMEXEDECIF
Inst 6, UniCore32
Inst 5, MC
Inst 4, UniCore16
Inst 3, UniCore16
Inst 2, MC
Inst 1, UniCore32
Inst 6, UniCore32
Inst 5, MC
Inst 4, UniCore16
Inst 3, UniCore16
Inst 2, MC
Inst 1, UniCore32
No extra cycles
One more 16-bit instruction-fetch buffer
An MC-decoder
August August ICDFN 2006ICDFN 2006
Mixed Code Generation
programprogram
program
Mode-Changing
Instructions
InstructionAnalyzer
Link-Time Optimizer
Mixed coded
Program
program
Simulator
August August ICDFN 2006ICDFN 2006
Experiment Results
• Normalized code size (results not using other link-time optimizations)
0
0. 2
0. 4
0. 6
0. 8
1
1. 2
Uni Core32 Uni Core16 Mi xed
August August ICDFN 2006ICDFN 2006
Conclusion
• Code compaction on Link-Time Optimization Platform– Compiler optimizations applied at link time
• Typical optimizations for code size reduction
– Program layout optimization• Hot/cold code splitting through basic block reordering
– Machine code generation• Mixed code generation
• Experiment Results– Average code size reduction: 32.9% – Average performance improvement: 9.1%
August August ICDFN 2006ICDFN 2006
Thank you
August August ICDFN 2006ICDFN 2006
August August ICDFN 2006ICDFN 2006
• Instruction Analysis
3 regs, all in r0-r7 / r8-r15 / r16-r23/ r24-r312 regs, one in r0-r31, one in r0-r16 / r17-r311 reg and 1 imme, imme field: 4-6 bits1 imme, imme field: 9 bitsreg: short for registerimme: short for immediate field
Instruction format type classifications
August August ICDFN 2006ICDFN 2006
EXPERIMENT RESULTS
• Normalized dynamic instruction numbers
• Normalized cycle counts
0
1
2
3
4
5
6
adpcm-encode
adpcm-decode
epi c unepi c pegwi t-encode
pegwi t-decode
j peg-encode
j peg-decode
mpeg2-encode
mpeg2-decode
mesa-mi pmap
mesa-texgen
mesa-osdemo
Uni Core32 Uni Core16 Mi xed
0
0. 5
1
1. 5
2
2. 5
3
3. 5
4
4. 5
5
adpcm-encode
adpcm-decode
epi c unepi c pegwi t-encode
pegwi t-decode
j peg-encode
j peg-decode
mpeg2-encode
mpeg2-decode
mesa-mi pmap
mesa-texgen
mesa-osdemo
Uni Core32 Uni Core16 Mi xed