August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC

August August

Code Compaction for UniCoreon Link-Time Optimization Platform

Zhang JiyuCompilation Toolchain Group

MPRC

August August ICDFN 2006ICDFN 2006

Compilation Process

Design

• Ideas

Source Code

• *.cpp• *.c• *.h• Makefile• ......

Assembly Code

• *.asm• *.s• Linking Scripts• ......

21 3

Object Files

• *.o• *.a• *.so• ……

Executable

• DLL• Executable

Profile Data

• Execute Frequency

• Traces

• ……

54 6

Coding Compile

Assemble Linking Execute

Profile Guided Optimization

Design

• Ideas

Source Code

• *.cpp• *.c• *.h• Makefile• ......

Assembly Code


21 3

Object Files

• *.o• *.a• *.so• ……

Executable


Profile Data


• Traces

• ……

54 6

Coding Compile

Assemble Linking Execute



Our Optimization Process

Design

• Ideas

Source Code

• *.cpp• *.c• *.h• Makefile• ......

Assembly Code


21 3

Object Files

• *.o• *.a• *.so• ……

Executable


Profile Data


• Traces

• ……

54 6

Coding Compile

Assemble Linking &Link-Time Optimization

Execute




CLOU is a Link-time Optimizer for UniCore

CodeCode

DataData

MetaMeta

CodeCode

DataData

MetaMeta

CodeCode

DataData

MetaMeta

CodeCode

DataData

CodeCode

DataData

CodeCode

DataData

DataData

DataData

DataData

Translation to IR

Translation to IR

CFG construction&

Optimizations

CFG construction&

Optimizations

ExecExec

Layout; AssemblingLayout; Assembling

LinkingLinking

A Graph Modified From Diablo


Code Compaction based on CLOU

• Motivation of code compaction– Limited memory and energy resources for embedded systems– Code density affects both memory and energy consumption

• Goal: reducing code size without losing performance• Code compaction in different levels

1. Typical optimizations for code size reduction at link-time

2. Hot/cold code splitting

3. New mixed code generation method

Code

Cold Code

Hot Code

Hot Code

Cold CodeCode

Cold Code

Hot Code

Hot Code

Cold Code

Code

Cold Code

Hot Code

Hot Code

Cold Code

Code

Cold Code

Hot Code

Hot Code

Cold Code

Code

Cold Code

Hot Code

Hot Code

Cold Code

Code

Cold Code

Hot Code

Cold Code

Cold Code

1 2 3


Typical Optimizations for Code Size Reduction• Redundant code elimination

– Computations whose results have been computed previously and are guaranteed to be available at that point

• Unreachable code elimination– Code fragments which there is no control flow path to from the

entry node– Many of them are following useless comparisons

• Dead code elimination– Computations whose results are never used

• Peephole optimization• Procedural abstraction -- might lead to performance loss


Experiments for Typical Optimizations for Code Size Reduction

• Benchmark: Mediabench

• Code size reduction– Average: 12.8%– Max: 22.3%

• Performance improvement– Average: 2.4%– Max: 4.2%


• Less code transferred from remote to local, from disk to memory, or from memory to cache

– Question: might be too conservative or lead to performance loss?

• Get hot/cold code splitted through basic block reordering

Hot/Cold Code Splitting

Condition

2

Hot Code Cold Code

More Code

Code1

Hot Code

More Code

Cold Code

Condition

Code

3

Hot Code

Cold Code

More Code

Condition

Code


Hot/Cold Code Splitting

• PH: A popular greedy approach• Structural Analysis Based Basic Block Reordering

– Most part of a program can be

decomposed into several typical structures

– Cost Module for each structure

– Minimal-cost layout Optimal layout

for each local structure based on

profiling information

B1

B2 B3

(d) Whi l e- l oop

yx

y

B1

B2

B3

(e) Repeat- l oop

yx

y

B1

B2 B3

(f ) Natural - l oop

yx

z

B4x1

x2

B1

B2 B3

(g) Natural - l oop

yx

B4

y1 y2

B5

B1

B2

(a) Bl ock模型

B1

B2

B3

(b) I f - then

B1

B2

B4

(c) I f - then-el se

y

x

B3

yx

Bn

. . .


Basic Block Reordering

• Cost Model– Different kinds of control flow

edges have different cost– For a specific order,

– A list can be got for each structure

f (structure, frequencies of all edges) the best order of basic blocks for the local structure

...

...

(a)

L1:...

...b L1

(b)

A

B

B

A

cmp …beq L1

...

(c)

A

B

L1:...

C

L1:...

L2:...

(d)

C

B

cmp …beq L1b L2

A

( )* ( )e

Cost Cost e Frequency e

control flow edges


Experiments

• Complexity: O(N*log N) ， N: number of basic blocks• Experiment results (not using other link-time

optimizations)• Normalized cycle counts Normalized cache miss rate

总体性能

0. 75

0. 8

0. 85

0. 9

0. 95

1

1. 05

ORI GPHSABO

Cache指令失效

0. 50. 550. 60. 650. 70. 750. 80. 850. 90. 9511. 051. 11. 151. 21. 251. 31. 351. 41. 451. 51. 551. 6

adpc

m-en

code ep

i c

j peg

- enc

ode

pegw

i t- e

ncod

e

mipm

ap

osde

mo

ORI GPHSABO


Mixed Code Generation

• Dual-width Instruction Set– 32-bit ISA: more powerful– 16-bit ISA: more compact

• Less coding space for operations• Less register field• Less immediate field

32-bit:add r0, r0,

0xff800000

16-bit:str r2, [addr]mov r2, 0xfflsl r2, #1add r2, #1lsl r2, 24 add r0, r2ld r2, [addr]



• Related works in dual-width Instruction Set design and mixed code generation– Coarse-grained function-level mixed code generation

• By BX in arm and JALX in MIPS

– Simple fine-grained instruction-level mixed code generation

• By BX in arm and JALX in MIPS• By single specific mode-changing instruction

– Specialized coding• One-leading instruction word indicates one 32-bit instruction; Zero-leading instruction word indicates two 16-bit instruction.• 16-bit ISA extensions

• Problem: Always lead to performance loss


Potential benefit

• Analysis of Programs in Mediabench

27851 different instructions in all programs:

Log(27851)=15

RankUnicore32

InstructionAverage

Percentage

1 mov 23%

2 ldr 16%

3 cmp 8%

4 add 8%

5 str 6%

6 b 5%

Total 66%

1 2


Two Main Kinds of Frequent Instructions

• Two-operand instructions mov rd, rm

or short immediate

cmp rn, rm

or short immediate

• Branch/Jump – Distribution of immediate-

offsets of branch instructions.

00. 020. 040. 060. 080. 1

0. 120. 140. 160. 180. 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Number of bi ts needed

perc

enta

ge


The Idea of Mode-Changing Instruction Set (MC)

• Extend the 32-bit ISA to add a small MC Instruction Set (using the reserved coding space)– Change the CPU mode

– Perform its own normal operation

• Scan for suitable 32-bit instructions to be encoded into 16-bit instructions

• A mixed code fraction with MC instructions

32-bit instructions

MC instruction UniCore16 instruction

UniCore16 instruction UniCore16 instruction

… …

UniCore16 instruction UniCore16 instruction

MC instruction UniCore16 instruction

32-bit instructions


Modification to Micro Architecture

• Mixed code execution in Unicore-I pipeline

• Improved mixed code executionin Unicore-I pipeline

DECIF

IF

EXEDECIF

MEMEXEDECIF

WBMEMEXEDECIF

DECIF

IF

EXEDECIF

MEMEXEDECIF

WBMEMEXEDECIF

Inst 4, UniCore16

Inst 5, UniCore16

Inst 3, UniCore16

Inst 2, UniCore16

Inst 1, BX, UniCore32

Inst 4, UniCore16

Inst 5, UniCore16

Inst 3, UniCore16

Inst 2, UniCore16

Inst 1, BX, UniCore32

DECIF

EXEDECIF

MEMEXEDECIF

WBMEMEXEDECIF

WBMEMEXEDECIF

WBMEMEXEDECIF

DECIF

EXEDECIF

MEMEXEDECIF

WBMEMEXEDECIF

WBMEMEXEDECIF

WBMEMEXEDECIF

Inst 6, UniCore32

Inst 5, MC

Inst 4, UniCore16

Inst 3, UniCore16

Inst 2, MC

Inst 1, UniCore32

Inst 6, UniCore32

Inst 5, MC

Inst 4, UniCore16

Inst 3, UniCore16

Inst 2, MC

Inst 1, UniCore32

No extra cycles

One more 16-bit instruction-fetch buffer

An MC-decoder



programprogram

program

Mode-Changing

Instructions

InstructionAnalyzer

Link-Time Optimizer

Mixed coded

Program

program

Simulator


Experiment Results

• Normalized code size (results not using other link-time optimizations)

0

0. 2

0. 4

0. 6

0. 8

1

1. 2

Uni Core32 Uni Core16 Mi xed


Conclusion

• Code compaction on Link-Time Optimization Platform– Compiler optimizations applied at link time

• Typical optimizations for code size reduction

– Program layout optimization• Hot/cold code splitting through basic block reordering

– Machine code generation• Mixed code generation

• Experiment Results– Average code size reduction: 32.9% – Average performance improvement: 9.1%


Thank you



• Instruction Analysis

3 regs, all in r0-r7 / r8-r15 / r16-r23/ r24-r312 regs, one in r0-r31, one in r0-r16 / r17-r311 reg and 1 imme, imme field: 4-6 bits1 imme, imme field: 9 bitsreg: short for registerimme: short for immediate field

Instruction format type classifications


EXPERIMENT RESULTS

• Normalized dynamic instruction numbers

• Normalized cycle counts

0

1

2

3

4

5

6

adpcm-encode

adpcm-decode

epi c unepi c pegwi t-encode

pegwi t-decode

j peg-encode

j peg-decode

mpeg2-encode

mpeg2-decode

mesa-mi pmap

mesa-texgen

mesa-osdemo


0

0. 5

1

1. 5

2

2. 5

3

3. 5

4

4. 5

5

adpcm-encode

adpcm-decode

epi c unepi c pegwi t-encode

pegwi t-decode

j peg-encode

j peg-decode

mpeg2-encode

mpeg2-decode

mesa-mi pmap

mesa-texgen

mesa-osdemo


Documents

August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC