51
Dynamic Binary Optimization – Part 1 2006. 9.25 Nam, E Hyun

Dynamic Binary Optimization – Part 1

  • Upload
    naiara

  • View
    52

  • Download
    0

Embed Size (px)

DESCRIPTION

Dynamic Binary Optimization – Part 1. 2006. 9.25 Nam, E Hyun. Contents. Overview Dynamic program Behavior Profiling Optimizing Translation blocks. Overview : Optimization. Optimization Migration of VM consideration from compatibility to performance Goal - PowerPoint PPT Presentation

Citation preview

Page 1: Dynamic Binary Optimization  –  Part 1

Dynamic Binary Optimization – Part 1

2006. 9.25

Nam, E Hyun

Page 2: Dynamic Binary Optimization  –  Part 1

2

Contents

Overview Dynamic program Behavior Profiling Optimizing Translation blocks

Page 3: Dynamic Binary Optimization  –  Part 1

3

Add1 %edx,4(%eax)Mov1 4(%eax),%edx

Addi r16,r4,4Lwzx r17,r2,r16Add r7,r17,r7Addi r16,r4,4Stwx r7,r2,r16

Addi r16,r4,4Lwzx r17,r2,r16Add r7,r17,r7Stwx r7,r2,r16

Overview : Optimization

Optimization Migration of VM consideration

from compatibility to performance Goal

To close the gap between a guest’ emulated performance and native platform performance

Type Translation block chaining Enlarging the translation block Reordering translated instructions Conventional complier

optimization techniques

Page 4: Dynamic Binary Optimization  –  Part 1

4

Overview : Profile

Profile Statistics regarding a program’s behavior A guide for making optimization decision

Common optimization strategy is to use profiling to determine the path that are predominantly followed by control flow

Type of profile information Instructions( or Basic Blocks ), more heavily executed Sequence in which BB are most commonly executed Behavior of particular data variables and addresses

Page 5: Dynamic Binary Optimization  –  Part 1

5

Overview : Profile

Advantage of profile information Providing information that may not have been available when a program

was originally compiled

BB A……R3 ß …R7 ß …R1 ß R2 + R3Br L1 if R3==0

BB B…R6 ß R1 + R6 ……

BB CL1: R1 ß 0

……

BB A……R3 ß …R7 ß …

Br L1 if R3==0

BB B…R6 ß R1 + R6 ……

BB CL1: R1 ß 0

……

BB A……R3 ß …R7 ß …

Br L1 if R3==0

BB B…R6 ß R1 + R6 ……

BB CL1: R1 ß 0

……

Compensation codeR1 ß R2 + R3

Page 6: Dynamic Binary Optimization  –  Part 1

6

Overview : BB rearrangement

Definition Method, so that

predominant path has instructions in consecutive memory location

Advantages Nice localization Efficient instruction

fetching Type

Trace Superblock Tree group

BB A……R3 ß …R7 ß …R1 ß R2 + R3Br L1 if R3==0

BB B…R6 ß R1 + R6 ……

BB CL1: R1 ß 0

……

Superblock……R3 ß …R7 ß …Br L1 if R3!=0

L1: R1 ß 0……

BB B…R6 ß R1 + R6 ……

Compensation codeR1 ß R2 + R3

Page 7: Dynamic Binary Optimization  –  Part 1

7

Overview : Staged emulation

Relation between emulation and optimization Tightly integrated with emulation Optimization is part of an emulation framework that support

staged emulation Staged emulation

Based on tradeoff between start-up time and steady state performance

Interpretation Binary translation Dynamic binary optimization

Page 8: Dynamic Binary Optimization  –  Part 1

8

Overview : Staged emulation

Stages of staged emulation Interpretation BB translation( e.g. chaining ) Optimized translation( e.g. superblock ) Highly optimized translation

Interpreter

Binary memoryImage

BB cache Code cache Profile data

Translator Optimizer

Emulation manager

Page 9: Dynamic Binary Optimization  –  Part 1

9

Overview : Spectrum of emulation

Interpret Basic translation Optimized blocksHighly optimized

blocks

Fast startup

Slow steady state

Simple profiling

Low overhead

Very slow startup

Fast steady state

Extensive profiling

High overhead

Page 10: Dynamic Binary Optimization  –  Part 1

10

Overview : Staged emulation strategy

Strategy decision factors Source and target ISA Type of VM being implemented Design objective Tradeoff between Obtained optimization performance and

optimization, profiling overhead Example

Original HP Dynamo system, Digital FX!32 Interpret optimized, translated code

DynamoRIO Simple binary translation optimization

Shade Interpretation simple binary translation

Page 11: Dynamic Binary Optimization  –  Part 1

11

Contents

Overview Dynamic program Behavior Profiling Optimizing Translation blocks

Page 12: Dynamic Binary Optimization  –  Part 1

12

Dynamic program behavior

Goal Optimization depends on

program’s structure and dynamic behavior

By profiling, optimization system can learn about program’s structure and dynamic behavior

Important characteristics of program

High predictability of dynamic control flow

Correlation of branch direction, between current and most recent previous execution

0

10

20

30

40

50

0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% >90%

Percent taken

Frac

tion

of st

atic

con

dition

al b

ranc

hes

0

10

20

30

40

50

60

70

80

90

100

176.g

cc

181.m

cf

197.p

arse

r

252.e

on

256.b

zip2

171.s

wim

173.a

pplu

177.m

esa

187.f

acere

c

189.l

ucas

Perc

ent dy

nam

ic b

ranc

hes

deci

ded

sam

e as

pre

viou

s tim

e

Page 13: Dynamic Binary Optimization  –  Part 1

13

Dynamic program behavior

Important characteristics of program

Backward instruction Is typically taken

Predictability of indirect jump Switch statement Return from procedure call

Predictability of data value

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 >9

Number of different destinations

Perc

ent

of in

dire

ct ju

mps

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

All Add/Sub Load Logic Shift Set

Instruction type

Frac

tion

wit

h co

nsta

nt v

alue

Static

Dynamic

Page 14: Dynamic Binary Optimization  –  Part 1

14

Contents

Overview Dynamic program Behavior Profiling

Overview Role Type Collecting the profile data Profile during interpretation Profiling translated code Overhead

Optimizing Translation blocks

Page 15: Dynamic Binary Optimization  –  Part 1

15

Profiling : Role

Definition The process of collecting instruction and data statistics for

an executing program Usage

Input to code-optimization process Principle of profiling

Predictability of program Past behavior will often hold for future behavior

Page 16: Dynamic Binary Optimization  –  Part 1

16

Profiling : Role

Traditional profiling & optimization procedure

Decomposing the source program into control flow graph

Analyzing the graph and inserting probes to collect profile information

Program running with a typical data input

Generating profile data Static profile log analysis Generating optimized code

Property Fully analyzed Optimal placement of probe Entire program run and complete

profile

HLL Program

Compiler Frontend

A

B C

D

E

F

Compiler Backend

Instrumentedcode

Instrumentedcode

Test data

Program execution

Programstatistics

Optimizingcompiler

Optimized binary

Page 17: Dynamic Binary Optimization  –  Part 1

17

Profiling : Role

Difficulty, requirement and limitation in dynamic optimization

Program structure is not known when a program begins

Program structure must be discovered in an incremental way

Inserting profiling probes in a globally optimal manner

Optimization decision must be made as early as possible

Statistics from a partial execution of the program

A

B

D

E

Programbinary

InterpreterPartial

Programstatistics

Translatoroptimizer

Programdata

Page 18: Dynamic Binary Optimization  –  Part 1

18

Profiling : Role

Tradeoff between overhead and benefit Overhead : Initial analysis + actual collection of profile data Benefit : execution time reduction due to optimization

Static optimization Overhead are paid once

Dynamic optimization Overhead are paid every time a guest program runs Benefits must outweigh the Overhead

Page 19: Dynamic Binary Optimization  –  Part 1

19

Profiling : Type of profile data

Frequency of Execution of different code region Hotspot Interpretation VS binary translation

Profile data which is based on Control flow( branch and Jump ) predictability Can be used for determining aspects of a program’s

dynamic execution behavior Used as basis for gathering and rearranging BBs into larger

unit Used to guide specific optimization

Address Data

Page 20: Dynamic Binary Optimization  –  Part 1

20

Profiling : Type of profile data

Basics Nodes : BBs Edges : flow of control

BB profile Numbers are counts of the

corresponding BB’s execution

Edge profile BB profile can be derived

from edge profile Path profile

Approximate the path profile by using a heuristics based on edge profile

A(65)

B(50) C(15)

D(25)

E(48)

F(17)

A

B C

D

E

F

50

12 13

210

15

38

48

17

15

Page 21: Dynamic Binary Optimization  –  Part 1

21

Profile : collecting the profile

Instrumentation based profiling Target program related events Count all instances of the event being profiled Many different events can be monitored simultaneously

Monitoring method : HW, SW Sampling based profiling

Program runs in its unmodified form Program is interrupted and an instances of program related event is

captured Tradeoff

Instrumentation based slow but can collect given number of profile data over much shorter period of

time Sampling based

fast but requires a longer time for collecting the same amount of profile information

Page 22: Dynamic Binary Optimization  –  Part 1

22

Profile : collecting the profile

Strategy Collection technique depends on emulation spectrum

Interpretation SW instrumentation is about the only choice

Optimizing binary translation, dynamic optimization system Instrumentation

Already well optimized longer running program Sampling

Page 23: Dynamic Binary Optimization  –  Part 1

23

Profile : profiling during interpretation

Key points Source instructions are actually access as data

Profiling code must be added to the interpret routine Profiling is applied to specific instruction type rather than specific

instruction It can be applied for Certain classes of instructions rather

than specific instruction E.g. Backward branch

Method BB profile

profile code should be added to all control transfer instructions after the PC bas been updated

Edge profile Both the PC of the control transfer instruction and the target PC are

used to define a specific instruction

Page 24: Dynamic Binary Optimization  –  Part 1

24

Profile : profiling during interpretation

Profile Table Access method

BB profile : Via PC value of control transfer destination Edge profile : PC value that define an edge Hash function

Contents of entry Basic block or edge count For conditional branch, taken count and not taken count

Page 25: Dynamic Binary Optimization  –  Part 1

25

Profile : profiling during interpretation

Instruction function list..Branch_conditional(inst){

BO = extract(inst,25,5);BI = extract(inst,20,5);displacement = extract( inst, 15, 14 ) * 4;..// code to compute whether branch should be taken..profile_addr = loopup(PC);if( branch_taken)

profile_cnt( profile_addr, taken );PC = PC + displacement;

elseprofile_cnt( profile_addr, nontaken);PC = PC + 4;

}

PCTakencount

Not-takencount

HASHBranch

PC

Page 26: Dynamic Binary Optimization  –  Part 1

26

Profile : profiling during interpretation

Profile Count decaying Problem of profile table

A count field overflow Solution

Key point Optimization method focus on not absolute count but

relative frequency Recent program event history is more valuable than that

of past Decay process

Periodically divide all the profile count by 2

Page 27: Dynamic Binary Optimization  –  Part 1

27

Profile : profiling during interpretation

Profiling Jump Instruction Difficulties of Jump compared with conditional branch

Switch statement : frequently change Return from procedure call : many target address

Solution Key point

Profile-driven optimization of indirect jump tend to be focused on those jumps that very frequently have the same target

Maintain profile table with a small number of target address and track only the more recently used target

Page 28: Dynamic Binary Optimization  –  Part 1

28

Profile : profiling translated code

Instrumenting individual instructions Each individual instruction can have its own custom profiling code

= Profiling can be selectively applied = Profile counters can be assigned to each static instructions

Profile counters can be directly addressed without hashing Profile code can be easily inserted and removed as needed

Translated BasicBlock

Fall-throughstub

Branch targetstub

Increment edgeCounter(j)

If( counter(j) > trigger)invoke optimizer

Elsebranch to targetBB

Increment edgeCounter(i)

If( counter(i) > trigger)invoke optimizer

Elsebranch to fall-throughBB

Page 29: Dynamic Binary Optimization  –  Part 1

29

Profiling : Overhead

Performance overhead Example

To access hash table : hash function + 1 load + 1 compare To increment proper count : 1 load + 1store + 1add

Profiling during interpretation VS profiling translated code Absolute overhead VS relative overhead

Memory overhead Profile table

Overhead reduction method Reducing the number of instrumentation point

Heuristic + Using collected data Code duplication

Attractive for same-ISA optimization ( 4.7 )

Page 30: Dynamic Binary Optimization  –  Part 1

30

Contents

Overview Dynamic program Behavior Profiling Optimizing Translation blocks

Overview Improving locality Traces Superblocks Dynamic superblocks formation Tree group

Page 31: Dynamic Binary Optimization  –  Part 1

31

Optimizing translation blocks : Overview

Two strategy Improving locality Optimization on enlarged translation blocks

Page 32: Dynamic Binary Optimization  –  Part 1

32

Optimizing translation blocks : Improving locality Locality

Temporal Spatial

Problem Cache space Performance

Low instruction fetch

bandwidth

A

B D

C

G

30

29 68

68129

70

F

197

2

E

1

3

Br cond1 == true

A

B

C

Br cond2 == false

Br uncond

D

Br cond3 == true

E

Br uncond

F

G

Br cond4 == true

E(Br Uncond) F(----------------) F(----------------) F(----------------)

Page 33: Dynamic Binary Optimization  –  Part 1

33

Optimizing translation blocks : Improving locality Rearrange the layout of the

blocks in memory Conditional branch tests are

reversed Unconditional branch

removal/Add Instruction fetch efficiency is

improved

G

Br cond1 == false

A

Br cond3 == true

D

E

Br cond4 == true

Br uncond

B

C

Br cond2 == false

Br uncond

F

Br uncond

Br uncond is removed

Br cond1 == true

A

B

C

Br cond2 == false

Br uncond

D

Br cond3 == true

E

Br uncond

F

G

Br cond4 == true

Page 34: Dynamic Binary Optimization  –  Part 1

34

Optimizing translation blocks : Improving locality Procedure inlining A

Call proc xyz

B

.

.

.

K

Call proc xyz

L

X

proc xyz

Z

return

Y

A

B

X

Z

Y

A

B

X

Z

Y

Page 35: Dynamic Binary Optimization  –  Part 1

35

Optimizing translation blocks : Improving locality Partial procedure inlining

In dynamic optimization system

A

Call proc xyz

B

.

.

.

K

Call proc xyz

L

X

proc xyz

Z

return

Y

A

B

X

Y

A

B

X

Z

Page 36: Dynamic Binary Optimization  –  Part 1

36

Optimizing translation blocks : Improving locality Pros and Cons of procedure inlining

Pros Increase spatial locality Remove overhead

Call and return instructions are removed Save/restore instruction are removed

Cons Increase code size Increase register “pressure”

Inlined code needs more register than procedure call Con sequently, procedure inlining is typically used only

for those procedures that are very frequently called and are very small

Page 37: Dynamic Binary Optimization  –  Part 1

37

Optimizing translation blocks

Three ways of rearranging basic blocks according to control flow Trace formation Superblock formation

Most widely used in VM implementation Tree group

Useful when control flow is difficult to predict Provide wider scope for optimization

Page 38: Dynamic Binary Optimization  –  Part 1

38

Optimizing translation blocks : Traces

Traces Chunks of contiguous instructions containing multiple BBs Traces > Superblock

Static traces forming step 1. Profile collection using test data 2. Begin with start point

Most frequently executed BB ,not already part of a trace 3. Collection BB through most common control path, until a stopping

condition is met A block already belonging to another trace is reached The arrival at a procedure call/return boundary

4. Collect the BBs into a trace Reverse branch tests removing/adding unconditional branch

5. stop otherwise go to step 2 In dynamic environment, Traces are not commly used s translation blocks

Page 39: Dynamic Binary Optimization  –  Part 1

39

Optimizing translation blocks : Traces

A

B D

C

G

30

29 68

68129

70

F

197

2

E

1

3

Trace1 Trace2 Trace3

G

Br cond1 == false

A

Br cond3 == true

D

E

Br cond4 == true

Br uncond

B

C

Br cond2 == false

Br uncond

F

Br uncond

Br uncond is removed

Page 40: Dynamic Binary Optimization  –  Part 1

40

Optimizing translation blocks : Superblocks Superblocks VS Traces

Side entrance Problems in forming superblocks

Small and a number of superblocks Too small to provide many opportunities for optimizations

Tail duplication The process of replicating code that appears at the end of a

superblock in order to form other superblock

Page 41: Dynamic Binary Optimization  –  Part 1

41

Optimizing translation blocks : Superblocks

A

B D

C

G

30

29 68

68129

70

F

197

2

E

1

3

A

B D

C

30

29 68

70

F

1

E

3

G G G

97

29 29 292

Page 42: Dynamic Binary Optimization  –  Part 1

42

Optimizing translation blocks : Dynamic superblock formation : Overview

Dynamic Formed incrementally as the source code is being emulated

Complication BB replication leads to more choices

Key question Starting point Continuation Stopping point

Page 43: Dynamic Binary Optimization  –  Part 1

43

Optimizing translation blocks : Dynamic superblock formation : starting point

Heavily used block By using Profile information

Method for determining profile points All basic block Heuristics

Targets of backward branches an candidates starting point Exit arc from an existing superblock

Start threshold When a profiled BB’s execution frequency reaches this

value, a new superblock is started Depends on emulation tradeoff A few tens to hundreds of execution is typical

Page 44: Dynamic Binary Optimization  –  Part 1

44

Optimizing translation blocks : Dynamic superblock formation : Continuation

Continuation Which subsequent blocks should be collected and added as

the superblock is grown Most frequently used approach

Node profile information is used to identify the most likely successor BB

Continuation threshold A relatively complete set of profile data must be collected for

all BBs Typically half of start point threshold

Continuation set At the time superblock formation is to begin, the set of all BBs

that have reached the continuation threshold is collected

Page 45: Dynamic Binary Optimization  –  Part 1

45

Optimizing translation blocks : Dynamic superblock formation : Continuation

Most frequently used procedureStart threshold reachedCollect continuation set

Build superblock from the hottest BB, following control flow edges

Including only BB’s in continuation set

Superblock is completed

Take a hottest as a new start pint

All block in the continuation set is exausted

Emulation process resume with profiling

Until another BB achieves the start threshold

Page 46: Dynamic Binary Optimization  –  Part 1

46

Optimizing translation blocks : Dynamic superblock formation : Continuation

Most Recently used approach Edge profile information Algorithm

Assumption The very next sequence of blocks following a start point is

also likely to be a common path Simply follows the actual dynamic control flow path one edge

at a time Advantage

Only candidate start point need to be profiled = No need to use profiling for continuation blocks = Profile overhead is substantially reduced

Page 47: Dynamic Binary Optimization  –  Part 1

47

Optimizing translation blocks : Dynamic superblock formation : stopping point

Type of heuristics to determine stop condition The start point of the same superblock is reached A start point of some other superblock is reached A superblock has reached some maximum length

A BB can be used in more than one superblock there may be multiple copies of a given BB Explosion of code size

When using the most frequently used heuristic, there are no more candidate BBs that have reached the candidate threshold

An indirect jump is reached, or there is a procedure call

Page 48: Dynamic Binary Optimization  –  Part 1

48

Optimizing translation blocks : Dynamic superblock formation : Example

Most frequently used

A

B D

C

G

30

29 68

68129

70

F

197

2

E

1

3Start point threshold : 100Continuation threshold : 50

Page 49: Dynamic Binary Optimization  –  Part 1

49

Optimizing translation blocks : Dynamic superblock formation : Example

Most Recently used Profile point is just A

because A is target of backward branch

Most likely ADEG BCG FG

However There is about 30% chance

ABCG DEG FG There are cases where a

most recently executed method may not select superblocks quite as well as most frequently executed method

A

B D

C

G

30

29 68

68129

70

F

197

2

E

1

3Start point threshold : 100Continuation threshold : 50

Page 50: Dynamic Binary Optimization  –  Part 1

50

Optimizing translation blocks : Tree group

Background Problems when applying Superblock for Branches that tend to

almost evenly split their decision Side exit is frequently taken compensation code overhead Optimization are typically not done along the side exit losing

performance improvement opportunities Traces, Superblock VS Tree group

Tree group conditional branch outcomes are more evenly balanced Generalization of superblock Multiple flow of control

Superblocks Conditional branches are predominantly decided one way Single flow of control

Page 51: Dynamic Binary Optimization  –  Part 1

51

Optimizing translation blocks : Tree group

A

B D

C

30

29 68

70

F

1

E

3

G G G

97

29 682