Run Time Optimization

Run Time Optimization

15-745: Optimizing Compilers

Pedro Artigas

2

Motivation A good reason

Compiling a language that contains run-time constructs

Java dynamic class loading Perl or Matlab eval(“statement”)

Faster than interpreting A better reason

May use program information only available at run time

3

Example of run-time information The processor that will be used to run the

program inc ax is faster on a Pentium III add ax,1 is faster on a Pentium 4

No need to recompile if generating code at run time

The actual program input/run-time behavior Is my profile information accurate for the

current program input? YES!

4

The life cycle of a program

Compile Link Load/Run

One Object File

Global Analysis

One Binary

Whole Program Analysis

One Process

Analysis? No observation!

Larger scope, better information about program behavior

5

New strategies are possible Pessimistic x Optimistic approaches Ex: Does int *a points to the same

location as int *b ? Compile time/Pessimistic: Prove that in ANY

execution those pointers point to different addresses

Run Time/Optimistic: Up to now in the current execution a and b point to different locations

Assume this holds If the assumption breaks, invalidate generated code

and generate new code

6

A sanity check Using run time information does not require

run time code generation Example: Versioning ISA may allow cheaper tests

IA-64 Transmeta

if (a!=b) {

<generate code assuming a!=b>

} else {

<generate code assuming a==b>

}

7

Drawbacks Code generation has to be FAST

Rule of thumb: almost linear on program size

Code quality: Compromise on quality to achieve fast code generation

shoot for good, not great Also this usually means:

No time for classical Iterative Data Flow Analysis at run time

8

No classical IDFA: Solutions Quasi-Static and/or Staged Compilation

Perform IDFA at compile time Specialize the dynamic code generator for the

obtained information That is, encode the obtained data flow information

in the “binary” Do not rely on classical IDFA

Use algorithms that do not require it Ex: Dominator based value numbering (coming up!)

Generate code in a style that does not require it Ex: One entry multiple exits traces

as in deco and dynamo

9

Code generation Strategies Compiling a language that requires run-

time code generation: Compile adaptively:

Use a very simple and fast code generation scheme

Re-compile frequently used regions using more advanced techniques

10

Adaptive Compilation: Motivation

Very simple code generation

Higher execution cost Elaborate code

generation Higher compilation cost

Problem: We may not know in

advance how frequently a region will execute

Measure frequencies and re-compile dynamicallyFast compiler Optimizing

compilerCost

threshold

re-compilation

execution count

tota

l co

st

2 level recompilation Fast OnlyOptimizing Only Optimal(Oracle)

11

Code generation Strategies Compiling selected regions that benefit

from run-time code generation: Pick only the regions that should benefit

the most Which regions?

Select them statically Use profile information Re-compile (that is select then dynamically) Usually all of the above

12

Code Optimization Unit What is the run-time unit of optimization?

Option: Procedures/static code regions Similar to static compilers

Option: Traces Start at the target of a backward branch Include all the instructions in a path May include procedure calls and returns Branches

Fall through = remain in the trace Target = exit the trace

1

2 3

4

1

2

4

3

4

13

Current strategies

Static region Trace

JIT compilers Java JITsMatlab JITs

?

Run-timeperformance engines

DycFabius

DynamoDeco

14

Run-Time code generation:Case studies Two examples of algorithms that

are suitable for run-time code generation Run time CSE/PRE replacement:

Dominator based value numbering Run time Register Allocation:

Linear scan register allocation

15

Sidebar With traces CSE/PRE become

almost trivial No need for register allocation if

optimizing a binary (ex: dynamo)

PRE

CSEA+B

A+B

A+B

A+B

16

Review: Local value numbering Store expressions already computed (in a hash table) Store variable nameVN mapping in the VN array Store VNvariable name mapping in the Name array Same value numbersame value

for each basic block Table.empty() for each computed expression (“x=y op z”) if V=Table.lookup(“y op z”)

VN[“x”]=V if VN[Name[V]]==V //expression is still there

replace “x = y op z” with “x = Name[V]” else Name[V]=“x”

else VN[“x”]=new_value_number()

Table.insert(“y op z”,VN[“x”]) Name[VN[“x”]]=“x”

Expression was computed in

the past, check if result is available

New expression, add to the

table

17

Local value numbering Works in linear time on program

size Assuming accesses to the array and

the hash table occur in constant time Can we make it work in a scope

larger than a basic block? (Hint: Yes)

What are the potential problems?

18

Problems How to propagate the hash table

contents across basic blocks? How to make sure that is safe to

access the location containing the expression in other basic blocks?

How do we make sure if the location containing the expression is fresh?

Remember: no IDFA

19

Control flow issues On split points things are simple

Just keep the content of the hash table from the predecessor

What about merge points? We do not know if the same expression was

computed in all incoming paths We do not want to check the fact anyway (why?) Reset the state of the hash table to a safe state

it had in the past Which program point in the past?

The immediate dominator of the merge block

20

Data flow issues Making sure the def of an expression is

fresh and reaches the blocks of interest How? By construction! SSA All names are fresh (Single Assignment) All defs dominate its’ uses (regular uses

not functions) As, by construction, we introduce new

defs using functions at every point this would not hold

21

Dominator/SSA based value numbering

DVN(Block B)Table.PushScope()for each exp “n=(…)”

if (exp is redundant or meaningless) //meaningless: (x0,x0)

VN[“n”]= Table.lookup(“(…)” or “x0”)remove(“n=(…)”)

elseVN[“n”]=“n”Table.insert(“(…)”,VN[n])

for each exp “x=y op z”if (“v”=Table.lookup(“y op z”))

VN[“x”]=“v”remove(“x=y op z”)

elseVN[“x”]=“x”Table.insert(“x=y op z”,VN[“x”])

for each successor s of BAdjust the inputs

for each dominator tree child c in CFG reverse post-orderDVN(c)

Table.PopScope()

First process the

expressions

Them the regular ones

Propagate info about inputs and call DVN recursively

22

ExampleName

VN

u0

v0

w0

x0

y0

u1

x1

y1

u2

x2

y2

u3

VN

u0=a0+b0

v0=c0+d0

w0=e0+f0

x0=c0+d0

y0=c0+d0

u1=a0+b0

x1=e0+f0

y1=e0+f0

u2= (u0,u1)

x2=(x0,x1)

y2=(y0,y1)

u3=a0+b0

1

23

4

23

Does not catch

But it performs almost as well as CSE And runs much faster

linear time ? (YES? NO?)

Problems

x1=a0+b0x0=a0+b0

x2=(x0,x1)

x0=a0+b0

x1=(x0,x2)

x2=a0+b0

24

Homework #4 The DVN algorithm scans the CFG in a

similar way as the second phase of SSA translation SSA translation phase #1

Placing functions SSA translation phase #2

assigning unique numbers to variables

Combine both and save one pass Gives us a smaller constant But, at run time, it pays of!

25

Run time register allocation Graph Coloring? Not an option

Even the simple stack based heuristic shown in class is O(n2)

Not even counting: Building the graph Move coalescing optimization

But register allocation is VERY important in terms of performance

Remember, memory is REALLY slow We need a simple but effective (almost)

linear time algorithm

26

Let’s start simple Start with a local (basic block) linear

time algorithm Assuming only one def and one use per

variable (More constrained than SSA) Assuming that if a variable is spilled it must

remain spilled (Why?) Can we find an optimum linear time

algorithm? (Hint: Yes) Ideas? Think about liveness first …

27

Simple Algorithm:Computing Liveness One def and one use per variable, only one

block A live range is merely the interval between

the def and the use Live Interval: Interval between the first def and

the last use OBS: Live Range = Live Interval if there is no

control flow, only one def and use We could compute live intervals using a

linear scan if we store the def instructions (beginning of the interval) in a hash table

28

Example

S1: A=1

S2: B=2

S3: C=3

S4: D=A

S5: E=B

S6: use(E)

S7: use(D)

S8: use(C)

29

Now Register Allocation Another linear scan

Keep the active intervals in an list (active) Assumption: an interval, when spilled, will

remain spilled Two scenarios

#1: No problem

#2: Must spill Which interval?

Ractive ||

Ractive ||

30

Spilling heuristic Since there is no second chance:

That is a spilled variable will always remain spilled

Spill the interval that ends last Intuition: As one spill must occur …

Pick the one that makes the remaining allocation least constrained

That is, the interval that ends last This is the provably optimum solution (given

all the constraints)

31

Linear Scan Register Allocation

active = {}freeregs = {all_registers}for each interval I (in order of increasing start point)

for each interval J in activeif J.end>I.start

continueactive.remove(J)freeregs.insert(J.register)

end for each interval Jif active.length()==R

spill_candidade=active.last();if (spill_candidate.end>I.end)

I.register = spill_candidate.registerspill(spill_candidate)active.remove(spill_candidate)active.insert_sorted(I) //sorted by end point

elsespill(I)

elseI.register = freeregs.pop() //get any register from the free list active.insert_sorted(I) //sorted by end point

end for each interval I

Expire old intervals

Must spill, pick either the last

interval in active or the new interval

No constraint

s

32

Example (R=2)

S1: A=1

S2: B=2

S3: C=3

S4: D=A

S5: E=B

S6: use(E)

S7: use(D)

S8: use(C)

AB

C

D

E

A B C D E

S1

S2

S3

S4

S5

S6

S7

S8

33

Is the second pass really linear? Invariant: active.length()<=R Complexity O(R*n) R is usually a small constant (128

at most) Therefore: O(n)

34

And we are done! Right? YES and NO Use the same algorithm as before for

register assignment Program representation: Linear list of

instructions Live intervals are not precise anymore

given control flow and multiple def/uses Not optimum, but still FAST

Code quality: within 10% of graph coloring for spec95 benchmarks (One problem with this claim)

35

The Worst problem: Obtaining precise live intervals How to obtain precise live interval

information FAST? Claim of 10% relies on live interval

information obtained using liveness analysis (IDFA) IDFA is SLOW, O(n3)

Most recent solutions: Use the local interval algorithm for variables that

only live inside one basic block Use liveness analysis for more global variables

Alleviates the problem, does not fully solve it

36

More problems: Live intervals may not be precise

OBS: The idea of lifetime holes leads to allocators that also try to use this holes to assign the same register to other live ranges

(bin-packing)

Such an allocator is used in the Alpha family of compilers (GEM compilers)

37

Other problems: Linearization order Register allocation quality depends

on chosen block linearization order Choose a good order in practice

layout order depth first traversal of the CFG

Both only 10% slower than graph coloring

38

Graph coloring versus Linear scan

Compilation cost scaling

39

Conclusion Run time code generation provides new

optimization opportunities Challenges

Identify new optimization opportunities Design new compilation strategies

example: optimistic versus conservative Design algorithms and implementations that:

minimize run time overhead Do not compromise much on code quality

Recent examples indicate: extending fast local methods is a promising way to

obtain fast run-time code generation

Documents

Run Time Optimization