CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

1

CMPUT680 - Winter 2006

Topic I: Superblock and Hyperblock Formation

José Nelson Amaralhttp://www.cs.ualberta.ca/~amaral/courses/680


2

Instruction Level Parallelism Optimizations

The objective of an optimizer is to reduce thenumber and complexity of the instructionsexecuted by the processor.

Superscalar or Very Long Instruction Word (VLIW) processors can reduce the execution time even when the number of instructions executed moderatelyincreases, as long as the dependence height is reduced.


3

Speculative and Predicated Execution

Speculative Execution: execution of an instructionbefore knowing that its execution is required.

Predicated Execution: architecture-supported conditional execution of an instruction based on the value of a Boolean source operand, referred to as the predicate of the instruction.

Superblock: structure used to implement compiler-controlled speculative execution.

If-conversion: compiler algorithm that converts conditional branches into predicate-defining instructions to allow the use of predication.


4

Trace Scheduling (Fisher, 1981)

Some optimization and scheduling decisionsmay decrease the execution time for onecontrol path while increasing the executiontime for another path.

Thus decisions should favor more frequentlyexecuted paths to improve overall performance.

Trace scheduling divides a procedure in a setof frequently executed traces (paths).


5

Trace Scheduling

There may be conditional branches from themiddle of the trace (side exits) and transitionsfrom other traces into the middle of the trace(side entrances).

These control-flow transitions are ignored duringtrace scheduling.

After scheduling, bookeeping is required to ensurethe correct execution of off-trace code.


6

Bookeeping for Trace Scheduling

Instr 1Instr 2Instr 3Instr 4Instr 5


What bookeeping is required when Instr 1 is moved below the side entrance in the trace?


7




Instr 3Instr 4


8




What bookeeping is required when Instr 5 moves above the side entrance in the trace?


9




Instr 5


10

Superblocks

A superblock is a trace without side entrances, i.e.,control can only enter from the top, but it can leaveat one or more exit points.

The formation of superblocks creates additionaloptimization opportunities because constraintsassociated with infrequently executed paths ofcontrol are ignored (thus these constraints donot inhibit optimizations that favor frequentlyexecuted paths).


11

Superblock Formation(Example)

Y

D100

C10

B90

E90

D0

F100

Z

1

90 10

900

090

10 99

1

Y

D100

C10

B90

E90

D0

F100

Z

1

90 10

900

090

10

99

1


12


Y

D100

C10

B90

E90

D0

F100

Z

1

90 10

900

090

10

99

1

Is this a superblock?

No, a superblock cannothave side entrances, andthis set of nodes hastwo side entrances intonode F. How do weconvert it into a superblock?


13


Y

D100

C10

B90

E90

D0

F90

Z

1

90 10

900

0

90

10

89.1

0.9

Tail duplication, is the duplication of basic blocksthat appear after a side entrance to eliminate side entrances and transform a trace into a superblock.

F’10

10

9.9

0.1


14

Common Subexpression Elimination in Superblocks

opA: mul r1,r2,3

opC: mul r3,r2,3

opB: add r2,r2,199

1

1

Original Code

opA: mul r1,r2,3

opC: mul r3,r2,3

opB: add r2,r2,199

1

Code After Superblock Formation

opC’: mul r3,r2,3

opA: mul r1,r2,3

opC: mov r3,r1

opB: add r2,r2,199

1

Code After Common Subexpression Elimination

opC’: mul r3,r2,3


15

Operation Migration in Superblocks

Original Code

…mov r0,r1

…mov r0,r2

…mov r0,r3

…add r1,r1,4add r2,r2,4add r3,r3,4

X

Y

Z

After Operation Migration

…

…

…

…add r1,r1,4add r2,r2,4add r3,r3,4

mov r0,r1

mov r0,r2

mov r0,r3

X

Y

Z


16

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

0

10

20

30

MEM[r0+x]

r4

1r1

1r0


17


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


0

10

20

30

MEM[r0+x]

10r4

1r1

1r0


18


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


0

10

20

30

MEM[r0+x]

11r4

1r1

1r0


19


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


020

11

30

MEM[r0+x]

11r4

1r1

1r0


20


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


020

11

30

MEM[r0+x]

11r4

2r1

1r0


21


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


0

11

20

30

MEM[r0+x]

11r4

2r1

1r0


22


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


0

11

20

30

MEM[r0+x]

12r4

2r1

1r0


23


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


020

12

30

MEM[r0+x]

12r4

2r1

1r0


24


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


020

12

30

MEM[r0+x]

12r4

2r1

2r0


25


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


020

12

30

MEM[r0+x]

20r4

2r1

2r0


26


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


020

12

30

MEM[r0+x]

21r4

2r1

2r0


27


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


0

12

30

21MEM[r0+x]

21r4

2r1

2r0


28


Loops


OpD: add r1, r1, 1

OpE: add r0, r0, 1100


0

OpC: st_i x, r0, r4

OpC’: st_i x, r0, r4OpE: add r0, r0, 1

OpA: ld_I r4, x, r0

OpB: add r4, r4, r1

OpD: add r1, r1, 1

100

After Variable Migration

0


29

Superblock Enlarging Optimizations

By enlarging a superblock, we can provide thescheduler with more independent instructions

to choose from for each cycle

Superblock enlarging optimizations:Branch target expansionLoop unrollingLoop peeling


30

Branch Target Expansion

Idea: To expand the superblock with the targetof a likely taken branch.

blt r1, r2, L3

beq r3, r4, L5

L1:

jump L4

L2:L3:

20 100blt r1, r2, L3

beq r3, r4, L5

L1:

jump L4

L2:

20


31

Superblock Loops

A superblock loop is a superblock that has afrequently taken backedge from its last node toits first node.

We will study the extension of some commonloop optimizations to superblocks.


32

Dependence Removing Optimizations

The goal is to eliminate data dependences betweeninstructions within frequently executed superblocks.

Dependence removing optimizations include:Register renamingAccumulator variable expansionInduction variable expansionSearch variable expansionOperation combiningStrength reductionTree height reduction


33

Instruction Latencies for Examples

Function Latency Int ALU 1 Int multiply 3 Int divide 10 branch 1 Memory load 2 Memory store 1 FP ALU 3 FP conversion 3 FP multiply 3 FP divide 10


34

Register Renaming Example

For (j=0; j<n; j++) { C(j) = A(j)+B(j) }

Original Loop

L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)blt r1, r5, L1 (f)

Assembly Code

For all the examples we assume a superscalar processor with infiniteresources and no register renaming hardware. Thus for the code above, we obtain the following schedule.


35


For (j=0; j<n; j++) { C(j) = A(j)+B(j) }

Original Loop


Assembly Code

a ab b

c c cde

f

0 5 cycles

Instr.

Code Schedule

7 cycles / 1 iteration


36



Original Assembly Code

L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)

ld_f f2, A, r1 (f)ld_f f3, B, r1 (g)add_f f4, f2, f3 (h)st_f C, r1, f4 (i)add r1, r1, 4 (j)ld_f f2, A, r1 (k)ld_f f3, B, r1 (l)add_f f4, f2, f3 (m)st_f C, r1, f4 (n)add r1, r1, 4 (o)blt r1, r5, L1 (p)

After Loop Unrolling


37

Loop Unrolling

a ab b

c c cde

f

0 5 cycles

Instr.

Code Schedule

fg g

h h hij

k kl l

m m mno

p

10 15

19 cycles / 3 iterations = 6.3 cycles / iteration





38

Register Renaming



After Register Renaming





39

Loop Unrolling and Register Renaming

Instr.

a ab b

c c cd

ef

0 5 cycles

Code Schedule

fg g

h h hi

jk kl l

m m mn

op

10 15




After Register Renaming


40

Accumulator Variable Expansion

An accumulator variable accumulates a sum or productin each iteration of a loop.

Accumulator variable expansion eliminates redefinitionsof an accumulator variable within an unrolled loop bycreating k temporary accumulators (k is the number ofaccumulation instructions). The values of all temporaryaccumulators must be summed at the exit points of the loop where the accumulator is live.


41

Accumulator Expansion Example

For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) }

Original Loop

ld_f f1, C, r2 (-)L1: ld_f f3, A, r4 (a)

ld_f f5, B, r6 (b)mul_f f7, f3, f5 (c)add_f f1, f1, f7 (d)add r4, r4, 4 (e)add r6, r6, r8 (f)blt r4, r9, L1 (g)st_f C, r2, f1 (-)

Assembly Code

For all examples we assume a superscalar processor with infiniteresources and no register renaming hardware. Thus for the code above, we obtain the following schedule.


42

Accumulator Expansion Example

For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) }

Original Loop

Assembly Codea ab b

c c cd

ef

0 5 cycles

Instr.

Code Schedule

g

ld_f f1, C, r2 (-)L1: ld_f f3, A, r4 (a)


d d



43


ld_f f1, C, r2 (-)L1: ld_f f3, A, r4 (a)


Assembly Code

After Unrolling and Renaming

ld_f f1, C, r2 (-)L1: ld_f f31, A, r41 (a)

ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f1, f1, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f1, f1, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f1, f1, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)st_f C, r2, f1 (-)


44


a ab b

c c cd

ef

0 5 cycles

Code Schedule

g gh h

ij

kl

10 15

d d

ld_f f1, C, r2 (-)L1: ld_f f31, A, r41 (a)

ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f1, f1, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f1, f1, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f1, f1, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)st_f C, r2, f1 (-)

Instr.

i ij j

m mn n

op

qr

o op p

s



45

Accumulator Expansion

a ab b

c c cd

ef

0 5 cycles

Code Schedule

g gh h

ij

kl

10 15

d d

ld_f f11, C, r2 (-)mov_f f12, 0 (-)

mov_f f13, 0 (-)L1: ld_f f31, A, r41 (a)

ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f11, f11, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f12, f12, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f13, f13, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)add_f f11, f11, f12 (-)add_f f11, f11, f13 (-)st_f C, r2, f1 (-)

Instr.

i ij j

m mn n

op

qr

o op p

s



46

Induction Variable Expansion

An induction variable is used to index through loop iterations and through regular data structure, such as arrays.

Induction variable expansion eliminates dependencesbetween definitions of induction variables and their usesin unrolled loops.


47

Induction Variable Expansion Example

For (i=0; i<n; i++) { C(j) = A(j) * B(j) j = j + K }

Original Loop

Assembly Codea ab b

c c cde

f

0 5 cycles

Instr.

Code Schedule

g

L1: ld_f f3, A, r2 (a)ld_f f4, B, r2 (b)mul_f f5, f3, f4 (c)st_f C, r2, f5 (d)add r2, r2, r7 (e)add r1, r1, 1 (f)blt r1, r6, L1 (g)



48


Assembly Code

After Unrolling and Renaming

L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)add r22, r21, r7 (e)

ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)add r23, r22, r7 (j)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r23, r7 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)

L1: ld_f f3, A, r2 (a)ld_f f4, B, r2 (b)mul_f f5, f3, f4 (c)st_f C, r2, f5 (d)add r2, r2, r7 (e)add r1, r1, 1 (f)blt r1, r6, L1 (g)


49


a ab b

c c cd

e

0 5 cycles

Code Schedule

f fg g

hi

j

10 15

Instr.

h h

k kl l

mn

op

m m

q

8 cycles / 3 iterations = 2.6 cycles / iteration After Unrolling and Renaming

L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)add r22, r21, r7 (e)

ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)add r23, r22, r7 (j)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r23, r7 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)


50

Induction Variable Expansion

a ab b

c c cd

0 5 cycles

Code Schedule

f fg g

h

10 15

Instr.

h h

k kl l

m

p

m m

6 cycles / 3 iterations = 2 cycles / iteration After Unrolling and Renaming

mov r21, r2 (-)add r22, r21, r7 (-)add r23, r22, r7 (-)mul r71, r7, 3 (-)

L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r21, r71 (e)add r22, r22, r71 (j)add r23, r23, r71 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)

e

i

j

n

o

q


51

Search Variable Expansion

A search variable is a single value (p.e., a minimum or a maximum) computed for a collection of data.

Search variable expansion eliminates dependencesbetween definitions of search variables and their usesin unrolled loops.

Each search variable is expanded into k temporaryindependent variables. At the exit of the loop the valueof the original search variable is obtained by comparingthe values of the temporary search variables.


52

Superblock Scheduling

Superblock scheduling is a two step process:

Step 1: Build dependence graphStep 2: List scheduling using the dependence

graph, instruction latencies, and resource constraints of the processor


53

List Scheduling

List scheduling employs heuristics to choose amongall ready nodes, the combination of nodes

that should be scheduled in the current cycle.

A node is ready if:(i) all its parents in the dependence graph have been scheduled;(ii) the result produced by each parent is available; and (iii) the resources required by the node are available.


54

Speculative Execution in Superblocks

To produce an efficient schedule, the compilermust be able to move instructions above and below branches.

R: xy+z…S: bnz r1...

...

P

LIVE-OUT(BR) is the set ofvariables that may be used before being redefined when

the branch BR is taken

In the example, LIVE-OUT(S) is the set of variables that is live at point P.

SB1

B2


55


If we want to move instruction R below the branchinstruction S, two situations might occur:


...

P

1) x LIVE-OUT(S)2) x LIVE-OUT(S)

What is the code thatthe compiler should

produce for each situation?

SB1

B2


56


If we want to move instruction R below the branchinstruction S, two situations might occur:


...

P

1) x LIVE-OUT(S)insert a copy of

instruction R in thebranch target.

2) x LIVE-OUT(S)no compensation code

is required

SB1

B2


57


…S: bnz r1…R: xy+z

R’: xy+z...

P

…S: bnz r1…R: xy+z

...

P

1) x LIVE-OUT(S) 2) x LIVE-OUT(S)must introduce R’ in

basic block B2no compensation code

is required

SB1

B2

SB1

B2


58


Upward code motion is more common to reducethe critical path of a superblock. (p.e. moving aload instruction upward to hide the load latency).

There are two major restrictions to move an instruction J from below to above a branch BR:Restriction 1: The destination of J is not in LIVE-OUT(BR).Restriction 2: J will never cause an exception that may terminate program execution when BR is taken.


59


Restriction 1 is usually removed by register renaming.By renaming the destination register of instruction J,we ensure that it is not in LIVE-OUT(BR).

There are two extreme interpretations to restriction 2.

Restricted Speculation Model: fully enforce restriction 2.

Therefore only instructions that cannot cause expections are candidates for speculative execution (p. e. memory load, memory store, integer divide, andall floating point instructions cannot be speculated).


60


General Speculation Model: completely ignore restriction 2.

Requires that the processor provide non-excepting or silent versions of all potentially excepting instructions in the instruction set architecure. If an exception occurs for a silent instruction, it

is simply ignored, and garbage is written in the destination.


61

Example for Speculative Execution

avg = 0;weight = 0;count = 0;while(prt != NULL) {

count = count + 1;if(prt->wt > 0) weight = weight - prt->wt;else weight = weight + prt->wt;prt = prt -> next;}

if(count != 0) avg = weight/count

C code segment

(i1) ld_i r1, prt, 0(i2) mov r7, 0 // avg(i3) mov r2, 0 // count(i4) mov r3, 0 // weight(i5) beq r1, 0, L3(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, L1(i9) sub r3, r3, r4(i10) jmp L2(i11) L1: add r3, r3, r4(i12) L2: ld_i r1, r1, 4(i13) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:

Assembly code segment


62

BB2

BB4

BB5


(i1) ld_i r1, prt, 0(i2) mov r7, 0 // avg(i3) mov r2, 0 // count(i4) mov r3, 0 // weight(i5) beq r1, 0, L3(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, L1(i9) sub r3, r3, r4(i10) jmp L2(i11) L1: add r3, r3, r4(i12) L2: ld_i r1, r1, 4(i13) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:


i6i7i8

i11

i12i13

i9i10

10

10

90

90

99

1

1

Trace Selection for the Loop

BB3


63

BB2

BB4

BB5BB5

BB2

BB4


i6i7i8

i11

i12i13

i9i10

10

10

90

90

99

1

1

Trace Selection for the Loop

BB3

i6i7i8

i11

i12i13

i9i12’i13’

1090

90

99(1/10)

1(9/10)

1

After superblock formationand branch target expansion

BB3’

1(1/10)

99(1/10)

SB1

SB2


64


BB2

BB4

BB5

i6i7i8

i11

i12i13

i9i12’i13’

1090

90

99(1/10)

1(9/10)

1

After superblock formationand branch target expansion

BB3’

1(1/10)

99(1/10)

SB1

SB2

ld_i r1, prt, 0mov r7, 0 // avgmov r2, 0 // countmov r3, 0 // weightbeq r1, 0, L3

(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, LA(i11) add r3, r3, r4(i12) ld_i r1, r1, 4 // prt->next(i13) bne r1, 0, L0(i9) LA: sub r3, r3, r4(i12’) ld_i r1, r1, 4 // prt->next(i13’) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:



65



(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0 L3: beq r2, 0, L4 div r7, r3, r2 st_I avg, 0, r7 L4: L1’: mov r1, r5 mov r4, r6 L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0


(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1

(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3

(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’

(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0

L3: beq r2, 0, L4 div r7, r3, r2 st_I avg, 0, r7

L4:

L1’: mov r1, r5 mov r4, r6

L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0


66



(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1

(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3

(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’

(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0

div r7, r3, r2 st_I avg, 0, r7

L4:

L1’: mov r1, r5 mov r4, r6

L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0

L3: beq r2, 0, L4


67

HyperblocksSuggested Reading

Scott A. Mahlke’s Ph.D. Thesis, chap. 7.


68

Hyperblock

A hyperblock is a collection of connected basicblocks in which control may only enter throughthe first block (entry block).

Control flow may leave from any number of blocksin the hyperblock.

Before scheduling, all control flow between basicblocks within a hyperblock is removed via if-conversion.


69

Hyperblock Formation

A five-step procedure is used to form hyperblocks:

1. region identification

2. loop backedge coalescing

3. block selection

4. tail duplication

5. if-conversion


70

Running Example: wc

Mahlke uses the inner loop of wc, the program that counts the number of characters, words, and lines in a file forlinux, as a running example.


71

The source code

linect =wordct = charct = token = 0; for ( ; ; )A: if (--(fp)->cnt < 0)C: c = filbuf(fp); elseB: c = *(fp)->ptr++;D: if (c == EOF) break;E: charct++; if ((‘ ‘ < c) &&F: (c < 0177)) {

H: if(! token) {K: wordct++; token++; } continue; }G: if (c == ‘\n’)I: linec++;J: else if ((c != ‘ ‘) &&L: (c != ‘\t’)) continue;M: token = 0; }


72

The Assembly Code

LA: ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, 27 blt r98, 1, LCLB: ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0LD: beq r4, -1, EXITLE: ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 bge 32, r4, LGLF: bge r4, 127, LGLH: bne 0, r2, LA

LK: ld_I r36, r72, 0 add r35, r36, 1 st_I r72, 0, r35 add r2, r2, 1 jmp LALG: beq r4, r10, LILJ: bne r4, 32, LLLM: mov r2, 0 jmp LALI: ld_I r39, r71, 0 add r38, r39, 1 st_I r71, 0, r38 jmp LMLL: bne r4, 9, LA jmp LMLC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jmp LD


73

Control Flow Graph

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

1

16K


74

Statistics of the Example

wc is formed by small basic blocks with a largepercentage of branches

It contains 13 basic blocks and 34 instructions:

14 branches: 8 conditional 5 unconditional 1 subroutine call


75

Step 1: Region Identification

A region is a group of basic blocks with a singleentry block that dominates all the blocks in theregion.

Regions are used because they provide easy tocompute outer boundaries for hyperblocks.

A basic block can only reside in a single region.

A second constraint imposed on region formationis that regions may not contain internal cycles(this constraint is relaxed later).

In wc, the entire control flow graph forms a region.


76

Step 2: Backedge Coalescing

If-conversion only can remove non-loop branches.

Thus we need to coaslece all back edges into asingle backedge. This allows the control logicthat choses which backedge is taken to beeliminated by if-conversion.

To coalesce the backedges, we introduce a newnode that will be the origin of the new single backedge.Then we retarget all existing backedges to this new node


77

CFG Before Backedge Coalescing

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

1

16K


78

CFG After Backedge Coalescing

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K


79

Step 3: Block Selection

Two conflicting goals:

(1) More blocks can potentially improve performance by eliminating branches among the blocks included.

(2) Too many blocks may result in performance loss due to over-saturation of processor resources or increased dependence height.


80

Enumerating Execution Paths

An execution path is a path of control flow fromthe entry block to an exit block in the region.

Mahlke assigns a priority to each execution path.This priority indicates the path relative importance.

Paths are included in the hyperblock from thehighest to the lowest priority based on the available resources.

Mahlke also estimates the available resourcesand the resource use of each path.


81

Path Priority Function

The path priority function combines four elements: (1) path execution frequency;

(2) number of instructions in the path;(3) path dependence height;(4) hazard conditions on the path;

Intuition: include paths with fewer instructions, with lower dependence height, that have few hazard conditions, and that are executed very often.

Hazard conditions include procedure calls andunresolvable memory stores.


82


( )

( )( ) ( )Kratioopratiodephazardyprobabilitpriority

opsnum

opsnumratioop

heightdep

heightdepratiodep

iiiii

jNj

ii

jNj

ii

++××=

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−=

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−=

≤≤

≤≤

__

_max

_0.1_

_max

_0.1_

1

1

Malhke use a hazard multiplier of 0.25 for all pathscontaining a subroutine call or an unresolvable memory reference, and 1.0 for all other paths.


83


( )

( )( ) ( )Kratioopratiodephazardyprobabilitpriority

opsnum

opsnumratioop

heightdep

heightdepratiodep

iiiii

jNj

ii

jNj

ii

++××=

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−=

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−=

≤≤

≤≤

__

_max

_0.1_

_max

_0.1_

1

1

The constant K makes the path with the largestdependence height and the most operations havea non-zero probability. Malhke used K=0.1.


84

Block Selection Algorithm

ISSUE_WIDTH = 1 to 8 /* as specified in the machine description file */RES_MULTIPLIER = 2MAX_DEP_GROWTH = 3MIN_PATH_PRIORITY_RATIO = 0.10

block_selection(region) { enumerate all paths in the region calculate priority of each path sort paths from highest to lowest priority /* Initialization of loop variables */ avail_resources = ISSUE_WIDTH dep_height1 RES_MULTIPLIER used_resources = 0 last_priority = 0.0 selected_paths = 0 for (i = 1 to num_paths) { /* Check if there are enough resources available to include the path */ if ((num_opsi + used_resources) > avail_resources) { continue } /* Prevent paths with large relative dependence heights from being included */ if (dep_heighti > (dep_height1 MAX_DEP_GROWTH)) { continue }


85

Block Selection Algorithm

/* Prevent paths with large relative dependence heights from being included */ if (dep_heighti > (dep_height1 MAX_DEP_GROWTH)) { continue }/* Do not include paths with a small relative priority to that of the last included path */ if (priorityi < (last_priority MIN_PATH_PRIORITY_RATIO)) { continue }/* Include the path in the hyperblock */ selected_paths = selected_paths pathi

used_resources = used_resources + num_opsi

last_priority = priorityi

} selected_blocks = all blocks contained within selected_paths return selected_blocks}


86

Block Selection

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K

1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D

8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N10. A-C-D-E-G-J-M-N11. A-C-D-E-G-J-L-M-N12. A-C-D-E-G-I-M-N13. A-C-D-E-G-J-L-N14. A-C-D

15. A-B-D-E-F-G-I-M-N16. A-B-D-E-F-G-J-M-N17. A-B-D-E-F-G-J-L-M-N18. A-B-D-E-F-G-J-L-N

19. A-C-D-E-F-G-I-M-N20. A-C-D-E-F-G-J-M-N21. A-C-D-E-F-G-J-L-M-N22. A-C-D-E-F-G-J-L-N


87

Block Selection

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K






88

Path Selection

Some paths that are not selected by the blockselection algorithms are also included in thehyperblocks because all their blocks belongto selected paths.

An alternative procedure could have eliminatedthese paths from the path set before the selection.

But the cost of such elimination would be higherthan maintaining these extra paths in the set.


89

Block Selection

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K






90

Step 4: Tail Duplication

To convert the set of selected blocks into ahyperblock (with a single entry block), controlflow from non-selected blocks (side entry points) must be eliminated.

The tail duplication algorithm first marks allblocks that have side entry points.

Then the algorithm marks all blocks that canbe reached from marked blocks.

All marked blocks form the tails that must beduplicated.


91

Tail Duplication

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K


92

Tail Duplication

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K


93

Tail Duplication

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

1

16K

E’

D’

F’

H’

K’

G’

I’ J’

L’

M’2

14

8

10

10 4

01 3

30

1

0

4

0

N’

105K 0

2

14


94

Anatomy of a Predicate Computation Operation

p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)

This instruction assigns value to Pout1 and Pout2:

The value assigned depends on:

The result of the comparisonThe value of Pin The type of Pout1 and Pout2


95



<cmp> = eq | ne | gt

<type> = U | U | OR | OR | AND | AND

Example:pge p4(OR), p2(/U), r4, 127 (p1)

cmp = ge, Pin = p1, Pout1 = p4, Pout2 = p2, src1 = r4, src2 = 127


96




U or U Always write into the destination register:

if type = U then if Pin = 0 then Pout = 0 elseif src1 <cmp> src2 then Pout = 1 else Pout = 0

if type = U then if Pin = 0 then Pout = 0 elseif src1 <cmp> src2 then Pout = 0 else Pout = 1


97




Write into the destination register onlyif Pin = 1 and <cmp> is true:

if type = OR and Pin = 1 and src1 <cmp> src2 then Pout = 1

Used when the execution of a block is enabled byone of multiple conditions.

OR type predicates must be initialized to 0 before their use.

OR or OR

if type = OR and Pin = 1 and src1 !<cmp> src2 then Pout = 1


98




Write into the destination register onlyif Pin = 1 and <cmp> is false:

if type =AND and Pin = 1 and src1 !<cmp> src2 then Pout = 0

Used when the execution of a block requiresseveral conditions to be true.

AND type predicates are often initialized to 1.

AND or AND

if type = AND and Pin = 1 and src1 <cmp> src2 then Pout = 0


99

Predicate Comparison Truth Table

• Pin predicates the entire predicate computation instruction.• Notice that for an unconditional type, the value 0 is written in Pout even when Pin is 0.

Pout

Pin Comparison UUOR ORAND AND0 0 0 0 - - - -0 1 0 0 - - - -1 0 0 1 - 1 0 -1 1 1 0 1 - - 0



100

Predicate Comparison Truth Table

p1 Comparison P4(OR) P2(/U) 0 0 - 0 0 1 - 0 1 0 - 1 1 1 1 0

pge p4(OR), p2(/U), r4, 127 (p1)

Pout

Pin Comparison UUOR ORAND AND0 0 0 0 - - - -0 1 0 0 - - - -1 0 0 1 - 1 0 -1 1 1 0 1 - - 0

Example:


101

Predicate Types

Unconditional predicates are used for control dependence sets that have a single edge.

OR-type predicates are used for predicates withmultiple edges in their control dependence sets.(OR-type predicates must be cleared beforeentering the hyperblock).


102

Step 5: If-conversion

For graph drawing, Malhke uses the convention that the left edge out of a basic block is the true condition and the right one is the false.

G

I J

In this control flow graph the control dependencieson blocks I and J are:

I: brGJ: /brG


103


E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K

D’-N’

14Control Dependences Predicate Assignment A : none A : null B : none B : null D : none C : null E : none E : null F : brE F : p1 (U) G : /brE, /brF G : p4 (OR) H : brF H : p2 (U) I : brG I : p7 (U) J : /brG J : p5 (U) K : brH K : p3 (U) L : /brJ L : p8 (U) M : brI, brJ, brL M : p6 (OR) N : none N : null


104


E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K

D’-N’



105

EXIT

4K

H

77K 24K

Step 5: If-conversion (example)

I J

A

CB

D

K L

M16K

105K 14

14105K

105K

61K

77K 28K

0

22K2K

4K

2K

28K

25

N

105K

1

16K

D’-N’


E

FG


106

EXIT

4K

H

77K 24K


I J

A

CB

D

K L

M16K

105K 14

14105K

105K

61K

77K 28K

0

22K2K

4K

2K

28K

25

N

105K

1

16K

D’-N’

14

E

FG




107

EXIT

4K

H

77K 24K


I J

A

CB

D

K L

M16K

105K 14

14105K

105K

61K

77K 28K

0

22K2K

4K

2K

28K

25

N

105K

1

16K

D’-N’

14

E

FG

pclr p4, p6ld_i r98, r3, 0add r27, r98, -1st_i r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_i r3, 4, r29ld_c r4, r30, 0 beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4pge p4(OR), p2(/U), r4, 127 (p1)peq p3(U),-,0,r2 (p2)peq p6(OR), p5(/U), r4, r10 (p4)peq p7(U), -, r4, r10 (p4)...


108


pclr p4, p6ld_i r98, r3, 0add r27, r98, -1st_i r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_i r3, 4, r29ld_c r4, r30, 0 beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4pge p4(OR), p2(/U), r4, 127 (p1)peq p3(U),-,0,r2 (p2)peq p6(OR), p5(/U), r4, r10 (p4)peq p7(U), -, r4, r10 (p4)...

EXIT

4K

H

77K 24K

I J

105K

77K 28K

0

1

E

FG




109

Inner Loop After If-conversion

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27

ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0

ld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4pge p4(OR), p2(/U), r4, 127 (p1)peq p3(U),-,0,r2 (p2)peq p6(OR), p5(/U), r4, r10 (p4)peq p7(U), -, r4, r10 (p4)peq p6(OR), p8(/U), r4, 32 (p5)ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8)mov r2, 0 (p6)jmp loop

blt r98, 1, LC

beq r4, -1, EXIT


110

Predicate Hierarchy Graph

The Predicate Hierarchy Graph (PHG) is a directed acyclic graph representing the Boolean equations used to compute all the predicates in a hyperblock.

There are two types of nodes in the PHG: predicate nodes and condition nodes.

Two PHG nodes x and y are connected if thevalue specified by x is used to directly compute the value of y.

The PHG is used to derive relationships among predicates.


111

Example of PHG Construction

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T


112



T

pge p4(OR), p1(/U), 32, r4 [c1, /c1]

c1 /c1

p1

p4


113



T

pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]

c1 /c1

p1

c2 /c2

p4 p2


114



T

peq p3(U),-,0,r2 (p2) [c3]

c1 /c1

p1

c2 /c2

p4 p2

c3

p3


115



T

peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]

c1 /c1

p1

c2 /c2

p4

p5

c4 /c4

p6

p2

c3

p3


116



T

peq p7(U), -, r4, r10 (p4) [c4]

c1 /c1

p1

c2 /c2

p4

p5

c4 c4 /c4

p6

p2

c3

p3p7


117



T

peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c4 c4 /c4

p6

p2

c3

p3p7


118



T

peq p6(OR), -, r4, 9 (p8) [c6]

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c6

c4 c4 /c4

p6

p2

c3

p3p7


119



T

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c6

c4 c4 /c4

p6

p2

c3

p3p7


120

Purpose of PHG

The PHG is used to allow the compiler to deriverelations among the predicates. Mahlke identifies threepredicate relations:Ancestor: pi is an ancestor of pj if all conditions used to compute pj are derived from pi.The compiler can be sure that pj may be true only when pi is also true. Control Path: There is a control path between pi and pj if there is at least one set of conditions under which both pj and pi are true.The compiler knows that pi and pj may be true at the same time.

Implies: pi implies pj if the conditions that make pi true guatantee that pj will also be true.


121

Imply Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c6

c4 c4 /c4

p6

p2

c3

p3p7

p7 implies p6


122

Ancestor Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c6

c4 c4 /c4

p6

p2

c3

p3p7

Which predicate nodes are ancestors

of p5?

T, p4, and p5


123

Ancestor Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c6

c4 c4 /c4

p6

p2

c3

p3p7

Which predicate nodes are in the same

control path as p5?T, p1, p4, p5, p6, p8


124

Classical/ILP Optimizations in Predicated Code

Example: Copy Propagation

A: mov r1, r2 (p1)B: add r2, r3, r4 (p2)C: ld_i r5, r1, 0 (p3)

Is the copy propagation frominstruction A to instruction C legal?

Depends on what we know about the relationship between p1, p2, and p3.If it is possible that p1 is false and p3is true, the propagation would be wrong!



125


Example: Copy Propagation


For instance, if we know that:(1) p1 is an ancestor of both p2 and p3, and (2) p2 and p3 are mutually exclusiveThen we can do the copy propagation safely.

p1

pk

cm /cm

p2 p3


126


Example: Instruction Scheduling

A: ld_i r1, r2, r3 (p2)B: add r4, r1, 4 (p2)C: ld_i r1, r5, 0 (p3)D: mul r6, r1, r7 (p3)

What are the data dependencies in thecode above? Depends on what we know about the relationship between p2, and p3.


127




pk

cm /cm

p2 p3

For instance, if we know thatp2 and p3 are mutually exclusive,we have this DDG:

A

B

C

D


128




pk

cm cm

p2 p3

But if p2 implies p3,then have this DDG:

A

BC

D


129

Predicate-Specific Optimizations

- Predicate Promotion- Branch Combining- Predicate Loop Peeling


130

Predicate Promotion

The idea it to speculate the execution of instructionsby replacing their predicate by a less constrainedpredecessor predicate.

Because the ancestor predicate is computed withfewer conditions, the execution of the promoted instruction is speculative.

The advantage of predicate promotion is the reductionof the dependence chain in a hyperblock.


131

Conditions for Simple Predicate Promotion

The predicate of an instruction op(x) canbe promoted to its predecessor predicateif all the following conditions are true:1. op(x) is predicated2. op(x) has a destination register3. op(x) has a speculative version4. there is a unique op(y) lexically before op(x) such that dest(y) = pred(x)5. dest(x) is not live at op(y)6. for any op(j) such that there is a path op(j)…op(y), dest(x) dest(j)7. It is profitable to promote op(x)


132

Example of Predicate Promotion (qsort)

1 LA: ld_i r20, r24, r1012 ld_i r23, r2, r1023 pge p126(U), p127(U), r20, r234 LB: ld_i r6, r123, 0 (p126)5 add r123, r123, 8 (p126)6 add r9, r9, 1 (p126)7 add r101, r101, 8 (p126)8 LC: ld_i r6, r124, 8 (p127)9 add r124, r124, 8 (p127)10 add r124, r124, 8 (p127)11 add r102, r102, 8 (p127)12 LD: st_i r114, 0, r2313 st_i r114, 4, r614 add r7, r7, 115 add r114, r114, 816 bge r9, r3, EXIT17 LE: blt r8, r1, LA

1 LA: ld_i r20, r24, r1012 ld_i r23, r2, r1023 pge p126(U), p127(U), r20, r234 LB: ld_i r6, r123, 0 5 add r123, r123, 8 (p126)6 add r9, r9, 1 (p126)7 add r101, r101, 8 (p126)8 LC: ld_i r60, r124, 8 8a mov r6, r60 (p127) 9 add r124, r124, 8 (p127)10 add r124, r124, 8 (p127)11 add r102, r102, 8 (p127)12 LD: st_i r114, 0, r2313 st_i r114, 4, r614 add r7, r7, 115 add r114, r114, 816 bge r9, r3, EXIT17 LE: blt r8, r1, LA


133

Branch Combining

Problem: too many infrequently executed branches in a hyperblock

1 A: bge r1, r5, EXIT12 ld_c r3, r1, 03 beq r3, 10, EXIT24 beq r3, 0, EXIT35 bge r2, r6, EXIT46 st_c r2, 0, r37 add r1, r1, 18 add r2, r2, 19 jmp A

Example: a loop in grep

14

4035

0

0


134

Branch Combining

Solution: replace a group of exit branches by a corresponding group of predicate define instructions.

All predicate definitions write into the same predicateregister using the OR-type semantics.

The resultant predicate will be set to 1 if any of the exit branches were to be taken.

Because not exiting the hyperblock is the mostcommon case, the predicate will be false.


135

Branch Combining

1 A: bge r1, r5, EXIT 2 ld_c r3, r1, -1 3 beq r3, 10, EXIT2 4 beq r3, 0, EXIT3 5 bge r2, r6, EXIT4 6 st_c r2, -1, r3 7 bge r1, r7, EXIT5 8 ld_c r4, r1, 0 9 beq r4, 10, EXIT6

10 beq r4, 0, EXIT7 11 bge r2, r8, EXIT8 12 st_c r2, 0, r4 13 add r1, r1, 2 14 add r2, r2, 2 15 jmp A

jmp

0 A: pclr p1 1 pge p1(OR), r1, r5 2 ld_c r3, r1, -1 3 peq p1(OR), r3, 10 4 peq p1(OR), r3, 0 5 pge p1(OR), r2, r6 7 pge p1(OR), r1, r7 8 ld_c r4, r1, 0 9 peq p1(OR), r4, 10

10 peq p1(OR), r4, 0 11 pge p1(OR), r2, r8 16 jmp Decode (p1) 6’ st_c r2, -1, r3

12 st_c r2, 0, r4 13 add r1, r1, 2 14 add r2, r2, 2 15 jmp A

jmp

Decode: 1 bge r1, r5, EXIT1 3 beq r3, 10, EXIT2 4 beq r3, 0, EXIT3 5 bge r2, r6, EXIT4 6 st_c r2, -1, r3 7 bge r1, r7, EXIT5 9 beq r4, 10, EXIT6

10 beq r4, 0, EXIT7 11 jmp EXIT8

jmp


136

Instruction Between Combined Branches

Instructions between combined branches arespeculated.

For instructions that are between combined branchesbut cannot be speculated, the following must be done:

(1) move the instructions below the combined exit branch in the hyperblock.

(2) replicate these instructions in their original position with respect to the exit branches in the decode block.


137

Backend Compilation with Hyperblocks

Register Allocation

Instruction Scheduling

Classical Optim.

ILP/Predicate-SpecificOptimizations

Hyperblock/SuperblockFormation

Classical Optim.

Lcode generation

PHG

CFGGenerator

EquationSolver

predicate relations

dataflowinformation

predicateaware

Documents

CMPUT680 - Winter 2006