137
CMPUT 329 - Computer Org anization and Architectu re II 1 CMPUT680 - Winter 2006 Topic I: Superblock and Hyperblock Formation José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680

CMPUT680 - Winter 2006

  • Upload
    bond

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

CMPUT680 - Winter 2006. Topic I: Superblock and Hyperblock Formation José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680. Instruction Level Parallelism Optimizations. The objective of an optimizer is to reduce the number and complexity of the instructions - PowerPoint PPT Presentation

Citation preview

Page 1: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

1

CMPUT680 - Winter 2006

Topic I: Superblock and Hyperblock Formation

José Nelson Amaralhttp://www.cs.ualberta.ca/~amaral/courses/680

Page 2: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

2

Instruction Level Parallelism Optimizations

The objective of an optimizer is to reduce thenumber and complexity of the instructionsexecuted by the processor.

Superscalar or Very Long Instruction Word (VLIW) processors can reduce the execution time even when the number of instructions executed moderatelyincreases, as long as the dependence height is reduced.

Page 3: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

3

Speculative and Predicated Execution

Speculative Execution: execution of an instructionbefore knowing that its execution is required.

Predicated Execution: architecture-supported conditional execution of an instruction based on the value of a Boolean source operand, referred to as the predicate of the instruction.

Superblock: structure used to implement compiler-controlled speculative execution.

If-conversion: compiler algorithm that converts conditional branches into predicate-defining instructions to allow the use of predication.

Page 4: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

4

Trace Scheduling (Fisher, 1981)

Some optimization and scheduling decisionsmay decrease the execution time for onecontrol path while increasing the executiontime for another path.

Thus decisions should favor more frequentlyexecuted paths to improve overall performance.

Trace scheduling divides a procedure in a setof frequently executed traces (paths).

Page 5: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

5

Trace Scheduling

There may be conditional branches from themiddle of the trace (side exits) and transitionsfrom other traces into the middle of the trace(side entrances).

These control-flow transitions are ignored duringtrace scheduling.

After scheduling, bookeeping is required to ensurethe correct execution of off-trace code.

Page 6: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

6

Bookeeping for Trace Scheduling

Instr 1Instr 2Instr 3Instr 4Instr 5

Instr 2Instr 3Instr 4Instr 1Instr 5

What bookeeping is required when Instr 1 is moved below the side entrance in the trace?

Page 7: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

7

Bookeeping for Trace Scheduling

Instr 1Instr 2Instr 3Instr 4Instr 5

Instr 2Instr 3Instr 4Instr 1Instr 5

Instr 3Instr 4

Page 8: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

8

Bookeeping for Trace Scheduling

Instr 1Instr 2Instr 3Instr 4Instr 5

Instr 1Instr 5Instr 2Instr 3Instr 4

What bookeeping is required when Instr 5 moves above the side entrance in the trace?

Page 9: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

9

Bookeeping for Trace Scheduling

Instr 1Instr 2Instr 3Instr 4Instr 5

Instr 1Instr 5Instr 2Instr 3Instr 4

Instr 5

Page 10: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

10

Superblocks

A superblock is a trace without side entrances, i.e.,control can only enter from the top, but it can leaveat one or more exit points.

The formation of superblocks creates additionaloptimization opportunities because constraintsassociated with infrequently executed paths ofcontrol are ignored (thus these constraints donot inhibit optimizations that favor frequentlyexecuted paths).

Page 11: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

11

Superblock Formation(Example)

Y

D100

C10

B90

E90

D0

F100

Z

1

90 10

900

090

10 99

1

Y

D100

C10

B90

E90

D0

F100

Z

1

90 10

900

090

10

99

1

Page 12: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

12

Superblock Formation(Example)

Y

D100

C10

B90

E90

D0

F100

Z

1

90 10

900

090

10

99

1

Is this a superblock?

No, a superblock cannothave side entrances, andthis set of nodes hastwo side entrances intonode F. How do weconvert it into a superblock?

Page 13: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

13

Superblock Formation(Example)

Y

D100

C10

B90

E90

D0

F90

Z

1

90 10

900

0

90

10

89.1

0.9

Tail duplication, is the duplication of basic blocksthat appear after a side entrance to eliminate side entrances and transform a trace into a superblock.

F’10

10

9.9

0.1

Page 14: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

14

Common Subexpression Elimination in Superblocks

opA: mul r1,r2,3

opC: mul r3,r2,3

opB: add r2,r2,199

1

1

Original Code

opA: mul r1,r2,3

opC: mul r3,r2,3

opB: add r2,r2,199

1

Code After Superblock Formation

opC’: mul r3,r2,3

opA: mul r1,r2,3

opC: mov r3,r1

opB: add r2,r2,199

1

Code After Common Subexpression Elimination

opC’: mul r3,r2,3

Page 15: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

15

Operation Migration in Superblocks

Original Code

…mov r0,r1

…mov r0,r2

…mov r0,r3

…add r1,r1,4add r2,r2,4add r3,r3,4

X

Y

Z

After Operation Migration

…add r1,r1,4add r2,r2,4add r3,r3,4

mov r0,r1

mov r0,r2

mov r0,r3

X

Y

Z

Page 16: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

16

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

0

10

20

30

MEM[r0+x]

r4

1r1

1r0

Page 17: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

17

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

0

10

20

30

MEM[r0+x]

10r4

1r1

1r0

Page 18: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

18

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

0

10

20

30

MEM[r0+x]

11r4

1r1

1r0

Page 19: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

19

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

020

11

30

MEM[r0+x]

11r4

1r1

1r0

Page 20: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

20

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

020

11

30

MEM[r0+x]

11r4

2r1

1r0

Page 21: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

21

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

0

11

20

30

MEM[r0+x]

11r4

2r1

1r0

Page 22: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

22

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

0

11

20

30

MEM[r0+x]

12r4

2r1

1r0

Page 23: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

23

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

020

12

30

MEM[r0+x]

12r4

2r1

1r0

Page 24: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

24

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

020

12

30

MEM[r0+x]

12r4

2r1

2r0

Page 25: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

25

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

020

12

30

MEM[r0+x]

20r4

2r1

2r0

Page 26: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

26

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

020

12

30

MEM[r0+x]

21r4

2r1

2r0

Page 27: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

27

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

0

12

30

21MEM[r0+x]

21r4

2r1

2r0

Page 28: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

28

Global Variable Migration in Superblock

Loops

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

0

OpC: st_i x, r0, r4

OpC’: st_i x, r0, r4OpE: add r0, r0, 1

OpA: ld_I r4, x, r0

OpB: add r4, r4, r1

OpD: add r1, r1, 1

100

After Variable Migration

0

Page 29: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

29

Superblock Enlarging Optimizations

By enlarging a superblock, we can provide thescheduler with more independent instructions

to choose from for each cycle

Superblock enlarging optimizations:Branch target expansionLoop unrollingLoop peeling

Page 30: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

30

Branch Target Expansion

Idea: To expand the superblock with the targetof a likely taken branch.

blt r1, r2, L3

beq r3, r4, L5

L1:

jump L4

L2:L3:

20 100blt r1, r2, L3

beq r3, r4, L5

L1:

jump L4

L2:

20

Page 31: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

31

Superblock Loops

A superblock loop is a superblock that has afrequently taken backedge from its last node toits first node.

We will study the extension of some commonloop optimizations to superblocks.

Page 32: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

32

Dependence Removing Optimizations

The goal is to eliminate data dependences betweeninstructions within frequently executed superblocks.

Dependence removing optimizations include:Register renamingAccumulator variable expansionInduction variable expansionSearch variable expansionOperation combiningStrength reductionTree height reduction

Page 33: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

33

Instruction Latencies for Examples

Function Latency Int ALU 1 Int multiply 3 Int divide 10 branch 1 Memory load 2 Memory store 1 FP ALU 3 FP conversion 3 FP multiply 3 FP divide 10

Page 34: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

34

Register Renaming Example

For (j=0; j<n; j++) { C(j) = A(j)+B(j) }

Original Loop

L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)blt r1, r5, L1 (f)

Assembly Code

For all the examples we assume a superscalar processor with infiniteresources and no register renaming hardware. Thus for the code above, we obtain the following schedule.

Page 35: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

35

Register Renaming Example

For (j=0; j<n; j++) { C(j) = A(j)+B(j) }

Original Loop

L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)blt r1, r5, L1 (f)

Assembly Code

a ab b

c c cde

f

0 5 cycles

Instr.

Code Schedule

7 cycles / 1 iteration

Page 36: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

36

Register Renaming Example

L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)blt r1, r5, L1 (f)

Original Assembly Code

L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)

ld_f f2, A, r1 (f)ld_f f3, B, r1 (g)add_f f4, f2, f3 (h)st_f C, r1, f4 (i)add r1, r1, 4 (j)ld_f f2, A, r1 (k)ld_f f3, B, r1 (l)add_f f4, f2, f3 (m)st_f C, r1, f4 (n)add r1, r1, 4 (o)blt r1, r5, L1 (p)

After Loop Unrolling

Page 37: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

37

Loop Unrolling

a ab b

c c cde

f

0 5 cycles

Instr.

Code Schedule

fg g

h h hij

k kl l

m m mno

p

10 15

19 cycles / 3 iterations = 6.3 cycles / iteration

L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)

ld_f f2, A, r1 (f)ld_f f3, B, r1 (g)add_f f4, f2, f3 (h)st_f C, r1, f4 (i)add r1, r1, 4 (j)ld_f f2, A, r1 (k)ld_f f3, B, r1 (l)add_f f4, f2, f3 (m)st_f C, r1, f4 (n)add r1, r1, 4 (o)blt r1, r5, L1 (p)

After Loop Unrolling

Page 38: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

38

Register Renaming

L1: ld_f f21, A, r11 (a)ld_f f31, B, r11 (b)add_f f41, f21, f31 (c)st_f C, r11, f41 (d)add r12, r11, 4 (e)

ld_f f22, A, r12 (f)ld_f f32, B, r12 (g)add_f f42, f22, f32 (h)st_f C, r12, f42 (i)add r13, r12, 4 (j)ld_f f23, A, r13 (k)ld_f f33, B, r13 (l)add_f f43, f23, f33 (m)st_f C, r13, f43 (n)add r11, r13, 4 (o)blt r11, r5, L1 (p)

After Register Renaming

L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)

ld_f f2, A, r1 (f)ld_f f3, B, r1 (g)add_f f4, f2, f3 (h)st_f C, r1, f4 (i)add r1, r1, 4 (j)ld_f f2, A, r1 (k)ld_f f3, B, r1 (l)add_f f4, f2, f3 (m)st_f C, r1, f4 (n)add r1, r1, 4 (o)blt r1, r5, L1 (p)

After Loop Unrolling

Page 39: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

39

Loop Unrolling and Register Renaming

Instr.

a ab b

c c cd

ef

0 5 cycles

Code Schedule

fg g

h h hi

jk kl l

m m mn

op

10 15

8 cycles / 3 iterations = 2.7 cycles / iteration

L1: ld_f f21, A, r11 (a)ld_f f31, B, r11 (b)add_f f41, f21, f31 (c)st_f C, r11, f41 (d)add r12, r11, 4 (e)

ld_f f22, A, r12 (f)ld_f f32, B, r12 (g)add_f f42, f22, f32 (h)st_f C, r12, f42 (i)add r13, r12, 4 (j)ld_f f23, A, r13 (k)ld_f f33, B, r13 (l)add_f f43, f23, f33 (m)st_f C, r13, f43 (n)add r11, r13, 4 (o)blt r11, r5, L1 (p)

After Register Renaming

Page 40: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

40

Accumulator Variable Expansion

An accumulator variable accumulates a sum or productin each iteration of a loop.

Accumulator variable expansion eliminates redefinitionsof an accumulator variable within an unrolled loop bycreating k temporary accumulators (k is the number ofaccumulation instructions). The values of all temporaryaccumulators must be summed at the exit points of the loop where the accumulator is live.

Page 41: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

41

Accumulator Expansion Example

For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) }

Original Loop

ld_f f1, C, r2 (-)L1: ld_f f3, A, r4 (a)

ld_f f5, B, r6 (b)mul_f f7, f3, f5 (c)add_f f1, f1, f7 (d)add r4, r4, 4 (e)add r6, r6, r8 (f)blt r4, r9, L1 (g)st_f C, r2, f1 (-)

Assembly Code

For all examples we assume a superscalar processor with infiniteresources and no register renaming hardware. Thus for the code above, we obtain the following schedule.

Page 42: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

42

Accumulator Expansion Example

For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) }

Original Loop

Assembly Codea ab b

c c cd

ef

0 5 cycles

Instr.

Code Schedule

g

ld_f f1, C, r2 (-)L1: ld_f f3, A, r4 (a)

ld_f f5, B, r6 (b)mul_f f7, f3, f5 (c)add_f f1, f1, f7 (d)add r4, r4, 4 (e)add r6, r6, r8 (f)blt r4, r9, L1 (g)st_f C, r2, f1 (-)

d d

8 cycles / 1 iteration

Page 43: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

43

Loop Unrolling and Register Renaming

ld_f f1, C, r2 (-)L1: ld_f f3, A, r4 (a)

ld_f f5, B, r6 (b)mul_f f7, f3, f5 (c)add_f f1, f1, f7 (d)add r4, r4, 4 (e)add r6, r6, r8 (f)blt r4, r9, L1 (g)st_f C, r2, f1 (-)

Assembly Code

After Unrolling and Renaming

ld_f f1, C, r2 (-)L1: ld_f f31, A, r41 (a)

ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f1, f1, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f1, f1, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f1, f1, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)st_f C, r2, f1 (-)

Page 44: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

44

Loop Unrolling and Register Renaming

a ab b

c c cd

ef

0 5 cycles

Code Schedule

g gh h

ij

kl

10 15

d d

ld_f f1, C, r2 (-)L1: ld_f f31, A, r41 (a)

ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f1, f1, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f1, f1, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f1, f1, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)st_f C, r2, f1 (-)

Instr.

i ij j

m mn n

op

qr

o op p

s

14 cycles / 3 iterations = 4.7 cycles / iteration

Page 45: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

45

Accumulator Expansion

a ab b

c c cd

ef

0 5 cycles

Code Schedule

g gh h

ij

kl

10 15

d d

ld_f f11, C, r2 (-)mov_f f12, 0 (-)

mov_f f13, 0 (-)L1: ld_f f31, A, r41 (a)

ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f11, f11, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f12, f12, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f13, f13, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)add_f f11, f11, f12 (-)add_f f11, f11, f13 (-)st_f C, r2, f1 (-)

Instr.

i ij j

m mn n

op

qr

o op p

s

10 cycles / 3 iterations = 3.3 cycles / iteration

Page 46: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

46

Induction Variable Expansion

An induction variable is used to index through loop iterations and through regular data structure, such as arrays.

Induction variable expansion eliminates dependencesbetween definitions of induction variables and their usesin unrolled loops.

Page 47: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

47

Induction Variable Expansion Example

For (i=0; i<n; i++) { C(j) = A(j) * B(j) j = j + K }

Original Loop

Assembly Codea ab b

c c cde

f

0 5 cycles

Instr.

Code Schedule

g

L1: ld_f f3, A, r2 (a)ld_f f4, B, r2 (b)mul_f f5, f3, f4 (c)st_f C, r2, f5 (d)add r2, r2, r7 (e)add r1, r1, 1 (f)blt r1, r6, L1 (g)

6 cycles / 1 iteration

Page 48: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

48

Loop Unrolling and Register Renaming

Assembly Code

After Unrolling and Renaming

L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)add r22, r21, r7 (e)

ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)add r23, r22, r7 (j)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r23, r7 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)

L1: ld_f f3, A, r2 (a)ld_f f4, B, r2 (b)mul_f f5, f3, f4 (c)st_f C, r2, f5 (d)add r2, r2, r7 (e)add r1, r1, 1 (f)blt r1, r6, L1 (g)

Page 49: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

49

Loop Unrolling and Register Renaming

a ab b

c c cd

e

0 5 cycles

Code Schedule

f fg g

hi

j

10 15

Instr.

h h

k kl l

mn

op

m m

q

8 cycles / 3 iterations = 2.6 cycles / iteration After Unrolling and Renaming

L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)add r22, r21, r7 (e)

ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)add r23, r22, r7 (j)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r23, r7 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)

Page 50: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

50

Induction Variable Expansion

a ab b

c c cd

0 5 cycles

Code Schedule

f fg g

h

10 15

Instr.

h h

k kl l

m

p

m m

6 cycles / 3 iterations = 2 cycles / iteration After Unrolling and Renaming

mov r21, r2 (-)add r22, r21, r7 (-)add r23, r22, r7 (-)mul r71, r7, 3 (-)

L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r21, r71 (e)add r22, r22, r71 (j)add r23, r23, r71 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)

e

i

j

n

o

q

Page 51: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

51

Search Variable Expansion

A search variable is a single value (p.e., a minimum or a maximum) computed for a collection of data.

Search variable expansion eliminates dependencesbetween definitions of search variables and their usesin unrolled loops.

Each search variable is expanded into k temporaryindependent variables. At the exit of the loop the valueof the original search variable is obtained by comparingthe values of the temporary search variables.

Page 52: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

52

Superblock Scheduling

Superblock scheduling is a two step process:

Step 1: Build dependence graphStep 2: List scheduling using the dependence

graph, instruction latencies, and resource constraints of the processor

Page 53: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

53

List Scheduling

List scheduling employs heuristics to choose amongall ready nodes, the combination of nodes

that should be scheduled in the current cycle.

A node is ready if:(i) all its parents in the dependence graph have been scheduled;(ii) the result produced by each parent is available; and (iii) the resources required by the node are available.

Page 54: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

54

Speculative Execution in Superblocks

To produce an efficient schedule, the compilermust be able to move instructions above and below branches.

R: xy+z…S: bnz r1...

...

P

LIVE-OUT(BR) is the set ofvariables that may be used before being redefined when

the branch BR is taken

In the example, LIVE-OUT(S) is the set of variables that is live at point P.

SB1

B2

Page 55: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

55

Speculative Execution in Superblocks

If we want to move instruction R below the branchinstruction S, two situations might occur:

R: xy+z…S: bnz r1...

...

P

1) x LIVE-OUT(S)2) x LIVE-OUT(S)

What is the code thatthe compiler should

produce for each situation?

SB1

B2

Page 56: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

56

Speculative Execution in Superblocks

If we want to move instruction R below the branchinstruction S, two situations might occur:

R: xy+z…S: bnz r1...

...

P

1) x LIVE-OUT(S)insert a copy of

instruction R in thebranch target.

2) x LIVE-OUT(S)no compensation code

is required

SB1

B2

Page 57: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

57

Speculative Execution in Superblocks

…S: bnz r1…R: xy+z

R’: xy+z...

P

…S: bnz r1…R: xy+z

...

P

1) x LIVE-OUT(S) 2) x LIVE-OUT(S)must introduce R’ in

basic block B2no compensation code

is required

SB1

B2

SB1

B2

Page 58: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

58

Speculative Execution in Superblocks

Upward code motion is more common to reducethe critical path of a superblock. (p.e. moving aload instruction upward to hide the load latency).

There are two major restrictions to move an instruction J from below to above a branch BR:Restriction 1: The destination of J is not in LIVE-OUT(BR).Restriction 2: J will never cause an exception that may terminate program execution when BR is taken.

Page 59: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

59

Speculative Execution in Superblocks

Restriction 1 is usually removed by register renaming.By renaming the destination register of instruction J,we ensure that it is not in LIVE-OUT(BR).

There are two extreme interpretations to restriction 2.

Restricted Speculation Model: fully enforce restriction 2.

Therefore only instructions that cannot cause expections are candidates for speculative execution (p. e. memory load, memory store, integer divide, andall floating point instructions cannot be speculated).

Page 60: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

60

Speculative Execution in Superblocks

General Speculation Model: completely ignore restriction 2.

Requires that the processor provide non-excepting or silent versions of all potentially excepting instructions in the instruction set architecure. If an exception occurs for a silent instruction, it

is simply ignored, and garbage is written in the destination.

Page 61: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

61

Example for Speculative Execution

avg = 0;weight = 0;count = 0;while(prt != NULL) {

count = count + 1;if(prt->wt > 0) weight = weight - prt->wt;else weight = weight + prt->wt;prt = prt -> next;}

if(count != 0) avg = weight/count

C code segment

(i1) ld_i r1, prt, 0(i2) mov r7, 0 // avg(i3) mov r2, 0 // count(i4) mov r3, 0 // weight(i5) beq r1, 0, L3(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, L1(i9) sub r3, r3, r4(i10) jmp L2(i11) L1: add r3, r3, r4(i12) L2: ld_i r1, r1, 4(i13) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:

Assembly code segment

Page 62: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

62

BB2

BB4

BB5

Example for Speculative Execution

(i1) ld_i r1, prt, 0(i2) mov r7, 0 // avg(i3) mov r2, 0 // count(i4) mov r3, 0 // weight(i5) beq r1, 0, L3(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, L1(i9) sub r3, r3, r4(i10) jmp L2(i11) L1: add r3, r3, r4(i12) L2: ld_i r1, r1, 4(i13) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:

Assembly code segment

i6i7i8

i11

i12i13

i9i10

10

10

90

90

99

1

1

Trace Selection for the Loop

BB3

Page 63: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

63

BB2

BB4

BB5BB5

BB2

BB4

Example for Speculative Execution

i6i7i8

i11

i12i13

i9i10

10

10

90

90

99

1

1

Trace Selection for the Loop

BB3

i6i7i8

i11

i12i13

i9i12’i13’

1090

90

99(1/10)

1(9/10)

1

After superblock formationand branch target expansion

BB3’

1(1/10)

99(1/10)

SB1

SB2

Page 64: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

64

Example for Speculative Execution

BB2

BB4

BB5

i6i7i8

i11

i12i13

i9i12’i13’

1090

90

99(1/10)

1(9/10)

1

After superblock formationand branch target expansion

BB3’

1(1/10)

99(1/10)

SB1

SB2

ld_i r1, prt, 0mov r7, 0 // avgmov r2, 0 // countmov r3, 0 // weightbeq r1, 0, L3

(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, LA(i11) add r3, r3, r4(i12) ld_i r1, r1, 4 // prt->next(i13) bne r1, 0, L0(i9) LA: sub r3, r3, r4(i12’) ld_i r1, r1, 4 // prt->next(i13’) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:

Assembly code segment

Page 65: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

65

Example for Speculative Execution

ld_i r1, prt, 0mov r7, 0 // avgmov r2, 0 // countmov r3, 0 // weightbeq r1, 0, L3

(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0 L3: beq r2, 0, L4 div r7, r3, r2 st_I avg, 0, r7 L4: L1’: mov r1, r5 mov r4, r6 L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0

ld_i r1, prt, 0mov r7, 0 // avgmov r2, 0 // countmov r3, 0 // weightbeq r1, 0, L3

(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1

(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3

(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’

(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0

L3: beq r2, 0, L4 div r7, r3, r2 st_I avg, 0, r7

L4:

L1’: mov r1, r5 mov r4, r6

L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0

Page 66: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

66

Example for Speculative Execution

ld_i r1, prt, 0mov r7, 0 // avgmov r2, 0 // countmov r3, 0 // weightbeq r1, 0, L3

(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1

(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3

(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’

(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0

div r7, r3, r2 st_I avg, 0, r7

L4:

L1’: mov r1, r5 mov r4, r6

L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0

L3: beq r2, 0, L4

Page 67: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

67

HyperblocksSuggested Reading

Scott A. Mahlke’s Ph.D. Thesis, chap. 7.

Page 68: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

68

Hyperblock

A hyperblock is a collection of connected basicblocks in which control may only enter throughthe first block (entry block).

Control flow may leave from any number of blocksin the hyperblock.

Before scheduling, all control flow between basicblocks within a hyperblock is removed via if-conversion.

Page 69: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

69

Hyperblock Formation

A five-step procedure is used to form hyperblocks:

1. region identification

2. loop backedge coalescing

3. block selection

4. tail duplication

5. if-conversion

Page 70: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

70

Running Example: wc

Mahlke uses the inner loop of wc, the program that counts the number of characters, words, and lines in a file forlinux, as a running example.

Page 71: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

71

The source code

linect =wordct = charct = token = 0; for ( ; ; )A: if (--(fp)->cnt < 0)C: c = filbuf(fp); elseB: c = *(fp)->ptr++;D: if (c == EOF) break;E: charct++; if ((‘ ‘ < c) &&F: (c < 0177)) {

H: if(! token) {K: wordct++; token++; } continue; }G: if (c == ‘\n’)I: linec++;J: else if ((c != ‘ ‘) &&L: (c != ‘\t’)) continue;M: token = 0; }

Page 72: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

72

The Assembly Code

LA: ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, 27 blt r98, 1, LCLB: ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0LD: beq r4, -1, EXITLE: ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 bge 32, r4, LGLF: bge r4, 127, LGLH: bne 0, r2, LA

LK: ld_I r36, r72, 0 add r35, r36, 1 st_I r72, 0, r35 add r2, r2, 1 jmp LALG: beq r4, r10, LILJ: bne r4, 32, LLLM: mov r2, 0 jmp LALI: ld_I r39, r71, 0 add r38, r39, 1 st_I r71, 0, r38 jmp LMLL: bne r4, 9, LA jmp LMLC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jmp LD

Page 73: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

73

Control Flow Graph

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

1

16K

Page 74: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

74

Statistics of the Example

wc is formed by small basic blocks with a largepercentage of branches

It contains 13 basic blocks and 34 instructions:

14 branches: 8 conditional 5 unconditional 1 subroutine call

Page 75: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

75

Step 1: Region Identification

A region is a group of basic blocks with a singleentry block that dominates all the blocks in theregion.

Regions are used because they provide easy tocompute outer boundaries for hyperblocks.

A basic block can only reside in a single region.

A second constraint imposed on region formationis that regions may not contain internal cycles(this constraint is relaxed later).

In wc, the entire control flow graph forms a region.

Page 76: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

76

Step 2: Backedge Coalescing

If-conversion only can remove non-loop branches.

Thus we need to coaslece all back edges into asingle backedge. This allows the control logicthat choses which backedge is taken to beeliminated by if-conversion.

To coalesce the backedges, we introduce a newnode that will be the origin of the new single backedge.Then we retarget all existing backedges to this new node

Page 77: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

77

CFG Before Backedge Coalescing

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

1

16K

Page 78: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

78

CFG After Backedge Coalescing

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K

Page 79: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

79

Step 3: Block Selection

Two conflicting goals:

(1) More blocks can potentially improve performance by eliminating branches among the blocks included.

(2) Too many blocks may result in performance loss due to over-saturation of processor resources or increased dependence height.

Page 80: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

80

Enumerating Execution Paths

An execution path is a path of control flow fromthe entry block to an exit block in the region.

Mahlke assigns a priority to each execution path.This priority indicates the path relative importance.

Paths are included in the hyperblock from thehighest to the lowest priority based on the available resources.

Mahlke also estimates the available resourcesand the resource use of each path.

Page 81: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

81

Path Priority Function

The path priority function combines four elements: (1) path execution frequency;

(2) number of instructions in the path;(3) path dependence height;(4) hazard conditions on the path;

Intuition: include paths with fewer instructions, with lower dependence height, that have few hazard conditions, and that are executed very often.

Hazard conditions include procedure calls andunresolvable memory stores.

Page 82: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

82

Path Priority Function

( )

( )( ) ( )Kratioopratiodephazardyprobabilitpriority

opsnum

opsnumratioop

heightdep

heightdepratiodep

iiiii

jNj

ii

jNj

ii

++××=

⎟⎟⎟

⎜⎜⎜

⎛−=

⎟⎟⎟

⎜⎜⎜

⎛−=

≤≤

≤≤

__

_max

_0.1_

_max

_0.1_

1

1

Malhke use a hazard multiplier of 0.25 for all pathscontaining a subroutine call or an unresolvable memory reference, and 1.0 for all other paths.

Page 83: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

83

Path Priority Function

( )

( )( ) ( )Kratioopratiodephazardyprobabilitpriority

opsnum

opsnumratioop

heightdep

heightdepratiodep

iiiii

jNj

ii

jNj

ii

++××=

⎟⎟⎟

⎜⎜⎜

⎛−=

⎟⎟⎟

⎜⎜⎜

⎛−=

≤≤

≤≤

__

_max

_0.1_

_max

_0.1_

1

1

The constant K makes the path with the largestdependence height and the most operations havea non-zero probability. Malhke used K=0.1.

Page 84: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

84

Block Selection Algorithm

ISSUE_WIDTH = 1 to 8 /* as specified in the machine description file */RES_MULTIPLIER = 2MAX_DEP_GROWTH = 3MIN_PATH_PRIORITY_RATIO = 0.10

block_selection(region) { enumerate all paths in the region calculate priority of each path sort paths from highest to lowest priority /* Initialization of loop variables */ avail_resources = ISSUE_WIDTH dep_height1 RES_MULTIPLIER used_resources = 0 last_priority = 0.0 selected_paths = 0 for (i = 1 to num_paths) { /* Check if there are enough resources available to include the path */ if ((num_opsi + used_resources) > avail_resources) { continue } /* Prevent paths with large relative dependence heights from being included */ if (dep_heighti > (dep_height1 MAX_DEP_GROWTH)) { continue }

Page 85: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

85

Block Selection Algorithm

/* Prevent paths with large relative dependence heights from being included */ if (dep_heighti > (dep_height1 MAX_DEP_GROWTH)) { continue }/* Do not include paths with a small relative priority to that of the last included path */ if (priorityi < (last_priority MIN_PATH_PRIORITY_RATIO)) { continue }/* Include the path in the hyperblock */ selected_paths = selected_paths pathi

used_resources = used_resources + num_opsi

last_priority = priorityi

} selected_blocks = all blocks contained within selected_paths return selected_blocks}

Page 86: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

86

Block Selection

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K

1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D

8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N10. A-C-D-E-G-J-M-N11. A-C-D-E-G-J-L-M-N12. A-C-D-E-G-I-M-N13. A-C-D-E-G-J-L-N14. A-C-D

15. A-B-D-E-F-G-I-M-N16. A-B-D-E-F-G-J-M-N17. A-B-D-E-F-G-J-L-M-N18. A-B-D-E-F-G-J-L-N

19. A-C-D-E-F-G-I-M-N20. A-C-D-E-F-G-J-M-N21. A-C-D-E-F-G-J-L-M-N22. A-C-D-E-F-G-J-L-N

Page 87: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

87

Block Selection

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K

1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D

8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N10. A-C-D-E-G-J-M-N11. A-C-D-E-G-J-L-M-N12. A-C-D-E-G-I-M-N13. A-C-D-E-G-J-L-N14. A-C-D

15. A-B-D-E-F-G-I-M-N16. A-B-D-E-F-G-J-M-N17. A-B-D-E-F-G-J-L-M-N18. A-B-D-E-F-G-J-L-N

19. A-C-D-E-F-G-I-M-N20. A-C-D-E-F-G-J-M-N21. A-C-D-E-F-G-J-L-M-N22. A-C-D-E-F-G-J-L-N

Page 88: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

88

Path Selection

Some paths that are not selected by the blockselection algorithms are also included in thehyperblocks because all their blocks belongto selected paths.

An alternative procedure could have eliminatedthese paths from the path set before the selection.

But the cost of such elimination would be higherthan maintaining these extra paths in the set.

Page 89: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

89

Block Selection

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K

1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D

8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N10. A-C-D-E-G-J-M-N11. A-C-D-E-G-J-L-M-N12. A-C-D-E-G-I-M-N13. A-C-D-E-G-J-L-N14. A-C-D

15. A-B-D-E-F-G-I-M-N16. A-B-D-E-F-G-J-M-N17. A-B-D-E-F-G-J-L-M-N18. A-B-D-E-F-G-J-L-N

19. A-C-D-E-F-G-I-M-N20. A-C-D-E-F-G-J-M-N21. A-C-D-E-F-G-J-L-M-N22. A-C-D-E-F-G-J-L-N

Page 90: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

90

Step 4: Tail Duplication

To convert the set of selected blocks into ahyperblock (with a single entry block), controlflow from non-selected blocks (side entry points) must be eliminated.

The tail duplication algorithm first marks allblocks that have side entry points.

Then the algorithm marks all blocks that canbe reached from marked blocks.

All marked blocks form the tails that must beduplicated.

Page 91: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

91

Tail Duplication

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K

Page 92: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

92

Tail Duplication

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K

Page 93: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

93

Tail Duplication

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

1

16K

E’

D’

F’

H’

K’

G’

I’ J’

L’

M’2

14

8

10

10 4

01 3

30

1

0

4

0

N’

105K 0

2

14

Page 94: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

94

Anatomy of a Predicate Computation Operation

p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)

This instruction assigns value to Pout1 and Pout2:

The value assigned depends on:

The result of the comparisonThe value of Pin The type of Pout1 and Pout2

Page 95: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

95

Anatomy of a Predicate Computation Operation

p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)

<cmp> = eq | ne | gt

<type> = U | U | OR | OR | AND | AND

Example:pge p4(OR), p2(/U), r4, 127 (p1)

cmp = ge, Pin = p1, Pout1 = p4, Pout2 = p2, src1 = r4, src2 = 127

Page 96: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

96

Anatomy of a Predicate Computation Operation

p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)

<type> = U | U | OR | OR | AND | AND

U or U Always write into the destination register:

if type = U then if Pin = 0 then Pout = 0 elseif src1 <cmp> src2 then Pout = 1 else Pout = 0

if type = U then if Pin = 0 then Pout = 0 elseif src1 <cmp> src2 then Pout = 0 else Pout = 1

Page 97: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

97

Anatomy of a Predicate Computation Operation

p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)

<type> = U | U | OR | OR | AND | AND

Write into the destination register onlyif Pin = 1 and <cmp> is true:

if type = OR and Pin = 1 and src1 <cmp> src2 then Pout = 1

Used when the execution of a block is enabled byone of multiple conditions.

OR type predicates must be initialized to 0 before their use.

OR or OR

if type = OR and Pin = 1 and src1 !<cmp> src2 then Pout = 1

Page 98: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

98

Anatomy of a Predicate Computation Operation

p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)

<type> = U | U | OR | OR | AND | AND

Write into the destination register onlyif Pin = 1 and <cmp> is false:

if type =AND and Pin = 1 and src1 !<cmp> src2 then Pout = 0

Used when the execution of a block requiresseveral conditions to be true.

AND type predicates are often initialized to 1.

AND or AND

if type = AND and Pin = 1 and src1 <cmp> src2 then Pout = 0

Page 99: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

99

Predicate Comparison Truth Table

• Pin predicates the entire predicate computation instruction.• Notice that for an unconditional type, the value 0 is written in Pout even when Pin is 0.

Pout

Pin Comparison UUOR ORAND AND0 0 0 0 - - - -0 1 0 0 - - - -1 0 0 1 - 1 0 -1 1 1 0 1 - - 0

p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)

Page 100: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

100

Predicate Comparison Truth Table

p1 Comparison P4(OR) P2(/U) 0 0 - 0 0 1 - 0 1 0 - 1 1 1 1 0

pge p4(OR), p2(/U), r4, 127 (p1)

Pout

Pin Comparison UUOR ORAND AND0 0 0 0 - - - -0 1 0 0 - - - -1 0 0 1 - 1 0 -1 1 1 0 1 - - 0

Example:

Page 101: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

101

Predicate Types

Unconditional predicates are used for control dependence sets that have a single edge.

OR-type predicates are used for predicates withmultiple edges in their control dependence sets.(OR-type predicates must be cleared beforeentering the hyperblock).

Page 102: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

102

Step 5: If-conversion

For graph drawing, Malhke uses the convention that the left edge out of a basic block is the true condition and the right one is the false.

G

I J

In this control flow graph the control dependencieson blocks I and J are:

I: brGJ: /brG

Page 103: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

103

Step 5: If-conversion

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K

D’-N’

14Control Dependences Predicate Assignment A : none A : null B : none B : null D : none C : null E : none E : null F : brE F : p1 (U) G : /brE, /brF G : p4 (OR) H : brF H : p2 (U) I : brG I : p7 (U) J : /brG J : p5 (U) K : brH K : p3 (U) L : /brJ L : p8 (U) M : brI, brJ, brL M : p6 (OR) N : none N : null

Page 104: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

104

Step 5: If-conversion

E

A

CB

D

F

H

K

G

I J

L

M16K

105K 14

14105K

105K

EXIT

61K

77K

77K 28K

04K 24K

22K2K

4K

2K

28K

25

N

105K

1

16K

D’-N’

14Control Dependences Predicate Assignment A : none A : null B : none B : null D : none C : null E : none E : null F : brE F : p1 (U) G : /brE, /brF G : p4 (OR) H : brF H : p2 (U) I : brG I : p7 (U) J : /brG J : p5 (U) K : brH K : p3 (U) L : /brJ L : p8 (U) M : brI, brJ, brL M : p6 (OR) N : none N : null

Page 105: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

105

EXIT

4K

H

77K 24K

Step 5: If-conversion (example)

I J

A

CB

D

K L

M16K

105K 14

14105K

105K

61K

77K 28K

0

22K2K

4K

2K

28K

25

N

105K

1

16K

D’-N’

14Control Dependences Predicate Assignment A : none A : null B : none B : null D : none C : null E : none E : null F : brE F : p1 (U) G : /brE, /brF G : p4 (OR) H : brF H : p2 (U) I : brG I : p7 (U) J : /brG J : p5 (U) K : brH K : p3 (U) L : /brJ L : p8 (U) M : brI, brJ, brL M : p6 (OR) N : none N : null

E

FG

Page 106: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

106

EXIT

4K

H

77K 24K

Step 5: If-conversion (example)

I J

A

CB

D

K L

M16K

105K 14

14105K

105K

61K

77K 28K

0

22K2K

4K

2K

28K

25

N

105K

1

16K

D’-N’

14

E

FG

LA: ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, 27 blt r98, 1, LCLB: ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0LD: beq r4, -1, EXITLE: ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 bge 32, r4, LGLF: bge r4, 127, LGLH: bne 0, r2, LA

LK: ld_I r36, r72, 0 add r35, r36, 1 st_I r72, 0, r35 add r2, r2, 1 jmp LALG: beq r4, r10, LILJ: bne r4, 32, LLLM: mov r2, 0 jmp LALI: ld_I r39, r71, 0 add r38, r39, 1 st_I r71, 0, r38 jmp LMLL: bne r4, 9, LA jmp LMLC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jmp LD

Page 107: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

107

EXIT

4K

H

77K 24K

Step 5: If-conversion (example)

I J

A

CB

D

K L

M16K

105K 14

14105K

105K

61K

77K 28K

0

22K2K

4K

2K

28K

25

N

105K

1

16K

D’-N’

14

E

FG

pclr p4, p6ld_i r98, r3, 0add r27, r98, -1st_i r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_i r3, 4, r29ld_c r4, r30, 0 beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4pge p4(OR), p2(/U), r4, 127 (p1)peq p3(U),-,0,r2 (p2)peq p6(OR), p5(/U), r4, r10 (p4)peq p7(U), -, r4, r10 (p4)...

Page 108: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

108

Step 5: If-conversion (example)

pclr p4, p6ld_i r98, r3, 0add r27, r98, -1st_i r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_i r3, 4, r29ld_c r4, r30, 0 beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4pge p4(OR), p2(/U), r4, 127 (p1)peq p3(U),-,0,r2 (p2)peq p6(OR), p5(/U), r4, r10 (p4)peq p7(U), -, r4, r10 (p4)...

EXIT

4K

H

77K 24K

I J

105K

77K 28K

0

1

E

FG

LA: ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, 27 blt r98, 1, LCLB: ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0LD: beq r4, -1, EXITLE: ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 bge 32, r4, LGLF: bge r4, 127, LGLH: bne 0, r2, LA

LK: ld_I r36, r72, 0 add r35, r36, 1 st_I r72, 0, r35 add r2, r2, 1 jmp LALG: beq r4, r10, LILJ: bne r4, 32, LLLM: mov r2, 0 jmp LALI: ld_I r39, r71, 0 add r38, r39, 1 st_I r71, 0, r38 jmp LMLL: bne r4, 9, LA jmp LMLC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jmp LD

Page 109: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

109

Inner Loop After If-conversion

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27

ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0

ld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4pge p4(OR), p2(/U), r4, 127 (p1)peq p3(U),-,0,r2 (p2)peq p6(OR), p5(/U), r4, r10 (p4)peq p7(U), -, r4, r10 (p4)peq p6(OR), p8(/U), r4, 32 (p5)ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8)mov r2, 0 (p6)jmp loop

blt r98, 1, LC

beq r4, -1, EXIT

Page 110: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

110

Predicate Hierarchy Graph

The Predicate Hierarchy Graph (PHG) is a directed acyclic graph representing the Boolean equations used to compute all the predicates in a hyperblock.

There are two types of nodes in the PHG: predicate nodes and condition nodes.

Two PHG nodes x and y are connected if thevalue specified by x is used to directly compute the value of y.

The PHG is used to derive relationships among predicates.

Page 111: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

111

Example of PHG Construction

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

Page 112: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

112

Example of PHG Construction

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

pge p4(OR), p1(/U), 32, r4 [c1, /c1]

c1 /c1

p1

p4

Page 113: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

113

Example of PHG Construction

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]

c1 /c1

p1

c2 /c2

p4 p2

Page 114: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

114

Example of PHG Construction

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

peq p3(U),-,0,r2 (p2) [c3]

c1 /c1

p1

c2 /c2

p4 p2

c3

p3

Page 115: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

115

Example of PHG Construction

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]

c1 /c1

p1

c2 /c2

p4

p5

c4 /c4

p6

p2

c3

p3

Page 116: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

116

Example of PHG Construction

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

peq p7(U), -, r4, r10 (p4) [c4]

c1 /c1

p1

c2 /c2

p4

p5

c4 c4 /c4

p6

p2

c3

p3p7

Page 117: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

117

Example of PHG Construction

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c4 c4 /c4

p6

p2

c3

p3p7

Page 118: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

118

Example of PHG Construction

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

peq p6(OR), -, r4, 9 (p8) [c6]

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c6

c4 c4 /c4

p6

p2

c3

p3p7

Page 119: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

119

Example of PHG Construction

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c6

c4 c4 /c4

p6

p2

c3

p3p7

Page 120: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

120

Purpose of PHG

The PHG is used to allow the compiler to deriverelations among the predicates. Mahlke identifies threepredicate relations:Ancestor: pi is an ancestor of pj if all conditions used to compute pj are derived from pi.The compiler can be sure that pj may be true only when pi is also true. Control Path: There is a control path between pi and pj if there is at least one set of conditions under which both pj and pi are true.The compiler knows that pi and pj may be true at the same time.

Implies: pi implies pj if the conditions that make pi true guatantee that pj will also be true.

Page 121: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

121

Imply Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c6

c4 c4 /c4

p6

p2

c3

p3p7

p7 implies p6

Page 122: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

122

Ancestor Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c6

c4 c4 /c4

p6

p2

c3

p3p7

Which predicate nodes are ancestors

of p5?

T, p4, and p5

Page 123: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

123

Ancestor Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

T

c1 /c1

p1

c2 /c2

p4

p5

c5 /c5

p8

c6

c4 c4 /c4

p6

p2

c3

p3p7

Which predicate nodes are in the same

control path as p5?T, p1, p4, p5, p6, p8

Page 124: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

124

Classical/ILP Optimizations in Predicated Code

Example: Copy Propagation

A: mov r1, r2 (p1)B: add r2, r3, r4 (p2)C: ld_i r5, r1, 0 (p3)

Is the copy propagation frominstruction A to instruction C legal?

Depends on what we know about the relationship between p1, p2, and p3.If it is possible that p1 is false and p3is true, the propagation would be wrong!

A: mov r1, r2 (p1)B: add r2, r3, r4 (p2)C: ld_i r5, r2, 0 (p3)

Page 125: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

125

Classical/ILP Optimizations in Predicated Code

Example: Copy Propagation

A: mov r1, r2 (p1)B: add r2, r3, r4 (p2)C: ld_i r5, r1, 0 (p3)

For instance, if we know that:(1) p1 is an ancestor of both p2 and p3, and (2) p2 and p3 are mutually exclusiveThen we can do the copy propagation safely.

p1

pk

cm /cm

p2 p3

Page 126: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

126

Classical/ILP Optimizations in Predicated Code

Example: Instruction Scheduling

A: ld_i r1, r2, r3 (p2)B: add r4, r1, 4 (p2)C: ld_i r1, r5, 0 (p3)D: mul r6, r1, r7 (p3)

What are the data dependencies in thecode above? Depends on what we know about the relationship between p2, and p3.

Page 127: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

127

Classical/ILP Optimizations in Predicated Code

Example: Instruction Scheduling

A: ld_i r1, r2, r3 (p2)B: add r4, r1, 4 (p2)C: ld_i r1, r5, 0 (p3)D: mul r6, r1, r7 (p3)

pk

cm /cm

p2 p3

For instance, if we know thatp2 and p3 are mutually exclusive,we have this DDG:

A

B

C

D

Page 128: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

128

Classical/ILP Optimizations in Predicated Code

Example: Instruction Scheduling

A: ld_i r1, r2, r3 (p2)B: add r4, r1, 4 (p2)C: ld_i r1, r5, 0 (p3)D: mul r6, r1, r7 (p3)

pk

cm cm

p2 p3

But if p2 implies p3,then have this DDG:

A

BC

D

Page 129: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

129

Predicate-Specific Optimizations

- Predicate Promotion- Branch Combining- Predicate Loop Peeling

Page 130: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

130

Predicate Promotion

The idea it to speculate the execution of instructionsby replacing their predicate by a less constrainedpredecessor predicate.

Because the ancestor predicate is computed withfewer conditions, the execution of the promoted instruction is speculative.

The advantage of predicate promotion is the reductionof the dependence chain in a hyperblock.

Page 131: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

131

Conditions for Simple Predicate Promotion

The predicate of an instruction op(x) canbe promoted to its predecessor predicateif all the following conditions are true:1. op(x) is predicated2. op(x) has a destination register3. op(x) has a speculative version4. there is a unique op(y) lexically before op(x) such that dest(y) = pred(x)5. dest(x) is not live at op(y)6. for any op(j) such that there is a path op(j)…op(y), dest(x) dest(j)7. It is profitable to promote op(x)

Page 132: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

132

Example of Predicate Promotion (qsort)

1 LA: ld_i r20, r24, r1012 ld_i r23, r2, r1023 pge p126(U), p127(U), r20, r234 LB: ld_i r6, r123, 0 (p126)5 add r123, r123, 8 (p126)6 add r9, r9, 1 (p126)7 add r101, r101, 8 (p126)8 LC: ld_i r6, r124, 8 (p127)9 add r124, r124, 8 (p127)10 add r124, r124, 8 (p127)11 add r102, r102, 8 (p127)12 LD: st_i r114, 0, r2313 st_i r114, 4, r614 add r7, r7, 115 add r114, r114, 816 bge r9, r3, EXIT17 LE: blt r8, r1, LA

1 LA: ld_i r20, r24, r1012 ld_i r23, r2, r1023 pge p126(U), p127(U), r20, r234 LB: ld_i r6, r123, 0 5 add r123, r123, 8 (p126)6 add r9, r9, 1 (p126)7 add r101, r101, 8 (p126)8 LC: ld_i r60, r124, 8 8a mov r6, r60 (p127) 9 add r124, r124, 8 (p127)10 add r124, r124, 8 (p127)11 add r102, r102, 8 (p127)12 LD: st_i r114, 0, r2313 st_i r114, 4, r614 add r7, r7, 115 add r114, r114, 816 bge r9, r3, EXIT17 LE: blt r8, r1, LA

Page 133: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

133

Branch Combining

Problem: too many infrequently executed branches in a hyperblock

1 A: bge r1, r5, EXIT12 ld_c r3, r1, 03 beq r3, 10, EXIT24 beq r3, 0, EXIT35 bge r2, r6, EXIT46 st_c r2, 0, r37 add r1, r1, 18 add r2, r2, 19 jmp A

Example: a loop in grep

14

4035

0

0

Page 134: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

134

Branch Combining

Solution: replace a group of exit branches by a corresponding group of predicate define instructions.

All predicate definitions write into the same predicateregister using the OR-type semantics.

The resultant predicate will be set to 1 if any of the exit branches were to be taken.

Because not exiting the hyperblock is the mostcommon case, the predicate will be false.

Page 135: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

135

Branch Combining

1 A: bge r1, r5, EXIT 2 ld_c r3, r1, -1 3 beq r3, 10, EXIT2 4 beq r3, 0, EXIT3 5 bge r2, r6, EXIT4 6 st_c r2, -1, r3 7 bge r1, r7, EXIT5 8 ld_c r4, r1, 0 9 beq r4, 10, EXIT6

10 beq r4, 0, EXIT7 11 bge r2, r8, EXIT8 12 st_c r2, 0, r4 13 add r1, r1, 2 14 add r2, r2, 2 15 jmp A

jmp

0 A: pclr p1 1 pge p1(OR), r1, r5 2 ld_c r3, r1, -1 3 peq p1(OR), r3, 10 4 peq p1(OR), r3, 0 5 pge p1(OR), r2, r6 7 pge p1(OR), r1, r7 8 ld_c r4, r1, 0 9 peq p1(OR), r4, 10

10 peq p1(OR), r4, 0 11 pge p1(OR), r2, r8 16 jmp Decode (p1) 6’ st_c r2, -1, r3

12 st_c r2, 0, r4 13 add r1, r1, 2 14 add r2, r2, 2 15 jmp A

jmp

Decode: 1 bge r1, r5, EXIT1 3 beq r3, 10, EXIT2 4 beq r3, 0, EXIT3 5 bge r2, r6, EXIT4 6 st_c r2, -1, r3 7 bge r1, r7, EXIT5 9 beq r4, 10, EXIT6

10 beq r4, 0, EXIT7 11 jmp EXIT8

jmp

Page 136: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

136

Instruction Between Combined Branches

Instructions between combined branches arespeculated.

For instructions that are between combined branchesbut cannot be speculated, the following must be done:

(1) move the instructions below the combined exit branch in the hyperblock.

(2) replicate these instructions in their original position with respect to the exit branches in the decode block.

Page 137: CMPUT680 - Winter 2006

CMPUT 329 - Computer Organization and Architecture II

137

Backend Compilation with Hyperblocks

Register Allocation

Instruction Scheduling

Classical Optim.

ILP/Predicate-SpecificOptimizations

Hyperblock/SuperblockFormation

Classical Optim.

Lcode generation

PHG

CFGGenerator

EquationSolver

predicate relations

dataflowinformation

predicateaware