52
Spring 2014 Jim Hogg - UW - CSE - P501 T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength Reduction Loop Unrolling Memory Caches

Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Embed Size (px)

Citation preview

Page 1: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Spring 2014 Jim Hogg - UW - CSE - P501 T-1

CSE P501 – Compiler Construction

Loop optimizations

Dominators – discovering loops

Loop Hoisting Operator Strength Reduction Loop Unrolling

Memory Caches

Page 2: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Loops

Most of the time executing a program is spent in loops. (Why is this true?)

So focus on loops when trying to make a program run faster

So: How to recognize loops? How to improve those loops

Spring 2014 Jim Hogg - UW - CSE - P501 T-2

Page 3: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

What’s a Loop?

Spring 2014 Jim Hogg - UW - CSE - P501 T-3

• In source code, a loop is the set of statements in the body of a for/while construct

• But, in a language that permits free use of GOTOs, how do we recognize a loop?

• In a control-flow-graph (node = basic-block, arc = flow-of-control), how do we recognize a loop?

Page 4: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Any Loops in this Code?

Spring 2014 Jim Hogg - UW - CSE - P501 T-4

i = 0goto L8

L7: i++L8: if (i < N) goto L9

s = 0j = 0goto L5

L4: j++L5: N--

if(j >= N) goto L3if (a[j+1] >= a[j]) goto L2t = a[j+1]a[j+1] = a[j]a[j] = ts = 1

L2: goto L4L3: if(s != ) goto L1 else goto L9L1: goto L7L9: return

Anyone recognize or guess the algorithm?

Page 5: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Any Loops in this Flowgraph?

Spring 2014 Jim Hogg - UW - CSE - P501 T-5

Page 6: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Loop in a Flowgraph: Intuition

Spring 2014 Jim Hogg - UW - CSE - P501 T-6

Header Node

• Cluster of nodes, such that:

• There's one node called the "header"• I can reach all nodes in the cluster from the header• I can get back to the header from all nodes in the cluster• Only once entrance - via the header• One or more exits

Page 7: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

What’s a Loop?

In a control flow graph, a set of nodes S such that:

S includes a unique header node h From any node in S there is a path (of directed

edges) leading to h There is a path from h to every node in S There is no edge from any node outside S to any

node in S other than via h

Spring 2014 Jim Hogg - UW - CSE - P501 T-7

Page 8: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Entries and Exits

In a loop

The entry node is one with some predecessor outside the loop

An exit node is one that has a successor node outside the loop

Corollary: A loop may have multiple exit nodes, but only one entry node

Spring 2014 Jim Hogg - UW - CSE - P501 T-8

Page 9: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

We use dominators to discover loops in flowgraph

Recall

Every control flow graph has a unique start node 0 Node d dominates node n if every path from 0 to n must go

thru d A node x dominates itself

You can't reach y, from 0, without passing thru d

Spring 2014 Jim Hogg - UW - CSE - P501 T-9

Dominators

Page 10: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Dominators by Inspection

Jim Hogg - UW - CSE - P501 T-10

1

7

86

5

2

0

3

4

For each Node, n, which nodes dominate it?

We denote this Dom(n) - the dominators of n

Page 11: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Dominators by Inspection

Jim Hogg - UW - CSE - P501 T-11

1

7

86

5

2

0

3

4

n Dom(n)

0 {0}

1 {0,1}

2 {0,1,2}

3 {0,1,3}

4 {0,1,3,4}

5 {0,1,5}

6 {0,1,5,6}

7 {0,1,5,7}

8 {0,1,5,8}

Page 12: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Dominators: Intuition

T-12

1

7

86

5

2

0

3

4

01

012

015

0158

0156

015678

01235678

012345678

0

1

7

86

5

2

0

3

4

01

012

015

0158

0156

0157

013

014

0

01235678

Travel everywhere you can!

It only counts if you reached here by every incoming path

Page 13: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Dominators by Calculation

T-13

1

7

86

5

2

0

3

4

Dom(n) is the set of nodes that dominate n

Calculate by iterating to fixed point: Dom(n) = {n} Intersectp Domp

Initial conditions: Dom(0) = 0 Otherwise, Dom(n) = U = set of all

nodesn preds - #1 #2

0 {} {0} {0} {0}

1 {0,3} U {0,1} {0,1}

2 {1} U {0,1,2} {0,1,2}

3 {2,7} U {0,1,2,3} {0,1,3}

4 {3} U {0,1,2,3,4}

{0,1,3,4}

5 {1} U {0,1,5} {0,1,5}

6 {5} U {0,1,5,6} {0,1,5,6}

7 {6,8} U {0,1,5,6,7}

{0,1,5,7}

8 {5} U {0,1,5,8} {0,1,5,8}

Page 14: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

T-14

n preds - 1

0 {} {0}

{0}

1 {0} U {0,1}

2 {1} U {0,1,2}

3 {2,7} U

4 {3} U

5 {1} U

6 {5} U

7 {6,8} U

8 {5} U

n preds - 1

0 {} {0}

{0}

1 {0} U {0,1}

2 {1} U {0,1,2}

3 {2,7} U {0,1,2,3}

4 {3} U

5 {1} U

6 {5} U

7 {6,8} U

8 {5} U

n preds - 1

0 {} {0} {0}

1 {0} U {0,1}

2 {1} U {0,1,2}

3 {2,7} U {0,1,2,3}

4 {3} U {0,1,2,3,4}

5 {1} U

6 {5} U

7 {6,8} U

8 {5} U

n preds - 1

0 {} {0} {0}

1 {0} U {0,1}

2 {1} U

3 {2,7} U

4 {3} U

5 {1} U

6 {5} U

7 {6,8} U

8 {5} U

Use most recent values to converge faster

Dominators by Calculation: Details

Page 15: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Dominator Algorithm

Iterative Dataflow Easier than previous Iterative Dataflow solutions Ignores content of each basic block Concerned only with how blocks link - the structure of the

flowgraph

Double-check Cooper&Torczon page 481, bottom table iteration #1 for node B3 Answer = {0,1,3} Should it be {0,1,2,3} ?

Spring 2014 Jim Hogg - UW - CSE - P501 T-15

Page 16: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Immediate Dominators

Every node n has a single immediate dominator Denote the immediate dominator of node n as IDom(n) IDom(n) is not n itself - we are interested now in strict

dominance strict in the same sense as strict subset: versus

IDom(n) strictly dominates n and is the nearest dominator of n nearest: IDom(n) does not dominate any other strict dominator of n

Theorem: If both a and b dominate n, then either a dominates b or b dominates a

Proof: reductio ad absurdum Therefore, IDom(n) is unique

Spring 2014 Jim Hogg - UW - CSE - P501 T-16

Page 17: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Dominator Tree

Jim Hogg - UW - CSE - P501 T-17

1

7

86

5

2

0

3

4

n Dom(n) IDom(n)

0 {0} -

1 {0,1} 0

2 {0,1,2} 1

3 {0,1,3} 1

4 {0,1,3,4}

3

5 {0,1,5} 1

6 {0,1,5,6}

5

7 {0,1,5,7}

5

8 {0,1,5,8}

5

2 4 6 7 8

3 5

1

0

(Strict) Dominator Tree.Can read off the Immediate Dominator of each node

Eg: 3 dominates 4

Page 18: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Identifying Loops in Flowgraph

Spring 2014 Jim Hogg - UW - CSE - P501 T-18

0

1

2 3

4 5

6

7

8

9

10

11

Node Dom IDom

0 -

1 0 0

2 0 1 2 1

3 0 1 3 1

4 0 1 3 4 3

5 0 1 3 5 3

6 0 1 3 4 6 4

7 0 1 3 4 6 7 6

8 0 1 3 4 6 7 8

7

9 0 1 3 9 3

10 0 1 3 9 10 9

11 0 1 3 9 10 11

10

0

1

32

4 5

6

7

8

9

10

11

Dominator Tree

Page 19: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Loops in Flowgraph

Spring 2014 Jim Hogg - UW - CSE - P501 T-19

0

1

2 3

4 5

6

7

8

9

10

11

0

1

32

4 5

6

7

8

9

10

11

Dominator Tree

A back edge is an edge from n to h where h dom nNatural Loop = h and n and all nodes that can reach n without going thru h

dom? Back Edg

e

Natural Loop

1 dom 2

21 {1,2}

1 dom 3

31 {1,3}

6 dom 7

76 {6,7}

4 dom 8

84 {4,6,7,8}

Page 20: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Inner Loops

Inner loops are more important for optimization

most execution time is spent there

Spring 2014 Jim Hogg - UW - CSE - P501 T-20

for (i = 1; i <= 100; ++i) { ... // executes 100 times for (j = 1; j <= 100; ++j) { ... // executes 10,000 times! } ... // executes 100 times}

So, how do we identify inner loops? . . .

Page 21: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Nested Loops in Flowgraph

Spring 2014 Jim Hogg - UW - CSE - P501 T-21

0

1

2 3

4 5

6

7

8

9

10

11

• A nested, or inner, loop is one whose nodes are a strict subset of another loop

• Eg: {6,7} {4,6,7,8}

dom? Back Edg

e

Natural Loop

1 dom 2

21 {1,2}

1 dom 3

31 {1,3}

6 dom 7

76 {6,7}

4 dom 8

84 {4,6,7,8}

Page 22: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Loop Surprise

Spring 2014 Jim Hogg - UW - CSE - P501 T-22

4

5 6

7

8

10

• 4 dom 7

• 74 (back edge)

• The natural loop includes 4 and 7 and all nodes that can reach 7 without passing thru 4

• Ie: {4,5,6,7,8,10}

• So technically, loop 74 has inner loop 107

• We can specify a loop by giving its back-edge

• We are not examining all kinds of loops.• Eg: nested loops that share a header• Eg: unnatural, or "irreducible" loops

Page 23: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

An Unnatural Loop!

Spring 2014 Jim Hogg - UW - CSE - P501 T-23

1

32

• We feel that {2, 3} should be a loop . . .• But 2 does not dominate 3, since we can reach 3 without going

thru 2• And 3 does not dominate 2, since we can reach 2 without going

thru 3• 1 dominates 2 and 3, but there is no back edge 21 or 31• Checkmate!

• Natural loops have one entry• Unnatural loops have multiple entries

• All loops constructed using FOR, WHILE, BREAK are guaranteed natural, by-construction

Page 24: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Idea: If x = a op b, inside the loop, always does the same thing each time, then:

hoist x = a op b out of the loop inside the loop simply re-use x without re-calculating it similar to CSE, but now a bigger payoff: 'trip-count' times!

But need a few safety checks before hoisting . . .

Spring 2014 Jim Hogg - UW - CSE - P501 T-24

Loop 'Hoisting'

Page 25: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Loop Preheader

Often we need a place to park code right before the start of a loop. Sometimes called a "landing pad"

Easy if there is a single node preceding the loop header h But this isn’t the case in general We don't want to execute extra code every time thru the loop!

So insert a preheader node p

Include an edge ph Change all edges xh to be xp

Spring 2014 Jim Hogg - UW - CSE - P501 T-25

h

p

h

back-edge

back-edge

Page 26: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Check for Loop-Invariance

We can hoist d: x = a op b to loop pre-header so long as:

a and b are constant, or All the defs for a and b that reach d are outside the loop, or One def of a reaches d, but that def is loop-invariant

Use this to build an iterative algorithm

Base cases: constants and operands defined outside the loop Then: repeatedly find definitions with loop-invariant operands

Spring 2014 Jim Hogg - UW - CSE - P501 T-26

Page 27: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Hoisting is allowed if . . .

Assume we already checked d: x = a + b is loop-invariant

Hoist instruction to loop pre-header if

if x is live-out from the loop, d must dominate all loop exits, and

there is only one def of x in the loop, and x is not live-out of the loop pre-header (ie, possibly used

before reaching d:)

Need to modify this if a + b could have side-effects or could raise an exception

Spring 2014 Jim Hogg - UW - CSE - P501 T-27

Page 28: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Hoisting: Possible?

Example 1

Spring 2014 Jim Hogg - UW - CSE - P501 T-28

L0: t = 0L1: i = i + 1 t = a + b M[i] = t if i < n goto L1L2: x = t

• a and b are loop-invariant

• L0: is the pre-header• can I hoist t = a + b?

L0: t = 0L1: if i ≥ n goto L2 i = i + 1 t = a + b M[i] = t goto L1L2: x = t

Example 2

Page 29: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Hoisting: Possible?

L0: t = 0L1: i = i + 1 t = a + b M[i] = t t = 0 M[j] = t if i < n goto L1L2: x = t

L0: t = 0L1: M[j] = t i = i + 1 t = a + b M[i] = t if i < n goto L1L2: x = t

Spring 2014 Jim Hogg - UW - CSE - P501 T-29

Example 3 Example 4

• a and b are loop-invariant

• L0: is the pre-header• can I hoist t = a + b?

Page 30: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Spring 2014 Jim Hogg - UW - CSE - P501 T-30

Operator Strength Reduction (OSR)

• Replace an expensive operator (eg: multiplication) in a loop with a cheaper operator (eg: addition)

• Usually requires replacing the original "induction variables"

Page 31: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

OSR Example

Spring 2014 Jim Hogg - UW - CSE - P501 T-31

s = 0 i = 0L1: if i ≥ n goto L2 j = i*4 k = j+a x = M[k] s = s+x i = i+1 goto L1L2:

Original

Induction-variable analysis: discover i and j as related IVs

Strength reduction to replace *4 with an addition

Replace the i ≥ n test Assorted copy propagation

To Optimize:

Page 32: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

An induction variable, i, is one whose value varies systematically inside a loop. Eg:

for i = 0 1 2 3 ... i * c ; 0 5 10 15 ... i c ; 7 8 9 10 ... where c is a region constant

Replace multiplications with (cheaper) additions

Eg: i = i * c is replace with i += c, but generates same sequence of values

Spring 2014 Jim Hogg - UW - CSE - P501 T-32

Induction Variable

Page 33: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

After OSR

Spring 2014 Jim Hogg - UW - CSE - P501 T-33

s = 0 i = 0L1: if i ≥ n goto L2 j = i * 4 k = j + a x = M[k] s = s + x i = i + 1 goto L1L2:

i = 0, 1, 2, 3 ...j = 0, 4, 8, 12 ...k = a, 4+a, 8+a, 12+a ...

s = 0 k’ = a t = n * 4 c = a + bL1: if k’ ≥ c goto L2 x = M[k’] s = s + x k’ = k’ + 4 goto L1L2:

Original Transformed

• Loop counter i is gone!• k replaced with k'• t is temp - used to calculate

c

The algorithm is messy. See Cooper&Toczon, p584 (not required for the exam)

Page 34: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Optimizing Induction Variables (IVs)

Strength reduction: if a "derived" IV is defined as j = i * c, replace with an addition inside the loop

Elimination: after strength reduction some IVs are not used, so delete them

Rewrite comparisons: If a variable is used only in comparisons against CIVs and in its own definition, modify the comparison to use a related IV

Spring 2014 Jim Hogg - UW - CSE - P501 T-34

Page 35: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Candidates: Loop with small body Then loop overhead (increment+test+jump) is a

significant fraction (or even bigger!) than the useful computation

Idea: reduce overhead by unrolling put two or more copies of the body inside the loop

Bonus: more opportunities to optimize

Spring 2014 Jim Hogg - UW - CSE - P501 T-35

Loop Unrolling

Page 36: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

sum = 0;for (i = 0; i < 100; ++i) sum += a[i];

Spring 2014 Jim Hogg - UW - CSE - P501 T-36

sum = 0;for (i = 0; i < 100; i += 2) { sum += a[i]; sum += a[i+1];}

Original

Unrolled 2x

• Example depends upon loop trip-count being a multiple of 2

• In general, add a 'tidy-up' step

Page 37: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Unrolling Algorithm Results

L1: x = M[i] sum = sum + x i = i + 4 if i < n goto L1L2:

L1: x = M[i] sum = sum + x i = i + 4 x = M[i] sum = sum + x i = i + 4 if i < n goto

L1L2:

Spring 2014 Jim Hogg - UW - CSE - P501 T-37

Original

Unrolled 2x

• Example depends upon loop trip-count being a multiple of 2

• In general, add a 'tidy-up' step• This code can be further optimized (eg: OSR on i)

Page 38: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Further Optimized

Spring 2014 Jim Hogg - UW - CSE - P501 T-38

L1: x = M[i] sum = sum + x x = M[i+4] sum = sum + x i = i + 8 if i < n goto L1L2:

L1: x = M[i] sum = sum + x i = i + 4 x = M[i] sum = sum + x i = i + 4 if i < n goto

L1L2:

• Actual optimizations depend partly upon the power of our IR

• (and subsequently on the richness of the target ISA)

Before After

Page 39: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Postscript on Loop Unrolling

This example only unrolls the loop by a factor of 2

In general, unroll by a factor of K

More aggressive unroll for reductions (like sum, product) Need a 'tidy-up' mini-loop where trip-count is not a multiple

of K May unroll short loops entirely

Why not unroll more? Code bloat - increases memory footprint for decreasing

perf win Increases register pressure

Spring 2014 Jim Hogg - UW - CSE - P501 T-39

Page 40: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

One of the great triumphs of computer hardware design

Effect is a large, fast memory

Reality is a series of progressively larger, slower, cheaper stores, with frequently accessed data automatically staged to faster storage (cache, main memory, disk)

Programmer/compiler typically treats it as one large store.

Hardware maintains cache coherency - well, mostly!

Spring 2014 Jim Hogg - UW - CSE - P501 T-40

Memory Caches

Page 41: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Intel Haswell Caches

T-41

Core Core

L1 = 64 KB per core

L2 = 256 KB per core

L3 = 2-8 MB shared

Main Memory

Page 42: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Just How Slow is Operand Access?

Instruction ~5 per cycle

Register 1 cycle

L1 CACHE ~4 cycles L2 CACHE ~10 cycles L3 CACHE (unshared line) ~40 cycles

DRAM ~100 ns

Spring 2014 Jim Hogg - UW - CSE - P501 T-42

Page 43: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Memory Issues

Byte load/store is often slower than whole (physical) word load/store Unaligned access is often extremely slow

Temporal locality: accesses to recently accessed data will usually find it in the cache

Spatial locality: accesses to data near recently used data will usually be fast

“near” = in the same cache block

But – alternating accesses to blocks that map to the same cache line will cause thrashing (cache-line "aliasing")

Spring 2014 Jim Hogg - UW - CSE - P501 T-43

• Increases in CPU speed have out-paced increases in memory access times

• Memory accesses now often determine overall program speed• "Instruction Count" is no longer the only performance metric to

optimize-for

Page 44: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Data Alignment

Data objects (structs) often are similar in size to a cache line/block (≈ 64 bytes)

Better if objects don’t span blocks

Some strategies Allocate objects sequentially; bump to next block boundary if

useful Allocate objects of same common size in separate pools (all

size-2, size-4, etc)

Tradeoff: speed for some wasted space

Spring 2014 Jim Hogg - UW - CSE - P501 T-44

Page 45: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Instruction Alignment

Align frequently executed basic blocks on cache boundaries (or avoid spanning cache blocks)

Branch targets (particularly loops) may be faster if they start on a cache line boundary In optimized code, will often see multi-byte NOPs as padding, to

align loop header Alignment varies with chip. Eg: current-gen Intel 16 bytes; current-

gen AMD prefers 32 bytes

Try to move infrequent code (startup, exceptions) away from hot code

Optimizing compiler may perform basic-block ordering ("layout")

Spring 2014 Jim Hogg - UW - CSE - P501 T-45

Page 46: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Loop Interchange

Textbook Matrix Multiply: A = B C aij += bik * ckj

Spring 2014 T-46

for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) a[i,j] += b[i,k] * c[k,j];

j

B C

i

A

j

As k increments (innermost/fastest), we move along a row of B, but down a column of C. The latter results in poor use of cache, and low performance

Page 47: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Loop Interchange

Spring 2014 T-47

C

• As k increments (innermost/fastest), we move down a column of C.• Each read fills a cache line (typically 64 bytes)• Algorithm uses only 4 bytes of that read• Then fills another cache line - so every access to C incurs a cache-

miss!• No reuse of cache line; poor locality; low performance

j

64 bytes wide cache line

• Eg: int C[1000,1000]• Row 0 begins at address 0 = 0 % 64• Row 1 begins at address 4000 = 16 % 64• Row 2 begins at address 8000 = 32 % 64

Page 48: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Loop Interchange

Textbook Matrix Multiply: A = B C aij += bik * ckj

Spring 2014 T-48

for (i = 0; i < n; i++) for (k = 0; k < n; k++) for (j = 0; j < n; j++) a[i,j] += b[i,k] * c[k,j];

B C

i

A

k

As j increments (innermost/fastest), we move along a row of A and along a row of C. This pattern produces good locality; reuse of cache line; and high performance. (Note that we now have to visit calculation of a[i,j] multiple times)

Page 49: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Loop Interchange

Spring 2014 T-49

C

• As k increments (innermost/fastest), we move down a column of C.

• Each read fills a cache line (typically 64 bytes)• Algorithm uses only 4 bytes of that read• Then fills another cache line• No reuse of cache line; poor locality; low performance

j

64 bytes wide cache line

Page 50: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Blocking/Tiling Matrix Multiply

Spring 2014 Jim Hogg - UW - CSE - P501 T-50

Textbook Matrix Multiply: A = B C aij += bik * ckj

j

B C

i

A

j

a11 a12 a13 a14 a15 a16

a21 a22 a23 a24 a25 a26

a31 a32 a33 a34 a35 a36

a41 a42 a43 a44 a45 a46

a51 a52 a53 a54 a55 a56

a61 a62 a63 a64 a64 a66

A11 A12

A21 A22

Aij += Bik * Ckj

Page 51: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Blocking/Tiling Matrix Multiply

Spring 2014 Jim Hogg - UW - CSE - P501 T-51

B CA

A11 A12

A21 A22

Aij += Bik * Ckj

= +

B C

• In practice, partition A,B,C into small enough chunks that they fit into cache

• Partitions need not be square, but must be conformable

Page 52: Spring 2014Jim Hogg - UW - CSE - P501T-1 CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength

Parallelizing Code

There is a long literature about how to rearrange loops for better locality and to detect parallelism

Some compilers can auto-parallelize user code

Much better if directed by user (eg: using parallel-for language construct; or parallel-library)

Some starting points

Latest edition of Dragon book, ch. 11 Allen & Kennedy Optimizing Compilers for Modern

Architectures Wolfe, High-Performance Compilers for Parallel Computing

(won't be in the exam)

Spring 2014 Jim Hogg - UW - CSE - P501 T-52