54
CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

Embed Size (px)

Citation preview

Page 1: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

1

CR18: Advanced Compilers

L06: Code Generation

Tomofumi Yuki

Page 2: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

2

Code Generation

Completing the transformation loop

Problem: how to generate code to scan a

polyhedron? a union of polyhedra? how to generate tiled code? how to generate parametrically tiled

code?

Page 3: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

3

Evolution of Code Gen

Ancourt & Irigoin 1991 single polyhedron scanning

LooPo: Griebl & Lengauer 1996 1st step to unions of polyhedra scan bounding box + guards

Omega Code Gen 1995 generate inefficient code (convex hull +

guards) then try to remove inefficiencies

Page 4: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

4

Evolution of Code Gen

LoopGen: Quilleré-Rajopadhye-Wilde 2000 efficiently scanning unions of polyhedra

CLooG: Bastoul 2004 improvements to QRW algorithm robust and well maintained

implementation AST Generation: Grosser 2015

Polyhedral AST generation is more than scanning polyhedra

scanning is not enough!

Page 5: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

5

Scanning a Polyhedron

Scanning Polyhedra with DO Loops [1991]

Problem: generate bounds on loops outermost loop: constants and params inner loop: + surrounding iterators

Approach: Fourier-Motzkin elimination projecting out variables

Page 6: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

6

Single Polyhedron Example

What is the loop nest for lex. scan?

i

ji≤N

j≥0

i-j≥0for i = 0 .. N for j = 0 .. i S;

Page 7: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

7

Single Polyhedron Example

What is the loop nest for permuted case? j as the outer loop

i

ji≤N

j≥0

i-j≥0for j = 0 .. N for i = j .. N S;

Page 8: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

8

Scanning Unions of Polyhedra Consider scanning two statements

Naïve approach: bounding box

S1: [N]->{ S1[i]->[i] : 0≤i<N }S2: [N]->{ S2[i]->[i+5] : 0≤i<N }

S1: [N]->{ [i] : 0≤i≤N }S2: [N]->{ [i] : 5≤i≤N+5 }

CoB

for (i=0 .. N+5) if (0<=i && i<=N) S1; if (5<=i && i<=N+5) S2;

Page 9: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

9

Slightly Better than BBox

Make disjoint domains

But this is also problematic: code size can quickly grow

for (i=0 .. i<=4) S1;for (i=4 .. i<=N) S1; S2;for (i=N+1 .. i<=N+5) S2;

S1: [N]->{ S1[i]->[i] : 0≤i<N }S2: [N]->{ S2[i]->[i] : 0≤i<M }

Page 10: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

10

QRW Algorithm

Key: Recursive Splitting Given a set of n-D domains to scan

start at d=1 and context=parameters 1. Restrict the domains to the context 2. Project the domains to outer d-

dimensions 3. Make the projections disjoint 4. Recurse for each disjoint projection

d=d+1, context=a piece of the projection 5. Sort the resulting loops

Page 11: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

11

Example

Scan the following domains

i

j

S1S2

d=1context=universe

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

Page 12: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

12

Example

Scan the following domains

i

j

S1S2

d=1context=universe

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

Page 13: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

13

Example

Scan the following domains

i

j

S1S2

d=1context=universe

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

for (i=0..1) ...

for (i=2..6) ...

Page 14: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

14

Example

Scan the following domains

i

j

S1S2

d=2context=0≤i≤2

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

for (i=0..1) ...for (i=0..1) for (j=0..4) S1;

Page 15: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

15

Example

Scan the following domains

i

j

S1S2

d=2context=2≤i≤6

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

for (i=2..6) ...

Page 16: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

16

Example

Scan the following domains

i

j

S1S2

d=2context=2≤i≤6

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

for (i=2..6) ...

L2L1

L4L3

Page 17: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

17

Example

Scan the following domains

i

j

S1S2

d=2context=2≤i≤6

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

for (i=2..6) L2 L1 L3 L4

L2L1

L4L3

Page 18: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

18

CLooG: Chunky Loop Generator A few problems in QRW Algorithm

high complexity code size is not controlled

CLooG uses: pattern matching to avoid costly

polyhedral operations during separation may stop the recursion at some depth

and generate loops with guards to reduce size

Page 19: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

19

Tiled Code Generation

Tiling with fixed size we did this earlier

Tiling with parametric size problem: non-affine!

Page 20: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

20

Tiling Review

What does the tiled code look like?for (i=0; i<=N; i++) for (j=0; j<=N; j++) S;

for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) Sfor (ti=0; ti<=floor(N,ts); ti++) for (tj=0; tj<=floor(N,ts); tj++) for (i=ti*ts; i<min(N+1,(ti+1)*ts); i++) for (j=tj*ts; j<min(N+1,(tj+1)*ts); j++) S

with tile size ts

Page 21: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

21

Two Approaches

Use fix sized tiling if the tile is a constant, stays affine pragmatic choice by many tools

Use non-polyhedral code generation much better for tuning tile sizes make sense for semi-automatic tools

Page 22: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

22

Difficulties in Tiled Code Gen This is still a very simplified view

In practice, we tile after transformation skewing, etc.

Let’s see the tiled iteration space with tvis

for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S

Page 23: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

23

Full Tiles, Inset / Outset

Partial tiles have a lot of control overhead

Challenges for parametric tiled code gen make sure to scan the outset but also separate the inset use efficient point loops for inset

All with out polyhedral analysis

Page 24: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

24

Point Loops for Full/Partial Tile Full Tile Point Loop

Partial/Empty Tile Point Loop

for (i=ti; i<ti+si; i++) for (j=tj; j<tj+sj; j++) for (k=tk; k<tk+sk; k++) ...

for (i=max(ti,...); i<min(ti+si,...); i++) for (j=max(tj,...); j<min(tj+sj,...); j++) for (k=max(tk,...); k<min(tk+sk,...); k++) if (...) ...

Page 25: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

25

Progression of Parametric Tiling Perfectly nested, single loop

TLoG [Renganarayana et al. 2007] Multiple levels of tiling

HiTLoG [Renganarayana et al. 2007] PrimeTile [Hartono 2009]

Parallelizing the tiles DynTile [Hartono 2010] D-Tiling [Kim 2011]

Page 26: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

26

Computing the Outset

We start with some domain expand in each dimension by (symbolic) tile size – 1 except for upper bounds

{[i,j]: 0≤i≤10 and i≤j≤i+10}

{[i,j]: -(ts-1) ≤i≤10 and -(ts-1)+i≤j and -(ts-1)+j≤i+10}

Page 27: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

27

Computing the Inset

We start with some domain shrink in each dimension by (symbolic) tile size – 1 except for lower bounds

{[i,j]: 0≤i≤10 and i≤j≤i+10}

{[i,j]: 0≤i≤10-(ts-1) and i≤j-(ts-1) and j≤i+10-(ts-1)}

Page 28: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

28

Syntactic Manipulation

We cannot use polyhedral code generators so back to modifying AST

Modify the loop bounds to get loops that visit outset get guards to switch point-loops

Up to here is HiTLoG/PrimeTile

Page 29: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

29

Problem: Parallelization

After tiling, there is parallelism However, it requires skewing of tiles

We need non-polyehdral skewing The key equation:

where

d: number of tiled dimensions ti: tile origins ts: tile sizes

Page 30: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

30

D-Tiling

The equation enables skewing of tiles If one of time or tile origins are

unknown, can be computed from the others

Generated Code: (tix is d-1th tile origin)for (time=start:end) for (ti1=ti1LB:ti1UB) … for (tix=tixLB:tixUB) { tid = f(time, ti1, …, tix); //compute tile ti1,ti2,…,tix,tid }

Page 31: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

31

Distributed Memory Parallelization Problems implicitly handled by the

shared memory now need explicit treatment

Communication Which processors need to send/receive? Which data to send/receive? How to manage communication buffers?

Data partitioning How do you allocate memory across

nodes?

Page 32: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

32

MPI Code Generator

Distributed Memory Parallelization Tiling based Parameterized tile sizes C+MPI implementation

Uniform dependences as key enabler Many affine dependences can be

uniformized Shared memory performance carried

over to distributed memory Scales as well as PLuTo but to multiple

nodes

Page 33: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

33

Related Work (Polyhedral)

Polyhedral Approaches Initial idea [Amarasinghe1993] Analysis for fixed sized tiling

[Claßen2006] Further optimization [Bondhugula2011]

“Brute Force” polyhedral analysis for handling communication No hope of handling parametric tile size Can handle arbitrarily affine programs

Page 34: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

34

Outline

Introduction “Uniform-ness” of Affine Programs

Uniformization Uniform-ness of PolyBench

MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with

PolyBench Conclusions and Future Work

Page 35: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

35

Affine vs Uniform

Affine Dependences:    f = Ax+b Examples

(i,j->j,i) (i,j->i,i) (i->0)

Uniform Dependences: f = Ix+b Examples

(i,j->i-1,j) (i->i-1)

Page 36: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

36

Uniformization

(i->0) (i->0)

(i->i-1)

Page 37: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

37

Uniformization

Uniformization is a classic technique “solved” in the 1980’s has been “forgotten” in the multi-core

era Any affine dependence can be

uniformized by adding a dimension

[Roychowdhury1988] Nullspace pipelining

simple technique for uniformization many dependences are uniformized

Page 38: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

38

Uniformization and Tiling

Uniformization does not influence tilability

Page 39: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

39

PolyBench [Pouchet2010]

Collection of 30 polyhedral kernels Proposed by Pouchet as a benchmark for

polyhedral compilation Goal: Small enough benchmark so that

individual results are reported; no averages

Kernels from: data mining linear algebra kernels, solvers dynamic programming stencil computations

Page 40: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

40

Uniform-ness of PolyBench

5 of them are “incorrect” and are excluded

Embedding: Match dimensions of statements

Phase Detection: Separate program into phases Output of a phase is used as inputs to

the other

Stage Uniform at

Start

AfterEmbeddin

g

AfterPipelining

After Phase

Detection

Number of Fully UniformPrograms

8/25 (32%)

13/25 (52%)

21/25 (84%)

24/25 (96%)

Page 41: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

41

Outline

Introduction Uniform-ness of Affine Programs

Uniformization Uniform-ness of PolyBench

MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with

PolyBench Conclusions and Future Work

Page 42: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

42

Basic Strategy: Tiling

We focus on tilable programs

Page 43: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

43

Dependences in Tilable Space All in the non-positive direction

Page 44: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

44

Wave-front Parallelization

All tiles with the same color can run in parallel

Page 45: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

45

Assumptions

Uniform in at least one of the dimensions

The uniform dimension is made outermost Tilable space is fully permutable

One-dimensional processor allocation Large enough tile sizes

Dependences do not span multiple tiles Then, communication is extremely

simplified

Page 46: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

46

Processor Allocation

Outermost tile loop is distributed

P0 P1 P2 P3i1

i2

Page 47: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

47

Values to be Communicated

Faces of the tiles (may be thicker than 1)

i1

i2

P0 P1 P2 P3

Page 48: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

48

Naïve Placement of Send and Receive Codes Receiver is the consumer tile of the

values

i1

i2

P0 P1 P2 P3

S

S

S

R

R

R

Page 49: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

49

Problems in Naïve Placement Receiver is in the next wave-front time

i1

i2

P0 P1 P2 P3

S

S

S

R

R

R

t=0

t=1

t=2

t=3

Page 50: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

50

Problems in Naïve Placement Receiver is in the next wave-front time Number of communications “in-flight”

= amount of parallelism MPI_Send will deadlock

May not return control if system buffer is full

Asynchronous communication is required Must manage your own buffer required buffer = amount of parallelism

i.e., number of virtual processors

Page 51: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

51

Proposed Placement of Send and Receive codes Receiver is one tile below the consumer

i1

i2

P0 P1 P2 P3

S

S

S

R

R

R

Page 52: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

52

Placement within a Tile

Naïve Placement: Receive -> Compute -> Send

Proposed Placement: Issue asynchronous receive (MPI_Irecv) Compute Issue asynchronous send (MPI_Isend) Wait for values to arrive

Overlap of computation and communication

Only two buffers per physical processor

Overlap

Recv Buffer

Send Buffer

Page 53: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

53

Evaluation

Compare performance with PLuTo Shared memory version with same

strategy Cray: 24 cores per node, up to 96 cores Goal: Similar scaling as PLuTo Tile sizes are searched with educated

guesses PolyBench

7 are too small 3 cannot be tiled or have limited

parallelism 9 cannot be used due to

PLuTo/PolyBench issue

Page 54: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

54

Performance Results

Linear extrapolation from speed up of 24

cores Broadcast cost at most 2.5 seconds