Beyond Shared Memory Loop Parallelism in the Polyhedral Modelpeople.rennes.inria.fr/Tomofumi.Yuki/.../yuki-dissertation-slides.pdf · Beyond Shared Memory Loop Parallelism in the

Beyond Shared Memory Loop Parallelism in the Polyhedral Model

Tomofumi Yuki Ph.D Dissertation

10/30 2012

The Problem

Figure from www.spiral.net/problem.html

2

Parallel Processing

n A small niche in the past, hot topic today n Ultimate Solution: Automatic Parallelization

n  Extremely difficult problem n  After decades of research, limited success

n Other solutions: Programming Models n  Libraries (MPI, OpenMP, CnC, TBB, etc.) n  Parallel languages (UPC, Chapel, X10, etc.) n  Domain Specific Languages (stencils, etc.)

3

MPI Code Generation

Polyhedral X10

X10

AlphaZ

MDE

40+ years of research linear algebra, ILP

CLooG, ISL, Omega, PLuTo

Contributions

Polyhedral Model

4

Polyhedral State-of-the-art

n  Tiling based parallelization n  Extensions to parameterized tile sizes

n  First step [Renganarayana2007] n  Parallelization + Imperfectly nested

loops[Hartono2010, Kim2010]

n  PLuTo approach is now used by many people n  Wave-front of tiles: better strategy than

maximum parallelism [Bondhugula2008]

n Many advances in shared memory context

5

How far can shared memory go?

n  The Memory Wall is still there n Does it make sense for 1000 cores to share

memory? [Berkley View, Shalf 07, Kumar 05] n  Power n  Coherency overhead n  False sharing n  Hierarchy? n  Data volume (tera- peta-bytes)

6

Distributed Memory Parallelization n  Problems implicitly handled by the shared

memory now need explicit treatment n Communication

n  Which processors need to send/receive? n  Which data to send/receive? n  How to manage communication buffers?

n Data partitioning n  How do you allocate memory across nodes?

7

MPI Code Generator

n Distributed Memory Parallelization n  Tiling based n  Parameterized tile sizes n  C+MPI implementation

n Uniform dependences as key enabler n  Many affine dependences can be uniformized

n  Shared memory performance carried over to distributed memory n  Scales as well as PLuTo but to multiple nodes

8

Related Work (Polyhedral)

n  Polyhedral Approaches n  Initial idea [Amarasinghe1993] n  Analysis for fixed sized tiling [Claßen2006] n  Further optimization [Bondhugula2011]

n  “Brute Force” polyhedral analysis for handling communication n  No hope of handling parametric tile size n  Can handle arbitrarily affine programs

9

Outline

n  Introduction n  “Uniform-ness” of Affine Programs

n  Uniformization n  Uniform-ness of PolyBench

n MPI Code Generation n  Tiling n  Uniform-ness simplifies everything n  Comparison against PLuTo with PolyBench

n Conclusions and Future Work

10

Affine vs Uniform

n Affine Dependences:　　f = Ax+b n  Examples

n  (i,j->j,i) n  (i,j->i,i) n  (i->0)

n Uniform Dependences: f = Ix+b n  Examples

n  (i,j->i-1,j) n  (i->i-1)

11

Uniformization

n  (i->0) n  (i->0)

n  (i->i-1)

12

Uniformization

n Uniformization is a classic technique n  “solved” in the 1980’s n  has been “forgotten” in the multi-core era

n Any affine dependence can be uniformized n  by adding a dimension [Roychowdhury1988]

n Nullspace pipelining n  simple technique for uniformization n  many dependences are uniformized

13

Uniformization and Tiling

n Uniformization does not influence tilability

14

PolyBench [Pouchet2010]

n Collection of 30 polyhedral kernels n  Proposed by Pouchet as a benchmark for

polyhedral compilation n  Goal: Small enough benchmark so that

individual results are reported; no averages

n Kernels from: n  data mining n  linear algebra kernels, solvers n  dynamic programming n  stencil computations

15

Uniform-ness of PolyBench

n  5 of them are “incorrect” and are excluded

n  Embedding: Match dimensions of statements n  Phase Detection: Separate program into phases

n  Output of a phase is used as inputs to the other

Stage Uniform at Start

After Embedding

After Pipelining

After Phase Detection

Number of Fully Uniform Programs

8/25 (32%)

13/25 (52%)

21/25 (84%)

24/25 (96%)

16

Outline

n  Introduction n Uniform-ness of Affine Programs

n  Uniformization n  Uniform-ness of PolyBench

n MPI Code Generation n  Tiling n  Uniform-ness simplifies everything n  Comparison against PLuTo with PolyBench

n Conclusions and Future Work

17

Basic Strategy: Tiling

n We focus on tilable programs

18

Dependences in Tilable Space

n All in the non-positive direction

19

Wave-front Parallelization

n All tiles with the same color can run in parallel

20

Assumptions

n Uniform in at least one of the dimensions n  The uniform dimension is made outermost

n  Tilable space is fully permutable

n One-dimensional processor allocation n  Large enough tile sizes

n  Dependences do not span multiple tiles

n  Then, communication is extremely simplified

21

Processor Allocation

n Outermost tile loop is distributed

P0 P1 P2 P3 i1

i2

22

Values to be Communicated

n  Faces of the tiles (may be thicker than 1)

i1

i2

P0 P1 P2 P3

23

Naïve Placement of Send and Receive Codes n Receiver is the consumer tile of the values

i1

i2

P0 P1 P2 P3

S

S

S

R

R

R

24

Problems in Naïve Placement

n Receiver is in the next wave-front time

i1

i2

P0 P1 P2 P3

S

S

S

R

R

R

t=0

t=1

t=2

t=3

25

Problems in Naïve Placement

n Receiver is in the next wave-front time n Number of communications “in-flight”��

= amount of parallelism n MPI_Send will deadlock

n  May not return control if system buffer is full

n Asynchronous communication is required n  Must manage your own buffer n  required buffer = amount of parallelism

n  i.e., number of virtual processors

26

Proposed Placement of Send and Receive codes n Receiver is one tile below the consumer

i1

i2

P0 P1 P2 P3

S

S

S

R

R

R

27

Placement within a Tile

n Naïve Placement: n  Receive -> Compute -> Send

n  Proposed Placement: n  Issue asynchronous receive (MPI_Irecv) n  Compute n  Issue asynchronous send (MPI_Isend) n  Wait for values to arrive

n Overlap of computation and communication n Only two buffers per physical processor

Overlap

Recv Buffer

Send Buffer

28

Evaluation

n Compare performance with PLuTo n  Shared memory version with same strategy

n Cray: 24 cores per node, up to 96 cores n Goal: Similar scaling as PLuTo n  Tile sizes are searched with educated guesses n  PolyBench

n  7 are too small n  3 cannot be tiled or have limited parallelism n  9 cannot be used due to PLuTo/PolyBench issue

29

Performance Results

30

ª  n  Linear extrapolation from speed up of 24 cores n  Broadcast cost at most 2.5 seconds

correlation covariance 2mm 3mm gemm syr2k syrk lu fdtd−2d jacobi−2dimper

seidel−2d

Summary of AlphaZ Performance Comparison with PLuTo

Sp

ee

d U

p w

ith

re

sp

ect

to P

Lu

To w

ith

1 c

ore

02

04

06

08

01

00

PLuTo 24 coresAlphaZ 24 coresPLuTo 96 cores (extrapolated)AlphaZ 96 cores (No Bcast)

AlphaZ System

n  System for polyhedral design space exploration n Key features not explored by other tools:

n  Memory allocation n  Reductions

n Case studies to illustrate the importance of unexplored design space [LCPC2012]

n  Polyhedral Equational Model [WOLFHPC2012]

n MDE applied to compilers [MODELS2011]

31

Polyhedral X10 [PPoPP2013?]

n Work with Vijay Saraswat and Paul Feautrier n  Extension of array data flow analysis to X10

n  supports finish/async but not clocks

n  finish/async can express more than doall n  Focus of polyhedral model so far: doall

n Dataflow result is used to detect races n  With polyhedral precision, we can guarantee

program regions to be race-free

32

Conclusions

n  Polyhedral Compilation has lots of potential n  Memory/reductions are not explored n  Successes in automatic parallelization n  Race-free guarantee

n Handling arbitrary affine may be an overkill n  Uniformization makes a lot of sense n  Distributed memory parallelization made easy n  Can handle most of PolyBench

33

Future Work

n Many direct extensions n  Hybrid MPI+OpenMP with multi-level tiling n  Partial uniformization to satisfy pre-condition n  Handling clocks in Polyhedral X10

n More broad applications of polyhedral model n  Approximations n  Larger granularity: blocks of computations

instead of statements n  Abstract interpretations [Alias2010]

34

Acknowledgements

n Advisor: Sanjay Rajopadhye n Committee members:

n  Wim Böhm n  Michelle Strout n  Edwin Chong

n Unofficial Co-advisor: Steven Derrien n Members of

n  Mélange, HPCM, CAIRN n  Dave Wonnacott, Haverford students

35

Backup Slides

36

Uniformization and Tiling

n  Tilability is preserved

37

D-Tiling Review [Kim2011]

n  Parametric tiling for shared memory n Uses non-polyhedral skewing of tiles

n  Required for wave-front execution of tiles

n  The key equation: n  n  where

n  d: number of tiled dimensions n  ti: tile origins n  ts: tile sizes

38

€

time =tiitsii=1

d∑

D-Tiling Review cont.

n  The equation enables skewing of tiles n  If one of time or tile origins are unknown, can

be computed from the others

n Generated Code: (tix is d-1th tile origin)

39

for (time=start:end)! for (ti1=ti1LB:ti1UB)! …! for (tix=tixLB:tixUB) {! tid = f(time, ti1, …, tix);! //compute tile ti1,ti2,…,tix,tid! }!

Placement of Receive Code using D-Tiling n  Slight modification to the use of the equation

n  Visit tiles in the next wave-front time

40

for (time=start:end)! for (ti1=ti1LB:ti1UB)! …! for (tix=tixLB:tixUB) {! tidNext = f(time+1, ti1, …, tix);! //receive and unpack buffer for! //tile ti1,ti2,…,tix,tidNext! }!

Proposed Placement of Send and Receive codes n Receiver is one tile below the consumer

i1

i2

P0 P1 P2 P3

S

S

S

R

R

R

41

Extensions to Schedule Independent Mapping n  Schedule Independent Mapping [Strout1998]

n  Universal Occupancy Vectors (UOVs) n  Legal storage mapping for any legal execution n  Uniform dependence programs only

n Universality of UOVs can be restricted n  e.g., to tiled execution

n  For tiled execution, shortest UOV can be found without any search

42

LU Decomposition

lu

Number of Cores

Sp

ee

d U

p w

ith r

esp

ect

to

PL

uTo

with

1 c

ore

0 8 16 24 48 72 96

08

16

24

48

72

96 PLuTo

AlphaZAlphaZ (No Bcast)

43

seidel-2d

seidel!2d

Number of Cores

Sp

ee

d U

p w

ith r

esp

ect

to

PL

uTo

with

1 c

ore

0 8 16 24 48 72 96

08

16

24

48

72

96 PLuTo


44

seidel-2d (no 8x8x8)

seidel!2d (without 8x8x8 tiles)

Number of Cores

Sp

ee

d U

p w

ith r

esp

ect

to

PL

uTo

with

1 c

ore

0 8 16 24 48 72 96

08

16

24

48

72

96 PLuTo


45

jacobi-2d-imper

jacobi!2d!imper

Number of Cores

Sp

ee

d U

p w

ith r

esp

ect

to

PL

uTo

with

1 c

ore

0 8 16 24 48 72 96

08

16

24

48

72

96 PLuTo


46

Related Work (Non-Polyhedral)

n Global communications [Li1990] n  Translation from shared memory programs n  Pattern matching for global communications

n  Paradigm [Banerjee1995] n  No loop transformations n  Finds parallel loops and inserts necessary

communications

n  Tiling based [Goumas2006] n  Perfectly nested uniform dependences

47

n  PLuTo does not scale because the outer loop is not tiled

adi.c: Performance

Speedup of Optimized Code on Xeon

Number of Threads (Cores)

Spee

d up

com

pare

d to

orig

inal

cod

e

AlphaZPLuTo

0 1 2 4 8

01

24

8

Speedup of Optimized Code on Cray XT6m

Number of Threads (Cores)

Spee

d up

com

pare

d to

orig

inal

cod

e

AlphaZPLuTo

0 4 8 12 16 20 240

48

1216

2024

48

n Complexity reduction is empirically confirmed

UNAfold: Performance

200 400 600 800 1000 1400

050

010

0015

0020

0025

00

Execution Time of UNAfold

Sequence Length (N)

Exec

utio

n Ti

me

in S

econ

ds original simplified

2.0 2.2 2.4 2.6 2.8 3.0 3.2

01

23

45

67

8

Log plot of Execution Time

Log of Sequence Length

Log

of E

xecu

tion

Tim

e original simplified y = 4x + b1

y = 3x + b2

49

Contributions

n  The AlphaZ System n  Polyhedral compiler with full control to the user n  Equational view of the polyhedral model

n MPI Code Generator n  The first code generator with parametric tiling n  Double buffering

n  Polyhedral X10 n  Extension to the polyhedral model n  Race-free guarantee of X10 programs

50

Documents

Beyond Shared Memory Loop Parallelism in the Polyhedral Modelpeople.rennes.inria.fr/Tomofumi.Yuki/.../yuki-dissertation-slides.pdf · Beyond Shared Memory Loop Parallelism in the