34
Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering The Ohio State University 2 Department of Electrical and Computer Engineering Louisiana State University

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1

  • Upload
    talen

  • View
    19

  • Download
    1

Embed Size (px)

DESCRIPTION

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 - PowerPoint PPT Presentation

Citation preview

Page 1: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Muthu Baskaran1 Uday Bondhugula1 Sriram Krishnamoorthy 1

J Ramanujam2 Atanas Rountev1 P Sadayappan1

1Department of Computer Science & EngineeringThe Ohio State University

2Department of Electrical and Computer EngineeringLouisiana State University

Page 2: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 3: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Single-processor performance ◦ Improved by ~50%/yr for almost two decades◦ Clock speed, ILP, …◦ Clock speed increased over 100x

Limits to single-processor performance growth◦ Increase in power density ◦ Flattening of clock speed due to power limitation

Transistor density continues to rise unabated Multiple cores are now the best option for

sustained performance growth

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 4: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Need to optimize memory bandwidth and latency in multi-core architectures

Traditional solution: introducing a cache hierarchy

Drawback◦ Caches are hardware-managed - difficult to model

miss behavior and to predict program execution times

Solution in many modern architectures: fast on-chip explicitly managed memory - scratchpad memory (local memory store)

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 5: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Scratchpads◦ Software-managed

Control over data movement Easier to model performance Burden on programmer/compiler to manage and utilize

◦ Lower power per chip area required compared to cache

Some modern architectures having scratchpad memories◦ GPU◦ Cell◦ MPSoC

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 6: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 7: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Effective management of on-chip scratchpads in multi-core architectures◦ Utilize limited capacity of scratchpad◦ Optimize data movement

Effective computation mapping in many-core architectures with multiple levels of parallelism◦ Exploit available parallelism◦ Account for scratchpad capacity constraints

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 8: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 9: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Orchestration of data movement between off-chip global and on-chip scratchpad memory

Decisions on◦ What data elements to move in and out of

scratchpad◦ When to move data◦ How to move data◦ How to access the data elements copied to

scratchpad

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 10: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

1. Allocation of storage space (as arrays) in the scratchpad memory for local copies

2. Determination of access functions of arrays in scratchpad memories

3. Generation of code for moving data between scratchpad (local) and off-chip (global) memories

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 11: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Targeted at affine programs◦ Dense arrays◦ Loop bounds – affine functions of outer loop

variables, constants and program parameters◦ Array access functions - affine functions of

surrounding loop variables, constants and program parameters

Developed using polyhedral model ◦ an algebraic framework for representing affine

programs – statement domains, dependences, array access functions – and affine program transformations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 12: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

for (i=1; i<=4; i++) for (j=2; j<=4; j++)

S1: a[i][j] = a[j][i] + a[i][j-1];

i j

xS1 = . 0 -1 4

IS1 =i j 1

≥ 0-1 0 4 0 1 -

2

1 0 -1

(0,0)

(m,m)

₣1a (xS1)

=

1 0 0 1

. i j

+0 0

DS1a = ₣1a

IS1

₣2a (xS1)

=

0 1 1 0

. i j

+0 0

₣3a (xS1)

=

1 0 0 1

. i j

+0 -1

j

i≥1

i≤4

ij≥2 j≤4

Page 13: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Given a program block, identify the storage space needed for each non-overlapping accessed region of all arrays◦ Access functions of array references may be non-

uniformly generated

For architectures (e.g. nVIDIA GeForce GPU) supporting direct data access from off-chip memory◦ Estimate extent of reuse of data to determine

whether or not to copy to scratchpad

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 14: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Array A

10

14

20

28

11 20

for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) {

A[i ][ j+1] = A[i+j ][ j+1] *3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } }

Local Array LA0: lb ( i ) = 10; ub( i ) = 14lb ( j ) = 11; ub( j ) = 20

Local Array LA1: lb ( i ) = 20; ub( i ) = 28lb ( j ) = 11; ub( j ) = 15

◦ Find the set of all data spaces accessed by all references to an array Access function of the

reference Iteration space of the

statement that holds the reference

◦Partition the set of all data spaces into maximal disjoint non-overlapping subset of data spaces◦ Local memory array for

each bounding box

◦ Find the bounding box of each partition of data spaces

Page 15: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

◦ Array dimension in scratchpad may be lower than original array dimension, depending on accessed data

◦ Access function in local memory array Original access function or reduced access function

with offsets – lower bounds (in each dimension) of scratchpad array

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { A[i ][ j+1] = A[i+j ][ j+1]*3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } }

for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { LA0[i-10][j+1-11] = LA1[i+j-20][j+1-11]*3; for (k=11;k<=20;k++) LB0[i-10][j+k-21] = LA0[i-10][k-11] + LB1[i+j-20][k-11]; }}

Page 16: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

◦ Generation of loop structure Scanning of polytopes (using

CLooG - a tool for code generation) corresponding to data spaces of read references – for moving

data into scratchpad write references – for moving

data out of scratchpad

◦ Generation of loop body (data movement statement) Copy from a location in

scratchpad buffer to off-chip memory location or vice versa

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

/* Data Move in code */for (i=10;i<=14;i++) { for (j=11;j<=20;j++) LA0[i-10][j-11] = A[i][j] ;}for (i=20;i<=28;i++) { for (j=max(i-13,11);j<=min(15,i-9); j++) LA1[i-20][j-11] = A[i][j] ;}

/* Data Move out code */for (i=10;i<=14;i++) { for (j=11;j<=15;j++) A[i][j] = LA0[i-10][j-11];}

Page 17: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 18: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

◦ Architectural components Slow off-chip (global) memory Two levels of parallelism

Set of multiprocessors Set of processor cores in each multiprocessor

Scratchpad on each multiprocessor, shared by its processor cores

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Off-chip memory

. . .Scratchpad

Scratchpad

Scratchpad

Page 19: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

◦ Tiling transformation framework recently developed at OSU by Bondhugula (CC-08, PLDI-08) Finds tiling transformations or hyperplanes

for sequences of imperfectly nested loops enables communication minimal parallelization and locality

optimization Identifies loops to tile for parallelism and data locality

◦ Multiple levels of tiling for exploiting parallelism across multiple parallel levels

◦ Additional tiling (sequential) at each level with scratchpad memory

If data required by tile executing at the level exceeds memory Data movement at the start and end of each sequential tile Synchronization points to ensure consistency

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 20: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

FORALL i = 1, Ni FORALL j = 1, Nj FOR k = 1, WS FOR l = 1, WS S1 END FOR END FOR END FORALLEND FORALL

END FOR END FOR END FOR END FOR

// Tiling to satisfy scratchpad memory limit FOR i' = iT, min(iT+Ti-1,Ni), ti' FOR j' = jT, min(jT+Tj-1,Nj), tj' FOR k' = 1, WS, tk' FOR l'= 1, WS, tl'

// Tiling to distribute at the outer levelFORALL iT = 1, Ni, Ti FORALL jT = 1, Nj, Tj

FOR i = it, min(it+ti-1,Ni) FOR j = jt, min(jt+tj-1,Nj) FOR k = k', min(k'+tk'-1,WS) FOR l = l', min(l'+tl'-1,WS) S1 END FOR END FOR END FOR END FOR

// Tiling to distribute at the inner level FORALL it = i', min(i'+ti'-1,Ni), ti FORALL jt = j', min(j'+tj'-1,Nj), tj

<Data move in Code>

<Data move out Code>

END FORALL END FORALL

END FORALLEND FORALL

Page 21: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

◦ Handling scratchpad memory constraints Cost model for data movement

C = N x (S + (V x L)/P)

N – Number of data movements

S – Sync cost per data movement

V – Number of elements per data movement (based on tile sizes)

L – Cost to transfer one element P – Number of processes involved in data movement

Tile size search formulation Constraint: memory requirement within limit Objective function: minimize data movement cost, C

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 22: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

))(())((mink 1Ik 1 P

LVS

t

N

P

LVS

t

N k

O

r

i i

ikr

i i

i outk

ink

nl

i

upi MM1

• Loop nest of m loops with tile sizes t1, t2,.., tm

• nl local arrays• Mj – Memory (as a function of tile sizes) for local array j• V inj and Voutj – Volume (as a function of tile sizes) moved in to and out of local array memory j, respectively• rj – position in the loop nest where the data movement code of array j is placed• Mup - total scratchpad memory

Memory Constraint:

Objective function:

Variables: t1, t2,.., tm

Page 23: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 24: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz768 MB off-chip memory16 x 16 KB scratchpad

Page 25: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz768 MB off-chip memory16 x 16 KB scratchpad

Page 26: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz768 MB off-chip memory16 x 16 KB scratchpad

Tile size from model

Page 27: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz768 MB off-chip memory16 x 16 KB scratchpad

Tile size from model

Page 28: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 29: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

◦ Scratchpad memory management Data reuse - Issenin et al. [DAC06] Allocation for uniformly generated references

Schreiber and Cronquist [HPLTR04] Anantharaman and Pande [RTSS98] Kandemir et al. [CAD04]

Improving performance on cached architectures Ferrante et al. [LCPC92] Gallivan et al. [ICS88]

◦ Multi-level tiling Fatahalian et al. [SC06]– various levels of memory Bikshandi et al. [PPOPP06] and Renganarayanan et

al. [SC07, IPDPS07] – parallelism and locality

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 30: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 31: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Addressed two issues in compiling for modern multi-level parallel architectures with scratchpads1. Data management in scratchpad memory

1. Data allocation2. Access in scratchpad3. Code generation for data movement

2. Mapping of computation in regular programs on to multiple levels of parallel units

Experimental evaluation using nVIDIA GPU

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 32: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 33: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Developing an end-to-end compiler framework for modern many-core architectures like GPUs

Algorithms developed in this work – an integral part of the overall compiler framework

Further optimize transformations like tiling, for modern architectures like GPUs, using model-driven empirical search

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Page 34: Muthu Baskaran 1         Uday Bondhugula 1            Sriram Krishnamoorthy  1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008