Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1

Muthu Baskaran1 Uday Bondhugula1 Sriram Krishnamoorthy 1

J Ramanujam2 Atanas Rountev1 P Sadayappan1

1Department of Computer Science & EngineeringThe Ohio State University

2Department of Electrical and Computer EngineeringLouisiana State University

IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Single-processor performance ◦ Improved by ~50%/yr for almost two decades◦ Clock speed, ILP, …◦ Clock speed increased over 100x

Limits to single-processor performance growth◦ Increase in power density ◦ Flattening of clock speed due to power limitation

Transistor density continues to rise unabated Multiple cores are now the best option for

sustained performance growth


Need to optimize memory bandwidth and latency in multi-core architectures

Traditional solution: introducing a cache hierarchy

Drawback◦ Caches are hardware-managed - difficult to model

miss behavior and to predict program execution times

Solution in many modern architectures: fast on-chip explicitly managed memory - scratchpad memory (local memory store)


Scratchpads◦ Software-managed

Control over data movement Easier to model performance Burden on programmer/compiler to manage and utilize

◦ Lower power per chip area required compared to cache

Some modern architectures having scratchpad memories◦ GPU◦ Cell◦ MPSoC




Effective management of on-chip scratchpads in multi-core architectures◦ Utilize limited capacity of scratchpad◦ Optimize data movement

Effective computation mapping in many-core architectures with multiple levels of parallelism◦ Exploit available parallelism◦ Account for scratchpad capacity constraints




Orchestration of data movement between off-chip global and on-chip scratchpad memory

Decisions on◦ What data elements to move in and out of

scratchpad◦ When to move data◦ How to move data◦ How to access the data elements copied to

scratchpad


1. Allocation of storage space (as arrays) in the scratchpad memory for local copies

2. Determination of access functions of arrays in scratchpad memories

3. Generation of code for moving data between scratchpad (local) and off-chip (global) memories


Targeted at affine programs◦ Dense arrays◦ Loop bounds – affine functions of outer loop

variables, constants and program parameters◦ Array access functions - affine functions of

surrounding loop variables, constants and program parameters

Developed using polyhedral model ◦ an algebraic framework for representing affine

programs – statement domains, dependences, array access functions – and affine program transformations



for (i=1; i<=4; i++) for (j=2; j<=4; j++)

S1: a[i][j] = a[j][i] + a[i][j-1];

i j

xS1 = . 0 -1 4

IS1 =i j 1

≥ 0-1 0 4 0 1 -

2

1 0 -1

(0,0)

(m,m)

₣1a (xS1)

=

1 0 0 1

. i j

+0 0

DS1a = ₣1a

IS1

₣2a (xS1)

=

0 1 1 0

. i j

+0 0

₣3a (xS1)

=

1 0 0 1

. i j

+0 -1

j

i≥1

i≤4

ij≥2 j≤4

Given a program block, identify the storage space needed for each non-overlapping accessed region of all arrays◦ Access functions of array references may be non-

uniformly generated

For architectures (e.g. nVIDIA GeForce GPU) supporting direct data access from off-chip memory◦ Estimate extent of reuse of data to determine

whether or not to copy to scratchpad



Array A

10

14

20

28

11 20

for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) {

A[i ][ j+1] = A[i+j ][ j+1] *3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } }

Local Array LA0: lb ( i ) = 10; ub( i ) = 14lb ( j ) = 11; ub( j ) = 20

Local Array LA1: lb ( i ) = 20; ub( i ) = 28lb ( j ) = 11; ub( j ) = 15

◦ Find the set of all data spaces accessed by all references to an array Access function of the

reference Iteration space of the

statement that holds the reference

◦Partition the set of all data spaces into maximal disjoint non-overlapping subset of data spaces◦ Local memory array for

each bounding box

◦ Find the bounding box of each partition of data spaces

◦ Array dimension in scratchpad may be lower than original array dimension, depending on accessed data

◦ Access function in local memory array Original access function or reduced access function

with offsets – lower bounds (in each dimension) of scratchpad array


for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { A[i ][ j+1] = A[i+j ][ j+1]*3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } }

for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { LA0[i-10][j+1-11] = LA1[i+j-20][j+1-11]*3; for (k=11;k<=20;k++) LB0[i-10][j+k-21] = LA0[i-10][k-11] + LB1[i+j-20][k-11]; }}

◦ Generation of loop structure Scanning of polytopes (using

CLooG - a tool for code generation) corresponding to data spaces of read references – for moving

data into scratchpad write references – for moving

data out of scratchpad

◦ Generation of loop body (data movement statement) Copy from a location in

scratchpad buffer to off-chip memory location or vice versa


/* Data Move in code */for (i=10;i<=14;i++) { for (j=11;j<=20;j++) LA0[i-10][j-11] = A[i][j] ;}for (i=20;i<=28;i++) { for (j=max(i-13,11);j<=min(15,i-9); j++) LA1[i-20][j-11] = A[i][j] ;}

/* Data Move out code */for (i=10;i<=14;i++) { for (j=11;j<=15;j++) A[i][j] = LA0[i-10][j-11];}



◦ Architectural components Slow off-chip (global) memory Two levels of parallelism

Set of multiprocessors Set of processor cores in each multiprocessor

Scratchpad on each multiprocessor, shared by its processor cores


Off-chip memory

. . .Scratchpad

Scratchpad

Scratchpad

◦ Tiling transformation framework recently developed at OSU by Bondhugula (CC-08, PLDI-08) Finds tiling transformations or hyperplanes

for sequences of imperfectly nested loops enables communication minimal parallelization and locality

optimization Identifies loops to tile for parallelism and data locality

◦ Multiple levels of tiling for exploiting parallelism across multiple parallel levels

◦ Additional tiling (sequential) at each level with scratchpad memory

If data required by tile executing at the level exceeds memory Data movement at the start and end of each sequential tile Synchronization points to ensure consistency



FORALL i = 1, Ni FORALL j = 1, Nj FOR k = 1, WS FOR l = 1, WS S1 END FOR END FOR END FORALLEND FORALL

END FOR END FOR END FOR END FOR

// Tiling to satisfy scratchpad memory limit FOR i' = iT, min(iT+Ti-1,Ni), ti' FOR j' = jT, min(jT+Tj-1,Nj), tj' FOR k' = 1, WS, tk' FOR l'= 1, WS, tl'

// Tiling to distribute at the outer levelFORALL iT = 1, Ni, Ti FORALL jT = 1, Nj, Tj

FOR i = it, min(it+ti-1,Ni) FOR j = jt, min(jt+tj-1,Nj) FOR k = k', min(k'+tk'-1,WS) FOR l = l', min(l'+tl'-1,WS) S1 END FOR END FOR END FOR END FOR

// Tiling to distribute at the inner level FORALL it = i', min(i'+ti'-1,Ni), ti FORALL jt = j', min(j'+tj'-1,Nj), tj

<Data move in Code>

<Data move out Code>

END FORALL END FORALL

END FORALLEND FORALL

◦ Handling scratchpad memory constraints Cost model for data movement

C = N x (S + (V x L)/P)

N – Number of data movements

S – Sync cost per data movement

V – Number of elements per data movement (based on tile sizes)

L – Cost to transfer one element P – Number of processes involved in data movement

Tile size search formulation Constraint: memory requirement within limit Objective function: minimize data movement cost, C



))(())((mink 1Ik 1 P

LVS

t

N

P

LVS

t

N k

O

r

i i

ikr

i i

i outk

ink

nl

i

upi MM1

• Loop nest of m loops with tile sizes t1, t2,.., tm

• nl local arrays• Mj – Memory (as a function of tile sizes) for local array j• V inj and Voutj – Volume (as a function of tile sizes) moved in to and out of local array memory j, respectively• rj – position in the loop nest where the data movement code of array j is placed• Mup - total scratchpad memory

Memory Constraint:

Objective function:

Variables: t1, t2,.., tm




Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz768 MB off-chip memory16 x 16 KB scratchpad









Tile size from model





Tile size from model



◦ Scratchpad memory management Data reuse - Issenin et al. [DAC06] Allocation for uniformly generated references

Schreiber and Cronquist [HPLTR04] Anantharaman and Pande [RTSS98] Kandemir et al. [CAD04]

Improving performance on cached architectures Ferrante et al. [LCPC92] Gallivan et al. [ICS88]

◦ Multi-level tiling Fatahalian et al. [SC06]– various levels of memory Bikshandi et al. [PPOPP06] and Renganarayanan et

al. [SC07, IPDPS07] – parallelism and locality




Addressed two issues in compiling for modern multi-level parallel architectures with scratchpads1. Data management in scratchpad memory

1. Data allocation2. Access in scratchpad3. Code generation for data movement

2. Mapping of computation in regular programs on to multiple levels of parallel units

Experimental evaluation using nVIDIA GPU




Developing an end-to-end compiler framework for modern many-core architectures like GPUs

Algorithms developed in this work – an integral part of the overall compiler framework

Further optimize transformations like tiling, for modern architectures like GPUs, using model-driven empirical search



Documents

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1