Upload
talen
View
19
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 - PowerPoint PPT Presentation
Citation preview
Muthu Baskaran1 Uday Bondhugula1 Sriram Krishnamoorthy 1
J Ramanujam2 Atanas Rountev1 P Sadayappan1
1Department of Computer Science & EngineeringThe Ohio State University
2Department of Electrical and Computer EngineeringLouisiana State University
IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Single-processor performance ◦ Improved by ~50%/yr for almost two decades◦ Clock speed, ILP, …◦ Clock speed increased over 100x
Limits to single-processor performance growth◦ Increase in power density ◦ Flattening of clock speed due to power limitation
Transistor density continues to rise unabated Multiple cores are now the best option for
sustained performance growth
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Need to optimize memory bandwidth and latency in multi-core architectures
Traditional solution: introducing a cache hierarchy
Drawback◦ Caches are hardware-managed - difficult to model
miss behavior and to predict program execution times
Solution in many modern architectures: fast on-chip explicitly managed memory - scratchpad memory (local memory store)
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Scratchpads◦ Software-managed
Control over data movement Easier to model performance Burden on programmer/compiler to manage and utilize
◦ Lower power per chip area required compared to cache
Some modern architectures having scratchpad memories◦ GPU◦ Cell◦ MPSoC
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Effective management of on-chip scratchpads in multi-core architectures◦ Utilize limited capacity of scratchpad◦ Optimize data movement
Effective computation mapping in many-core architectures with multiple levels of parallelism◦ Exploit available parallelism◦ Account for scratchpad capacity constraints
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Orchestration of data movement between off-chip global and on-chip scratchpad memory
Decisions on◦ What data elements to move in and out of
scratchpad◦ When to move data◦ How to move data◦ How to access the data elements copied to
scratchpad
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
1. Allocation of storage space (as arrays) in the scratchpad memory for local copies
2. Determination of access functions of arrays in scratchpad memories
3. Generation of code for moving data between scratchpad (local) and off-chip (global) memories
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Targeted at affine programs◦ Dense arrays◦ Loop bounds – affine functions of outer loop
variables, constants and program parameters◦ Array access functions - affine functions of
surrounding loop variables, constants and program parameters
Developed using polyhedral model ◦ an algebraic framework for representing affine
programs – statement domains, dependences, array access functions – and affine program transformations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
for (i=1; i<=4; i++) for (j=2; j<=4; j++)
S1: a[i][j] = a[j][i] + a[i][j-1];
i j
xS1 = . 0 -1 4
IS1 =i j 1
≥ 0-1 0 4 0 1 -
2
1 0 -1
(0,0)
(m,m)
₣1a (xS1)
=
1 0 0 1
. i j
+0 0
DS1a = ₣1a
IS1
₣2a (xS1)
=
0 1 1 0
. i j
+0 0
₣3a (xS1)
=
1 0 0 1
. i j
+0 -1
j
i≥1
i≤4
ij≥2 j≤4
Given a program block, identify the storage space needed for each non-overlapping accessed region of all arrays◦ Access functions of array references may be non-
uniformly generated
For architectures (e.g. nVIDIA GeForce GPU) supporting direct data access from off-chip memory◦ Estimate extent of reuse of data to determine
whether or not to copy to scratchpad
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Array A
10
14
20
28
11 20
for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) {
A[i ][ j+1] = A[i+j ][ j+1] *3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } }
Local Array LA0: lb ( i ) = 10; ub( i ) = 14lb ( j ) = 11; ub( j ) = 20
Local Array LA1: lb ( i ) = 20; ub( i ) = 28lb ( j ) = 11; ub( j ) = 15
◦ Find the set of all data spaces accessed by all references to an array Access function of the
reference Iteration space of the
statement that holds the reference
◦Partition the set of all data spaces into maximal disjoint non-overlapping subset of data spaces◦ Local memory array for
each bounding box
◦ Find the bounding box of each partition of data spaces
◦ Array dimension in scratchpad may be lower than original array dimension, depending on accessed data
◦ Access function in local memory array Original access function or reduced access function
with offsets – lower bounds (in each dimension) of scratchpad array
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { A[i ][ j+1] = A[i+j ][ j+1]*3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } }
for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { LA0[i-10][j+1-11] = LA1[i+j-20][j+1-11]*3; for (k=11;k<=20;k++) LB0[i-10][j+k-21] = LA0[i-10][k-11] + LB1[i+j-20][k-11]; }}
◦ Generation of loop structure Scanning of polytopes (using
CLooG - a tool for code generation) corresponding to data spaces of read references – for moving
data into scratchpad write references – for moving
data out of scratchpad
◦ Generation of loop body (data movement statement) Copy from a location in
scratchpad buffer to off-chip memory location or vice versa
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
/* Data Move in code */for (i=10;i<=14;i++) { for (j=11;j<=20;j++) LA0[i-10][j-11] = A[i][j] ;}for (i=20;i<=28;i++) { for (j=max(i-13,11);j<=min(15,i-9); j++) LA1[i-20][j-11] = A[i][j] ;}
/* Data Move out code */for (i=10;i<=14;i++) { for (j=11;j<=15;j++) A[i][j] = LA0[i-10][j-11];}
IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
◦ Architectural components Slow off-chip (global) memory Two levels of parallelism
Set of multiprocessors Set of processor cores in each multiprocessor
Scratchpad on each multiprocessor, shared by its processor cores
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Off-chip memory
. . .Scratchpad
Scratchpad
Scratchpad
◦ Tiling transformation framework recently developed at OSU by Bondhugula (CC-08, PLDI-08) Finds tiling transformations or hyperplanes
for sequences of imperfectly nested loops enables communication minimal parallelization and locality
optimization Identifies loops to tile for parallelism and data locality
◦ Multiple levels of tiling for exploiting parallelism across multiple parallel levels
◦ Additional tiling (sequential) at each level with scratchpad memory
If data required by tile executing at the level exceeds memory Data movement at the start and end of each sequential tile Synchronization points to ensure consistency
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
FORALL i = 1, Ni FORALL j = 1, Nj FOR k = 1, WS FOR l = 1, WS S1 END FOR END FOR END FORALLEND FORALL
END FOR END FOR END FOR END FOR
// Tiling to satisfy scratchpad memory limit FOR i' = iT, min(iT+Ti-1,Ni), ti' FOR j' = jT, min(jT+Tj-1,Nj), tj' FOR k' = 1, WS, tk' FOR l'= 1, WS, tl'
// Tiling to distribute at the outer levelFORALL iT = 1, Ni, Ti FORALL jT = 1, Nj, Tj
FOR i = it, min(it+ti-1,Ni) FOR j = jt, min(jt+tj-1,Nj) FOR k = k', min(k'+tk'-1,WS) FOR l = l', min(l'+tl'-1,WS) S1 END FOR END FOR END FOR END FOR
// Tiling to distribute at the inner level FORALL it = i', min(i'+ti'-1,Ni), ti FORALL jt = j', min(j'+tj'-1,Nj), tj
<Data move in Code>
<Data move out Code>
END FORALL END FORALL
END FORALLEND FORALL
◦ Handling scratchpad memory constraints Cost model for data movement
C = N x (S + (V x L)/P)
N – Number of data movements
S – Sync cost per data movement
V – Number of elements per data movement (based on tile sizes)
L – Cost to transfer one element P – Number of processes involved in data movement
Tile size search formulation Constraint: memory requirement within limit Objective function: minimize data movement cost, C
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
))(())((mink 1Ik 1 P
LVS
t
N
P
LVS
t
N k
O
r
i i
ikr
i i
i outk
ink
nl
i
upi MM1
• Loop nest of m loops with tile sizes t1, t2,.., tm
• nl local arrays• Mj – Memory (as a function of tile sizes) for local array j• V inj and Voutj – Volume (as a function of tile sizes) moved in to and out of local array memory j, respectively• rj – position in the loop nest where the data movement code of array j is placed• Mup - total scratchpad memory
Memory Constraint:
Objective function:
Variables: t1, t2,.., tm
IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Machine Information:
NVIDIA GeForce 8800 GTX
16 x 8 cores @ 1.35 GHz768 MB off-chip memory16 x 16 KB scratchpad
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Machine Information:
NVIDIA GeForce 8800 GTX
16 x 8 cores @ 1.35 GHz768 MB off-chip memory16 x 16 KB scratchpad
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Machine Information:
NVIDIA GeForce 8800 GTX
16 x 8 cores @ 1.35 GHz768 MB off-chip memory16 x 16 KB scratchpad
Tile size from model
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Machine Information:
NVIDIA GeForce 8800 GTX
16 x 8 cores @ 1.35 GHz768 MB off-chip memory16 x 16 KB scratchpad
Tile size from model
IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
◦ Scratchpad memory management Data reuse - Issenin et al. [DAC06] Allocation for uniformly generated references
Schreiber and Cronquist [HPLTR04] Anantharaman and Pande [RTSS98] Kandemir et al. [CAD04]
Improving performance on cached architectures Ferrante et al. [LCPC92] Gallivan et al. [ICS88]
◦ Multi-level tiling Fatahalian et al. [SC06]– various levels of memory Bikshandi et al. [PPOPP06] and Renganarayanan et
al. [SC07, IPDPS07] – parallelism and locality
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Addressed two issues in compiling for modern multi-level parallel architectures with scratchpads1. Data management in scratchpad memory
1. Data allocation2. Access in scratchpad3. Code generation for data movement
2. Mapping of computation in regular programs on to multiple levels of parallel units
Experimental evaluation using nVIDIA GPU
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
IntroductionChallengesAutomatic Data ManagementMulti-level TilingExperimentsRelated WorkSummaryOngoing and Future Work
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Developing an end-to-end compiler framework for modern many-core architectures like GPUs
Algorithms developed in this work – an integral part of the overall compiler framework
Further optimize transformations like tiling, for modern architectures like GPUs, using model-driven empirical search
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008