Upload
earl-paul
View
213
Download
0
Embed Size (px)
Citation preview
CellSs: A Programming Model for the Cell BE
Architecture
Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta
Barcelona Supercomputing Center (BSC-CNS)Technical University of Catalonia (UPC)
Index
•Motivation
•Programming models
•CellSs sample codes
•Compilation environment
•Execution behavior
•Results
•Related work
•Conclusions & ongoing work
Motivation
* Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, Intel White Paper
Motivation
User point of view
So, what is the Cell BE?
Architecture point of view
SPEPPE SPE SPE SPE SPE SPE SPE SPE
Separate address spacesTiny local memoryBandwidth
Thin processorSMT
Hard to optimize
Programmers point of view
ns 100 useconds minutes/hours
Programming models
Grid
Concepts mapping:Instructions Block operations Full binary
Functional units SPEs remote machines
Fetch &decode unit PPE local machine
Registers (name space) Main memory Files
Registers (storage) SPU memory Files
Standard sequential languages:
On standard processors run sequential
On Cell runs parallel
Constraint
Block algorithms
CellSs sample code: Matrix multiply
int main(int argc, char **argv) {int i, j, k;…
initialize(A, B, C);
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); ... }
static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;
for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}
B
BNB
NB
B
B
CellSs sample code: Matrix multiply
int main (int argc, char **argv) {int i, j, k;…
initialize(A, B, C);
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); ... }
#pragma css task input(A, B) inout(C)static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;
for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}
SPE
unroll
B
BNB
NB
B
B
CellSs sample code: Sparse LU
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) {
lu0(A[kk][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL)
fwd(A[kk][kk], A[kk][jj]);
for (ii=kk+1; ii<NB; ii++)
if (A[ii][kk] != NULL) {
bdiv (A[kk][kk], A[ii][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL) {
if (A[ii][jj]==NULL)
A[ii][jj]=allocate_clean_block();
bmod(A[ii][kk], A[kk][jj], A[ii][jj]);
}
}
}}
void lu0(float *diag);
void bdiv(float *diag, float *row);
void bmod(float *row, float *col, float *inner);
void fwd(float *diag, float *col);
B
BNB
NB
B
B
CellSs sample code: Sparse LU
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) {
lu0(A[kk][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL)
fwd(A[kk][kk], A[kk][jj]);
for (ii=kk+1; ii<NB; ii++)
if (A[ii][kk] != NULL) {
bdiv (A[kk][kk], A[ii][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL) {
if (A[ii][jj]==NULL)
A[ii][jj]=allocate_clean_block();
bmod(A[ii][kk], A[kk][jj], A[ii][jj]);
}
}
}}
#pragma css task inout(diag[B][B])
void lu0(float *diag);
#pragma css task input(diag[B][B]) inout(row[B][B])
void bdiv(float *diag, float *row);
#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])
void bmod(float *row, float *col, float *inner);
#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);
B
BNB
NB
B
B
Data dependent parallelism
CellSs sample code: Sparse LU
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) {
lu0(A[kk][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL)
fwd(A[kk][kk], A[kk][jj]);
for (ii=kk+1; ii<NB; ii++)
if (A[ii][kk] != NULL) {
bdiv (A[kk][kk], A[ii][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL) {
if (A[ii][jj]==NULL)
A[ii][jj]=allocate_clean_block();
bmod(A[ii][kk], A[kk][jj], A[ii][jj]);
}
}
}}
#pragma css task inout(diag[B][B])
void lu0(float *diag);
#pragma css task input(diag[B][B]) inout(row[B][B])
void bdiv(float *diag, float *row);
#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])
void bmod(float *row, float *col, float *inner);
#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);
B
BNB
NB
B
B
Dynamic main memory allocationData dependent parallelism
CellSs sample code: Sparse LU
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) {
lu0(A[kk][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL)
fwd(A[kk][kk], A[kk][jj]);
for (ii=kk+1; ii<NB; ii++)
if (A[ii][kk] != NULL) {
bdiv (A[kk][kk], A[ii][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL) {
if (A[ii][jj]==NULL)
A[ii][jj]=allocate_clean_block();
bmod(A[ii][kk], A[kk][jj], A[ii][jj]);
}
}
}}
#pragma css task inout(diag[B][B])
void lu0(float *diag);
#pragma css task input(diag[B][B]) inout(row[B][B])
void bdiv(float *diag, float *row);
#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])
void bmod(float *row, float *col, float *inner);
#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);
B
BNB
NB
B
B
CellSs sample code: Checking LU
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
#pragma css task input(Src) out(Dst) void copy_block (float Src[BS][BS], float Dst[BS][BS]);
void copy_mat (float *Src,float *Dst){ ... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) ... copy_block(Src[ii][jj],block); ...}
#pragma gss task input(A) out(L,U)void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]);
void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB]){... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++){ ... split_block (LU[ii][ii],L[ii][ii],U[ii][ii]); ... }}
Compilation environment
app.c
CSS compiler
app_spe.c
app_ppe.c
llib_css-spe.so
Cell executable
llib_css-ppe.so
SPE Linker
PPE Linker
SPEexecutable SPE Compiler app_spe.o
PPE Compiler app_ppe.o SPE Embedder
SPE Linker
PPEObject
SDK
Execution behavior
PPU
User main
program
CellSs PPU lib
SPU0
DMA inTask executionDMA outSynchronization
CellSs SPU lib
Original task code
Helper threadmain thread
Memory
Userdata
Renaming
Task graph
Synchronization
Tasks
Finalization signal
Stage in/out data
Work assignment
Data dependence Data renaming
Scheduling
SPU1
SPU2
Execution behavior: Matrix multiply
...
#pragma css task input(A, B) inout(C)block_addmultiply( C[i][j], A[i][k], B[k][j])
C[i][j]
A[i][k] B[k][j]
• For each operation, two blocks of data are get from PPE memory to SPE local storage
• Clusters of dependent tasks are scheduled to the
same PPE
The inout block is kept in the local storage and only
put in PPE memory once (reuse)
Execution behavior: Matrix multiply
Clustering Chain of 7 block multiply (270 us)Size of block: 64x64 floatsStage in/out
Reuse
Main thread: task generation
Helper thread
Execution behavior: Matrix multiply
Waiting for SPE availability
Schedule & dispatch
Execution behavior: Matrix multiply
Stage out and notification
Task generation
DispatchScheduleGraph update
Execution behavior: Sparse LU
Priority hints#pragma css task highpriority …
Increase parallelism / support schedulingSupport reuse
Execution behavior: J_Check_LU
copy_mat (A, origA);
LU (A);
split_mat (A, L, U);
clean_mat(A);
sparse_matmult (L, U, A); compare_mat (origA, A);
Without CellSs With CellSs
...
Execution behavior: J_Check
Execution behavior: Other views
Stage in bandwidth
Stage out bandwidth
Task generation lookahead
Full unrolling before execution
Overlaped generation/execution
Scalability
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
Matmul speedup results
SPUs
Sp
eed
up
Faster tasks (pre-fetching data)
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
SparseLU speedup results
SPUs
Sp
eed
Up
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
Matul Speedup results (v2)
SPUs
Sp
eed
up
Related work
•Sequoia
• Just presented!
•Charm++
• Runtime tailored to Cell BE
• Offload API
•Octopiler (IBM)
• Auto-SIMDization
• OpenMP as programming model
• Single shared-memory abstraction
Conclusions & Ongoing work
•Cell Superscalar offers a simple programmer model for the Cell BE• Allows easy porting of applications
• General
• Constraints:
• Blocking
•Ongoing work
• Run Time optimization: overheads, halos, scheduling algs, overlap phases, overlays, speculation, short-circuits, more helper threads, lazy renaming, …
• Garbage collection
•Applications
• Bio
• Engineering
•To be distributed as open source soon
THANKS!
Visit us at BSC booth #1800 for further information