CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

CellSs: A Programming Model for the Cell BE

Architecture

Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta

Barcelona Supercomputing Center (BSC-CNS)Technical University of Catalonia (UPC)

[email protected]

Index

•Motivation

•Programming models

•CellSs sample codes

•Compilation environment

•Execution behavior

•Results

•Related work

•Conclusions & ongoing work

Motivation

* Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, Intel White Paper

Motivation

User point of view

So, what is the Cell BE?

Architecture point of view

SPEPPE SPE SPE SPE SPE SPE SPE SPE

Separate address spacesTiny local memoryBandwidth

Thin processorSMT

Hard to optimize

Programmers point of view

ns 100 useconds minutes/hours

Programming models

Grid

Concepts mapping:Instructions Block operations Full binary

Functional units SPEs remote machines

Fetch &decode unit PPE local machine

Registers (name space) Main memory Files

Registers (storage) SPU memory Files

Standard sequential languages:

On standard processors run sequential

On Cell runs parallel

Constraint

Block algorithms

CellSs sample code: Matrix multiply

int main(int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); ... }

static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;

for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}

B

BNB

NB

B

B

CellSs sample code: Matrix multiply

int main (int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); ... }

#pragma css task input(A, B) inout(C)static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;

for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}

SPE

unroll

B

BNB

NB

B

B

CellSs sample code: Sparse LU

int main(int argc, char **argv) {int ii, jj, kk;…

for (kk=0; kk<NB; kk++) {

lu0(A[kk][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL)

fwd(A[kk][kk], A[kk][jj]);

for (ii=kk+1; ii<NB; ii++)

if (A[ii][kk] != NULL) {

bdiv (A[kk][kk], A[ii][kk]);


if (A[kk][jj] != NULL) {

if (A[ii][jj]==NULL)

A[ii][jj]=allocate_clean_block();

bmod(A[ii][kk], A[kk][jj], A[ii][jj]);

}

}

}}

void lu0(float *diag);

void bdiv(float *diag, float *row);

void bmod(float *row, float *col, float *inner);

void fwd(float *diag, float *col);

B

BNB

NB

B

B




lu0(A[kk][kk]);












}

}

}}

#pragma css task inout(diag[B][B])


#pragma css task input(diag[B][B]) inout(row[B][B])


#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])


#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);

B

BNB

NB

B

B

Data dependent parallelism




lu0(A[kk][kk]);












}

}

}}








B

BNB

NB

B

B

Dynamic main memory allocationData dependent parallelism




lu0(A[kk][kk]);












}

}

}}








B

BNB

NB

B

B

CellSs sample code: Checking LU

int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}

#pragma css task input(Src) out(Dst) void copy_block (float Src[BS][BS], float Dst[BS][BS]);

void copy_mat (float *Src,float *Dst){ ... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) ... copy_block(Src[ii][jj],block); ...}

#pragma gss task input(A) out(L,U)void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]);

void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB]){... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++){ ... split_block (LU[ii][ii],L[ii][ii],U[ii][ii]); ... }}

Compilation environment

app.c

CSS compiler

app_spe.c

app_ppe.c

llib_css-spe.so

Cell executable

llib_css-ppe.so

SPE Linker

PPE Linker

SPEexecutable SPE Compiler app_spe.o

PPE Compiler app_ppe.o SPE Embedder

SPE Linker

PPEObject

SDK

Execution behavior

PPU

User main

program

CellSs PPU lib

SPU0

DMA inTask executionDMA outSynchronization

CellSs SPU lib

Original task code

Helper threadmain thread

Memory

Userdata

Renaming

Task graph

Synchronization

Tasks

Finalization signal

Stage in/out data

Work assignment

Data dependence Data renaming

Scheduling

SPU1

SPU2

Execution behavior: Matrix multiply

...

#pragma css task input(A, B) inout(C)block_addmultiply( C[i][j], A[i][k], B[k][j])

C[i][j]

A[i][k] B[k][j]

• For each operation, two blocks of data are get from PPE memory to SPE local storage

• Clusters of dependent tasks are scheduled to the

same PPE

The inout block is kept in the local storage and only

put in PPE memory once (reuse)


Clustering Chain of 7 block multiply (270 us)Size of block: 64x64 floatsStage in/out

Reuse

Main thread: task generation

Helper thread


Waiting for SPE availability

Schedule & dispatch


Stage out and notification

Task generation

DispatchScheduleGraph update

Execution behavior: Sparse LU

Priority hints#pragma css task highpriority …

Increase parallelism / support schedulingSupport reuse

Execution behavior: J_Check_LU

copy_mat (A, origA);

LU (A);

split_mat (A, L, U);

clean_mat(A);

sparse_matmult (L, U, A); compare_mat (origA, A);

Without CellSs With CellSs

...

Execution behavior: J_Check

Execution behavior: Other views

Stage in bandwidth

Stage out bandwidth

Task generation lookahead

Full unrolling before execution

Overlaped generation/execution

Scalability

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

9

Matmul speedup results

SPUs

Sp

eed

up

Faster tasks (pre-fetching data)

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

SparseLU speedup results

SPUs

Sp

eed

Up

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

Matul Speedup results (v2)

SPUs

Sp

eed

up

Related work

•Sequoia

• Just presented!

•Charm++

• Runtime tailored to Cell BE

• Offload API

•Octopiler (IBM)

• Auto-SIMDization

• OpenMP as programming model

• Single shared-memory abstraction

Conclusions & Ongoing work

•Cell Superscalar offers a simple programmer model for the Cell BE• Allows easy porting of applications

• General

• Constraints:

• Blocking

•Ongoing work

• Run Time optimization: overheads, halos, scheduling algs, overlap phases, overlays, speculation, short-circuits, more helper threads, lazy renaming, …

• Garbage collection

•Applications

• Bio

• Engineering

•To be distributed as open source soon

THANKS!

Visit us at BSC booth #1800 for further information

Documents

CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)