26
CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia , Jesus Labarta Barcelona Supercomputing Center (BSC-CNS) Technical University of Catalonia (UPC) [email protected]

CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Embed Size (px)

Citation preview

Page 1: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

CellSs: A Programming Model for the Cell BE

Architecture

Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta

Barcelona Supercomputing Center (BSC-CNS)Technical University of Catalonia (UPC)

[email protected]

Page 2: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Index

•Motivation

•Programming models

•CellSs sample codes

•Compilation environment

•Execution behavior

•Results

•Related work

•Conclusions & ongoing work

Page 3: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Motivation

* Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, Intel White Paper

Page 4: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Motivation

User point of view

So, what is the Cell BE?

Architecture point of view

SPEPPE SPE SPE SPE SPE SPE SPE SPE

Separate address spacesTiny local memoryBandwidth

Thin processorSMT

Hard to optimize

Programmers point of view

Page 5: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

ns 100 useconds minutes/hours

Programming models

Grid

Concepts mapping:Instructions Block operations Full binary

Functional units SPEs remote machines

Fetch &decode unit PPE local machine

Registers (name space) Main memory Files

Registers (storage) SPU memory Files

Standard sequential languages:

On standard processors run sequential

On Cell runs parallel

Constraint

Block algorithms

Page 6: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

CellSs sample code: Matrix multiply

int main(int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); ... }

static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;

for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}

B

BNB

NB

B

B

Page 7: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

CellSs sample code: Matrix multiply

int main (int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); ... }

#pragma css task input(A, B) inout(C)static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;

for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}

SPE

unroll

B

BNB

NB

B

B

Page 8: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

CellSs sample code: Sparse LU

int main(int argc, char **argv) {int ii, jj, kk;…

for (kk=0; kk<NB; kk++) {

lu0(A[kk][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL)

fwd(A[kk][kk], A[kk][jj]);

for (ii=kk+1; ii<NB; ii++)

if (A[ii][kk] != NULL) {

bdiv (A[kk][kk], A[ii][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL) {

if (A[ii][jj]==NULL)

A[ii][jj]=allocate_clean_block();

bmod(A[ii][kk], A[kk][jj], A[ii][jj]);

}

}

}}

void lu0(float *diag);

void bdiv(float *diag, float *row);

void bmod(float *row, float *col, float *inner);

void fwd(float *diag, float *col);

B

BNB

NB

B

B

Page 9: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

CellSs sample code: Sparse LU

int main(int argc, char **argv) {int ii, jj, kk;…

for (kk=0; kk<NB; kk++) {

lu0(A[kk][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL)

fwd(A[kk][kk], A[kk][jj]);

for (ii=kk+1; ii<NB; ii++)

if (A[ii][kk] != NULL) {

bdiv (A[kk][kk], A[ii][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL) {

if (A[ii][jj]==NULL)

A[ii][jj]=allocate_clean_block();

bmod(A[ii][kk], A[kk][jj], A[ii][jj]);

}

}

}}

#pragma css task inout(diag[B][B])

void lu0(float *diag);

#pragma css task input(diag[B][B]) inout(row[B][B])

void bdiv(float *diag, float *row);

#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])

void bmod(float *row, float *col, float *inner);

#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);

B

BNB

NB

B

B

Page 10: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Data dependent parallelism

CellSs sample code: Sparse LU

int main(int argc, char **argv) {int ii, jj, kk;…

for (kk=0; kk<NB; kk++) {

lu0(A[kk][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL)

fwd(A[kk][kk], A[kk][jj]);

for (ii=kk+1; ii<NB; ii++)

if (A[ii][kk] != NULL) {

bdiv (A[kk][kk], A[ii][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL) {

if (A[ii][jj]==NULL)

A[ii][jj]=allocate_clean_block();

bmod(A[ii][kk], A[kk][jj], A[ii][jj]);

}

}

}}

#pragma css task inout(diag[B][B])

void lu0(float *diag);

#pragma css task input(diag[B][B]) inout(row[B][B])

void bdiv(float *diag, float *row);

#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])

void bmod(float *row, float *col, float *inner);

#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);

B

BNB

NB

B

B

Page 11: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Dynamic main memory allocationData dependent parallelism

CellSs sample code: Sparse LU

int main(int argc, char **argv) {int ii, jj, kk;…

for (kk=0; kk<NB; kk++) {

lu0(A[kk][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL)

fwd(A[kk][kk], A[kk][jj]);

for (ii=kk+1; ii<NB; ii++)

if (A[ii][kk] != NULL) {

bdiv (A[kk][kk], A[ii][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL) {

if (A[ii][jj]==NULL)

A[ii][jj]=allocate_clean_block();

bmod(A[ii][kk], A[kk][jj], A[ii][jj]);

}

}

}}

#pragma css task inout(diag[B][B])

void lu0(float *diag);

#pragma css task input(diag[B][B]) inout(row[B][B])

void bdiv(float *diag, float *row);

#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])

void bmod(float *row, float *col, float *inner);

#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);

B

BNB

NB

B

B

Page 12: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

CellSs sample code: Checking LU

int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}

#pragma css task input(Src) out(Dst) void copy_block (float Src[BS][BS], float Dst[BS][BS]);

void copy_mat (float *Src,float *Dst){ ... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) ... copy_block(Src[ii][jj],block); ...}

#pragma gss task input(A) out(L,U)void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]);

void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB]){... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++){ ... split_block (LU[ii][ii],L[ii][ii],U[ii][ii]); ... }}

Page 13: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Compilation environment

app.c

CSS compiler

app_spe.c

app_ppe.c

llib_css-spe.so

Cell executable

llib_css-ppe.so

SPE Linker

PPE Linker

SPEexecutable SPE Compiler app_spe.o

PPE Compiler app_ppe.o SPE Embedder

SPE Linker

PPEObject

SDK

Page 14: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Execution behavior

PPU

User main

program

CellSs PPU lib

SPU0

DMA inTask executionDMA outSynchronization

CellSs SPU lib

Original task code

Helper threadmain thread

Memory

Userdata

Renaming

Task graph

Synchronization

Tasks

Finalization signal

Stage in/out data

Work assignment

Data dependence Data renaming

Scheduling

SPU1

SPU2

Page 15: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Execution behavior: Matrix multiply

...

#pragma css task input(A, B) inout(C)block_addmultiply( C[i][j], A[i][k], B[k][j])

C[i][j]

A[i][k] B[k][j]

• For each operation, two blocks of data are get from PPE memory to SPE local storage

• Clusters of dependent tasks are scheduled to the

same PPE

The inout block is kept in the local storage and only

put in PPE memory once (reuse)

Page 16: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Execution behavior: Matrix multiply

Clustering Chain of 7 block multiply (270 us)Size of block: 64x64 floatsStage in/out

Reuse

Main thread: task generation

Helper thread

Page 17: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Execution behavior: Matrix multiply

Waiting for SPE availability

Schedule & dispatch

Page 18: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Execution behavior: Matrix multiply

Stage out and notification

Task generation

DispatchScheduleGraph update

Page 19: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Execution behavior: Sparse LU

Priority hints#pragma css task highpriority …

Increase parallelism / support schedulingSupport reuse

Page 20: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Execution behavior: J_Check_LU

copy_mat (A, origA);

LU (A);

split_mat (A, L, U);

clean_mat(A);

sparse_matmult (L, U, A); compare_mat (origA, A);

Without CellSs With CellSs

...

Page 21: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Execution behavior: J_Check

Page 22: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Execution behavior: Other views

Stage in bandwidth

Stage out bandwidth

Task generation lookahead

Full unrolling before execution

Overlaped generation/execution

Page 23: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Scalability

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

9

Matmul speedup results

SPUs

Sp

eed

up

Faster tasks (pre-fetching data)

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

SparseLU speedup results

SPUs

Sp

eed

Up

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

Matul Speedup results (v2)

SPUs

Sp

eed

up

Page 24: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Related work

•Sequoia

• Just presented!

•Charm++

• Runtime tailored to Cell BE

• Offload API

•Octopiler (IBM)

• Auto-SIMDization

• OpenMP as programming model

• Single shared-memory abstraction

Page 25: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Conclusions & Ongoing work

•Cell Superscalar offers a simple programmer model for the Cell BE• Allows easy porting of applications

• General

• Constraints:

• Blocking

•Ongoing work

• Run Time optimization: overheads, halos, scheduling algs, overlap phases, overlays, speculation, short-circuits, more helper threads, lazy renaming, …

• Garbage collection

•Applications

• Bio

• Engineering

•To be distributed as open source soon

Page 26: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

THANKS!

Visit us at BSC booth #1800 for further information