41
1 The Potential of the Cell Processor for Scientific Computing Sam Williams [email protected] Computational Research Division Future Technologies Group July 12, 2006

1 The Potential of the Cell Processor for Scientific Computing Sam Williams [email protected] Computational Research Division Future Technologies Group

Embed Size (px)

Citation preview

Page 1: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

1

The Potential of the Cell Processor for Scientific Computing

Sam [email protected]

Computational Research Division

Future Technologies GroupJuly 12, 2006

Page 2: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

2

• Cell Architecture• Programming Cell• Benchmarks & Performance

– Stencils on Structured Grids– Sparse Matrix-Vector Multiplication– Matrix-Matrix Multiplication– FFTs

• Summary

Outline

Page 3: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

3

Cell Architecture

Page 4: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

4

Cell Chip Architecture

• 9 Core SMP– One core is a conventional cache based PPC

– The other 8 are local memory based SIMD processors (SPEs)

• Cores connected via 4 rings (4 x 128b @ 1.6GHz)• 25.6GB/s memory bandwidth (128b @ 1.6GHz) to XDR• I/O channels can be used to build multichip SMPs

Page 5: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

5

Cell Chip Characteristics

• Core frequency up to 3.2GHz– Limited access to 2.1GHz hardware

– Faster (2.4GHz & 3.2GHz) became available after the paper.

• ~221mm2 chip, – Much smaller than Itanium2 (1/2 to 1/3)

– About the size of a Power5

– Opteron/Pentium before shrink

• ~500W blades (2 chips + DRAM + network)

Page 6: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

6

SPE Architecture

• 128b SIMD– Dual issue (FP/ALU + load/store/permute/etc…), but not DP– Statically Scheduled, in-order– 4 FMA SP datapaths, 6 cycle latency– 1 FMA DP datapath, 13 cycle latency (including 7 stall cycles)– Software managed BTB (branch hints)

• Local memory based architecture (two steps to access DRAM)– DMA commands are queued in the MFC– 128b @ 1.6GHz to local store (256KB)– Aligned loads from local store to RF (128 x 128b)

• ~3W @ 3.2GHz

Page 7: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

7

Programming Cell

Page 8: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

8

Estimation, Simulation and Exploration

• Perform analysis before writing any code• Conceptualize Algorithm with

– Double buffering + long DMAs + in order machine– use static timing analysis + memory traffic modeling

• Model Performance– For regular data structures, spreadsheet modeling works– SpMV requires more advanced modeling

• Full System Simulator• Cell+

– How severely does DP throughput of 1 SIMD instruction every 7 or 8 cycles impair performance?

– 1 SIMD instruction every 2 cycles– Allows dual issuing of DP instructions with

loads/stores/permutes/branch

Page 9: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

9

Cell Programming

• Modified SPMD (Single Program Multiple Data)– Dual Program Multiple Data

– Hierarchical SPMD

– Kind of clunky approach compared to MPI or pthreads

• Power core is used to:– Load/initialize data structures

– Spawn SPE threads

– Parallelize data structures

– Pass pointers to SPEs

– Synchronize SPE threads

– Communicate with other processors

– Perform other I/O operations

Page 10: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

10

SPE Programming• Software controlled Memory

– Use pointers from PPC to construct global addresses– Programmer handles transfers from global to local– Compiler handles transfers from local to RF– Most applicable when address stream is long and independent

• DMAs– Granularity is 1,2,4,8,16N bytes– Issued with very low level intrinsics (no error checking)– GETL = Stanza Gather/Scatter = distributed in global, packed in local

• Double buffering– SPU and MFC operate in parallel– Operate on current data set while transferring the next/last– Time is max of computation time and communication time– Prologue and epilogue penalties

• Although more verbose, intrinsics are an effective way of delivering peak performance.

Page 11: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

11

Benchmark Kernels

– Stencil Computations on Structured Grids – Sparse Matrix-Vector Multiplication– Matrix-Matrix Multiplication– 1D FFTs– 2D FFTs

Page 12: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

12

Processors Used

Note: Cell performance does not

include the Power core

Page 13: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

13

Stencil Operations on Structured Grids

Page 14: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

14

Stencil Operations

• Simple Example - The Heat Equation– dT/dt = k2T

– Parabolic PDE on 3D discretized scalar domain

• Jacobi Style (read from current grid, write to next grid)– 7 FLOPs per point, typically double precision

– Next[x,y,z] = Alpha*Current[x,y,z] +

Current[x-1,y,z] + Current[x+1,y,z] +

Current[x,y-1,z] + Current[x,y+1,z] +

Current[x,y,z-1] + Current[x,y,z+1]– Doesn’t exploit the FMA pipeline well

– Basically 6 streams presented to the memory subsystem

• Explicit ghost zones bound grid

[x,y,z] [x+1,y,z]

[x,y-1,z]

[x,y,z-1][x,y+1,z]

[x,y,z+1]

[x-1,y,z]

Page 15: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

15

Optimization - Planes• Naïve approach (cacheless vector machine) is to load 5

streams and store one.• This is 7 flops per 48 bytes

– memory limits performance to 3.7 GFLOP/s

• A better approach is to make each DMA the size of a plane– cache the 3 most recent planes (z-1, z, z+1)

– there are only two streams (one load, one store)

– memory now limits performance to 11.2 GFLOP/s

• Still must compute on each plane after it is loaded– e.g. forall Current_local[x,y] update Next_local[x,y]

– Note: computation can severely limit performance

Page 16: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

16

Optimization - Double Buffering

• Add a input stream buffer and and output stream buffer (keep 6 planes in local store)

• Two phases (transfer & compute) are now overlapped• Thus it is possible to hide the faster of DMA transfer time

and computation time

Page 17: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

17

Optimization - Cache Blocking

• Domains can be quite large (~1GB)• A single plane, let alone 6, might not fit in the local store• Partition domain (and thus planes) into cache blocks so that

6 can fit in the local store.• Has the added benefit that cache blocks are independent

and thus can be parallelized across multiple SPEs• Memory efficiency can be diminished as an intra grid ghost

zone is implicitly created.

Page 18: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

18

• Instead of computing on pencils, compute on ribbons (4x2)• Hides functional unit & local store latency • Minimizes local store memory traffic• Minimizes loop overhead

• May not be beneficial / noticeable for cache based machines

Optimization - Register Blocking

Page 19: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

19

Optimization - Time Skewing

• If the application allows it, perform multiple time steps within the local store

• Only appropriate on memory bound implementations– Improves computational intensity– single precision, or – improved double precision

• Simple approach– Overlapping trapezoids in time-space plot – Can be inefficient due to duplicated work

• If performing two steps, local store must now hold 9 planes (thus smaller cache blocks)

• If performing n steps, the local store must hold 3(n+1) planes

Time tTime t+1

Time t+2

Page 20: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

20

Stencil Performance

4.49 4.77

10.6

7.25

3.91

0.571.19

0

2

4

6

8

10

12

Double Precision

GFLOPs

2.1GHz Cell(8 SPEs) 2.1GHz Cell(simulator) 3.2GHz Cell+(model) 3.2GHz Cell(simulator)X1E Opteron Itanium2

65.8

3.261.07 1.97

0

10

20

30

40

50

60

70

Single Precision

• Notes: – Performance model when accounting for limited dual issue, matches

well with FSS and hardware– Double precision runs don’t exploit time skewing– In single precision time skewing = 4 – Problem was best of 1283 and 2563

Page 21: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

21

Sparse Matrix-Vector Multiplication

Page 22: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

22

Sparse Matrix Vector Multiplication

• Most of the matrix entries are zero, thus the nonzeros are sparsely distributed

• Can be used for unstructured grid problems• Issues

– Like DGEMM, can exploit a FMA well

– Very low computational intensity

(1 FMA for every 12+ bytes)

– Non FP instructions can dominate

– Can be very irregular

– Row lengths can be unique and very short

Page 23: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

23

Compressed Sparse Row

• Compressed Sparse Row (CSR) is the standard format– Array of nonzero values

– Array of corresponding column for each nonzero value

– Array of row starts containing index (in the above arrays) of first nonzero in the row

Page 24: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

24

Optimization - Double Buffer Nonzeros

• Computation and Communication are approximately equally expensive

• While operating on the current set of nonzeros, load the next (~1K nonzero buffers)

• Need complex (thought) code to stop and restart a row between buffers

• Can nearly double performance

Page 25: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

25

Optimization - SIMDization

• BCSR– Nonzeros are grouped into r x c blocks

– O(nnz) explicit zeros are added

– Choose r&c so that it meshes well with 128b registers

– Performance can hurt especially in DP as computing on zeros is very wasteful

– Can hide latency and amortize loop overhead

– Only used in initial performance model

• Row Padding– Pad rows to the nearest multiple of 128b

– Might requires O(N) explicit zeros

– Loop overhead still present

– Generally works better in double precision

Page 26: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

26

Optimization - Cache Blocked Vectors• Doubleword DMA gathers from DRAM can be expensive• Cache block source and destination vectors• Finite LS, so what’s the best aspect ratio?• DMA large blocks into local store• Gather operations into local store

– ISA vs. memory subsystem inefficiencies– Exploits temporal and spatial locality within

the SPE

• In effect, the sparse matrix is explicitly blocked

into submatrices, and we can skip, or otherwise

optimize empty submatrices

• Indices are now relative to the cache block– half words– reduces memory traffic by 16%

Page 27: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

27

Optimization - Load Balancing

• Potentially irregular problem• Load imbalance can severely hurt performance• Partitioning by rows is clearly insufficient• Partitioning by nonzeros is inefficient when

average number of nonzeros per row per cache

block is small

• Define a cost function of number of row starts

and number of nonzeros.• Determine the parameters via static timing analysis or

tuning.

Page 28: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

28

Benchmark Matrices

• 4 nonsymmetric SPARSITY matrices• 6 symmetric SPARSITY matrices• 7pt Heat equation matrix

Page 29: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

29

Other Approaches

• BeBop / OSKI on the Itanium2 & Opteron– uses BCSR – auto tunes for optimal r x c blocking– Cell implementation is similar

• Cray’s routines on the X1E– Report best of CSRP, Segmented Scan &

Jagged Diagonal

Page 30: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

30

Double Precision SpMV Performance

• 16 SPE version needs broadcast optimization

• Frequency helps (mildly computationally bound)

• Cell+ doesn’t help much more

(non FP instruction BW)

3.53

4.27

1.50

1.91

2.802.45

3.02

1.461.27

2.05

2.61

3.00

1.561.30

2.12

4.36

5.56

2.80

2.26

3.753.79

4.28

2.211.87

3.04

0.84

1.55

0.57

1.61

1.14

0.44 0.42 0.30 0.28 0.360.46 0.490.27 0.21 0.36

0

1

2

3

4

5

6

PDE FEM Memory Circuit CFD Average

GFLOPs

2.1GHz Cell(16 SPEs) 2.1GHz Cell(8 SPEs) 2.1GHz Cell(simulator) 3.2GHz Cell+(simulator)3.2GHz Cell(simulator) X1E(best of JAG,CSRP,ELL,…) Opteron(OSKI) Itanium2(OSKI)

Page 31: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

31

Dense Matrix Matrix

Multiplication

(performance model only)

Page 32: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

32

Dense Matrix-Matrix Multiplication

• Blocking– Explicit (BDL) or implicit blocking (gather stanzas)– Hybrid method would be to convert and store to DRAM

on the fly– Choose a block size so that kernel is computationally

bound 642 in single precision

• much easier in double precision (14x computational time, 2x transfer time)

• Parallelization– Partition A & C among SPEs– Future work - cannon’s algorithm

Page 33: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

33

GEMM Performance

51.1

14.616.9

4.0 5.4

0

10

20

30

40

50

60

Double Precision

GFLOPs

3.2GHz Cell+(model) 3.2GHz Cell(model) X1E Opteron Itanium2

204.7

29.5

7.8 3.00

50

100

150

200

250

Single Precision

Page 34: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

34

Fourier Transforms

(performance model only)

Page 35: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

35

1D Fast Fourier Transforms

• Naïve Algorithm– Load roots of unity– Load data (cyclic)– Local work, on-chip transpose, local work– i.e. SPEs cooperate on a single FFT– No overlap of communication or computation

Page 36: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

36

2D Fast Fourier Transforms

• Each SPE performs 2 * (N/8) FFTs• Double buffer rows

– overlap communication and computation– 2 incoming, 2 outgoing

• Straightforward algorithm (N2 2D FFT):– N simultaneous FFTs, transpose,

N simultaneous FFTs, transpose.

• Long DMAs necessitate transposes• transposes represent about 50% of total SP

execution time • SP Simultaneous FFT’s run at ~ 75 GFLOP/s

Page 37: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

37

FFT Performance

13.4016.20

5.85 6.65

33.65

38.20

4.537.05

5.307.93

1.61 0.693.24

1.322.70

0.311.72

0.420.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

small 1D large 2D small 1D large 2D

Double Precision Single Precision

GFLOPs

Cell+ Cell X1E Opteron Itanium2

Page 38: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

38

Conclusions

Page 39: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

39

Conclusions

• Far more predictable than conventional OOO machines• Even in double precision, it obtains much better

performance on a surprising variety of codes.• Cell can eliminate unneeded memory traffic, hide memory

latency, and thus achieves a much higher percentage of memory bandwidth.

• Instruction set can be very inefficient for poorly SIMDizable or misaligned codes.

• Loop overheads can heavily dominate performance.

• Programming Model is clunky

Page 40: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

40

Acknowledgments

• This work (paper in CF06, poster in EDGE06) is a collaboration with the following FTG members:– John Shalf, Lenny Oliker, Shoaib Kamil, Parry Husbands,

and Kathy Yelick

• Additional thanks to– Joe Gebis and David Patterson

• X1E FFT numbers provided by:– Bracy Elton, and Adrian Tate

• Cell access provided by:– Mike Perrone (IBM Watson)– Otto Wohlmuth (IBM Germany)

– Nehal Desai (Los Alamos National Lab)

Page 41: 1 The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group

41

Questions?