28
Presented by The Innovative Computing Laboratory University of Tennessee Knoxville Oak Ridge National Laboratory IBM Cell for Linear Algebra Computations

IBM Cell for Linear Algebra Computations

  • Upload
    jin

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

IBM Cell for Linear Algebra Computations. The Innovative Computing Laboratory University of Tennessee Knoxville Oak Ridge National Laboratory. With the hype on Cell and PS3 we became interested. The PlayStation 3's CPU based on a "Cell“ processor. - PowerPoint PPT Presentation

Citation preview

Page 1: IBM Cell for Linear Algebra Computations

Presented by

The Innovative Computing LaboratoryUniversity of Tennessee Knoxville

Oak Ridge National Laboratory

IBM Cell for Linear Algebra Computations

Page 2: IBM Cell for Linear Algebra Computations

2 Dongarra_KOJAK_SC072 Dongarra_Cell_SC07

SPE ~ 25 Gflop/s peak

With the hype on Cell and PS3we became interested The PlayStation 3's CPU based on a "Cell“ processor. Each Cell contains a Power PC processor and 8 SPEs (SPE is processing unit, SPE: SPU + DMA engine).

An SPE is a self-contained vector processor that acts independently from the others. 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2 GHZ.

204.8 Gflop/s peak! The catch is that this is for 32 bit floating point; (single precision, SP). And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!!

Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues.

Page 3: IBM Cell for Linear Algebra Computations

3 Dongarra_KOJAK_SC073 Dongarra_Cell_SC07

   SizeSize SGEMM/SGEMM/DGEMMDGEMM SizeSize SGEMV/SGEMV/

DGEMVDGEMV

AMD Opteron 246 3000 2.002.00 5000 1.701.70

UltraSparc-IIe 3000 1.641.64 5000 1.661.66Intel PIII

Coppermine 3000 2.032.03 5000 2.092.09

PowerPC 970 3000 2.042.04 5000 1.441.44Intel

Woodcrest 3000 1.811.81 5000 2.182.18

Intel XEON 3000 2.042.04 5000 1.821.82Intel Centrino

Duo 3000 2.712.71 5000 2.212.21

Performance of single precision on conventional processors We have the similar

situation on our commodity processors. That is, SP is 2X as

fast as DP on many systems.

The Intel Pentium and AMD Opteron have SSE2: 2 flops/cycle DP 4 flops/cycle SP

IBM PowerPC has AltiVec: 8 flops/cycle SP 4 flops/cycle DP

No DP on AltiVec

Single precision is faster because Higher parallelism in SSE/vector units Reduced data motion Higher locality in cache

Page 4: IBM Cell for Linear Algebra Computations

4 Dongarra_KOJAK_SC074 Dongarra_Cell_SC07

32 or 64 bit floating point precision?

A long time ago, 32 bit floating point was used. Still used in scientific apps but limited.

Most apps use 64 bit floating point. Accumulation of round off error:

A 10 Tflop/s computer running for 4 hours performs > 1 exaflop (1018) ops.

Ill conditioned problems: IEEE SP exponent bits too few (8 bits, 10±38). Critical sections need higher precision—

Sometimes need extended precision (128 bit floating point).

However, some can get by with 32 bit floating point in some parts.

Mixed precision is a possibility. Approximate in lower precision and then refine or improve solution to

high precision.

Page 5: IBM Cell for Linear Algebra Computations

5 Dongarra_KOJAK_SC075 Dongarra_Cell_SC07

Idea goes something like this…

Exploit 32 bit floating point as much as possible— especially for the bulk of the computation.

Correct or update the solution with selective use of 64 bit floating point to provide refined results.

Intuitively compute a 32 bit result, calculate a correction to 32 bit result using selected higher

precision, and perform the update of the 32 bit results with the correction

using high precision.

Page 6: IBM Cell for Linear Algebra Computations

6 Dongarra_KOJAK_SC076 Dongarra_Cell_SC07

L U = lu(A) SINGLE

O(n3)

x = L\(U\b) SINGLE

O(n2)

r = b – Ax DOUBLE

O(n2)

WHILE || r || not small enough

z = L\(U\r) SINGLE

O(n2)

x = x + z DOUBLE O(n1)

r = b – Ax DOUBLE

O(n2)

END

• Iterative refinement for dense systems, Ax = b, can work this way.

Mixed-precision iterative refinement

Page 7: IBM Cell for Linear Algebra Computations

7 Dongarra_KOJAK_SC077 Dongarra_Cell_SC07

Mixed-precision iterative refinement

Iterative refinement for dense systems, Ax = b, can work this way.

Wilkinson, Moler, Stewart, and Higham provide error bound for SP floating point results when using DP fl pt.

It can be shown that, using this approach, we can compute the solution to 64-bit floating point precision.

Requires extra storage, total is 1.5 times normal.

O(n3) work is done in lower precision.

O(n2) work is done in high precision.

Problems if the matrix is ill-conditioned in sp; O(108).

Page 8: IBM Cell for Linear Algebra Computations

8 Dongarra_KOJAK_SC078 Dongarra_Cell_SC07

Single precision is faster than DP because of higher parallelism within vector units

4 ops/cycle (usually) instead of 2 ops/cycle reduced data motion

32 bit data instead of 64 bit data higher locality in cache

More data items in cache

Architecture (BLAS)1 Intel Pentium III Coppermine (Goto)2 Intel Pentium III Katmai (Goto)3 Sun UltraSPARC IIe (Sunperf) 4 Intel Pentium IV Prescott (Goto)5 Intel Pentium IV-M Northwood (Goto)6 AMD Opteron (Goto)7 Cray X1 (libsci)8 IBM Power PC G5 (2.7 GHz) (VecLib)9 Compaq Alpha EV6 (CXML)10 IBM SP Power3 (ESSL)11 SGI Octane (ATLAS)

Results for mixed precision iterative refinement for dense Ax = b

Page 9: IBM Cell for Linear Algebra Computations

9 Dongarra_KOJAK_SC079 Dongarra_Cell_SC07

Architecture (BLAS)1 Intel Pentium III Coppermine (Goto)2 Intel Pentium III Katmai (Goto)3 Sun UltraSPARC IIe (Sunperf) 4 Intel Pentium IV Prescott (Goto)5 Intel Pentium IV-M Northwood (Goto)6 AMD Opteron (Goto)7 Cray X1 (libsci)8 IBM Power PC G5 (2.7 GHz) (VecLib)9 Compaq Alpha EV6 (CXML)10 IBM SP Power3 (ESSL)11 SGI Octane (ATLAS)

Architecture (BLAS-MPI) No. of procs

n DP solve/SP solve

DP solve/iter ref

No. of

iter

AMD Opteron (Goto – OpenMPI MX) 32 22627 1.85 1.79 6

AMD Opteron (Goto – OpenMPI MX) 64 32000 1.90 1.83 6

Results for mixed precision iterative refinement for dense Ax = b

Page 10: IBM Cell for Linear Algebra Computations

10 Dongarra_KOJAK_SC0710 Dongarra_Cell_SC07

What about the Cell?

Power PC at 3.2 GHz:

DGEMM at 5 Gflop/s.

Altivec peak at 25.6 Gflop/s—

Achieved 10 Gflop/s SGEMM.

8 SPUs

204.8 Gflop/s peak!

The catch is that this is for 32 bit floating point (single precision SP).

And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!!

Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues.

Page 11: IBM Cell for Linear Algebra Computations

11 Dongarra_KOJAK_SC0711 Dongarra_Cell_SC07

256 KB

Worst-case memory-bound operations (no reuse of data). three data movements (2 in and 1 out) with 2 ops (SAXPY)for the Cell would be 4.6 Gflop/s (25.6 GB/s*2ops/12B).

Injection bandwidth25.6 GB/s

Injection bandwidth

Moving data around on the Cell

Page 12: IBM Cell for Linear Algebra Computations

12 Dongarra_KOJAK_SC0712 Dongarra_Cell_SC07

0

50

100

150

200

250

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Matrix size

GFlop/s

SP Peak (204 Gflop/s)

SP Ax=b IBM

DP Peak (15 Gflop/s)

DP Ax=b IBM

0.30 secs

3.9 secs

8 SGEMM (embarrassingly parallel)

IBM Cell 3.2 GHz, Ax = b

Page 13: IBM Cell for Linear Algebra Computations

13 Dongarra_KOJAK_SC0713 Dongarra_Cell_SC07

0

50

100

150

200

250

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Matrix size

GFlop/s

SP Peak (204 Gflop/s)

SP Ax=b IBM

DSGESV

DP Peak (15 Gflop/s)

DP Ax=b IBM

0.30 secs

0.47 secs

3.9 secs

8.3X

8 SGEMM (embarrassingly parallel)

IBM Cell 3.2 GHz, Ax = b

Page 14: IBM Cell for Linear Algebra Computations

14 Dongarra_KOJAK_SC0714 Dongarra_Cell_SC07

14For the SPE’s standard C code and C language SIMD extensions (intrinsics)

Single precision performance

Mixed precision performance using iterative refinement Method achieving 64 bit accuracy

Cholesky on the Cell, Ax=b, A=AT, xTAx > 0

Page 15: IBM Cell for Linear Algebra Computations

15 Dongarra_KOJAK_SC0715 Dongarra_Cell_SC07

Cholesky—Using two Cell chips

Page 16: IBM Cell for Linear Algebra Computations

16 Dongarra_KOJAK_SC0716 Dongarra_Cell_SC07

Intriguing potential Exploit lower precision as much as possible

Payoff in performance Faster floating point Fewer data to move

Automatically switch between SP and DP to match the desired accuracy Compute solution in SP and then a correction to the solution in

DP

Potential for GPU, FPGA, special purpose processors What about 16 bit floating point?

Use as little you can get away with and improve the accuracy

Applies to sparse direct and iterative linear systems and Eigenvalue, optimization problems, where Newton’s method is used

Page 17: IBM Cell for Linear Algebra Computations

17 Dongarra_KOJAK_SC0717 Dongarra_Cell_SC07

IBM/Mercury Cell blade

From IBM or Mercury 2 Cell chip

Each w/8 SPEs

512 MB/Cell ~$8K–17K Some SW

Page 18: IBM Cell for Linear Algebra Computations

18 Dongarra_KOJAK_SC0718 Dongarra_Cell_SC07

Sony Playstation 3 cluster PS3-T

From IBM or Mercury 2 Cell chip

Each w/8 SPEs 512 MB/Cell ~$8K–17K Some SW

From WAL*MART PS3 1 Cell chip

w/6 SPEs 256 MB/PS3 $600 Download SW Dual boot

Page 19: IBM Cell for Linear Algebra Computations

19 Dongarra_KOJAK_SC0719 Dongarra_Cell_SC07

3.2 GHz25 GB/s injection bandwidth200 GB/s between SPEs32 bit peak perf 8*25.6 Gflop/s

204.8 Gflop/s peak64 bit peak perf 8*1.8 Gflop/s

14.6 Gflop/s peak512 MB memory

SIT CELLPE

PE

PE

PE

PE

PE

200 GB/s200 GB/s

512 MiB512 MiB

25 GB/s

PowerPCPE

PE

25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s

25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s

Cell hardware overview

Page 20: IBM Cell for Linear Algebra Computations

20 Dongarra_KOJAK_SC0720 Dongarra_Cell_SC07

3.2 GHz25 GB/s injection bandwidth200 GB/s between SPEs 32 bit peak perf 6*25.6 Gflop/s

153.6 Gflop/s peak64 bit peak perf 6*1.8 Gflop/s

10.8 Gflop/s peak1 Gb/s NIC256 MiB memory

SIT CELLPE

PE

PE

PE

PE

PE

200 GB/s200 GB/sGameOS

Hypervisor

256 MiB256 MiB

Disabled/broken: Yield issues

25 GB/s

PowerPC

25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s

25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s

PS3 hardware overview

Page 21: IBM Cell for Linear Algebra Computations

21 Dongarra_KOJAK_SC0721 Dongarra_Cell_SC07

PlayStation 3 LU codes

0

20

40

60

80

100

120

140

160

180

0 500 1000 1500 2000 2500

Matrix size

GFlop/s SP Peak (153.6 Gflop/s)

SP Ax=b IBM

DP Peak (10.9 Gflop/s)

6 SGEMM (embarrassingly parallel)

Page 22: IBM Cell for Linear Algebra Computations

22 Dongarra_KOJAK_SC0722 Dongarra_Cell_SC07

PlayStation 3 LU codes

0

20

40

60

80

100

120

140

160

180

0 500 1000 1500 2000 2500

Matrix size

GFlop/sSP Peak (153.6 Gflop/s)

SP Ax=b IBM

DSGESV

DP Peak (10.9 Gflop/s)

6 SGEMM (embarrassingly parallel)

Page 23: IBM Cell for Linear Algebra Computations

23 Dongarra_KOJAK_SC0723 Dongarra_Cell_SC07

Cholesky on the PS3, Ax=b, A=AT, xTAx > 0

Page 24: IBM Cell for Linear Algebra Computations

24 Dongarra_KOJAK_SC0724 Dongarra_Cell_SC07

33 24

HPC in the living room

Page 25: IBM Cell for Linear Algebra Computations

25 Dongarra_KOJAK_SC0725 Dongarra_Cell_SC07

33

Gold: Computation: 8 msBlue: Communication: 20 ms

Matrix Multiple on a 4 Node PlayStation3 ClusterWhat's good Very cheap: ~4$ per Gflop/s (with

32 bit floating point theoretical peak).

Fast local computations between SPEs.

Perfect overlap between communications and computations is possible (Open-MPI running): PPE does communication via MPI. SPEs do computation via SGEMMs.

What's bad Gigabit network card. 1 Gb/s is

too little for such computational power (150 Gflop/s per node).

Linux can only run on top of GameOS (hypervisor): Extremely high network access

latencies (120 usec). Low bandwidth (600 Mb/s).

Only 256 MB local memory. Only 6 SPEs.

Page 26: IBM Cell for Linear Algebra Computations

26 Dongarra_KOJAK_SC0726 Dongarra_Cell_SC07

33

SCOP3: A Rough Guide to Scientific Computing on the PlayStation 3

See web page for detailswww.netlib.org/utk/people/JackDongarra/PAPERS/scop3.pdf

Users guide for SC on PS3

Page 27: IBM Cell for Linear Algebra Computations

27 Dongarra_KOJAK_SC0727 Dongarra_Cell_SC07

Conclusions

For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware.

This strategy needs to be rebalanced—barriers to progress are increasingly on the software side.

Moreover, the return on investment is more favorable to software. Hardware has a half-life measured in years, while software has a

half-life measured in decades.

High performance ecosystem is out of balance: hardware, OS, compilers, software, algorithms, applications—

no Moore’s Law for software, algorithms and applications.

Page 28: IBM Cell for Linear Algebra Computations

28 Dongarra_KOJAK_SC0728 Dongarra_Cell_SC07

33

Collaborators / support

Alfredo Buttari, UTK

Julien Langou, UColorado

Julie Langou, UTK

Piotr Luszczek, MathWorks

Jakub Kurzak, UTK

Stan Tomov, UTK