ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA … · on the future of high performance computing: how to think for peta and exascale computing jack dongarra university

ON THE FUTURE OF HIGH

PERFORMANCE COMPUTING:

HOW TO THINK FOR PETA

AND EXASCALE COMPUTING

JACK DONGARRA

UNIVERSITY OF TENNESSEE

OAK RIDGE NATIONAL LAB

What Is LINPACK?

LINPACK is a package of mathematical software for solving problems in linear algebra, mainly dense linear systems of linear equations.

LINPACK: “LINear algebra PACKage” Written in Fortran 66

The project had its origins in 1974

The project had four primary contributors: myself when I was at Argonne National Lab, Jim Bunch from the University of California-San Diego, Cleve Moler who was at New Mexico at that time, and Pete Stewart from the University of Maryland. LINPACK as a software package has been largely superseded by LAPACK, which has been designed to run efficiently on shared-memory, vector supercomputers.

http://www.netlib.org/lapack

Computing in 1974

High Performance Computers: IBM 370/195, CDC 7600, Univac 1110, DEC PDP-10,

Honeywell 6030

Fortran 66 Trying to achieve software portability Floating point operations where expensive We didn’t think much about energy used Run efficiently BLAS (Level 1) Vector operations

Software released in 1979 About the time of the Cray 1

LINPACK Benchmark?

The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. It is determined by running a computer program that solves

a dense system of linear equations.

Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack

Benchmark report.

LINPACK Benchmark Dense linear system solve with LU factorization using partial

pivoting Operation count is: 2/3 n3 + O(n2)

Benchmark Measure: MFlop/s Original benchmark measures the execution rate for a

Fortran program on a matrix of size 100x100.

Accidental Benchmarker Appendix B of the Linpack Users’ Guide Designed to help users extrapolate execution

time for Linpack software package

First benchmark report from 1977; Cray 1 to DEC PDP-10

H. Meuer, H. Simon, E. Strohmaier, & JD

- Listing of the 500 most powerful

Computers in the World

- Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem

- Updated twice a year

SC‘xy in the States in November

Meeting in Germany in June

- All data available from www.top500.org

Size

Rate

TPP performance

Top500 List of Supercomputers

Over Last 20 Years - Performance

Development

0,1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1E+09

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

59.7 GFlop/s

400 MFlop/s

1.17 TFlop/s

16.3 PFlop/s

60.8 TFlop/s

123 PFlop/s

SUM

N=1

N=500

6-8 years

My Laptop (70 Gflop/s)

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011

My iPad2 & iPhone 4s (1.02 Gflop/s)

2012

June 2012: The TOP10

Rank Site Computer Country Cores Rmax

[Pflops] % of Peak

Power [MW]

MFlops/Watt

1 DOE / NNSA

L Livermore Nat Lab Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 16.3 81 8.6 1895

2 RIKEN Advanced Inst for

Comp Sci K computer Fujitsu SPARC64 VIIIfx (8c) +

custom Japan 705,024 10.5 93 12.7 830

3 DOE / OS

Argonne Nat Lab Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2069

4 Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.52 823

5 Nat. SuperComputer Center

in Tianjin Tianhe-1A, NUDT

Intel (6c) + Nvidia GPU (14c) + custom China 186,368 2.57 55 4.04 636

6 DOE / OS

Oak Ridge Nat Lab Jaguar, Cray

AMD (16c) + custom USA 298,592 1.94 74 5.14 377

7 CINECA Fermi, BlueGene/Q (16c) + custom Italy 163,840 1.73 82 .821 2099

8 Forschungszentrum Juelich

(FZJ) JuQUEEN, BlueGene/Q (16c) + custom Germany 131,072 1.38 82 .657 2099

9 Commissariat a l'Energie

Atomique (CEA) Curie, Bull Intel (8c) + IB France 77,184 1.36 82 2.25 604

10 Nat. Supercomputer Center

in Shenzhen Nebulea, Dawning Intel (6) + Nvidia GPU

(14c) + IB China 120,640 1.27 43 2.58 493

500 Energy Comp IBM Cluster, Intel + IB Italy 4096 .061 93*

Accelerators (58 systems)

0

10

20

30

40

50

60

2006 2007 2008 2009 2010 2011 2012

Syst

em

s

Intel MIC (1)

Clearspeed CSX600 (0)

ATI GPU (2)

IBM PowerXCell 8i (2)

NVIDIA 2070 (10)

NVIDIA 2050(12)

NVIDIA 2090 (31)

27 US

7 China

4 Japan

3 Russia

2 France

2 Germany

2 India

2 Italy

2 Poland

1 Australia

1 Brazil

1 Canada

1 Singapore

1 Spain

1 Taiwan

1 UK

Countries Share

Absolute Counts

US: 252

China: 68

Japan: 35

UK: 25

France: 22

Germany: 20

10

28 Systems at > Pflop/s (Peak)

9/20/2012

11

Top500 in France rank

manufacturer computer

installation_site_name procs r_peak

9 Bull Bullx B510, Xeon E5-2680 8C 2.700GHz, InfB CEA/TGCC-GENCI 77184 1667174

17 Bull Bull bullx super-node S6010/S6030 CEA 138368 1254550

29 IBM BlueGene/Q, Power BQC 16C 1.60GHz, Custom EDF R&D 65536 838861

30 IBM BlueGene/Q, Power BQC 16C 1.60GHz, Custom IDRIS/GENCI 65536 838861

45 Bull Bullx B510, Xeon E5 (Sandy Bridge - EP) 8C 2.70GHz, InfB Bull 20480 442368

69 HP HP POD - Cluster Platform 3000, X5675 3.06 GHz, InfB Airbus 24192 296110

75 SGI SGI Altix ICE 8200EX, Xeon E5472 3.0/X5560 2.8 GHz (GENCI-CINES) 23040 267878

95 HP Cluster Platform 3000 BL2x220, L54xx 2.5 Ghz, InfB Government 24704 247040

97 Bull Bullx B510, Xeon E5-2680 8C 2.700GHz, InfB CEA 9440 203904

104 IBM iDataPlex, Xeon X56xx 6C 2.93 GHz, InfB EDF R&D 16320 191270

115 Bull Bullx B505, Xeon E56xx (Westmere-EP) 2.40 GHz, InfB CEA 7020 274560 149 IBM Blue Gene/P Solution IDRIS 40960 139264

162 Bull Bullx B505, Xeon E5640 2.67 GHz, InfB CEA/TGCC-GENCI 5040 198162 168 Bull BULL Novascale R422-E2 CEA 11520 129998

171 Bull Bullx B510, Xeon E5 (Sandy Bridge - EP) 8C 2.70GHz, InfB Bull 5760 124416

173 SGI SGI Altix ICE 8200EX, Xeon quad core 3.0 GHz Total 10240 122880 201 IBM Blue Gene/P Solution EDF R&D 32768 111411 240 HP Cluster Platform 3000, Xeon X5570 2.93 GHz, InfB Manufacturing 8576 100511

244 Bull Bull bullx super-node S6010/S6030 Bull 11520 104602

245 Bull Bullx S6010 Cluster, Xeon 2.26 Ghz 8-core, InfB CEA/TGCC-GENCI 11520 104417

365 IBM xSeries x3550M3 Cluster, Xeon X5650 6C 2.66 GHz, GigE Information 13068 139044

477 IBM iDataPlex, Xeon E55xx QC 2.26 GHz, GigE Financial 13056 118026

Linpack Efficiency

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 100 200 300 400 500

Linpa

ck E

fficienc

y

Linpack Efficiency

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 100 200 300 400 500

Linpa

ck E

fficienc

y

Linpack Efficiency

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 100 200 300 400 500

Linpa

ck E

fficienc

y

Performance Development in Top500

0,1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1E+09

1E+10

1E+11

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

1 Eflop/s

1 Gflop/s

1 Tflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

N=1

N=500

•Flop/s or percentage of peak flop/s become

much less relevant

•Algorithms & Software: minimize data

movement; perform more work per unit data

movement.

The High Cost of Data Movement

2011 2018

DP FMADD flop 100 pJ 10 pJ

DP DRAM read 4800 pJ 1920 pJ

Local Interconnect 7500 pJ 2500 pJ

Cross System 9000 pJ 3500 pJ

Approximate power costs (in picoJoules)

Source: John Shalf, LBNL

Energy Cost Challenge

At ~$1M per MW energy costs are substantial 10 Pflop/s in 2011 uses ~10 MWs

1 Eflop/s in 2018 > 100 MWs

DOE Target: 1 Eflop/s in 2018 at 20 MWs

19

Potential System Architecture

with a cap of $200M and 20MW

Systems 2012 BG/Q

Computer

2022 Difference Today & 2022

System peak 20 Pflop/s 1 Eflop/s O(100)

Power 8.6 MW (2 Gflops/W)

~20 MW (50 Gflops/W)

System memory 1.6 PB (16*96*1024)

32 - 64 PB O(10)

Node performance 205 GF/s (16*1.6GHz*8)

1.2 or 15TF/s O(10) – O(100)

Node memory BW 42.6 GB/s 2 - 4TB/s O(1000)

Node concurrency 64 Threads

O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW

20 GB/s 200-400GB/s O(10)

System size (nodes) 98,304 (96*1024)

O(100,000) or O(1M) O(100) – O(1000)

Total concurrency 5.97 M O(billion) O(1,000)

MTTI 4 days O(<1 day) - O(10)

Potential System Architecture

with a cap of $200M and 20MW

Systems 2012 BG/Q

Computer

2022 Difference Today & 2022

System peak 20 Pflop/s 1 Eflop/s O(100)

Power 8.6 MW (2 Gflops/W)

~20 MW (50 Gflops/W)

System memory 1.6 PB (16*96*1024)

32 - 64 PB O(10)

Node performance 205 GF/s (16*1.6GHz*8)

1.2 or 15TF/s O(10) – O(100)

Node memory BW 42.6 GB/s 2 - 4TB/s O(1000)

Node concurrency 64 Threads

O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW

20 GB/s 200-400GB/s O(10)

System size (nodes) 98,304 (96*1024)

O(100,000) or O(1M) O(100) – O(1000)

Total concurrency 5.97 M O(billion) O(1,000)

MTTI 4 days O(<1 day) - O(10)

Critical Issues at Peta & Exascale for

Algorithm and Software Design

Synchronization-reducing algorithms Break Fork-Join model

Communication-reducing algorithms Use methods which have lower bound on communication

Mixed precision methods 2x speed of ops and 2x speed for data movement

Autotuning Today’s machines are too complicated, build “smarts” into

software to adapt to the hardware

Fault resilient algorithms Implement algorithms that can recover from failures/bit

flips

Reproducibility of results Today we can’t guarantee this. We understand the issues,

but some of our “colleagues” have a hard time with this.

• Must rethink the design of our algorithms and software

Manycore and Hybrid architectures are disruptive technology Similar to what happened with cluster

computing and message passing

Rethink and rewrite the applications, algorithms, and software

Data movement is expensive

Flops are cheap

Major Changes to Algorithms/Software

23

Dense Linear Algebra Software Evolution

LINPACK (70's)

vector operations

LAPACK (80's)

block operations

ScaLAPACK (90's)

block cyclic

data distribution

PLASMA (00's)

tile operations

Level 1 BLAS

Level 3 BLAS

PBLAS

BLACS

(message passing)

tile layout

dataflow scheduling

PLASMA Principles

Tile Algorithms

minimize capacity misses

Tile Matrix Layout

minimize conflict misses

Dynamic DAG Scheduling

minimizes idle time

More overlap

Asynchronous ops

CPU

MEM

cache

CPU

MEM

cache

CPU

cache

CPU

cache

CPU

cache

LAPACK

PLASMA

Fork-Join Parallelization of LU and QR.

Parallelize the update: • Easy and done in any reasonable software. • This is the 2/3n3 term in the FLOPs count. • Can be done efficiently with LAPACK+multithreaded BLAS

-

dgemm

Time

Core

s

Objectives High utilization of each core

Scaling to large number of cores

Synchronization reducing algorithms

Methodology Dynamic DAG scheduling (QUARK)

Explicit parallelism

Implicit communication

Fine granularity / block data layout

Arbitrary DAG with dynamic scheduling Fork-join

parallelism

PLASMA/MAGMA: Parallel Linear Algebra

s/w for Multicore/Hybrid Architectures

DAG scheduled

parallelism

Time

Communication Avoiding QR

Example

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd

Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,

pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0

R1

R2

R3

R0

R2

R0 R R

D 1

D 2

D 3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D 0

D 1

D 2

D 3

D 0


Example




R0

R1

R2

R3

R0

R2

R0 R R

D 1

D 2

D 3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D 0

D 1

D 2

D 3

D 0


Example




R0

R1

R2

R3

R0

R2

R0 R R

D 1

D 2

D 3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D 0

D 1

D 2

D 3

D 0


Example




R0

R1

R2

R3

R0

R2

R0 R R

D 1

D 2

D 3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D 0

D 1

D 2

D 3

D 0


Example




R0

R1

R2

R3

R0

R2

R0 R R

D 1

D 2

D 3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D 0

D 1

D 2

D 3

D 0

PowerPack 2.0

The PowerPack platform consists of software and hardware instrumentation.

Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/

Power for QR Factorization

dual-socket quad-core Intel Xeon E5462 (Harpertown) processor

@ 2.80GHz (8 cores total) w / MLK BLAS

matrix size is very tall and skinny (mxn is 1,152,000 by 288)

PLASMA’s Communication

Reducing QR Factorization

DAG based

PLASMA’s Conventional

QR Factorization

DAG based

MKL’s QR Factorization

Fork-join based

LAPACK’s QR Factorization

Fork-join based

The standard Tridiagonal reduction xSYTRD

step k: Q A Q* then update step k+1

LAPACK xSYTRD:

1. Apply left-right transformations Q A Q* to the panel

2. Update the remaining submatrix A33

A22

A32

æ

è ç

ö

ø ÷

For the symmetric eigenvalue problem:

First stage takes:

• 90% of the time if only eigenvalues

• 50% of the time if eigenvalues and eigenvectors

The standard Tridiagonal reduction xSYTRD

Characteristics 1. Phase 1 requires :

o 4 panel vector multiplications,

o 1 symmetric matrix vector multiplication with A33,

o Cost 2(n-k)2b Flops.

2. Phase 2 requires:

o Symmetric update of A33 using SYRK,

o Cost 2(n-k)2b Flops.

Observations • Too many Level 2 BLAS ops,

• Relies on panel factorization,

• Total cost 4n3/3

• Bulk sync phases,

• Memory bound algorithm.

Symmetric Eigenvalue Problem

• Standard reduction algorithm are very slow on multicore.

• Step1: Reduce the dense matrix to band.

• Matrix-matrix operations, high degree of parallelism

• Step2: Bulge Chasing on the band matrix

• by group and cache aware

Symmetric Eigenvalues

Singular Values singular values only

Block DAG based to banded form, then pipelined group chasing to tridiagaonal form.

The reduction to condensed form accounts for the factor of 50 improvement over LAPACK Execution rates based on 4/3n3 ops

eigenvalues only

Experiments on eight-socket six-core AMD Opteron 2.4 GHz

processors with MKL V10.3.

Summary These are old ideas (today SMPss, StarPU, Charm++, ParalleX, Swarm,…)

Major Challenges are ahead for extreme computing Power Levels of Parallelism Communication Hybrid Fault Tolerance … and many others not discussed here

Not just a programming assignment.

This opens up many new opportunities for applied mathematicians and computer scientists

Collaborators / Software / Support

PLASMA

http://icl.cs.utk.edu/plasma/

MAGMA

http://icl.cs.utk.edu/magma/

Quark (RT for Shared Memory)

• http://icl.cs.utk.edu/quark/

PaRSEC(Parallel Runtime Scheduling

and Execution Control)

• http://icl.cs.utk.edu/parsec/

40

Collaborating partners

University of Tennessee, Knoxville

University of California, Berkeley

University of Colorado, Denver

INRIA, France

KAUST, Saudi Arabia

These tools are being applied to a range of applications beyond dense LA:

Sparse direct, Sparse iterations methods and Fast Multipole Methods



http://icl.cs.utk.edu/plasma









Documents

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA … · on the future of high performance computing: how to think for peta and exascale computing jack dongarra university