50
1 Jack Dongarra University of Tennessee and Oak Ridge National Laboratory http://www.cs.utk.edu/~ http://www.cs.utk.edu/~ dongarra dongarra / / Seminario Seminario EEUU/Venezuela de EEUU/Venezuela de Computacin Computacin de Alto de Alto Rendimiento Rendimiento 2000 2000 US/Venezuela Workshop on High Performance Computing 2000 US/Venezuela Workshop on High Performance Computing 2000 - - WoHPC WoHPC 2000 2000 4 al 6 de 4 al 6 de Abril Abril del 2000 del 2000 Puerto La Cruz Venezuela Puerto La Cruz Venezuela

Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

  • Upload
    trandat

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

1

Jack DongarraUniversity of Tennessee andOak Ridge National Laboratoryhttp://www.cs.utk.edu/~http://www.cs.utk.edu/~dongarradongarra//

SeminarioSeminario EEUU/Venezuela deEEUU/Venezuela de ComputaciónComputación de Altode Alto RendimientoRendimiento 20002000US/Venezuela Workshop on High Performance Computing 2000 US/Venezuela Workshop on High Performance Computing 2000 -- WoHPCWoHPC 20002000

4 al 6 de4 al 6 de AbrilAbril del 2000 del 2000 Puerto La Cruz VenezuelaPuerto La Cruz Venezuela

Page 2: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

2

♦ Look at trends in HPC�Top500 statistics

♦ NetSolve�Example of grid middleware

♦ Performance on today�s architecture�ATLAS effort

♦ Tools for performance evaluation�Performance API (PAPI)

Page 3: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

3

- Started in 6/93 by JD,Hans W. Meuer and Erich Strohmaier

- Basis for analyzing the HCP market- Quantify observations - Detection of trends(market, architecture, technology)

Page 4: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

4

- Listing of the 500 most powerfulComputers in the World

- Yardstick: Rmax from LINPACK MPPAx=b, dense problem

- Updated twice a yearSC�xy in NovemberMeeting in Mannheim in June

- All data available from www.top500.org

Size

Rat

e

TPP performance

Page 5: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

5

♦ A way for tracking trends�in performance�in market�in classes of HPC systems�Architecture�Technology

♦ Original classes of machines�Sequential�SMPs�MPPs�SIMDs

♦ Two new classes�Beowulf-class

systems�Clustering of SMPs

and DSMs �Requires additional

terminology� �Constellation�

Page 6: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

6

Cluster of ClustersCluster of Clusters

♦ An ensemble of N nodes each comprising p computing elements

♦ The p elements are tightly bound shared memory (e.g. smp, dsm)

♦ The N nodes are loosely coupled, i.e.: distributed memory

♦ p is greater than N♦ Distinction is which

layer gives us the most power through parallelism

4TF Blue Pacific SST 3 x 480 4-way SMP nodes3.9 TF peak performance 2.6 TB memory2.5 Tb/s bisectional bandwidth62 TB disk6.4 GB/s delivered I/O bandwidth

Page 7: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

Year91 93

Gflop/s

9795

TMC CM-5 (1024)TMC CM-2

(2048)NEC SX-3 (4)

Fujitsu VPP-500 (140)Intel Paragon (6788)

Hitachi CP-PACS (2040)

Fujitsu VP-2600 Cray Y-MP (8)

99

250

100

0

500

750

1000

1500

Intel ASCI Red (9152)

SGI ASCI Blue Mountain (5040)

2000 ASCI Blue Pacific SST(5808)

Intel ASCI Red Xeon (9632)

Page 8: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

RANK MANU-FACTURER COMPUTER

RMAX[GF/S]

INSTALLATION SITE COUNTRY YEAR AREA OF INSTALLATION # PROC

1 Intel ASCI Red 2379.6 Sandia National Labs Albuquerque

USA 1999 Research 9632

2 IBM ASCI Blue-Pacific SST,

IBM SP 604E2144 Lawrence Livermore

National Laboratory USA 1999 Research 5808

3 SGI ASCI Blue Mountain

1608 Los Alamos National Lab USA 1998 Research 6144

4 SGI T3E 1200 891.5 Government USA 1998 Classified 1084

5 Hitachi SR8000 873.6 University of Tokyo Japan 1999 Academic 128

6 SGI T3E 900 815.1 Government USA 1997 Classified 1324

7 SGI Orgin 2000 690.9 Los Alamos National Lab /ACL

USA 1999 Research 2048

8 Cray/SGI T3E 900 675.7 Naval Oceanographic Office, Bay Saint Louis USA 1999 Research

Weather 1084

9 SGI T3E 1200 671.2 Deutscher Wetterdienst Germany 1999 Research Weather

812

10 IBM SP Power3 558.13 UCSD/San Diego Supercomputer Center,

IBM/Poughkeepsie

USA 1999 Research 1024

Page 9: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

9

50.97 TF/s

2.38 TF/s

33.1 GF/s

Jun-9

3Nov

-93Ju

n-94

Nov-94

Jun-9

5Nov

-95Ju

n-96

Nov-96

Jun-9

7Nov

-97Ju

n-98

Nov-98

Jun-9

9Nov

-99

Intel XP/S140Sandia

Fujitsu'NWT' NAL

SNI VP200EXUni Dresden

Cray Y-MP M94/4

KFA Jülich

CrayY-MP C94/364

'EPA' USA

Hitachi/TsukubaCP-PACS/2048

SGIPOWER

CHALLANGE GOODYEAR

IntelASCI Red

Sandia

IntelASCI Red

Sandia

Sun UltraHPC 1000

News International

SunHPC 10000Merril Lynch

Fujitsu'NWT' NAL

IntelASCI Red

Sandia

IBM 604e69 procNabisco

N=1

N=500

SUM

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

[60G - 400 M][2.4 Tflop/s 33 Gflop/s],[1 40 Tf] Schwab #12, 1/2 each year, 100 > 100 Gf, faster than Moore�s law

Page 10: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

10

0.1

1

10

100

1000

10000

100000

1000000

Jun-93

Nov-94

Jun-96

Nov-97

Jun-99

Nov-00

Jun-02

Nov-03

Jun-05

Nov-06

Jun-08

Nov-09

Perf

orm

ance

[GFl

op/s

]

N=1

N=500

Sum

N=10

1 TFlop/s

1 PFlop/s

ASCI

Earth Simulator

Entry 1 T 2005 and 1 P 2010

Page 11: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

11

Single Processor

SMP

MPP

SIMD Constellation Cluster

0

100

200

300

400

500

Jun-9

3Nov-9

3Ju

n-94

Nov-94

Jun-9

5Nov-9

5Ju

n-96

Nov-96

Jun-9

7Nov-9

7Ju

n-98

Nov-98

Jun-9

9Nov-9

9

Page 12: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

12

Cray SGI

IBM

SunConvex/HP

TMC

Intel

"Japan Inc"others

0

100

200

300

400

500

Jun-

93Nov

-93Ju

n-94

Nov-94

Jun-

95Nov

-95Ju

n-96

Nov-96

Jun-

97Nov

-97Ju

n-98

Nov-98

Jun-

99Nov

-99

Tara -> Cray Inc $15M

Page 13: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

Inmos Transputer

Alpha

IBMHP

intel

MIPS

SUN

Other

0

100

200

300

400

500

Jun-93

Nov-93

Jun-94

Nov-94

Jun-

95Nov-9

5Ju

n-96

Nov-96

Jun-

97Nov-9

7Ju

n-98

Nov-98

Jun-

99Nov

-99

Page 14: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

14

Research

Industry

Academic

ClassifiedVendor

0

100

200

300

400

500

Jun-93

Nov-93

Jun-94

Nov-94

Jun-95

Nov-95

Jun-96

Nov-96

Jun-97

Nov-97

Jun-98

Nov-98

Jun-99

Nov-99

Page 15: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

15

♦ Clustering of shared memory machines for scalability� Emergence of PC commodity systems

� Pentium/Alpha based, Linux or NT driven� �Supercomputer performance at mail-order prices�

� Beowulf-Class Systems (Linux+PC)� Distributed Shared Memory (clusters of

processors connected)� Shared address space w/deep memory hierarchy

♦ Efficiency of message passing and data parallel programming� Helped by standards efforts such as PVM, MPI,

Open-MP and HPF♦ Many of the machines as a single user environments♦ Pure COTS

Page 16: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

16

RANK MANU-FACTURER COMPUTER RMAX INSTALLATION SITE COUNTRY YEAR AREA OF

INSTALLATION#

PROC

33 Sun HPC 450Cluster

272.1 Sun, Burlington USA 1999 Vendor 720

34 Compaq Alpha Server SC 271.4 Compaq Computer Corp.Littleton USA 1999 Vendor 512

� � � � � � � � �

44 Self-made Cplant Cluster 232.6 Sandia NationalLaboratories USA 1999 Research 580

� � � � � � � � �

169 Self-made Alphleet Cluster 61.3 Institute of Physical andChemical Res. (RIKEN) Japan 1999 Research 140

� � � � � � � � �

265 Self-made Avalon Cluster 48.6 Los Alamos National Lab/CNLS USA 1998 Research 140

� � � � � � � � �351 Siemens hpcLine Cluster 41.45 Universitaet Paderborn/PC2 Germany 1999 Academic 192

� � � � � � � � �454 Self-made Parnass2 Cluster 34.23 University of Bonn/

Applied MathematicGermany 1999 Academic 128

Page 17: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

17

♦ Allow networked resources to be integrated into the desktop.

♦ Not just hardware, but also make available software resources.

♦ Locate and �deliver� software or solutions to the user in a directly usable and �conventional� form.

♦ Part of the motivation - software maintenance

Page 18: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

18

Page 19: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

19

♦ Three basic scenarios:�Client, servers and

agents anywhere on Internet (3(10)-150(80-ws/mpp)-Mcell)

�Client, servers and agents on an Intranet�Client, server and agent on the same

machine♦ �Blue Collar� Grid Based Computing

�User can set things up, no �su� required�Doesn�t require deep knowledge of network

programming♦ Focus on Matlab users

�OO language, objects are matrices (pse, eg os)

�One of the most popular desktop systems for numerical computing, 400K Users

Page 20: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

20

� No knowledge of networking involved� Hide complexity of numerical software� Computation location transparency� Provides access to Virtual Libraries :

� Component grid-based framework� Central management of library resources� User not concerned with most up-to-date version� Automatic tie to Netlib repository in project

� Provides synchronous or asynchronous calls(User level parallelism)

Page 21: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

call NETSL(�DGESV()�,NSINFO,N,1,A,MAX,IPIV,B,MAX,INFO)

Asynchronous Calls also available

>> define sparse matrix A>> define rhs>> [x, its] = netsolve('itmeth',�petsc�, A, rhs, 1.e-6

Page 22: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

http://www.cs.utk.edu/netsolve

� Framework to easily add arbitrary software ...� Many numerical libraries being integrated by the

NetSolve team � Many software being integrated by users

Agent : � Gateway to the computational servers

Computational Server :� Various Software resources installed on

various Hardware Resources� Configurable and Extensible :

� Performs Load Balancing among the resources

Page 23: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

http://www.cs.utk.edu/netsolve

NetSolve agent :predicts the execution times and sorts the servers

Prediction for a server based on :� Its distance over the network

- Latency and Bandwidth- Statistical Averaging

� Its performance (LINPACK benchmark)� Its workload� The problem size and the algorithm complexity

Cached data Quick estimate

workload out of date ?

Page 24: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

24

University of Tennessee�s Grid Prototype: University of Tennessee�s Grid Prototype: ScalableScalable IntracampusIntracampus Research Grid:Research Grid: SInRGSInRG

Page 25: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

25

netsl(�command1�, A, B, C);netsl(�command2�, A, C, D);netsl(�command3�, D, E, F);

Client Server

command1(A, B)

result C

Client Server

command2(A, C)

result D

Client Server

command3(D, E)

result F

netsl_begin_sequence( );netsl(�command1�, A, B, C);netsl(�command2�, A, C, D);netsl(�command3�, D, E, F);netsl_end_sequence(C, D);

Client Server

sequence(A, B, E)

Server

Client Serverresult F

input A,intermediate output C

intermediate output D,input E

Page 26: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

26

NetSolveClient

NetSolve Servers

NetSolve Agent

NetSolve Services

MDS

GASS

GRAM

Client-ProxyInterface

NetSolveProxy

Client-ProxyInterfaceGlobusProxy

Client-ProxyInterface

NinfProxy

Client-ProxyInterface

LegionProxy

NinfServices Legion

Services

Page 27: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

27

♦ Kerberos used to maintain Access Control Lists and manage access to computational resources.

♦ NetSolve properly handles authorized and non-authorized components together in the same system.

♦ In use by DOD Modernization program at Army Research Lab

Page 28: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

28

♦ Multiple requests to single problem.

♦ Previous Solution:

♦ New Solution:�� Single call to netsl_farm( );Single call to netsl_farm( );

� Many calls to netslnb( ); /* non-blocking */

Page 29: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

SCIRun torso defibrillator application

Page 30: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

30

(A bit in the future)(A bit in the future)

� NetSolve examines the user�s data together with the resources available (in the grid sense) and makes decisions on best time to solution dynamically.

� Decision based on size of problem, hardware/software, network connection etc.

� Method and Placement.

Dense

Rectangular General Symmetric

Sparse

Matrixtype

Direct Iterative

GeneralSymmetric

Symmetric General

♦♦ x =x = netsolvenetsolve(�linear system�,A,b)(�linear system�,A,b)

Page 31: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

31

Where Does the Performance Go? orWhere Does the Performance Go? orWhy Should I Cares About the Memory Hierarchy?Why Should I Cares About the Memory Hierarchy?

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU19

82

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orm

ance

Time

�Moore�s Law�

Processor-DRAM Memory Gap (latency)

Page 32: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

♦ By taking advantage of the principle of locality:� Present the user with as much memory as is available in

the cheapest technology.� Provide access at the speed offered by the fastest

technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

Level2 and 3Cache

(SRAM)

On-C

hipC

ache1s 10,000,000s

(10s ms)100,000 s(.1s ms)

Speed (ns): 10s 100s100s

Gs

Size (bytes):Ks Ms

TertiaryStorage

(Disk/Tape)

10,000,000,000s (10s sec)

10,000,000 s(10s ms)

Ts

DistributedMemory

Remote ClusterMemory

Page 33: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

33

♦ Today�s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.

♦ Hardware and software have a large design space w/many parameters� Blocking sizes, loop nesting permutations, loop unrolling

depths, software pipelining strategies, register allocations, and instruction schedules.

� Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.

♦ About a year ago no tuned BLAS for Pentium for Linux.♦ Need for quick/dynamic deployment of optimized routines.♦ ATLAS - Automatic Tuned Linear Algebra Software

� PhiPac from Berkeley� FFTW from MIT (http://www.fftw.org)

Page 34: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

34

♦ An adaptive software architecture�High-performance�Portability�Elegance

♦ ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

Page 35: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

35

(DGEMM n = 500)(DGEMM n = 500)

♦ ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

900.0

AMD Athlon-600

DEC ev56-5

33DEC ev6

-500

HP9000

/735/1

35IBM PPC60

4-112

IBM Power2-16

0IBM Power3-

200

Pentium Pro-20

0Pentiu

m II-26

6Pentiu

m III-55

0

SGI R10

000ip

28-20

0

SGI R12

000ip

30-27

0

Sun UltraSparc2

-200

Architectures

MFL

OPS

Vendor BLASATLAS BLASF77 BLAS

Page 36: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

36

♦ Code is iteratively generated & timed until optimal case is found. We try:� Differing NBs� Breaking false

dependencies� M, N and K loop unrolling

♦ Designed for RISC arch� Super Scalar� Need reasonable C

compiler♦ Takes ~20 minutes to run

♦ Two phases:� Probes the systems for

system features� Does a parameter study

♦ On-chip multiply optimizes for:� TLB access� L1 cache reuse� FP unit usage� Memory fetch� Register reuse� Loop overhead

minimization♦ New model of HP

programming where critical code is machine generated using parameter optimization.

Page 37: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

37

0

50

100

150

200

250

300

350

DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM

BLAS

MFLO

PS

Vendor BLASATLAS BLASReference BLAS

Page 38: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

38

0100200300400500600700800

100

200

300

400

500

600

700

800

900

1000

Size

Mfl

op

/s

Intel BLAS 1 proc ATLAS 1proc Intel BLAS 2 proc ATLAS 2 proc

MultiMulti--Threaded DGEMMThreaded DGEMMIntel PIII 550 MHzIntel PIII 550 MHz

Page 39: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

39

♦ Software Release, available today:�Level 1, 2, and 3 BLAS implementations�See: www.netlib.org/atlas/

♦ Near Future:�Multi-treading�Optimize message passing system�Extend these ideas to Java directly�Sparse Matrix-Vector ops

♦ Futures:�Runtime adaptation

�Sparsity analysis� Iterative code improvement

�Specialization for user applications�Adaptive libraries

Page 40: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

40

♦ Timing and performance evaluation has been an art�Resolution of the clock�Issues about cache effects�Different systems

♦ Situation about to change�Today�s processors have internal counters

Page 41: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

41

♦ Hidden from users.♦ On most platforms the APIs, if they exist, are not appropriate for a common user, functional or well documented.

♦ Existing performance counter APIs�Cray T3E�SGI MIPS R10000�IBM Power series�DEC Alpha pfm pseudo-device interface�Windows 95, NT and Linux

Page 42: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

42

�Cycle count�Floating point

instruction count�Integer instruction

count�Instruction count�Load/store count�Branch taken / not

taken count�Branch mispredictions

�Pipeline stalls due to memory subsystem

�Pipeline stalls due to resource conflicts

�I/D cache misses for different levels

�Cache invalidations�TLB misses�TLB invalidations

Page 43: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

43

♦ Performance Application Programming Interface

♦ The purpose of PAPI is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors

♦ Used by Tau (A. Malony) andSvPablo (D. Reed)

Page 44: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

44

♦ Application is instrumented with PAPI�One simple �call�

♦ Will be layered over the best existing vendor-specific APIs for these platforms

♦ Sections of code that are of interest are designated with specific colors� Using a call to mark_perfometer(�color�)

♦ Application is started and a Java window containing the Perfometer application is also started

Page 45: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

PerfometerPerfometer

Call Perfometer()

Page 46: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

46

♦ Top500� Hans W. Meuer, Mannheim U� Erich Strohmaier, UTK

♦ NetSolve� Dorian Arnold, UTK� Susan Blackford, UTK� Henri Casanova, UCSD� Michelle Miller, UTK� Ganapathy Raman, UTK� Sathish Vadhiyar, UTK

♦ ATLAS� Clint Whaley, UTK� Antoine Petitet, UTK

♦ PAPI� Shirley Browne, UTK� Nathan Garner, UTK� Kevin London, UTK� Phil Mucci, UTK

For additional information see�www.top500.orgwww.netlib.org/atlas/www.netlib.org/netsolve/www.cs.utk.edu/~dongarra/

Page 47: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

47

NetSolve Agent

Software RepositoryComputational Resources

http://www.cs.utk.edu/netsolve/NetSolve Client

Page 48: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

48

♦ List of Top 100 Clusters �IEEE Task Force on Cluster Computing�Interested in assembling a list of the Top n

Clusters�Based on current metric�Starting to put together software to

facilitate running and collection of data.♦ Sparse Benchmark

�Look at the performance in terms of sparse matrix operations

�Iterative solvers�Beginning to collect data

Page 49: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

49

♦ Tool integration� Globus - Middleware infrastructure (ANL/SSI)� Condor - Workstation farm (U Wisconsin)� NWS - Network Weather Service (U Tennessee)� SCIRun - Computational steering (U Utah)� Ninf - NetSolve-like system, (Tsukuba U)

♦ Library usage� LAPACK/ScaLAPACK - Parallel dense linear solvers� SuperLU/MA28 - Parallel sparse direct linear solvers(UCB/RAL)� PETSc/Aztec - Parallel iterative solvers (ANL/SNL)� Other areas as well (not just linear algebra)

♦ Applications� MCell - Microcellular physiology (UCSD/Salk)� IPARS - Reservoir Simulator (UTexas, Austin)� Virtual Human - Pulmonary System Model (ORNL)� RSICC - Radiation Safety sw/simulation (ORNL)� LUCAS - Land usage modeling (U Tennessee)� ImageVision - Computer Graphics and Vision (Graz U)

Page 50: Jack Dongarra University of Tennessee and Oak Ridge National …pages.cs.wisc.edu/~bart/US-Ven/Talks/Dongarra.pdf · 2 ♦Look at trends in HPC Top500 statistics ♦NetSolve Example

50

PETScPETScPETScPETSc SuperLSuperLSuperLSuperLUUUUSPOOLESSPOOLESSPOOLESSPOOLESMA2MA2MA2MA28888

♦ Iterative and direct solvers: PETSc, Aztec, SuperLU, Ma28, �

♦ Support for compressed row/column sparse matrix storage --significantly reduces network data transmission

♦ Sequential and parallel implementations available