Jack Dongarra University of Tennessee and Oak Ridge National...

Preview:

Citation preview

1

Jack DongarraUniversity of Tennessee andOak Ridge National Laboratoryhttp://www.cs.utk.edu/~http://www.cs.utk.edu/~dongarradongarra//

SeminarioSeminario EEUU/Venezuela deEEUU/Venezuela de ComputaciónComputación de Altode Alto RendimientoRendimiento 20002000US/Venezuela Workshop on High Performance Computing 2000 US/Venezuela Workshop on High Performance Computing 2000 -- WoHPCWoHPC 20002000

4 al 6 de4 al 6 de AbrilAbril del 2000 del 2000 Puerto La Cruz VenezuelaPuerto La Cruz Venezuela

2

♦ Look at trends in HPC�Top500 statistics

♦ NetSolve�Example of grid middleware

♦ Performance on today�s architecture�ATLAS effort

♦ Tools for performance evaluation�Performance API (PAPI)

3

- Started in 6/93 by JD,Hans W. Meuer and Erich Strohmaier

- Basis for analyzing the HCP market- Quantify observations - Detection of trends(market, architecture, technology)

4

- Listing of the 500 most powerfulComputers in the World

- Yardstick: Rmax from LINPACK MPPAx=b, dense problem

- Updated twice a yearSC�xy in NovemberMeeting in Mannheim in June

- All data available from www.top500.org

Size

Rat

e

TPP performance

5

♦ A way for tracking trends�in performance�in market�in classes of HPC systems�Architecture�Technology

♦ Original classes of machines�Sequential�SMPs�MPPs�SIMDs

♦ Two new classes�Beowulf-class

systems�Clustering of SMPs

and DSMs �Requires additional

terminology� �Constellation�

6

Cluster of ClustersCluster of Clusters

♦ An ensemble of N nodes each comprising p computing elements

♦ The p elements are tightly bound shared memory (e.g. smp, dsm)

♦ The N nodes are loosely coupled, i.e.: distributed memory

♦ p is greater than N♦ Distinction is which

layer gives us the most power through parallelism

4TF Blue Pacific SST 3 x 480 4-way SMP nodes3.9 TF peak performance 2.6 TB memory2.5 Tb/s bisectional bandwidth62 TB disk6.4 GB/s delivered I/O bandwidth

Year91 93

Gflop/s

9795

TMC CM-5 (1024)TMC CM-2

(2048)NEC SX-3 (4)

Fujitsu VPP-500 (140)Intel Paragon (6788)

Hitachi CP-PACS (2040)

Fujitsu VP-2600 Cray Y-MP (8)

99

250

100

0

500

750

1000

1500

Intel ASCI Red (9152)

SGI ASCI Blue Mountain (5040)

2000 ASCI Blue Pacific SST(5808)

Intel ASCI Red Xeon (9632)

RANK MANU-FACTURER COMPUTER

RMAX[GF/S]

INSTALLATION SITE COUNTRY YEAR AREA OF INSTALLATION # PROC

1 Intel ASCI Red 2379.6 Sandia National Labs Albuquerque

USA 1999 Research 9632

2 IBM ASCI Blue-Pacific SST,

IBM SP 604E2144 Lawrence Livermore

National Laboratory USA 1999 Research 5808

3 SGI ASCI Blue Mountain

1608 Los Alamos National Lab USA 1998 Research 6144

4 SGI T3E 1200 891.5 Government USA 1998 Classified 1084

5 Hitachi SR8000 873.6 University of Tokyo Japan 1999 Academic 128

6 SGI T3E 900 815.1 Government USA 1997 Classified 1324

7 SGI Orgin 2000 690.9 Los Alamos National Lab /ACL

USA 1999 Research 2048

8 Cray/SGI T3E 900 675.7 Naval Oceanographic Office, Bay Saint Louis USA 1999 Research

Weather 1084

9 SGI T3E 1200 671.2 Deutscher Wetterdienst Germany 1999 Research Weather

812

10 IBM SP Power3 558.13 UCSD/San Diego Supercomputer Center,

IBM/Poughkeepsie

USA 1999 Research 1024

9

50.97 TF/s

2.38 TF/s

33.1 GF/s

Jun-9

3Nov

-93Ju

n-94

Nov-94

Jun-9

5Nov

-95Ju

n-96

Nov-96

Jun-9

7Nov

-97Ju

n-98

Nov-98

Jun-9

9Nov

-99

Intel XP/S140Sandia

Fujitsu'NWT' NAL

SNI VP200EXUni Dresden

Cray Y-MP M94/4

KFA Jülich

CrayY-MP C94/364

'EPA' USA

Hitachi/TsukubaCP-PACS/2048

SGIPOWER

CHALLANGE GOODYEAR

IntelASCI Red

Sandia

IntelASCI Red

Sandia

Sun UltraHPC 1000

News International

SunHPC 10000Merril Lynch

Fujitsu'NWT' NAL

IntelASCI Red

Sandia

IBM 604e69 procNabisco

N=1

N=500

SUM

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

[60G - 400 M][2.4 Tflop/s 33 Gflop/s],[1 40 Tf] Schwab #12, 1/2 each year, 100 > 100 Gf, faster than Moore�s law

10

0.1

1

10

100

1000

10000

100000

1000000

Jun-93

Nov-94

Jun-96

Nov-97

Jun-99

Nov-00

Jun-02

Nov-03

Jun-05

Nov-06

Jun-08

Nov-09

Perf

orm

ance

[GFl

op/s

]

N=1

N=500

Sum

N=10

1 TFlop/s

1 PFlop/s

ASCI

Earth Simulator

Entry 1 T 2005 and 1 P 2010

11

Single Processor

SMP

MPP

SIMD Constellation Cluster

0

100

200

300

400

500

Jun-9

3Nov-9

3Ju

n-94

Nov-94

Jun-9

5Nov-9

5Ju

n-96

Nov-96

Jun-9

7Nov-9

7Ju

n-98

Nov-98

Jun-9

9Nov-9

9

12

Cray SGI

IBM

SunConvex/HP

TMC

Intel

"Japan Inc"others

0

100

200

300

400

500

Jun-

93Nov

-93Ju

n-94

Nov-94

Jun-

95Nov

-95Ju

n-96

Nov-96

Jun-

97Nov

-97Ju

n-98

Nov-98

Jun-

99Nov

-99

Tara -> Cray Inc $15M

Inmos Transputer

Alpha

IBMHP

intel

MIPS

SUN

Other

0

100

200

300

400

500

Jun-93

Nov-93

Jun-94

Nov-94

Jun-

95Nov-9

5Ju

n-96

Nov-96

Jun-

97Nov-9

7Ju

n-98

Nov-98

Jun-

99Nov

-99

14

Research

Industry

Academic

ClassifiedVendor

0

100

200

300

400

500

Jun-93

Nov-93

Jun-94

Nov-94

Jun-95

Nov-95

Jun-96

Nov-96

Jun-97

Nov-97

Jun-98

Nov-98

Jun-99

Nov-99

15

♦ Clustering of shared memory machines for scalability� Emergence of PC commodity systems

� Pentium/Alpha based, Linux or NT driven� �Supercomputer performance at mail-order prices�

� Beowulf-Class Systems (Linux+PC)� Distributed Shared Memory (clusters of

processors connected)� Shared address space w/deep memory hierarchy

♦ Efficiency of message passing and data parallel programming� Helped by standards efforts such as PVM, MPI,

Open-MP and HPF♦ Many of the machines as a single user environments♦ Pure COTS

16

RANK MANU-FACTURER COMPUTER RMAX INSTALLATION SITE COUNTRY YEAR AREA OF

INSTALLATION#

PROC

33 Sun HPC 450Cluster

272.1 Sun, Burlington USA 1999 Vendor 720

34 Compaq Alpha Server SC 271.4 Compaq Computer Corp.Littleton USA 1999 Vendor 512

� � � � � � � � �

44 Self-made Cplant Cluster 232.6 Sandia NationalLaboratories USA 1999 Research 580

� � � � � � � � �

169 Self-made Alphleet Cluster 61.3 Institute of Physical andChemical Res. (RIKEN) Japan 1999 Research 140

� � � � � � � � �

265 Self-made Avalon Cluster 48.6 Los Alamos National Lab/CNLS USA 1998 Research 140

� � � � � � � � �351 Siemens hpcLine Cluster 41.45 Universitaet Paderborn/PC2 Germany 1999 Academic 192

� � � � � � � � �454 Self-made Parnass2 Cluster 34.23 University of Bonn/

Applied MathematicGermany 1999 Academic 128

17

♦ Allow networked resources to be integrated into the desktop.

♦ Not just hardware, but also make available software resources.

♦ Locate and �deliver� software or solutions to the user in a directly usable and �conventional� form.

♦ Part of the motivation - software maintenance

18

19

♦ Three basic scenarios:�Client, servers and

agents anywhere on Internet (3(10)-150(80-ws/mpp)-Mcell)

�Client, servers and agents on an Intranet�Client, server and agent on the same

machine♦ �Blue Collar� Grid Based Computing

�User can set things up, no �su� required�Doesn�t require deep knowledge of network

programming♦ Focus on Matlab users

�OO language, objects are matrices (pse, eg os)

�One of the most popular desktop systems for numerical computing, 400K Users

20

� No knowledge of networking involved� Hide complexity of numerical software� Computation location transparency� Provides access to Virtual Libraries :

� Component grid-based framework� Central management of library resources� User not concerned with most up-to-date version� Automatic tie to Netlib repository in project

� Provides synchronous or asynchronous calls(User level parallelism)

call NETSL(�DGESV()�,NSINFO,N,1,A,MAX,IPIV,B,MAX,INFO)

Asynchronous Calls also available

>> define sparse matrix A>> define rhs>> [x, its] = netsolve('itmeth',�petsc�, A, rhs, 1.e-6

http://www.cs.utk.edu/netsolve

� Framework to easily add arbitrary software ...� Many numerical libraries being integrated by the

NetSolve team � Many software being integrated by users

Agent : � Gateway to the computational servers

Computational Server :� Various Software resources installed on

various Hardware Resources� Configurable and Extensible :

� Performs Load Balancing among the resources

http://www.cs.utk.edu/netsolve

NetSolve agent :predicts the execution times and sorts the servers

Prediction for a server based on :� Its distance over the network

- Latency and Bandwidth- Statistical Averaging

� Its performance (LINPACK benchmark)� Its workload� The problem size and the algorithm complexity

Cached data Quick estimate

workload out of date ?

24

University of Tennessee�s Grid Prototype: University of Tennessee�s Grid Prototype: ScalableScalable IntracampusIntracampus Research Grid:Research Grid: SInRGSInRG

25

netsl(�command1�, A, B, C);netsl(�command2�, A, C, D);netsl(�command3�, D, E, F);

Client Server

command1(A, B)

result C

Client Server

command2(A, C)

result D

Client Server

command3(D, E)

result F

netsl_begin_sequence( );netsl(�command1�, A, B, C);netsl(�command2�, A, C, D);netsl(�command3�, D, E, F);netsl_end_sequence(C, D);

Client Server

sequence(A, B, E)

Server

Client Serverresult F

input A,intermediate output C

intermediate output D,input E

26

NetSolveClient

NetSolve Servers

NetSolve Agent

NetSolve Services

MDS

GASS

GRAM

Client-ProxyInterface

NetSolveProxy

Client-ProxyInterfaceGlobusProxy

Client-ProxyInterface

NinfProxy

Client-ProxyInterface

LegionProxy

NinfServices Legion

Services

27

♦ Kerberos used to maintain Access Control Lists and manage access to computational resources.

♦ NetSolve properly handles authorized and non-authorized components together in the same system.

♦ In use by DOD Modernization program at Army Research Lab

28

♦ Multiple requests to single problem.

♦ Previous Solution:

♦ New Solution:�� Single call to netsl_farm( );Single call to netsl_farm( );

� Many calls to netslnb( ); /* non-blocking */

SCIRun torso defibrillator application

30

(A bit in the future)(A bit in the future)

� NetSolve examines the user�s data together with the resources available (in the grid sense) and makes decisions on best time to solution dynamically.

� Decision based on size of problem, hardware/software, network connection etc.

� Method and Placement.

Dense

Rectangular General Symmetric

Sparse

Matrixtype

Direct Iterative

GeneralSymmetric

Symmetric General

♦♦ x =x = netsolvenetsolve(�linear system�,A,b)(�linear system�,A,b)

31

Where Does the Performance Go? orWhere Does the Performance Go? orWhy Should I Cares About the Memory Hierarchy?Why Should I Cares About the Memory Hierarchy?

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU19

82

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orm

ance

Time

�Moore�s Law�

Processor-DRAM Memory Gap (latency)

♦ By taking advantage of the principle of locality:� Present the user with as much memory as is available in

the cheapest technology.� Provide access at the speed offered by the fastest

technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

Level2 and 3Cache

(SRAM)

On-C

hipC

ache1s 10,000,000s

(10s ms)100,000 s(.1s ms)

Speed (ns): 10s 100s100s

Gs

Size (bytes):Ks Ms

TertiaryStorage

(Disk/Tape)

10,000,000,000s (10s sec)

10,000,000 s(10s ms)

Ts

DistributedMemory

Remote ClusterMemory

33

♦ Today�s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.

♦ Hardware and software have a large design space w/many parameters� Blocking sizes, loop nesting permutations, loop unrolling

depths, software pipelining strategies, register allocations, and instruction schedules.

� Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.

♦ About a year ago no tuned BLAS for Pentium for Linux.♦ Need for quick/dynamic deployment of optimized routines.♦ ATLAS - Automatic Tuned Linear Algebra Software

� PhiPac from Berkeley� FFTW from MIT (http://www.fftw.org)

34

♦ An adaptive software architecture�High-performance�Portability�Elegance

♦ ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

35

(DGEMM n = 500)(DGEMM n = 500)

♦ ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

900.0

AMD Athlon-600

DEC ev56-5

33DEC ev6

-500

HP9000

/735/1

35IBM PPC60

4-112

IBM Power2-16

0IBM Power3-

200

Pentium Pro-20

0Pentiu

m II-26

6Pentiu

m III-55

0

SGI R10

000ip

28-20

0

SGI R12

000ip

30-27

0

Sun UltraSparc2

-200

Architectures

MFL

OPS

Vendor BLASATLAS BLASF77 BLAS

36

♦ Code is iteratively generated & timed until optimal case is found. We try:� Differing NBs� Breaking false

dependencies� M, N and K loop unrolling

♦ Designed for RISC arch� Super Scalar� Need reasonable C

compiler♦ Takes ~20 minutes to run

♦ Two phases:� Probes the systems for

system features� Does a parameter study

♦ On-chip multiply optimizes for:� TLB access� L1 cache reuse� FP unit usage� Memory fetch� Register reuse� Loop overhead

minimization♦ New model of HP

programming where critical code is machine generated using parameter optimization.

37

0

50

100

150

200

250

300

350

DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM

BLAS

MFLO

PS

Vendor BLASATLAS BLASReference BLAS

38

0100200300400500600700800

100

200

300

400

500

600

700

800

900

1000

Size

Mfl

op

/s

Intel BLAS 1 proc ATLAS 1proc Intel BLAS 2 proc ATLAS 2 proc

MultiMulti--Threaded DGEMMThreaded DGEMMIntel PIII 550 MHzIntel PIII 550 MHz

39

♦ Software Release, available today:�Level 1, 2, and 3 BLAS implementations�See: www.netlib.org/atlas/

♦ Near Future:�Multi-treading�Optimize message passing system�Extend these ideas to Java directly�Sparse Matrix-Vector ops

♦ Futures:�Runtime adaptation

�Sparsity analysis� Iterative code improvement

�Specialization for user applications�Adaptive libraries

40

♦ Timing and performance evaluation has been an art�Resolution of the clock�Issues about cache effects�Different systems

♦ Situation about to change�Today�s processors have internal counters

41

♦ Hidden from users.♦ On most platforms the APIs, if they exist, are not appropriate for a common user, functional or well documented.

♦ Existing performance counter APIs�Cray T3E�SGI MIPS R10000�IBM Power series�DEC Alpha pfm pseudo-device interface�Windows 95, NT and Linux

42

�Cycle count�Floating point

instruction count�Integer instruction

count�Instruction count�Load/store count�Branch taken / not

taken count�Branch mispredictions

�Pipeline stalls due to memory subsystem

�Pipeline stalls due to resource conflicts

�I/D cache misses for different levels

�Cache invalidations�TLB misses�TLB invalidations

43

♦ Performance Application Programming Interface

♦ The purpose of PAPI is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors

♦ Used by Tau (A. Malony) andSvPablo (D. Reed)

44

♦ Application is instrumented with PAPI�One simple �call�

♦ Will be layered over the best existing vendor-specific APIs for these platforms

♦ Sections of code that are of interest are designated with specific colors� Using a call to mark_perfometer(�color�)

♦ Application is started and a Java window containing the Perfometer application is also started

PerfometerPerfometer

Call Perfometer()

46

♦ Top500� Hans W. Meuer, Mannheim U� Erich Strohmaier, UTK

♦ NetSolve� Dorian Arnold, UTK� Susan Blackford, UTK� Henri Casanova, UCSD� Michelle Miller, UTK� Ganapathy Raman, UTK� Sathish Vadhiyar, UTK

♦ ATLAS� Clint Whaley, UTK� Antoine Petitet, UTK

♦ PAPI� Shirley Browne, UTK� Nathan Garner, UTK� Kevin London, UTK� Phil Mucci, UTK

For additional information see�www.top500.orgwww.netlib.org/atlas/www.netlib.org/netsolve/www.cs.utk.edu/~dongarra/

47

NetSolve Agent

Software RepositoryComputational Resources

http://www.cs.utk.edu/netsolve/NetSolve Client

48

♦ List of Top 100 Clusters �IEEE Task Force on Cluster Computing�Interested in assembling a list of the Top n

Clusters�Based on current metric�Starting to put together software to

facilitate running and collection of data.♦ Sparse Benchmark

�Look at the performance in terms of sparse matrix operations

�Iterative solvers�Beginning to collect data

49

♦ Tool integration� Globus - Middleware infrastructure (ANL/SSI)� Condor - Workstation farm (U Wisconsin)� NWS - Network Weather Service (U Tennessee)� SCIRun - Computational steering (U Utah)� Ninf - NetSolve-like system, (Tsukuba U)

♦ Library usage� LAPACK/ScaLAPACK - Parallel dense linear solvers� SuperLU/MA28 - Parallel sparse direct linear solvers(UCB/RAL)� PETSc/Aztec - Parallel iterative solvers (ANL/SNL)� Other areas as well (not just linear algebra)

♦ Applications� MCell - Microcellular physiology (UCSD/Salk)� IPARS - Reservoir Simulator (UTexas, Austin)� Virtual Human - Pulmonary System Model (ORNL)� RSICC - Radiation Safety sw/simulation (ORNL)� LUCAS - Land usage modeling (U Tennessee)� ImageVision - Computer Graphics and Vision (Graz U)

50

PETScPETScPETScPETSc SuperLSuperLSuperLSuperLUUUUSPOOLESSPOOLESSPOOLESSPOOLESMA2MA2MA2MA28888

♦ Iterative and direct solvers: PETSc, Aztec, SuperLU, Ma28, �

♦ Support for compressed row/column sparse matrix storage --significantly reduces network data transmission

♦ Sequential and parallel implementations available

Recommended