Jack Dongarra University of Tennessee and Oak Ridge National...

Jack DongarraUniversity of Tennessee andOak Ridge National Laboratoryhttp://www.cs.utk.edu/~http://www.cs.utk.edu/~dongarradongarra//

SeminarioSeminario EEUU/Venezuela deEEUU/Venezuela de ComputaciónComputación de Altode Alto RendimientoRendimiento 20002000US/Venezuela Workshop on High Performance Computing 2000 US/Venezuela Workshop on High Performance Computing 2000 -- WoHPCWoHPC 20002000

4 al 6 de4 al 6 de AbrilAbril del 2000 del 2000 Puerto La Cruz VenezuelaPuerto La Cruz Venezuela

♦ Look at trends in HPC�Top500 statistics

♦ NetSolve�Example of grid middleware

♦ Performance on today�s architecture�ATLAS effort

♦ Tools for performance evaluation�Performance API (PAPI)

- Started in 6/93 by JD,Hans W. Meuer and Erich Strohmaier

- Basis for analyzing the HCP market- Quantify observations - Detection of trends(market, architecture, technology)

- Listing of the 500 most powerfulComputers in the World

- Yardstick: Rmax from LINPACK MPPAx=b, dense problem

- Updated twice a yearSC�xy in NovemberMeeting in Mannheim in June

- All data available from www.top500.org

TPP performance

♦ A way for tracking trends�in performance�in market�in classes of HPC systems�Architecture�Technology

♦ Original classes of machines�Sequential�SMPs�MPPs�SIMDs

♦ Two new classes�Beowulf-class

systems�Clustering of SMPs

and DSMs �Requires additional

terminology� �Constellation�

Cluster of ClustersCluster of Clusters

♦ An ensemble of N nodes each comprising p computing elements

♦ The p elements are tightly bound shared memory (e.g. smp, dsm)

♦ The N nodes are loosely coupled, i.e.: distributed memory

♦ p is greater than N♦ Distinction is which

layer gives us the most power through parallelism

4TF Blue Pacific SST 3 x 480 4-way SMP nodes3.9 TF peak performance 2.6 TB memory2.5 Tb/s bisectional bandwidth62 TB disk6.4 GB/s delivered I/O bandwidth

Year91 93

Gflop/s

TMC CM-5 (1024)TMC CM-2

(2048)NEC SX-3 (4)

Fujitsu VPP-500 (140)Intel Paragon (6788)

Hitachi CP-PACS (2040)

Fujitsu VP-2600 Cray Y-MP (8)

Intel ASCI Red (9152)

SGI ASCI Blue Mountain (5040)

2000 ASCI Blue Pacific SST(5808)

Intel ASCI Red Xeon (9632)

RANK MANU-FACTURER COMPUTER

RMAX[GF/S]

INSTALLATION SITE COUNTRY YEAR AREA OF INSTALLATION # PROC

1 Intel ASCI Red 2379.6 Sandia National Labs Albuquerque

USA 1999 Research 9632

2 IBM ASCI Blue-Pacific SST,

IBM SP 604E2144 Lawrence Livermore

National Laboratory USA 1999 Research 5808

3 SGI ASCI Blue Mountain

1608 Los Alamos National Lab USA 1998 Research 6144

4 SGI T3E 1200 891.5 Government USA 1998 Classified 1084

5 Hitachi SR8000 873.6 University of Tokyo Japan 1999 Academic 128

6 SGI T3E 900 815.1 Government USA 1997 Classified 1324

7 SGI Orgin 2000 690.9 Los Alamos National Lab /ACL

8 Cray/SGI T3E 900 675.7 Naval Oceanographic Office, Bay Saint Louis USA 1999 Research

Weather 1084

9 SGI T3E 1200 671.2 Deutscher Wetterdienst Germany 1999 Research Weather

10 IBM SP Power3 558.13 UCSD/San Diego Supercomputer Center,

IBM/Poughkeepsie

50.97 TF/s

2.38 TF/s

33.1 GF/s

Nov-94

Nov-96

Nov-98

Intel XP/S140Sandia

Fujitsu'NWT' NAL

SNI VP200EXUni Dresden

Cray Y-MP M94/4

KFA Jülich

CrayY-MP C94/364

'EPA' USA

Hitachi/TsukubaCP-PACS/2048

SGIPOWER

CHALLANGE GOODYEAR

IntelASCI Red

Sandia

IntelASCI Red

Sandia

Sun UltraHPC 1000

News International

SunHPC 10000Merril Lynch

Fujitsu'NWT' NAL

IntelASCI Red

Sandia

IBM 604e69 procNabisco

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

[60G - 400 M][2.4 Tflop/s 33 Gflop/s],[1 40 Tf] Schwab #12, 1/2 each year, 100 > 100 Gf, faster than Moore�s law

100000

1000000

Jun-93

Nov-94

Jun-96

Nov-97

Jun-99

Nov-00

Jun-02

Nov-03

Jun-05

Nov-06

Jun-08

Nov-09

1 TFlop/s

1 PFlop/s

Earth Simulator

Entry 1 T 2005 and 1 P 2010

Single Processor

SIMD Constellation Cluster

3Nov-9

Nov-94

5Nov-9

Nov-96

7Nov-9

Nov-98

9Nov-9

Cray SGI

SunConvex/HP

"Japan Inc"others

Nov-94

Nov-96

Nov-98

Tara -> Cray Inc $15M

Inmos Transputer

Jun-93

Nov-93

Jun-94

Nov-94

95Nov-9

Nov-96

97Nov-9

Nov-98

Research

Industry

Academic

ClassifiedVendor

Jun-93

Nov-93

Jun-94

Nov-94

Jun-95

Nov-95

Jun-96

Nov-96

Jun-97

Nov-97

Jun-98

Nov-98

Jun-99

Nov-99

♦ Clustering of shared memory machines for scalability� Emergence of PC commodity systems

� Pentium/Alpha based, Linux or NT driven� �Supercomputer performance at mail-order prices�

� Beowulf-Class Systems (Linux+PC)� Distributed Shared Memory (clusters of

processors connected)� Shared address space w/deep memory hierarchy

♦ Efficiency of message passing and data parallel programming� Helped by standards efforts such as PVM, MPI,

Open-MP and HPF♦ Many of the machines as a single user environments♦ Pure COTS

RANK MANU-FACTURER COMPUTER RMAX INSTALLATION SITE COUNTRY YEAR AREA OF

INSTALLATION#

33 Sun HPC 450Cluster

272.1 Sun, Burlington USA 1999 Vendor 720

34 Compaq Alpha Server SC 271.4 Compaq Computer Corp.Littleton USA 1999 Vendor 512

� � � � � � � � �

44 Self-made Cplant Cluster 232.6 Sandia NationalLaboratories USA 1999 Research 580

� � � � � � � � �

169 Self-made Alphleet Cluster 61.3 Institute of Physical andChemical Res. (RIKEN) Japan 1999 Research 140

� � � � � � � � �

265 Self-made Avalon Cluster 48.6 Los Alamos National Lab/CNLS USA 1998 Research 140

� � � � � � � � �351 Siemens hpcLine Cluster 41.45 Universitaet Paderborn/PC2 Germany 1999 Academic 192

� � � � � � � � �454 Self-made Parnass2 Cluster 34.23 University of Bonn/

Applied MathematicGermany 1999 Academic 128

♦ Allow networked resources to be integrated into the desktop.

♦ Not just hardware, but also make available software resources.

♦ Locate and �deliver� software or solutions to the user in a directly usable and �conventional� form.

♦ Part of the motivation - software maintenance

♦ Three basic scenarios:�Client, servers and

agents anywhere on Internet (3(10)-150(80-ws/mpp)-Mcell)

�Client, servers and agents on an Intranet�Client, server and agent on the same

machine♦ �Blue Collar� Grid Based Computing

�User can set things up, no �su� required�Doesn�t require deep knowledge of network

programming♦ Focus on Matlab users

�OO language, objects are matrices (pse, eg os)

�One of the most popular desktop systems for numerical computing, 400K Users

� No knowledge of networking involved� Hide complexity of numerical software� Computation location transparency� Provides access to Virtual Libraries :

� Component grid-based framework� Central management of library resources� User not concerned with most up-to-date version� Automatic tie to Netlib repository in project

� Provides synchronous or asynchronous calls(User level parallelism)

call NETSL(�DGESV()�,NSINFO,N,1,A,MAX,IPIV,B,MAX,INFO)

Asynchronous Calls also available

>> define sparse matrix A>> define rhs>> [x, its] = netsolve('itmeth',�petsc�, A, rhs, 1.e-6

http://www.cs.utk.edu/netsolve

� Framework to easily add arbitrary software ...� Many numerical libraries being integrated by the

NetSolve team � Many software being integrated by users

Agent : � Gateway to the computational servers

Computational Server :� Various Software resources installed on

various Hardware Resources� Configurable and Extensible :

� Performs Load Balancing among the resources

http://www.cs.utk.edu/netsolve

NetSolve agent :predicts the execution times and sorts the servers

Prediction for a server based on :� Its distance over the network

- Latency and Bandwidth- Statistical Averaging

� Its performance (LINPACK benchmark)� Its workload� The problem size and the algorithm complexity

Cached data Quick estimate

workload out of date ?

University of Tennessee�s Grid Prototype: University of Tennessee�s Grid Prototype: ScalableScalable IntracampusIntracampus Research Grid:Research Grid: SInRGSInRG

netsl(�command1�, A, B, C);netsl(�command2�, A, C, D);netsl(�command3�, D, E, F);

Client Server

command1(A, B)

result C

Client Server

command2(A, C)

result D

Client Server

command3(D, E)

result F

netsl_begin_sequence( );netsl(�command1�, A, B, C);netsl(�command2�, A, C, D);netsl(�command3�, D, E, F);netsl_end_sequence(C, D);

Client Server

sequence(A, B, E)

Server

Client Serverresult F

input A,intermediate output C

intermediate output D,input E

NetSolveClient

NetSolve Servers

NetSolve Agent

NetSolve Services

Client-ProxyInterface

NetSolveProxy

Client-ProxyInterfaceGlobusProxy

NinfProxy

LegionProxy

NinfServices Legion

Services

♦ Kerberos used to maintain Access Control Lists and manage access to computational resources.

♦ NetSolve properly handles authorized and non-authorized components together in the same system.

♦ In use by DOD Modernization program at Army Research Lab

♦ Multiple requests to single problem.

♦ Previous Solution:

♦ New Solution:�� Single call to netsl_farm( );Single call to netsl_farm( );

� Many calls to netslnb( ); /* non-blocking */

SCIRun torso defibrillator application

(A bit in the future)(A bit in the future)

� NetSolve examines the user�s data together with the resources available (in the grid sense) and makes decisions on best time to solution dynamically.

� Decision based on size of problem, hardware/software, network connection etc.

� Method and Placement.

Rectangular General Symmetric

Sparse

Matrixtype

Direct Iterative

GeneralSymmetric

Symmetric General

♦♦ x =x = netsolvenetsolve(�linear system�,A,b)(�linear system�,A,b)

Where Does the Performance Go? orWhere Does the Performance Go? orWhy Should I Cares About the Memory Hierarchy?Why Should I Cares About the Memory Hierarchy?

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

Processor-MemoryPerformance Gap:(grows 50% / year)

�Moore�s Law�

Processor-DRAM Memory Gap (latency)

♦ By taking advantage of the principle of locality:� Present the user with as much memory as is available in

the cheapest technology.� Provide access at the speed offered by the fastest

technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

Level2 and 3Cache

(SRAM)

ache1s 10,000,000s

(10s ms)100,000 s(.1s ms)

Speed (ns): 10s 100s100s

Size (bytes):Ks Ms

TertiaryStorage

(Disk/Tape)

10,000,000,000s (10s sec)

10,000,000 s(10s ms)

DistributedMemory

Remote ClusterMemory

♦ Today�s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.

♦ Hardware and software have a large design space w/many parameters� Blocking sizes, loop nesting permutations, loop unrolling

depths, software pipelining strategies, register allocations, and instruction schedules.

� Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.

♦ About a year ago no tuned BLAS for Pentium for Linux.♦ Need for quick/dynamic deployment of optimized routines.♦ ATLAS - Automatic Tuned Linear Algebra Software

� PhiPac from Berkeley� FFTW from MIT (http://www.fftw.org)

♦ An adaptive software architecture�High-performance�Portability�Elegance

♦ ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

(DGEMM n = 500)(DGEMM n = 500)

♦ ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

AMD Athlon-600

DEC ev56-5

33DEC ev6

HP9000

/735/1

35IBM PPC60

IBM Power2-16

0IBM Power3-

Pentium Pro-20

0Pentiu

m II-26

6Pentiu

m III-55

SGI R10

SGI R12

Sun UltraSparc2

Architectures

Vendor BLASATLAS BLASF77 BLAS

♦ Code is iteratively generated & timed until optimal case is found. We try:� Differing NBs� Breaking false

dependencies� M, N and K loop unrolling

♦ Designed for RISC arch� Super Scalar� Need reasonable C

compiler♦ Takes ~20 minutes to run

♦ Two phases:� Probes the systems for

system features� Does a parameter study

♦ On-chip multiply optimizes for:� TLB access� L1 cache reuse� FP unit usage� Memory fetch� Register reuse� Loop overhead

minimization♦ New model of HP

programming where critical code is machine generated using parameter optimization.

DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM

Vendor BLASATLAS BLASReference BLAS

0100200300400500600700800

Intel BLAS 1 proc ATLAS 1proc Intel BLAS 2 proc ATLAS 2 proc

MultiMulti--Threaded DGEMMThreaded DGEMMIntel PIII 550 MHzIntel PIII 550 MHz

♦ Software Release, available today:�Level 1, 2, and 3 BLAS implementations�See: www.netlib.org/atlas/

♦ Near Future:�Multi-treading�Optimize message passing system�Extend these ideas to Java directly�Sparse Matrix-Vector ops

♦ Futures:�Runtime adaptation

�Sparsity analysis� Iterative code improvement

�Specialization for user applications�Adaptive libraries

♦ Timing and performance evaluation has been an art�Resolution of the clock�Issues about cache effects�Different systems

♦ Situation about to change�Today�s processors have internal counters

♦ Hidden from users.♦ On most platforms the APIs, if they exist, are not appropriate for a common user, functional or well documented.

♦ Existing performance counter APIs�Cray T3E�SGI MIPS R10000�IBM Power series�DEC Alpha pfm pseudo-device interface�Windows 95, NT and Linux

�Cycle count�Floating point

instruction count�Integer instruction

count�Instruction count�Load/store count�Branch taken / not

taken count�Branch mispredictions

�Pipeline stalls due to memory subsystem

�Pipeline stalls due to resource conflicts

�I/D cache misses for different levels

�Cache invalidations�TLB misses�TLB invalidations

♦ Performance Application Programming Interface

♦ The purpose of PAPI is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors

♦ Used by Tau (A. Malony) andSvPablo (D. Reed)

♦ Application is instrumented with PAPI�One simple �call�

♦ Will be layered over the best existing vendor-specific APIs for these platforms

♦ Sections of code that are of interest are designated with specific colors� Using a call to mark_perfometer(�color�)

♦ Application is started and a Java window containing the Perfometer application is also started

PerfometerPerfometer

Call Perfometer()

♦ Top500� Hans W. Meuer, Mannheim U� Erich Strohmaier, UTK

♦ NetSolve� Dorian Arnold, UTK� Susan Blackford, UTK� Henri Casanova, UCSD� Michelle Miller, UTK� Ganapathy Raman, UTK� Sathish Vadhiyar, UTK

♦ ATLAS� Clint Whaley, UTK� Antoine Petitet, UTK

♦ PAPI� Shirley Browne, UTK� Nathan Garner, UTK� Kevin London, UTK� Phil Mucci, UTK

For additional information see�www.top500.orgwww.netlib.org/atlas/www.netlib.org/netsolve/www.cs.utk.edu/~dongarra/

NetSolve Agent

Software RepositoryComputational Resources

http://www.cs.utk.edu/netsolve/NetSolve Client

♦ List of Top 100 Clusters �IEEE Task Force on Cluster Computing�Interested in assembling a list of the Top n

Clusters�Based on current metric�Starting to put together software to

facilitate running and collection of data.♦ Sparse Benchmark

�Look at the performance in terms of sparse matrix operations

�Iterative solvers�Beginning to collect data

♦ Tool integration� Globus - Middleware infrastructure (ANL/SSI)� Condor - Workstation farm (U Wisconsin)� NWS - Network Weather Service (U Tennessee)� SCIRun - Computational steering (U Utah)� Ninf - NetSolve-like system, (Tsukuba U)

♦ Library usage� LAPACK/ScaLAPACK - Parallel dense linear solvers� SuperLU/MA28 - Parallel sparse direct linear solvers(UCB/RAL)� PETSc/Aztec - Parallel iterative solvers (ANL/SNL)� Other areas as well (not just linear algebra)

♦ Applications� MCell - Microcellular physiology (UCSD/Salk)� IPARS - Reservoir Simulator (UTexas, Austin)� Virtual Human - Pulmonary System Model (ORNL)� RSICC - Radiation Safety sw/simulation (ORNL)� LUCAS - Land usage modeling (U Tennessee)� ImageVision - Computer Graphics and Vision (Graz U)

PETScPETScPETScPETSc SuperLSuperLSuperLSuperLUUUUSPOOLESSPOOLESSPOOLESSPOOLESMA2MA2MA2MA28888

♦ Iterative and direct solvers: PETSc, Aztec, SuperLU, Ma28, �

♦ Support for compressed row/column sparse matrix storage --significantly reduces network data transmission

♦ Sequential and parallel implementations available

Jack Dongarra University of Tennessee and Oak Ridge National...

Documents

Http:// NetSolve Happenings A Progress Report of the NetSolve Grid Computing System Cluster and Computational Grids for Scientific

NetSolve / GridSolve

JOURNAL OF WEATHER MODIFICATION Volume 37 104 - Chemtrail Planet · 2015-11-18 · JOURNAL OF WEATHER MODIFICATION Volume 37 104 ♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦

NetSolve / GridSolve By Milan Novakovic, Steven Morgan

NetSolve: Grid Enabling Scientiï¬c Computing Environments

ST. B ONAVENTURE PARISHST.BONAVENTURE PARISH 799 S TATE ROAD ♦♦♦ P. O. B OX 996 ♦♦♦♦ MANOMET, MA 02345-0996 ♦♦♦♦ MASS SCHEDULE Monday through Saturday - 8:00

BILATERAL TRADE OF CULTURAL GOODS ♦♦♦♦

20,000 LEAGUES UNDER THE SEA ♦♦♦ - Kolbe Academykolbe.org/files/2714/0251/1947/TWENTY_THOUSAND_LEAGUES_UND… · 20,000 Leagues U1 ♦♦♦ 20,000 LEAGUES UNDER THE SEA ♦♦♦

2. BATXILERGOA - larramendi-ikastola.eus · ♦♦♦♦ giza zientziei aplikatutako matematika ii ♦♦♦♦ ..... 15 edukiak

Grid Computing and NetSolve / GridSolve

6427 Franconia Rd Springfield, VA 22150 ♦ ♦♦ ♦storage.cloversites.com/thejourney3/documents/JourneyFL_Feb09.pdf · the Journey ♦ ♦♦ ♦ 6427 Franconia Rd ♦ ♦♦

NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

14.pdf · ♦LP ♦Lt, L ♦OREF ♦ORIF ♦ORTH ♦OTR ♦Rt, R ♦T 1, T 2, T 3-12 ♦y/o, yr ♦Lumbar puncture ♦Left ♦Open reduction with external fixation ♦Open reduction

Reinforcement 2º E.P · Unit 4: The Nature trail ♦ Forest ♦ Hill ♦ Grass ♦ River ♦ Leaves ♦ Lake ♦ Sand ♦ Road ♦ Path ♦ Bridge Unit 5: The frozen lake ♦ Drink

Guía para el usuario del teléfono D1688 · Use♦el♦cable♦espiral♦para♦conectar♦el♦receptor♦en♦el♦enchufe♦en♦la♦parte♦ izquierda♦de♦la♦base. Para

022020 Einzelausgabe Hahn A Chloris · 2020. 11. 27. · ♦ ♦ ♦ ♦ ♦ LEVEL 1 – sehr leicht / very easy ♦ ♦ ♦ ♦ LEVEL 2 – leicht / easy ... »À Chloris« was a

Powerful Partnerships Rodin President ... E. Glossary E-1 . xi TABLES 1. ... Powerful Partnerships -. ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦

Grid Computing and NetSolve - netlib.org

L’ALIMENTAZIONE nel Morbo di Parkinson2012-5-23 · Esempio di alimentazione equilibrata Colazione ♦Caffè o the o caffè d’orzo ♦♦♦♦Latte o yogurt ♦♦♦♦Fette

Nimrod & NetSolve