1 A Look at Library Software for Linear Algebra: Past, Present, and Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

1

A Look at Library Software for A Look at Library Software for Linear Algebra: Past, Present, Linear Algebra: Past, Present, and Futureand Future

Jack Dongarra University of Tennessee

and Oak Ridge National Laboratory

2

1960’s1960’s 1961

– IBM Stretch delivered to LANL

1962– Virtual memory from U

Manchester, T. Kilburn 1964

– AEC urges manufacturers to look at ``radical new'' machine structures.

» This leads to CDC Star-100, TI ASC, and Illiac-IV.

– CDC 6600; S. Cray's design» Functional parallelism, leading

to RISC» (3 times faster than the IBM

Stretch)

1965– DEC ships first PDP-8 & IBM ships 360– First CS PhD U of Penn. Richard Wexelblat– Wilkinson’s The Algebraic Eigenvalue

Problem published 1966

– DARPA contract with U of I to build the ILLIAC IV

– Fortran 66 1967

– Forsythe & Moler published» Fortran, Algol and PLI

1969 – DARPANET work begins

» 4 computers connected UC-SB,UCLA, SRI and U of Utah

– CDC 7600» Pipelined architecture; 3.1 Mflop/s

– Unix developed Thompson and Ritchie

.

.

.

.

.

3

Wilkinson-Reinsch HandbookWilkinson-Reinsch Handbook “In general we have aimed to include only

algorithms that provide something approaching an optimal solution, this may be from the point of view of generality, elegance, speed, or economy of storage.”– Part 1 Linear System– Part 2 Algebraic Eigenvalue Problem

Before publishing Handbook– Virginia Klema and others at ANL began

translation of the Algol procedures into Fortran subroutines

4

1970’s1970’s 1970

– NATS Project conceived» Concept of certified MS and process

involved with production

– Purdue Math Software Symposium– NAG project begins

1971 – Handbook for Automatic Computation,

Vol II » Landmark in the development of

numerical algorithms and software» Basis for a number of software projects

EISPACK, a number of linear algebra routines in IMSL and the F chapters of NAG

– IBM 370/195» Pipelined architecture; Out of order

execution; 2.5 Mflop/s

– Intel 4004; 60 Kop/s– IMSL founded

1972– Cray Research founded– Intel 8008– 1/4 size Illiac IV installed NASA Ames

» 15 Mflop/s achieved; 64 processors

– Paper by S. Reddaway on massive bit-level parallelism

– EISPACK available» 150 installations; EISPACK Users' Guide» 5 versions, IBM, CDC, Univac, Honeywell

and PDP, distributed free via Argonne Code Center

– M. Flynn publishes paper on architectural taxonomy

– ARPANet» 37 computers connected

1973– BLAS report in SIGNUM

» Lawson, Hanson, &Krogh

.

.

.

.

.

5

NATS ProjectNATS Project National Activity for Testing Software

(NSF, Argonne, Texas and Stanford) Project to explore the problems of testing,

certifying, disseminating and maintaining quality math software.– First EISPACK, later FUNPACK

» Influenced other “PACK”s ELLPACK, FISKPACK, ITPACK,MINPACK,PDEPACK, QUADPACK,

SPARSPAK, ROSEPACK, TOOLPACK, TESTPACK,LINPACK, LAPACK, ScaLAPACK . . .

Key attributes of math software– reliability, robustness, structure, usability, and

validity

6

EISPACKEISPACK Algol versions of the algorithms written in Fortran

– Restructured to avoid underflow– Check user’s claims such as if matrix positive definite– Format programs in a unified fashion

» Burt Garbow

– Field test sites 1971 U of Michigan Summer Conference

– JHW algorithms & CBM software 1972 Software released via Argonne Code Center

– 5 versions Software certified in the sense that reports of poor or

incorrect performance “would gain the immediate attention from the developers”

7

EISPACKEISPACK EISPAC Control Program, Boyle

– One interface allowed access to whole package on IBM

Argonne’s interactive RESCUE system– Allowed us to easily manipulate 10s of 1000’s of

lines Generalizer/Selector

– Convert IBM version to general form– Selector extract appropriate version

1976 Extensions to package ready – EISPACK II

8

1970’s continued1970’s continued 1974

– Intel 8080– Level 1 BLAS activity started by

community; Purdue 5/74– LINPACK meeting in summer ANL

1975– First issue of TOMS– Second LINPACK meeting ANL

» lay the groundwork and hammer out what was and was not to be included in the package. Proposal submitted to the NSF

1976 – Cray 1 - model for vector computing

» 4 Mflop/s in 79; 12 Mflop/s in 83; 27 Mflop/s in 89

– LINPACK work during summer at ANL– EISPACK Second edition of User's Guide

1977 – DEC VAX 11/780; super mini

» .14 Mflop/s; 4.3 GB virtual memory

– LINPACK test software developed &sent– EISPACK second release– IEEE Arithmetic standard meetings

» paper by Palmer on INTEL std for fl pt

1978– Fortran 77– LINPACK software released

» Sent to NESC and IMSL for distribution

1979 – John Cocke designs 801– ICL DAP delivered to QMC, London– Level 1 BLAS published/released– LINPACK Users' Guide

» Appendix: 17 machines PDP-10 to Cray-1

.

.

.

.

9

Basic Linear Algebra SubprogramsBasic Linear Algebra Subprograms BLAS 1973-1977 Consensus on:

– Names– Calling sequences– Functional Descriptions– Low level linear algebra

operations Success results from

– Extensive Public Involvement

– Careful consideration of implications

A design tool software in numerical linear algebra

Improve readability and aid documentation

Aid modularity and maintenance, and improve robustness of software calling the BLAS

Improve portability, without sacrificing efficiency, through standardization

10

CRAY and VAXCRAY and VAX

VAXination of groups and departments

Cray’s introduction of vector computing

Both had a significant impact on scientific computing

11

LINPACKLINPACK June 1974, ANL

– Jim Pool’s meeting February 1975, ANL

– groundwork for project January 1976,

– funded by NSF and DOE: ANL/UNM/UM/UCSD Summer 1976 - Summer 1977 Fall 1977

– Software to 26 testsites December 1978

– Software released, NESC and IMSL

LINPACKLINPACK Research into

mechanics of software production. Provide a yearstick against which future

software would be measured. Produce library used by people and those

that wished to modify/extend software to handle special problems.

Hope would be used in classroom Machine independent and efficient

– No mix mode arithmetic

L I N P A C K

I N P A C K

N P A C K

P A C K

A C K

C K

K

F

H

GGGGGGGG

I

K

JJJJJJJJ

1 1

1 1

1 1

1 1

1 1

1 1

1

/ /

/ /

/ /

/ /

/ /

/ /

/

L L

I I

N N

P P

A A

C C

K

F

H

GGGGGGGG

I

K

JJJJJJJJ

13

LINPACK Efficiency LINPACK Efficiency Condition estimator Inner loops via BLAS Column access Unrolling loops (done for the BLAS) BABE Algorithm for tridiagonal matrices TAMPR system allowed easy generation

of versions LINPACK Benchmark

– Today reports machines from Cray T90 to Palm Pilot» (1.0 Gflop/s to 1.4 Kflop/s)

for i = 1:n:4,

end

y y x

y y x

y y x

y y x

i i i

i i i

i i i

i i i

;

;

;

;

1 1 1

2 2 2

3 3 3

0

0

14

1980’s Vector Computing 1980’s Vector Computing (Parallel Processing)(Parallel Processing)

1980– Total computers in use in the US

exceeds 1M– CDC Introduces Cyber 205

1981– IBM introduces the PC

» Intel 8088/DOS

– BBN Butterfly delivered– FPS delivers FPS-164

» Start of mini-supercomputer market

– SGI founded by Jim Clark and others– Loop unrolling at outer level for

data locality and parallelism» Amounts to matrix-vector operations

– Cuppen's method for Symmetric Eigenvalue D&C published

» Talk at Oxford Gatlinburg (1980)

1982– Illiac IV decommissioned– Steve Chen's group at Cray produces X-MP– First Denelcor HEP installed (.21 Mflop/s)– Sun Microsys, Convex and Alliant founded

1983– Total computers in use in the US exceed 10 M– DARPA starts Strategic Computing Initiative

» Helps fund Thinking Machines, BBN, WARP

– Cosmic Cube hypercube running at Caltech» John Palmer, after seeing Caltech machine,

leaves Intel to found Ncube

– Encore , Sequent, TMC, SCS, Myrias founded– Cray 2 introduced– NEC SX-1 and SX-2, Fujitsu ships VP-200– ETA System spun off from CDC– Golub & Van Loan published

.

.

.

.

.

.

.

.

.

15


– NSFNET; 5000 computers;56 Kb/s lines

– MathWorks founded– EISPACK third release– Netlib begins 1/3/84– Level 2 BLAS activity started

» Gatlinburg(Waterloo), Purdue, SIAM

– Intel Scientific Computers started by J. Rattner

» Produce commerical hypercube

– Cray X-MP 1 processor, 21 Mflop/s Linpack

– Multiflow founded by J. Fisher; VLIW architecture

– Apple introduces Mac &IBM introduces PC AT

– IJK paper

1985– IEEE Standard 754 for floating point– IBM delivers 3090 vector; 16 Mflop/s

Linpack, 138 Peak– TMC demos CM1 to DARPA– Intel produces first iPSC/1

Hypercube» 80286 connected via Ethernet

controllers

– Fujitsu VP-400; NEC SX-2; Cray 2; Convex C1

– Ncube/10 ; .1 Mflop/s Linpack 1 processor

– FPS-264; 5.9 Mflop/s Linpack 38 Peak– IBM begins RP2 project– Stellar, Poduska & Ardent, Michels

Supertek Computer founded– Denelcor closes doors

.

.

.

.

.

16

1980’s 1980’s (Lost Decade for Parallel Software)(Lost Decade for Parallel Software) 1986

– # of computers in US exceeds 30 M– TMC ships CM-1; 64K 1 bit

processors– Cray X-MP– IBM and MIPS release first RISC WS

1987– ETA Systems family of

supercomputers– Sun Microsystems introduces its

first RISC WS– IBM invests in Steve Chen's SSI– Cray Y-MP– First NA-DIGEST– Level 3 BLAS work begun– LAPACK: Prospectus Development of

a LA Library for HPC

1988– AMT delivers first re-engineered DAP– Intel produces iPSC/2– Stellar and Ardent begin delivering

single user graphics workstations– Level 2 BLAS paper published

1989– # of computers in the US > 50M– Stellar and Ardent merge, forming

Stardent– S. Cray leaves Cray Research to form

Cray Computer– Ncube 2nd generation machine– ETA out of business– Intel 80486 and i860 ; 1 M

transistors» i860 RISC & 64 bit floating point

.

.

.

.

.

.

17

EISPACK 3 and BLAS 2 & 3EISPACK 3 and BLAS 2 & 3 Machine independence for

EISPACK Reduce the possibility of

overflow and underflow. Mods to the Algol from S.

Hammarling Rewrite reductions to

tridiagonal form to involve sequential access to memory

“Official” Double Precision version

Inverse iteration routines modified to reduce the size for reorthogonalization.

BLAS (Level 1) vector operations provide for too much data movement.

Community effort to define extensions– Matrix-vector ops– Matrix-matrix ops

18

Netlib - Mathematical Netlib - Mathematical Software and DataSoftware and Data Began in 1985

– JD and Eric Grosse, AT&T Bell Labs Motivated by the need for cost-effective, timely distribution

of high-quality mathematical software to the community. Designed to send, by return electronic mail, requested items. Automatic mechanism for electronic dissemination of freely

available software.– Still in use and growing– Mirrored at 9 sites around the world

Moderated collection /Distributed maintenance

NA-DIGEST and NA-Net– Gene Golub, Mark Kent and Cleve Moler

19

Netlib Growth Netlib Growth

0

2000000

4000000

6000000

8000000

10000000

Nu

mb

er o

f re

qu

ests

1985

1987

1989

1991

1993

1995

1997

Year

gopher

http

ftp

xnetlib

email

Just over 6,000 in 1985 Over 29,000,000 totalOver 9 Million hits in 19975.4 Million so far in 1998

LAPACK Best Seller 1.6 M hits

20

1990’s1990’s 1990

– Internet– World Wide Web

– Motorola introduces 68040– NEC ships SX-3; First parallel

Japanese parallel vector supercomputer

– IBM announces RS/6000 family» has FMA instruction

– Intel hypercube based on 860 chip» 128 processors

– Alliant delivers FX/2800 based on i860

– Fujitsu VP-2600– PVM project started– Level 3 BLAS Published

1991– Stardent to sell business and close– Cray C-90– Kendall Square Research delivers 32

processor KSR-1– TMC produces CM-200 and

announces CM-5 MIMD computer– DEC announces the Alpha– TMC produces the first CM-5– Fortran 90– Workshop to consider Message

Passing Standard, beginnings of MPI» Community effort

– Xnetlib running 1992

– LAPACK software released & Users' Guide Published

.

.

.

.

21

Architectures

328312

69

118

230

227

156

265

362

140

344

246

71

134

81130

2766

10190

0

50

100

150

200

250

300

350

400

# S

yste

ms

MPP PVP SMP

22

Parallel Processing Comes of AgeParallel Processing Comes of Age "There are three rules for

programming parallel computers.

We just don't know what they are yet." -- Gary Montry

“Embarrassingly Parallel”, Cleve Moler

“Humiliatingly Parallel”

23

Memory Hierarchy Memory Hierarchy and LAPACKand LAPACK ijk - implementations

Effects order in which data referenced; some better at allowing data to keep in higher levels of memory hierarchy.

Applies for matrix multiply, reductions to condensed form– May do slightly more

flops– Up to 3 times faster

for _ = 1:n;

for _ = 1:n;

for _ = 1:n;

end

end

end

a a b ci j i j i k k j, , , ,

24

BLAS MemoryRefs

Flops Flops/MemoryRefs

Level 1y=y+x

3n 2n 2/3

Level 2y=y+Ax

n2 2n2 2

Level 3C=C+AB

4n2 2n3 n/2

Why Higher Level BLAS?Why Higher Level BLAS? Can only do

arithmetic on data at the top of the hierarchy

Higher level BLAS lets us do this Development of

blocked algorithms important for performance

IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak)

0

50

100

150

200

250

10 100 200 300 400 500Order of vector/Matrices

Mfl

op

/s

Level 3 BLAS

Level 2 BLAS

Level 1 BLAS

25

History of Block Partitioned History of Block Partitioned AlgorithmsAlgorithms

Ideas not new. Early algorithms involved use

of small main memory using tapes as secondary storage.

Recent work centers on use of vector registers, level 1 and 2 cache, main memory, and “out of core” memory.

26

LAPACKLAPACK Linear Algebra library in Fortran 77

– Solution of systems of equations– Solution of eigenvalue problems

Combine algorithms from LINPACK and EISPACK into a single package

Block algorithms– Efficient on a wide range of computers

» RISC, Vector, SMPs

User interface similar to LINPACK– Single, Double, Complex, Double Complex

Built on the Level 1, 2, and 3 BLAS HP-48G to CRAY T-90

Cray Y-MPCholesky, n=500

1 Proc 8 procs

j-varientLINPACK

72 72

j-varientLevel 2 BLAS

251 378

j-varientLevel 3 BLAS

287 1225

k-varientLevel 3 BLAS

290 1414

Mflop/s

27Rate

ops

Time

23

3 2n n

28


– Intel Pentium system start to ship– ScaLAPACK Prototype software

released » First portable library for distributed

memory machines» Intel, TMC and workstations using PVM

– PVM 3.0 available 1994

– MPI-1 Finished 1995

– Templates project 1996

– Internet; 34M Users– Nintendo 64

» More computing power than a Cray 1 and much much better graphics

1997– MPI-2 Finished – Fortran 95

1998– Issues of parallel and

numerical stability– Divide time– DSM architectures– "New" Algorithms

» Chaotic iteration» Sparse LU w/o pivoting» Pipeline HQR» Graph partitioning» Algorithmic bombardment

.

.

.

.

29

Templates ProjectTemplates Project Iterative methods for large sparse

systems– Communicate to HPC community “State of

the Art” algorithms– Subtle algorithm issues addressed, i.e.

convergence, preconditions, data structures

– Performance and parallelism considerations

– Gave the computational scientists algorithms in form they wanted.

30

ScaLAPACKScaLAPACK Library of software for dense & banded

»Sparse direct being developed

Distributed Memory - Message Passing– PVM and MPI

MIMD Computers, Networks of Workstations, and Clumps of SMPs

SPMD Fortran 77 with object based design Built on various modules

– PBLAS (BLACS and BLAS)» PVM, MPI, IBM SP, CRI T3, Intel, TMC» Provides right level of notation.

31

High-Performance Computing High-Performance Computing Directions Directions

Move toward shared memory– SMPs and Distributed Shared Memory– Shared address space w/deep memory hierarchy

Clustering of shared memory machines for scalability– Emergence of PC commodity systems

» Pentium based, NT or Linux driven At UTK cluster of 14 (dual) Pentium based 7.2 Gflop/s

Efficiency of message passing and data parallel programming– Helped by standards efforts such as PVM, MPI and HPF

Complementing “Supercomputing” with Metacomputing Computational Grid

32

Heterogeneous ComputingHeterogeneous Computing Heterogeneity introduces new bugs in parallel code Slightly different fl pt can make data dependent

branches go different ways when we expect identical behavior.

A “correct” algorithm on a network of identical workstations may fail if a slightly different machine is introduced.

Some easy to fix (compare s < tol on 1 proc and broadcast results)

Some hard to fix (handling denorms; getting the same answer independent of # of procs)

33

Java - Java - For Numerical Computations?For Numerical Computations?

Java likely to be a dominant language. Provides for machine independent code. C++ like language No pointers, goto’s, overloading arith

ops, or memory deallocation Portability achieved via abstract

machine Java is a convenient user interface

builder which allows one to quickly develop customized interfaces.

34

Network Enabled Network Enabled ServersServers

Allow networked resources to be integrated into the desktop.

Many hosts, co-existing in a loose confederation tied together with high-speed links.

Users have the illusion of a very powerful computer on the desk.

Locate and “deliver” software or solutions to the user in a directly usable and “conventional” form.

Part of the motivation software maintenance

35

Future: Petaflops ( fl pt ops/s)Future: Petaflops ( fl pt ops/s)

A Pflop for 1 second a typical workstation computing for 1 year.

From an algorithmic standpoint– concurrency– data locality– latency & sync– floating point accuracy

May be feasible and “affordable” by the year 2010

1015

– dynamic redistribution of workload

– new language and constructs

– role of numerical libraries– algorithm adaptation to

hardware failure

Today flops for our workstations 1015

36

SummarySummary

As a community we have a lot to be proud of in terms of the algorithms and software we have produced.– generality, elegance, speed, or

economy of storage Software still being used in many

cases 30 years after

37

Documents

1 A Look at Library Software for Linear Algebra: Past, Present, and Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory