Upload
shanon-bell
View
226
Download
6
Tags:
Embed Size (px)
Citation preview
1
A Look at Library Software for A Look at Library Software for Linear Algebra: Past, Present, Linear Algebra: Past, Present, and Futureand Future
Jack Dongarra University of Tennessee
and Oak Ridge National Laboratory
2
1960’s1960’s 1961
– IBM Stretch delivered to LANL
1962– Virtual memory from U
Manchester, T. Kilburn 1964
– AEC urges manufacturers to look at ``radical new'' machine structures.
» This leads to CDC Star-100, TI ASC, and Illiac-IV.
– CDC 6600; S. Cray's design» Functional parallelism, leading
to RISC» (3 times faster than the IBM
Stretch)
1965– DEC ships first PDP-8 & IBM ships 360– First CS PhD U of Penn. Richard Wexelblat– Wilkinson’s The Algebraic Eigenvalue
Problem published 1966
– DARPA contract with U of I to build the ILLIAC IV
– Fortran 66 1967
– Forsythe & Moler published» Fortran, Algol and PLI
1969 – DARPANET work begins
» 4 computers connected UC-SB,UCLA, SRI and U of Utah
– CDC 7600» Pipelined architecture; 3.1 Mflop/s
– Unix developed Thompson and Ritchie
.
.
.
.
.
3
Wilkinson-Reinsch HandbookWilkinson-Reinsch Handbook “In general we have aimed to include only
algorithms that provide something approaching an optimal solution, this may be from the point of view of generality, elegance, speed, or economy of storage.”– Part 1 Linear System– Part 2 Algebraic Eigenvalue Problem
Before publishing Handbook– Virginia Klema and others at ANL began
translation of the Algol procedures into Fortran subroutines
4
1970’s1970’s 1970
– NATS Project conceived» Concept of certified MS and process
involved with production
– Purdue Math Software Symposium– NAG project begins
1971 – Handbook for Automatic Computation,
Vol II » Landmark in the development of
numerical algorithms and software» Basis for a number of software projects
EISPACK, a number of linear algebra routines in IMSL and the F chapters of NAG
– IBM 370/195» Pipelined architecture; Out of order
execution; 2.5 Mflop/s
– Intel 4004; 60 Kop/s– IMSL founded
1972– Cray Research founded– Intel 8008– 1/4 size Illiac IV installed NASA Ames
» 15 Mflop/s achieved; 64 processors
– Paper by S. Reddaway on massive bit-level parallelism
– EISPACK available» 150 installations; EISPACK Users' Guide» 5 versions, IBM, CDC, Univac, Honeywell
and PDP, distributed free via Argonne Code Center
– M. Flynn publishes paper on architectural taxonomy
– ARPANet» 37 computers connected
1973– BLAS report in SIGNUM
» Lawson, Hanson, &Krogh
.
.
.
.
.
5
NATS ProjectNATS Project National Activity for Testing Software
(NSF, Argonne, Texas and Stanford) Project to explore the problems of testing,
certifying, disseminating and maintaining quality math software.– First EISPACK, later FUNPACK
» Influenced other “PACK”s ELLPACK, FISKPACK, ITPACK,MINPACK,PDEPACK, QUADPACK,
SPARSPAK, ROSEPACK, TOOLPACK, TESTPACK,LINPACK, LAPACK, ScaLAPACK . . .
Key attributes of math software– reliability, robustness, structure, usability, and
validity
6
EISPACKEISPACK Algol versions of the algorithms written in Fortran
– Restructured to avoid underflow– Check user’s claims such as if matrix positive definite– Format programs in a unified fashion
» Burt Garbow
– Field test sites 1971 U of Michigan Summer Conference
– JHW algorithms & CBM software 1972 Software released via Argonne Code Center
– 5 versions Software certified in the sense that reports of poor or
incorrect performance “would gain the immediate attention from the developers”
7
EISPACKEISPACK EISPAC Control Program, Boyle
– One interface allowed access to whole package on IBM
Argonne’s interactive RESCUE system– Allowed us to easily manipulate 10s of 1000’s of
lines Generalizer/Selector
– Convert IBM version to general form– Selector extract appropriate version
1976 Extensions to package ready – EISPACK II
8
1970’s continued1970’s continued 1974
– Intel 8080– Level 1 BLAS activity started by
community; Purdue 5/74– LINPACK meeting in summer ANL
1975– First issue of TOMS– Second LINPACK meeting ANL
» lay the groundwork and hammer out what was and was not to be included in the package. Proposal submitted to the NSF
1976 – Cray 1 - model for vector computing
» 4 Mflop/s in 79; 12 Mflop/s in 83; 27 Mflop/s in 89
– LINPACK work during summer at ANL– EISPACK Second edition of User's Guide
1977 – DEC VAX 11/780; super mini
» .14 Mflop/s; 4.3 GB virtual memory
– LINPACK test software developed &sent– EISPACK second release– IEEE Arithmetic standard meetings
» paper by Palmer on INTEL std for fl pt
1978– Fortran 77– LINPACK software released
» Sent to NESC and IMSL for distribution
1979 – John Cocke designs 801– ICL DAP delivered to QMC, London– Level 1 BLAS published/released– LINPACK Users' Guide
» Appendix: 17 machines PDP-10 to Cray-1
.
.
.
.
9
Basic Linear Algebra SubprogramsBasic Linear Algebra Subprograms BLAS 1973-1977 Consensus on:
– Names– Calling sequences– Functional Descriptions– Low level linear algebra
operations Success results from
– Extensive Public Involvement
– Careful consideration of implications
A design tool software in numerical linear algebra
Improve readability and aid documentation
Aid modularity and maintenance, and improve robustness of software calling the BLAS
Improve portability, without sacrificing efficiency, through standardization
10
CRAY and VAXCRAY and VAX
VAXination of groups and departments
Cray’s introduction of vector computing
Both had a significant impact on scientific computing
11
LINPACKLINPACK June 1974, ANL
– Jim Pool’s meeting February 1975, ANL
– groundwork for project January 1976,
– funded by NSF and DOE: ANL/UNM/UM/UCSD Summer 1976 - Summer 1977 Fall 1977
– Software to 26 testsites December 1978
– Software released, NESC and IMSL
LINPACKLINPACK Research into
mechanics of software production. Provide a yearstick against which future
software would be measured. Produce library used by people and those
that wished to modify/extend software to handle special problems.
Hope would be used in classroom Machine independent and efficient
– No mix mode arithmetic
L I N P A C K
I N P A C K
N P A C K
P A C K
A C K
C K
K
F
H
GGGGGGGG
I
K
JJJJJJJJ
1 1
1 1
1 1
1 1
1 1
1 1
1
/ /
/ /
/ /
/ /
/ /
/ /
/
L L
I I
N N
P P
A A
C C
K
F
H
GGGGGGGG
I
K
JJJJJJJJ
13
LINPACK Efficiency LINPACK Efficiency Condition estimator Inner loops via BLAS Column access Unrolling loops (done for the BLAS) BABE Algorithm for tridiagonal matrices TAMPR system allowed easy generation
of versions LINPACK Benchmark
– Today reports machines from Cray T90 to Palm Pilot» (1.0 Gflop/s to 1.4 Kflop/s)
for i = 1:n:4,
end
y y x
y y x
y y x
y y x
i i i
i i i
i i i
i i i
;
;
;
;
1 1 1
2 2 2
3 3 3
0
0
14
1980’s Vector Computing 1980’s Vector Computing (Parallel Processing)(Parallel Processing)
1980– Total computers in use in the US
exceeds 1M– CDC Introduces Cyber 205
1981– IBM introduces the PC
» Intel 8088/DOS
– BBN Butterfly delivered– FPS delivers FPS-164
» Start of mini-supercomputer market
– SGI founded by Jim Clark and others– Loop unrolling at outer level for
data locality and parallelism» Amounts to matrix-vector operations
– Cuppen's method for Symmetric Eigenvalue D&C published
» Talk at Oxford Gatlinburg (1980)
1982– Illiac IV decommissioned– Steve Chen's group at Cray produces X-MP– First Denelcor HEP installed (.21 Mflop/s)– Sun Microsys, Convex and Alliant founded
1983– Total computers in use in the US exceed 10 M– DARPA starts Strategic Computing Initiative
» Helps fund Thinking Machines, BBN, WARP
– Cosmic Cube hypercube running at Caltech» John Palmer, after seeing Caltech machine,
leaves Intel to found Ncube
– Encore , Sequent, TMC, SCS, Myrias founded– Cray 2 introduced– NEC SX-1 and SX-2, Fujitsu ships VP-200– ETA System spun off from CDC– Golub & Van Loan published
.
.
.
.
.
.
.
.
.
15
1980’s continued1980’s continued 1984
– NSFNET; 5000 computers;56 Kb/s lines
– MathWorks founded– EISPACK third release– Netlib begins 1/3/84– Level 2 BLAS activity started
» Gatlinburg(Waterloo), Purdue, SIAM
– Intel Scientific Computers started by J. Rattner
» Produce commerical hypercube
– Cray X-MP 1 processor, 21 Mflop/s Linpack
– Multiflow founded by J. Fisher; VLIW architecture
– Apple introduces Mac &IBM introduces PC AT
– IJK paper
1985– IEEE Standard 754 for floating point– IBM delivers 3090 vector; 16 Mflop/s
Linpack, 138 Peak– TMC demos CM1 to DARPA– Intel produces first iPSC/1
Hypercube» 80286 connected via Ethernet
controllers
– Fujitsu VP-400; NEC SX-2; Cray 2; Convex C1
– Ncube/10 ; .1 Mflop/s Linpack 1 processor
– FPS-264; 5.9 Mflop/s Linpack 38 Peak– IBM begins RP2 project– Stellar, Poduska & Ardent, Michels
Supertek Computer founded– Denelcor closes doors
.
.
.
.
.
16
1980’s 1980’s (Lost Decade for Parallel Software)(Lost Decade for Parallel Software) 1986
– # of computers in US exceeds 30 M– TMC ships CM-1; 64K 1 bit
processors– Cray X-MP– IBM and MIPS release first RISC WS
1987– ETA Systems family of
supercomputers– Sun Microsystems introduces its
first RISC WS– IBM invests in Steve Chen's SSI– Cray Y-MP– First NA-DIGEST– Level 3 BLAS work begun– LAPACK: Prospectus Development of
a LA Library for HPC
1988– AMT delivers first re-engineered DAP– Intel produces iPSC/2– Stellar and Ardent begin delivering
single user graphics workstations– Level 2 BLAS paper published
1989– # of computers in the US > 50M– Stellar and Ardent merge, forming
Stardent– S. Cray leaves Cray Research to form
Cray Computer– Ncube 2nd generation machine– ETA out of business– Intel 80486 and i860 ; 1 M
transistors» i860 RISC & 64 bit floating point
.
.
.
.
.
.
17
EISPACK 3 and BLAS 2 & 3EISPACK 3 and BLAS 2 & 3 Machine independence for
EISPACK Reduce the possibility of
overflow and underflow. Mods to the Algol from S.
Hammarling Rewrite reductions to
tridiagonal form to involve sequential access to memory
“Official” Double Precision version
Inverse iteration routines modified to reduce the size for reorthogonalization.
BLAS (Level 1) vector operations provide for too much data movement.
Community effort to define extensions– Matrix-vector ops– Matrix-matrix ops
18
Netlib - Mathematical Netlib - Mathematical Software and DataSoftware and Data Began in 1985
– JD and Eric Grosse, AT&T Bell Labs Motivated by the need for cost-effective, timely distribution
of high-quality mathematical software to the community. Designed to send, by return electronic mail, requested items. Automatic mechanism for electronic dissemination of freely
available software.– Still in use and growing– Mirrored at 9 sites around the world
Moderated collection /Distributed maintenance
NA-DIGEST and NA-Net– Gene Golub, Mark Kent and Cleve Moler
19
Netlib Growth Netlib Growth
0
2000000
4000000
6000000
8000000
10000000
Nu
mb
er o
f re
qu
ests
1985
1987
1989
1991
1993
1995
1997
Year
gopher
http
ftp
xnetlib
Just over 6,000 in 1985 Over 29,000,000 totalOver 9 Million hits in 19975.4 Million so far in 1998
LAPACK Best Seller 1.6 M hits
20
1990’s1990’s 1990
– Internet– World Wide Web
– Motorola introduces 68040– NEC ships SX-3; First parallel
Japanese parallel vector supercomputer
– IBM announces RS/6000 family» has FMA instruction
– Intel hypercube based on 860 chip» 128 processors
– Alliant delivers FX/2800 based on i860
– Fujitsu VP-2600– PVM project started– Level 3 BLAS Published
1991– Stardent to sell business and close– Cray C-90– Kendall Square Research delivers 32
processor KSR-1– TMC produces CM-200 and
announces CM-5 MIMD computer– DEC announces the Alpha– TMC produces the first CM-5– Fortran 90– Workshop to consider Message
Passing Standard, beginnings of MPI» Community effort
– Xnetlib running 1992
– LAPACK software released & Users' Guide Published
.
.
.
.
21
Architectures
328312
69
118
230
227
156
265
362
140
344
246
71
134
81130
2766
10190
0
50
100
150
200
250
300
350
400
# S
yste
ms
MPP PVP SMP
22
Parallel Processing Comes of AgeParallel Processing Comes of Age "There are three rules for
programming parallel computers.
We just don't know what they are yet." -- Gary Montry
“Embarrassingly Parallel”, Cleve Moler
“Humiliatingly Parallel”
23
Memory Hierarchy Memory Hierarchy and LAPACKand LAPACK ijk - implementations
Effects order in which data referenced; some better at allowing data to keep in higher levels of memory hierarchy.
Applies for matrix multiply, reductions to condensed form– May do slightly more
flops– Up to 3 times faster
for _ = 1:n;
for _ = 1:n;
for _ = 1:n;
end
end
end
a a b ci j i j i k k j, , , ,
24
BLAS MemoryRefs
Flops Flops/MemoryRefs
Level 1y=y+x
3n 2n 2/3
Level 2y=y+Ax
n2 2n2 2
Level 3C=C+AB
4n2 2n3 n/2
Why Higher Level BLAS?Why Higher Level BLAS? Can only do
arithmetic on data at the top of the hierarchy
Higher level BLAS lets us do this Development of
blocked algorithms important for performance
IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak)
0
50
100
150
200
250
10 100 200 300 400 500Order of vector/Matrices
Mfl
op
/s
Level 3 BLAS
Level 2 BLAS
Level 1 BLAS
25
History of Block Partitioned History of Block Partitioned AlgorithmsAlgorithms
Ideas not new. Early algorithms involved use
of small main memory using tapes as secondary storage.
Recent work centers on use of vector registers, level 1 and 2 cache, main memory, and “out of core” memory.
26
LAPACKLAPACK Linear Algebra library in Fortran 77
– Solution of systems of equations– Solution of eigenvalue problems
Combine algorithms from LINPACK and EISPACK into a single package
Block algorithms– Efficient on a wide range of computers
» RISC, Vector, SMPs
User interface similar to LINPACK– Single, Double, Complex, Double Complex
Built on the Level 1, 2, and 3 BLAS HP-48G to CRAY T-90
Cray Y-MPCholesky, n=500
1 Proc 8 procs
j-varientLINPACK
72 72
j-varientLevel 2 BLAS
251 378
j-varientLevel 3 BLAS
287 1225
k-varientLevel 3 BLAS
290 1414
Mflop/s
27Rate
ops
Time
23
3 2n n
28
1990’s continued1990’s continued 1993
– Intel Pentium system start to ship– ScaLAPACK Prototype software
released » First portable library for distributed
memory machines» Intel, TMC and workstations using PVM
– PVM 3.0 available 1994
– MPI-1 Finished 1995
– Templates project 1996
– Internet; 34M Users– Nintendo 64
» More computing power than a Cray 1 and much much better graphics
1997– MPI-2 Finished – Fortran 95
1998– Issues of parallel and
numerical stability– Divide time– DSM architectures– "New" Algorithms
» Chaotic iteration» Sparse LU w/o pivoting» Pipeline HQR» Graph partitioning» Algorithmic bombardment
.
.
.
.
29
Templates ProjectTemplates Project Iterative methods for large sparse
systems– Communicate to HPC community “State of
the Art” algorithms– Subtle algorithm issues addressed, i.e.
convergence, preconditions, data structures
– Performance and parallelism considerations
– Gave the computational scientists algorithms in form they wanted.
30
ScaLAPACKScaLAPACK Library of software for dense & banded
»Sparse direct being developed
Distributed Memory - Message Passing– PVM and MPI
MIMD Computers, Networks of Workstations, and Clumps of SMPs
SPMD Fortran 77 with object based design Built on various modules
– PBLAS (BLACS and BLAS)» PVM, MPI, IBM SP, CRI T3, Intel, TMC» Provides right level of notation.
31
High-Performance Computing High-Performance Computing Directions Directions
Move toward shared memory– SMPs and Distributed Shared Memory– Shared address space w/deep memory hierarchy
Clustering of shared memory machines for scalability– Emergence of PC commodity systems
» Pentium based, NT or Linux driven At UTK cluster of 14 (dual) Pentium based 7.2 Gflop/s
Efficiency of message passing and data parallel programming– Helped by standards efforts such as PVM, MPI and HPF
Complementing “Supercomputing” with Metacomputing Computational Grid
32
Heterogeneous ComputingHeterogeneous Computing Heterogeneity introduces new bugs in parallel code Slightly different fl pt can make data dependent
branches go different ways when we expect identical behavior.
A “correct” algorithm on a network of identical workstations may fail if a slightly different machine is introduced.
Some easy to fix (compare s < tol on 1 proc and broadcast results)
Some hard to fix (handling denorms; getting the same answer independent of # of procs)
33
Java - Java - For Numerical Computations?For Numerical Computations?
Java likely to be a dominant language. Provides for machine independent code. C++ like language No pointers, goto’s, overloading arith
ops, or memory deallocation Portability achieved via abstract
machine Java is a convenient user interface
builder which allows one to quickly develop customized interfaces.
34
Network Enabled Network Enabled ServersServers
Allow networked resources to be integrated into the desktop.
Many hosts, co-existing in a loose confederation tied together with high-speed links.
Users have the illusion of a very powerful computer on the desk.
Locate and “deliver” software or solutions to the user in a directly usable and “conventional” form.
Part of the motivation software maintenance
35
Future: Petaflops ( fl pt ops/s)Future: Petaflops ( fl pt ops/s)
A Pflop for 1 second a typical workstation computing for 1 year.
From an algorithmic standpoint– concurrency– data locality– latency & sync– floating point accuracy
May be feasible and “affordable” by the year 2010
1015
– dynamic redistribution of workload
– new language and constructs
– role of numerical libraries– algorithm adaptation to
hardware failure
Today flops for our workstations 1015
36
SummarySummary
As a community we have a lot to be proud of in terms of the algorithms and software we have produced.– generality, elegance, speed, or
economy of storage Software still being used in many
cases 30 years after
37