20 Years of the LINPACK Benchmark - ICL UTKicl.cs.utk.edu/20/presentations/Ed_Anderson.pdfA B C x = nb. nb. n. n. nb. nb. 96 KB Scache (12 KW), 3- way associative. Random replacement

20 Years of the LINPACK Benchmark 1

20 Years of the LINPACK Benchmark

Edward AndersonLockheed Martin (contractor to U.S. EPA)

ICL 1989-1991


Argonne MCS, 1988


Internals of SGEMM on CRAY T3D

Basic operation: 16-by-64 matrix times 64-by-n matrix

A B C

x =1664 n n

Innermost kernel: 16×64 matrix-vector multiply, 16-element resultThe 16-by-64 block of A completely filled the 8 KB Dcache.Elements of B were loaded from memory.The result vector is held in registers until combined into C.

64...

.

.

.


Cray’s first 100 Gflop/s result


Internals of SGEMM on CRAY T3E

A B C

x =

nb

nb n n

nbnb

96 KB Scache (12 KW), 3-way associativeRandom replacement policy limited use to 4096W = 64x64Issue rate of 2 FP ops/CP, could only manage 12-element C vectorFor LINPACK, chose nb = 48


2-D block cyclic data distribution

Example on a 2x3 processor grid:0

1

0

1

0

2

1

3

2

3

2

3

4

5

4

5

4

5

0

1

0

1

0

1

2

3

2

3

2

3

4

5

4

5

4

5

0

1

0

1

0

2

1

3

2

3

2

3

0

1

2

3

4

5

0

1

2

3

4

5

0

1

2

3


Local view: ScaLAPACK/HPLBlocks stored by columns, on processor 0:

Advantage: With abstraction of BLAS and LAPACK routines, we can maintain much of the LAPACK design.

Disadvantage: As matrix size gets large, distribution blocks get spread out through memory.

A1,1

A5,1

A1,4

A5,4

A1,7

A5,7

A3,1

A7,1

A3,4

A7,4

A3,7

A7,7

A(mb*nrblks, nb*ncblks)


Local view: Cray LBM codeBlocks stored by rows, processor 0:

A1,1

A5,1

A1,4

A5,4

A1,7

A5,7

A3,1

A7,1

A3,4

A7,4

A3,7

A7,7

A(mb,nb,ncblks,1)

A(mb,nb,ncblks,2)

A(mb,nb,ncblks,3)

A(mb,nb,ncblks,4)

Advantage: Distribution blocks and block rows are contiguous.Disadvantage: More indexing with the 4-D array.

A(mb,nb,ncblks,nrblks)


Main computation kernelAfter communication of the column and row blocks, each processorperforms a matrix-matrix multiplication of the form:

= -x

In ScaLAPACK/HPL, one call to SGEMM is required for thisoperation.


Optimizing data layout for SGEMMWith row-wise storage of blocks, one call to SGEMM is needed to

update each local block row.

x= -

x= -

x= -(all the same block)


SGI’s first Tflop/s result

This same code was used on ASCI Blue Mountain (Nov. 1998):

By the next Top 500 list in June 1999, LANL was #2.


EPA – RTP, North Carolina


EPA vs. Top 500 (circa 2001)

0.1

1

10

100

1000

10000

1993

/06

1993

/11

1994

/06

1994

/11

1995

/06

1995

/11

1996

/06

1996

/11

1997

/06

1997

/11

1998

/06

1998

/11

1999

/06

1999

/11

2000

/06

2000

/11

2001

/06

Time (in 6 months intervals)

Per

form

ance

(in

Gflo

ps)

#1

#100

#500

EPAEPA #500


CRAY X1 MSP


Internals of SGEMM on CRAY X1

SSP0

A B C

x =

nb

nb n n

nbSSP1

SSP2

SSP3

nb

2 MB Ecache (shared by SSPs) allowed nb up to 256 or 512Each SSP computed four 64 x nb m-v multiplies concurrentlyFor LINPACK, nb was approx. 256Achieved 90%+ peak for all processor counts


When a p×q grid isn’t enough

Shortcomings of existing algorithm:• Many processors are idle during column factorization.• Not every processor count factors neatly into N = p×q.

Example:124 = 4x31123 = 3x41122 = 2x61121 = 11x11

• On some systems it is better to leave one or two processors idle( –1 may not factor neatly). Np

p


The Virtual Processor Grid

Generalize the 2-D grid factorization to p×q = k×N , where k ≥ 1, p ≤ N , and lcm(p,N ) = k×N .

Example: 6 processors in a 4×3 virtual grid

0

1

2

3

4

5

0

1

2

3

4

5

This becomes the tiling patternfor the distributed 2-D matrix

p pp

p


Data structure for VPGRecall the 4-D array used for row-wise storage of blocks:

A( mb, nb, ncblks, nrblks )

Now add another dimension for the virtual processor index:A( mb, nb, ncblks, nrblks, nvpi )

Maintains contiguousness of distribution blocks and row blocks.Need to add a loop over the virtual processor indices.No extra storage except for some extra buffers for each virtual

processor.


20 Years of the LINPACK BenchmarkArgonne MCS, 1988Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Internals of SGEMM on CRAY T3DCray’s first 100 Gflop/s resultInternals of SGEMM on CRAY T3E2-D block cyclic data distributionLocal view: ScaLAPACK/HPLLocal view: Cray LBM codeMain computation kernelOptimizing data layout for SGEMMSGI’s first Tflop/s resultSlide Number 17EPA vs. Top 500 (circa 2001)Slide Number 19CRAY X1 MSPInternals of SGEMM on CRAY X1When a pq grid isn’t enoughThe Virtual Processor GridData structure for VPGSlide Number 25

Documents

20 Years of the LINPACK Benchmark - ICL UTKicl.cs.utk.edu/20/presentations/Ed_Anderson.pdfA B C x = nb. nb. n. n. nb. nb. 96 KB Scache (12 KW), 3- way associative. Random replacement