25
20 Years of the LINPACK Benchmark 1 20 Years of the LINPACK Benchmark Edward Anderson Lockheed Martin (contractor to U.S. EPA) ICL 1989-1991

20 Years of the LINPACK Benchmark - ICL UTKicl.cs.utk.edu/20/presentations/Ed_Anderson.pdfA B C x = nb. nb. n. n. nb. nb. 96 KB Scache (12 KW), 3- way associative. Random replacement

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • 20 Years of the LINPACK Benchmark 1

    20 Years of the LINPACK Benchmark

    Edward AndersonLockheed Martin (contractor to U.S. EPA)

    ICL 1989-1991

  • 20 Years of the LINPACK Benchmark 2

    Argonne MCS, 1988

  • 20 Years of the LINPACK Benchmark 3

  • 20 Years of the LINPACK Benchmark 4

  • 20 Years of the LINPACK Benchmark 5

  • 20 Years of the LINPACK Benchmark 6

  • 20 Years of the LINPACK Benchmark 7

  • 20 Years of the LINPACK Benchmark 8

    Internals of SGEMM on CRAY T3D

    Basic operation: 16-by-64 matrix times 64-by-n matrix

    A B C

    x =1664 n n

    Innermost kernel: 16×64 matrix-vector multiply, 16-element resultThe 16-by-64 block of A completely filled the 8 KB Dcache.Elements of B were loaded from memory.The result vector is held in registers until combined into C.

    64...

    .

    .

    .

  • 20 Years of the LINPACK Benchmark 9

    Cray’s first 100 Gflop/s result

  • 20 Years of the LINPACK Benchmark 10

    Internals of SGEMM on CRAY T3E

    A B C

    x =

    nb

    nb n n

    nbnb

    96 KB Scache (12 KW), 3-way associativeRandom replacement policy limited use to 4096W = 64x64Issue rate of 2 FP ops/CP, could only manage 12-element C vectorFor LINPACK, chose nb = 48

  • 20 Years of the LINPACK Benchmark 11

    2-D block cyclic data distribution

    Example on a 2x3 processor grid:0

    1

    0

    1

    0

    2

    1

    3

    2

    3

    2

    3

    4

    5

    4

    5

    4

    5

    0

    1

    0

    1

    0

    1

    2

    3

    2

    3

    2

    3

    4

    5

    4

    5

    4

    5

    0

    1

    0

    1

    0

    2

    1

    3

    2

    3

    2

    3

    0

    1

    2

    3

    4

    5

    0

    1

    2

    3

    4

    5

    0

    1

    2

    3

  • 20 Years of the LINPACK Benchmark 12

    Local view: ScaLAPACK/HPLBlocks stored by columns, on processor 0:

    Advantage: With abstraction of BLAS and LAPACK routines, we can maintain much of the LAPACK design.

    Disadvantage: As matrix size gets large, distribution blocks get spread out through memory.

    A1,1

    A5,1

    A1,4

    A5,4

    A1,7

    A5,7

    A3,1

    A7,1

    A3,4

    A7,4

    A3,7

    A7,7

    A(mb*nrblks, nb*ncblks)

  • 20 Years of the LINPACK Benchmark 13

    Local view: Cray LBM codeBlocks stored by rows, processor 0:

    A1,1

    A5,1

    A1,4

    A5,4

    A1,7

    A5,7

    A3,1

    A7,1

    A3,4

    A7,4

    A3,7

    A7,7

    A(mb,nb,ncblks,1)

    A(mb,nb,ncblks,2)

    A(mb,nb,ncblks,3)

    A(mb,nb,ncblks,4)

    Advantage: Distribution blocks and block rows are contiguous.Disadvantage: More indexing with the 4-D array.

    A(mb,nb,ncblks,nrblks)

  • 20 Years of the LINPACK Benchmark 14

    Main computation kernelAfter communication of the column and row blocks, each processorperforms a matrix-matrix multiplication of the form:

    = -x

    In ScaLAPACK/HPL, one call to SGEMM is required for thisoperation.

  • 20 Years of the LINPACK Benchmark 15

    Optimizing data layout for SGEMMWith row-wise storage of blocks, one call to SGEMM is needed to

    update each local block row.

    x= -

    x= -

    x= -(all the same block)

  • 20 Years of the LINPACK Benchmark 16

    SGI’s first Tflop/s result

    This same code was used on ASCI Blue Mountain (Nov. 1998):

    By the next Top 500 list in June 1999, LANL was #2.

  • 20 Years of the LINPACK Benchmark 17

    EPA – RTP, North Carolina

  • 20 Years of the LINPACK Benchmark 18

    EPA vs. Top 500 (circa 2001)

    0.1

    1

    10

    100

    1000

    10000

    1993

    /06

    1993

    /11

    1994

    /06

    1994

    /11

    1995

    /06

    1995

    /11

    1996

    /06

    1996

    /11

    1997

    /06

    1997

    /11

    1998

    /06

    1998

    /11

    1999

    /06

    1999

    /11

    2000

    /06

    2000

    /11

    2001

    /06

    Time (in 6 months intervals)

    Per

    form

    ance

    (in

    Gflo

    ps)

    #1

    #100

    #500

    EPAEPA #500

  • 20 Years of the LINPACK Benchmark 19

  • 20 Years of the LINPACK Benchmark 20

    CRAY X1 MSP

  • 20 Years of the LINPACK Benchmark 21

    Internals of SGEMM on CRAY X1

    SSP0

    A B C

    x =

    nb

    nb n n

    nbSSP1

    SSP2

    SSP3

    nb

    2 MB Ecache (shared by SSPs) allowed nb up to 256 or 512Each SSP computed four 64 x nb m-v multiplies concurrentlyFor LINPACK, nb was approx. 256Achieved 90%+ peak for all processor counts

  • 20 Years of the LINPACK Benchmark 22

    When a p×q grid isn’t enough

    Shortcomings of existing algorithm:• Many processors are idle during column factorization.• Not every processor count factors neatly into N = p×q.

    Example:124 = 4x31123 = 3x41122 = 2x61121 = 11x11

    • On some systems it is better to leave one or two processors idle( –1 may not factor neatly). Np

    p

  • 20 Years of the LINPACK Benchmark 23

    The Virtual Processor Grid

    Generalize the 2-D grid factorization to p×q = k×N , where k ≥ 1, p ≤ N , and lcm(p,N ) = k×N .

    Example: 6 processors in a 4×3 virtual grid

    0

    1

    2

    3

    4

    5

    0

    1

    2

    3

    4

    5

    This becomes the tiling patternfor the distributed 2-D matrix

    p pp

    p

  • 20 Years of the LINPACK Benchmark 24

    Data structure for VPGRecall the 4-D array used for row-wise storage of blocks:

    A( mb, nb, ncblks, nrblks )

    Now add another dimension for the virtual processor index:A( mb, nb, ncblks, nrblks, nvpi )

    Maintains contiguousness of distribution blocks and row blocks.Need to add a loop over the virtual processor indices.No extra storage except for some extra buffers for each virtual

    processor.

  • 20 Years of the LINPACK Benchmark 25

    20 Years of the LINPACK BenchmarkArgonne MCS, 1988Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Internals of SGEMM on CRAY T3DCray’s first 100 Gflop/s resultInternals of SGEMM on CRAY T3E2-D block cyclic data distributionLocal view: ScaLAPACK/HPLLocal view: Cray LBM codeMain computation kernelOptimizing data layout for SGEMMSGI’s first Tflop/s resultSlide Number 17EPA vs. Top 500 (circa 2001)Slide Number 19CRAY X1 MSPInternals of SGEMM on CRAY X1When a pq grid isn’t enoughThe Virtual Processor GridData structure for VPGSlide Number 25