Vector Computing

  • Upload
    palrajk

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

  • 8/6/2019 Vector Computing

    1/28

    Eine Zeitreise in die Welt der Computer.1

    Pipelined Vector Processing and Scientific

    Computation

    John G. Zabolitzky

  • 8/6/2019 Vector Computing

    2/28

    Eine Zeitreise in die Welt der Computer.2

    Applications of High-Performance Computing

    Weather prediction, climatic simulation

    fluid dynamics simulation (aerodynamics for aerospace, automobile, combustion, ....)

    basic science

    cosmology

    quantum mechanical many-body problems

    chemistry

    solid-state

    quantum fluids

    high-energy physics

    cryptography

    weapons researchenergy research

    nuclear reactor simulation

    fusion research

    many many more

  • 8/6/2019 Vector Computing

    3/28

    Eine Zeitreise in die Welt der Computer.3

    Terminal State of Scalar Computing: CDC 7600, 1968

    Maximum RISC performance of 1 operation/cycle achieved

    No further improvement possible without change of paradigm

    36 MHz => 36 MIPS => 5 MFLOPS real

    The CDC 7600 (designed by Seymour Cray) was

    the most powerful of all computers from 1968 to

    1976 when the Cray-1 achieved > 10 times its

    performance

  • 8/6/2019 Vector Computing

    4/28

    Eine Zeitreise in die Welt der Computer.4

    Pipelined Scalar Execution

    time

    instruction 4

    etc decode execute

    etc decode execute

    etc decode execute

    4 etc decode execute

    etc decode executeetc decode

    etc

    Pipelined execution on parallel unctional units

  • 8/6/2019 Vector Computing

    5/28

    Eine Zeitreise in die Welt der Computer.5

  • 8/6/2019 Vector Computing

    6/28

    Eine Zeitreise in die Welt der Computer.6

    Scalar Code Example

    DO i=1,100 a(i)=b(i)*c(i)

    load b, inc addesss

    load c, inc address

    multiply

    store a, inc address

    decrement count, loop?

    5 instructions = cycles (optimum) for one multiply

    pipelined multiply: could start one multiply each and every cycle => only 20%

    efficient use

    expensive multiplier sits idle most of the time

  • 8/6/2019 Vector Computing

    7/28

    Eine Zeitreise in die Welt der Computer.7

    Architectural Alternatives

    * Pipelined Scalar (RISC) as outlined before

    * Pipelined Vector (this presentation further down)

    * SIMD (Single Instruction Multiple Data) parallel arithmetic (e.g., ILLIAC IV)

    too expensive, inefficient: larger number of lightly used multipliers

    * Superscalar = multiple issue in one cycle

    all modern single-chip CPUs (Intel to TI); keep all functions busy

    * VLIW (Very Long Instruction Word) = Variant of Superscalar

    * MIMD (Multiple Instruction Multiple Data) true parallel streams, e.g. Cray T3E, IBM

    Blue Gene, IBM Cell: may be superimposed on top of ANY CPU architecture

  • 8/6/2019 Vector Computing

    8/28

    Eine Zeitreise in die Welt der Computer.8

    Vector Computation

    Scientific codes have high percentage in looping over simple data structures

    DO i=1,100 a(i) = b*c(i) + d(i)

    simple logical structure ==>

    set up such that one multiply/cycle

    one instruction for entire loop

    MFLOP rate = cycle rate or multiple thereof

    specialized for scientific/engineering tasks

  • 8/6/2019 Vector Computing

    9/28

    Eine Zeitreise in die Welt der Computer.9

    Vector Pipeline c(i)=a(i)*b(i)

    etch a(i )

    multip. 1 multip. 2 multip. 3 multip. 4 store c(i )

    etch b(i )

    timei=1 |

    2 1 |

    3 2 1 |

    4 3 2 1 |

    5 4 3 2 1 V

    6 5 4 3 2 17 6 5 4 3 2

    8 7 6 5 4 3

    Inventor: Henry Ford

  • 8/6/2019 Vector Computing

    10/28

    Eine Zeitreise in die Welt der Computer.10

    Need to Vectorize; some automatic, high quality requires hand-optimization

    Naive scalar code for matrix multiply

    s=0.0

    do j=1,n

    s=s+a(i,j)*b(j,k)

    Recursive on s => adder pipeline blocked

    vector code for matrix multiply

    do i=1,n

    c(i,k) =c(i,k) +a(i,j)*b(j,k)

    Independent vector elements, but 1.5x bandwidth

    Frequently good idea: exchange inner/outer loop

  • 8/6/2019 Vector Computing

    11/28

    Eine Zeitreise in die Welt der Computer.11

    First Vector Computers

    Control Data Corporation (CDC) STAR-100 [STring ARray 100 MFLOPS]

    memory-to-memory architecture

    therefore long startup times (~n00 cycles)very slow scalar unit (~2 MFLOPS)

    overall disappointing performance

    contracted 1967, announced 1972, delivered 1974

    total of 4 machines, 2 Lawrence Livermore Lab

    Thornton (CDC) and Fernbach (LLL) loose their jobs

  • 8/6/2019 Vector Computing

    12/28

    Eine Zeitreise in die Welt der Computer.12

    Photographcourtesy of

    Charles Babbage

    Institute, University of

    Minnesota, Minneapolis

    CDC STAR-100

  • 8/6/2019 Vector Computing

    13/28

    Eine Zeitreise in die Welt der Computer.13

    Texas Instruments ASC

    Advanced Scientific Computer, early 1970s

    architecturally similar to CDC STAR-100

    7 units sold

    TI dropped out of mainframe computer manufacturing after this machine

  • 8/6/2019 Vector Computing

    14/28

    Eine Zeitreise in die Welt der Computer.14

    Vector Performance I

    MFLOP rate (MFLOPS) as function of vector length n

    scalar: ~constant (only some loop overhead, then n * loop time)

    vector: (n = length of vector)

    # cycles = startup + n / nflop_per_cycle

    rate/clock = #ops / #cycles ~ n / (startup + n)

    half rate at vectorlength n ~ startup

    full rate needs n >> startup => Long Vector Machine

  • 8/6/2019 Vector Computing

    15/28

    Eine Zeitreise in die Welt der Computer.15

    Performance vs. Startup, Length

    .5

    1

    1.5

    .5

    .5

    1 5

    vecto ength

    op

    /clock

    s1 r1

    s1 r1

    s1 r

    s1 r

    scalar .

  • 8/6/2019 Vector Computing

    16/28

    Eine Zeitreise in die Welt der Computer.16

    Vector Performance II

    Vector/Scalar Subsections

    ALL codes have some scalar (non-vectorizable) sections

    total time = (scalar fraction)/(scalar rate) + (vector fraction)/(vector rate)

    example: 10% / 1 MFLOPS + 90% / 100 MFLOPS =

    100 / (0.1 * 100 + 0.9 * 1) = 9.2 MFLOPS !!!

  • 8/6/2019 Vector Computing

    17/28

    Eine Zeitreise in die Welt der Computer.17

    Vector Version of Amdahls Law

    .

    .

    .

    .

    1

    1.

    . . . . 1 1.

    scalar fraction

    performance

    r

    r1

    r

    r

    r1

  • 8/6/2019 Vector Computing

    18/28

    Eine Zeitreise in die Welt der Computer.18

    Vector Computer Design Guide

    Must have SHORT vector startup => can work with short vectors

    Must have FASTEST POSSIBLE scalar unit => can afford scalar sections

    irregular data structures ==> need gather, scatter, merge operations (and a few

    more)

    x(i) = a(index(i)) * b(i)

    y(index(i)) = c(i) + d(i)

    where (a(i) > b(i)) c(i) = d(i)

  • 8/6/2019 Vector Computing

    19/28

    Eine Zeitreise in die Welt der Computer.19

    Cray Research, Inc.

    Founded by Seymour Cray (father of CDC 6600/7600) in 1972 (STAR-100 known)

    first Cray-1 delivered in 1976 to Los Alamos Scientific Laboratory (LASL)

    8 vector registers of 64 elements each

    Vector load/store instructions

    fastest scalar computer of its time

    160 MFLOPS peak rate ( 2 ops/cycle @ 80 MHz), few cycles startup

  • 8/6/2019 Vector Computing

    20/28

    Eine Zeitreise in die Welt der Computer.20

    Photographcourtesy

    of Charles Babbage

    Institute, University of

    Minnesota,

    Minneapolis

    Seymour Cray

    Cray-1

    1976

    Single Processor

    80 MFLOPS

    1 Mword = 8 Mbyte

  • 8/6/2019 Vector Computing

    21/28

    Eine Zeitreise in die Welt der Computer.21

    Block Diagram Cray YMP-EL, only one of four identical CPUs shown, simplified

    8 vector registers,64 elements,64bit 4 vector execution units, 33

    z

    Shared

    Vi

    Main

    8 scalar registers,64bit scalar

    unctional units

    Memory

    64 ord

    T

    k

    u

    ermemory Si

    128 MW

    64 bit 8 address registers, 32 bit address

    unctional units

    64 ord

    1 Gbyte

    k

    4ports/ bu

    er memory

    proc Ai

    4x 4 x

    33 MHz

    = 8 instruction bu

    ers, 32 ords each4.2 Gby

    /sec Y1channel

    instruction issue 40Mbyte/sec

    48

    shared

    registersI

    S

    Large orking set:

    - 8 vector registers,64 ords

    - 8 scalar registers

    - 8 address registers

    - large instruction bu er

    Per ormance Features:

    - vectorprocessing: one operation

    affects64 vector elements,streamed

    throughfunctional unit

    - small vectorstartup time

    - chaining bet een vector ops

    - large,fast semiconductor memory

  • 8/6/2019 Vector Computing

    22/28

    Eine Zeitreise in die Welt der Computer.22

    Cray Research, Inc. cntd

    1982 Cray-XMP (Steve Chen improvements, up to 4 processors, shared memory)

    1985 Cray-2, 256 Mword memory, 4 processors, immersion cooled

    1988 Cray-YMP (last Chen machine)

    1991 Cray C90 (up to 16 vector CPUs, shared memory)1993 Cray T3D (massively parallel Alpha)

    one and only Cray-3 delivered to NCAR (Cray Comp Corp)

    1994 Cray J90 (up to 32 vector CPUs, shared memory), air cooled

    1995 Cray T3E (most successful MPP machine), Cray T90 (parallel vector, immersion cooled)

    Cray-4 abandoned (Cray Computer Corporation ch. 11)

    1996 acquired by Silicon Graphics

    1998 Cray SV1 (parallel vector, air cooled)

    1999 acquired by Teradata => Cray, Inc.

    2002 Cray X1, parallel vector, immersion spray cooled

    2004 Cray X1e, enhanced version of X1

    Cray XT3, AMD based 3D Torus massively parallel machine

  • 8/6/2019 Vector Computing

    23/28

    Eine Zeitreise in die Welt der Computer.23

    CDC Cyber 200 Family

    - 1980, enhanced version of STAR-100

    - reduced startup time, ~ 50 cycles

    - fast scalar unit

    - rich instruction repertoire

    - still memory-to-memory, 400 MFLOPS peak

    - Cyber 203, Cyber 205, ETA-10 [10 GFLOPS]

    - vector FORTRAN language extensions provided

    - terminated in 1989 since unprofitable

    - around 40 Cyber 200, 34 ETA-10 sold

  • 8/6/2019 Vector Computing

    24/28

    Eine Zeitreise in die Welt der Computer.24

    MinnesotaSupercomputer Center

    Minneapolis,1986

    Cray-2, CDC Cyber205

  • 8/6/2019 Vector Computing

    25/28

    Eine Zeitreise in die Welt der Computer.25

    NEC Japan

    - 1983 SX-1 single processor vector 650 MFLOPS

    - 1985 SX-2 single processor vector 1300 MFLOPS

    - 1990 SX-3 four processors at ~ 5 GFLOPS each, 4 Gbyte = 0.5 Gword memory

    - 1995 SX-4 32 processors at ~ 2 GFLOPS each (CMOS; all previous ECL)

    - 1998 SX-5 upto 512 processors 8 GFLOPS each

    - 2002 SX-6 upto 1024 processors 8 GFLOPS each

    - 2004 SX-7 upto 2048 processors 8.8 GFLOPS each

    - 2004 SX-8 upto 4096 processors 16 GFLOPS each

  • 8/6/2019 Vector Computing

    26/28

    Eine Zeitreise in die Welt der Computer.26

    IBM - Sony - Toshiba CELL processor

    - 8 vector CPUs+ PU on single chip

    - 256kbyte = 32kword local storage (very small !!)

    - 12 word/cycle internal interconnect = 386 byte/sec

    - 24 byte/sec= 3 word/sec main memory

    - 76 byte/sec=9.5 word/seccommunication

    - @ 4 Hz clock256 FLOPS(32 bit) peak

    - 26 FLOPS(64 bit) peak

    - max 4.5 byte addressable,512 Mbyte implemented

    - system interconnect ?- used within Sony Playstation 3

    - Mercury, IBM bladesavailable; 512 Mbyte only

    - highly imbalanced forscientificcomputation

  • 8/6/2019 Vector Computing

    27/28

    Eine Zeitreise in die Welt der Computer.27

    IBM - Sony - Toshiba CELL processor

    - 90 nm SOI, 8 layers Cu interconnect

    - 234 M Transistors

    - 221 mm die size

    - significant potential in future revisions

    - but: 80W @ 1.1V4.0 Hz is too much

    - 180W @ 1.4V5.6 Hz is much too much

    - workneeded in power reduction

    - largerinternal memory- 64 bit arithmeticimproved

  • 8/6/2019 Vector Computing

    28/28

    Eine Zeitreise in die Welt der Computer.28

    IBM - Sony - Toshiba CELL processor

    From: S.Williams et.al., Lawrence Berkeley Laboratory

    - single Cell chipperformance

    - compared with Cray X1Esingle vectorprocessorand

    several commodity microprocessors(AMD, Intel)

    - already current version showsimpressive speedup,atcost ofsignificant programming complexity (explicit

    storage movesas opposed to caching)

    - slightly enhanced Cell (Cell+) simulation provides very

    significant additional speedup(more efficient DP)

    - current version insufficient for majorimpact

    - future versions may change that, great potential