Vector Computing

8/6/2019 Vector Computing

1/28

Eine Zeitreise in die Welt der Computer.1

Pipelined Vector Processing and Scientific

Computation

John G. Zabolitzky


2/28


Applications of High-Performance Computing

Weather prediction, climatic simulation

fluid dynamics simulation (aerodynamics for aerospace, automobile, combustion, ....)

basic science

cosmology

quantum mechanical many-body problems

chemistry

solid-state

quantum fluids

high-energy physics

cryptography

weapons researchenergy research

nuclear reactor simulation

fusion research

many many more


3/28


Terminal State of Scalar Computing: CDC 7600, 1968

Maximum RISC performance of 1 operation/cycle achieved

No further improvement possible without change of paradigm

36 MHz => 36 MIPS => 5 MFLOPS real

The CDC 7600 (designed by Seymour Cray) was

the most powerful of all computers from 1968 to

1976 when the Cray-1 achieved > 10 times its

performance


4/28


Pipelined Scalar Execution

time

instruction 4

etc decode execute

etc decode execute

etc decode execute

4 etc decode execute

etc decode executeetc decode

etc

Pipelined execution on parallel unctional units


5/28



6/28


Scalar Code Example

DO i=1,100 a(i)=b(i)*c(i)

load b, inc addesss

load c, inc address

multiply

store a, inc address

decrement count, loop?

5 instructions = cycles (optimum) for one multiply

pipelined multiply: could start one multiply each and every cycle => only 20%

efficient use

expensive multiplier sits idle most of the time


7/28


Architectural Alternatives

* Pipelined Scalar (RISC) as outlined before

* Pipelined Vector (this presentation further down)

* SIMD (Single Instruction Multiple Data) parallel arithmetic (e.g., ILLIAC IV)

too expensive, inefficient: larger number of lightly used multipliers

* Superscalar = multiple issue in one cycle

all modern single-chip CPUs (Intel to TI); keep all functions busy

* VLIW (Very Long Instruction Word) = Variant of Superscalar

* MIMD (Multiple Instruction Multiple Data) true parallel streams, e.g. Cray T3E, IBM

Blue Gene, IBM Cell: may be superimposed on top of ANY CPU architecture


8/28


Vector Computation

Scientific codes have high percentage in looping over simple data structures

DO i=1,100 a(i) = b*c(i) + d(i)

simple logical structure ==>

set up such that one multiply/cycle

one instruction for entire loop

MFLOP rate = cycle rate or multiple thereof

specialized for scientific/engineering tasks


9/28


Vector Pipeline c(i)=a(i)*b(i)

etch a(i )

multip. 1 multip. 2 multip. 3 multip. 4 store c(i )

etch b(i )

timei=1 |

2 1 |

3 2 1 |

4 3 2 1 |

5 4 3 2 1 V

6 5 4 3 2 17 6 5 4 3 2

8 7 6 5 4 3

Inventor: Henry Ford


10/28


Need to Vectorize; some automatic, high quality requires hand-optimization

Naive scalar code for matrix multiply

s=0.0

do j=1,n

s=s+a(i,j)*b(j,k)

Recursive on s => adder pipeline blocked

vector code for matrix multiply

do i=1,n

c(i,k) =c(i,k) +a(i,j)*b(j,k)

Independent vector elements, but 1.5x bandwidth

Frequently good idea: exchange inner/outer loop


11/28


First Vector Computers

Control Data Corporation (CDC) STAR-100 [STring ARray 100 MFLOPS]

memory-to-memory architecture

therefore long startup times (~n00 cycles)very slow scalar unit (~2 MFLOPS)

overall disappointing performance

contracted 1967, announced 1972, delivered 1974

total of 4 machines, 2 Lawrence Livermore Lab

Thornton (CDC) and Fernbach (LLL) loose their jobs


12/28


Photographcourtesy of

Charles Babbage

Institute, University of

Minnesota, Minneapolis

CDC STAR-100


13/28


Texas Instruments ASC

Advanced Scientific Computer, early 1970s

architecturally similar to CDC STAR-100

7 units sold

TI dropped out of mainframe computer manufacturing after this machine


14/28


Vector Performance I

MFLOP rate (MFLOPS) as function of vector length n

scalar: ~constant (only some loop overhead, then n * loop time)

vector: (n = length of vector)

# cycles = startup + n / nflop_per_cycle

rate/clock = #ops / #cycles ~ n / (startup + n)

half rate at vectorlength n ~ startup

full rate needs n >> startup => Long Vector Machine


15/28


Performance vs. Startup, Length

.5

1

1.5

.5

.5

1 5

vecto ength

op

/clock

s1 r1

s1 r1

s1 r

s1 r

scalar .


16/28


Vector Performance II

Vector/Scalar Subsections

ALL codes have some scalar (non-vectorizable) sections

total time = (scalar fraction)/(scalar rate) + (vector fraction)/(vector rate)

example: 10% / 1 MFLOPS + 90% / 100 MFLOPS =

100 / (0.1 * 100 + 0.9 * 1) = 9.2 MFLOPS !!!


17/28


Vector Version of Amdahls Law

.

.

.

.

1

1.

. . . . 1 1.

scalar fraction

performance

r

r1

r

r

r1


18/28


Vector Computer Design Guide

Must have SHORT vector startup => can work with short vectors

Must have FASTEST POSSIBLE scalar unit => can afford scalar sections

irregular data structures ==> need gather, scatter, merge operations (and a few

more)

x(i) = a(index(i)) * b(i)

y(index(i)) = c(i) + d(i)

where (a(i) > b(i)) c(i) = d(i)


19/28


Cray Research, Inc.

Founded by Seymour Cray (father of CDC 6600/7600) in 1972 (STAR-100 known)

first Cray-1 delivered in 1976 to Los Alamos Scientific Laboratory (LASL)

8 vector registers of 64 elements each

Vector load/store instructions

fastest scalar computer of its time

160 MFLOPS peak rate ( 2 ops/cycle @ 80 MHz), few cycles startup


20/28


Photographcourtesy

of Charles Babbage

Institute, University of

Minnesota,

Minneapolis

Seymour Cray

Cray-1

1976

Single Processor

80 MFLOPS

1 Mword = 8 Mbyte


21/28


Block Diagram Cray YMP-EL, only one of four identical CPUs shown, simplified

8 vector registers,64 elements,64bit 4 vector execution units, 33

z

Shared

Vi

Main

8 scalar registers,64bit scalar

unctional units

Memory

64 ord

T

k

u

ermemory Si

128 MW

64 bit 8 address registers, 32 bit address

unctional units

64 ord

1 Gbyte

k

4ports/ bu

er memory

proc Ai

4x 4 x

33 MHz

= 8 instruction bu

ers, 32 ords each4.2 Gby

/sec Y1channel

instruction issue 40Mbyte/sec

48

shared

registersI

S

Large orking set:

- 8 vector registers,64 ords

- 8 scalar registers

- 8 address registers

- large instruction bu er

Per ormance Features:

- vectorprocessing: one operation

affects64 vector elements,streamed

throughfunctional unit

- small vectorstartup time

- chaining bet een vector ops

- large,fast semiconductor memory


22/28


Cray Research, Inc. cntd

1982 Cray-XMP (Steve Chen improvements, up to 4 processors, shared memory)

1985 Cray-2, 256 Mword memory, 4 processors, immersion cooled

1988 Cray-YMP (last Chen machine)

1991 Cray C90 (up to 16 vector CPUs, shared memory)1993 Cray T3D (massively parallel Alpha)

one and only Cray-3 delivered to NCAR (Cray Comp Corp)

1994 Cray J90 (up to 32 vector CPUs, shared memory), air cooled

1995 Cray T3E (most successful MPP machine), Cray T90 (parallel vector, immersion cooled)

Cray-4 abandoned (Cray Computer Corporation ch. 11)

1996 acquired by Silicon Graphics

1998 Cray SV1 (parallel vector, air cooled)

1999 acquired by Teradata => Cray, Inc.

2002 Cray X1, parallel vector, immersion spray cooled

2004 Cray X1e, enhanced version of X1

Cray XT3, AMD based 3D Torus massively parallel machine


23/28


CDC Cyber 200 Family

- 1980, enhanced version of STAR-100

- reduced startup time, ~ 50 cycles

- fast scalar unit

- rich instruction repertoire

- still memory-to-memory, 400 MFLOPS peak

- Cyber 203, Cyber 205, ETA-10 [10 GFLOPS]

- vector FORTRAN language extensions provided

- terminated in 1989 since unprofitable

- around 40 Cyber 200, 34 ETA-10 sold


24/28


MinnesotaSupercomputer Center

Minneapolis,1986

Cray-2, CDC Cyber205


25/28


NEC Japan

- 1983 SX-1 single processor vector 650 MFLOPS

- 1985 SX-2 single processor vector 1300 MFLOPS

- 1990 SX-3 four processors at ~ 5 GFLOPS each, 4 Gbyte = 0.5 Gword memory

- 1995 SX-4 32 processors at ~ 2 GFLOPS each (CMOS; all previous ECL)

- 1998 SX-5 upto 512 processors 8 GFLOPS each


- 2004 SX-7 upto 2048 processors 8.8 GFLOPS each



26/28


IBM - Sony - Toshiba CELL processor

- 8 vector CPUs+ PU on single chip

- 256kbyte = 32kword local storage (very small !!)

- 12 word/cycle internal interconnect = 386 byte/sec

- 24 byte/sec= 3 word/sec main memory

- 76 byte/sec=9.5 word/seccommunication

- @ 4 Hz clock256 FLOPS(32 bit) peak

- 26 FLOPS(64 bit) peak

- max 4.5 byte addressable,512 Mbyte implemented

- system interconnect ?- used within Sony Playstation 3

- Mercury, IBM bladesavailable; 512 Mbyte only

- highly imbalanced forscientificcomputation


27/28



- 90 nm SOI, 8 layers Cu interconnect

- 234 M Transistors

- 221 mm die size

- significant potential in future revisions

- but: 80W @ 1.1V4.0 Hz is too much

- 180W @ 1.4V5.6 Hz is much too much

- workneeded in power reduction

- largerinternal memory- 64 bit arithmeticimproved


28/28



From: S.Williams et.al., Lawrence Berkeley Laboratory

- single Cell chipperformance

- compared with Cray X1Esingle vectorprocessorand

several commodity microprocessors(AMD, Intel)

- already current version showsimpressive speedup,atcost ofsignificant programming complexity (explicit

storage movesas opposed to caching)

- slightly enhanced Cell (Cell+) simulation provides very

significant additional speedup(more efficient DP)

- current version insufficient for majorimpact

- future versions may change that, great potential

Documents

Vector Computing