Anshul Kumar, CSE IITD CS718 : Data Parallel Processors 27 th April, 2006

Anshul Kumar, CSE IITD

CS718 : Data Parallel ProcessorsCS718 : Data Parallel ProcessorsCS718 : Data Parallel ProcessorsCS718 : Data Parallel Processors

27th April, 2006


Data Parallel ArchitecturesData Parallel ArchitecturesData Parallel ArchitecturesData Parallel Architectures

• SIMD Processors– Multiple processing elements driven by a single

instruction stream• Associative Processors

– SIMD like processors with associative memory• Vector Processors

– Uni-processors with vector instructions• Systolic Arrays

– Application specific VLSI structures


SIMDSIMDSIMDSIMD

C

P

P

MIS

DS

DS

One of the earliest model of parallel computer


ILLIAC IV SIMD ModelILLIAC IV SIMD ModelILLIAC IV SIMD ModelILLIAC IV SIMD Model

P

M

P

M

P

M

P

M

Interconnection network

PE1 PE2 PEn

CU

I/O

bus

Planned for 64 x 4 PEs, built only 64


Burroughs Scientific Processor (BSP) ModelBurroughs Scientific Processor (BSP) ModelBurroughs Scientific Processor (BSP) ModelBurroughs Scientific Processor (BSP) Model

P

M

P1

M1

P2

M2

Pn

Mk

Interconnection network

CU

I/O

bus


SIMD algorithms: sum of vector elementsSIMD algorithms: sum of vector elementsSIMD algorithms: sum of vector elementsSIMD algorithms: sum of vector elements

Si = ai + ai+1 i = 0,2,4,6

Si = Si + Si+2 i = 0,4

Si = Si + Si+4 i = 0

a0 a1 a2 a3 a4 a5 a6 a7

a0+a1 a2+a3 a4+a5 a6+a7

a0+a1+a2+a3

a4+a5+a6+a7

a0+a1+a2+a3+a4+a5+a6+a7

step 1:

step 2:

step 3:

Si = ai + ai+4 i = 0,1,2,3

Si = Si + Si+2 i = 0,1

Si = Si + Si+1 i = 0

OR


No. of processors vs timeNo. of processors vs timeNo. of processors vs timeNo. of processors vs time

Adding vector elements:– n processors – log n steps– n/log n processors – log n steps

Matrix multiplication:– n processor – n2 steps– n2 processors – n steps– n3 processors – log n steps– n3/log n processors – log n steps

Important factors: data distribution, network


Rise and fall of SIMDsRise and fall of SIMDsRise and fall of SIMDsRise and fall of SIMDs• Introduced in 60’s (e.g. Illiac, BSP)• Problems:

– not cost effective– serial fraction and Amdahl’s law– I/O bottle neck

• Overshadowed by Vector Processors• Resurrected in 80’s (MPP from Goodyear,

Connection machine from Thinking Machines Inc., MP-1 from MasPar)

• Did not survive because of high cost


Related ideasRelated ideasRelated ideasRelated ideas

• Coarse grain SIMD with off the shelf processors (synchronized MIMD), e.g. CM5 of Thinking Machines

• This gave rise to SPMD (single program multiple data)

• MMX and SIMD instructions in Pentium


Vector ProcessorsVector ProcessorsVector ProcessorsVector Processors

I-cache

D-cache

Memcontrol

I-unitand

control

V-reg GPRsaddress

unit

VFU VFU FU

Buses

Mem

ory


Four Generations of CRAY systems Four Generations of CRAY systems (vector processors)(vector processors)

Four Generations of CRAY systems Four Generations of CRAY systems (vector processors)(vector processors)

System CPUs Clock Flops/ Words Mflops Gates/

MHz clock/ moved/ chip

CPU clk/CPU

CRAY-1 1 80 2 1 80 2

X-MP 4 105 2 3 840 16

Y-MP 8 166 2 3 2667 2500

C90 16 240 4 6 15360 10000


Cray HistoryCray HistoryCray HistoryCray History

• http://www.cray.com/company/history.html


CRAY C90CRAY C90CRAY C90CRAY C90

• 8GB central memory shared by 16 CPUs

• 128 CPU - mem paths• word =

64 bits + 16 ECC• Dual vector pipes• 128 element segments

Memory

8 sections

8x8 sub sections

8x8x2 bank groups

8x8x2x8 banks


Convex C4/XA systemConvex C4/XA systemConvex C4/XA systemConvex C4/XA system

• CPU: 7.5 ns clock, 1620 MFLOPs

• Mem: 32 MB x 32 banks, 64 bit word, 50ns access time

• 3 FP pipes, 2 results each• Vector regs - FPU cross

bar• 1.1 GB/s per I/O port

5 x 5crossbar

CPUs

mem

orie

s

I/O utilities


Other examplesOther examplesOther examplesOther examples

NEC SX - X

• 4 CPUs• 4 x 2 pipes each

Fujitsu VP5000

• 7 - 222 CPUs• 2 LS pipes• 3 Func pipes• 2 mask pipes

Fujitsu VP2000

1 - 2 CPUs


Systolic Arrays Systolic Arrays (H.T. Kung 1978)(H.T. Kung 1978)Systolic Arrays Systolic Arrays (H.T. Kung 1978)(H.T. Kung 1978)

Simplicity, Regularity, Concurrency, Communication

Example : Band matrix multiplication

666564

56555453

45444342

34333231

232221

1211

666564

56555453

45444342

34333231

232221

1211

000

00

00

00

000

0000

000

00

00

00

000

0000

BBB

BBBB

BBBB

BBBB

BBB

BB

AAA

AAAA

AAAA

AAAA

AAA

AA

C

B11 B12

B21

B31

A11

A12

A21

A22

A31

A23

T=0

B11 B12

B21

B31

B22

A11

A12

A21

A22

A31

A23

A32

T=1

A11

A12

A21

A22

A31

A23

A32

A33

B11 B12

B21

B31

B22

B32

T=2

A21

A22

A31

A23

A32

A33

A34

B12

B31

B22

B32

B42

A11 B11

A42 B23A12

B21

T=3

A22

A31

A23

A32

A33

A34

B31

B22

B32

B42

A11 B11

A12 B21

A42 B23

A11 B12A21 B11

B33A43

T=4

A23

A32

A33

A34

B31 B32

B42

A42 B23

B33A43

A11 B12

A12 B22

A21 B12

A21 B11

A22 B21

C11

A31 B11

T=5

A33

A34

B32

B42

A42

B33A43

A21 B12

A22 B22

A21 B11

A22 B21

A23 B31

C11

A31 B12

A31 B11

A32 B21

C12

A12 B23

A53

A44B43

T=6


WARP: Programmable Systolic ProcessorWARP: Programmable Systolic ProcessorWARP: Programmable Systolic ProcessorWARP: Programmable Systolic Processor

[Kung, CMU 1987]

Complete contrast to the original idea

• not application specific

• not a single VLSI

• complex cell (pipelined FP adder, mult, FIFOs, RAM, cross bar)

• linear

• asynchronous


ReferencesReferencesReferencesReferences

• D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design Space Approach", Addison Wesley, 1997.

• K. Hwang, "Advanced Computer Architecture : Parallelism, Scalability, Programmability", McGraw Hill, 1993.

Documents

Anshul Kumar, CSE IITD CS718 : Data Parallel Processors 27 th April, 2006