Upload
jocelin-nichols
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Anshul Kumar, CSE IITD
CS718 : Data Parallel ProcessorsCS718 : Data Parallel ProcessorsCS718 : Data Parallel ProcessorsCS718 : Data Parallel Processors
27th April, 2006
Anshul Kumar, CSE IITD
Data Parallel ArchitecturesData Parallel ArchitecturesData Parallel ArchitecturesData Parallel Architectures
• SIMD Processors– Multiple processing elements driven by a single
instruction stream• Associative Processors
– SIMD like processors with associative memory• Vector Processors
– Uni-processors with vector instructions• Systolic Arrays
– Application specific VLSI structures
Anshul Kumar, CSE IITD
SIMDSIMDSIMDSIMD
C
P
P
MIS
DS
DS
One of the earliest model of parallel computer
Anshul Kumar, CSE IITD
ILLIAC IV SIMD ModelILLIAC IV SIMD ModelILLIAC IV SIMD ModelILLIAC IV SIMD Model
P
M
P
M
P
M
P
M
Interconnection network
PE1 PE2 PEn
CU
I/O
bus
Planned for 64 x 4 PEs, built only 64
Anshul Kumar, CSE IITD
Burroughs Scientific Processor (BSP) ModelBurroughs Scientific Processor (BSP) ModelBurroughs Scientific Processor (BSP) ModelBurroughs Scientific Processor (BSP) Model
P
M
P1
M1
P2
M2
Pn
Mk
Interconnection network
CU
I/O
bus
Anshul Kumar, CSE IITD
SIMD algorithms: sum of vector elementsSIMD algorithms: sum of vector elementsSIMD algorithms: sum of vector elementsSIMD algorithms: sum of vector elements
Si = ai + ai+1 i = 0,2,4,6
Si = Si + Si+2 i = 0,4
Si = Si + Si+4 i = 0
a0 a1 a2 a3 a4 a5 a6 a7
a0+a1 a2+a3 a4+a5 a6+a7
a0+a1+a2+a3
a4+a5+a6+a7
a0+a1+a2+a3+a4+a5+a6+a7
step 1:
step 2:
step 3:
Si = ai + ai+4 i = 0,1,2,3
Si = Si + Si+2 i = 0,1
Si = Si + Si+1 i = 0
OR
Anshul Kumar, CSE IITD
No. of processors vs timeNo. of processors vs timeNo. of processors vs timeNo. of processors vs time
Adding vector elements:– n processors – log n steps– n/log n processors – log n steps
Matrix multiplication:– n processor – n2 steps– n2 processors – n steps– n3 processors – log n steps– n3/log n processors – log n steps
Important factors: data distribution, network
Anshul Kumar, CSE IITD
Rise and fall of SIMDsRise and fall of SIMDsRise and fall of SIMDsRise and fall of SIMDs• Introduced in 60’s (e.g. Illiac, BSP)• Problems:
– not cost effective– serial fraction and Amdahl’s law– I/O bottle neck
• Overshadowed by Vector Processors• Resurrected in 80’s (MPP from Goodyear,
Connection machine from Thinking Machines Inc., MP-1 from MasPar)
• Did not survive because of high cost
Anshul Kumar, CSE IITD
Related ideasRelated ideasRelated ideasRelated ideas
• Coarse grain SIMD with off the shelf processors (synchronized MIMD), e.g. CM5 of Thinking Machines
• This gave rise to SPMD (single program multiple data)
• MMX and SIMD instructions in Pentium
Anshul Kumar, CSE IITD
Vector ProcessorsVector ProcessorsVector ProcessorsVector Processors
I-cache
D-cache
Memcontrol
I-unitand
control
V-reg GPRsaddress
unit
VFU VFU FU
Buses
Mem
ory
Anshul Kumar, CSE IITD
Four Generations of CRAY systems Four Generations of CRAY systems (vector processors)(vector processors)
Four Generations of CRAY systems Four Generations of CRAY systems (vector processors)(vector processors)
System CPUs Clock Flops/ Words Mflops Gates/
MHz clock/ moved/ chip
CPU clk/CPU
CRAY-1 1 80 2 1 80 2
X-MP 4 105 2 3 840 16
Y-MP 8 166 2 3 2667 2500
C90 16 240 4 6 15360 10000
Anshul Kumar, CSE IITD
Cray HistoryCray HistoryCray HistoryCray History
• http://www.cray.com/company/history.html
Anshul Kumar, CSE IITD
CRAY C90CRAY C90CRAY C90CRAY C90
• 8GB central memory shared by 16 CPUs
• 128 CPU - mem paths• word =
64 bits + 16 ECC• Dual vector pipes• 128 element segments
Memory
8 sections
8x8 sub sections
8x8x2 bank groups
8x8x2x8 banks
Anshul Kumar, CSE IITD
Convex C4/XA systemConvex C4/XA systemConvex C4/XA systemConvex C4/XA system
• CPU: 7.5 ns clock, 1620 MFLOPs
• Mem: 32 MB x 32 banks, 64 bit word, 50ns access time
• 3 FP pipes, 2 results each• Vector regs - FPU cross
bar• 1.1 GB/s per I/O port
5 x 5crossbar
CPUs
mem
orie
s
I/O utilities
Anshul Kumar, CSE IITD
Other examplesOther examplesOther examplesOther examples
NEC SX - X
• 4 CPUs• 4 x 2 pipes each
Fujitsu VP5000
• 7 - 222 CPUs• 2 LS pipes• 3 Func pipes• 2 mask pipes
Fujitsu VP2000
1 - 2 CPUs
Anshul Kumar, CSE IITD
Systolic Arrays Systolic Arrays (H.T. Kung 1978)(H.T. Kung 1978)Systolic Arrays Systolic Arrays (H.T. Kung 1978)(H.T. Kung 1978)
Simplicity, Regularity, Concurrency, Communication
Example : Band matrix multiplication
666564
56555453
45444342
34333231
232221
1211
666564
56555453
45444342
34333231
232221
1211
000
00
00
00
000
0000
000
00
00
00
000
0000
BBB
BBBB
BBBB
BBBB
BBB
BB
AAA
AAAA
AAAA
AAAA
AAA
AA
C
B11 B12
B21
B31
A11
A12
A21
A22
A31
A23
T=0
B11 B12
B21
B31
B22
A11
A12
A21
A22
A31
A23
A32
T=1
A11
A12
A21
A22
A31
A23
A32
A33
B11 B12
B21
B31
B22
B32
T=2
A21
A22
A31
A23
A32
A33
A34
B12
B31
B22
B32
B42
A11 B11
A42 B23A12
B21
T=3
A22
A31
A23
A32
A33
A34
B31
B22
B32
B42
A11 B11
A12 B21
A42 B23
A11 B12A21 B11
B33A43
T=4
A23
A32
A33
A34
B31 B32
B42
A42 B23
B33A43
A11 B12
A12 B22
A21 B12
A21 B11
A22 B21
C11
A31 B11
T=5
A33
A34
B32
B42
A42
B33A43
A21 B12
A22 B22
A21 B11
A22 B21
A23 B31
C11
A31 B12
A31 B11
A32 B21
C12
A12 B23
A53
A44B43
T=6
Anshul Kumar, CSE IITD
WARP: Programmable Systolic ProcessorWARP: Programmable Systolic ProcessorWARP: Programmable Systolic ProcessorWARP: Programmable Systolic Processor
[Kung, CMU 1987]
Complete contrast to the original idea
• not application specific
• not a single VLSI
• complex cell (pipelined FP adder, mult, FIFOs, RAM, cross bar)
• linear
• asynchronous
Anshul Kumar, CSE IITD
ReferencesReferencesReferencesReferences
• D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design Space Approach", Addison Wesley, 1997.
• K. Hwang, "Advanced Computer Architecture : Parallelism, Scalability, Programmability", McGraw Hill, 1993.