34
P ll l C t P ll l C t Parallel Computer Architectures Parallel Computer Architectures Architectures Architectures Jin-Soo Kim ( [email protected]) Jin Soo Kim ( [email protected]) Computer Systems Laboratory Sungkyunkwan University htt // l kk d http://csl.skku.edu

C l lt PlP arallel Computer Architectures - AndroBenchcsl.skku.edu/uploads/ICE3003F09/22-parch.pdf · C l lt PlP arallel Computer Architectures ... ICE3003: Computer Architecture

  • Upload
    ngophuc

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

P ll l C t P ll l C t Parallel Computer Architectures

Parallel Computer ArchitecturesArchitecturesArchitectures

Jin-Soo Kim ([email protected])Jin Soo Kim ([email protected])Computer Systems Laboratory

Sungkyunkwan Universityhtt // l kk dhttp://csl.skku.edu

Exploiting ParallelismExploiting ParallelismExploiting ParallelismExploiting ParallelismInstruction-Level Parallelism• Pipelining, superscalar, OOO, branch prediction,

VLIW, …

Task-Level (or Thread-Level) Parallelism• Simultaneous MultithreadingS u ta eous u t t ead g• Multicore• MultiprocessorsMultiprocessors• Clusters

Data-Level ParallelismData-Level Parallelism• Vector architecture• SIMD instructions

2ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

• SIMD instructions

Hardware Multithreading (1)Hardware Multithreading (1)Hardware Multithreading (1)Hardware Multithreading (1)Run multiple threads of execution in parallelp p• Replicate registers, PC, etc.• Fast switching between threadsg

Fine-grain multithreading• Switch threads after each cycle• Switch threads after each cycle• Interleave instruction execution• If one thread stalls others are executed• If one thread stalls, others are executed

Coarse-grain multithreadingO l it h l t ll ( L2 h i )• Only switch on long stall (e.g., L2-cache miss)

• Simplifies hardware, but doesn’t hide short stalls (e.g., data hazards)

3ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

data hazards)

Hardware Multithreading (2)Hardware Multithreading (2)Hardware Multithreading (2)Hardware Multithreading (2)Simultaneous multithreading (SMT)g ( )• In multiple-issue dynamically scheduled processor

– Schedule instructions from multiple threads– Instructions from independent threads execute when

function units are availableWithin threads dependencies handled by scheduling and– Within threads, dependencies handled by scheduling and register renaming

• Makes a single physical processor appear as multiple g p y p pp plogical processors

• Uses processor resources more effectively• Example: Intel Pentium 4 HT

– Two threads: duplicated registers, shared function units and h

4ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

caches

Hardware Multithreading (3)Hardware Multithreading (3)Hardware Multithreading (3)Hardware Multithreading (3)Intel HyperThreadingyp g

logicalprocessor 0

logicalprocessor 1

Arch. state Arch. state

p p

c state(registers)

c state(registers)

Execution unitsExecution units

Cache(s)Cache(s)

System bus

5ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

System bus

Hardware Multithreading (4)Hardware Multithreading (4)Hardware Multithreading (4)Hardware Multithreading (4)Examplesp

6ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Hardware Multithreading (5)Hardware Multithreading (5)Hardware Multithreading (5)Hardware Multithreading (5)Future of multithreadingg• Will it survive? In what form?

• Power considerations ⇒ simplified microarchitectures– Simpler forms of multithreading– e.g., Sun UltraSPARC T2: fine-grained multithreading

• Tolerating cache-miss latency– Thread switch may be most effective

• Multiple simple cores might share resources more ff ti leffectively

7ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Multicore (1)Multicore (1)Multicore (1)Multicore (1)Multicore is mainstream

8ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Multicore (2)Multicore (2)Multicore (2)Multicore (2)

9ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Multicore (3)Multicore (3)Multicore (3)Multicore (3)Memory wally• CPU 55%/year, Memory 10%/year (1986~2000)• Caches show diminishing returnsg

ILP wall• Control dependency• Data dependency

Power wall• Dynamic power ∝ Frequency3• Dynamic power ∝ Frequency3

• Static power ∝ Frequency• Total power ∝ The number of cores

10ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

• Total power ∝ The number of cores

Multicore (4)Multicore (4)Multicore (4)Multicore (4)1.73x1.73x

PerformancePerformance

1.13x1.13xPowerPower

PerformancePerformance

1.00x1.00x More MIPS/wattRaise Raise Clock (20%)Clock (20%)

RM

AN

CE

R

1.00x1.00x

MA

NC

E

1.73x1.73x

More MIPS/watt

PER

FOR

POW

ER

PER

FOR

M

POW

ER

0.87x0.87x

E

P

SingleSingle––CoreCore

P

1.02x1.02x

NC

E

Lower Lower Clock (20%)Clock (20%)

0.51x0.51x

FOR

MA

NC

WER

RFO

RM

AN

WERDualDual––CoreCore

11ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

PER

POW

PER

POSource: Intel

Multicore (5)Multicore (5)Multicore (5)Multicore (5)Core typesyp• Homogeneous• Heterogeneousg

Cache organization• Shared L2+ cache• Independent L2+ cache

On-die interconnects• Bus• Bus• Crossbar• Ring Mesh Torus

12ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

• Ring, Mesh, Torus, …

Multicore (6)Multicore (6)Multicore (6)Multicore (6)Intel dual-core processorsp• Homogeneous cores• Bus-based• Different cache configurations

13ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Multicore (7)Multicore (7)Multicore (7)Multicore (7)Sun UltraSPARC T2 (Niagara2)( g )• Homogeneous• Bus-based• Shared L2 cache

8 SPARC• 8 SPARC cores,8 threads / core

• 16KB I$ per core,8KB D$ per core

• 4MB shared L2, 8 banks, 16-way

14ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

y

Multicore (8)Multicore (8)Multicore (8)Multicore (8)STI Cell Broadband Engineg• Heterogeneous• Ring-basedg• One PPE based on

64-bit PowerPC• Eight SPEs, each with

256KB Local Store• Globally coherent

DMA • EIB: Four 16B rings

PPE: Power Processor ElementSPE: Synergistic Processing Element

15ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

SPE: Synergistic Processing ElementEIB: Element Interconnect Bus

Multicore (9)Multicore (9)Multicore (9)Multicore (9)Cavium Octeon CN3860• System-on-a-chip for networking equipment• 16 MIPS cores (integer only) ( g y)

16ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Multicore (10)Multicore (10)Multicore (10)Multicore (10)Tilera Tile64• Homogeneous• Mesh-based• Independent

L2 cache

17ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Multicore (11)Multicore (11)Multicore (11)Multicore (11)ARM Cortex-A9 MPCore• 1 ~ 4 cores• ARMv7 architecture• L1: 16KB, 32KB, or 64KB

18ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Parallel Architectures (1)Parallel Architectures (1)Parallel Architectures (1)Parallel Architectures (1)SMP: shared memory multiprocessorsy p• Hardware provides single physical address space for

all processors• Synchronize shared variables using locks• Memory access time: UMA vs. NUMA

19ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Parallel Architectures (2)Parallel Architectures (2)Parallel Architectures (2)Parallel Architectures (2)Distributed memory architecturey• Each processor has private physical address space• Hardware sends/receives messages between / g

processors

20ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Parallel Architectures (3)Parallel Architectures (3)Parallel Architectures (3)Parallel Architectures (3)Clusters• A network of independent computers• Each node has private memory and OSp y• Connected using I/O system

– E.g., Ethernet/switch, Internet

• Suitable for applications with independent tasks– Web servers, databases, simulations, …

• High availability, scalable, affordable• Problems

– Administration cost (prefer virtual machines)– Low interconnect bandwidth

f / b d idth SMP

21ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

» cf. processor/memory bandwidth on an SMP

Data-Level ParallelismData-Level ParallelismData Level ParallelismData Level ParallelismSIMD (Single-Instruction Multiple-Data) ( g p )architecture• Operate element-wise on vectors of datap

– E.g., MMX and SSE instructions in x86:Multiple data elements in 128-bit wide registers

All h i i h• All processors execute the same instruction at the same time

Each with different data address etc– Each with different data address, etc.

• Simplifies synchronization• Reduced instruction control hardware• Reduced instruction control hardware• Works best for highly data-parallel applications

22ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Vector Processors (1)Vector Processors (1)Vector Processors (1)Vector Processors (1)Characteristics• Highly pipelined function units• Stream data from/to vector registers to units/ g

– Data collected from memory into registers– Results stored from registers to memory

Example: Vector extension to MIPS• 32 x 64-element registers (64-bit elements)g• Vector instructions

– lv, sv: load/store vector– addv.d: add vectors of double– addvs.d: add scalar to each element of vector of double

Si ifi l d i i f h b d id h

23ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

• Significantly reduces instruction-fetch bandwidth

Vector Processors (2)Vector Processors (2)Vector Processors (2)Vector Processors (2)Example: DAXPY (Y = a * X + Y)p ( )• Conventional MIPS code

l.d $f0,a($sp)      ;load scalar al.d $f0,a($sp)      ;load scalar aaddiu r4,$s0,#512    ;upper bound of what to load

loop:l.d $f2,0($s0)     ;load x(i)mul.d $f2,$f2,$f0    ;a × x(i)l.d $f4,0($s1)     ;load y(i)dd d $f4 $f4 $f2      × (i)    (i)add.d $f4,$f4,$f2    ;a × x(i) + y(i)s.d $f4,0($s1)     ;store into y(i)addiu $s0,$s0,#8     ;increment index to xaddiu $s0,$s0,#8     ;increment index to xaddiu $s1,$s1,#8     ;increment index to ysubu $t0,r4,$s0     ;compute bound

24ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

bne $t0,$zero,loop ;check if done

Vector Processors (3)Vector Processors (3)Vector Processors (3)Vector Processors (3)Example: DAXPY (Y = a * X + Y) (cont’d)p ( ) ( )• Vector MIPS code

l.d $f0,a($sp)   ;load scalar alv $v1,0($s0)   ;load vector xmulvs.d $v2,$v1,$f0  ;vector‐scalar multiplylv $v3,0($s1)   ;load vector yaddv.d $v4,$v2,$v3  ;add y to productsv $v4,0($s1)   ;store the result

25ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Vector Processors (4)Vector Processors (4)Vector Processors (4)Vector Processors (4)Vector architectures and compilersp• Simplify data-parallel programming• Explicit statement of absence of loop-carried p p

dependencies– Reduced checking in hardware

• Regular access patterns benefit from interleaved and burst memory

• Avoid control hazards by avoiding loops• More general than ad-hoc media extensions (such as

MMX SSE)MMX, SSE)– Better match with compiler technology

26ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

MMX (1)MMX (1)MMX (1)MMX (1)Intel MMX Technologygy• Introduced in Pentium MMX and Pentium II• SIMD (Single-Instruction Multiple-Data) execution model( g p )

• For media and communications applications• Eight 64-bit MMX registers (MM0-MM7) are shared g t 6 b t eg ste s ( 0 ) a e s a ed

with x87 FPU registers (R0-R7)• Three new packed data typesp yp

27ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

MMX (2)MMX (2)MMX (2)MMX (2)SIMD execution model• Add, Subtract, Multiply, Multiply and Add, …

28ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

SSE (1)SSE (1)SSE (1)SSE (1)Streaming SIMD Extensionsg• Introduced in Pentium III • For advanced 2-D/3-D graphics, motion video, image / g p , , g

processing, speech recognition, audio synthesis, telephony, and video conferencing

• New eight 128-bit XMM registers (XMM0-XMM7)• New 32-bit MXCSR register

– Control and status bits for operations on XMM registers

• New 128-bit packed single-precision FP data type

29ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

SSE (2)SSE (2)SSE (2)SSE (2)SSE packed/scalar single-precision FP p g poperations• Add, Multiply, Divide, Reciprocal, Square root, , p y, , p , q ,

Reciprocal of square root, Max, Min

Scalar single‐precisionFP operation

SSE logical operations• And Or Xor

30ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

And, Or, Xor

SSE (3)SSE (3)SSE (3)SSE (3)SSE shuffle operationp• Any two from src• Any two from desty

SSE unpack operation

31ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

SSE2SSE2SSE2 SSE2 Streaming SIMD Extensions 2 (SSE2)g ( )• Introduced in Pentium 4 and Intel Xeon• For advanced 3-D graphics, speech recognition, video g p , p g ,

encoding/decoding, E-commerce, Internet, scientific and engineering applications

• Six new data types

32ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

SSE3, SSSE3, SSE4SSE3, SSSE3, SSE4SSE3, SSSE3, SSE4SSE3, SSSE3, SSE4Streaming SIMD Extensions 3 (SSE3)g ( )• Introduced in Pentium 4 (Prescott)• 13 new instructions

Supplemental SSE3 (SSSE3)• Introduced in Intel Xeon 5100 and Intel Core 2• 16 new instructions each on MMX or XMM registersg

Streaming SIMD Extensions 4 (SSE4)• SSE4.1: new 47 instructions (Penryn)• SSE4.2: new 7 instructions (Core i7)

33ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

SummarySummarySummarySummaryFlynn’s taxonomy on parallel architecturesy y p

Data Streams

Single MultipleSingle Multiple

Instruction Streams

Single SISD:Uniprocessors

SIMD: Vector processorsStreams Uniprocessors Vector processors Multimedia extension(Intel MMX/SSE, ...)

Multiple MISD:Systolic array?

MIMD (cf. SPMD):Multicore multiprocessors,y y p ,MPP, SMP, Cluster, ...

34ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])