C l lt PlP arallel Computer Architectures - AndroBenchcsl.skku.edu/uploads/ICE3003F09/22-parch.pdf · C l lt PlP arallel Computer Architectures ... ICE3003: Computer Architecture

P ll l C t P ll l C t Parallel Computer Architectures

Parallel Computer ArchitecturesArchitecturesArchitectures

Jin-Soo Kim ([email protected])Jin Soo Kim ([email protected])Computer Systems Laboratory

Sungkyunkwan Universityhtt // l kk dhttp://csl.skku.edu

Exploiting ParallelismExploiting ParallelismExploiting ParallelismExploiting ParallelismInstruction-Level Parallelism• Pipelining, superscalar, OOO, branch prediction,

VLIW, …

Task-Level (or Thread-Level) Parallelism• Simultaneous MultithreadingS u ta eous u t t ead g• Multicore• MultiprocessorsMultiprocessors• Clusters

Data-Level ParallelismData-Level Parallelism• Vector architecture• SIMD instructions

2ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

• SIMD instructions

Hardware Multithreading (1)Hardware Multithreading (1)Hardware Multithreading (1)Hardware Multithreading (1)Run multiple threads of execution in parallelp p• Replicate registers, PC, etc.• Fast switching between threadsg

Fine-grain multithreading• Switch threads after each cycle• Switch threads after each cycle• Interleave instruction execution• If one thread stalls others are executed• If one thread stalls, others are executed

Coarse-grain multithreadingO l it h l t ll ( L2 h i )• Only switch on long stall (e.g., L2-cache miss)

• Simplifies hardware, but doesn’t hide short stalls (e.g., data hazards)


data hazards)

Hardware Multithreading (2)Hardware Multithreading (2)Hardware Multithreading (2)Hardware Multithreading (2)Simultaneous multithreading (SMT)g ( )• In multiple-issue dynamically scheduled processor

– Schedule instructions from multiple threads– Instructions from independent threads execute when

function units are availableWithin threads dependencies handled by scheduling and– Within threads, dependencies handled by scheduling and register renaming

• Makes a single physical processor appear as multiple g p y p pp plogical processors

• Uses processor resources more effectively• Example: Intel Pentium 4 HT

– Two threads: duplicated registers, shared function units and h


caches

Hardware Multithreading (3)Hardware Multithreading (3)Hardware Multithreading (3)Hardware Multithreading (3)Intel HyperThreadingyp g

logicalprocessor 0

logicalprocessor 1

Arch. state Arch. state

p p

c state(registers)

c state(registers)

Execution unitsExecution units

Cache(s)Cache(s)

System bus


System bus

Hardware Multithreading (4)Hardware Multithreading (4)Hardware Multithreading (4)Hardware Multithreading (4)Examplesp


Hardware Multithreading (5)Hardware Multithreading (5)Hardware Multithreading (5)Hardware Multithreading (5)Future of multithreadingg• Will it survive? In what form?

• Power considerations ⇒ simplified microarchitectures– Simpler forms of multithreading– e.g., Sun UltraSPARC T2: fine-grained multithreading

• Tolerating cache-miss latency– Thread switch may be most effective

• Multiple simple cores might share resources more ff ti leffectively


Multicore (1)Multicore (1)Multicore (1)Multicore (1)Multicore is mainstream


Multicore (2)Multicore (2)Multicore (2)Multicore (2)


Multicore (3)Multicore (3)Multicore (3)Multicore (3)Memory wally• CPU 55%/year, Memory 10%/year (1986~2000)• Caches show diminishing returnsg

ILP wall• Control dependency• Data dependency

Power wall• Dynamic power ∝ Frequency3• Dynamic power ∝ Frequency3

• Static power ∝ Frequency• Total power ∝ The number of cores


• Total power ∝ The number of cores

Multicore (4)Multicore (4)Multicore (4)Multicore (4)1.73x1.73x

PerformancePerformance

1.13x1.13xPowerPower

PerformancePerformance

1.00x1.00x More MIPS/wattRaise Raise Clock (20%)Clock (20%)

RM

AN

CE

R

1.00x1.00x

MA

NC

E

1.73x1.73x

More MIPS/watt

PER

FOR

POW

ER

PER

FOR

M

POW

ER

0.87x0.87x

E

P

SingleSingle––CoreCore

P

1.02x1.02x

NC

E

Lower Lower Clock (20%)Clock (20%)

0.51x0.51x

FOR

MA

NC

WER

RFO

RM

AN

WERDualDual––CoreCore


PER

POW

PER

POSource: Intel

Multicore (5)Multicore (5)Multicore (5)Multicore (5)Core typesyp• Homogeneous• Heterogeneousg

Cache organization• Shared L2+ cache• Independent L2+ cache

On-die interconnects• Bus• Bus• Crossbar• Ring Mesh Torus


• Ring, Mesh, Torus, …

Multicore (6)Multicore (6)Multicore (6)Multicore (6)Intel dual-core processorsp• Homogeneous cores• Bus-based• Different cache configurations


Multicore (7)Multicore (7)Multicore (7)Multicore (7)Sun UltraSPARC T2 (Niagara2)( g )• Homogeneous• Bus-based• Shared L2 cache

8 SPARC• 8 SPARC cores,8 threads / core

• 16KB I$ per core,8KB D$ per core

• 4MB shared L2, 8 banks, 16-way


y

Multicore (8)Multicore (8)Multicore (8)Multicore (8)STI Cell Broadband Engineg• Heterogeneous• Ring-basedg• One PPE based on

64-bit PowerPC• Eight SPEs, each with

256KB Local Store• Globally coherent

DMA • EIB: Four 16B rings

PPE: Power Processor ElementSPE: Synergistic Processing Element


SPE: Synergistic Processing ElementEIB: Element Interconnect Bus

Multicore (9)Multicore (9)Multicore (9)Multicore (9)Cavium Octeon CN3860• System-on-a-chip for networking equipment• 16 MIPS cores (integer only) ( g y)


Multicore (10)Multicore (10)Multicore (10)Multicore (10)Tilera Tile64• Homogeneous• Mesh-based• Independent

L2 cache


Multicore (11)Multicore (11)Multicore (11)Multicore (11)ARM Cortex-A9 MPCore• 1 ~ 4 cores• ARMv7 architecture• L1: 16KB, 32KB, or 64KB


Parallel Architectures (1)Parallel Architectures (1)Parallel Architectures (1)Parallel Architectures (1)SMP: shared memory multiprocessorsy p• Hardware provides single physical address space for

all processors• Synchronize shared variables using locks• Memory access time: UMA vs. NUMA


Parallel Architectures (2)Parallel Architectures (2)Parallel Architectures (2)Parallel Architectures (2)Distributed memory architecturey• Each processor has private physical address space• Hardware sends/receives messages between / g

processors


Parallel Architectures (3)Parallel Architectures (3)Parallel Architectures (3)Parallel Architectures (3)Clusters• A network of independent computers• Each node has private memory and OSp y• Connected using I/O system

– E.g., Ethernet/switch, Internet

• Suitable for applications with independent tasks– Web servers, databases, simulations, …

• High availability, scalable, affordable• Problems

– Administration cost (prefer virtual machines)– Low interconnect bandwidth

f / b d idth SMP


» cf. processor/memory bandwidth on an SMP

Data-Level ParallelismData-Level ParallelismData Level ParallelismData Level ParallelismSIMD (Single-Instruction Multiple-Data) ( g p )architecture• Operate element-wise on vectors of datap

– E.g., MMX and SSE instructions in x86:Multiple data elements in 128-bit wide registers

All h i i h• All processors execute the same instruction at the same time

Each with different data address etc– Each with different data address, etc.

• Simplifies synchronization• Reduced instruction control hardware• Reduced instruction control hardware• Works best for highly data-parallel applications


Vector Processors (1)Vector Processors (1)Vector Processors (1)Vector Processors (1)Characteristics• Highly pipelined function units• Stream data from/to vector registers to units/ g

– Data collected from memory into registers– Results stored from registers to memory

Example: Vector extension to MIPS• 32 x 64-element registers (64-bit elements)g• Vector instructions

– lv, sv: load/store vector– addv.d: add vectors of double– addvs.d: add scalar to each element of vector of double

Si ifi l d i i f h b d id h


• Significantly reduces instruction-fetch bandwidth

Vector Processors (2)Vector Processors (2)Vector Processors (2)Vector Processors (2)Example: DAXPY (Y = a * X + Y)p ( )• Conventional MIPS code

l.d $f0,a($sp) ;load scalar al.d $f0,a($sp) ;load scalar aaddiu r4,$s0,#512 ;upper bound of what to load

loop:l.d $f2,0($s0) ;load x(i)mul.d $f2,$f2,$f0 ;a × x(i)l.d $f4,0($s1) ;load y(i)dd d $f4 $f4 $f2 × (i) (i)add.d $f4,$f4,$f2 ;a × x(i) + y(i)s.d $f4,0($s1) ;store into y(i)addiu $s0,$s0,#8 ;increment index to xaddiu $s0,$s0,#8 ;increment index to xaddiu $s1,$s1,#8 ;increment index to ysubu $t0,r4,$s0 ;compute bound


bne $t0,$zero,loop ;check if done

Vector Processors (3)Vector Processors (3)Vector Processors (3)Vector Processors (3)Example: DAXPY (Y = a * X + Y) (cont’d)p ( ) ( )• Vector MIPS code

l.d $f0,a($sp) ;load scalar alv $v1,0($s0) ;load vector xmulvs.d $v2,$v1,$f0 ;vector‐scalar multiplylv $v3,0($s1) ;load vector yaddv.d $v4,$v2,$v3 ;add y to productsv $v4,0($s1) ;store the result


Vector Processors (4)Vector Processors (4)Vector Processors (4)Vector Processors (4)Vector architectures and compilersp• Simplify data-parallel programming• Explicit statement of absence of loop-carried p p

dependencies– Reduced checking in hardware

• Regular access patterns benefit from interleaved and burst memory

• Avoid control hazards by avoiding loops• More general than ad-hoc media extensions (such as

MMX SSE)MMX, SSE)– Better match with compiler technology


MMX (1)MMX (1)MMX (1)MMX (1)Intel MMX Technologygy• Introduced in Pentium MMX and Pentium II• SIMD (Single-Instruction Multiple-Data) execution model( g p )

• For media and communications applications• Eight 64-bit MMX registers (MM0-MM7) are shared g t 6 b t eg ste s ( 0 ) a e s a ed

with x87 FPU registers (R0-R7)• Three new packed data typesp yp


MMX (2)MMX (2)MMX (2)MMX (2)SIMD execution model• Add, Subtract, Multiply, Multiply and Add, …


SSE (1)SSE (1)SSE (1)SSE (1)Streaming SIMD Extensionsg• Introduced in Pentium III • For advanced 2-D/3-D graphics, motion video, image / g p , , g

processing, speech recognition, audio synthesis, telephony, and video conferencing

• New eight 128-bit XMM registers (XMM0-XMM7)• New 32-bit MXCSR register

– Control and status bits for operations on XMM registers

• New 128-bit packed single-precision FP data type


SSE (2)SSE (2)SSE (2)SSE (2)SSE packed/scalar single-precision FP p g poperations• Add, Multiply, Divide, Reciprocal, Square root, , p y, , p , q ,

Reciprocal of square root, Max, Min

Scalar single‐precisionFP operation

SSE logical operations• And Or Xor


And, Or, Xor

SSE (3)SSE (3)SSE (3)SSE (3)SSE shuffle operationp• Any two from src• Any two from desty

SSE unpack operation


SSE2SSE2SSE2 SSE2 Streaming SIMD Extensions 2 (SSE2)g ( )• Introduced in Pentium 4 and Intel Xeon• For advanced 3-D graphics, speech recognition, video g p , p g ,

encoding/decoding, E-commerce, Internet, scientific and engineering applications

• Six new data types


SSE3, SSSE3, SSE4SSE3, SSSE3, SSE4SSE3, SSSE3, SSE4SSE3, SSSE3, SSE4Streaming SIMD Extensions 3 (SSE3)g ( )• Introduced in Pentium 4 (Prescott)• 13 new instructions

Supplemental SSE3 (SSSE3)• Introduced in Intel Xeon 5100 and Intel Core 2• 16 new instructions each on MMX or XMM registersg

Streaming SIMD Extensions 4 (SSE4)• SSE4.1: new 47 instructions (Penryn)• SSE4.2: new 7 instructions (Core i7)


SummarySummarySummarySummaryFlynn’s taxonomy on parallel architecturesy y p

Data Streams

Single MultipleSingle Multiple

Instruction Streams

Single SISD:Uniprocessors

SIMD: Vector processorsStreams Uniprocessors Vector processors Multimedia extension(Intel MMX/SSE, ...)

Multiple MISD:Systolic array?

MIMD (cf. SPMD):Multicore multiprocessors,y y p ,MPP, SMP, Cluster, ...


Documents

C l lt PlP arallel Computer Architectures - AndroBenchcsl.skku.edu/uploads/ICE3003F09/22-parch.pdf · C l lt PlP arallel Computer Architectures ... ICE3003: Computer Architecture