Solihin Chapter 1 slides

1

Chapter 1: Perspectives

Copyright @ 2005-2008 Yan Solihin

Copyright notice:No part of this publication may be reproduced, stored in a retrieval system, ortransmitted by any means (electronic, mechanical, photocopying, recording, orotherwise) without the prior written permission of the author.An exception is granted for academic lectures at universities and colleges, providedthat the following text is included in such copy: “Source: Yan Solihin, Fundamentalsof Parallel Computer Architecture, 2008”.

Fundamentals of Computer Architecture - Chapter 1 2

Evolution in Microprocessors

2


Key Points Increasingly more and more components can be

integrated on a single chip Speed of integration tracks Moore’s law: doubling

every 18-24 months. Performance tracks speed of integration up until

recently At the architecture level, there are two techniques

Instruction Level Parallelism Cache Memory

Performance gain from uniprocessor system sosignificant making multiprocessor systems notprofitable


Illustration 100-processor system with perfect speedup Compared to a single processor system

Year 1: 100x faster Year 2: 62.5x faster Year 3: 39x faster … Year 10: 0.9x faster

Single processor performance catches up in just a fewyears!

Even worse It takes longer to develop a multiprocessor system Low volume means prices must be very high High prices delay adoption Perfect speedup is unattainable

3


Why did uniproc performance grow sofast? ~ half from circuit improvement (smaller transistors,

faster clock, etc.) ~ half from architecture/organization:

Instruction Level Parallelism (ILP) Pipelining: RISC, CISC with RISC backend Superscalar Out of order execution

Memory hierarchy (Caches) Exploiting spatial and temporal locality Multiple cache levels


But the uniproc perf growth is stalling Source of uniprocessor performance growth:

instruction level parallelism (ILP) Parallel execution of independent instructions from a

single thread

ILP growth has slowed abruptly Memory wall: Processor speed grows at 55%/year,

memory speed grows at 7% per year ILP wall: achieving higher ILP requires quadratically

increasing complexity (and power)

Power efficiency Thermal packaging limit vs. cost

4


Instruction level (ECE 521)

Pipelining

Types of parallelism

A (a load)

B

C

IF ID MEMEX WB

IF ID MEMEX WB

IF ID MEMEX WB


Superscalar/ VLIW Original:

Schedule as:

+ Moderate degree of parallelism (sometimes 50)- Requires fast communication (register level)

LD F0, 34(R2)

ADDD F4, F0, F2

LD F7, 45(R3)

ADDD F8, F7, F6

LD F0, 34(R2) | LD F7, 45(R3)

ADDD F4, F0, F2 | ADDD F8, F0, F6

5


Why ILP is slowing Branch prediction accuracy is already > 90%

Hard to improve it even more

Number of pipeline stages is already deep (~20-30stages) But critical dependence loops do not change Memory latency requires more clock cycles to satisfy

Processor width is already high Quadratically increasing complexity to increase the width

Cache size Effective, but also shows diminishing returns In general, the size must be doubled to reduce miss rate

by a half


Current Trend: Multicore and Manycore

100 Watts95 Watts120 WattsChip power

256KB localstore

512KB L2(private),2MB L3 (shd)

2x4MB L2Caches

2-issue SIMDOOOSuperscalar

OOOSuperscalar

Core type

3.2 GHz2.3 GHz2.66 GHzClock Freq

8+144# cores

IBM CellAMDBarcelona

IntelClovertown

Aspects

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

6


Historical Perspectives 80s – early 90s: prime time for parallel architecture

research A microprocessor cannot fit on a chip, so naturally need

multiple chips (and processors) J-machine, M-machine, Alewife, Tera, HEP, etc.

90s: at the low end, uniprocessor systems’ speedgrows much faster than parallel systems’ speed A microprocessor fits on a chip. So do branch predictor,

multiple functional units, large caches, etc! Microprocessor also exploits parallelism (pipelining,

multiple-issue, VLIW) – parallelisms originally invented formultiprocessors

Many parallel computer vendors went bankrupt Prestigious but small high-performance computing market


“If the automobile industry advanced as rapidly asthe semiconductor industry, a Rolls Royce wouldget ½ million miles per gallon and it would becheaper to throw it away than to park it.”

Gordon Moore,Intel Corporation

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

7


90s: emergence of distributed (vs. parallel) machines Progress in network technologies:

Network bandwidth grows faster than Moore’s law Fast interconnection network getting cheap

Connects cheap uniprocessor systems into a large distributedmachine

Network of Workstations, Clusters, GRID

00s: parallel architectures are back Transistors per chip >> microproc transistors Harder to get more performance from a uniprocessor SMT (Simultaneous multithreading), CMP (Chip Multi-

Processor), ultimately Massive CMP E.g. Intel Pentium D, Core Duo, AMD Dual Core, IBM

Power5, Sun Niagara, etc.


What is a Parallel Architecture?

“A parallel computer is a collection of processingelements that can communicate and cooperate tosolve a large problem fast.”

-- Almasi & Gottlieb

Tanmay_S

Highlight

8


Parallel computers A parallel computer is a collection of processing

elements that can communicate and cooperate tosolve a large problem fast. [Almasi&Gottlieb]

“collection of processing elements” How many? How powerful each? Scalability? Few very powerful (e.g., Altix) vs. many small ones

(BlueGene)

“that can communicate” How do PEs communicate? (shared memory vs. msg

passing) Interconnection network (bus, multistage, crossbar, …) Evaluation criteria: cost, latency, throughput, scalability,

and fault tolerance


“and cooperate” Issues: granularity, synchronization, and autonomy Synchronization allows sequencing of operations to ensure

correctness Granularity up => parallelism down, communication

down, overhead down Statement/Instruction level: 2-10 instructions (ECE 521) Loop level: 10-1K instructions Task level: 1K – 1M instructions Program level: > 1M instructions

Autonomy SIMD (single instruction stream) vs. MIMD (multiple

instruction streams)

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

9


“solve a large problem fast” General vs. special purpose machine? Any machine can solve certain problems well

What domains? Highly (embarassingly) parallel apps

Many scientific codes

Medium parallel apps Many engineering apps (finite-elements, VLSI-CAD)

Not parallel apps Compilers, editors (do we care?)


Why parallel computers? Absolute performance: Can we afford to wait?

Folding of a single protein takes years to simulate on themost advanced microprocessor. It only takes days on aparallel computer

Weather forecast: timeliness is crucial

Cost/performance Harder to improve performance on a single processor Bigger monolithic processor vs. many, simple processors

Power/performance Reliability and availability

Key enabling technology: Advances in microproc and interconnect technology Advances in software technology

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

10


Scope of CSC/ECE 506 Parallelism

Loop Level and Task Level Parallelism

Flynn taxonomy: SIMD (vector architecture) MIMD

Shared memory machines (SMP and DSM) Clusters

Programming Model: Shared Memory Message passing Hybrid


Loop level parallelism Each iteration can be computed independently

Each iteration cannot be computed independently, thusdoes not have loop level parallelism

+ Very high parallelism > 1K+ Often easy to achieve load balance- Some loops are not parallel- Some apps do not have many loops

for (i=0; i<8; i++) a[i] = b[i] + c[i];

for (i=0; i<8; i++) a[i] = b[i] + a[i-1];

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

11


Task level parallelism Arbitrary code segments in a single program Across loops:

Subroutines:

Threads: e.g. editor: GUI, printing, parsing+ Larger granularity => low overheads, communication- Low degree of parallelism- Hard to balance

…for (i=0; i<n; i++) sum = sum + a[i];for (i=0; i<n; i++) prod = prod * a[i];…

Cost = getCost();A = computeSum();B = A + Cost;


Program level parallelism Various independent programs execute together gmake:

gcc –c code1.c // assign to proc1 gcc –c code2.c // assign to proc2 gcc –c main.c // assign to proc3 gcc main.o code1.o code2.o

+ no communication- Hard to balance- Few opportunities

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

12




Flynn taxonomy: SIMD (vector architecture) MIMD

*Shared memory machines (SMP and DSM) Clusters

Programming Model: Shared Memory Message passing Hybrid


Taxonomy of Parallel ComputersThe Flynn taxonomy:• Single or multiple instruction streams.• Single or multiple data streams. 1. SISD machine (Most desktops, laptops)

Only one instruction fetch stream Most of today’s workstations or desktops

Control

unit

Instruction

stream

Data

streamALU

13


SIMD Examples: Vector processors, SIMD extensions (MMX) A single instruction operates on multiple data items.

Pseudo-SIMD popular for multimedia extension

Control

unit

Instruction

stream

ALU 2

ALU 1

ALU

n

Data

stream

1

Data

stream

2

Data

stream

n

SISD:for (i=0; i<8; i++) a[i] = b[i] + c[i];

SIMD:a = b + c; // vector addition


MISD machine Example: CMU Warp Systolic arrays

Control

unit 2 ALU 2

ALU 1

ALU

n

Instruction

stream 1

stream 2

stream

n

Data stream

Instruction

Instruction

Control

unit 1

Control

unit n

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

14


Systolic Arrays (contd.)

Practical realizations (e.g. iWARP) use quite general processors Enable variety of algorithms on same hardware

But dedicated interconnect channels Data transfer directly from register to register across channel

Specialized, and same problems as SIMD General purpose systems work well for same algorithms (locality etc.)

y(i) = w1 × x(i) + w2 × x(i + 1) + w3 × x(i + 2) + w4 × x(i + 3)

x8

y3 y2 y1

x7x6

x5x4

x3

w4

x2

x

w

x1

w3 w2 w1

xin

yin

xout

yout

xout = x

yout = yin + w × xinx = xin

Example: Systolic array for 1-D convolution


MIMD machine Independent processors connected together to form a

multiprocessor system. Physical organization:

Determines which memory hierarchy level is shared

Programming abstraction: Shared Memory:

on a chip: Chip Multiprocessor (CMP) Interconnected by a bus: Symmetric multiprocessors

(SMP) Point-to-point interconnection: Distributed Shared

Memory (DSM)

Distributed Memory: Clusters, Grid

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

15


MIMD Physical Organization

P

caches

M

PShared Cache Architecture: - CMP (or Simultaneous Multi-Threading)- e.g.: Pentium4 chip, IBM Power4 chip, SUN Niagara, Pentium D, etc.- Implies shared memory hardware

…

P

caches

M

P

…caches

Network

UMA (Uniform Memory Access) Shared Memory : - Pentium Pro Quad, Sun Enterprise, etc.- What interconnection network?

- Bus- Multistage- Crossbar - etc.

- Implies shared memory hardware


MIMD Physical Organization (2)

P

caches

M…

Network

P

caches

M

NUMA (Non-Uniform Memory Access) Shared Memory : - SGI Origin, Altix, IBM p690, AMD Hammer-based system- What interconnection network?

- Crossbar - Mesh- Hypercube- etc.

- Also referred to as Distributed Shared Memory

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

Tanmay_S

Highlight

16


MIMD Physical Organization (3)

P

caches

M

Network

P

caches

M

I/O I/O

Distributed System/Memory:- Also called clusters, grid- Don’t confuse it with distributedshared memory


Parallel vs. Distributed Computers

Cost

size

Parallel comp

Distrib compPerf

size

Parallel comp

Distrib comp

• Small scale machines: parallel system cheaper • Large scale machines: distributed system cheaper

• Performance: parallel system better (but more expensive)• System size: parallel system limited, and cost grows fast

• However, must also consider software cost

Tanmay_S

Highlight

17




Flynn taxonomy: MIMD

Shared memory machines (SMP and DSM)

Programming Model: Shared Memory Message passing Hybrid (e.g., UPC) Data parallel


Programming Models Shared Memory / Shared Address Space:

Each processor can see the entire memory

Programming model = thread programming inuniprocessor systems

P P P …

Shared M

18


Distributed Memory / Message Passing / MultipleAddress Space: a processor can only directly access its own local

memory. All communication happens by explicitmessages.

P

M

P

M

P

M

P

M


Shared Mem compared to MsgPassing+ Can easily be automated (parallelizing compiler,

OpenMP)+ Shared vars are not communicated, but must be

guarded- How to provide shared memory? Complex hardware- Synchronization overhead grows fast with more

processors+- Difficult to debug, not intuitive for users

19


Data Parallel Prog Paradigm & Systems Programming model

Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps

Conceptually, a processor associated with each data element

Architectural model Array of many simple, cheap processors with little memory each

Processors don’t sequence through instructions

Attached to a control processor that issues instructions Specialized and general communication, cheap global synchronization

Original motivations•Matches simple differential equation solvers•Centralize high cost of instructionfetch/sequencing


Application of Data Parallelism

Each PE contains an employee record with his/her salaryIf salary > 100K then salary = salary *1.05else salary = salary *1.10

Logically, the whole operation is a single step Some processors enabled for arithmetic operation, others disabled

Other examples: Finite differences, linear algebra, ...

Document searching, graphics, image processing, ... Some recent machines:

Thinking Machines CM-1, CM-2 (and CM-5)

Maspar MP-1 and MP-2,

20


Common Today

Systolic Arrays: idea adopted in graphics and network processors

Dataflow: idea adopted in superscalar processors

Shared memory: most small scale servers (up to 128 processors) Now in workstations/desktops/laptops, too

Message passing: most large scale systems clusters, grid (hundreds to thousands of processors)

Data parallel/SIMD: small scale: SIMD multimedia extension (MMX, VIS) large scale: vector processors


Top 500 Supercomputer http://www.top500.org Let’s look at the Earth Simulator

Was #1 in 2004, now #10 in 2006

Hardware: 5,120 (640 8-way nodes) 500 MHz NEC CPUs 8 GFLOPS per CPU (41 TFLOPS total)

30s TFLOPS sustained performance!

2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total) shared memory inside the node 10 TB total memory 640 × 640 crossbar switch between the nodes 16 GB/s inter-node bandwidth 20 kVA power consumption per node

21


Programming Model In a CPU: data parallel, using automatic vectorization

Instruction level

In a node (8 CPUs): shared memory using OpenMP Loop level

Across nodes: message passing using MPI-2 or HPF Algorithm level


“The machine room sits at approximately 4th floorlevel. The 3rd floor level is taken by hundreds ofkilometers of copper cabling, and the lower floorshouse the air conditioning and electrical equipment.The structure is enclosed in a cooling shell, with the airpumped from underneath through the cabinets, andcollected to the two long sides of the building. Theaeroshell gives the building its "pumped-up"appearance. The machine room is electromagneticallyshielded to prevent interference from nearbyexpressway and rail. Even the halogen light sourcesare outside the shield, and the light is distributed by agrid of scattering pipes under the ceiling. The entirestructure is mechanically isolated from thesurroundings, suspended in order to make it less proneto earthquake damage. All attachments (power,cooling, access walkways) are flexible. “

22



Linpack performance: 40 TFlops, 80% peak Real world performance: 33-66% peak (vs. less than

15% for clusters) Cost? Hint: starts with 4 Maintenance $15M per year Failure one processor per week Distributed Memory Parallel Computing System which

640 processor nodes interconnected by Single-StageCrossbar Network

23


Fastest (#1 as of Aug 2006) BlueGene 65536 processors Each processor PowerPC 440 700 MHz (2.8 GFlops) Rpeak (GFlops):183 TFLOPS Rmax (GFlops):136 TFLOPS


Limitations of very large machines Niche market Power wall

By using low power processor, BlueGene can scale to avery large processor count

Many practical issues: electricity, cooling, etc.

Programming wall Extremely hard to extract performance out of very large

machine

Documents

Solihin Chapter 1 slides