Lecture 10: Parallelism & Clusters - pub.ro

Lecture 10:

Parallelism & Clusters

Department of Electrical EngineeringStanford University

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 1

http://eeclass.stanford.edu/ee282

Announcements

• Graded quiz 1 available on WedSol tions a ailable online– Solutions available online

• HW2 available online– Due on 11/12

• PA 1 due on 10/29• PA-1 due on 10/29


Review: Parallel Systems

• Differentiating factors to keep in mind– Degree of integration

P/$

Network

P/$P/$ P/$

– Degree of integration– Which resources are parallelized?– Uniform Vs. non-uniform storage access

Communication through memory or I/O accessesI/O

M

Network

– Communication through memory or I/O accesses

• These choices have implications on– Scaling suitability to specific apps cost software

P/$

M

N t k

P/$P/$ P/$

M M MScaling, suitability to specific apps, cost, software infrastructure, …

• Parallelization approaches

I/O

Network

P/$ P/$P/$ P/$Parallelization approaches– Data or domain parallelism– Task or functional parallelism – Task pipelining

P/$

M

P/$P/$ P/$

M M M

I/O I/O I/O I/O


Task pipelining– Combinations…

Network

Review: Limitations to Parallelism

• Major issues to keep in mind– Serially dominated workload– Serially dominated workload– Parallel overhead (e.g., excessive or slow communication)– I/O bottlenecks

Load Imbalance– Load Imbalance– Locality issues

• Metrics• Metrics– Speedup

• Amdhal’s Law• Don’t forget overheads of parallelismDon t forget overheads of parallelism

– Efficiency • May be misleading if parallel resources are cheap


Review: Shared-memory vs Message-PassingShared memory vs Message Passing

• Single address space for all CPUs • Private address-spaces for CPUs

• Communication through regular • Communication through message load/store operations– Implicit

send/receive operations (through memory or I/O network)– Explicit

Synchronization using locks and • Synchronization using blocking• Synchronization using locks and barriers

• Synchronization using blocking messages


An Example - Iterative Solver

double a[2][MAXI+2][MAXJ+2]; //two copies of state//use one to compute the other//use one to compute the other

for (s=0; s<STEPS; s++) {k = s&1; // 0 1 0 1 0 1 ...m = k^1; // 1 0 1 0 1 0 ...forall(i=1; i<=MAXI; i++) { // do iterations in parallel

forall(j=1; j<=MAXJ; j++){a[k][i][j] = c1*a[m][i][j] + c2*a[m][i-1][j] +

c3*a[m][i+1][j] + c4*a[m][i][j-1] +c3*a[m][i+1][j] + c4*a[m][i][j 1] +c5*a[m][i][j+1];

}}

}


Domain Decomposition

Di id i A 16J

• Divide matrix A over 16 processors– Each processor computes a

16x16 submatrix 0 1 2 3

6300

15 31 47I

• Processor 6– Owns [i][j] = [32..47][16..31]– Shares [i][j] = [31][16..31] and

th th t i4 5 6 7

15

three other strips• Each processor

– Communicates to get shared 8 9 10 11

31

data it needs– Computes its data– Synchronizes

12 13 14 14

63

47


63

Shared Memory Code

Fork N processeseach process, p computesistart[p], iend[p], jstart[p], jend[p]

For (s=0; s<STEPS; s++) {For (s 0; s<STEPS; s++) {k = s&1;m = k^1;forall(i=istart[p]; i<=iend[p]; i++) { // e.g. 32..47

f ll(j j t t[ ] j< j d[ ] j++){ // 16 31forall(j=jstart[p]; j<=jend[p]; j++){ // e.g. 16..31a[k][i][j] = c1*a[m][i][j] + c2*a[m][i-1][j] +

c3*a[m][i+1][j] + c4*a[m][i][j-1] +c5*a[m][i][j+1]; // implicit comm.[ ][ ][j ] p

}}barrier();

}


}

Message Passing Code

Fork N processes and distribute subarrays to processorsEach processor computes north[p], south[p], east[p], west[p],

1 if i hb i di i-1 if no neighbor in direction

for (s=0; s<STEPS; s++) {k = s&1; m = k^1;

( 0) ( 1 1 )if (north[p]>= 0) send(north[p], NORTH, a[m][1][1..MAXSUBJ]);if (east]>= 0) send(east[p], EAST, a[m][1..MAXSUBI][1]);same for south and westif (north[p]>= 0) receive(NORTH, a[m][0][1..MAXSUBJ]);

f h di isame for other directionsforall(i=1; i<=MAXSUBI; i++) {

forall(j=1; j<=MAXSUBJ; j++){a[k][i][j] = c1*a[m][i][j] + c2*a[m][i-1][j] +

3* [ ][i 1][j] 4* [ ][i][j 1]c3*a[m][i+1][j] + c4*a[m][i][j-1] +c5*a[m][i][j+1];

}}

}


}

Shared-memory Vs Message Passing ProgrammingProgramming

• Shared memoryT picall easier to rite first correct ersion– Typically easier to write first correct version

• Communication through load/stores, just get synchronization right– Typically more difficult to write fully optimized version

• Difficult to tell which loads/stores lead to communication– Often more difficult to scale

• Can create fine-grain communication/synchronization

• Message passing– Typically more difficult to write first correct version– Typically easier to write fully optimized version– Typically easier to write fully optimized version

• Communication/synchronization on sends/receives– Often easier to scale

T i ll l d t i i ti / h i ti


• Typically leads to coarse-grain communication/synchronization

Convergence of Models

• Can do– Message passing programs on top of shared memory hardwareMessage passing programs on top of shared memory hardware

• Load/stores to shared buffers to implement messages• This is how custom message passing machines work

– But no coherence…

Shared memory programs on top of message passing hardware– Shared memory programs on top of message passing hardware• Use virtual memory system to implement sharing

• Can combine shared-memory & message-passing hardwarey g p g– Message-passing cluster with each node a shared-memory

multiprocessor

• Within a chip (multi-core or CMP system), we can greatly improve both shared-memory and message-passing models– Lower latency, more bandwidth, simpler networks, specialized HW

suport


suport …

Clusters


What is an Cluster

• A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand alone computersconsists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource

• Clusters are message passing machinesDisjoint address spaces– Disjoint address spaces

• A typical cluster uses:yp– Commodity off the shelf parts computers and networks– Low latency communication protocols


The History of Clusters

In the 1980’s it was a vector SMPIn the 1980’s it was a vector SMPIn the 1980 s, it was a vector SMP.In the 1980 s, it was a vector SMP.

Custom components throughout

In the 1990’s, it was a In the 1990’s, it was a massively parallelmassively parallelmassively parallel massively parallel

computer.computer.

Commodity Off The Shelf CPUs, everything else customy g

… but today, it is a cluster.… but today, it is a cluster.

COTS components everywhere


everywhere

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Systems View of a Cluster

Master Node LAN/WAN

File Server / Gateway

Cluster Cluster Management Management

Interconnect

Management Management ToolsTools

Compute Nodes


Cluster Pros/Cons

• Advantages– Low cost: they use high volume commodity components– Low cost: they use high volume, commodity components– Scale: easier to expand/scale that any other parallel system

• 10,000s of processors; can scale while service is uninterruptedE i l ti t dd li it ff t– Error isolation: separate address space limits error effect

– Repair: easier to replace machine in cluster than component in shared memory systems

• Disadvantages– Administration cost: just like administering N independent

machines• Shared memory systems behave like a single machine

– Communication overhead• Typically goes through I/O bus, OS. long networking protocols


y y g g g g– Dealing with distributed storage

Dealing with Cluster Shortcomings

• Administration costClones of identical PCs– Clones of identical PCs

– 3 steps: reboot, reinstall OS, recycle– At $1000/PC, cheaper to discard than to figure out what is wrong

and repair it?• Network performance (more discussion later)

– Storage area networksStorage area networks– I/O accelerations

• Network interface at the memory bus, direct user access

• Storage• Storage– Separation of long term storage and computation– If separate storage servers or file servers, cluster is no worse (?)


A Sample Cluster Design

Cluster Switch

Rack switches

External Network

4

1 3 5 7

2 4 6 8

9 11 13 15

10 12 14 16

PowerConnect 2016

1

100MLNK/ACT

FDXPOWER2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

4

Compute Nodes

Data Network 31 32

Master NodeExternal NetworkGigabit Ethernet(Fibre)

Data NetworkGigabit Ethernet(copper)

Storage Node

Control and Out-of-Band Network100BaseT Copper

Connection to storageEMC 2

Disk Store

Control Node

Rack-mount LCD Panel/keyboard

31 32


Control and Out-of-Band Network

Rack-mounted Systems

• Advantages– Dense packingp g– Simpler cabling

• Typical rack: 19” wide• Height measured in RUs

– 1 RU = 1.75”

Collocation sites charge by rack• Collocation sites charge by rack– Space + power supply

• A typical installation can support up to 1,000Watts per rack, p– Assuming 12MW/building, 10,000

square feet, 10 square feet per rack


Cluster Hardware: the Nodes

• A single element within the cluster• Compute Node• Compute Node

– Just computes – little else– Private IP address – no user access

• Master/Head/Front End Node• Master/Head/Front End Node– User login– Job scheduler

Public IP address connects to external network– Public IP address – connects to external network• Management/Administrator Node

– Systems/cluster management functionsS d i i t t dd– Secure administrator address

• I/O Node– Access to data


– Generally internal to cluster or to data centre

Technology Advancements in 5 Years

Codename Release date

GHz Number of cores

Peak FLOP per CPU cycle

Peak GFLOPS per CPU

Linpack on256 date of cores per CPU cycle per CPU 256

Processors

Foster September 2001

1.7 1 2 3.4 288.9*

Woodcrest June 2006 3.0 2 4 24 4781**

Example:* From November 2001 top500 supercomputer list

(cluster of Dell Precision 530)** Intel internal cluster built in 2006


Example

• Circa 2003, a custom Google PC consisted of:1 to 2 CPUs– 1 to 2 CPUs

• From a 533MHz Celeron to a 1.4GHz Pentium III– 256MB SDRAM (100MHz)– ~2 IDE disks (typically 5400RPM)– 100Mbps Ethernet link to a switch

• Why a low end design?– Cost

P– Power


Cluster Interconnect

Interconnect Typical Latency usec

Typical Bandwidth MB/sLatency usec Bandwidth MB/s

100 Mbps Ethernet 75 8

1Gbit/s Ethernet 60-90 901Gbit/s Ethernet 60 90 90

10 Gb/s Ethernet 12-20 800

SCI* 1.5-4 200-600

Myricom Myrinet* 2.2-3 250-1200

InfiniBand* 2-4 900-1400

Quadrics QsNet* 3-5 600-900


Q Q

Network Latency

• Diameter: the maximum (over all pairs of nodes) of the shortest path between a given pair of nodespath between a given pair of nodes.

• Latency: delay between send and receive times– Latency tends to vary widely across architectures

V d ft t h d l t i ( i ti )– Vendors often report hardware latencies (wire time)– Application programmers care about software latencies (user

program to user program)Ob ti• Observations:– Hardware/software latencies often differ by 1-2 orders of

magnitudeM i h d l t i ith di t b t th– Maximum hardware latency varies with diameter, but the variation in software latency is usually negligible

• Latency is important for programs with many small messages


Overall Network Latency

Sender Transmission timeSender

SenderOverhead

Transmission time(size ÷ bandwidth)

(processor

Transmission time(size ÷ bandwidth)

Time ofFlight

ReceiverOverhead

(processorbusy)

Receiver

Transport Latency (processorbusy)

Total Latency = Sender Overhead + Time of Flight +

Total Latency


Message Size ÷ BW + Receiver OverheadNote: don’t forget that packets have header and trailer…

Network Bandwidth

• The bandwidth of a link = w * 1/tis the n mber of ires U idi ti l i di ti– w is the number of wires

– t is the time per bit• Bandwidth typically in Gigabytes (GB), i.e., 8* 220 bits

Unidirectional: in one directionBidirectional: in both directions

• Effective bandwidth is usually lower than physical link bandwidth due to packet overhead

Routing

and conheader

Data

payloa d

Error co

Trailer

• Bandwidth is important for applications with mostly large messages

g ntrol

dode


Bisection Bandwidth

• Bisection bandwidth: bandwidth across smallest cut that divides network into two equal halvesinto two equal halves

• Bandwidth across “narrowest” part of the network

bisection cut

not a bisectioncut

cut

bisection bw= link bw bisection bw = sqrt(n) * link bw

Bisection bandwidth is important for algorithms in which all processors need


• Bisection bandwidth is important for algorithms in which all processors need to communicate with all others

Networking Background

• Topology (how things are connected)Crossbar ring 2 D and 2 D tor s h perc be omega net ork– Crossbar, ring, 2-D and 2-D torus, hypercube, omega network.

• Routing algorithm:– Example: all east-west then all north-south (avoids deadlock)

• Switching strategy:– Circuit switching: full path reserved for entire message, like the

telephonetelephone– Packet switching: message broken into separately-routed

packets, like the post office.Fl t l ( h t if th i ti )• Flow control (what if there is congestion):– Stall, store data temporarily in buffers, re-route data to other

nodes, tell source node to temporarily halt, discard, etc.


Network Topology

• In the past, there was considerable research in network topology and in mapping algorithms to topologymapping algorithms to topology.– Key cost to be minimized: number of “hops” between nodes – Modern networks hide hop cost so topology is no longer a major factor

in application performance.in application performance.• Why is topology interesting

– Algorithms may have a communication topologyTopology affects– Topology affects

• Bisection bandwidth• Latency• Observed congestiong

• Trade-off: connectivity vs costNumber of switches outgoing/incoming links per switch


– Number of switches, outgoing/incoming links per switch

Linear and Ring Topologies

• Linear array

– Diameter = n-1; average distance ~n/3.– Bisection bandwidth = 1 (units are link bandwidth).

• Torus or Ring

– Diameter = n/2; average distance ~ n/4.– Bisection bandwidth = 2.

N l f l i h h k i h 1D


– Natural for algorithms that work with 1D arrays.

Meshes and Tori

Two dimensional mesh

Diameter = 2 * (sqrt( n ) 1)Two dimensional torus

Diameter = sqrt( n )–Diameter = 2 (sqrt( n ) – 1)–Bisection bandwidth = sqrt(n)

–Diameter = sqrt( n )–Bisection bandwidth = 2* sqrt(n)

• Generalizes to higher dimensions (3D Torus).


• Natural for algorithms that work with 2D and/or 3D arrays.

Hypercubes

• Number of nodes n = 2d for dimension d.– Diameter = d– Diameter = d. – Bisection bandwidth = n/2.

• 0d 1d 2d 3d 4d

Greycode addressing: 111110• Greycode addressing:– Each node connected to d

others with 1 bit different. 100

010 011

111

101

110


001000

Trees

• Diameter = log n.

Bi ti b d idth 1• Bisection bandwidth = 1

• Easy layout as planar graph.

• Many tree algorithms (e g summation)Many tree algorithms (e.g., summation).

• Fat trees avoid bisection bandwidth problem:– More (or wider) links near top.


Butterflies

• Diameter = log n

Bi ti b d idth• Bisection bandwidth = n

• Cost: lots of wires.

• Natural for FFTNatural for FFT.

O 1O 1

O 1 O 1

butterfly switchmultistage butterfly network


butterfly switch

Cluster Architecture View: HW + SW

Parallel Benchmarks:Parallel Benchmarks:Perf, Ring, HINT, NAS, …Perf, Ring, HINT, NAS, … Real ApplicationsReal ApplicationsApplicationApplication

shmemshmemMiddlewareMiddleware MPIMPI PVMPVM

Other OSesOther OSesOSOS LinuxLinux

hh

TCP/IPTCP/IPProtocolProtocol ProprietaryProprietary

ii

VIAVIA

fi ib dfi ib dd id i

desktopdesktop

EthernetEthernet

HardwareHardware

InterconnectInterconnect

WorkstationWorkstation ServerServerServerServer1P/2P1P/2P

MyrinetMyrinetInfinibandInfinibandQuadricsQuadrics


HardwareHardware 4U +4U +1P/2P1P/2P

Cluster Design Exercise fromOld Version of TextbookOld Version of Textbook

• Goal: design a cluster with 32 processors, 32GB DRAM, 32-64 disks

• Choices– Type of processor (board)Type of processor (board)– Type of DRAM DIMMs– Location of disks (local vs across network)

• Constraints:– Cost– Rack-mounted system


Components & Costs

1-way 2-way 8-way

Processors/box 1 (1 RU) 2 (1 RU) 4 (8 RU)

Frequency/L2 Size 1GHz / 256KB 1GHz / 256KB 700MHz / 1MB

Box + 1 processor $1,759 $1,939 $14,614

Extra processor - $799 $1,799

0.5GB SDRAM DIMM $549 $749 $1,069

1GB SDRAM DIMM - $1,689 $2,3691GB SDRAM DIMM $1,689 $2,369

36GB SCSI Disk $579 $639 $639

73GB SCSI Disk - 1,299 $1,299

LAN Switch (8p 1RU) $6 280 $6 280 $6 280LAN Switch (8p, 1RU) $6,280 $6,280 $6,280

LAN Switch (30p, 2RU)

$15,995 $15,995 $15,995

LAN Adapter $795 $795 $795


LAN Adapter $795 $795 $795

44-RU Rack $1,975 $1,975 $1,975

Cluster with 1-way processors

• 32 boxes, 32x2x0.5GB DRAM, 32x2 36GB disks (2.3TB)

• 32 LAN adapters, 2 30p LAN switches

1 k (32 2 2 RU )


• 1 rack (32+2+2 RUs)


• 16 boxes, 16x2 1GB DRAM, 16x2 73GB disks (2.3TB)


• 1 rack (16+2 RUs)


• 1 rack (16+2 RUs)


4 bo es 4 8 1GB DRAM 4 2 73GB disks• 4 boxes, 4x8 1GB DRAM, 4x2 73GB disks

• 4 storage expansion slots (6 disks, 3 RU each)



• 2 racks (4*8+4*3+1 RUs)

Cost Comparison


Disks across the Network

• Disks per processor box:Ad antage cheaper– Advantage: cheaper

– Disadvantage: no fault tolerance

• Disks across the network (e.g. Fibre-channel SAN)– Advantage: can organize in fault-tolerant groups (see later

lecture)lecture)– Disadvantage: cost

• Controllers, enclosure, cables, extra rack space


Cost Comparison with SAN-based Disks


Total Cost Ownership

• Software: OS, database server, …Most companies charge one license per processor (or bo )– Most companies charge one license per processor (or box)

– E.g. Win2K for 1-4 CPUs $800, for 1-8 CPUs $3,295– E.g. SQL server for 1 CPU $16,000

• HW maintenance

• Rack space rental + power supply$800 $1 500 per month– $800-$1,500 per month

• Internet access

• Operatorp– $100,000 per year

• BackupsT d i


– Tapes, devices etc

3-year Ownership Cost Comparison


Some General Observations

• Cost of HW is typically less than cost of ownershipSoft are maintenance back ps– Software, maintenance, backups, …

• Space and power consumption can be important for clusters– Same as with embedded computing ☺

• Smaller nodes are typically cheaper and faster – Simpler & faster processors

Less sharing of resources within each node– Less sharing of resources within each node• Some integration is still beneficial

– 2-way to 4-way SMPs lead to space and cost savings• Heterogeneous clusters are often a necessity

– Build cluster incrementally from available components


Documents

Lecture 10: Parallelism & Clusters - pub.ro