46
Lecture 10: Parallelism & Clusters Department of Electrical Engineering Stanford University EE282 – Fall 2008 Christos Kozyrakis Lecture 10 - 1 http://eeclass.stanford.edu/ee282

Lecture 10: Parallelism & Clusters - pub.ro

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 10: Parallelism & Clusters - pub.ro

Lecture 10:

Parallelism & Clusters

Department of Electrical EngineeringStanford University

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 1

http://eeclass.stanford.edu/ee282

Page 2: Lecture 10: Parallelism & Clusters - pub.ro

Announcements

• Graded quiz 1 available on WedSol tions a ailable online– Solutions available online

• HW2 available online– Due on 11/12

• PA 1 due on 10/29• PA-1 due on 10/29

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 2

Page 3: Lecture 10: Parallelism & Clusters - pub.ro

Review: Parallel Systems

• Differentiating factors to keep in mind– Degree of integration

P/$

Network

P/$P/$ P/$

– Degree of integration– Which resources are parallelized?– Uniform Vs. non-uniform storage access

Communication through memory or I/O accessesI/O

M

Network

– Communication through memory or I/O accesses

• These choices have implications on– Scaling suitability to specific apps cost software

P/$

M

N t k

P/$P/$ P/$

M M MScaling, suitability to specific apps, cost, software infrastructure, …

• Parallelization approaches

I/O

Network

P/$ P/$P/$ P/$Parallelization approaches– Data or domain parallelism– Task or functional parallelism – Task pipelining

P/$

M

P/$P/$ P/$

M M M

I/O I/O I/O I/O

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 3

Task pipelining– Combinations…

Network

Page 4: Lecture 10: Parallelism & Clusters - pub.ro

Review: Limitations to Parallelism

• Major issues to keep in mind– Serially dominated workload– Serially dominated workload– Parallel overhead (e.g., excessive or slow communication)– I/O bottlenecks

Load Imbalance– Load Imbalance– Locality issues

• Metrics• Metrics– Speedup

• Amdhal’s Law• Don’t forget overheads of parallelismDon t forget overheads of parallelism

– Efficiency • May be misleading if parallel resources are cheap

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 4

Page 5: Lecture 10: Parallelism & Clusters - pub.ro

Review: Shared-memory vs Message-PassingShared memory vs Message Passing

• Single address space for all CPUs • Private address-spaces for CPUs

• Communication through regular • Communication through message load/store operations– Implicit

send/receive operations (through memory or I/O network)– Explicit

Synchronization using locks and • Synchronization using blocking• Synchronization using locks and barriers

• Synchronization using blocking messages

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 5

Page 6: Lecture 10: Parallelism & Clusters - pub.ro

An Example - Iterative Solver

double a[2][MAXI+2][MAXJ+2]; //two copies of state//use one to compute the other//use one to compute the other

for (s=0; s<STEPS; s++) {k = s&1; // 0 1 0 1 0 1 ...m = k^1; // 1 0 1 0 1 0 ...forall(i=1; i<=MAXI; i++) { // do iterations in parallel

forall(j=1; j<=MAXJ; j++){a[k][i][j] = c1*a[m][i][j] + c2*a[m][i-1][j] +

c3*a[m][i+1][j] + c4*a[m][i][j-1] +c3*a[m][i+1][j] + c4*a[m][i][j 1] +c5*a[m][i][j+1];

}}

}

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 6

Page 7: Lecture 10: Parallelism & Clusters - pub.ro

Domain Decomposition

Di id i A 16J

• Divide matrix A over 16 processors– Each processor computes a

16x16 submatrix 0 1 2 3

6300

15 31 47I

• Processor 6– Owns [i][j] = [32..47][16..31]– Shares [i][j] = [31][16..31] and

th th t i4 5 6 7

15

three other strips• Each processor

– Communicates to get shared 8 9 10 11

31

data it needs– Computes its data– Synchronizes

12 13 14 14

63

47

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 7

63

Page 8: Lecture 10: Parallelism & Clusters - pub.ro

Shared Memory Code

Fork N processeseach process, p computesistart[p], iend[p], jstart[p], jend[p]

For (s=0; s<STEPS; s++) {For (s 0; s<STEPS; s++) {k = s&1;m = k^1;forall(i=istart[p]; i<=iend[p]; i++) { // e.g. 32..47

f ll(j j t t[ ] j< j d[ ] j++){ // 16 31forall(j=jstart[p]; j<=jend[p]; j++){ // e.g. 16..31a[k][i][j] = c1*a[m][i][j] + c2*a[m][i-1][j] +

c3*a[m][i+1][j] + c4*a[m][i][j-1] +c5*a[m][i][j+1]; // implicit comm.[ ][ ][j ] p

}}barrier();

}

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 8

}

Page 9: Lecture 10: Parallelism & Clusters - pub.ro

Message Passing Code

Fork N processes and distribute subarrays to processorsEach processor computes north[p], south[p], east[p], west[p],

1 if i hb i di i-1 if no neighbor in direction

for (s=0; s<STEPS; s++) {k = s&1; m = k^1;

( 0) ( 1 1 )if (north[p]>= 0) send(north[p], NORTH, a[m][1][1..MAXSUBJ]);if (east]>= 0) send(east[p], EAST, a[m][1..MAXSUBI][1]);same for south and westif (north[p]>= 0) receive(NORTH, a[m][0][1..MAXSUBJ]);

f h di isame for other directionsforall(i=1; i<=MAXSUBI; i++) {

forall(j=1; j<=MAXSUBJ; j++){a[k][i][j] = c1*a[m][i][j] + c2*a[m][i-1][j] +

3* [ ][i 1][j] 4* [ ][i][j 1]c3*a[m][i+1][j] + c4*a[m][i][j-1] +c5*a[m][i][j+1];

}}

}

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 9

}

Page 10: Lecture 10: Parallelism & Clusters - pub.ro

Shared-memory Vs Message Passing ProgrammingProgramming

• Shared memoryT picall easier to rite first correct ersion– Typically easier to write first correct version

• Communication through load/stores, just get synchronization right– Typically more difficult to write fully optimized version

• Difficult to tell which loads/stores lead to communication– Often more difficult to scale

• Can create fine-grain communication/synchronization

• Message passing– Typically more difficult to write first correct version– Typically easier to write fully optimized version– Typically easier to write fully optimized version

• Communication/synchronization on sends/receives– Often easier to scale

T i ll l d t i i ti / h i ti

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 10

• Typically leads to coarse-grain communication/synchronization

Page 11: Lecture 10: Parallelism & Clusters - pub.ro

Convergence of Models

• Can do– Message passing programs on top of shared memory hardwareMessage passing programs on top of shared memory hardware

• Load/stores to shared buffers to implement messages• This is how custom message passing machines work

– But no coherence…

Shared memory programs on top of message passing hardware– Shared memory programs on top of message passing hardware• Use virtual memory system to implement sharing

• Can combine shared-memory & message-passing hardwarey g p g– Message-passing cluster with each node a shared-memory

multiprocessor

• Within a chip (multi-core or CMP system), we can greatly improve both shared-memory and message-passing models– Lower latency, more bandwidth, simpler networks, specialized HW

suport

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 11

suport …

Page 12: Lecture 10: Parallelism & Clusters - pub.ro

Clusters

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 12

Page 13: Lecture 10: Parallelism & Clusters - pub.ro

What is an Cluster

• A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand alone computersconsists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource

• Clusters are message passing machinesDisjoint address spaces– Disjoint address spaces

• A typical cluster uses:yp– Commodity off the shelf parts computers and networks– Low latency communication protocols

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 13

Page 14: Lecture 10: Parallelism & Clusters - pub.ro

The History of Clusters

In the 1980’s it was a vector SMPIn the 1980’s it was a vector SMPIn the 1980 s, it was a vector SMP.In the 1980 s, it was a vector SMP.

Custom components throughout

In the 1990’s, it was a In the 1990’s, it was a massively parallelmassively parallelmassively parallel massively parallel

computer.computer.

Commodity Off The Shelf CPUs, everything else customy g

… but today, it is a cluster.… but today, it is a cluster.

COTS components everywhere

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 14

everywhere

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Page 15: Lecture 10: Parallelism & Clusters - pub.ro

Systems View of a Cluster

Master Node LAN/WAN

File Server / Gateway

Cluster Cluster Management Management

Interconnect

Management Management ToolsTools

Compute Nodes

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 15

Page 16: Lecture 10: Parallelism & Clusters - pub.ro

Cluster Pros/Cons

• Advantages– Low cost: they use high volume commodity components– Low cost: they use high volume, commodity components– Scale: easier to expand/scale that any other parallel system

• 10,000s of processors; can scale while service is uninterruptedE i l ti t dd li it ff t– Error isolation: separate address space limits error effect

– Repair: easier to replace machine in cluster than component in shared memory systems

• Disadvantages– Administration cost: just like administering N independent

machines• Shared memory systems behave like a single machine

– Communication overhead• Typically goes through I/O bus, OS. long networking protocols

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 16

y y g g g g– Dealing with distributed storage

Page 17: Lecture 10: Parallelism & Clusters - pub.ro

Dealing with Cluster Shortcomings

• Administration costClones of identical PCs– Clones of identical PCs

– 3 steps: reboot, reinstall OS, recycle– At $1000/PC, cheaper to discard than to figure out what is wrong

and repair it?• Network performance (more discussion later)

– Storage area networksStorage area networks– I/O accelerations

• Network interface at the memory bus, direct user access

• Storage• Storage– Separation of long term storage and computation– If separate storage servers or file servers, cluster is no worse (?)

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 17

Page 18: Lecture 10: Parallelism & Clusters - pub.ro

A Sample Cluster Design

Cluster Switch

Rack switches

External Network

4

1 3 5 7

2 4 6 8

9 11 13 15

10 12 14 16

PowerConnect 2016

1

100MLNK/ACT

FDXPOWER2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

4

Compute Nodes

Data Network 31 32

Master NodeExternal NetworkGigabit Ethernet(Fibre)

Data NetworkGigabit Ethernet(copper)

Storage Node

Control and Out-of-Band Network100BaseT Copper

Connection to storageEMC 2

Disk Store

Control Node

Rack-mount LCD Panel/keyboard

31 32

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 18

Control and Out-of-Band Network

Page 19: Lecture 10: Parallelism & Clusters - pub.ro

Rack-mounted Systems

• Advantages– Dense packingp g– Simpler cabling

• Typical rack: 19” wide• Height measured in RUs

– 1 RU = 1.75”

Collocation sites charge by rack• Collocation sites charge by rack– Space + power supply

• A typical installation can support up to 1,000Watts per rack, p– Assuming 12MW/building, 10,000

square feet, 10 square feet per rack

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 19

Page 20: Lecture 10: Parallelism & Clusters - pub.ro

Cluster Hardware: the Nodes

• A single element within the cluster• Compute Node• Compute Node

– Just computes – little else– Private IP address – no user access

• Master/Head/Front End Node• Master/Head/Front End Node– User login– Job scheduler

Public IP address connects to external network– Public IP address – connects to external network• Management/Administrator Node

– Systems/cluster management functionsS d i i t t dd– Secure administrator address

• I/O Node– Access to data

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 20

– Generally internal to cluster or to data centre

Page 21: Lecture 10: Parallelism & Clusters - pub.ro

Technology Advancements in 5 Years

Codename Release date

GHz Number of cores

Peak FLOP per CPU cycle

Peak GFLOPS per CPU

Linpack on256 date of cores per CPU cycle per CPU 256

Processors

Foster September 2001

1.7 1 2 3.4 288.9*

Woodcrest June 2006 3.0 2 4 24 4781**

Example:* From November 2001 top500 supercomputer list

(cluster of Dell Precision 530)** Intel internal cluster built in 2006

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 21

Page 22: Lecture 10: Parallelism & Clusters - pub.ro

Example

• Circa 2003, a custom Google PC consisted of:1 to 2 CPUs– 1 to 2 CPUs

• From a 533MHz Celeron to a 1.4GHz Pentium III– 256MB SDRAM (100MHz)– ~2 IDE disks (typically 5400RPM)– 100Mbps Ethernet link to a switch

• Why a low end design?– Cost

P– Power

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 22

Page 23: Lecture 10: Parallelism & Clusters - pub.ro

Cluster Interconnect

Interconnect Typical Latency usec

Typical Bandwidth MB/sLatency usec Bandwidth MB/s

100 Mbps Ethernet 75 8

1Gbit/s Ethernet 60-90 901Gbit/s Ethernet 60 90 90

10 Gb/s Ethernet 12-20 800

SCI* 1.5-4 200-600

Myricom Myrinet* 2.2-3 250-1200

InfiniBand* 2-4 900-1400

Quadrics QsNet* 3-5 600-900

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 23

Q Q

Page 24: Lecture 10: Parallelism & Clusters - pub.ro

Network Latency

• Diameter: the maximum (over all pairs of nodes) of the shortest path between a given pair of nodespath between a given pair of nodes.

• Latency: delay between send and receive times– Latency tends to vary widely across architectures

V d ft t h d l t i ( i ti )– Vendors often report hardware latencies (wire time)– Application programmers care about software latencies (user

program to user program)Ob ti• Observations:– Hardware/software latencies often differ by 1-2 orders of

magnitudeM i h d l t i ith di t b t th– Maximum hardware latency varies with diameter, but the variation in software latency is usually negligible

• Latency is important for programs with many small messages

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 24

Page 25: Lecture 10: Parallelism & Clusters - pub.ro

Overall Network Latency

Sender Transmission timeSender

SenderOverhead

Transmission time(size ÷ bandwidth)

(processor

Transmission time(size ÷ bandwidth)

Time ofFlight

ReceiverOverhead

(processorbusy)

Receiver

Transport Latency (processorbusy)

Total Latency = Sender Overhead + Time of Flight +

Total Latency

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 25

Message Size ÷ BW + Receiver OverheadNote: don’t forget that packets have header and trailer…

Page 26: Lecture 10: Parallelism & Clusters - pub.ro

Network Bandwidth

• The bandwidth of a link = w * 1/tis the n mber of ires U idi ti l i di ti– w is the number of wires

– t is the time per bit• Bandwidth typically in Gigabytes (GB), i.e., 8* 220 bits

Unidirectional: in one directionBidirectional: in both directions

• Effective bandwidth is usually lower than physical link bandwidth due to packet overhead

Routing

and conheader

Data

payloa d

Error co

Trailer

• Bandwidth is important for applications with mostly large messages

g ntrol

dode

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 26

Page 27: Lecture 10: Parallelism & Clusters - pub.ro

Bisection Bandwidth

• Bisection bandwidth: bandwidth across smallest cut that divides network into two equal halvesinto two equal halves

• Bandwidth across “narrowest” part of the network

bisection cut

not a bisectioncut

cut

bisection bw= link bw bisection bw = sqrt(n) * link bw

Bisection bandwidth is important for algorithms in which all processors need

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 27

• Bisection bandwidth is important for algorithms in which all processors need to communicate with all others

Page 28: Lecture 10: Parallelism & Clusters - pub.ro

Networking Background

• Topology (how things are connected)Crossbar ring 2 D and 2 D tor s h perc be omega net ork– Crossbar, ring, 2-D and 2-D torus, hypercube, omega network.

• Routing algorithm:– Example: all east-west then all north-south (avoids deadlock)

• Switching strategy:– Circuit switching: full path reserved for entire message, like the

telephonetelephone– Packet switching: message broken into separately-routed

packets, like the post office.Fl t l ( h t if th i ti )• Flow control (what if there is congestion):– Stall, store data temporarily in buffers, re-route data to other

nodes, tell source node to temporarily halt, discard, etc.

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 28

Page 29: Lecture 10: Parallelism & Clusters - pub.ro

Network Topology

• In the past, there was considerable research in network topology and in mapping algorithms to topologymapping algorithms to topology.– Key cost to be minimized: number of “hops” between nodes – Modern networks hide hop cost so topology is no longer a major factor

in application performance.in application performance.• Why is topology interesting

– Algorithms may have a communication topologyTopology affects– Topology affects

• Bisection bandwidth• Latency• Observed congestiong

• Trade-off: connectivity vs costNumber of switches outgoing/incoming links per switch

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 29

– Number of switches, outgoing/incoming links per switch

Page 30: Lecture 10: Parallelism & Clusters - pub.ro

Linear and Ring Topologies

• Linear array

– Diameter = n-1; average distance ~n/3.– Bisection bandwidth = 1 (units are link bandwidth).

• Torus or Ring

– Diameter = n/2; average distance ~ n/4.– Bisection bandwidth = 2.

N l f l i h h k i h 1D

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 30

– Natural for algorithms that work with 1D arrays.

Page 31: Lecture 10: Parallelism & Clusters - pub.ro

Meshes and Tori

Two dimensional mesh

Diameter = 2 * (sqrt( n ) 1)Two dimensional torus

Diameter = sqrt( n )–Diameter = 2 (sqrt( n ) – 1)–Bisection bandwidth = sqrt(n)

–Diameter = sqrt( n )–Bisection bandwidth = 2* sqrt(n)

• Generalizes to higher dimensions (3D Torus).

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 31

• Natural for algorithms that work with 2D and/or 3D arrays.

Page 32: Lecture 10: Parallelism & Clusters - pub.ro

Hypercubes

• Number of nodes n = 2d for dimension d.– Diameter = d– Diameter = d. – Bisection bandwidth = n/2.

• 0d 1d 2d 3d 4d

Greycode addressing: 111110• Greycode addressing:– Each node connected to d

others with 1 bit different. 100

010 011

111

101

110

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 32

001000

Page 33: Lecture 10: Parallelism & Clusters - pub.ro

Trees

• Diameter = log n.

Bi ti b d idth 1• Bisection bandwidth = 1

• Easy layout as planar graph.

• Many tree algorithms (e g summation)Many tree algorithms (e.g., summation).

• Fat trees avoid bisection bandwidth problem:– More (or wider) links near top.

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 33

Page 34: Lecture 10: Parallelism & Clusters - pub.ro

Butterflies

• Diameter = log n

Bi ti b d idth• Bisection bandwidth = n

• Cost: lots of wires.

• Natural for FFTNatural for FFT.

O 1O 1

O 1 O 1

butterfly switchmultistage butterfly network

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 34

butterfly switch

Page 35: Lecture 10: Parallelism & Clusters - pub.ro

Cluster Architecture View: HW + SW

Parallel Benchmarks:Parallel Benchmarks:Perf, Ring, HINT, NAS, …Perf, Ring, HINT, NAS, … Real ApplicationsReal ApplicationsApplicationApplication

shmemshmemMiddlewareMiddleware MPIMPI PVMPVM

Other OSesOther OSesOSOS LinuxLinux

hh

TCP/IPTCP/IPProtocolProtocol ProprietaryProprietary

ii

VIAVIA

fi ib dfi ib dd id i

desktopdesktop

EthernetEthernet

HardwareHardware

InterconnectInterconnect

WorkstationWorkstation ServerServerServerServer1P/2P1P/2P

MyrinetMyrinetInfinibandInfinibandQuadricsQuadrics

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 35

HardwareHardware 4U +4U +1P/2P1P/2P

Page 36: Lecture 10: Parallelism & Clusters - pub.ro

Cluster Design Exercise fromOld Version of TextbookOld Version of Textbook

• Goal: design a cluster with 32 processors, 32GB DRAM, 32-64 disks

• Choices– Type of processor (board)Type of processor (board)– Type of DRAM DIMMs– Location of disks (local vs across network)

• Constraints:– Cost– Rack-mounted system

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 36

Page 37: Lecture 10: Parallelism & Clusters - pub.ro

Components & Costs

1-way 2-way 8-way

Processors/box 1 (1 RU) 2 (1 RU) 4 (8 RU)

Frequency/L2 Size 1GHz / 256KB 1GHz / 256KB 700MHz / 1MB

Box + 1 processor $1,759 $1,939 $14,614

Extra processor - $799 $1,799

0.5GB SDRAM DIMM $549 $749 $1,069

1GB SDRAM DIMM - $1,689 $2,3691GB SDRAM DIMM $1,689 $2,369

36GB SCSI Disk $579 $639 $639

73GB SCSI Disk - 1,299 $1,299

LAN Switch (8p 1RU) $6 280 $6 280 $6 280LAN Switch (8p, 1RU) $6,280 $6,280 $6,280

LAN Switch (30p, 2RU)

$15,995 $15,995 $15,995

LAN Adapter $795 $795 $795

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 37

LAN Adapter $795 $795 $795

44-RU Rack $1,975 $1,975 $1,975

Page 38: Lecture 10: Parallelism & Clusters - pub.ro

Cluster with 1-way processors

• 32 boxes, 32x2x0.5GB DRAM, 32x2 36GB disks (2.3TB)

• 32 LAN adapters, 2 30p LAN switches

1 k (32 2 2 RU )

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 38

• 1 rack (32+2+2 RUs)

Page 39: Lecture 10: Parallelism & Clusters - pub.ro

Cluster with 2-way processors

• 16 boxes, 16x2 1GB DRAM, 16x2 73GB disks (2.3TB)

• 16 LAN adapters, 1 30p LAN switches

• 1 rack (16+2 RUs)

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 39

• 1 rack (16+2 RUs)

Page 40: Lecture 10: Parallelism & Clusters - pub.ro

Cluster with 8-way processors

4 bo es 4 8 1GB DRAM 4 2 73GB disks• 4 boxes, 4x8 1GB DRAM, 4x2 73GB disks

• 4 storage expansion slots (6 disks, 3 RU each)

• 4 LAN adapters, 1 8p LAN switches

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 40

• 2 racks (4*8+4*3+1 RUs)

Page 41: Lecture 10: Parallelism & Clusters - pub.ro

Cost Comparison

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 41

Page 42: Lecture 10: Parallelism & Clusters - pub.ro

Disks across the Network

• Disks per processor box:Ad antage cheaper– Advantage: cheaper

– Disadvantage: no fault tolerance

• Disks across the network (e.g. Fibre-channel SAN)– Advantage: can organize in fault-tolerant groups (see later

lecture)lecture)– Disadvantage: cost

• Controllers, enclosure, cables, extra rack space

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 42

Page 43: Lecture 10: Parallelism & Clusters - pub.ro

Cost Comparison with SAN-based Disks

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 43

Page 44: Lecture 10: Parallelism & Clusters - pub.ro

Total Cost Ownership

• Software: OS, database server, …Most companies charge one license per processor (or bo )– Most companies charge one license per processor (or box)

– E.g. Win2K for 1-4 CPUs $800, for 1-8 CPUs $3,295– E.g. SQL server for 1 CPU $16,000

• HW maintenance

• Rack space rental + power supply$800 $1 500 per month– $800-$1,500 per month

• Internet access

• Operatorp– $100,000 per year

• BackupsT d i

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 44

– Tapes, devices etc

Page 45: Lecture 10: Parallelism & Clusters - pub.ro

3-year Ownership Cost Comparison

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 45

Page 46: Lecture 10: Parallelism & Clusters - pub.ro

Some General Observations

• Cost of HW is typically less than cost of ownershipSoft are maintenance back ps– Software, maintenance, backups, …

• Space and power consumption can be important for clusters– Same as with embedded computing ☺

• Smaller nodes are typically cheaper and faster – Simpler & faster processors

Less sharing of resources within each node– Less sharing of resources within each node• Some integration is still beneficial

– 2-way to 4-way SMPs lead to space and cost savings• Heterogeneous clusters are often a necessity

– Build cluster incrementally from available components

EE282 – Fall 2008 Christos KozyrakisLecture 10 - 46