Upload
helen-sherman
View
237
Download
0
Embed Size (px)
Citation preview
CS591x -Cluster Computing and Parallel Programming
Parallel Computer Architecture and Software Models
It all about performance
Greater performance is the reason for parallel computingMany types of scientific and engineering programs are too large and too complex for traditional uniprocessorsSuch large problems are common is – Ocean modeling, weather modeling,
astrophysics, solid state physics, power systems….
FLOPS – a measure of performance
FLOPS – Floating Point Operations per Second… a measure of how much computation can be done in a certain amount of time MegaFLOPS – MFLOPS - 106 FLOPS GigaFLOPS – GFLOPS – 109 FLOPS TeraFLOPS – TFLOPS – 1012 FLOPS PetaFLOPS – PFLOPS – 1015 FLOPS
How fast …
Cray 1 - ~150 MFLOPSPentium 4 – 3-6 GFLOPSIBM’s BlueGene - +70 TFLOPSPSC’s Big Ben – 10 TFLOPSHumans --- it depends as calculators – 0.001 MFLOPS as information processors – 10PFLOPS
FLOPS vs. MIPS
FLOPS only concerned with floating pointing calculationsother performance issues memory latency cache performance I/O capacity …
See…
www.Top500.org biannual performance reports and … rankings of the fastest computers in
the world
Performance
Speedup(n processors) = time(1 processor)/time(n processors)
** Culler, Singh and Gupta, Parallel Computing Architecture, A Hardware/Software Approach
Consider…
from: www.lib.utexas.edu/maps/indian_ocean.html
… a model of the Indian Ocean -
73,000,000 square kilometer One data point per 100 meters 7,300,000,000 surface points
Need to model the ocean at depth – say every 10 meters up to 200 meters 20 depth data points
Every 10 minutes for 4 hours – 24 time steps
So –
73 x 106 (points on the surface) x 102 (points per sq. km) x 20 points per sq km of depth) x 24 (time steps) 3,504,000,000,000 data points in the
model grid
Suppose 100 instruction per grid point 350,400,000,000,000 instructions in
model
Then -
Imagine that you have a computer that can run 1 billion (109)instructions per second3.504 x 1014 / 109 = 35040 seconds or 9.7 hours
But –
On a 10 teraflops computer – 3.504 x 1014 / 1013 = 35.0 seconds
Gaining performance
Pipelining More instructions –faster More instructions in execution at the
same time in a single processor Not usually an attractive strategy
these days – why?
Instruction Level Parallelism (ILP)
based on the fact that many instructions do not depend on instructions that are before them…Processor has extra hardware to execute several instructions at the same time …multiple adders…
Pipelining and ILP not the solution to our problem – why?
near incremental improvements in performancebeen done alreadywe need orders of magnitude improvements in performance
Gaining Performance
Vector ProcessorsScientific and Engineering computations are often vector and matrix operations graphic transformations – i.e. shift
object x to the right
Redundant arithmetic hardware and vector registers to operate on an entire vector in one step (SIMD)
Gaining Performance
Vector ProcessorsDeclining popularity for a while – Hardware expensive
Popularity returning – Applications – science, engineering,
cryptography, media/graphics Earth Simulator
Parallel Computer Architecture
Shared Memory ArchitecturesDistributed Memory
Shared Memory Systems
Multiple processors connected to/share the same pool of memorySMPEvery processor has, potentially, access to and control of every memory location
Shared Memory Computers
MemoryProcessor
ProcessorProcessor
Processor
Processor Processor
Shared Memory Computers
Memory Memory Memory
Processor
Processor
Processor
Shared Memory Computer
Memory Memory Memory
Processor
Processor
Processor
Switch
Share Memory Computers
SGI Origin2000 – at NCSABalder256 250mhz R10000 processors128 Gbyte Memory
Shared Memory Computers
Rachel at PSC64 1.15 Ghz EV7 processors256 Gbytes of shared memory
Distributed Memory Systems
Multiple processors each with their own memoryInterconnected to share/exchange data, processingModern architectural approach to supercomputersSupercomputers and Clusters similar
Clusters – distributed memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Interconnect
ClusterDistributed Memory with SMP
Proc1
Memory
Memory
Memory
Memory
Interconnect
Proc2 Proc1
Memory
Proc2 Proc1
Memory
Proc2
Proc1Proc2 Proc1Proc2 Proc1Proc2
Distributed Memory Supercomputer
BlueGene/L DOE/IBM0.7 Ghz PowerPC 44032768 Processors70 Teraflops
Distributed Memory Supercomputer
Thunder at LLNLNumber 520 Teraflops1.4 Ghz Itanium processors4096 processors
Grid Computing Systems
What is a Grid Means different things to different
people
Distributed Processors Around campus Around the state Around the world
Grid Computing Systems
Widely distributedLoosely connected (i.e. Internet)No central management
Grid Computing SystemsConnected Clusters/other dedicated scientific computers
I2/Abilene
Grid Computer Systems
InternetInternet
Control/Scheduler
Harvested Idle Cycles
Grid Computing Systems
Dedicated Grids TeraGrid Sabre NASA Information Power Grid
Cycle Harvesting Grids Condor *GlobalGridForum (Parabon) Seti@home
Let’s revisit speedup…
we can achieve speedup (theoretically) by using more processors,…but, of factors may limit speedup… Interprocessor communications Interprocess synchronization Load balance
Amdahl’s Law
According to Amdahl’s Law… Speedup = 1/(S + (1-S)/N) where S is the purely sequential part of the
program N is the number of processors
Amdahl’s LawWhat does it mean – Part of a program can is parallelizable Part of the program must be sequential
(S)
Amdahl’s law says – Speedup is constrained by the portion of
the program that must remain sequential relative to the part that is parallelized.
Note: If S is very small – “embarrassingly parallel problem”
Software models for parallel computing
Shared MemoryDistributed MemoryData Parallel
Flynn’s Taxonomy
Single Instruction/Single Data - SISD
Multiple Instruction/Single Data - MISD
Single Instruction/Multiple Data - SIMD
Multiple Instruction/Multiple Data - MIMD
Single Program/Multiple Data - SPMD
Next
Cluster Computer ArchitectureLinux