View
6
Download
0
Category
Preview:
Citation preview
Extreme computingInfrastructure
Stratis D. Viglas
School of InformaticsUniversity of Edinburgh
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Outline
Infrastructure
Overview
Distributed file systems
Replication and fault tolerance
Virtualisation
Parallelism and parallel/concurrent programming
Services
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
The world is going parallel
• To be fair, the world was always parallel
• We just didn’t comprehend it, didn’t pay attention to it, or we were not
exposed to the parallelism
• Parallelism used to be elective
• There were tools (both software and hardware) for parallelism and we
could choose whether to use them or not
• Or, special problems that were a better fit for parallel solutions
• Parallelism is now enforced
• Massive potential for parallelism at the infrastructure level
• Application programmers are forced to think in parallel terms
• Multicore chips; parallel machines in your pocket
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Implicit vs. explicit parallelism
• High-level classification of parallel architectures
• Implicit: there but you do not see it
• Most of the work is carried out at the architecture level
• Pipelined instruction execution is a typical example
• Explicit: there and you can (and had better) use it
• Parallelism potential is exposed to the programmer
• When applications are not implemented in parallel, we’re not making
best use of the hardware
• Multicore chips and parallel programming frameworks (e.g.,MPI,
MapReduce)
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Implicit parallelism through superscalar execution
• Issue varying number of instructions per clock tick• Static scheduling
• Compiler techniques to identify potential• In-order execution (i.e.,instructions are not reordered)
• Dynamic scheduling• Instruction-level parallelism (ILP)• Let the CPU (and sometimes the compiler) examine the next few
hunderds of instructions and identify parallism potential• Schedule them in parallel as operands/resources become available• There are plenty of registers; user them to eliminate dependencies• Execute instructions out of order
• Change the order of execution to one that is more inherently parallel• Speculative execution
• Say there is a branch instruction but which branch will be taken is uknownat issue time
• Maintain statistics and start executing one branch
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Pipelining and superscalar execution
IF ID EX WBIF ID EX WB
IF ID EX WBIF ID EX WB
IF ID EX WB
1 2 3 4 5 6 7 8Instruction i
Instruction i+1
Instruction i+2
Instruction i+3
Instruction i+4
Cycle
IFID
EXWB
Instruction FetchInstruction Decode
ExecuteWrite Back
1 2 3 4 5 6 7 8Instruction i
Instruction i+1
Instruction i+2
Instruction i+3
Instruction i+4
Cycle
IntegerFloating Point
IntegerFloating Point
IntegerFloating Point
IntegerFloating Point
IntegerFloating Point
EX WBIDIFEX WBIDIF
IF ID EX WBIF ID EX WB
IF ID EX WBIF ID EX WB
IF ID EX WBIF ID EX WB
IF ID EX WBIF ID EX WB
2-issue superscalar architecture
standard pipelining
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
ILP, data dependencies, and hazards
• CPU and compiler must preserve program order: the order in whichthe instruction would be executed if the original program wereexecuted sequentially
• Study the dependencies among the program’s instructions• Data dependencies are important
• Indicated the possibility of a hazard• Determines the order in which computations should take place• Most importantly: sets an upper bound on how much parallelism can
possibly be exploited
• Goal: explolt parallelism by preserving program order only where itaffects the outcome of the program
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Speculative execution
• Greater ILP: overcome control dependencies by speculating inhardware the outcome of branches and executing the program as ifthe guess were correct
• Dynamic scheduling: fetch and issue instructions• Speculation: fetch, issue, and execute a stream of instructions as if
branch predictions were correct• Different predictors
• Branch predictor: outcome of branching instructions• Value predictor: outcome of certain computations• Prefetching: lock on to memory access patterns and fetch
data/instructions• But predictors make mistakes
• Upon a mistake, we need to empty the pipeline(s) of all erroneouslypredicted data/instructions
• Power consumption: more circuitry on the chip means more power,which means more heat
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Explicit parallelism
operation
throughput per cycle
latency in cycles
• Parallelism potential is exposed to software• Both to the compiler and the programmer
• Various different forms• From loosely coupled multiprocessors to tightly
coupled very long instruction word architetures• Little’s law: parallelism = throughput × latency
• To maintain a throughput of T operations per cyclewhen each operation has a latency of L cycles, weneed to execute T × L independent operations inparallel
• To maintain fixed parallelism• Decreased latency results in increased throughput• Decreased throughput allows increased latency
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Types of parallelism
time
time
time
time
pipelined parallelism: different operationsin different stages
data-level parallelism: same computation over different data
instruction-level parallelism: no two threadsperforming the same computation at the same time
thread-level parallelism: different threadsperforming different computations
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Shared Instruction Multiple Data (SIMD) architecture
• Precursor to massively data-parallel architectures
• Central controller broadcasts instructions to multiple processing
elements (PEs)
• One controller for the whole array
• One copy of the program
• All computations fully synchronised
inter-PE connection network
PE PE PE PE PE PE PE
Mem Mem Mem Mem Mem Mem Mem
Arraycontroller
control
data
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Massively Parallel Processors (MPP)
• Message passing between microprocessors (µP) through a network
interface (NI)
• Architecture scales to thousands of nodes
• Data layout must be handled by
software
• Any two nodes can exchange
data, but only through explicit
request/reply messages
• Message passing has high
overhead
• Invoke the OS on each message
• Access to NI is also expensive
• Sending is cheap, receiving is
expensive (poll the NI or invoke
the OS through an interrupt)
interconnect network
μP μP μP μP μP μP μP
Mem Mem Mem Mem Mem Mem Mem
NI NI NI NI NI NI NI
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Shared memory multiprocessors
• Just what the name suggests
• Will work with any data placement; any memory location is
accessible by any processing node
• Standard load and store instructions used to communicate data
• No OS involvement
• Low software overhead
• Only some special synchronisation primitives
• Shared memory is usually implemented as physically distributed
memory modules
• Each processor has its cache, so coherency is an issue
• Snooping: broadcast data requests to all processors
• Processors snoop to see if they have a copy of data and act accordingly
• Works well if there is a common bus
• Dominates in small-scale machines
• Directory-based: keep track of what is stored where in a directory
• Directory is usually distributed for scalability
• Point-to-point requests to handle inconsistencies
• Scales better than snooping
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Multicore chips
interrupt distributor
CPU interface
CPU interface
CPU interface
CPU interface
CPUL1
cache
CPUL1
cache
CPUL1
cache
CPUL1
cache
snooping unit
shared L2shar
ed L
2
hardware interrupts private IRQ
northbridge (memory I/O) southbridge (peripherals I/O)
• Also known as chip multiprocessors• Multiple processors, their cache memories, and the interconnect on the same die• Latencies for chip-to-chip communication drastically reduced
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Enforced parallelism
• The era of programmers not caring what’s under the hood is over• To gain performance, we need to understand the infrastructure• Different software design decisions based on the architecture• But without getting too close to the processor
• Problems with portability• Common set of problems faced, regardless of the architecture
• Concurrency, synchronisation, type of parallelism
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Concurrent programming
• Sequential program: a single thread of control that executes oneinstruction and when it is finished it moves on to the next one
• Concurrent program: collection of autonomous sequential threadsexecuting logically in parallel
• The physical execution model is irrelevant• Multiprogramming: multiplexed threads on a uniprocessor• Multiprocessing: multiplexed threads on a multiprocessor system• Distributed processing: multiplexed processes on different machines
• Concurrency is not only parallelism• Difference between logical and physical models• Interleaved concurrency: logically simultaneous processing• Parallelism: physically simultaneous processing
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Reasons for concurrent programming
• Natural application structure• World is not sequential; easier to program in terms of multiple and
independant activities• Increased throughput and responsiveness
• No reason to block an entire application due to a single event• No reason to block the entire system due to a single application
• Enforced by hardware• Multicore chips are the standard
• Inherent distribution of contemporary large-scale systems• Single application running on multiple machines• Client/server, peer-to-peer, clusters
• Concurrent programming introduces problems• Need multiple threads for increased performance without compromising
the application• Need to synchronise concurrent threads for correctness and safety
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Synchronisation
• To increase throughput we interleave threads• Not all interleavings of threads are acceptable and correct programs
• Correctness: same end-effect as if all threads were executedsequentially
• Synchronisation serves two purposes• Ensure safety of shared state updates• Coordinate actions of threads
• So we need a way to restrict the interleavings explicitly at thesoftware level
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Thread safety
• Multiple threads access shared resource simultaneously• Access is safe if and only if
• Accesses have no effect on the state of the resource (for example,reading the same variable)
• Accesses are idempotent (for example y = x2)• Accesses are mutually exclusive
• In other words, accesses are serialised
3:053:103:153:203:253:30
Arrive home Look in fridge, no milkGo to supermarket
Arrive supermarket, buy milk
3:35 Arrive home, put milk in fridge3:403:45
Arrive home Look in fridge, no milkGo to supermarket
Arrive supermarket, buy milk
Arrive home, put milk in fridge3:50
You Your roomate
resources not protected: too much milk
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Mutual exclusion and atomicity
• Prevent more than one thread from accessing a critical section at a
given time
• Once a thread is in the critical section, no other thread can enter the
critical section until the first thread has left the critical section
• No interleavings of threads within the critical section
• Serialised access
• Critical sections are executed as atomic units
• A single operation, executed to completion
• Can be arbitrary in size (e.g.,a whole block of code, or an increment to
a variable)
• Multiple ways to implement this
• Semaphores, mutexes, spinlocks, copy-on-write, . . .
• For instance, the synchronized keyword in Java ensures only one
thread accesses the synchronised block
• At the end of the day, it’s all locks around a shared resource
• Shared or no lock to read resource (multiple threads can have it)
• Exclusive lock to update resource (there can be only one)
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Potential concurrency problems
• Deadlock: two or more threads stop and wait for each other• Usually caused be a cycle in the lock graph
• Livelock: two or more threads continue to execute but they make noreal progress toward their goal
• Example, each thread in a pair of threads undoes what the other threadhas done
• Real world example: the awkward walking shuffle• Walking towards someone, you both try to get out of each other’s way but
end up choosing the same path
• Starvation: some thread gets deferred forever• Each thread should get a fair chance to make progress
• Race condition: possible interleaving of threads results in undesiredcomputation result
• Two threads access a variable simulatneously and one access is a write• A mutex usually solves the problem; contemporary hardware has the
compare-and-swap atomic operation
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Parallel performance analysis
• Extent of parallelism in an algorithm• A measure of how parallelisable an algorithm is and what we can
expect in terms of its ideal performance• Granularity of parallelism
• A measure of how well the data/work of the computation is distributedacross the processors of the parallel machine
• The ratio of computation over communication• Computation stages are typically separated from communication by
synchronization events• Locality of computation
• A measure of how much communication needs to take place
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Parallelism extent: Amdahl’s law
sequential
parallel
sequential
25 seconds
50 seconds
25 seconds
100 seconds
sequential
sequential
time
25 seconds
10 seconds
25 seconds
60 seconds
• Program speedup is defined by the fraction of code that can beparallelised
• In the example, speedup = sequential time / parallel time =100 seconds/67 seconds = 1.67
• speedup = 1(1−p)+ p
n• p: fraction of work that is parallelisable• n: number of parallel processors
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Implications of Amdahl’s law
• Speedup tends to be 11−p as number of processors tends to infinity
• Parallel programming is only worthwhile when programs have a lot ofwork that is parallel in nature (so called embarrassingly parallel)
spee
dup
number of processors
linear
spee
dup
super linear speeduppossible dueto caching andlocality effects
typical speedupless than linear
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Fine vs. coarse granularity
Fine-grain parallelism
• Low computation to
communication ratio
• Small amounts of computation
between communication stages
• Less opportunity for
performance enhancement
• High communication overhead
Coarse-grain parallelism
• High computation to
communication ratio
• Large amounts of computation
between communication stages
• More opportunity for
performance enhancement
• Harder to balance efficiently
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Load balancing
• Processors finishing early have to wait for the processor with thelargest amount of work to complete; leads to idle time, lowerutilisation
• Static load balancing: programmer makes decisions and fixes a priorithe amount of work distributed to each processor
• Works well for homogeneous parallel processing• All processors are the same• Each core has an equal amount of work
• Not so well for heterogeneous parallel processing• Some processors are faster than others• Difficult to distrubute work evenly
• Dynamic load balancing: when one processors finishes its allocatedwork it takes work from processor with the heaviest workload
• Also known as task stealing• Ideal for uneven workloads and heterogeneous systems
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Communication modes
• Processors need to exchange data and/or control messages• Control messages are usually short, data messages longer
• Computation and communication are concurrent operations• Similar to instruction pipelining
• Synchronous communication: sender is notified when message isreceived
• Asynchronous communication: sender only knows the message issent
• Blocking communication: sender waits until a message is fullytransmitted, receiver waits until message is fully received
• Non-blocking communication: processing continues even if messagehas not been transmitted
• Depending on the platform, the programmer might be exposed to thecommunication mechanism and have to orchestrate it
• Point to point, broadcast/reduce, all to all, scatter/gather• Radically different proramming paradigms
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Communication cost model
• C = f ·�
o + l + n/mB + t − overlap
�
• C: cost of communication• f : frequency of message exchange• o: overhead per message• l : latency per message• n: total amount of data sent• m: number of messages exchanged• B: communication bandwidth• t : contention per message• overlap: latency hidden by overlapping communication with
computation through concurrency
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Locality
• Ensure that each processor is close to the data it processes• Move processing to data, not data to processing
• Data transfer is a either a known cost, or a hidden cost• Known if we explicitly move data, hidden if data needs to be transferred
due to a high-level programming interface• A good task scheduler will minimise data movement• Possible scenario: parallel computation is serialised due to bus
contention and lack of bandwidth• Distribute data across processors to relieve contention and increase
effective bandwidth• Also known as partitioning, or declustering
Stratis D. Viglas www.inf.ed.ac.uk
Infrastructure Parallelism and parallel/concurrent programming
Example architectures
• Uniform access• Centrally located store, processors are equidistant• Typical example: physical memory of a shared memory multiprocessor
system• Non-uniform access
• Physically partitioned data, but accessible by all• Processors see the same address space, but placement of data affects
performance• Typical example: Non-Uniform Memory Access (NUMA) architectures for
multicore systems
Stratis D. Viglas www.inf.ed.ac.uk
Recommended