Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Parallel Processing
Raul Queiroz Feitosa
Parts of these slides are from the support material provided by W. Stallings
Parallel Processing 2
Objective
To present the most prominent approaches
to parallel computer organization.
Parallel Processing 3
Outline
Taxonomy
MIMD systems Symmetric multiprocessing
Time shared bus
Cache coherence
Multithreading
Clusters
NUMA systems
Parallel Processing 4
Taxonomy
Flynn
Instructions being executed simultaneously
data
bein
g p
rocessed
sim
ultaneously
1 many
1m
últip
las
SISD
SIMD MIMD
MISD
Parallel Processing 5
Single Instruction, Single Data
SISD
Single processor
Single instruction stream
Data stored in single memory
Uni-processor (von Neumann architecture)
control
Unit
processing
unit
memory
unit
instruction
stream
data
stream
Parallel Processing 6
Single Instruction, Multiple Data
SIMD Single machine instruction
controls simultaneous
execution
Multiple processing elements
Each processing element has
associated data memory
The same instruction
executed by all processing
elements but on different set
of data.
Main subclasses: vector and
array processors.
control
unit
processing
element
memory
unit
instr
uctio
n
str
ea
m
data
stream
processing
element
processing
element
memory
unit
memory
unit
data
stream
data
stream
…
Parallel Processing 7
Multiple Instruction, Single Data
MISD
Sequence of data transmitted to a set of processors.
Each processor executes a different instruction
sequence on different “parts” of the same data
sequence !!!!!!
Never been implemented.
Parallel Processing 8
Multiple Instruction, Multiple Data
MIMD
Set of processors
simultaneously
execute different
instruction sequences.
Different sets of data.
Main subclasses:
multiprocessors and
multicomputers
processing
unit
memory
unit
data
stream
instruction
stream
processing
unit
data
stream
instruction
stream
processing
unit
data
stream
instruction
stream
…
control
unit
control
unit
control
unit
multiprocessor
Parallel Processing 9
Multiple Instruction, Multiple Data
MIMD
Set of processors
simultaneously
execute different
instruction sequences.
Different sets of data.
Main subclasses:
multiprocessors and
multicomputers
processing
unit
memory
unit
data
stream
instruction
stream
processing
unit
data
stream
instruction
stream
processing
unit
data
stream
instruction
stream
…
control
unit
control
unit
control
unit
memory
unit
memory
unit
Inte
rcon
ne
ction
ne
two
rk
multicomputer
Parallel Processing 10
Taxonomy tree
Processor organizations
Single instructions
single data stream
(SISD)
Single instructions
multiple data stream
(SIMD)
Multiple instructions
single data stream
(MISD)
Multiple instructions
multiple data stream
(MIMD)
Vector
Processor
Array
Processor
Multiprocessor
shared memory
(tightly coupled)
Multicomputer
distributed memory
(loosely coupled)
Symmetric
Multiprocessor
(SMP)
Nonuniform
memory access
(NUMA)
Clusters
Parallel Processing 11
Outline
Taxonomy
MIMD systems Symmetric multiprocessing
Time shared bus
Cache coherence
Multithreading
Clusters
NUMA systems
Vector Computation
Parallel Processing 12
MIMD - Overview
Set of general purpose processors.
Each can process all instructions necessary
Further classified by method of processor
communication.
Parallel Processing 13
Communication Models
Multiprocessors
All CPUs are able to process all necessary
instruction.
All access the same physical shared memory
All share the same address space.
Communication through shared memory via
LOAD/STORE instructions → tightly coupled.
Simple programming model.
Parallel Processing 14
Communication Models
Multiprocessors (example)
a) Multiprocessor with 16 CPUs sharing a common memory.
b) Memory in 16 sections; each one processed by one processor.
Parallel Processing 15
Communication Models
Multicomputers
Each CPU has a private memory → distributed
memory system.
Each CPU has a particular address space
Communication through send/receive primitives →
loosely coupled system.
More complex programming model
Parallel Processing 16
Communication Models
Multicomputers (example)
Multicomputer with 16 CPUs each with its own private memory
Image (see previous figure) distributed among the 16 CPUs
Parallel Processing 17
Communication Models
Multiprocessors Multicomputers
Multiprocessors:
Potentially easier to program
Building a shared memory for hundreds of CPUs is not easy → non
scalable.
Memory contention is a potential performance bottleneck.
Multicomputers:
More difficult to program.
Building multicomputers with 1000’s of CPU is not difficult →
scalable.
Parallel Processing 18
Outline
Taxonomy
MIMD systems Symmetric multiprocessing
Time shared bus
Cache coherence
Multithreading
Clusters
NUMA systems
Vector Computation
Parallel Processing 19
Symmetric Multiprocessors
A stand alone computer with the following
characteristics:
Two or more similar processors of comparable capacity.
Processors share same memory and I/O.
Processors are connected by a bus or other internal
connection.
Memory access time is approximately the same for each
processor.
Parallel Processing 20
SMP Advantages
Performance
If some work can be done in parallel
Availability
Since all processors can perform the same functions, failure of a single
processor does not necessarily halt the system.
Incremental growth
User can enhance performance by adding additional processors.
Scaling
Vendors can offer range of products based on number of processors.
Parallel Processing 21
Outline
Taxonomy
MIMD systems Symmetric multiprocessing
Time shared bus
Cache coherence
Multithreading
Clusters
NUMA systems
Vector Computation
Parallel Processing 22
Time Shared Bus
Characteristics:
Simplest form.
Structure and interface similar to single processor system
Following features provided:
Addressing - distinguish modules on bus .
Arbitration - any module can be temporary master.
Time sharing - if one module has the bus, others must wait and may
have to suspend.
Now have multiple processors as well as multiple I/O
modules.
Parallel Processing 23
Time Shared Bus - SMP
Parallel Processing 24
Time Shared Bus
Advantages: Simplicity
Flexibility
Reliability
Disadvantages: Performance limited by bus cycle time
Each processor should have local cache Reduce number of bus accesses
Leads to problems with cache coherence Solved in hardware - see later
Parallel Processing 25
Outline
Taxonomy
MIMD systems Symmetric multiprocessing
Time shared bus
Cache coherence
Multithreading
Clusters
NUMA systems
Parallel Processing 26
Cache Coherence Problem
Cache A
SHARED BUS
. . .
SHARED
MEMORY
a
1- CPU A reads data (miss)
2- CPU K reads the same data (miss)
3- CPU K writes (changes) data (hit)
4- CPU A reads data (hit) – outdated !!!!!
……
y
xxxx
Cache K
CPU A CPU K
Parallel Processing 27
Cache controllers may have a snoop,
monitors the shared bus to detect any for coherence
relevant activity and
acts so as to assure data coherence.
It increases bus traffic.
which
Snoopy Protocols
Parallel Processing 28
Snoopy Protocols
Cache A
SHARED BUS
SHARED
MEMORY
aCPU A CPU K
……
x
xxx. . .
x
Cache K
1- CPU K writes (changes) the data (hit)
2- write propagates to the shared memory
3- snoop invalidates
x
or updates data in CPU A
yy
y
y
Parallel Processing 29
MESI State Transition Diagram
Parallel Processing 30
L1-L2 Cache Consistency
L1 caches do not connect to the bus → do not
engage in the snoop protocol.
Simple solution:
L1 is “write-through”.
Updates and invalidations in L2 must be propagated to
L1.
Approaches for write back L1 exist → more
complex.
Parallel Processing 31
Cache Coherence
connection other than shared bus
Directory Protocols
Collect and maintain information about copies of data in
cache.
Typically a central directory stored in main memory.
Requests are checked against directory.
Appropriate transfers are performed.
Creates central bottleneck.
Effective in large scale systems with complex
interconnection schemes, according to Stallings ??????
Parallel Processing 32
Cache Coherence
Software Solutions Compiler and operating system deal with problem.
Overhead transferred to compile time.
Design complexity transferred from hardware to software.
However, software tends to make conservative decisions
Inefficient cache utilization.
Analyze code to determine safe periods for caching shared variables.
HW+SW solutions exist.
Parallel Processing 33
Outline
Taxonomy
MIMD systems Symmetric multiprocessing
Time shared bus
Cache coherence
Multithreading
Clusters
NUMA systems
Parallel Processing 34
Increasing Performance
MIPS rate = f IPC
Pipelining and superscalar to increase IPC
Mechanisms to maximize the utilization of
each pipeline, may be reaching a limit due to
Complexity
Power consumption
processor clock frequency average instructions per cycle
Parallel Processing 35
Threads and Processes
Process:
An instance of program running on computer
Resource ownership, such as main memory,
I/O channels, I/O devices, and files
Scheduling/execution
Process switch
Parallel Processing 36
Threads and Processes
Thread:
A dispatchable unit of work within a process
Includes processor context (which includes the program counter and stack pointer) and data area for stack (to enable subroutine branching)
Threads within a process share resources
Threads are able to be executed independently!
Switching processor between threads within same process
Typically less costly than process switch
Parallel Processing 37
Explicit:
user-level threads visible to the applicationprogram
kernel-level threads, which are visible to the OS
All commercial processors and most experimental ones use explicit multithreading
Implicit:
defined statically by compiler or dynamically by hardware
Implicit and Explicit Multithreading
Parallel Processing 38
Approaches to Explicit Multithreading
Interleaved (fine-grained)
Processor issues instruction(s) from a single thread at a time
Switching thread at each clock cycle
If thread is blocked it is skipped
Blocked (coarse-grained)
Processor issues instruction(s) from a single thread at a time
Thread executed until event causes delay e.g. cache miss
Effective on in-order processor
Avoids pipeline stall
Simultaneous (SMT)
Processor issues instructions from multiple threads at a time to execution units of superscalar processor
Chip multiprocessing
Processor is replicated on a single chip
Each processor handles separate threads
Parallel Processing 39
Scalar Processor Approaches
Single-threaded scalar
Simple pipeline
No multithreading
Interleaved multithreaded scalar
Easiest multithreading to implement
Switch threads at each clock cycle
Pipeline stages kept close to fully occupied
Hardware needs to switch thread context between cycles
Blocked multithreaded scalar
Thread executed until latency event occurs
Would stop pipeline
Processor switches to another thread
Parallel Processing 40
Scalar Diagrams
caused by
some
dependency
3 time slots
to clear
dependency
Parallel Processing 41
Superscalar Processor Approaches (1)
Superscalar
No multithreading
Interleaved multithreading superscalar:
Each cycle, as many instructions as possible issued from single thread
Delays due to thread switches eliminated
Number of instructions issued in cycle limited by dependencies
Blocked multithreaded superscalar
Instructions from one thread
Blocked multithreading used
Parallel Processing 42
Superscalar Diagrams (1)
Parallel Processing 43
Very long instruction word (VLIW)
E.g. IA-64
Multiple instructions in single word
Typically constructed by compiler
Operations that may be executed in parallel in same word
May pad with no-ops
Interleaved multithreading VLIW
Similar efficiencies to interleaved multithreading on superscalar architecture
Blocked multithreaded VLIW
Similar efficiencies to blocked multithreading on superscalar architecture
Superscalar Processor Approaches (2)
Parallel Processing 44
Superscalar Diagrams (2)
Parallel Processing 45
Simultaneous multithreading (SMT)
Issue multiple instructions at a time
One thread may fill all horizontal slots
Instructions from two or more threads may be issued
With enough threads, can issue maximum number of instructions on
each cycle
Chip multiprocessor
Multiple processors
Each has two-issue superscalar processor
Each processor is assigned thread
Can issue up to two instructions per cycle per thread
Superscalar Processor Approaches (3)
Parallel Processing 46
Superscalar Diagrams (3)
Parallel Processing 47
Outline
Taxonomy
MIMD systems Symmetric multiprocessing
Time shared bus
Cache coherence
Multithreading
Clusters
NUMA systems
Parallel Processing 48
Clusters
“A group of interconnected whole computers
working together as unified resource
illusion of being one machine.”
Parallel Processing 49
Clusters
Benefits
Absolute scalability
Incremental scalability
High availability
Superior price/performance
Server applications
Each computer called a node
Parallel Processing 50
Cluster Configurations
Standby Server, No Shared Disk
Parallel Processing 51
Cluster Configurations
With Shared Disk
Parallel Processing 52
Clustering Methods
Method Description Benefits Limitations
Passive StandbyA secondary server takes over in
case of primary server failure.Easy to implement
High cost because the
secondary server is
unavailable for other
processing tasks.
Active
Secondary
The secondary server is also used
for processing tasks.
Reduced cost because
secondary servers can be
used for processing.
Increased complexity.
Separate Servers
Separate servers have their own
disks. Data is continuously copied
from primary to secondary server.
High availability
High network and server
overhead due to copying
operations.
Servers
Connected to
Disks
Servers are cabled to the same
disks, but each server owns its
disks. If one server fails, its disks
re taken over by the other server.
Reduced network and
server overhead due to
elimination of copying
operations
Usually requires disk
mirroring or RAID
technology to compensate
for risk of disk failure.
Servers Share
Disks
Multiple servers simultaneously
share access to disks.
Low network and server
overhead. Reduced risk of
downtime caused by disk
failure.
Requires lock manager
software. Usually used
with disk mirroring or
RAID technology.
Parallel Processing 53
Outline
Taxonomy
MIMD systems Symmetric multiprocessing
Time shared bus
Cache coherence
Multithreading
Clusters
NUMA systems
Parallel Processing 54
NUMA SystemsTerminology:
Uniform memory access (UMA) All processors have access to all parts of memory → load & store.
Access time to all regions of memory is the same.
Access time to memory for different processors is the same.
As used by SMP.
Nonuniform memory access (NUMA) All processors have access to all parts of memory → load & store
Access time of processor differs depending on region of memory
Different processors access different regions of memory at different speeds
Cache coherent NUMA (CC_NUMA) Cache coherence is maintained among the caches of the various
processors
Significantly different from SMP and clusters
Parallel Processing 55
Motivation for NUMA
SMP has practical limit to the number of processors
Bus traffic limits to between 16 and 64 processors
In clusters each node has own memory
Apps do not see large global memory
Coherence maintained by software not hardware
NUMA retains SMP flavour while giving large scale
multiprocessing
e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors
Objective
to maintain transparent system wide memory while permitting
multiprocessor nodes, each with own bus or internal interconnection
system.
Parallel Processing 56
CC-NUMA OrganizationSingle addressable memory space.
Memory request order:
1. L1 cache (local to processor)
2. L2 cache (local to processor)
3. Main memory (local to node)
4. Remote memory (automatically and transparent to processor)
Parallel Processing 57
NUMA Pros & Cons
Higher levels of parallelism than SMP, with minor software changes
Performance can breakdown if too much access to remote memory
Can be avoided by:
L1 & L2 cache design reducing all memory access
Need good temporal and spatial locality of software
Virtual memory management moving pages to nodes that are using them most
Keeping it Cache Coherent is costly in terms of HW and performance
Parallel Processing 58
Exercises Exercise 1:
Consider a situation in which two processors (P1 and P2) in an SMP configuration, over time, require access to the same line of data from main memory. Both processors have a cache and use the MESI protocol. Initially both caches have an invalid copy of the line. Eventually, processor P1 reads line x. If this is the start of a sequence of accesses, complete the table below indicating the state-transitions of this line in the caches of both processors.
AccessesState transition in
cache 1
State transition in
cache 2
P2 reads x ES IS
P1 writes to x SM SI
P1 writes to x MM II
P2 reads x M↓S IS
Parallel Processing 59
Exercises
Exercise 2:
Consider an SMP with both L1and L2 caches using the MESI protocol. One of four states is associated with each line in the L2 cache. Are all four states also needed for each line in the L1 cache? If so, why? If not, explain which state or states can be eliminated.
Parallel Processing 60
Exercises Exercise 3:
Consider a pipeline similar to the ones seen in chapter on Pipelining, which is redrawn in Figure a with the fetch and decode stages ignored to represent the execution of thread A. Figure b illustrates the execution of a separate thread B. In both cases, a simple pipelined processor is used.
1 2 3 4 5 6 7 8 9 10 12 12
CO A1 A2 A3 A4 A5 A15 A16
FO A1 A2 A3 A4 A15 A16
EI A1 A2 A3 A15 A16
WO A1 A2 A3 A15 A16
1 2 3 4 5 6 7 8 9 10 12 12
CO B1 B2 B3 B4 B5 B6 B7
FO B1 B2 B3 B4 B5 B6 B7
EI B1 B2 B3 B4 B5 B6 B7
WO B1 B2 B3 B4 B5 B6 B7
cycle
Thread A
Thread B
Parallel Processing 61
Exercises Exercise 3a :
Show an instruction issue diagram, for each of the two threads
Exercise 3b :
Assume that the two threads are to be executed in parallel on a chip processor, with each of the two processors on the chip using a simple pipeline. Show an instruction issue diagram. Also show a pipeline execution diagram.
Thread A
Thread B
1 2 3 4 5 6 7 8 9 10 12 12
cycle
Parallel Processing 62
Exercises Exercise 3c :
Assume a two-issue superscalar architecture. Repeat part (b) for an interleaved multithreading superscalar implementation, assuming no data dependencies. Note: There is no unique answer, you need to make assumptions about latency and priority
1 2 3 4 5 6 7 8 9 10 12 12
cycle
Parallel Processing 63
Exercises
Exercise 3d :
Repeat part c for a blocked multithreading
superscalar implementation.
1 2 3 4 5 6 7 8 9 10 12 12
cycle
Parallel Processing 64
Exercises
Exercise 3e :
Repeat for a four-issue SMT architecture.
1 2 3 4 5 6 7 8 9 10 12 12
cycle
Parallel Processing 65
Exercises Exercise 4:
An application program is executed on a nine-computer cluster. A benchmark program took time T on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine computers. The remaining time, the application had to run on a single computer.
a) Calculate the effective speedup under the aforementioned condition as compared to executing the program on a single computer. Also calculate , the percentage of code that has been parallelized (programmed or compiled so as to use the cluster mode) in the preceding program.
b) Suppose that we are able to effectively use 17 computers rather than 9 computers on the parallelized portion of the code. Calculate the effective speedup that is achieved
Parallel Processing 66
Exercises Exercise 5:
The figures below show the state diagrams of two possible cache coherence protocols. Deduce and explain each protocol, and compare each to MESI
W(i) = write to line by processor i
R(i) = read line by processor i
Z(i) = displace line by processor i
W(j) = write to line by processor j ( j i)
R(j) = read line by processor j ( j i)
Z(j) = displace line by processor j ( j i)Note: state diagrams are for a
given line in cache i
InvalidValid
R(i) W(i)
W(j) Z(i)
R(j)
W(j)
Z(j)
W(i)
Z(i)
R(i)
Exclusive
SharedInvalid
R(i)
W(j) Z(i)
R(j)
W(j)
Z(j)
Z(j)
R(j)
R(i)
W(j) W(i)
R(j)
W(i)Z(i)
R(i)
W(i)
Z(j)
Parallel Processing 67
Parallel Processing
END