67
Parallel Processing Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings

Introdução à Micro-Informática

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introdução à Micro-Informática

Parallel Processing

Raul Queiroz Feitosa

Parts of these slides are from the support material provided by W. Stallings

Page 2: Introdução à Micro-Informática

Parallel Processing 2

Objective

To present the most prominent approaches

to parallel computer organization.

Page 3: Introdução à Micro-Informática

Parallel Processing 3

Outline

Taxonomy

MIMD systems Symmetric multiprocessing

Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems

Page 4: Introdução à Micro-Informática

Parallel Processing 4

Taxonomy

Flynn

Instructions being executed simultaneously

data

bein

g p

rocessed

sim

ultaneously

1 many

1m

últip

las

SISD

SIMD MIMD

MISD

Page 5: Introdução à Micro-Informática

Parallel Processing 5

Single Instruction, Single Data

SISD

Single processor

Single instruction stream

Data stored in single memory

Uni-processor (von Neumann architecture)

control

Unit

processing

unit

memory

unit

instruction

stream

data

stream

Page 6: Introdução à Micro-Informática

Parallel Processing 6

Single Instruction, Multiple Data

SIMD Single machine instruction

controls simultaneous

execution

Multiple processing elements

Each processing element has

associated data memory

The same instruction

executed by all processing

elements but on different set

of data.

Main subclasses: vector and

array processors.

control

unit

processing

element

memory

unit

instr

uctio

n

str

ea

m

data

stream

processing

element

processing

element

memory

unit

memory

unit

data

stream

data

stream

Page 7: Introdução à Micro-Informática

Parallel Processing 7

Multiple Instruction, Single Data

MISD

Sequence of data transmitted to a set of processors.

Each processor executes a different instruction

sequence on different “parts” of the same data

sequence !!!!!!

Never been implemented.

Page 8: Introdução à Micro-Informática

Parallel Processing 8

Multiple Instruction, Multiple Data

MIMD

Set of processors

simultaneously

execute different

instruction sequences.

Different sets of data.

Main subclasses:

multiprocessors and

multicomputers

processing

unit

memory

unit

data

stream

instruction

stream

processing

unit

data

stream

instruction

stream

processing

unit

data

stream

instruction

stream

control

unit

control

unit

control

unit

multiprocessor

Page 9: Introdução à Micro-Informática

Parallel Processing 9

Multiple Instruction, Multiple Data

MIMD

Set of processors

simultaneously

execute different

instruction sequences.

Different sets of data.

Main subclasses:

multiprocessors and

multicomputers

processing

unit

memory

unit

data

stream

instruction

stream

processing

unit

data

stream

instruction

stream

processing

unit

data

stream

instruction

stream

control

unit

control

unit

control

unit

memory

unit

memory

unit

Inte

rcon

ne

ction

ne

two

rk

multicomputer

Page 10: Introdução à Micro-Informática

Parallel Processing 10

Taxonomy tree

Processor organizations

Single instructions

single data stream

(SISD)

Single instructions

multiple data stream

(SIMD)

Multiple instructions

single data stream

(MISD)

Multiple instructions

multiple data stream

(MIMD)

Vector

Processor

Array

Processor

Multiprocessor

shared memory

(tightly coupled)

Multicomputer

distributed memory

(loosely coupled)

Symmetric

Multiprocessor

(SMP)

Nonuniform

memory access

(NUMA)

Clusters

Page 11: Introdução à Micro-Informática

Parallel Processing 11

Outline

Taxonomy

MIMD systems Symmetric multiprocessing

Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems

Vector Computation

Page 12: Introdução à Micro-Informática

Parallel Processing 12

MIMD - Overview

Set of general purpose processors.

Each can process all instructions necessary

Further classified by method of processor

communication.

Page 13: Introdução à Micro-Informática

Parallel Processing 13

Communication Models

Multiprocessors

All CPUs are able to process all necessary

instruction.

All access the same physical shared memory

All share the same address space.

Communication through shared memory via

LOAD/STORE instructions → tightly coupled.

Simple programming model.

Page 14: Introdução à Micro-Informática

Parallel Processing 14

Communication Models

Multiprocessors (example)

a) Multiprocessor with 16 CPUs sharing a common memory.

b) Memory in 16 sections; each one processed by one processor.

Page 15: Introdução à Micro-Informática

Parallel Processing 15

Communication Models

Multicomputers

Each CPU has a private memory → distributed

memory system.

Each CPU has a particular address space

Communication through send/receive primitives →

loosely coupled system.

More complex programming model

Page 16: Introdução à Micro-Informática

Parallel Processing 16

Communication Models

Multicomputers (example)

Multicomputer with 16 CPUs each with its own private memory

Image (see previous figure) distributed among the 16 CPUs

Page 17: Introdução à Micro-Informática

Parallel Processing 17

Communication Models

Multiprocessors Multicomputers

Multiprocessors:

Potentially easier to program

Building a shared memory for hundreds of CPUs is not easy → non

scalable.

Memory contention is a potential performance bottleneck.

Multicomputers:

More difficult to program.

Building multicomputers with 1000’s of CPU is not difficult →

scalable.

Page 18: Introdução à Micro-Informática

Parallel Processing 18

Outline

Taxonomy

MIMD systems Symmetric multiprocessing

Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems

Vector Computation

Page 19: Introdução à Micro-Informática

Parallel Processing 19

Symmetric Multiprocessors

A stand alone computer with the following

characteristics:

Two or more similar processors of comparable capacity.

Processors share same memory and I/O.

Processors are connected by a bus or other internal

connection.

Memory access time is approximately the same for each

processor.

Page 20: Introdução à Micro-Informática

Parallel Processing 20

SMP Advantages

Performance

If some work can be done in parallel

Availability

Since all processors can perform the same functions, failure of a single

processor does not necessarily halt the system.

Incremental growth

User can enhance performance by adding additional processors.

Scaling

Vendors can offer range of products based on number of processors.

Page 21: Introdução à Micro-Informática

Parallel Processing 21

Outline

Taxonomy

MIMD systems Symmetric multiprocessing

Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems

Vector Computation

Page 22: Introdução à Micro-Informática

Parallel Processing 22

Time Shared Bus

Characteristics:

Simplest form.

Structure and interface similar to single processor system

Following features provided:

Addressing - distinguish modules on bus .

Arbitration - any module can be temporary master.

Time sharing - if one module has the bus, others must wait and may

have to suspend.

Now have multiple processors as well as multiple I/O

modules.

Page 23: Introdução à Micro-Informática

Parallel Processing 23

Time Shared Bus - SMP

Page 24: Introdução à Micro-Informática

Parallel Processing 24

Time Shared Bus

Advantages: Simplicity

Flexibility

Reliability

Disadvantages: Performance limited by bus cycle time

Each processor should have local cache Reduce number of bus accesses

Leads to problems with cache coherence Solved in hardware - see later

Page 25: Introdução à Micro-Informática

Parallel Processing 25

Outline

Taxonomy

MIMD systems Symmetric multiprocessing

Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems

Page 26: Introdução à Micro-Informática

Parallel Processing 26

Cache Coherence Problem

Cache A

SHARED BUS

. . .

SHARED

MEMORY

a

1- CPU A reads data (miss)

2- CPU K reads the same data (miss)

3- CPU K writes (changes) data (hit)

4- CPU A reads data (hit) – outdated !!!!!

……

y

xxxx

Cache K

CPU A CPU K

Page 27: Introdução à Micro-Informática

Parallel Processing 27

Cache controllers may have a snoop,

monitors the shared bus to detect any for coherence

relevant activity and

acts so as to assure data coherence.

It increases bus traffic.

which

Snoopy Protocols

Page 28: Introdução à Micro-Informática

Parallel Processing 28

Snoopy Protocols

Cache A

SHARED BUS

SHARED

MEMORY

aCPU A CPU K

……

x

xxx. . .

x

Cache K

1- CPU K writes (changes) the data (hit)

2- write propagates to the shared memory

3- snoop invalidates

x

or updates data in CPU A

yy

y

y

Page 29: Introdução à Micro-Informática

Parallel Processing 29

MESI State Transition Diagram

Page 30: Introdução à Micro-Informática

Parallel Processing 30

L1-L2 Cache Consistency

L1 caches do not connect to the bus → do not

engage in the snoop protocol.

Simple solution:

L1 is “write-through”.

Updates and invalidations in L2 must be propagated to

L1.

Approaches for write back L1 exist → more

complex.

Page 31: Introdução à Micro-Informática

Parallel Processing 31

Cache Coherence

connection other than shared bus

Directory Protocols

Collect and maintain information about copies of data in

cache.

Typically a central directory stored in main memory.

Requests are checked against directory.

Appropriate transfers are performed.

Creates central bottleneck.

Effective in large scale systems with complex

interconnection schemes, according to Stallings ??????

Page 32: Introdução à Micro-Informática

Parallel Processing 32

Cache Coherence

Software Solutions Compiler and operating system deal with problem.

Overhead transferred to compile time.

Design complexity transferred from hardware to software.

However, software tends to make conservative decisions

Inefficient cache utilization.

Analyze code to determine safe periods for caching shared variables.

HW+SW solutions exist.

Page 33: Introdução à Micro-Informática

Parallel Processing 33

Outline

Taxonomy

MIMD systems Symmetric multiprocessing

Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems

Page 34: Introdução à Micro-Informática

Parallel Processing 34

Increasing Performance

MIPS rate = f IPC

Pipelining and superscalar to increase IPC

Mechanisms to maximize the utilization of

each pipeline, may be reaching a limit due to

Complexity

Power consumption

processor clock frequency average instructions per cycle

Page 35: Introdução à Micro-Informática

Parallel Processing 35

Threads and Processes

Process:

An instance of program running on computer

Resource ownership, such as main memory,

I/O channels, I/O devices, and files

Scheduling/execution

Process switch

Page 36: Introdução à Micro-Informática

Parallel Processing 36

Threads and Processes

Thread:

A dispatchable unit of work within a process

Includes processor context (which includes the program counter and stack pointer) and data area for stack (to enable subroutine branching)

Threads within a process share resources

Threads are able to be executed independently!

Switching processor between threads within same process

Typically less costly than process switch

Page 37: Introdução à Micro-Informática

Parallel Processing 37

Explicit:

user-level threads visible to the applicationprogram

kernel-level threads, which are visible to the OS

All commercial processors and most experimental ones use explicit multithreading

Implicit:

defined statically by compiler or dynamically by hardware

Implicit and Explicit Multithreading

Page 38: Introdução à Micro-Informática

Parallel Processing 38

Approaches to Explicit Multithreading

Interleaved (fine-grained)

Processor issues instruction(s) from a single thread at a time

Switching thread at each clock cycle

If thread is blocked it is skipped

Blocked (coarse-grained)

Processor issues instruction(s) from a single thread at a time

Thread executed until event causes delay e.g. cache miss

Effective on in-order processor

Avoids pipeline stall

Simultaneous (SMT)

Processor issues instructions from multiple threads at a time to execution units of superscalar processor

Chip multiprocessing

Processor is replicated on a single chip

Each processor handles separate threads

Page 39: Introdução à Micro-Informática

Parallel Processing 39

Scalar Processor Approaches

Single-threaded scalar

Simple pipeline

No multithreading

Interleaved multithreaded scalar

Easiest multithreading to implement

Switch threads at each clock cycle

Pipeline stages kept close to fully occupied

Hardware needs to switch thread context between cycles

Blocked multithreaded scalar

Thread executed until latency event occurs

Would stop pipeline

Processor switches to another thread

Page 40: Introdução à Micro-Informática

Parallel Processing 40

Scalar Diagrams

caused by

some

dependency

3 time slots

to clear

dependency

Page 41: Introdução à Micro-Informática

Parallel Processing 41

Superscalar Processor Approaches (1)

Superscalar

No multithreading

Interleaved multithreading superscalar:

Each cycle, as many instructions as possible issued from single thread

Delays due to thread switches eliminated

Number of instructions issued in cycle limited by dependencies

Blocked multithreaded superscalar

Instructions from one thread

Blocked multithreading used

Page 42: Introdução à Micro-Informática

Parallel Processing 42

Superscalar Diagrams (1)

Page 43: Introdução à Micro-Informática

Parallel Processing 43

Very long instruction word (VLIW)

E.g. IA-64

Multiple instructions in single word

Typically constructed by compiler

Operations that may be executed in parallel in same word

May pad with no-ops

Interleaved multithreading VLIW

Similar efficiencies to interleaved multithreading on superscalar architecture

Blocked multithreaded VLIW

Similar efficiencies to blocked multithreading on superscalar architecture

Superscalar Processor Approaches (2)

Page 44: Introdução à Micro-Informática

Parallel Processing 44

Superscalar Diagrams (2)

Page 45: Introdução à Micro-Informática

Parallel Processing 45

Simultaneous multithreading (SMT)

Issue multiple instructions at a time

One thread may fill all horizontal slots

Instructions from two or more threads may be issued

With enough threads, can issue maximum number of instructions on

each cycle

Chip multiprocessor

Multiple processors

Each has two-issue superscalar processor

Each processor is assigned thread

Can issue up to two instructions per cycle per thread

Superscalar Processor Approaches (3)

Page 46: Introdução à Micro-Informática

Parallel Processing 46

Superscalar Diagrams (3)

Page 47: Introdução à Micro-Informática

Parallel Processing 47

Outline

Taxonomy

MIMD systems Symmetric multiprocessing

Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems

Page 48: Introdução à Micro-Informática

Parallel Processing 48

Clusters

“A group of interconnected whole computers

working together as unified resource

illusion of being one machine.”

Page 49: Introdução à Micro-Informática

Parallel Processing 49

Clusters

Benefits

Absolute scalability

Incremental scalability

High availability

Superior price/performance

Server applications

Each computer called a node

Page 50: Introdução à Micro-Informática

Parallel Processing 50

Cluster Configurations

Standby Server, No Shared Disk

Page 51: Introdução à Micro-Informática

Parallel Processing 51

Cluster Configurations

With Shared Disk

Page 52: Introdução à Micro-Informática

Parallel Processing 52

Clustering Methods

Method Description Benefits Limitations

Passive StandbyA secondary server takes over in

case of primary server failure.Easy to implement

High cost because the

secondary server is

unavailable for other

processing tasks.

Active

Secondary

The secondary server is also used

for processing tasks.

Reduced cost because

secondary servers can be

used for processing.

Increased complexity.

Separate Servers

Separate servers have their own

disks. Data is continuously copied

from primary to secondary server.

High availability

High network and server

overhead due to copying

operations.

Servers

Connected to

Disks

Servers are cabled to the same

disks, but each server owns its

disks. If one server fails, its disks

re taken over by the other server.

Reduced network and

server overhead due to

elimination of copying

operations

Usually requires disk

mirroring or RAID

technology to compensate

for risk of disk failure.

Servers Share

Disks

Multiple servers simultaneously

share access to disks.

Low network and server

overhead. Reduced risk of

downtime caused by disk

failure.

Requires lock manager

software. Usually used

with disk mirroring or

RAID technology.

Page 53: Introdução à Micro-Informática

Parallel Processing 53

Outline

Taxonomy

MIMD systems Symmetric multiprocessing

Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems

Page 54: Introdução à Micro-Informática

Parallel Processing 54

NUMA SystemsTerminology:

Uniform memory access (UMA) All processors have access to all parts of memory → load & store.

Access time to all regions of memory is the same.

Access time to memory for different processors is the same.

As used by SMP.

Nonuniform memory access (NUMA) All processors have access to all parts of memory → load & store

Access time of processor differs depending on region of memory

Different processors access different regions of memory at different speeds

Cache coherent NUMA (CC_NUMA) Cache coherence is maintained among the caches of the various

processors

Significantly different from SMP and clusters

Page 55: Introdução à Micro-Informática

Parallel Processing 55

Motivation for NUMA

SMP has practical limit to the number of processors

Bus traffic limits to between 16 and 64 processors

In clusters each node has own memory

Apps do not see large global memory

Coherence maintained by software not hardware

NUMA retains SMP flavour while giving large scale

multiprocessing

e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors

Objective

to maintain transparent system wide memory while permitting

multiprocessor nodes, each with own bus or internal interconnection

system.

Page 56: Introdução à Micro-Informática

Parallel Processing 56

CC-NUMA OrganizationSingle addressable memory space.

Memory request order:

1. L1 cache (local to processor)

2. L2 cache (local to processor)

3. Main memory (local to node)

4. Remote memory (automatically and transparent to processor)

Page 57: Introdução à Micro-Informática

Parallel Processing 57

NUMA Pros & Cons

Higher levels of parallelism than SMP, with minor software changes

Performance can breakdown if too much access to remote memory

Can be avoided by:

L1 & L2 cache design reducing all memory access

Need good temporal and spatial locality of software

Virtual memory management moving pages to nodes that are using them most

Keeping it Cache Coherent is costly in terms of HW and performance

Page 58: Introdução à Micro-Informática

Parallel Processing 58

Exercises Exercise 1:

Consider a situation in which two processors (P1 and P2) in an SMP configuration, over time, require access to the same line of data from main memory. Both processors have a cache and use the MESI protocol. Initially both caches have an invalid copy of the line. Eventually, processor P1 reads line x. If this is the start of a sequence of accesses, complete the table below indicating the state-transitions of this line in the caches of both processors.

AccessesState transition in

cache 1

State transition in

cache 2

P2 reads x ES IS

P1 writes to x SM SI

P1 writes to x MM II

P2 reads x M↓S IS

Page 59: Introdução à Micro-Informática

Parallel Processing 59

Exercises

Exercise 2:

Consider an SMP with both L1and L2 caches using the MESI protocol. One of four states is associated with each line in the L2 cache. Are all four states also needed for each line in the L1 cache? If so, why? If not, explain which state or states can be eliminated.

Page 60: Introdução à Micro-Informática

Parallel Processing 60

Exercises Exercise 3:

Consider a pipeline similar to the ones seen in chapter on Pipelining, which is redrawn in Figure a with the fetch and decode stages ignored to represent the execution of thread A. Figure b illustrates the execution of a separate thread B. In both cases, a simple pipelined processor is used.

1 2 3 4 5 6 7 8 9 10 12 12

CO A1 A2 A3 A4 A5 A15 A16

FO A1 A2 A3 A4 A15 A16

EI A1 A2 A3 A15 A16

WO A1 A2 A3 A15 A16

1 2 3 4 5 6 7 8 9 10 12 12

CO B1 B2 B3 B4 B5 B6 B7

FO B1 B2 B3 B4 B5 B6 B7

EI B1 B2 B3 B4 B5 B6 B7

WO B1 B2 B3 B4 B5 B6 B7

cycle

Thread A

Thread B

Page 61: Introdução à Micro-Informática

Parallel Processing 61

Exercises Exercise 3a :

Show an instruction issue diagram, for each of the two threads

Exercise 3b :

Assume that the two threads are to be executed in parallel on a chip processor, with each of the two processors on the chip using a simple pipeline. Show an instruction issue diagram. Also show a pipeline execution diagram.

Thread A

Thread B

1 2 3 4 5 6 7 8 9 10 12 12

cycle

Page 62: Introdução à Micro-Informática

Parallel Processing 62

Exercises Exercise 3c :

Assume a two-issue superscalar architecture. Repeat part (b) for an interleaved multithreading superscalar implementation, assuming no data dependencies. Note: There is no unique answer, you need to make assumptions about latency and priority

1 2 3 4 5 6 7 8 9 10 12 12

cycle

Page 63: Introdução à Micro-Informática

Parallel Processing 63

Exercises

Exercise 3d :

Repeat part c for a blocked multithreading

superscalar implementation.

1 2 3 4 5 6 7 8 9 10 12 12

cycle

Page 64: Introdução à Micro-Informática

Parallel Processing 64

Exercises

Exercise 3e :

Repeat for a four-issue SMT architecture.

1 2 3 4 5 6 7 8 9 10 12 12

cycle

Page 65: Introdução à Micro-Informática

Parallel Processing 65

Exercises Exercise 4:

An application program is executed on a nine-computer cluster. A benchmark program took time T on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine computers. The remaining time, the application had to run on a single computer.

a) Calculate the effective speedup under the aforementioned condition as compared to executing the program on a single computer. Also calculate , the percentage of code that has been parallelized (programmed or compiled so as to use the cluster mode) in the preceding program.

b) Suppose that we are able to effectively use 17 computers rather than 9 computers on the parallelized portion of the code. Calculate the effective speedup that is achieved

Page 66: Introdução à Micro-Informática

Parallel Processing 66

Exercises Exercise 5:

The figures below show the state diagrams of two possible cache coherence protocols. Deduce and explain each protocol, and compare each to MESI

W(i) = write to line by processor i

R(i) = read line by processor i

Z(i) = displace line by processor i

W(j) = write to line by processor j ( j i)

R(j) = read line by processor j ( j i)

Z(j) = displace line by processor j ( j i)Note: state diagrams are for a

given line in cache i

InvalidValid

R(i) W(i)

W(j) Z(i)

R(j)

W(j)

Z(j)

W(i)

Z(i)

R(i)

Exclusive

SharedInvalid

R(i)

W(j) Z(i)

R(j)

W(j)

Z(j)

Z(j)

R(j)

R(i)

W(j) W(i)

R(j)

W(i)Z(i)

R(i)

W(i)

Z(j)

Page 67: Introdução à Micro-Informática

Parallel Processing 67

Parallel Processing

END