Introdução à Micro-Informática

Parallel Processing

Raul Queiroz Feitosa

Parts of these slides are from the support material provided by W. Stallings

Parallel Processing 2

Objective

To present the most prominent approaches

to parallel computer organization.


Outline

Taxonomy

MIMD systems Symmetric multiprocessing

Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems


Taxonomy

Flynn

Instructions being executed simultaneously

data

bein

g p

rocessed

sim

ultaneously

1 many

1m

últip

las

SISD

SIMD MIMD

MISD


Single Instruction, Single Data

SISD

Single processor

Single instruction stream

Data stored in single memory

Uni-processor (von Neumann architecture)

control

Unit

processing

unit

memory

unit

instruction

stream

data

stream


Single Instruction, Multiple Data

SIMD Single machine instruction

controls simultaneous

execution

Multiple processing elements

Each processing element has

associated data memory

The same instruction

executed by all processing

elements but on different set

of data.

Main subclasses: vector and

array processors.

control

unit

processing

element

memory

unit

instr

uctio

n

str

ea

m

data

stream

processing

element

processing

element

memory

unit

memory

unit

data

stream

data

stream

…


Multiple Instruction, Single Data

MISD

Sequence of data transmitted to a set of processors.

Each processor executes a different instruction

sequence on different “parts” of the same data

sequence !!!!!!

Never been implemented.


Multiple Instruction, Multiple Data

MIMD

Set of processors

simultaneously

execute different

instruction sequences.

Different sets of data.

Main subclasses:

multiprocessors and

multicomputers

processing

unit

memory

unit

data

stream

instruction

stream

processing

unit

data

stream

instruction

stream

processing

unit

data

stream

instruction

stream

…

control

unit

control

unit

control

unit

multiprocessor


Multiple Instruction, Multiple Data

MIMD

Set of processors

simultaneously

execute different

instruction sequences.

Different sets of data.

Main subclasses:

multiprocessors and

multicomputers

processing

unit

memory

unit

data

stream

instruction

stream

processing

unit

data

stream

instruction

stream

processing

unit

data

stream

instruction

stream

…

control

unit

control

unit

control

unit

memory

unit

memory

unit

Inte

rcon

ne

ction

ne

two

rk

multicomputer


Taxonomy tree

Processor organizations

Single instructions

single data stream

(SISD)

Single instructions

multiple data stream

(SIMD)

Multiple instructions

single data stream

(MISD)

Multiple instructions

multiple data stream

(MIMD)

Vector

Processor

Array

Processor

Multiprocessor

shared memory

(tightly coupled)

Multicomputer

distributed memory

(loosely coupled)

Symmetric

Multiprocessor

(SMP)

Nonuniform

memory access

(NUMA)

Clusters


Outline

Taxonomy


Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems

Vector Computation


MIMD - Overview

Set of general purpose processors.

Each can process all instructions necessary

Further classified by method of processor

communication.


Communication Models

Multiprocessors

All CPUs are able to process all necessary

instruction.

All access the same physical shared memory

All share the same address space.

Communication through shared memory via

LOAD/STORE instructions → tightly coupled.

Simple programming model.



Multiprocessors (example)

a) Multiprocessor with 16 CPUs sharing a common memory.

b) Memory in 16 sections; each one processed by one processor.



Multicomputers

Each CPU has a private memory → distributed

memory system.

Each CPU has a particular address space

Communication through send/receive primitives →

loosely coupled system.

More complex programming model



Multicomputers (example)

Multicomputer with 16 CPUs each with its own private memory

Image (see previous figure) distributed among the 16 CPUs



Multiprocessors Multicomputers

Multiprocessors:

Potentially easier to program

Building a shared memory for hundreds of CPUs is not easy → non

scalable.

Memory contention is a potential performance bottleneck.

Multicomputers:

More difficult to program.

Building multicomputers with 1000’s of CPU is not difficult →

scalable.


Outline

Taxonomy


Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems

Vector Computation


Symmetric Multiprocessors

A stand alone computer with the following

characteristics:

Two or more similar processors of comparable capacity.

Processors share same memory and I/O.

Processors are connected by a bus or other internal

connection.

Memory access time is approximately the same for each

processor.


SMP Advantages

Performance

If some work can be done in parallel

Availability

Since all processors can perform the same functions, failure of a single

processor does not necessarily halt the system.

Incremental growth

User can enhance performance by adding additional processors.

Scaling

Vendors can offer range of products based on number of processors.


Outline

Taxonomy


Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems

Vector Computation


Time Shared Bus

Characteristics:

Simplest form.

Structure and interface similar to single processor system

Following features provided:

Addressing - distinguish modules on bus .

Arbitration - any module can be temporary master.

Time sharing - if one module has the bus, others must wait and may

have to suspend.

Now have multiple processors as well as multiple I/O

modules.


Time Shared Bus - SMP


Time Shared Bus

Advantages: Simplicity

Flexibility

Reliability

Disadvantages: Performance limited by bus cycle time

Each processor should have local cache Reduce number of bus accesses

Leads to problems with cache coherence Solved in hardware - see later


Outline

Taxonomy


Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems


Cache Coherence Problem

Cache A

SHARED BUS

. . .

SHARED

MEMORY

a

1- CPU A reads data (miss)

2- CPU K reads the same data (miss)

3- CPU K writes (changes) data (hit)

4- CPU A reads data (hit) – outdated !!!!!

……

y

xxxx

Cache K

CPU A CPU K


Cache controllers may have a snoop,

monitors the shared bus to detect any for coherence

relevant activity and

acts so as to assure data coherence.

It increases bus traffic.

which

Snoopy Protocols


Snoopy Protocols

Cache A

SHARED BUS

SHARED

MEMORY

aCPU A CPU K

……

x

xxx. . .

x

Cache K

1- CPU K writes (changes) the data (hit)

2- write propagates to the shared memory

3- snoop invalidates

x

or updates data in CPU A

yy

y

y


MESI State Transition Diagram


L1-L2 Cache Consistency

L1 caches do not connect to the bus → do not

engage in the snoop protocol.

Simple solution:

L1 is “write-through”.

Updates and invalidations in L2 must be propagated to

L1.

Approaches for write back L1 exist → more

complex.


Cache Coherence

connection other than shared bus

Directory Protocols

Collect and maintain information about copies of data in

cache.

Typically a central directory stored in main memory.

Requests are checked against directory.

Appropriate transfers are performed.

Creates central bottleneck.

Effective in large scale systems with complex

interconnection schemes, according to Stallings ??????


Cache Coherence

Software Solutions Compiler and operating system deal with problem.

Overhead transferred to compile time.

Design complexity transferred from hardware to software.

However, software tends to make conservative decisions

Inefficient cache utilization.

Analyze code to determine safe periods for caching shared variables.

HW+SW solutions exist.


Outline

Taxonomy


Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems


Increasing Performance

MIPS rate = f IPC

Pipelining and superscalar to increase IPC

Mechanisms to maximize the utilization of

each pipeline, may be reaching a limit due to

Complexity

Power consumption

processor clock frequency average instructions per cycle


Threads and Processes

Process:

An instance of program running on computer

Resource ownership, such as main memory,

I/O channels, I/O devices, and files

Scheduling/execution

Process switch


Threads and Processes

Thread:

A dispatchable unit of work within a process

Includes processor context (which includes the program counter and stack pointer) and data area for stack (to enable subroutine branching)

Threads within a process share resources

Threads are able to be executed independently!

Switching processor between threads within same process

Typically less costly than process switch


Explicit:

user-level threads visible to the applicationprogram

kernel-level threads, which are visible to the OS

All commercial processors and most experimental ones use explicit multithreading

Implicit:

defined statically by compiler or dynamically by hardware

Implicit and Explicit Multithreading


Approaches to Explicit Multithreading

Interleaved (fine-grained)

Processor issues instruction(s) from a single thread at a time

Switching thread at each clock cycle

If thread is blocked it is skipped

Blocked (coarse-grained)

Processor issues instruction(s) from a single thread at a time

Thread executed until event causes delay e.g. cache miss

Effective on in-order processor

Avoids pipeline stall

Simultaneous (SMT)

Processor issues instructions from multiple threads at a time to execution units of superscalar processor

Chip multiprocessing

Processor is replicated on a single chip

Each processor handles separate threads


Scalar Processor Approaches

Single-threaded scalar

Simple pipeline

No multithreading

Interleaved multithreaded scalar

Easiest multithreading to implement

Switch threads at each clock cycle

Pipeline stages kept close to fully occupied

Hardware needs to switch thread context between cycles

Blocked multithreaded scalar

Thread executed until latency event occurs

Would stop pipeline

Processor switches to another thread


Scalar Diagrams

caused by

some

dependency

3 time slots

to clear

dependency


Superscalar Processor Approaches (1)

Superscalar

No multithreading

Interleaved multithreading superscalar:

Each cycle, as many instructions as possible issued from single thread

Delays due to thread switches eliminated

Number of instructions issued in cycle limited by dependencies

Blocked multithreaded superscalar

Instructions from one thread

Blocked multithreading used


Superscalar Diagrams (1)


Very long instruction word (VLIW)

E.g. IA-64

Multiple instructions in single word

Typically constructed by compiler

Operations that may be executed in parallel in same word

May pad with no-ops

Interleaved multithreading VLIW

Similar efficiencies to interleaved multithreading on superscalar architecture

Blocked multithreaded VLIW

Similar efficiencies to blocked multithreading on superscalar architecture





Simultaneous multithreading (SMT)

Issue multiple instructions at a time

One thread may fill all horizontal slots

Instructions from two or more threads may be issued

With enough threads, can issue maximum number of instructions on

each cycle

Chip multiprocessor

Multiple processors

Each has two-issue superscalar processor

Each processor is assigned thread

Can issue up to two instructions per cycle per thread





Outline

Taxonomy


Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems


Clusters

“A group of interconnected whole computers

working together as unified resource

illusion of being one machine.”


Clusters

Benefits

Absolute scalability

Incremental scalability

High availability

Superior price/performance

Server applications

Each computer called a node


Cluster Configurations

Standby Server, No Shared Disk


Cluster Configurations

With Shared Disk


Clustering Methods

Method Description Benefits Limitations

Passive StandbyA secondary server takes over in

case of primary server failure.Easy to implement

High cost because the

secondary server is

unavailable for other

processing tasks.

Active

Secondary

The secondary server is also used

for processing tasks.

Reduced cost because

secondary servers can be

used for processing.

Increased complexity.

Separate Servers

Separate servers have their own

disks. Data is continuously copied

from primary to secondary server.

High availability

High network and server

overhead due to copying

operations.

Servers

Connected to

Disks

Servers are cabled to the same

disks, but each server owns its

disks. If one server fails, its disks

re taken over by the other server.

Reduced network and

server overhead due to

elimination of copying

operations

Usually requires disk

mirroring or RAID

technology to compensate

for risk of disk failure.

Servers Share

Disks

Multiple servers simultaneously

share access to disks.

Low network and server

overhead. Reduced risk of

downtime caused by disk

failure.

Requires lock manager

software. Usually used

with disk mirroring or

RAID technology.


Outline

Taxonomy


Time shared bus

Cache coherence

Multithreading

Clusters

NUMA systems


NUMA SystemsTerminology:

Uniform memory access (UMA) All processors have access to all parts of memory → load & store.

Access time to all regions of memory is the same.

Access time to memory for different processors is the same.

As used by SMP.

Nonuniform memory access (NUMA) All processors have access to all parts of memory → load & store

Access time of processor differs depending on region of memory

Different processors access different regions of memory at different speeds

Cache coherent NUMA (CC_NUMA) Cache coherence is maintained among the caches of the various

processors

Significantly different from SMP and clusters


Motivation for NUMA

SMP has practical limit to the number of processors

Bus traffic limits to between 16 and 64 processors

In clusters each node has own memory

Apps do not see large global memory

Coherence maintained by software not hardware

NUMA retains SMP flavour while giving large scale

multiprocessing

e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors

Objective

to maintain transparent system wide memory while permitting

multiprocessor nodes, each with own bus or internal interconnection

system.


CC-NUMA OrganizationSingle addressable memory space.

Memory request order:

1. L1 cache (local to processor)

2. L2 cache (local to processor)

3. Main memory (local to node)

4. Remote memory (automatically and transparent to processor)


NUMA Pros & Cons

Higher levels of parallelism than SMP, with minor software changes

Performance can breakdown if too much access to remote memory

Can be avoided by:

L1 & L2 cache design reducing all memory access

Need good temporal and spatial locality of software

Virtual memory management moving pages to nodes that are using them most

Keeping it Cache Coherent is costly in terms of HW and performance


Exercises Exercise 1:

Consider a situation in which two processors (P1 and P2) in an SMP configuration, over time, require access to the same line of data from main memory. Both processors have a cache and use the MESI protocol. Initially both caches have an invalid copy of the line. Eventually, processor P1 reads line x. If this is the start of a sequence of accesses, complete the table below indicating the state-transitions of this line in the caches of both processors.

AccessesState transition in

cache 1

State transition in

cache 2

P2 reads x ES IS

P1 writes to x SM SI

P1 writes to x MM II

P2 reads x M↓S IS


Exercises

Exercise 2:

Consider an SMP with both L1and L2 caches using the MESI protocol. One of four states is associated with each line in the L2 cache. Are all four states also needed for each line in the L1 cache? If so, why? If not, explain which state or states can be eliminated.

http://www.ele.puc-rio.br/~micro/Exercicio2/Exercicio2.html



Consider a pipeline similar to the ones seen in chapter on Pipelining, which is redrawn in Figure a with the fetch and decode stages ignored to represent the execution of thread A. Figure b illustrates the execution of a separate thread B. In both cases, a simple pipelined processor is used.

1 2 3 4 5 6 7 8 9 10 12 12

CO A1 A2 A3 A4 A5 A15 A16

FO A1 A2 A3 A4 A15 A16

EI A1 A2 A3 A15 A16

WO A1 A2 A3 A15 A16

1 2 3 4 5 6 7 8 9 10 12 12

CO B1 B2 B3 B4 B5 B6 B7

FO B1 B2 B3 B4 B5 B6 B7

EI B1 B2 B3 B4 B5 B6 B7

WO B1 B2 B3 B4 B5 B6 B7

cycle

Thread A

Thread B



Exercises Exercise 3a :

Show an instruction issue diagram, for each of the two threads

Exercise 3b :

Assume that the two threads are to be executed in parallel on a chip processor, with each of the two processors on the chip using a simple pipeline. Show an instruction issue diagram. Also show a pipeline execution diagram.

Thread A

Thread B

1 2 3 4 5 6 7 8 9 10 12 12

cycle


Exercises Exercise 3c :

Assume a two-issue superscalar architecture. Repeat part (b) for an interleaved multithreading superscalar implementation, assuming no data dependencies. Note: There is no unique answer, you need to make assumptions about latency and priority

1 2 3 4 5 6 7 8 9 10 12 12

cycle


Exercises

Exercise 3d :

Repeat part c for a blocked multithreading

superscalar implementation.

1 2 3 4 5 6 7 8 9 10 12 12

cycle


Exercises

Exercise 3e :

Repeat for a four-issue SMT architecture.

1 2 3 4 5 6 7 8 9 10 12 12

cycle



An application program is executed on a nine-computer cluster. A benchmark program took time T on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine computers. The remaining time, the application had to run on a single computer.

a) Calculate the effective speedup under the aforementioned condition as compared to executing the program on a single computer. Also calculate , the percentage of code that has been parallelized (programmed or compiled so as to use the cluster mode) in the preceding program.

b) Suppose that we are able to effectively use 17 computers rather than 9 computers on the parallelized portion of the code. Calculate the effective speedup that is achieved




The figures below show the state diagrams of two possible cache coherence protocols. Deduce and explain each protocol, and compare each to MESI

W(i) = write to line by processor i

R(i) = read line by processor i

Z(i) = displace line by processor i

W(j) = write to line by processor j ( j i)

R(j) = read line by processor j ( j i)

Z(j) = displace line by processor j ( j i)Note: state diagrams are for a

given line in cache i

InvalidValid

R(i) W(i)

W(j) Z(i)

R(j)

W(j)

Z(j)

W(i)

Z(i)

R(i)

Exclusive

SharedInvalid

R(i)

W(j) Z(i)

R(j)

W(j)

Z(j)

Z(j)

R(j)

R(i)

W(j) W(i)

R(j)

W(i)Z(i)

R(i)

W(i)

Z(j)



Parallel Processing

END

Documents

Introdução à Micro-Informática