Multi processors and thread level parallelism

7/25/2019 Multi processors and thread level parallelism

1/74

UNIT 5

MULTI PROCESSORS AND THREADLEVEL PARALLELISM


2/74

CONTENT

INTRODUCTION

SYMMETRIC AND SHARED MEMORY ARCHITECTURES

PERFORMANCE OF SYMMETRIC SHARED MEMORY

ARCHITECTURES

DISTRIBUTES SHARED MEMORY AND DIRECTORY BASED

COHERENCE

BASICS OF SYNCHRONIZATION

MODELS OF MEMORY CONSISTENCY


3/74

FACTORS THAT TREND TOWARD

MULTIPROCESSOR

1.A growing interest in servers and server

performance

2.A growth in data intensive applications

3. The insight that increasing performance on thedesktop is less important

4.An improved understanding on how to use

multi processors effectively

5.Advantages of leveraging a design investmentby replication rather than unique design


4/74

A TAXONOMY OF PARALLEL

ARCHITECTURES

1. Single instruction stream, single data stream

(SISD)

2. Single instruction stream, multiple data stream

(SIMD)

3. Multiple instruction stream single data stream

(MISD)

4. Multiple instruction stream, multiple data

stream (MIMD)


5/74

SIMDSame instruction is executed by multiple

processors using different data streams

Exploit data level parallelism

Each processor has its own data memory

Single instruction memory

Control processor to fetch and dispatch

instructions

SISD

- Uniprocessor


6/74

MIMD

Each processor fetches its own instruction and

operates on its own code

Exploits thread level parallelism


7/74

FACTORS THAT CONTRIBUTED TO

THE RISE OF MIMD

1. Flexibility Functions as a single user multiprocessor

Can focus on high performance for one application

Can run multiple tasks simultaneously

2. Cost performance Use the same micro processor found in

workstations and single processor servers

Multi core chips leverage the design investment

using replication


8/74

CLUSTERS

One class of MIMD

Use standard components and a network

technology

Two types: Commodity clusters

Custom clusters


9/74

COMMODITY CLUSTERS

Rely on 3rdparty processors and interconnect

technology

Are often blade / rack mounted servers

Focus on throughputNo communication among threads

Assembled by users rather than vendors


10/74

CUSTOMCLUSTERS

Designer customizes either the detailed node

design or the interconnect design or both

Exploit large amount of parallelism

Require significant among of communicationduring computation

More efficient

Ex.: IBM Blue gene


11/74

MULTICORE

Multi processors placed on a single die

A.k.a. on-chip multiprocessing or single-chip

multiprocessing

Multiple core shares resources (cache, I/O bus)Ex.: IBM Power 5


12/74

PROCESS

Segment of code that may be run independently

Process state contains all necessary information

to execute that program

Each process is independent of the other :-multiprogramming environment


13/74

THREADS

Multiple processors executing a single program

Share the code and address space

Grain size must be large to exploit parallelism

Independent threads within a process areidentified by the programmer or created by the

compiler

Loop iterations within thread-Exploit data level

parallelism


14/74

MIMD CLASSIFICATION

1. Centralized shared memory architectures

2. Distributed memory processors


15/74

CENTRALIZED SHARED MEMORY

ARCHITECTURES

A few dozen processors share a single centralized

memory

Large caches or multiple bank memory

Scaling done using p-2-p connections, switchesand multiple bank memory

Symmetric relationship

Uniform access time

Called as Symmetric Shared MemoryMultiprocessor (SMP) or Uniform Memory Access

(UMA)


16/74


17/74

DISTRIBUTED MEMORY MULTI

PROCESSORS

Physically distributed memory

Supports large number of processors and

bandwidth

Raises the need for high bandwidth interconnectDirection networks(switches) and indirection

networks(multidimensional meshes) are used


18/74


19/74

BENEFITS:

1. Cost effective to scale memory bandwidth

2. Reduces latency to access local memory

DRAWBACKS:3. Complex

4. Software needed to manage the increased

memory bandwidth


20/74

MODELS FOR COMMUNICATION

AND MEMORY ARCHITECTURE

1. Communication occurs in a shared address

space Physically separated memory => one logical shared

address space

Called as Distributed Shared Memoryarchitecture(DSM) or Non Uniform Memory Access

(NUMA)

Memory reference made by any processor to any

memory location

Access time depends on the data location in

memory


21/74

2. Address space consist of multiple private address

spaces

Addresses are logically disjoint

Cannot be addresses by a remote processorSame physical address(processor) refer to

different memory location

Each processor-memory module is a separate

computerCommunication is done via message passing

A.k.a Message Passing Multiprocessors


22/74

CHALLENGES OF MULTI

PROCESSING

1. Limited parallelism available in program

2. Relatively high cost of communication

3. Large latency of remote access

4. Difficult to achieve good speed up

Performance measured using Amdahls

law


23/74

SOLUTION

Limited parallelism : algorithms with better

parallel performance

Access latency : architecture design and

programming

Reduce the frequency of remote access: hardware

and software mechanisms

Tolerate latency: multi threading and pre-

fetching


24/74

PROBLEM

Suppose you want to achieve a speedup of 80 with

100 processors. What fraction of the original

computation can be sequential?


25/74

assume that the program operates in only two

modes:

1. parallel with all processors fully used, which is

the enhanced mode2. serial with only one processor in use.

Speedup in enhanced mode =number of processors,

Speed in fraction of enhanced mode = time spentin parallel mode.


26/74

=99.75%

.25% of original computation can be can be

sequential26


27/74

PROBLEM


28/74


29/74

SHARED SYMMETRIC MEMORY

ARCHITECTURE

Use of multi level caches substantially reduce the

memory bandwidth demands of a processor

Solution: Creation of small scale multi processors

where several processors shared a single physical

memory connected by a shared bus

Benefit: Cost effective

They support caching of private/shared data


30/74

Private data: Used by a single processor

Shared data: Shared between multiple processors

How are these cached?


31/74

WHAT IS MULTI PROCESSOR CACHE

COHERENCE?

A memory system is said to be coherent:

1.A read by processor P to a location X that follows a

write by P to X, with no writes of X by another

processor occurring between the write and the

read by P, always returns the value written by P2.A read by a processor to location X that follows a

write by another processor to X returns the

written value if the read and write are sufficiently

separated in time and no other writes to X occurbetween two accesses

3. Writes to the same location are serialized; two

writes to the same location by any two processors

are seen in the same order by all processors


32/74

Coherence: Defines the behavior of reads and

writes to the same memory location

Consistency: Defines the behavior of reads and

writes w.r.t accesses to other memory location


33/74

BASIC SCHEMES FOR ENFORCING

COHERENCE

Coherent caches provide:

1. Migration: Data item can be moved to a local

cache and used

2. Replication:Shared data can be

simultaneously read

. The protocols to maintain coherence for

multiple processors are called cache coherence

protocols


34/74

1. Directorybased: The sharing status of a

block of physical memory is kept in just one

location: directory

2. Snooping:Every cache that has a copy of the

data from a block of physical memory has also asharing status of the block; no centralized state

is kept


35/74

SNOOPING PROTOCOLS

1. Write invalidate

2. Write update


36/74

BASIC IMPLEMENTATION

TECHNIQUES

1. The processor acquires bus access and

broadcasts the address to be invalidated on the

bus

2. Processors continuously snoop on the bus

watching for addresses

3. The processors check if the address on the bus

in their cache

4. If so, they invalidate the corresponding data in

their cache

5. If two processors attempt to write shared blocks

at the same time, their attempts to broadcast

an invalidate operation will be serialized


37/74

Write update: Broadcasts the write to all the

cache lines

Consumes bandwidth

Write - through cache: Written data is sent to

memory

The most recent value of the data item be

fetched from memory


38/74

Write back cache: Every processor snoops the

address on the bus.

If the processor finds that it has a dirty copy of

the requested cache block, it provides that cache

block on request for a read

This in turn causes the memory operation to be

aborted

The cache block is then retrieved from the

processors cache


39/74

To track if a cache block is shared, an extra bit

calledstate bitis associated with each cache

block

When a write to a shared block occurs, the cache

generates an invalidation on the bus and marksthe block asexclusive

The processor with this sole copy of the block is

called theownerof the block


40/74

When an invalidation is sent, the owners sate of

the cache block is changed from shared to

exclusive

Later, if another processor requests for the cache

block, the state has to be made shared again


41/74


42/74

WRITE INVALIDATE FOR A WRITE

BACK CAHCE

Circles: Cache states

Arcs: State transitions

Label on the arcs: Stimulus that causes state

transition

Bold: Bus actions caused by transitions


43/74


44/74

LIMITATIONS

As the number of processors in a multiprocessor

grow / memory demands grow, any centralized

resource becomes a bottleneck

A single bus has to carry both the coherence

traffic as well as the normal trafficDesigners can use multiple buses and

interconnection networks

Attain a midway approach : shared memory vs

centralized memory


45/74


46/74

PERFORMANCE OF SYMMETRIC SHARED

MEMORY MULTI PROCESSORS

Coherence misses can be broken into two sources:

1. True sharing miss: The first write by a

processor to a shared cache block causes an

invalidation to establish block ownership; a

subsequent attempt to read the modified in thatcache block results in a miss

2. False sharing miss:The block is invalidate

because some word in the cache block other

than the one being read is written into


47/74

PROBLEM 3:Assume that words xl and x2 are

in the same cache block, which is in the shared

state in the caches of both PI and P2. Assuming

the following sequence of events, identify each

miss as a true sharing miss, a false sharing miss,or a hit. Any miss that would occur if the block

size were one word is designated a true sharing

miss.


48/74


49/74

DISTRIBUTED SHARED MEMORY

AND DIRECTORRY BASED COHERENCE

A directory keeps state of every cached block

Information in the directory includes which

caches have copies of the block, if they are dirty

and so on

An entry in the directory is associated with each

block

To prevent the directory from becoming a

bottleneck, the directory is distributed along with

the memory.


50/74


51/74

DIRECTORY BASED CACHE

COHERENCE PROTOCOLS

The state of each cache block could be the

following:

1. Shared: One or more processors have the block

cached, and the value in memory and all the

caches is up to date

2. Uncached: No processor has a copy of the

cache block

3. Modified: Exactly one processor has a copy of

the cached block, and it has written the block,the memory copy is out of date; the processor is

the owner of the block


52/74

To keep track of the each potentially shared

block, abit vectoris maintained for each block.

Each bit indicates if the corresponding processor

has a copy of the block

Local node

Home node

Remote node


53/74


54/74


55/74

DIRECTORY BASED CACHE

COHERENCE PROTOCOLS

When the block is in uncached state, the possible

requests for it are:

1. Read miss: The requesting processor is sent

the block from memory; the state of the block is

made shared

2. Write miss: The requesting processor is sent

the value and becomes the sharing node; the

block is made exclusive


56/74

When the block is in the shared state, the

memory value is up to date:

1. Read miss: The requesting processor is sent

the requested data from memory, and the

requesting processor is added to the sharing set

2. Write miss: The requesting processor is sent

the value. All other processors in the sharers

state are sent invalidate messages and they

contain the identity of the requesting processor;the state of the block is made exclusive


57/74

When the block is in the exclusive state, the

current value of the block is held in the owner

processors cache

1. Read miss: The owner processor is sent the

data fetch message. The state of the block smade shared; the requesting processor is added

to the sharers set which contains the identity of

the owner


58/74

2. Data write back: The owners processor is

replacing the block and hence the block has to

be written back. Memory copy is made up to

date, the block is uncached and the sharers set

is empty3. Write miss: The block has a new owner. A

message is sent to old owner to invalidate the

block; the state of the block remains exclusive


59/74

SYNCHRONIZATION

Synchronization mechanisms are built with user

level software routines that rely on hardware

supplies synchronization instructions

Atomic operations: The ability to atomically

read and modify the memory locationAtomic exchange: Inter changes the value in a

register for a value in memory

Locks: 0 is used to indicate a lock is free; 1 is

used to indicate that a lock is unavailable


60/74

Test and set: Tests a value and sets if the value

passes the test

Fetch and increment: It returns a value in

memory and atomically increments it


61/74

IMPLEMENTING LOCKS USING

COHERENCE

Spin locks: Locks that a processor continuously

tries to acquire, spinning around a loop until it

succeeds

Are to used when the lock is to be held for a very

short amount of time and the process acquiringthe lock is of low latency


62/74

Simple implementation:

A processor could continually try to acquire the

lock using an atomic operation

E.g.: Exchange and test

To release a lock, the processor stores a 0 to the

lock


63/74

Coherence mechanism:

Use cache coherence mechanism to maintain the

lock value coherently

The processor can acquire a locally cached lock

rather than using a global memory

Locality in lock access: The processor that used

the lock last will use it again in near future


64/74

Spin procedure:

A processor reads the lock variable to test its

state

This is repeated until the value of the read

indicates that the lock is unlocked

The processor then races with all the other

waiting processors

All processes use aswapfunction that reads the

old value and stores a 1 into the lock variable


65/74

The single winner will see a 0 and the losers willsee a 1 that is placed by the winner

The winning processor executes the code after

the lock and then release it by storing a 0 in the

lock variableThe race starts again


66/74


67/74

MODELS OF MEMORY

CONSISTENCY

Consistency:

1. When must a processor see a value that has

been updated by another processor

2. In what order must a processor observe the

data writes of another processor

. Sequential consistency: Result of any execution

be the same as if the memory accesses executed

by each processor were kept in order and

accesses among different processors areinterleaved


68/74

Sequential consistency:Sequentialconsistency requires that the result of any

execution be the same as if the memory accesses

executed by each processor were kept in order

and the accesses among different processors werearbitrarily interleaved.


69/74

A program issynchronizedif all accesses toshared data are ordered by synchronization

operations

Data race: Variables are updated without

ordering by synchronization; execution outcomedepends on the relative speed of the processors

Synchronization operations?


70/74

RELAXED CONSISTENCY MODELS

Allow read and write to complete out of order; butuse synchronization operations to enforce

ordering

X->Y: Operation X must complete before Y

Four possible orderings: R->W; R->R; W->R; W->W


71/74

1. Relaxing W -> R yields totalstore orderingorprocessor consistencymodel

2. Relaxing W -> W ordering yields a model

known aspartial store order

3. Relaxing R -> W and R -> R yields weakordering, release consistency model


72/74

1. Define the four major categories of computersystems

2. List the factors that led to the rise of MIMD

multi processors

3. Illustrate the basic architecture of a centralizedshared memory multi processor

4. Illustrate the basic architecture of a distributed

memory multi processor

5. Distinguish between private data and shared

data


73/74

6. Define the cache coherence problem

7. List the conditions required for a memory

system to be coherent

8. Define the cache coherence protocols

9.Analyze the implementation of cache coherence

protocol

10.Illustrate the performance of symmetric shared

memory multi processors with a commercial

workload applicatio


74/74

11.Illustrate the working of distributed memorymulti processor

12.Demonstrate the transitions in a directory

based system

13.Define spin locks

14.Define the ordering of a relaxed consistency

model

Documents

Multi processors and thread level parallelism