Multi processors and thread level parallelism

Embed Size (px)

Citation preview

  • 7/25/2019 Multi processors and thread level parallelism

    1/74

    UNIT 5

    MULTI PROCESSORS AND THREADLEVEL PARALLELISM

  • 7/25/2019 Multi processors and thread level parallelism

    2/74

    CONTENT

    INTRODUCTION

    SYMMETRIC AND SHARED MEMORY ARCHITECTURES

    PERFORMANCE OF SYMMETRIC SHARED MEMORY

    ARCHITECTURES

    DISTRIBUTES SHARED MEMORY AND DIRECTORY BASED

    COHERENCE

    BASICS OF SYNCHRONIZATION

    MODELS OF MEMORY CONSISTENCY

  • 7/25/2019 Multi processors and thread level parallelism

    3/74

    FACTORS THAT TREND TOWARD

    MULTIPROCESSOR

    1.A growing interest in servers and server

    performance

    2.A growth in data intensive applications

    3. The insight that increasing performance on thedesktop is less important

    4.An improved understanding on how to use

    multi processors effectively

    5.Advantages of leveraging a design investmentby replication rather than unique design

  • 7/25/2019 Multi processors and thread level parallelism

    4/74

    A TAXONOMY OF PARALLEL

    ARCHITECTURES

    1. Single instruction stream, single data stream

    (SISD)

    2. Single instruction stream, multiple data stream

    (SIMD)

    3. Multiple instruction stream single data stream

    (MISD)

    4. Multiple instruction stream, multiple data

    stream (MIMD)

  • 7/25/2019 Multi processors and thread level parallelism

    5/74

    SIMDSame instruction is executed by multiple

    processors using different data streams

    Exploit data level parallelism

    Each processor has its own data memory

    Single instruction memory

    Control processor to fetch and dispatch

    instructions

    SISD

    - Uniprocessor

  • 7/25/2019 Multi processors and thread level parallelism

    6/74

    MIMD

    Each processor fetches its own instruction and

    operates on its own code

    Exploits thread level parallelism

  • 7/25/2019 Multi processors and thread level parallelism

    7/74

    FACTORS THAT CONTRIBUTED TO

    THE RISE OF MIMD

    1. Flexibility Functions as a single user multiprocessor

    Can focus on high performance for one application

    Can run multiple tasks simultaneously

    2. Cost performance Use the same micro processor found in

    workstations and single processor servers

    Multi core chips leverage the design investment

    using replication

  • 7/25/2019 Multi processors and thread level parallelism

    8/74

    CLUSTERS

    One class of MIMD

    Use standard components and a network

    technology

    Two types: Commodity clusters

    Custom clusters

  • 7/25/2019 Multi processors and thread level parallelism

    9/74

    COMMODITY CLUSTERS

    Rely on 3rdparty processors and interconnect

    technology

    Are often blade / rack mounted servers

    Focus on throughputNo communication among threads

    Assembled by users rather than vendors

  • 7/25/2019 Multi processors and thread level parallelism

    10/74

    CUSTOMCLUSTERS

    Designer customizes either the detailed node

    design or the interconnect design or both

    Exploit large amount of parallelism

    Require significant among of communicationduring computation

    More efficient

    Ex.: IBM Blue gene

  • 7/25/2019 Multi processors and thread level parallelism

    11/74

    MULTICORE

    Multi processors placed on a single die

    A.k.a. on-chip multiprocessing or single-chip

    multiprocessing

    Multiple core shares resources (cache, I/O bus)Ex.: IBM Power 5

  • 7/25/2019 Multi processors and thread level parallelism

    12/74

    PROCESS

    Segment of code that may be run independently

    Process state contains all necessary information

    to execute that program

    Each process is independent of the other :-multiprogramming environment

  • 7/25/2019 Multi processors and thread level parallelism

    13/74

    THREADS

    Multiple processors executing a single program

    Share the code and address space

    Grain size must be large to exploit parallelism

    Independent threads within a process areidentified by the programmer or created by the

    compiler

    Loop iterations within thread-Exploit data level

    parallelism

  • 7/25/2019 Multi processors and thread level parallelism

    14/74

    MIMD CLASSIFICATION

    1. Centralized shared memory architectures

    2. Distributed memory processors

  • 7/25/2019 Multi processors and thread level parallelism

    15/74

    CENTRALIZED SHARED MEMORY

    ARCHITECTURES

    A few dozen processors share a single centralized

    memory

    Large caches or multiple bank memory

    Scaling done using p-2-p connections, switchesand multiple bank memory

    Symmetric relationship

    Uniform access time

    Called as Symmetric Shared MemoryMultiprocessor (SMP) or Uniform Memory Access

    (UMA)

  • 7/25/2019 Multi processors and thread level parallelism

    16/74

  • 7/25/2019 Multi processors and thread level parallelism

    17/74

    DISTRIBUTED MEMORY MULTI

    PROCESSORS

    Physically distributed memory

    Supports large number of processors and

    bandwidth

    Raises the need for high bandwidth interconnectDirection networks(switches) and indirection

    networks(multidimensional meshes) are used

  • 7/25/2019 Multi processors and thread level parallelism

    18/74

  • 7/25/2019 Multi processors and thread level parallelism

    19/74

    BENEFITS:

    1. Cost effective to scale memory bandwidth

    2. Reduces latency to access local memory

    DRAWBACKS:3. Complex

    4. Software needed to manage the increased

    memory bandwidth

  • 7/25/2019 Multi processors and thread level parallelism

    20/74

    MODELS FOR COMMUNICATION

    AND MEMORY ARCHITECTURE

    1. Communication occurs in a shared address

    space Physically separated memory => one logical shared

    address space

    Called as Distributed Shared Memoryarchitecture(DSM) or Non Uniform Memory Access

    (NUMA)

    Memory reference made by any processor to any

    memory location

    Access time depends on the data location in

    memory

  • 7/25/2019 Multi processors and thread level parallelism

    21/74

    2. Address space consist of multiple private address

    spaces

    Addresses are logically disjoint

    Cannot be addresses by a remote processorSame physical address(processor) refer to

    different memory location

    Each processor-memory module is a separate

    computerCommunication is done via message passing

    A.k.a Message Passing Multiprocessors

  • 7/25/2019 Multi processors and thread level parallelism

    22/74

    CHALLENGES OF MULTI

    PROCESSING

    1. Limited parallelism available in program

    2. Relatively high cost of communication

    3. Large latency of remote access

    4. Difficult to achieve good speed up

    Performance measured using Amdahls

    law

  • 7/25/2019 Multi processors and thread level parallelism

    23/74

    SOLUTION

    Limited parallelism : algorithms with better

    parallel performance

    Access latency : architecture design and

    programming

    Reduce the frequency of remote access: hardware

    and software mechanisms

    Tolerate latency: multi threading and pre-

    fetching

  • 7/25/2019 Multi processors and thread level parallelism

    24/74

    PROBLEM

    Suppose you want to achieve a speedup of 80 with

    100 processors. What fraction of the original

    computation can be sequential?

  • 7/25/2019 Multi processors and thread level parallelism

    25/74

    assume that the program operates in only two

    modes:

    1. parallel with all processors fully used, which is

    the enhanced mode2. serial with only one processor in use.

    Speedup in enhanced mode =number of processors,

    Speed in fraction of enhanced mode = time spentin parallel mode.

  • 7/25/2019 Multi processors and thread level parallelism

    26/74

    =99.75%

    .25% of original computation can be can be

    sequential26

  • 7/25/2019 Multi processors and thread level parallelism

    27/74

    PROBLEM

  • 7/25/2019 Multi processors and thread level parallelism

    28/74

  • 7/25/2019 Multi processors and thread level parallelism

    29/74

    SHARED SYMMETRIC MEMORY

    ARCHITECTURE

    Use of multi level caches substantially reduce the

    memory bandwidth demands of a processor

    Solution: Creation of small scale multi processors

    where several processors shared a single physical

    memory connected by a shared bus

    Benefit: Cost effective

    They support caching of private/shared data

  • 7/25/2019 Multi processors and thread level parallelism

    30/74

    Private data: Used by a single processor

    Shared data: Shared between multiple processors

    How are these cached?

  • 7/25/2019 Multi processors and thread level parallelism

    31/74

    WHAT IS MULTI PROCESSOR CACHE

    COHERENCE?

    A memory system is said to be coherent:

    1.A read by processor P to a location X that follows a

    write by P to X, with no writes of X by another

    processor occurring between the write and the

    read by P, always returns the value written by P2.A read by a processor to location X that follows a

    write by another processor to X returns the

    written value if the read and write are sufficiently

    separated in time and no other writes to X occurbetween two accesses

    3. Writes to the same location are serialized; two

    writes to the same location by any two processors

    are seen in the same order by all processors

  • 7/25/2019 Multi processors and thread level parallelism

    32/74

    Coherence: Defines the behavior of reads and

    writes to the same memory location

    Consistency: Defines the behavior of reads and

    writes w.r.t accesses to other memory location

  • 7/25/2019 Multi processors and thread level parallelism

    33/74

    BASIC SCHEMES FOR ENFORCING

    COHERENCE

    Coherent caches provide:

    1. Migration: Data item can be moved to a local

    cache and used

    2. Replication:Shared data can be

    simultaneously read

    . The protocols to maintain coherence for

    multiple processors are called cache coherence

    protocols

  • 7/25/2019 Multi processors and thread level parallelism

    34/74

    1. Directorybased: The sharing status of a

    block of physical memory is kept in just one

    location: directory

    2. Snooping:Every cache that has a copy of the

    data from a block of physical memory has also asharing status of the block; no centralized state

    is kept

  • 7/25/2019 Multi processors and thread level parallelism

    35/74

    SNOOPING PROTOCOLS

    1. Write invalidate

    2. Write update

  • 7/25/2019 Multi processors and thread level parallelism

    36/74

    BASIC IMPLEMENTATION

    TECHNIQUES

    1. The processor acquires bus access and

    broadcasts the address to be invalidated on the

    bus

    2. Processors continuously snoop on the bus

    watching for addresses

    3. The processors check if the address on the bus

    in their cache

    4. If so, they invalidate the corresponding data in

    their cache

    5. If two processors attempt to write shared blocks

    at the same time, their attempts to broadcast

    an invalidate operation will be serialized

  • 7/25/2019 Multi processors and thread level parallelism

    37/74

    Write update: Broadcasts the write to all the

    cache lines

    Consumes bandwidth

    Write - through cache: Written data is sent to

    memory

    The most recent value of the data item be

    fetched from memory

  • 7/25/2019 Multi processors and thread level parallelism

    38/74

    Write back cache: Every processor snoops the

    address on the bus.

    If the processor finds that it has a dirty copy of

    the requested cache block, it provides that cache

    block on request for a read

    This in turn causes the memory operation to be

    aborted

    The cache block is then retrieved from the

    processors cache

  • 7/25/2019 Multi processors and thread level parallelism

    39/74

    To track if a cache block is shared, an extra bit

    calledstate bitis associated with each cache

    block

    When a write to a shared block occurs, the cache

    generates an invalidation on the bus and marksthe block asexclusive

    The processor with this sole copy of the block is

    called theownerof the block

  • 7/25/2019 Multi processors and thread level parallelism

    40/74

    When an invalidation is sent, the owners sate of

    the cache block is changed from shared to

    exclusive

    Later, if another processor requests for the cache

    block, the state has to be made shared again

  • 7/25/2019 Multi processors and thread level parallelism

    41/74

  • 7/25/2019 Multi processors and thread level parallelism

    42/74

    WRITE INVALIDATE FOR A WRITE

    BACK CAHCE

    Circles: Cache states

    Arcs: State transitions

    Label on the arcs: Stimulus that causes state

    transition

    Bold: Bus actions caused by transitions

  • 7/25/2019 Multi processors and thread level parallelism

    43/74

  • 7/25/2019 Multi processors and thread level parallelism

    44/74

    LIMITATIONS

    As the number of processors in a multiprocessor

    grow / memory demands grow, any centralized

    resource becomes a bottleneck

    A single bus has to carry both the coherence

    traffic as well as the normal trafficDesigners can use multiple buses and

    interconnection networks

    Attain a midway approach : shared memory vs

    centralized memory

  • 7/25/2019 Multi processors and thread level parallelism

    45/74

  • 7/25/2019 Multi processors and thread level parallelism

    46/74

    PERFORMANCE OF SYMMETRIC SHARED

    MEMORY MULTI PROCESSORS

    Coherence misses can be broken into two sources:

    1. True sharing miss: The first write by a

    processor to a shared cache block causes an

    invalidation to establish block ownership; a

    subsequent attempt to read the modified in thatcache block results in a miss

    2. False sharing miss:The block is invalidate

    because some word in the cache block other

    than the one being read is written into

  • 7/25/2019 Multi processors and thread level parallelism

    47/74

    PROBLEM 3:Assume that words xl and x2 are

    in the same cache block, which is in the shared

    state in the caches of both PI and P2. Assuming

    the following sequence of events, identify each

    miss as a true sharing miss, a false sharing miss,or a hit. Any miss that would occur if the block

    size were one word is designated a true sharing

    miss.

  • 7/25/2019 Multi processors and thread level parallelism

    48/74

  • 7/25/2019 Multi processors and thread level parallelism

    49/74

    DISTRIBUTED SHARED MEMORY

    AND DIRECTORRY BASED COHERENCE

    A directory keeps state of every cached block

    Information in the directory includes which

    caches have copies of the block, if they are dirty

    and so on

    An entry in the directory is associated with each

    block

    To prevent the directory from becoming a

    bottleneck, the directory is distributed along with

    the memory.

  • 7/25/2019 Multi processors and thread level parallelism

    50/74

  • 7/25/2019 Multi processors and thread level parallelism

    51/74

    DIRECTORY BASED CACHE

    COHERENCE PROTOCOLS

    The state of each cache block could be the

    following:

    1. Shared: One or more processors have the block

    cached, and the value in memory and all the

    caches is up to date

    2. Uncached: No processor has a copy of the

    cache block

    3. Modified: Exactly one processor has a copy of

    the cached block, and it has written the block,the memory copy is out of date; the processor is

    the owner of the block

  • 7/25/2019 Multi processors and thread level parallelism

    52/74

    To keep track of the each potentially shared

    block, abit vectoris maintained for each block.

    Each bit indicates if the corresponding processor

    has a copy of the block

    Local node

    Home node

    Remote node

  • 7/25/2019 Multi processors and thread level parallelism

    53/74

  • 7/25/2019 Multi processors and thread level parallelism

    54/74

  • 7/25/2019 Multi processors and thread level parallelism

    55/74

    DIRECTORY BASED CACHE

    COHERENCE PROTOCOLS

    When the block is in uncached state, the possible

    requests for it are:

    1. Read miss: The requesting processor is sent

    the block from memory; the state of the block is

    made shared

    2. Write miss: The requesting processor is sent

    the value and becomes the sharing node; the

    block is made exclusive

  • 7/25/2019 Multi processors and thread level parallelism

    56/74

    When the block is in the shared state, the

    memory value is up to date:

    1. Read miss: The requesting processor is sent

    the requested data from memory, and the

    requesting processor is added to the sharing set

    2. Write miss: The requesting processor is sent

    the value. All other processors in the sharers

    state are sent invalidate messages and they

    contain the identity of the requesting processor;the state of the block is made exclusive

  • 7/25/2019 Multi processors and thread level parallelism

    57/74

    When the block is in the exclusive state, the

    current value of the block is held in the owner

    processors cache

    1. Read miss: The owner processor is sent the

    data fetch message. The state of the block smade shared; the requesting processor is added

    to the sharers set which contains the identity of

    the owner

  • 7/25/2019 Multi processors and thread level parallelism

    58/74

    2. Data write back: The owners processor is

    replacing the block and hence the block has to

    be written back. Memory copy is made up to

    date, the block is uncached and the sharers set

    is empty3. Write miss: The block has a new owner. A

    message is sent to old owner to invalidate the

    block; the state of the block remains exclusive

  • 7/25/2019 Multi processors and thread level parallelism

    59/74

    SYNCHRONIZATION

    Synchronization mechanisms are built with user

    level software routines that rely on hardware

    supplies synchronization instructions

    Atomic operations: The ability to atomically

    read and modify the memory locationAtomic exchange: Inter changes the value in a

    register for a value in memory

    Locks: 0 is used to indicate a lock is free; 1 is

    used to indicate that a lock is unavailable

  • 7/25/2019 Multi processors and thread level parallelism

    60/74

    Test and set: Tests a value and sets if the value

    passes the test

    Fetch and increment: It returns a value in

    memory and atomically increments it

  • 7/25/2019 Multi processors and thread level parallelism

    61/74

    IMPLEMENTING LOCKS USING

    COHERENCE

    Spin locks: Locks that a processor continuously

    tries to acquire, spinning around a loop until it

    succeeds

    Are to used when the lock is to be held for a very

    short amount of time and the process acquiringthe lock is of low latency

  • 7/25/2019 Multi processors and thread level parallelism

    62/74

    Simple implementation:

    A processor could continually try to acquire the

    lock using an atomic operation

    E.g.: Exchange and test

    To release a lock, the processor stores a 0 to the

    lock

  • 7/25/2019 Multi processors and thread level parallelism

    63/74

    Coherence mechanism:

    Use cache coherence mechanism to maintain the

    lock value coherently

    The processor can acquire a locally cached lock

    rather than using a global memory

    Locality in lock access: The processor that used

    the lock last will use it again in near future

  • 7/25/2019 Multi processors and thread level parallelism

    64/74

    Spin procedure:

    A processor reads the lock variable to test its

    state

    This is repeated until the value of the read

    indicates that the lock is unlocked

    The processor then races with all the other

    waiting processors

    All processes use aswapfunction that reads the

    old value and stores a 1 into the lock variable

  • 7/25/2019 Multi processors and thread level parallelism

    65/74

    The single winner will see a 0 and the losers willsee a 1 that is placed by the winner

    The winning processor executes the code after

    the lock and then release it by storing a 0 in the

    lock variableThe race starts again

  • 7/25/2019 Multi processors and thread level parallelism

    66/74

  • 7/25/2019 Multi processors and thread level parallelism

    67/74

    MODELS OF MEMORY

    CONSISTENCY

    Consistency:

    1. When must a processor see a value that has

    been updated by another processor

    2. In what order must a processor observe the

    data writes of another processor

    . Sequential consistency: Result of any execution

    be the same as if the memory accesses executed

    by each processor were kept in order and

    accesses among different processors areinterleaved

  • 7/25/2019 Multi processors and thread level parallelism

    68/74

    Sequential consistency:Sequentialconsistency requires that the result of any

    execution be the same as if the memory accesses

    executed by each processor were kept in order

    and the accesses among different processors werearbitrarily interleaved.

  • 7/25/2019 Multi processors and thread level parallelism

    69/74

    A program issynchronizedif all accesses toshared data are ordered by synchronization

    operations

    Data race: Variables are updated without

    ordering by synchronization; execution outcomedepends on the relative speed of the processors

    Synchronization operations?

  • 7/25/2019 Multi processors and thread level parallelism

    70/74

    RELAXED CONSISTENCY MODELS

    Allow read and write to complete out of order; butuse synchronization operations to enforce

    ordering

    X->Y: Operation X must complete before Y

    Four possible orderings: R->W; R->R; W->R; W->W

  • 7/25/2019 Multi processors and thread level parallelism

    71/74

    1. Relaxing W -> R yields totalstore orderingorprocessor consistencymodel

    2. Relaxing W -> W ordering yields a model

    known aspartial store order

    3. Relaxing R -> W and R -> R yields weakordering, release consistency model

  • 7/25/2019 Multi processors and thread level parallelism

    72/74

    1. Define the four major categories of computersystems

    2. List the factors that led to the rise of MIMD

    multi processors

    3. Illustrate the basic architecture of a centralizedshared memory multi processor

    4. Illustrate the basic architecture of a distributed

    memory multi processor

    5. Distinguish between private data and shared

    data

  • 7/25/2019 Multi processors and thread level parallelism

    73/74

    6. Define the cache coherence problem

    7. List the conditions required for a memory

    system to be coherent

    8. Define the cache coherence protocols

    9.Analyze the implementation of cache coherence

    protocol

    10.Illustrate the performance of symmetric shared

    memory multi processors with a commercial

    workload applicatio

  • 7/25/2019 Multi processors and thread level parallelism

    74/74

    11.Illustrate the working of distributed memorymulti processor

    12.Demonstrate the transitions in a directory

    based system

    13.Define spin locks

    14.Define the ordering of a relaxed consistency

    model