Shared-Memory Multiprocessors Prof. Sivarama Dandamudi School of Computer Science Carleton...

Shared-Memory Multiprocessors

Prof. Sivarama Dandamudi

School of Computer Science

Carleton University

Carleton University © S. Dandamudi 2

Roadmap

UI CedarArchitecture overviewOperating system primitivesMultiprocessing primitives

Run queue organizationCentralizedDistributedHierarchical organization

UI Cedar Architecture

Shared-memory MIMD system Experimental system built at University of Illinois

Processors are grouped into clustersUses a hierarchical organizationThree levels of memory hierarchy

Local memoryCluster memoryGlobal memory

Refer to the same physical memory

UI Cedar Architecture (cont’d)

CCU: Cluster Control Unit

Local memoryLocal to each processorNo need to go through any network

Cluster memoryProcessors in a cluster can access this memoryAccess via local interconnection network

Global memoryAny processor can access this memoryAccess is via the global interconnection network

Processor cluster (PC)Smallest execution unit

Typically 8 processors

A compound function (a chunk of program) Can be assigned to one or more PCs

Each processor consists of FP unit No local data registers (unusual) Local memory can be used as a large data set

Local memory can be dynamically partitioned into pseudo-vector registers of different sizes

Processor cluster (PC)Controlled by CCUCCU serves as a synchronization unit

Starts all processors when the data is moved from global to local memory

Signals GCU when a CF is done

Local networkEither a crossbar or a bus

Global networkBased on an extension of the Omega network

At least 2 paths from every switch(except the last)

Added redundancy to original Omega

Improves FT and reduces conflicts

Memory system

Each PC contains eight 16K memory modules

Memory hierarchy is user transparentCCUs and GCU move program code from global to

local memory in large blocks

Transfer time is overlapped with computation

Cache system Implemented in local memories for global memory

accessesNot all accesses are cachedOnly those predetermined by programmer or compiler

To avoid cache consistency problems Caches only

Read-only data, or Data written by a single processor (i.e., private data)

GCU Uses macro-dataflow

To reduce scheduling and other overheads Considers large structures (arrays) as one object

Several operations are combined to reduce scheduling overhead

Each PC is considered as an execution unit Each PC executes a Compound Function

Views program as a directed graph Nodes are CFs

Large data structures are stored in global memory No structure copying problem

Synchronization Primitive

Synchronization is supported via a sync variable It is a special data type supported by the hardwareConsists of two contiguous items in global memory

Each item is either 4 bytes (single precision) or 8 bytes (double precision)

First item: key Always an integer

Second item: data Unspecified type (integer, floating point, logical, or address)

Synchronization Primitive (cont’d)

Sync expressionsync(key-relation; key-op; data-op)

key-relation: key relop expression voidkey-op: lvalue = key key = expression lvalue = ++key lvalue = --key ++key --key void

data-op: lvalue = data data = expression void

Sync expression semanticsKey-relation is evaluated

If true, key-op and data-op are done indivisibly

Result of sync expression is the value of the key-relationIf key-relation is omitted

Key-op and data-op are done unconditionally

When data-op is missing Key does not have to be a key field of a sync variable

It can be any integer

Sync expression example

while (!sync(lock == 0; ++lock));

/* spin-wait until lock is free */

/* and then set lock */

accum += delta;

lock = 0; /* unlock */

Memory Attributes

Three typesLocality

Global Cluster

Page type Shared Private

Access privilege Read, write, execute A combination of these

Memory Attributes (cont’d)

Locality attributeSpecifies where the page should be located in the

hierarchyGlobal pages are mapped to physical global memoryCluster pages are mapped to cluster memoryDetails of physical mapping are not visible to a user program

Xylem always places a page according to its attribute when a user program references it

Memory Attributes (cont’d)

Page type attribute

Specifies whether the page is shared or private

Indicates how a task logically sees the page Private pages belong to a single task

Any modifications done can only be seen by that task

Other tasks do not see these changes

Modifications done to a shared page can be seen by other tasks

Multiprocessing Support

Cedar compiler takes FORTRAN source code Analyzes for implicit parallelism Generates a control flow graph

User Control Block (UCB) Created when the user first logs in Multiple logins do not create multiple UCBs (one UCB per user)

Process Control Block (PCB)When a process is created (via Unix fork)

One PCB and a single task control block (TCB) are created The new task is scheduled This can create other tasks linked to the same PCB

Multiprocessing Support (cont’d)

Five primitives are providedcreate_task()delete_task() start_task()end_task()

Stop task

wait_task()Wait for another task

create_task()

Creates a new TCB Attached to caller’s PCB

Not scheduled for execution

Task is in idle state

Returns an integer to identify the task

No child-parent relationship

delete_task(tasknum)

Deletes the task identified by tasknum

TCB and associated resources are deallocated

If the task was executing, it is terminated

Error if tasknum is unknown

start_task(tasknum, pc)Forces the task identified by tasknum to begin execution at

location pc

Task is marked busy and scheduled for execution

If the task is already busy, it is interrupted with no way of

returning

Error if unknown tasknum

end_task()Marks the calling task as idle and stops its execution

All tasks waiting for this task are unblocked

It does not deallocate resources allocated to the task

A task that waits for this one can Delete it

Start it at another location, or

Let it remain idle

wait_task(tasknum)Blocks the calling task until the specified task (i.e., tasknum)

enters idle stateA task enters idle state

When it is created When it calls end_task

If the specified task is already in idle state, the calling task continues immediately

Error if unknown tasknum

Example 1 (cont’d)

global shared integer: FLAG, MIDDLElocal private integer: RIGHT

A:<body of node A> FLAG = 0 RIGHT = create_task() call start_task(RIGHT,C) goto B

B: <body of B> if (.NOT. SYNC(FLAG == 0; ++FLAG)) then MIDDLE = create_task() call start_task(MIDDLE,E) endif goto D

C: <body of C> if (.NOT. SYNC(FLAG == 0; ++FLAG)) then MIDDLE = create_task() call start_task(MIDDLE,E) endif goto F

D: <body of D> goto GD

E: <body of E> goto GE

E: <body of E> goto GF

call end_task()

call wait_task(RIGHT)

call wait_task(MIDDLE)

call delete_task(RIGHT)

call delete_task(MIDDlE)

DO 101 I = 1,N

DO 101 J = 1,210

101 A(I,J) = B(I,J) + C(j)

DO 102 I = 1,10000

F(I) = ABS(F(I))

102 IF (G(I) .LT. 0) F(I) = -F(I)

DOALL101

DOALL102

local private integer T

A: T = create_task() call start_task(T,C) goto B

B: doall 101 goto DB

C: doall 102 goto DC

DB:call wait_task(T)

call delete_task(T)

goto next_node

DOALL101

DOALL102

DC:call end_task()

C: N = 10 local private integer tasknum(N),T,J global shared integer I

I = 0 do J = 1,N T = create_task() tasknum(J) = T call start_task(T,CC) enddo

do J = 1,N T = tasknum(J) call wait_task(T) call delete_task(T) enddo

CC:local private integer J,K

dowhile SYNC(I<100;J=++I))

J = J*100 – 99 do K = J, J+99 F(K) = abs(F(K)) if(G(K) .LT. 0) F(K) = -F(K) enddo endwhile call end_task()

Run Queue Organization

Run queue organizationsCentralized

A single global queue

DistributedLocal queues

HybridMultiple queues

Hierarchical

Run Queue Organization (cont’d)

Centralized organization A single global queue Tasks are accessible to all

processors Mutually exclusive access

to the global queue is required

Can lead to queue access contention for large number of processors

Good for small systems

Distributed organization A local queue at each

processor Tasks are accessible only to

the associated processor Need a task placement

policy Excellent scalability

Good for large systems Load balancing is a

problem

Performance comparisonRun queue access time is not negligible# of processors = 64Average # of tasks/job = 64 (exponentially distributed)Average task service time = 1 time unit (expo. dist.)Run queue access time

f = 0% to 4% of task service time

0 0.2 0.4 0.6 0.8 1

Utilization

f = 4% f = 3% f = 2% f = 1% f = 0%

Centralized

0 0.2 0.4 0.6 0.8 1

Utilization

f = 4% f = 0%

Distributed

0 1 2 3 4

Service time CV

Distributed Centralized

Improving Performance

Centralized organizationNeed to minimize access contentionAutonomous policy (Nelson & Squillante)

Every access brings a set of tasks Reduces the number of accesses to the central queue Potential problems

Load imbalance Optimal set size depends on the system load Large service time CV can cause performance deterioration

Improving Performance (cont’d)

Cooperative policy (Nelson & Squillante) Every access brings a tasks for other processors as well

Moves tasks from the central queue to other processor local queues Uses “join the shortest queue” policy Improves load balancing Performs better than Autonomous policy and distributed

organization Potential problems

Difficult to implement for large systems Scheduler need to maintain state information on other processors Their local queue length

Distributed OrganizationWe have to address the load imbalance problemOblivious placement policies

RandomRound robin (cyclic)

Use adaptive placement policiesShortest queueShortest response time (SRT) queue

0 0.2 0.4 0.6 0.8 1

Utilization

Random Round robin Shortest queue SRT queue

0 1 2 3 4

Service time CV

Random Round robin Shortest queue SRT queue

Implementation problems with adaptive policiesSystem state overhead

Both shortest queue and SRT queue policies need system state information

To reduce this overhead, system state information is collected only from a subset P (P < N) processors

P = # of probes to collect state information If P is small, we are successful in reducing the overhead

In practice, a small number probes is sufficient

1 2 3 4 5 6 7 8 9 10

Number of probes

Shortest queue SRT queue

A problem with SRT Queue policyNeed to have a priori knowledge of execution timesOften we may get only an estimate

Subject to estimation errorsESRT queue policy

Uses estimate that is with in X % of the actual service time In the experiments we used 30%

SRT queue policyAssumes exact service time is known beforehand

0 1 2 3 4

Service time CV

Shortest queue SRT queue ESRT queue

Hierarchical Organization

Goal is to have best of the both organizationsAvoid bottleneck problems like the distributed organizationGood load sharing like in the centralized organizationShould be self-schedulingNo state information collection

Hierarchical organization provides all these desired features

Performs close to the centralized organization but scaled well like the distributed organization

Hierarchical Organization (cont’d)

Tr = 1 Tr = 2

0 0.2 0.4 0.6 0.8 1

Utilization

f = 4% f = 3% f = 2% f = 1% f = 0%

Centralized

0 0.2 0.4 0.6 0.8 1

Utilization

Distributed Hierarchical

0 20 40 60 80 100

Number of tasks

f = 4% f = 3% f = 2% f = 1% f = 0%

CentralizedFixed task size

0 20 40 60 80 100

Number of tasks

Fixed task size

0 20 40 60 80

Number of tasks

Fixed job size

0 1 2 3 4

Service time CV

Distributed Hierarchical Centralized

0 1 2 3 4Service time CV

Distributed (utilization = 0.5) Distributed (utilization = 0.75)

Hierarchical (utilization = 0.5) Hierarchical (utilization = 0.75)

0 0.2 0.4 0.6 0.8 1

Utilization

Distributed, N = 64 Distributed, N = 128

Hierarchical, N = 64 Hierarchical, N = 128

0 0.2 0.4 0.6 0.8 1

Utilization

Distributed, N = 64 Distributed, N = 128

Hierarchical, N = 64 Hierarchical, N = 128

0 0.2 0.4 0.6 0.8 1

Utilization

B = 2 to B = 4, f = 2% B = 8 to B = 4, f = 2%

B = 2 to B = 4, f = 4% B = 8 to B = 4, f = 4%

0 0.2 0.4 0.6 0.8 1

Utilization

Tr = 2, f = 4% Tr = 0.5, f = 4%

Tr = 2, f = 2% Tr = 0.5, f = 2%

Adaptive number of tasks Policy 1

Moves # tasks proportional to the # tasks queued at parent

At least as in the static Tr * # processors below the

child queue

Adaptive number of tasks Policy 2

Moves # tasks proportional to the # tasks queued at parent

But maintains I to keep this value the same for al children of the parent

At least as in the static Tr * # processors below the

child queue

0 0.2 0.4 0.6 0.8 1

Utilization

Policy 1 Policy 2

Last slide

Shared-Memory Multiprocessors Prof. Sivarama Dandamudi School of Computer Science Carleton...

Documents

Virtual Memory - The School of Computer Scienceservice.scs.carleton.ca/sivarama/org_book/org_book_web/slides/chap_1_versions/ch18_1.pdfTo be used with S. Dandamudi, “Fundamentals

Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Overview of Assembly Language - Carleton Universityservice.scs.carleton.ca/sivarama/asm_book_web/Student_copies/ch3... · Overview of Assembly Language Chapter 3 S. Dandamudi 1998

Interrupts & Input/Output - Carleton Universityservice.scs.carleton.ca/sivarama/asm_book_web/Student_copies/ch12... · Interrupts & Input/Output Chapter 12 S. Dandamudi 1998 ... •

Dandamudi Solution Ch17

Cache Memory Chapter 17 S. Dandamudi. 2003 To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer, 2003. S. Dandamudi

Hierarchical Checking of Multiprocessors Using · PDF fileHierarchical Checking of Multiprocessors Using ... error detection method is concurrent control ... of Multiprocessors Using

Multiprocessors MULTIPROCESSORS - WordPress.com · Multiprocessors 3 Computer Organization Computer Architectures Lab TERMINOLOGY Supercomputing Use of fastest, biggest machines to

Cache Memory - The School of Computer Scienceservice.scs.carleton.ca/sivarama/org_book/org_book_web/slides/chap... · 2003 To be used with S. Dandamudi, “Fundamentals of Computer

Chapter 7 Multicores, Multiprocessors, and Clustersmarkov/ccsu_courses/COD-Chapter7.pdf7.3 Shared Memory Multiprocessors Chapter 7 — Multicores, Multiprocessors, and Cluster s —

CSL718 : Multiprocessors

Virtual Memory Chapter 18 S. Dandamudi. 2003 To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer, 2003. S. Dandamudi

Symmetric Multiprocessors

Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Selected Pentium Instructions Chapter 12 S. Dandamudi

1 Multiprocessors Computer Organization Prof. H. Yoon MULTIPROCESSORS Characteristics of Multiprocessors Interconnection Structures Interprocessor Arbitration

Large Scale Multiprocessors and Scientific Applications · Large-Scale Multiprocessors. Performance of Scientific Workload on Shared-Memory Multiprocessors •Variables –Processor

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University

MultiProcessors - University of California, San Diegocseweb.ucsd.edu/classes/sp09/cse141/Slides/15_CMPs.pdf · Multiprocessors • Speciﬁcally, shared-memory multiprocessors have

Introduction Prof. Sivarama Dandamudi School of Computer Science Carleton University