Shared-Memory Multiprocessors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Shared-Memory Multiprocessors

Prof. Sivarama Dandamudi

School of Computer Science

Carleton University

Carleton University © S. Dandamudi 2

Roadmap

UI CedarArchitecture overviewOperating system primitivesMultiprocessing primitives

Run queue organizationCentralizedDistributedHierarchical organization


UI Cedar Architecture

Shared-memory MIMD system Experimental system built at University of Illinois

Processors are grouped into clustersUses a hierarchical organizationThree levels of memory hierarchy

Local memoryCluster memoryGlobal memory

Refer to the same physical memory


UI Cedar Architecture (cont’d)

CCU: Cluster Control Unit



Local memoryLocal to each processorNo need to go through any network

Cluster memoryProcessors in a cluster can access this memoryAccess via local interconnection network

Global memoryAny processor can access this memoryAccess is via the global interconnection network



Processor cluster (PC)Smallest execution unit

Typically 8 processors

A compound function (a chunk of program) Can be assigned to one or more PCs

Each processor consists of FP unit No local data registers (unusual) Local memory can be used as a large data set

Local memory can be dynamically partitioned into pseudo-vector registers of different sizes



Processor cluster (PC)Controlled by CCUCCU serves as a synchronization unit

Starts all processors when the data is moved from global to local memory

Signals GCU when a CF is done

Local networkEither a crossbar or a bus

Global networkBased on an extension of the Omega network



At least 2 paths from every switch(except the last)

Added redundancy to original Omega

Improves FT and reduces conflicts



Memory system

Each PC contains eight 16K memory modules

Memory hierarchy is user transparentCCUs and GCU move program code from global to

local memory in large blocks

Transfer time is overlapped with computation



Cache system Implemented in local memories for global memory

accessesNot all accesses are cachedOnly those predetermined by programmer or compiler

To avoid cache consistency problems Caches only

Read-only data, or Data written by a single processor (i.e., private data)



GCU Uses macro-dataflow

To reduce scheduling and other overheads Considers large structures (arrays) as one object

Several operations are combined to reduce scheduling overhead

Each PC is considered as an execution unit Each PC executes a Compound Function

Views program as a directed graph Nodes are CFs

Large data structures are stored in global memory No structure copying problem


Synchronization Primitive

Synchronization is supported via a sync variable It is a special data type supported by the hardwareConsists of two contiguous items in global memory

Each item is either 4 bytes (single precision) or 8 bytes (double precision)

First item: key Always an integer

Second item: data Unspecified type (integer, floating point, logical, or address)


Synchronization Primitive (cont’d)

Sync expressionsync(key-relation; key-op; data-op)

key-relation: key relop expression voidkey-op: lvalue = key key = expression lvalue = ++key lvalue = --key ++key --key void

data-op: lvalue = data data = expression void



Sync expression semanticsKey-relation is evaluated

If true, key-op and data-op are done indivisibly

Result of sync expression is the value of the key-relationIf key-relation is omitted

Key-op and data-op are done unconditionally

When data-op is missing Key does not have to be a key field of a sync variable

It can be any integer



Sync expression example

while (!sync(lock == 0; ++lock));

/* spin-wait until lock is free */

/* and then set lock */

accum += delta;

lock = 0; /* unlock */


Memory Attributes

Three typesLocality

Global Cluster

Page type Shared Private

Access privilege Read, write, execute A combination of these


Memory Attributes (cont’d)

Locality attributeSpecifies where the page should be located in the

hierarchyGlobal pages are mapped to physical global memoryCluster pages are mapped to cluster memoryDetails of physical mapping are not visible to a user program

Xylem always places a page according to its attribute when a user program references it


Memory Attributes (cont’d)

Page type attribute

Specifies whether the page is shared or private

Indicates how a task logically sees the page Private pages belong to a single task

Any modifications done can only be seen by that task

Other tasks do not see these changes

Modifications done to a shared page can be seen by other tasks


Multiprocessing Support

Cedar compiler takes FORTRAN source code Analyzes for implicit parallelism Generates a control flow graph

User Control Block (UCB) Created when the user first logs in Multiple logins do not create multiple UCBs (one UCB per user)

Process Control Block (PCB)When a process is created (via Unix fork)

One PCB and a single task control block (TCB) are created The new task is scheduled This can create other tasks linked to the same PCB


Multiprocessing Support (cont’d)



Five primitives are providedcreate_task()delete_task() start_task()end_task()

Stop task

wait_task()Wait for another task



create_task()

Creates a new TCB Attached to caller’s PCB

Not scheduled for execution

Task is in idle state

Returns an integer to identify the task

No child-parent relationship



delete_task(tasknum)

Deletes the task identified by tasknum

TCB and associated resources are deallocated

If the task was executing, it is terminated

Error if tasknum is unknown



start_task(tasknum, pc)Forces the task identified by tasknum to begin execution at

location pc

Task is marked busy and scheduled for execution

If the task is already busy, it is interrupted with no way of

returning

Error if unknown tasknum



end_task()Marks the calling task as idle and stops its execution

All tasks waiting for this task are unblocked

It does not deallocate resources allocated to the task

A task that waits for this one can Delete it

Start it at another location, or

Let it remain idle



wait_task(tasknum)Blocks the calling task until the specified task (i.e., tasknum)

enters idle stateA task enters idle state

When it is created When it calls end_task

If the specified task is already in idle state, the calling task continues immediately

Error if unknown tasknum


Example 1 (cont’d)

global shared integer: FLAG, MIDDLElocal private integer: RIGHT

A:<body of node A> FLAG = 0 RIGHT = create_task() call start_task(RIGHT,C) goto B

B: <body of B> if (.NOT. SYNC(FLAG == 0; ++FLAG)) then MIDDLE = create_task() call start_task(MIDDLE,E) endif goto D

A

CB

FD E

G



C: <body of C> if (.NOT. SYNC(FLAG == 0; ++FLAG)) then MIDDLE = create_task() call start_task(MIDDLE,E) endif goto F

D: <body of D> goto GD

E: <body of E> goto GE

E: <body of E> goto GF

A

CB

FD E

G



GE:

GF:

call end_task()

GD:

call wait_task(RIGHT)

call wait_task(MIDDLE)

call delete_task(RIGHT)

call delete_task(MIDDlE)

A

CB

FD E

G



DO 101 I = 1,N

DO 101 J = 1,210

101 A(I,J) = B(I,J) + C(j)

DO 102 I = 1,10000

F(I) = ABS(F(I))

102 IF (G(I) .LT. 0) F(I) = -F(I)

A

CB

D

DOALL101

DOALL102



local private integer T

A: T = create_task() call start_task(T,C) goto B

B: doall 101 goto DB

C: doall 102 goto DC

DB:call wait_task(T)

call delete_task(T)

goto next_node

A

CB

D

DOALL101

DOALL102

DC:call end_task()



C: N = 10 local private integer tasknum(N),T,J global shared integer I

I = 0 do J = 1,N T = create_task() tasknum(J) = T call start_task(T,CC) enddo

do J = 1,N T = tasknum(J) call wait_task(T) call delete_task(T) enddo

CC:local private integer J,K

dowhile SYNC(I<100;J=++I))

J = J*100 – 99 do K = J, J+99 F(K) = abs(F(K)) if(G(K) .LT. 0) F(K) = -F(K) enddo endwhile call end_task()


Run Queue Organization

Run queue organizationsCentralized

A single global queue

DistributedLocal queues

HybridMultiple queues

Hierarchical


Run Queue Organization (cont’d)

Centralized organization A single global queue Tasks are accessible to all

processors Mutually exclusive access

to the global queue is required

Can lead to queue access contention for large number of processors

Good for small systems



Distributed organization A local queue at each

processor Tasks are accessible only to

the associated processor Need a task placement

policy Excellent scalability

Good for large systems Load balancing is a

problem



Performance comparisonRun queue access time is not negligible# of processors = 64Average # of tasks/job = 64 (exponentially distributed)Average task service time = 1 time unit (expo. dist.)Run queue access time

f = 0% to 4% of task service time



0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se ti

me

f = 4% f = 3% f = 2% f = 1% f = 0%

Centralized



0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se ti

me

f = 4% f = 0%

Distributed



0

20

40

60

80

0 1 2 3 4

Service time CV

Mea

n re

spon

se ti

me

Distributed Centralized


Improving Performance

Centralized organizationNeed to minimize access contentionAutonomous policy (Nelson & Squillante)

Every access brings a set of tasks Reduces the number of accesses to the central queue Potential problems

Load imbalance Optimal set size depends on the system load Large service time CV can cause performance deterioration


Improving Performance (cont’d)

Cooperative policy (Nelson & Squillante) Every access brings a tasks for other processors as well

Moves tasks from the central queue to other processor local queues Uses “join the shortest queue” policy Improves load balancing Performs better than Autonomous policy and distributed

organization Potential problems

Difficult to implement for large systems Scheduler need to maintain state information on other processors Their local queue length



Distributed OrganizationWe have to address the load imbalance problemOblivious placement policies

RandomRound robin (cyclic)

Use adaptive placement policiesShortest queueShortest response time (SRT) queue



0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se ti

me

Random Round robin Shortest queue SRT queue



0

50

100

150

200

0 1 2 3 4

Service time CV

Mea

n re

spon

se ti

me

Random Round robin Shortest queue SRT queue



Implementation problems with adaptive policiesSystem state overhead

Both shortest queue and SRT queue policies need system state information

To reduce this overhead, system state information is collected only from a subset P (P < N) processors

P = # of probes to collect state information If P is small, we are successful in reducing the overhead

In practice, a small number probes is sufficient



0

5

10

15

20

1 2 3 4 5 6 7 8 9 10

Number of probes

Mea

n re

spon

se ti

me

Shortest queue SRT queue



A problem with SRT Queue policyNeed to have a priori knowledge of execution timesOften we may get only an estimate

Subject to estimation errorsESRT queue policy

Uses estimate that is with in X % of the actual service time In the experiments we used 30%

SRT queue policyAssumes exact service time is known beforehand



0

10

20

30

40

0 1 2 3 4

Service time CV

Mea

n re

spon

se ti

me

Shortest queue SRT queue ESRT queue


Hierarchical Organization

Goal is to have best of the both organizationsAvoid bottleneck problems like the distributed organizationGood load sharing like in the centralized organizationShould be self-schedulingNo state information collection

Hierarchical organization provides all these desired features

Performs close to the centralized organization but scaled well like the distributed organization


Hierarchical Organization (cont’d)



Tr = 1 Tr = 2



0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se ti

me

f = 4% f = 3% f = 2% f = 1% f = 0%

Centralized



0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se ti

me

Distributed Hierarchical



0

20

40

60

80

100

0 20 40 60 80 100

Number of tasks

Mea

n re

spon

se ti

me

f = 4% f = 3% f = 2% f = 1% f = 0%

CentralizedFixed task size



0

20

40

60

80

100

0 20 40 60 80 100

Number of tasks

Mea

n re

spon

se ti

me


Fixed task size



0

20

40

60

80

100

0 20 40 60 80

Number of tasks

Mea

n re

spon

se ti

me


Fixed job size



0

20

40

60

80

100

120

0 1 2 3 4

Service time CV

Mea

n re

spon

se ti

me

Distributed Hierarchical Centralized



0

20

40

60

80

100

120

0 1 2 3 4Service time CV

Mea

n re

spon

se ti

me

Distributed (utilization = 0.5) Distributed (utilization = 0.75)

Hierarchical (utilization = 0.5) Hierarchical (utilization = 0.75)



0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se t

ime

Distributed, N = 64 Distributed, N = 128

Hierarchical, N = 64 Hierarchical, N = 128



0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se t

ime

Distributed, N = 64 Distributed, N = 128

Hierarchical, N = 64 Hierarchical, N = 128



0.7

0.8

0.9

1

1.1

1.2

1.3

0 0.2 0.4 0.6 0.8 1

Utilization

Rat

io o

f m

ean

resp

onse

tim

e

B = 2 to B = 4, f = 2% B = 8 to B = 4, f = 2%

B = 2 to B = 4, f = 4% B = 8 to B = 4, f = 4%



0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

0 0.2 0.4 0.6 0.8 1

Utilization

Rat

io o

f m

ean

resp

onse

tim

e

Tr = 2, f = 4% Tr = 0.5, f = 4%

Tr = 2, f = 2% Tr = 0.5, f = 2%



Adaptive number of tasks Policy 1

Moves # tasks proportional to the # tasks queued at parent

At least as in the static Tr * # processors below the

child queue



Adaptive number of tasks Policy 2

Moves # tasks proportional to the # tasks queued at parent

But maintains I to keep this value the same for al children of the parent

At least as in the static Tr * # processors below the

child queue



0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.2 0.4 0.6 0.8 1

Utilization

Rat

io o

f m

ean

resp

onse

tim

e

Policy 1 Policy 2

Last slide

Documents

Shared-Memory Multiprocessors Prof. Sivarama Dandamudi School of Computer Science Carleton University