Shared-Memory Multiprocessors Prof. Sivarama Dandamudi School of Computer Science Carleton...

Preview:

Citation preview

Shared-Memory Multiprocessors

Prof. Sivarama Dandamudi

School of Computer Science

Carleton University

Carleton University © S. Dandamudi 2

Roadmap

UI CedarArchitecture overviewOperating system primitivesMultiprocessing primitives

Run queue organizationCentralizedDistributedHierarchical organization

Carleton University © S. Dandamudi 3

UI Cedar Architecture

Shared-memory MIMD system Experimental system built at University of Illinois

Processors are grouped into clustersUses a hierarchical organizationThree levels of memory hierarchy

Local memoryCluster memoryGlobal memory

Refer to the same physical memory

Carleton University © S. Dandamudi 4

UI Cedar Architecture (cont’d)

CCU: Cluster Control Unit

Carleton University © S. Dandamudi 5

UI Cedar Architecture (cont’d)

Local memoryLocal to each processorNo need to go through any network

Cluster memoryProcessors in a cluster can access this memoryAccess via local interconnection network

Global memoryAny processor can access this memoryAccess is via the global interconnection network

Carleton University © S. Dandamudi 6

UI Cedar Architecture (cont’d)

Processor cluster (PC)Smallest execution unit

Typically 8 processors

A compound function (a chunk of program) Can be assigned to one or more PCs

Each processor consists of FP unit No local data registers (unusual) Local memory can be used as a large data set

Local memory can be dynamically partitioned into pseudo-vector registers of different sizes

Carleton University © S. Dandamudi 7

UI Cedar Architecture (cont’d)

Processor cluster (PC)Controlled by CCUCCU serves as a synchronization unit

Starts all processors when the data is moved from global to local memory

Signals GCU when a CF is done

Local networkEither a crossbar or a bus

Global networkBased on an extension of the Omega network

Carleton University © S. Dandamudi 8

UI Cedar Architecture (cont’d)

At least 2 paths from every switch(except the last)

Added redundancy to original Omega

Improves FT and reduces conflicts

Carleton University © S. Dandamudi 9

UI Cedar Architecture (cont’d)

Memory system

Each PC contains eight 16K memory modules

Memory hierarchy is user transparentCCUs and GCU move program code from global to

local memory in large blocks

Transfer time is overlapped with computation

Carleton University © S. Dandamudi 10

UI Cedar Architecture (cont’d)

Cache system Implemented in local memories for global memory

accessesNot all accesses are cachedOnly those predetermined by programmer or compiler

To avoid cache consistency problems Caches only

Read-only data, or Data written by a single processor (i.e., private data)

Carleton University © S. Dandamudi 11

UI Cedar Architecture (cont’d)

GCU Uses macro-dataflow

To reduce scheduling and other overheads Considers large structures (arrays) as one object

Several operations are combined to reduce scheduling overhead

Each PC is considered as an execution unit Each PC executes a Compound Function

Views program as a directed graph Nodes are CFs

Large data structures are stored in global memory No structure copying problem

Carleton University © S. Dandamudi 12

Synchronization Primitive

Synchronization is supported via a sync variable It is a special data type supported by the hardwareConsists of two contiguous items in global memory

Each item is either 4 bytes (single precision) or 8 bytes (double precision)

First item: key Always an integer

Second item: data Unspecified type (integer, floating point, logical, or address)

Carleton University © S. Dandamudi 13

Synchronization Primitive (cont’d)

Sync expressionsync(key-relation; key-op; data-op)

key-relation: key relop expression voidkey-op: lvalue = key key = expression lvalue = ++key lvalue = --key ++key --key void

data-op: lvalue = data data = expression void

Carleton University © S. Dandamudi 14

Synchronization Primitive (cont’d)

Sync expression semanticsKey-relation is evaluated

If true, key-op and data-op are done indivisibly

Result of sync expression is the value of the key-relationIf key-relation is omitted

Key-op and data-op are done unconditionally

When data-op is missing Key does not have to be a key field of a sync variable

It can be any integer

Carleton University © S. Dandamudi 15

Synchronization Primitive (cont’d)

Sync expression example

while (!sync(lock == 0; ++lock));

/* spin-wait until lock is free */

/* and then set lock */

accum += delta;

lock = 0; /* unlock */

Carleton University © S. Dandamudi 16

Memory Attributes

Three typesLocality

Global Cluster

Page type Shared Private

Access privilege Read, write, execute A combination of these

Carleton University © S. Dandamudi 17

Memory Attributes (cont’d)

Locality attributeSpecifies where the page should be located in the

hierarchyGlobal pages are mapped to physical global memoryCluster pages are mapped to cluster memoryDetails of physical mapping are not visible to a user program

Xylem always places a page according to its attribute when a user program references it

Carleton University © S. Dandamudi 18

Memory Attributes (cont’d)

Page type attribute

Specifies whether the page is shared or private

Indicates how a task logically sees the page Private pages belong to a single task

Any modifications done can only be seen by that task

Other tasks do not see these changes

Modifications done to a shared page can be seen by other tasks

Carleton University © S. Dandamudi 19

Multiprocessing Support

Cedar compiler takes FORTRAN source code Analyzes for implicit parallelism Generates a control flow graph

User Control Block (UCB) Created when the user first logs in Multiple logins do not create multiple UCBs (one UCB per user)

Process Control Block (PCB)When a process is created (via Unix fork)

One PCB and a single task control block (TCB) are created The new task is scheduled This can create other tasks linked to the same PCB

Carleton University © S. Dandamudi 20

Multiprocessing Support (cont’d)

Carleton University © S. Dandamudi 21

Multiprocessing Support (cont’d)

Five primitives are providedcreate_task()delete_task() start_task()end_task()

Stop task

wait_task()Wait for another task

Carleton University © S. Dandamudi 22

Multiprocessing Support (cont’d)

create_task()

Creates a new TCB Attached to caller’s PCB

Not scheduled for execution

Task is in idle state

Returns an integer to identify the task

No child-parent relationship

Carleton University © S. Dandamudi 23

Multiprocessing Support (cont’d)

delete_task(tasknum)

Deletes the task identified by tasknum

TCB and associated resources are deallocated

If the task was executing, it is terminated

Error if tasknum is unknown

Carleton University © S. Dandamudi 24

Multiprocessing Support (cont’d)

start_task(tasknum, pc)Forces the task identified by tasknum to begin execution at

location pc

Task is marked busy and scheduled for execution

If the task is already busy, it is interrupted with no way of

returning

Error if unknown tasknum

Carleton University © S. Dandamudi 25

Multiprocessing Support (cont’d)

end_task()Marks the calling task as idle and stops its execution

All tasks waiting for this task are unblocked

It does not deallocate resources allocated to the task

A task that waits for this one can Delete it

Start it at another location, or

Let it remain idle

Carleton University © S. Dandamudi 26

Multiprocessing Support (cont’d)

wait_task(tasknum)Blocks the calling task until the specified task (i.e., tasknum)

enters idle stateA task enters idle state

When it is created When it calls end_task

If the specified task is already in idle state, the calling task continues immediately

Error if unknown tasknum

Carleton University © S. Dandamudi 27

Example 1 (cont’d)

global shared integer: FLAG, MIDDLElocal private integer: RIGHT

A:<body of node A> FLAG = 0 RIGHT = create_task() call start_task(RIGHT,C) goto B

B: <body of B> if (.NOT. SYNC(FLAG == 0; ++FLAG)) then MIDDLE = create_task() call start_task(MIDDLE,E) endif goto D

A

CB

FD E

G

Carleton University © S. Dandamudi 28

Example 1 (cont’d)

C: <body of C> if (.NOT. SYNC(FLAG == 0; ++FLAG)) then MIDDLE = create_task() call start_task(MIDDLE,E) endif goto F

D: <body of D> goto GD

E: <body of E> goto GE

E: <body of E> goto GF

A

CB

FD E

G

Carleton University © S. Dandamudi 29

Example 1 (cont’d)

GE:

GF:

call end_task()

GD:

call wait_task(RIGHT)

call wait_task(MIDDLE)

call delete_task(RIGHT)

call delete_task(MIDDlE)

A

CB

FD E

G

Carleton University © S. Dandamudi 30

Example 2 (cont’d)

DO 101 I = 1,N

DO 101 J = 1,210

101 A(I,J) = B(I,J) + C(j)

DO 102 I = 1,10000

F(I) = ABS(F(I))

102 IF (G(I) .LT. 0) F(I) = -F(I)

A

CB

D

DOALL101

DOALL102

Carleton University © S. Dandamudi 31

Example 2 (cont’d)

local private integer T

A: T = create_task() call start_task(T,C) goto B

B: doall 101 goto DB

C: doall 102 goto DC

DB:call wait_task(T)

call delete_task(T)

goto next_node

A

CB

D

DOALL101

DOALL102

DC:call end_task()

Carleton University © S. Dandamudi 32

Example 2 (cont’d)

C: N = 10 local private integer tasknum(N),T,J global shared integer I

I = 0 do J = 1,N T = create_task() tasknum(J) = T call start_task(T,CC) enddo

do J = 1,N T = tasknum(J) call wait_task(T) call delete_task(T) enddo

CC:local private integer J,K

dowhile SYNC(I<100;J=++I))

J = J*100 – 99 do K = J, J+99 F(K) = abs(F(K)) if(G(K) .LT. 0) F(K) = -F(K) enddo endwhile call end_task()

Carleton University © S. Dandamudi 33

Run Queue Organization

Run queue organizationsCentralized

A single global queue

DistributedLocal queues

HybridMultiple queues

Hierarchical

Carleton University © S. Dandamudi 34

Run Queue Organization (cont’d)

Centralized organization A single global queue Tasks are accessible to all

processors Mutually exclusive access

to the global queue is required

Can lead to queue access contention for large number of processors

Good for small systems

Carleton University © S. Dandamudi 35

Run Queue Organization (cont’d)

Distributed organization A local queue at each

processor Tasks are accessible only to

the associated processor Need a task placement

policy Excellent scalability

Good for large systems Load balancing is a

problem

Carleton University © S. Dandamudi 36

Run Queue Organization (cont’d)

Performance comparisonRun queue access time is not negligible# of processors = 64Average # of tasks/job = 64 (exponentially distributed)Average task service time = 1 time unit (expo. dist.)Run queue access time

f = 0% to 4% of task service time

Carleton University © S. Dandamudi 37

Run Queue Organization (cont’d)

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se ti

me

f = 4% f = 3% f = 2% f = 1% f = 0%

Centralized

Carleton University © S. Dandamudi 38

Run Queue Organization (cont’d)

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se ti

me

f = 4% f = 0%

Distributed

Carleton University © S. Dandamudi 39

Run Queue Organization (cont’d)

0

20

40

60

80

0 1 2 3 4

Service time CV

Mea

n re

spon

se ti

me

Distributed Centralized

Carleton University © S. Dandamudi 40

Improving Performance

Centralized organizationNeed to minimize access contentionAutonomous policy (Nelson & Squillante)

Every access brings a set of tasks Reduces the number of accesses to the central queue Potential problems

Load imbalance Optimal set size depends on the system load Large service time CV can cause performance deterioration

Carleton University © S. Dandamudi 41

Improving Performance (cont’d)

Cooperative policy (Nelson & Squillante) Every access brings a tasks for other processors as well

Moves tasks from the central queue to other processor local queues Uses “join the shortest queue” policy Improves load balancing Performs better than Autonomous policy and distributed

organization Potential problems

Difficult to implement for large systems Scheduler need to maintain state information on other processors Their local queue length

Carleton University © S. Dandamudi 42

Improving Performance (cont’d)

Distributed OrganizationWe have to address the load imbalance problemOblivious placement policies

RandomRound robin (cyclic)

Use adaptive placement policiesShortest queueShortest response time (SRT) queue

Carleton University © S. Dandamudi 43

Improving Performance (cont’d)

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se ti

me

Random Round robin Shortest queue SRT queue

Carleton University © S. Dandamudi 44

Improving Performance (cont’d)

0

50

100

150

200

0 1 2 3 4

Service time CV

Mea

n re

spon

se ti

me

Random Round robin Shortest queue SRT queue

Carleton University © S. Dandamudi 45

Improving Performance (cont’d)

Implementation problems with adaptive policiesSystem state overhead

Both shortest queue and SRT queue policies need system state information

To reduce this overhead, system state information is collected only from a subset P (P < N) processors

P = # of probes to collect state information If P is small, we are successful in reducing the overhead

In practice, a small number probes is sufficient

Carleton University © S. Dandamudi 46

Improving Performance (cont’d)

0

5

10

15

20

1 2 3 4 5 6 7 8 9 10

Number of probes

Mea

n re

spon

se ti

me

Shortest queue SRT queue

Carleton University © S. Dandamudi 47

Improving Performance (cont’d)

A problem with SRT Queue policyNeed to have a priori knowledge of execution timesOften we may get only an estimate

Subject to estimation errorsESRT queue policy

Uses estimate that is with in X % of the actual service time In the experiments we used 30%

SRT queue policyAssumes exact service time is known beforehand

Carleton University © S. Dandamudi 48

Improving Performance (cont’d)

0

10

20

30

40

0 1 2 3 4

Service time CV

Mea

n re

spon

se ti

me

Shortest queue SRT queue ESRT queue

Carleton University © S. Dandamudi 49

Hierarchical Organization

Goal is to have best of the both organizationsAvoid bottleneck problems like the distributed organizationGood load sharing like in the centralized organizationShould be self-schedulingNo state information collection

Hierarchical organization provides all these desired features

Performs close to the centralized organization but scaled well like the distributed organization

Carleton University © S. Dandamudi 50

Hierarchical Organization (cont’d)

Carleton University © S. Dandamudi 51

Hierarchical Organization (cont’d)

Tr = 1 Tr = 2

Carleton University © S. Dandamudi 52

Hierarchical Organization (cont’d)

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se ti

me

f = 4% f = 3% f = 2% f = 1% f = 0%

Centralized

Carleton University © S. Dandamudi 53

Hierarchical Organization (cont’d)

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se ti

me

Distributed Hierarchical

Carleton University © S. Dandamudi 54

Hierarchical Organization (cont’d)

0

20

40

60

80

100

0 20 40 60 80 100

Number of tasks

Mea

n re

spon

se ti

me

f = 4% f = 3% f = 2% f = 1% f = 0%

CentralizedFixed task size

Carleton University © S. Dandamudi 55

Hierarchical Organization (cont’d)

0

20

40

60

80

100

0 20 40 60 80 100

Number of tasks

Mea

n re

spon

se ti

me

Distributed Hierarchical

Fixed task size

Carleton University © S. Dandamudi 56

Hierarchical Organization (cont’d)

0

20

40

60

80

100

0 20 40 60 80

Number of tasks

Mea

n re

spon

se ti

me

Distributed Hierarchical

Fixed job size

Carleton University © S. Dandamudi 57

Hierarchical Organization (cont’d)

0

20

40

60

80

100

120

0 1 2 3 4

Service time CV

Mea

n re

spon

se ti

me

Distributed Hierarchical Centralized

Carleton University © S. Dandamudi 58

Hierarchical Organization (cont’d)

0

20

40

60

80

100

120

0 1 2 3 4Service time CV

Mea

n re

spon

se ti

me

Distributed (utilization = 0.5) Distributed (utilization = 0.75)

Hierarchical (utilization = 0.5) Hierarchical (utilization = 0.75)

Carleton University © S. Dandamudi 59

Hierarchical Organization (cont’d)

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se t

ime

Distributed, N = 64 Distributed, N = 128

Hierarchical, N = 64 Hierarchical, N = 128

Carleton University © S. Dandamudi 60

Hierarchical Organization (cont’d)

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Utilization

Mea

n re

spon

se t

ime

Distributed, N = 64 Distributed, N = 128

Hierarchical, N = 64 Hierarchical, N = 128

Carleton University © S. Dandamudi 61

Hierarchical Organization (cont’d)

0.7

0.8

0.9

1

1.1

1.2

1.3

0 0.2 0.4 0.6 0.8 1

Utilization

Rat

io o

f m

ean

resp

onse

tim

e

B = 2 to B = 4, f = 2% B = 8 to B = 4, f = 2%

B = 2 to B = 4, f = 4% B = 8 to B = 4, f = 4%

Carleton University © S. Dandamudi 62

Hierarchical Organization (cont’d)

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

0 0.2 0.4 0.6 0.8 1

Utilization

Rat

io o

f m

ean

resp

onse

tim

e

Tr = 2, f = 4% Tr = 0.5, f = 4%

Tr = 2, f = 2% Tr = 0.5, f = 2%

Carleton University © S. Dandamudi 63

Hierarchical Organization (cont’d)

Adaptive number of tasks Policy 1

Moves # tasks proportional to the # tasks queued at parent

At least as in the static Tr * # processors below the

child queue

Carleton University © S. Dandamudi 64

Hierarchical Organization (cont’d)

Adaptive number of tasks Policy 2

Moves # tasks proportional to the # tasks queued at parent

But maintains I to keep this value the same for al children of the parent

At least as in the static Tr * # processors below the

child queue

Carleton University © S. Dandamudi 65

Hierarchical Organization (cont’d)

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.2 0.4 0.6 0.8 1

Utilization

Rat

io o

f m

ean

resp

onse

tim

e

Policy 1 Policy 2

Last slide

Recommended