Upload
derrick-mccarthy
View
219
Download
0
Embed Size (px)
Citation preview
Shared-Memory Multiprocessors
Prof. Sivarama Dandamudi
School of Computer Science
Carleton University
Carleton University © S. Dandamudi 2
Roadmap
UI CedarArchitecture overviewOperating system primitivesMultiprocessing primitives
Run queue organizationCentralizedDistributedHierarchical organization
Carleton University © S. Dandamudi 3
UI Cedar Architecture
Shared-memory MIMD system Experimental system built at University of Illinois
Processors are grouped into clustersUses a hierarchical organizationThree levels of memory hierarchy
Local memoryCluster memoryGlobal memory
Refer to the same physical memory
Carleton University © S. Dandamudi 4
UI Cedar Architecture (cont’d)
CCU: Cluster Control Unit
Carleton University © S. Dandamudi 5
UI Cedar Architecture (cont’d)
Local memoryLocal to each processorNo need to go through any network
Cluster memoryProcessors in a cluster can access this memoryAccess via local interconnection network
Global memoryAny processor can access this memoryAccess is via the global interconnection network
Carleton University © S. Dandamudi 6
UI Cedar Architecture (cont’d)
Processor cluster (PC)Smallest execution unit
Typically 8 processors
A compound function (a chunk of program) Can be assigned to one or more PCs
Each processor consists of FP unit No local data registers (unusual) Local memory can be used as a large data set
Local memory can be dynamically partitioned into pseudo-vector registers of different sizes
Carleton University © S. Dandamudi 7
UI Cedar Architecture (cont’d)
Processor cluster (PC)Controlled by CCUCCU serves as a synchronization unit
Starts all processors when the data is moved from global to local memory
Signals GCU when a CF is done
Local networkEither a crossbar or a bus
Global networkBased on an extension of the Omega network
Carleton University © S. Dandamudi 8
UI Cedar Architecture (cont’d)
At least 2 paths from every switch(except the last)
Added redundancy to original Omega
Improves FT and reduces conflicts
Carleton University © S. Dandamudi 9
UI Cedar Architecture (cont’d)
Memory system
Each PC contains eight 16K memory modules
Memory hierarchy is user transparentCCUs and GCU move program code from global to
local memory in large blocks
Transfer time is overlapped with computation
Carleton University © S. Dandamudi 10
UI Cedar Architecture (cont’d)
Cache system Implemented in local memories for global memory
accessesNot all accesses are cachedOnly those predetermined by programmer or compiler
To avoid cache consistency problems Caches only
Read-only data, or Data written by a single processor (i.e., private data)
Carleton University © S. Dandamudi 11
UI Cedar Architecture (cont’d)
GCU Uses macro-dataflow
To reduce scheduling and other overheads Considers large structures (arrays) as one object
Several operations are combined to reduce scheduling overhead
Each PC is considered as an execution unit Each PC executes a Compound Function
Views program as a directed graph Nodes are CFs
Large data structures are stored in global memory No structure copying problem
Carleton University © S. Dandamudi 12
Synchronization Primitive
Synchronization is supported via a sync variable It is a special data type supported by the hardwareConsists of two contiguous items in global memory
Each item is either 4 bytes (single precision) or 8 bytes (double precision)
First item: key Always an integer
Second item: data Unspecified type (integer, floating point, logical, or address)
Carleton University © S. Dandamudi 13
Synchronization Primitive (cont’d)
Sync expressionsync(key-relation; key-op; data-op)
key-relation: key relop expression voidkey-op: lvalue = key key = expression lvalue = ++key lvalue = --key ++key --key void
data-op: lvalue = data data = expression void
Carleton University © S. Dandamudi 14
Synchronization Primitive (cont’d)
Sync expression semanticsKey-relation is evaluated
If true, key-op and data-op are done indivisibly
Result of sync expression is the value of the key-relationIf key-relation is omitted
Key-op and data-op are done unconditionally
When data-op is missing Key does not have to be a key field of a sync variable
It can be any integer
Carleton University © S. Dandamudi 15
Synchronization Primitive (cont’d)
Sync expression example
while (!sync(lock == 0; ++lock));
/* spin-wait until lock is free */
/* and then set lock */
accum += delta;
lock = 0; /* unlock */
Carleton University © S. Dandamudi 16
Memory Attributes
Three typesLocality
Global Cluster
Page type Shared Private
Access privilege Read, write, execute A combination of these
Carleton University © S. Dandamudi 17
Memory Attributes (cont’d)
Locality attributeSpecifies where the page should be located in the
hierarchyGlobal pages are mapped to physical global memoryCluster pages are mapped to cluster memoryDetails of physical mapping are not visible to a user program
Xylem always places a page according to its attribute when a user program references it
Carleton University © S. Dandamudi 18
Memory Attributes (cont’d)
Page type attribute
Specifies whether the page is shared or private
Indicates how a task logically sees the page Private pages belong to a single task
Any modifications done can only be seen by that task
Other tasks do not see these changes
Modifications done to a shared page can be seen by other tasks
Carleton University © S. Dandamudi 19
Multiprocessing Support
Cedar compiler takes FORTRAN source code Analyzes for implicit parallelism Generates a control flow graph
User Control Block (UCB) Created when the user first logs in Multiple logins do not create multiple UCBs (one UCB per user)
Process Control Block (PCB)When a process is created (via Unix fork)
One PCB and a single task control block (TCB) are created The new task is scheduled This can create other tasks linked to the same PCB
Carleton University © S. Dandamudi 20
Multiprocessing Support (cont’d)
Carleton University © S. Dandamudi 21
Multiprocessing Support (cont’d)
Five primitives are providedcreate_task()delete_task() start_task()end_task()
Stop task
wait_task()Wait for another task
Carleton University © S. Dandamudi 22
Multiprocessing Support (cont’d)
create_task()
Creates a new TCB Attached to caller’s PCB
Not scheduled for execution
Task is in idle state
Returns an integer to identify the task
No child-parent relationship
Carleton University © S. Dandamudi 23
Multiprocessing Support (cont’d)
delete_task(tasknum)
Deletes the task identified by tasknum
TCB and associated resources are deallocated
If the task was executing, it is terminated
Error if tasknum is unknown
Carleton University © S. Dandamudi 24
Multiprocessing Support (cont’d)
start_task(tasknum, pc)Forces the task identified by tasknum to begin execution at
location pc
Task is marked busy and scheduled for execution
If the task is already busy, it is interrupted with no way of
returning
Error if unknown tasknum
Carleton University © S. Dandamudi 25
Multiprocessing Support (cont’d)
end_task()Marks the calling task as idle and stops its execution
All tasks waiting for this task are unblocked
It does not deallocate resources allocated to the task
A task that waits for this one can Delete it
Start it at another location, or
Let it remain idle
Carleton University © S. Dandamudi 26
Multiprocessing Support (cont’d)
wait_task(tasknum)Blocks the calling task until the specified task (i.e., tasknum)
enters idle stateA task enters idle state
When it is created When it calls end_task
If the specified task is already in idle state, the calling task continues immediately
Error if unknown tasknum
Carleton University © S. Dandamudi 27
Example 1 (cont’d)
global shared integer: FLAG, MIDDLElocal private integer: RIGHT
A:<body of node A> FLAG = 0 RIGHT = create_task() call start_task(RIGHT,C) goto B
B: <body of B> if (.NOT. SYNC(FLAG == 0; ++FLAG)) then MIDDLE = create_task() call start_task(MIDDLE,E) endif goto D
A
CB
FD E
G
Carleton University © S. Dandamudi 28
Example 1 (cont’d)
C: <body of C> if (.NOT. SYNC(FLAG == 0; ++FLAG)) then MIDDLE = create_task() call start_task(MIDDLE,E) endif goto F
D: <body of D> goto GD
E: <body of E> goto GE
E: <body of E> goto GF
A
CB
FD E
G
Carleton University © S. Dandamudi 29
Example 1 (cont’d)
GE:
GF:
call end_task()
GD:
call wait_task(RIGHT)
call wait_task(MIDDLE)
call delete_task(RIGHT)
call delete_task(MIDDlE)
A
CB
FD E
G
Carleton University © S. Dandamudi 30
Example 2 (cont’d)
DO 101 I = 1,N
DO 101 J = 1,210
101 A(I,J) = B(I,J) + C(j)
DO 102 I = 1,10000
F(I) = ABS(F(I))
102 IF (G(I) .LT. 0) F(I) = -F(I)
A
CB
D
DOALL101
DOALL102
Carleton University © S. Dandamudi 31
Example 2 (cont’d)
local private integer T
A: T = create_task() call start_task(T,C) goto B
B: doall 101 goto DB
C: doall 102 goto DC
DB:call wait_task(T)
call delete_task(T)
goto next_node
A
CB
D
DOALL101
DOALL102
DC:call end_task()
Carleton University © S. Dandamudi 32
Example 2 (cont’d)
C: N = 10 local private integer tasknum(N),T,J global shared integer I
I = 0 do J = 1,N T = create_task() tasknum(J) = T call start_task(T,CC) enddo
do J = 1,N T = tasknum(J) call wait_task(T) call delete_task(T) enddo
CC:local private integer J,K
dowhile SYNC(I<100;J=++I))
J = J*100 – 99 do K = J, J+99 F(K) = abs(F(K)) if(G(K) .LT. 0) F(K) = -F(K) enddo endwhile call end_task()
Carleton University © S. Dandamudi 33
Run Queue Organization
Run queue organizationsCentralized
A single global queue
DistributedLocal queues
HybridMultiple queues
Hierarchical
Carleton University © S. Dandamudi 34
Run Queue Organization (cont’d)
Centralized organization A single global queue Tasks are accessible to all
processors Mutually exclusive access
to the global queue is required
Can lead to queue access contention for large number of processors
Good for small systems
Carleton University © S. Dandamudi 35
Run Queue Organization (cont’d)
Distributed organization A local queue at each
processor Tasks are accessible only to
the associated processor Need a task placement
policy Excellent scalability
Good for large systems Load balancing is a
problem
Carleton University © S. Dandamudi 36
Run Queue Organization (cont’d)
Performance comparisonRun queue access time is not negligible# of processors = 64Average # of tasks/job = 64 (exponentially distributed)Average task service time = 1 time unit (expo. dist.)Run queue access time
f = 0% to 4% of task service time
Carleton University © S. Dandamudi 37
Run Queue Organization (cont’d)
0
10
20
30
40
50
0 0.2 0.4 0.6 0.8 1
Utilization
Mea
n re
spon
se ti
me
f = 4% f = 3% f = 2% f = 1% f = 0%
Centralized
Carleton University © S. Dandamudi 38
Run Queue Organization (cont’d)
0
10
20
30
40
50
0 0.2 0.4 0.6 0.8 1
Utilization
Mea
n re
spon
se ti
me
f = 4% f = 0%
Distributed
Carleton University © S. Dandamudi 39
Run Queue Organization (cont’d)
0
20
40
60
80
0 1 2 3 4
Service time CV
Mea
n re
spon
se ti
me
Distributed Centralized
Carleton University © S. Dandamudi 40
Improving Performance
Centralized organizationNeed to minimize access contentionAutonomous policy (Nelson & Squillante)
Every access brings a set of tasks Reduces the number of accesses to the central queue Potential problems
Load imbalance Optimal set size depends on the system load Large service time CV can cause performance deterioration
Carleton University © S. Dandamudi 41
Improving Performance (cont’d)
Cooperative policy (Nelson & Squillante) Every access brings a tasks for other processors as well
Moves tasks from the central queue to other processor local queues Uses “join the shortest queue” policy Improves load balancing Performs better than Autonomous policy and distributed
organization Potential problems
Difficult to implement for large systems Scheduler need to maintain state information on other processors Their local queue length
Carleton University © S. Dandamudi 42
Improving Performance (cont’d)
Distributed OrganizationWe have to address the load imbalance problemOblivious placement policies
RandomRound robin (cyclic)
Use adaptive placement policiesShortest queueShortest response time (SRT) queue
Carleton University © S. Dandamudi 43
Improving Performance (cont’d)
0
10
20
30
40
50
0 0.2 0.4 0.6 0.8 1
Utilization
Mea
n re
spon
se ti
me
Random Round robin Shortest queue SRT queue
Carleton University © S. Dandamudi 44
Improving Performance (cont’d)
0
50
100
150
200
0 1 2 3 4
Service time CV
Mea
n re
spon
se ti
me
Random Round robin Shortest queue SRT queue
Carleton University © S. Dandamudi 45
Improving Performance (cont’d)
Implementation problems with adaptive policiesSystem state overhead
Both shortest queue and SRT queue policies need system state information
To reduce this overhead, system state information is collected only from a subset P (P < N) processors
P = # of probes to collect state information If P is small, we are successful in reducing the overhead
In practice, a small number probes is sufficient
Carleton University © S. Dandamudi 46
Improving Performance (cont’d)
0
5
10
15
20
1 2 3 4 5 6 7 8 9 10
Number of probes
Mea
n re
spon
se ti
me
Shortest queue SRT queue
Carleton University © S. Dandamudi 47
Improving Performance (cont’d)
A problem with SRT Queue policyNeed to have a priori knowledge of execution timesOften we may get only an estimate
Subject to estimation errorsESRT queue policy
Uses estimate that is with in X % of the actual service time In the experiments we used 30%
SRT queue policyAssumes exact service time is known beforehand
Carleton University © S. Dandamudi 48
Improving Performance (cont’d)
0
10
20
30
40
0 1 2 3 4
Service time CV
Mea
n re
spon
se ti
me
Shortest queue SRT queue ESRT queue
Carleton University © S. Dandamudi 49
Hierarchical Organization
Goal is to have best of the both organizationsAvoid bottleneck problems like the distributed organizationGood load sharing like in the centralized organizationShould be self-schedulingNo state information collection
Hierarchical organization provides all these desired features
Performs close to the centralized organization but scaled well like the distributed organization
Carleton University © S. Dandamudi 50
Hierarchical Organization (cont’d)
Carleton University © S. Dandamudi 51
Hierarchical Organization (cont’d)
Tr = 1 Tr = 2
Carleton University © S. Dandamudi 52
Hierarchical Organization (cont’d)
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1
Utilization
Mea
n re
spon
se ti
me
f = 4% f = 3% f = 2% f = 1% f = 0%
Centralized
Carleton University © S. Dandamudi 53
Hierarchical Organization (cont’d)
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1
Utilization
Mea
n re
spon
se ti
me
Distributed Hierarchical
Carleton University © S. Dandamudi 54
Hierarchical Organization (cont’d)
0
20
40
60
80
100
0 20 40 60 80 100
Number of tasks
Mea
n re
spon
se ti
me
f = 4% f = 3% f = 2% f = 1% f = 0%
CentralizedFixed task size
Carleton University © S. Dandamudi 55
Hierarchical Organization (cont’d)
0
20
40
60
80
100
0 20 40 60 80 100
Number of tasks
Mea
n re
spon
se ti
me
Distributed Hierarchical
Fixed task size
Carleton University © S. Dandamudi 56
Hierarchical Organization (cont’d)
0
20
40
60
80
100
0 20 40 60 80
Number of tasks
Mea
n re
spon
se ti
me
Distributed Hierarchical
Fixed job size
Carleton University © S. Dandamudi 57
Hierarchical Organization (cont’d)
0
20
40
60
80
100
120
0 1 2 3 4
Service time CV
Mea
n re
spon
se ti
me
Distributed Hierarchical Centralized
Carleton University © S. Dandamudi 58
Hierarchical Organization (cont’d)
0
20
40
60
80
100
120
0 1 2 3 4Service time CV
Mea
n re
spon
se ti
me
Distributed (utilization = 0.5) Distributed (utilization = 0.75)
Hierarchical (utilization = 0.5) Hierarchical (utilization = 0.75)
Carleton University © S. Dandamudi 59
Hierarchical Organization (cont’d)
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1
Utilization
Mea
n re
spon
se t
ime
Distributed, N = 64 Distributed, N = 128
Hierarchical, N = 64 Hierarchical, N = 128
Carleton University © S. Dandamudi 60
Hierarchical Organization (cont’d)
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1
Utilization
Mea
n re
spon
se t
ime
Distributed, N = 64 Distributed, N = 128
Hierarchical, N = 64 Hierarchical, N = 128
Carleton University © S. Dandamudi 61
Hierarchical Organization (cont’d)
0.7
0.8
0.9
1
1.1
1.2
1.3
0 0.2 0.4 0.6 0.8 1
Utilization
Rat
io o
f m
ean
resp
onse
tim
e
B = 2 to B = 4, f = 2% B = 8 to B = 4, f = 2%
B = 2 to B = 4, f = 4% B = 8 to B = 4, f = 4%
Carleton University © S. Dandamudi 62
Hierarchical Organization (cont’d)
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
0 0.2 0.4 0.6 0.8 1
Utilization
Rat
io o
f m
ean
resp
onse
tim
e
Tr = 2, f = 4% Tr = 0.5, f = 4%
Tr = 2, f = 2% Tr = 0.5, f = 2%
Carleton University © S. Dandamudi 63
Hierarchical Organization (cont’d)
Adaptive number of tasks Policy 1
Moves # tasks proportional to the # tasks queued at parent
At least as in the static Tr * # processors below the
child queue
Carleton University © S. Dandamudi 64
Hierarchical Organization (cont’d)
Adaptive number of tasks Policy 2
Moves # tasks proportional to the # tasks queued at parent
But maintains I to keep this value the same for al children of the parent
At least as in the static Tr * # processors below the
child queue
Carleton University © S. Dandamudi 65
Hierarchical Organization (cont’d)
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.2 0.4 0.6 0.8 1
Utilization
Rat
io o
f m
ean
resp
onse
tim
e
Policy 1 Policy 2
Last slide