View
8
Download
0
Category
Preview:
Citation preview
On the Relevance of Wait-free Coordination Algorithms inShared-Memory HPC: The Global Virtual Time Case
Alessandro PellegriniFrancesco Quaglia
High Performance and DependableComputing Systems Group
Sapienza, University of Rome
InfQ 2014
Current Technological Trend
• Implications of Moore’s Lawhave changed since 2003
• 130W is considered an upperbound (the power wall)
P = ACV 2f
2 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Current Technological Trend
• Implications of Moore’s Lawhave changed since 2003
• 130W is considered an upperbound (the power wall)
P = ACV 2f
2 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Current Technological Trend
• Implications of Moore’s Lawhave changed since 2003
• 130W is considered an upperbound (the power wall)
P = ACV 2f
2 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Multicore Software Scaling
3 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Multicore Software Scaling
3 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Multicore Software Scaling
3 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Multicore Software Scaling
3 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Multicore Software Scaling
3 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Problem Statement in HPC
• Many HPC applications enforce data separation◦ The amount of synchronization points are already reduced
• Coordination algorithms easily become the bottleneck of HPCSystems on multicore architectures
• They cannot be avoided at all◦ Necessary for the overall correctness
• How to deal with performance issues of HPC Systems onmulticores?
4 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Wait-Free Algorithms
• All threads complete their job in a finite number of steps
• Algorithms in which threads communicate using some sort ofatomic-update machine instruction◦ Compare and Swap (CAS)◦ Load-Link/Store-Conditional (LL/SC)
• Both CAS & LL/SC will fail if their initial conditions change
• The “fix” for failing is usually to loop and try again.
5 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Wait-Free Algorithms
• All threads complete their job in a finite number of steps
• Algorithms in which threads communicate using some sort ofatomic-update machine instruction◦ Compare and Swap (CAS)◦ Load-Link/Store-Conditional (LL/SC)
• Both CAS & LL/SC will fail if their initial conditions change
• The “fix” for failing is usually to loop and try again.
5 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Target Architecture: PDES
• Exchanged information unit is the discrete event
• The involved objects state is updated according to the flow ofevents
6 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Target Architecture: PDES
• Exchanged information unit is the discrete event
• The involved objects state is updated according to the flow ofevents
Communication Network
Machine
Processor
Kernel
LPLP
LP
Processor
Kernel
LPLP
LP
Machine
Processor
Kernel
LPLP
LP
Processor
Kernel
LPLP
LP
... ...
...
6 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Target Architecture: PDES
• Exchanged information unit is the discrete event
• The involved objects state is updated according to the flow ofevents
Communication Network
Machine
CPU
Kernel
LPLP
LP LPLP
LP LPLP
LP LPLP
LP
...
...
CPU CPU CPU
Machine
CPU
Kernel
...CPU CPU CPU
Kernel
6 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Target Architecture: PDES
• Exchanged information unit is the discrete event
• The involved objects state is updated according to the flow ofevents
Machine
CPU
LPLP
LP LPLP
LP
CPU CPU CPU
Simulation Kernel
Discrete Event
Runtime
Environment
6 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Coordination Algorithm in PDES: GVT
LPi
LPj
LPk Execution Time
Execution Time
Execution Time
7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Coordination Algorithm in PDES: GVT
LPi
LPj 15
5
LPk Execution Time7
10 Execution Time
Execution Time
7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Coordination Algorithm in PDES: GVT
LPi
LPj 15
5
LPk Execution Time7 17
10
17
Execution Time
Execution Time
7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Coordination Algorithm in PDES: GVT
LPi
LPj 15
5 10
20
LPk Execution Time7 17 25
10
17
Execution Time
Execution Time
7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Coordination Algorithm in PDES: GVT
LPi
LPj 15
5 10
20
12
LPk Execution Time7 17 25
10
17
Execution Time
Execution Time
7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Coordination Algorithm in PDES: GVT
LPi
LPj 15
5 10
20
Straggler Message
12
LPk Execution Time7 17 25
10
17
Rollback Execution:
Recovering state at
LVT 10
Execution Time
Execution Time
7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Coordination Algorithm in PDES: GVT
LPi
LPj 15
5 10
20
Straggler Message
12
LPk Execution Time7 17 25
10
17 17
Anti-message
anti-message
reception
Rollback Execution:
Recovering state at
LVT 10
Rollback Execution:
Recovering State at
LVT 7
Execution Time
Execution Time
7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Coordination Algorithm in PDES: GVT
LPi
LPj 15
5 10
20 12
Straggler Message
12
LPk Execution Time7 17 25
10
17
17 17
Anti-message
anti-message
reception
Rollback Execution:
Recovering state at
LVT 10
Rollback Execution:
Recovering State at
LVT 7
Execution Time
Execution Time
7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Coordination Algorithm in PDES: GVT
LPi
LPj 15
5 10
20 12
Straggler Message
12
LPk Execution Time7 17 25
10
17
17 17
Anti-message
anti-message
reception
Rollback Execution:
Recovering state at
LVT 10
Rollback Execution:
Recovering State at
LVT 7
Execution Time
Execution Time
7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Fujimoto & Hybinette
• Reference algorithm has been so far Fujimoto’s and Hybinette’sGVT for Shared-Memory Multiprocessors (1998)
• It targets observable Time Warp Systems◦ Essentially the send operation leads to the direct incorporation into
the message queue◦ This removes the notion of in-transit messages
• It relies on the notion of GVT computation phase
8 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Fujimoto & Hybinette (2)
procedure Main-Loopwhile there are events to process do
LocalGVTFlag ← GVTFlag . Atomically set to NPEReceive messages and process rollbacksReceive antimessages, process annihilations and rollbacksSelect next event Mif LocalGVTFlag > 0 and LocalMin not computed then . CS
PEMin[PE] ← min{SendMin, TS(M)}GVTFlags ← GVTFlags - 1if GVTFlag = 0 then
GVT ← min{PEMin[1], . . . , PEMin[NPE]}end if
end ifend while
end procedure
9 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Wait-Free GVT Algorithm
• Each worker thread manages a set of LPs
• Event-queues are directly accessible by worker threads◦ No message/anti-message is ever in-flight across worker-threads
• GVT protocol entails determining the right moment to look atdata structures and compute the local minimum◦ It is computed after the incorporation of incoming
messages/anti-messages into the event-queue◦ No need to account for anti-messages and canceled events timestamp
• The algorithm is wait-free because it relies on a sequence of phases
10 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Wait-Free GVT Algorithm (2)
wall-clock-time
t1 t2
GVT_flag is
set to TRUE
phase-A phase-send
t3
phase-BWT1
WT2
WT3
m
ts
11 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Test-Bed Scenario
• 32-cores HP ProLiant Server
• 64GB of RAM
• 2.6.32-5-amd64 Linux kernel
• ROOT-Sim Simulation Kernel v. 1.0.2
• Modified version of PHOLD benchmark◦ homogeneous LPs that perform memory allocation/deallocation
operations and read/write operations◦ LPs modify their memory layout◦ few and small-granularity events ⇒ increased likelihood of competing
for the critical section
12 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Experimental Results
0
2
4
6
8
10
12
14
16
8 16 24 32
Ove
rall
Exe
cutio
n T
ime
(sec
onds
)
Number of Concurrent Worker Threads
WF FH
13 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Experimental Results (2)
0
0.5
1
1.5
2
8 16 24 32
Num
ber
of S
pinl
ock
Cyc
les
per
WC
T U
nit
Number of Concurrent Worker Threads
FH
14 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Experimental Results (3)
0
5
10
15
20
25
30
35
8 12 16 20 24
Ove
rall
Exe
cutio
n T
ime
(sec
onds
)
Number of Busy (with external workload) CPUs
WF FH
15 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Conclusions
• HPC Applications usually have few synchronization points◦ There is a high demand for enhanced synchronization schemes
• Modern multicore architectures enhance the impact ofsynchronization points◦ Contention due to locks can hamper scalability
• Wait-free algorithms can be a viable solution to be studied to facethis issue◦ They can provide enhanced scalability◦ There is a benefit in non-dedicated environments as well
16 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Thanks for your attention
Questio
ns?
Questio
ns?
pellegrini@dis.uniroma1.it
http://www.dis.uniroma1.it/∼pellegrini
http://www.dis.uniroma1.it/∼ROOT-Sim
17 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case
Recommended