Upload
ryann
View
65
Download
0
Embed Size (px)
DESCRIPTION
Efficient Synchronization for Non-Uniform Communication Architecture. Uppsala University Department of Information Technology Uppsala Architecture Research Team [ UART ]. Zoran Radovic and Erik Hagersten {zoran.radovic, erik.hagersten}@it.uu.se. Synchronization Basics. A:=0. BARRIER. - PowerPoint PPT Presentation
Citation preview
NUCA LocksNUCA Locks
Uppsala UniversityDepartment of Information Technology
Uppsala Architecture Research Team [UART]
Efficient Synchronization forNon-Uniform Communication Architecture
Zoran Radovic and Erik Hagersten{zoran.radovic, erik.hagersten}@it.uu.se
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Synchronization Basics
Locks are used to protect the shared critical section data
A:=0 BARRIER
LOCK(L)A:=A+1
UNLOCK(L)LOCK(L)B:=A+5
UNLOCK(L)
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Simple Spin Locks
test_and_test&set (TATAS), ‘84 TATAS with exponential backoff (TATAS_EXP), ‘90
Many variations
P1
$
P2
$
P3
$
Pn
$
Memory
FREELock:
P3
BUSY
Busy-wait/backoff
FREEBUSYBUSY BUSY
…
TATAS_LOCK(L) { if (tas(L)) { do { if (*L) continue; } while (tas(L)); }}
TATAS_UNLOCK(L) { *L = 0; // = FREE}
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Performance Under Contention
Amount of Contention
Spin locks
Spin locksw/ backoff
CS
Co
st
IF (more contention) THEN less efficient CS …
IF (more contention) THEN less efficient CS …
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Making it Scalable: Queues …
First-come,first-served order Starvation avoidance Maximal fairness Reduced traffic
Queue-based locks HW: QOLB ‘89 SW: MCS ‘91 SW: CLH ‘93
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Queue Locks Under Contention
Amount of Contention
Spin locks
Spin locksw/ backoff
CS
Co
st
Queue-based locks IF (more contention) THEN constant CS cost …
IF (more contention) THEN constant CS cost …
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Switch
Non-Uniform MemoryArchitecture (NUMA)
Many NUMA optimizations are proposed Page migration Page replication
P1
$
P2
$
P3
$
Pn
$
P1
$
P2
$
P3
$
Pn
$
Memory Memory
12 – 10
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Non-Uniform CommunicationArchitecture (NUCA)
NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future: CMP, SMT (~ 10)
NUCAratio
Switch
P1
$
P2
$
P3
$
Pn
$
P1
$
P2
$
P3
$
Pn
$
Memory Memory
1 2 – 10
Our NUCA …
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Our Goals
Design a scalable spin lock that exploits the NUCAs Creating node affinity
• For lock handover
• For CS data
“Stable lock” Reducing the traffic compared with the test&set locks
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Outline
Background & MotivationNUMA vs. NUCA The RH Lock Performance Results Application Study Conclusions
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Key Ideas Behind RH Lock
Minimizing global traffic at lock-handover Only one thread per node will try to acquire a remotely
owned lock
Maximizing node locality of NUCAs Handover the lock to a neighbor in the same node Creates locality for the critical section (CS) data as well Especially good for large CS and high contention
RH lock in a nutshell: Double TATAS_EXP: one node-local lock + one “global”
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
The RH Lock Algorithm
FREE
P1
$
P2
$
P3
$
P16
$
Cabinet 1: Memory
REMOTE
P17
$
P18
$
P19
$
P32
$
Cabinet 2: Memory
FREEREMOTELock1:
Lock2:
Lock1:
Lock2:
P2
2
P19
19else:
TATAS(my_TID, Lock)until FREE or
L_FREE
if “REMOTE”:Spin remotely
CAS(FREE, REMOTE)until FREE
(w/ exp backoff)
… …
FREECS
1
2
16
1 REMOTE
32L_FREE
Acquire:SWAP(my_TID, Lock)If (FREE or L_FREE) You’ve got it!
Release:CAS(my_TID, FREE) else L_FREE)
16
FREECS
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Our NUCA: Sun WildFire
NUCAratio
Switch
P1
$
P2
$
P3
$
Pn
$
P1
$
P2
$
P3
$
Pn
$
Memory Memory
1 6
14 14
WF
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
NUCA-performance14 14
0
5
10
15
20
25
30
35
40
45
50
55
60
0 4 8 12 16 20 24 28
Number of Processors
Tim
e [m
icro
seco
nds]
TATAS
TATAS_EXP
MCS
CLH
RH
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28
Number of Processors
Nod
e ha
ndof
fs [
%]
TATAS
TATAS_EXP
MCS
CLH
RH
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
New Microbenchmark
for (i = 0; i < iterations; i++) { LOCK(L); delay(critical_work); // CS UNLOCK(L); static_delay(); random_delay();}
More realistic node handoffs for queue-based locks Constant number of processors Amount of Critical Section (CS) work can be increased
we can control the “amount of contention”
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
3
4
5
6
7
8
9
10
11
12
13
0 500 1000 1500 2000
critical_work
Tim
e [s
econ
ds]
TATAS
TATAS_EXP
MCS
CLH
RH
Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs
WF
14 14
0
10
20
30
40
50
60
70
80
90
100
0 500 1000 1500 2000
critical_work
Nod
e ha
ndof
fs [
%]
TATAS
TATAS_EXP
MCS
CLH
RH
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Traffic MeasurementsNew microbenchmark; critical_work = 1500
0.00.10.20.30.40.50.60.70.80.91.0
TATAS_EXP MCS CLH RH
Local Transactions Global Transactions
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Application PerformanceRaytrace Speedup
WF
0
1
2
3
4
5
6
7
8
9
0 4 8 12 16 20 24 28
Number of Processors
Spe
edup
TATAS
TATAS_EXP
MCS
CLH
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Application PerformanceRaytrace Speedup
WF
0
1
2
3
4
5
6
7
8
9
0 4 8 12 16 20 24 28
Number of Processors
Spe
edup TATAS
TATAS_EXP
MCS
CLH
RH
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
RH Lock Under Contention
Amount of Contention
Queue-based locks
Spin locks
Spin locksw/ backoff
CS
Co
st
RH lock
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Total Traffic: Raytrace
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
TATAS TATAS_EXP MCS CLH RH
Local Transactions Global Transactions
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Application Performance28-processor runs
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Barne
s
Chole
sky
FMM
Radio
sity
Raytra
ce
Volre
nd
Wat
er-N
sq
Avera
ge
Nor
mal
ized
Spe
edup
TATAS TATAS_EXP MCS CLH RH
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
First-come, first-served not desirable for NUCAs The RH lock exploits NUCAs by
creating locality through CS affinity (stable lock) reducing traffic compared with the test&set locks
The first lock that performs better under contention Global traffic is significantly reduced Applications with contented locks scale better with RH
locks on NUCAs
Conclusions
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Any Drawbacks?
Proof-of-concept NUCA-aware lock for 2 nodes Hard to port to some architectures
Memory needs to be allocated/placed in different nodes
Lock storage is proportional to #NUCA nodes Sensitive for starvation
“Non-uniform nature” of the algorithm No mechanism for lowering the risk of starvation
Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks
Can We Fix It?
We propose a new set of NUCA-aware locks Hierarchical Backoff
Locks (HBO) HPCA-9: Anaheim,
California, February 2003
Teaser … Portable Scalable to many NUCA
nodes Only cas atomic
operations are used Only node_id is needed Lowers the risk of
starvation