Upload
elmer-betley
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
[email protected] Dissertation Seminar Nov 18, 2005
Dissertation Seminar, 18/11 – 2005Dissertation Seminar, 18/11 – 2005Auditorium Minus, Museum GustavianumAuditorium Minus, Museum Gustavianum
Software Techniques forSoftware Techniques forDistributed Shared MemoryDistributed Shared Memory
Zoran RadovicZoran [email protected]@it.uu.se
Dissertation Seminar, 18/11 – 2005Dissertation Seminar, 18/11 – 2005Auditorium Minus, Museum GustavianumAuditorium Minus, Museum Gustavianum
Software Techniques forSoftware Techniques forDistributed Shared MemoryDistributed Shared Memory
Zoran RadovicZoran [email protected]@it.uu.se
[email protected] Dissertation Seminar Nov 18, 2005
Outline
NUCA Locks
DSZOOM – Software-based Shared Memory
TMA – Trap-based Memory Architecture
[email protected] Dissertation Seminar Nov 18, 2005
Vasaloppet“Contention Problem in Sweden”
Traditional cross-country ski race90 km …
85.6533 km to go… CSCS
[email protected] Dissertation Seminar Nov 18, 2005
Spin Locks under Contention
Amount of Contention
Spin locks
Spin lockswith backoff
Cri
tic
al S
ecti
on
(C
S)
Co
st
IF (more contention) THEN less efficient CS …
“The more important the slower it runs…”
IF (more contention) THEN less efficient CS …
“The more important the slower it runs…”
[email protected] Dissertation Seminar Nov 18, 2005
Queue-based Locks
Amount of Contention
Spin locks
Spin lockswith backoff
CS
Co
st
Queue-based locks IF (more contention) THEN constant CS cost …
IF (more contention) THEN constant CS cost …
[email protected] Dissertation Seminar Nov 18, 2005
This Dissertation
Amount of Contention
Queue-based locks
Spin locks
Spin lockswith backoff
NUCA locks
CS
Co
st
IF (more contention) THEN more efficient CS …
“The more important the faster it runs…”
IF (more contention) THEN more efficient CS …
“The more important the faster it runs…”
[email protected] Dissertation Seminar Nov 18, 2005
NUCA Locks (Basic Idea)
Switch
MemoryMemoryMemory
TestTestTestTestLock/Unlock
Lock/Unlock
P
$
P
$
P
$…
P
$
P
$
P
$…
P
$
P
$
P
$…
TestTestTestTestTestTestTest
1) Reduce traffic- one CPU per node is testing…
2) Improve lock handover3) More efficient CS
- local traffic is cheaper
1) Reduce traffic- one CPU per node is testing…
2) Improve lock handover3) More efficient CS
- local traffic is cheaper
[email protected] Dissertation Seminar Nov 18, 2005
The HBO Lock (the simplest HBO)
What do we need? node_id Compare&swap (CAS) atomic operation
CAS(Lock_address, FREE, node_id)
lock-acquire: If the lock-value is in the state FREE:
• The node_id is CAS-ed into the lock location
Else: 2 cases• The lock is “local” Spin with small backoff• The lock is “remote” Spin with large backoff
Simple but fairly effective…
CreatesCommunication
Affinity
[email protected] Dissertation Seminar Nov 18, 2005
Performance ResultsRealistic microbenchmark, 2-node WildFire, 28 CPUs
3
4
5
6
7
8
9
10
11
12
0 500 1000 1500 2000critical_work
Iter
atio
n T
ime
[sec
onds
]
Spin
MCS
HBO
WF
14 14
0
10
20
30
40
50
60
0 500 1000 1500 2000
critical_work
Nod
e H
ando
ffs
[%]
Fairness?Fairness?
[email protected] Dissertation Seminar Nov 18, 2005
Fairness StudyRealistic microbenchmark, 2-node WildFire, 28 CPUs
02468
10121416182022242628
0 5 10 15Time [seconds]
Num
ber
of F
inis
hed
Pro
cess
ors Spin
MCS
HBO
t
[email protected] Dissertation Seminar Nov 18, 2005
Application Performance28-processor runs
0
0.5
1
1.5
2
2.5
Barne
s
Choles
kyFM
M
Radios
ity
Raytra
ce
Volren
d
Wat
er-N
sq
Avera
ge
No
rma
lize
d S
pe
ed
up
Spin Spin EXP MCS HBO
≈ 4x
[email protected] Dissertation Seminar Nov 18, 2005
Total Traffic: Raytrace
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Spin Spin EXP MCS HBO
Local Transactions Global Transactions
[email protected] Dissertation Seminar Nov 18, 2005
HBO Locks inside Linux Kernel
Patch provided by Silicon Graphics, Inc. Linux-IA64 kernel implementation, May 2005
Page-fault handler runs 3x faster 60 processors
[email protected] Dissertation Seminar Nov 18, 2005
Outline
NUCA Locks
DSZOOM – Software-based Shared Memory
TMA – Trap-based Memory Architecture
[email protected] Dissertation Seminar Nov 18, 2005
The DSZOOM Proposal
[email protected] Dissertation Seminar Nov 18, 2005
The DSZOOM Proposal
Run entire protocol in requesting-processor No protocol agent communication!
Assumes user-level remote memory access put, get, and atomics [ InfiniBand ]
Fine-grain memory protocols (e.g., 64 bytes)
Hardware-like memory models [Shasta, Blizzard, Sirocco]
[email protected] Dissertation Seminar Nov 18, 2005
“Squeezing” Protocols into Binaries…
...cmp %g0, %l5
bne 0x24431nop
ldd [%o0 + 16], %f4clr %l5...
...cmp %g0, %l5
bne 0x24431nop
ldd [%o0 + 16], %f4clr %l5...
ld [%o1 + 64], %o0
ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop
OriginalProgram
DSZOOMProgram
Fast-path Protocol
Code
Slow-pathProtocol
Code(C-code)
Binary/Assembler level instrumentation
[email protected] Dissertation Seminar Nov 18, 2005
Write Permission Caching
Problem: store instrumentation relies on locking More complex instrumentation
Solution: write permission cache (WPC) Small and fast software-managed cache Keeps write permissions
The WPC idea: Exploit store locality Dynamically reduce the number of memory references
in store checking code
[email protected] Dissertation Seminar Nov 18, 2005
Other “Features”
Two kinds of protocols Invalidate Update
Many optimizations Instrumentation scheduling (update and invalidate) Instrumentation batching (invalidate) WPC-based write batching (update) WPC-based dirty-data filtering (update) Private-data filtering (update) # of WPC entries (update and
invalidate) Coherence unit size (update and invalidate)
[email protected] Dissertation Seminar Nov 18, 2005
Coherence Flags and Profiling
Coherence flags Similar to optimization flags of compilers Possible scenario:
gcc -dszoom-cl 128 -dszoom-inv –O3 my_app.c
Execution profiling Similar to profile feedback of compilers Helps finding appropriate coherence flag settings Low overhead implementation in DSZOOM
• Less than 30 percent overhead
Works for both small and large input sets
[email protected] Dissertation Seminar Nov 18, 2005
DSZOOM Results2-node WildFire, 16 CPUs
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
fft lu-c
lu-nc
radix
barn
esfm
m
ocean
-c
ocean
-nc
radio
sity
raytr
ace
water-n
sq
water-s
p
avera
ge
Nor
mal
ized
Exe
cutio
n T
ime
HW-DSM inv-64 inv-dwpc-64 PROFILED BEST
1.45x 1.11x
[email protected] Dissertation Seminar Nov 18, 2005
Outline
NUCA Locks
DSZOOM – Software-based Shared Memory
TMA – Trap-based Memory Architecture
[email protected] Dissertation Seminar Nov 18, 2005
Instrumentation Drawbacks
...cmp %g0, %l5
bne 0x24431nop
ldd [%o0 + 16], %f4clr %l5...
...cmp %g0, %l5
bne 0x24431nop
ldd [%o0 + 16], %f4clr %l5...
ld [%o1 + 64], %o0
ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop
OriginalProgram
DSZOOMProgram
Fast-path Protocol
Code
Slow-pathProtocol
Code(C-code)
• Binary transparency? • Run-time execution overhead
• Binary transparency? • Run-time execution overhead
[email protected] Dissertation Seminar Nov 18, 2005
Trap-Based Memory Architectures
Basic idea Detect fine-grained coherence violations in hardware Trigger a coherence trap when one occur Maintain coherence by software protocols
No memory system modifications Minimal processor modifications
Binary Transparency No need to instrument binaries/applications
[email protected] Dissertation Seminar Nov 18, 2005
TMA LiteProof-of-concept Implementation
Load permission check Hardware implementation of software check
• Predefined “magic-value” convention
Store permission check Hardware WPC
Can be seen as a very small cache Operates on virtual addresses Accessed in parallel with the data TLB
[email protected] Dissertation Seminar Nov 18, 2005
TMA Lite Performance[TMA: simulation study, 4 nodes | DSZOOM: 2-node WildFire]
0
0.5
1
1.5
2
2.5
Nor
mal
ized
Exe
cutio
n T
ime
HW-DSM DSZOOM DWPC PROFILED BEST TMA
1.75x 1.01x
[email protected] Dissertation Seminar Nov 18, 2005
Topics not Presented
RH lock algorithm Controlled (un)fairness
HBO_GT and HBO_GT_SD algorithms Global throttling and starvation detection
DSZOOM implementation details Instrumentation challenges; scheduling, batching, etc. Bandwidth filtering techniques; dirty- & private-data
Innovative TMA simulation tricks Low-level “good days” hacks Reusing Simics checkpoints
[email protected] Dissertation Seminar Nov 18, 2005
Dissertation Seminar, 18/11 – 2005Dissertation Seminar, 18/11 – 2005Auditorium Minus, Museum GustavianumAuditorium Minus, Museum Gustavianum
Software Techniques forSoftware Techniques forDistributed Shared MemoryDistributed Shared Memory
Zoran RadovicZoran [email protected]@it.uu.se
Dissertation Seminar, 18/11 – 2005Dissertation Seminar, 18/11 – 2005Auditorium Minus, Museum GustavianumAuditorium Minus, Museum Gustavianum
Software Techniques forSoftware Techniques forDistributed Shared MemoryDistributed Shared Memory
Zoran RadovicZoran [email protected]@it.uu.se