Upload
giza
View
48
Download
0
Embed Size (px)
DESCRIPTION
Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden. Behavior of Synchronization Methods in Commonly Used Languages and Systems. Yiannis Nikolakopoulos [email protected] Joint work with: D. Cederman, B. Chatterjee, N. Nguyen, - PowerPoint PPT Presentation
Citation preview
Behavior of Synchronization Methods in Commonly Used Languages and Systems
Yiannis [email protected]
Joint work with:D. Cederman, B. Chatterjee, N. Nguyen,
M. Papatriantafilou, P. Tsigas
Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden
2
Developing a multithreaded application…
Yiannis [email protected]
The boss wants .NET
The client wants speed…
(C++?)
Java is nice
Multicores everywhere
3Yiannis [email protected]
The worker threads need to access data
Concurrent Data Structures
Then we need Synchronization.
Developing a multithreaded application…
4
Implementation
Coarse Grain Locking
Fine Grain Locking
Test And Set
Array Locks
And more!
Yiannis [email protected]
Implementing Concurrent Data Structures
Performance Bottleneck
5
Implementation
Coarse Grain Locking
Fine Grain Locking
Test And Set
Array Locks
And more!
Lock Free
Yiannis [email protected]
Implementing Concurrent Data Structures
Runtime System
Hardware platform
Which is the fastest/most
scalable?
7
Problem Statement
• How the interplay of the above parameters and the different synchronization methods, affect the performance and the behavior of concurrent data structures.
Yiannis [email protected]
Yiannis Nikolakopoulos [email protected]
8
Outline
Introduction
Experiment SetupHighlights of Study and ResultsConclusion
9
Which data structures to study?
Represent different levels of contention:• Queue - 1 or 2 contention points• Hash table - multiple contention points
Yiannis [email protected]
10
How do we choose implementation?
Possible criteria:• Framework dependencies• Programmability• “Good” performance
Yiannis [email protected]
11
Interpreting “good”
• Throughput:The more operations completed per time unit the better.
• Is this enough?
Yiannis [email protected]
12
Non-fairness
13
• Throughput:Data structure operations completed per time unit.
What to measure?
Yiannis [email protected]
Operations by thread i
Average operations per
thread
14
Implementation Parameters
Yiannis [email protected]
Programming Environments C++ Java C# (.NET, Mono)
SynchronizationMethods
TAS, TTAS, Lock-free, Array lock
PMutex, Lock-free memory
management
Reentrant, synchronized
lock construct,Mutex
NUMAArchitectures
Intel Nehalem, 2 x 6 core(24 HW threads)
AMD Bulldozer, 4 x 12 core(48 HW threads)
Do they influence fairness?
15
Experiment Parameters
• Different levels of contention• Number of threads• Measured time intervals
Yiannis [email protected]
Yiannis Nikolakopoulos [email protected]
16
Outline
• Queue– Fairness– Intel vs AMD– Throughput vs Fairness
• Hash Table– Intel vs AMD– Scalability
IntroductionExperiment Setup
Highlights of Study and ResultsConclusion
Yiannis Nikolakopoulos [email protected]
17
Fairness can change along different time intervals24 Threads, High contention
Observations: Queue
0
0,2
0,4
0,6
0,8
1
400 600 800 1000 2000 3000 4000 5000 10000
Fairn
ess
Measurement interval (ms)
C# (.NET)
Intel - Lock-free AMD - Lock-free
Intel - TAS AMD - TAS
Yiannis Nikolakopoulos [email protected]
18
Significantly different fairness behavior in different architectures24 Threads, High contention
Observations: Queue
0
0,2
0,4
0,6
0,8
1
400 600 800 1000 2000 3000 4000 5000 10000
Measurement interval (ms)
Java
Intel - TAS Intel - TTAS
Intel - Synchronized Intel - Lock-free
Fairn
ess
Yiannis Nikolakopoulos [email protected]
19
Significantly different fairness behavior in different architectures24 Threads, High contention
Lock-free is less affected in this case
Observations: Queue
Fairn
ess
0
0,2
0,4
0,6
0,8
1
400 600 800 1000 2000 3000 4000 5000 10000
Fairn
ess
Measurement interval (ms)
Java
Intel - TAS AMD - TASIntel - TTAS AMD - TTASIntel - Synchronized AMD - SynchronizedIntel - Lock-free AMD - Lock-free
Yiannis Nikolakopoulos [email protected]
20
Queue: Throughput vs Fairness
Fairness 0.6 s, Intel Throughput
0
0,2
0,4
0,6
0,8
1
2 4 6 8 12 24 48
Fairn
ess
Threads
C++
TTAS Lock-free PMutex
0
2
4
6
8
10
12
14
16
2 4 6 8 12 24 48
Ope
ratio
ns p
er m
s (th
ousa
nds)
Threads
C++
21
Observations: Hash table
• Operations are distributed in different buckets• Things get interesting when
#threads > #buckets• Tradeoff between throughput and fairness– Different winners and losers– Contention is lowered in the linked list
components
Yiannis [email protected]
Yiannis Nikolakopoulos [email protected]
22
0
0,2
0,4
0,6
0,8
1
400 600 800 1000 2000 3000 4000 5000 10000
Fairn
ess
Measurement interval (ms)
C# (Mono)
Intel - TAS Intel - TTAS Intel - Lock-free
Fairness differences in Hash table across architectures24 Threads, High contention
Observations: Hash table
Yiannis Nikolakopoulos [email protected]
23
Fairness differences in Hash table across architectures24 Threads, High contention
Lock-free is again not affected
Observations: Hash table
0
0,2
0,4
0,6
0,8
1
400 600 800 1000 2000 3000 4000 5000 10000
Fairn
ess
Measurement interval (ms)
C# (Mono)
Intel - TAS AMD - TASIntel - TTAS AMD - TTASIntel - Lock-free AMD - Lock-free
Yiannis Nikolakopoulos [email protected]
24
Observations: Hash tableIn C++, custom memory management and lock-free implementations excel in
scalability and performance.
0
5
10
15
20
25
30
2 4 6 8 12 24 48
Suce
ssfu
l ope
ratio
ns p
er m
s (th
ousa
nds)
Threads
C++
TAS TTAS Lock-free
Array Lock PMutex Lock-free, MM
0
1
2
3
4
5
6
2 4 6 8 12 24 48
Threads
Java
TAS TTAS Lock-freeArray Lock Reentrant Reentrant FairSynchronized
Yiannis Nikolakopoulos [email protected]
25
Conclusion
• Complex synchronization mechanisms (Pmutex, Reentrant lock) pay off in heavily contended hot spots
• Scalability via more complex, inherently parallel designs and implementations
• Tradeoff between throughput and fairness– LF Hash table – Reentrant lock vs Array Lock vs LF Queue
• Fairness can be heavily influenced by HW– Interesting exceptions
Which is the fastest/most
scalable?
Is fairness influenced by
NUMA?