View
222
Download
1
Category
Tags:
Preview:
Citation preview
Wait-Free Queues with Multiple Enqueuers and Dequeuers
Alex Kogan Erez Petrank
Computer Science, Technion, Israel
FIFO queues One of the most fundamental and common data
structures
dequeue
5 3 2
enqueue
9
Concurrent FIFO queues Concurrent implementation supports “correct”
concurrent adding and removing elements correct = linearizable
The access to the shared memory should be synchronized
3 2
enqueue
9
empty!
dequeue
dequeue
dequeue
dequeue
Non-blocking synchronization No thread is blocked in waiting for another thread to
complete e.g., no locks / critical sections
Progress guarantees: Obstruction-freedom
progress is guaranteed only in the eventual absence of interference
Lock-freedom among all threads trying to apply an operation, one will
succeed
Wait-freedom a thread completes its operation in a bounded number of steps
Lock-freedom Among all threads trying to apply an operation,
one will succeed opportunistic approach
make attempts until succeeding
global progressall but one threads may starve
Many efficient and scalable lock-free queue implementations
Wait-freedom A thread completes its operation in a bounded
number of steps regardless of what other threads are doing
A highly desired property of any concurrent data structure but, commonly regarded as inefficient and too costly to
achieve
Particularly important in several domains real-time systems operating under SLA heterogeneous environments
Related work: existing wait-free queues Limited concurrency
one enqueuer and one dequeuer multiple enqueuers, one concurrent dequeuer multiple dequeuers, one concurrent enqueuer
Universal constructions generic method to transform any (sequential)
object into lock-free/wait-free concurrent object expensive impractical implementations
(Almost) no experimental results
[Lamport’83]
[David’04]
[Jayanti&Petrovic’05]
[Herlihy’91]
Related work: lock-free queue One of the most scalable and efficient lock-
free implementations
Widely adopted by industry part of Java Concurrency package
Relatively simple and intuitive implementation Based on singly-linked list of nodes
12 4 17
head tail
[Michael & Scott’96]
MS-queue brief review: enqueue
4
head tail
enqueue
9
CAS
CAS
12 17 9
MS-queue brief review: enqueue
4
head tail
enqueue
9
12 17 9
enqueue
5
5CAS
CAS
CAS
MS-queue brief review: dequeue
4
head tail
dequeue
CAS
12 17 912
Our idea (in a nutshell) Based on the lock-free queue by Michael & Scott
Helping mechanism each operation is applied in a bounded time
“Wait-free” implementation scheme each operation is applied exactly once
Helping mechanism Each operation is assigned a dynamic age-based
priority inspired by the Doorway mechanism used in Bakery
mutex
Each thread accessing a queue chooses a monotonically increasing
phase number writes down its phase and operation
info in a special state array helps all threads with a non-larger
phase to apply their operations
phase: longpending: boolean
enqueue: boolean
node: Node
state entry per thread
Helping mechanism in action
4
true
true
ref
9
false
true
null
9
true
true
ref
3
false
true
ref
phasependin
genqueu
enode
Helping mechanism in action
4
true
true
ref
9
false
true
null
9
true
true
ref
10true
true
ref
I need to help!
phasependin
genqueu
enode
Helping mechanism in action
4
true
true
ref
9
false
true
null
9
true
true
ref
10true
true
ref
phasependin
genqueu
enode
I do not need to help!
Helping mechanism in action
4
true
true
ref
9
false
true
null
11
true
false
null
10true
true
ref
phasependin
genqueu
enode
I do not need to help!
I need to help!
Helping mechanism in action
The number of operations that may linearize before any given operation is bounded hence, wait-freedom
4
true
true
ref
9
false
true
null
11
true
false
null
10true
true
ref
phasependin
genqueu
enode
Optimized helping The basic scheme has two drawbacks:
the number of steps executed by each thread on every operation depends on n (the number of threads) even when there is no contention
creates scenarios where many threads help same operations e.g., when many threads access the queue concurrently large redundant work
Optimization: help one thread at a time, in a cyclic manner faster threads help slower peers in parallel reduces the amount of redundant work
How to choose the phase numbers Every time ti chooses a phase number, it is greater
than the number of any thread that made its choice before ti
defines a logical order on operations and provides wait-freedom
Like in Bakery mutex: scan through state calculate the maximal phase value + 1requires O(n) steps
Alternative: use an atomic counterrequires O(1) steps
4
true
true
ref
3
false
true
null
5
true
true
ref
6!
“Wait-free” design scheme Break each operation into three atomic steps
can be executed by different threads cannot be interleaved
1. Initial change of the internal structure concurrent operations realize that there is an operation-in-
progress
2. Updating the state of the operation-in-progress as being performed (linearized)
3. Fixing the internal structure finalizing the operation-in-progress
Internal structures
4
head tail
1 2
9
false
false
null
4
false
true
null
9
false
true
null
phasependin
genque
uenode
state
Internal structures
head tail9
false
false
null
4
false
true
null
9
false
true
null
phasependin
genque
uenode
101
41-1
20-1
holds ID of the thread that
performs / has performed the insertion of the
node into the queue
these elements were enqueued by Thread 0this element was enqueued by Thread 1
state
enqTid: int
Internal structures
head tail
101
41-1
20-1
9
false
false
null
4
false
true
null
9
false
true
null
phasependin
genque
uenode
state
deqTid: int
holds ID of the thread that
performs / has performed the removal of the
node into the queue
this element was dequeued by Thread 1
enqueue operation
head tail
120-1
41-1
170-1
62-1
enqueue
6
ID: 2
9
false
false
null
4
false
true
null
9
false
true
null
phasependin
genque
uenode
Creating a new node
state
enqueue operation
head tail9
false
false
null
4
false
true
null
10
true
true
phasependin
genque
uenode
Announcing a new operation
state
enqueue
6
ID: 2
120-1
41-1
170-1
62-1
enqueue operation
head tail9
false
false
null
4
false
true
null
10
true
true
phasependin
genque
uenode
Step 1: Initial change of the internal structure
state
enqueue
6
ID: 2
120-1
41-1
170-1
62-1
CAS
enqueue operation
head tail9
false
false
null
4
false
true
null
10
false
true
phasependin
genque
uenode
Step 2: Updating the state of the operation-in-progress as being performed
CAS
state
enqueue
6
ID: 2
120-1
41-1
170-1
62-1
enqueue operation
head tail9
false
false
null
4
false
true
null
10
false
true
phasependin
genque
uenode
state
enqueue
6
ID: 2
120-1
41-1
170-1
62-1
Step 3: Fixing the internal structure
CAS
enqueue operation
head tail9
false
false
null
4
false
true
null
10
true
true
phasependin
genque
uenode
enqueue
3
ID: 0state
enqueue
6
ID: 2
Step 1: Initial change of the internal structure
120-1
41-1
170-1
62-1
enqueue operation
head tail11
true
true
4
false
true
null
10
true
true
phasependin
genque
uenode
enqueue
3
ID: 0state
enqueue
6
ID: 2
120-1
41-1
170-1
62-1
30-1
Creating a new nodeAnnouncing a new operation
enqueue operation
head tail11
true
true
4
false
true
null
10
true
true
phasependin
genque
uenode
enqueue
3
ID: 0state
enqueue
6
ID: 2
120-1
41-1
170-1
62-1
30-1
Step 2: Updating the state of the operation-in-progress as being performed
enqueue operation
head tail11
true
true
4
false
true
null
10
false
true
phasependin
genque
uenode
enqueue
3
ID: 0state
enqueue
6
ID: 2
120-1
41-1
170-1
62-1
30-1
Step 2: Updating the state of the operation-in-progress as being performed
CAS
enqueue operation
head tail11
true
true
4
false
true
null
10
false
true
phasependin
genque
uenode
enqueue
3
ID: 0state
enqueue
6
ID: 2
120-1
41-1
170-1
62-1
30-1
Step 3: Fixing the internal structure
CAS
enqueue operation
head tail11
true
true
4
false
true
null
10
false
true
phasependin
genque
uenode
enqueue
3
ID: 0state
enqueue
6
ID: 2
120-1
41-1
170-1
62-1
30-1
Step 1: Initial change of the internal structure
CAS
dequeue operation
head tail9
false
false
null
4
false
true
null
9
false
true
null
phasependin
genque
uenode
state
120-1
41-1
170-1
dequeue
ID: 2
dequeue operation
head tail9
false
false
null
4
false
true
null
10
true
false
null
phasependin
genque
uenode
state
120-1
41-1
170-1
dequeue
ID: 2
Announcing a new operation
dequeue operation
head tail9
false
false
null
4
false
true
null
10
true
false
phasependin
genque
uenode
state
120-1
41-1
170-1
dequeue
ID: 2
Updating state to refer the first node
CAS
dequeue operation
head tail9
false
false
null
4
false
true
null
10
true
false
phasependin
genque
uenode
state
1202
41-1
170-1
dequeue
ID: 2
Step 1: Initial change of the internal structure
CAS
dequeue operation
head tail9
false
false
null
4
false
true
null
10
false
false
phasependin
genque
uenode
state
1202
41-1
170-1
dequeue
ID: 2
Step 2: Updating the state of the operation-in-progress as being performed
CAS
dequeue operation
head tail9
false
false
null
4
false
true
null
10
false
false
phasependin
genque
uenode
state
1202
41-1
170-1
dequeue
ID: 2
Step 3: Fixing the internal structure
CAS
Performance evaluation
Architecture two 2.5 GHz quadcore Xeon E5420 processors
two 1.6 GHz quadcore
Xeon E5310 processors
# threads 8 8 8
RAM 16GB 16GB 16GB
OS CentOS 5.5 Server
Ubuntu 8.10 Server
RedHat Enterpise 5.3 Server
Java Sun’s Java SE Runtime 1.6.0 update 22, 64-bit Server VM
Benchmarks Enqueue-Dequeue benchmark
the queue is initially empty each thread iteratively performs enqueue and
then dequeue 1,000,000 iterations per thread
50%-Enqueue benchmark the queue is initialized with 1000 elements each thread decides uniformly and random which
operation to perform, with equal odds for enqueue and dequeue
1,000,000 operations per thread
Tested algorithmsCompared implementations: MS-queue Base wait-free queue Optimized wait-free queue
Opt 1: optimized helping (help one thread at a time)
Opt 2: atomic counter-based phase calculation
Measure completion time as a function of # threads
Enqueue-Dequeue benchmark TBD: add figures
The impact of optimizations TBD: add figures
Optimizing further: false sharing Created on accesses to state array Resolved by stretching the state with dummy
pads TBD: add figures
Optimizing further: memory management Every attempt to update state is preceded by
an allocation of a new record these records can be reused when the attempt
fails (more) validation checks can be performed to
reduce the number of failed attempts
When an operation is finished, remove the reference from state to a list node help garbage collector
Implementing the queue without GC Apply Hazard Pointers technique [Michael’04]
each thread is associated with hazard pointers single-writer multi-reader registers used by threads to point on objects they may access
later when an object should be deleted, a thread stores
its address in a special stack once in a while, it scans the stack and recycle
objects only if there are no hazard pointers pointing on it
In our case, the technique can be applied with a slight modification in the dequeue method
Summary First wait-free queue implementation supporting
multiple enqueuers and dequeuers
Wait-freedom incurs an inherent trade-off bounds the completion time of a single operation has a cost in a “typical” case
The additional cost can be reduced and become tolerable
Proposed design scheme might be applicable for other wait-free data structures
Thank you!Questions?
Recommended