Wait-Free Queues with Multiple Enqueuers and Dequeuers Alex Kogan Erez Petrank Computer Science,...

Wait-Free Queues with Multiple Enqueuers and Dequeuers

Alex Kogan Erez Petrank

Computer Science, Technion, Israel

FIFO queues One of the most fundamental and common data

structures

dequeue

enqueue

Concurrent FIFO queues Concurrent implementation supports “correct”

concurrent adding and removing elements correct = linearizable

The access to the shared memory should be synchronized

enqueue

empty!

dequeue

Non-blocking synchronization No thread is blocked in waiting for another thread to

complete e.g., no locks / critical sections

Progress guarantees: Obstruction-freedom

progress is guaranteed only in the eventual absence of interference

Lock-freedom among all threads trying to apply an operation, one will

succeed

Wait-freedom a thread completes its operation in a bounded number of steps

Lock-freedom Among all threads trying to apply an operation,

one will succeed opportunistic approach

make attempts until succeeding

global progressall but one threads may starve

Many efficient and scalable lock-free queue implementations

Wait-freedom A thread completes its operation in a bounded

number of steps regardless of what other threads are doing

A highly desired property of any concurrent data structure but, commonly regarded as inefficient and too costly to

achieve

Particularly important in several domains real-time systems operating under SLA heterogeneous environments

Related work: existing wait-free queues Limited concurrency

one enqueuer and one dequeuer multiple enqueuers, one concurrent dequeuer multiple dequeuers, one concurrent enqueuer

Universal constructions generic method to transform any (sequential)

object into lock-free/wait-free concurrent object expensive impractical implementations

(Almost) no experimental results

[Lamport’83]

[David’04]

[Jayanti&Petrovic’05]

[Herlihy’91]

Related work: lock-free queue One of the most scalable and efficient lock-

free implementations

Widely adopted by industry part of Java Concurrency package

Relatively simple and intuitive implementation Based on singly-linked list of nodes

12 4 17

head tail

[Michael & Scott’96]

MS-queue brief review: enqueue

head tail

enqueue

12 17 9

MS-queue brief review: enqueue

head tail

enqueue

12 17 9

enqueue

MS-queue brief review: dequeue

head tail

dequeue

12 17 912

Our idea (in a nutshell) Based on the lock-free queue by Michael & Scott

Helping mechanism each operation is applied in a bounded time

“Wait-free” implementation scheme each operation is applied exactly once

Helping mechanism Each operation is assigned a dynamic age-based

priority inspired by the Doorway mechanism used in Bakery

Each thread accessing a queue chooses a monotonically increasing

phase number writes down its phase and operation

info in a special state array helps all threads with a non-larger

phase to apply their operations

phase: longpending: boolean

enqueue: boolean

node: Node

state entry per thread

Helping mechanism in action

phasependin

genqueu

10true

I need to help!

phasependin

genqueu

10true

phasependin

genqueu

I do not need to help!

10true

phasependin

genqueu

I do not need to help!

I need to help!

The number of operations that may linearize before any given operation is bounded hence, wait-freedom

10true

phasependin

genqueu

Optimized helping The basic scheme has two drawbacks:

the number of steps executed by each thread on every operation depends on n (the number of threads) even when there is no contention

creates scenarios where many threads help same operations e.g., when many threads access the queue concurrently large redundant work

Optimization: help one thread at a time, in a cyclic manner faster threads help slower peers in parallel reduces the amount of redundant work

How to choose the phase numbers Every time ti chooses a phase number, it is greater

than the number of any thread that made its choice before ti

defines a logical order on operations and provides wait-freedom

Like in Bakery mutex: scan through state calculate the maximal phase value + 1requires O(n) steps

Alternative: use an atomic counterrequires O(1) steps

“Wait-free” design scheme Break each operation into three atomic steps

can be executed by different threads cannot be interleaved

1. Initial change of the internal structure concurrent operations realize that there is an operation-in-

progress

2. Updating the state of the operation-in-progress as being performed (linearized)

3. Fixing the internal structure finalizing the operation-in-progress

Internal structures

head tail

phasependin

genque

uenode

Internal structures

head tail9

phasependin

genque

uenode

holds ID of the thread that

performs / has performed the insertion of the

node into the queue

these elements were enqueued by Thread 0this element was enqueued by Thread 1

enqTid: int

Internal structures

head tail

phasependin

genque

uenode

deqTid: int

holds ID of the thread that

performs / has performed the removal of the

node into the queue

this element was dequeued by Thread 1

enqueue operation

head tail

enqueue

phasependin

genque

uenode

Creating a new node

enqueue operation

head tail9

phasependin

genque

uenode

Announcing a new operation

enqueue

enqueue operation

head tail9

phasependin

genque

uenode

Step 1: Initial change of the internal structure

enqueue

enqueue operation

head tail9

phasependin

genque

uenode

Step 2: Updating the state of the operation-in-progress as being performed

enqueue

enqueue operation

head tail9

phasependin

genque

uenode

enqueue

Step 3: Fixing the internal structure

enqueue operation

head tail9

phasependin

genque

uenode

enqueue

ID: 0state

enqueue

enqueue operation

head tail11

phasependin

genque

uenode

enqueue

ID: 0state

enqueue

Creating a new nodeAnnouncing a new operation

enqueue operation

head tail11

phasependin

genque

uenode

enqueue

ID: 0state

enqueue

enqueue operation

head tail11

phasependin

genque

uenode

enqueue

ID: 0state

enqueue

enqueue operation

head tail11

phasependin

genque

uenode

enqueue

ID: 0state

enqueue

enqueue operation

head tail11

phasependin

genque

uenode

enqueue

ID: 0state

enqueue

dequeue operation

head tail9

phasependin

genque

uenode

dequeue

dequeue operation

head tail9

phasependin

genque

uenode

dequeue

Announcing a new operation

dequeue operation

head tail9

phasependin

genque

uenode

dequeue

Updating state to refer the first node

dequeue operation

head tail9

phasependin

genque

uenode

dequeue

dequeue operation

head tail9

phasependin

genque

uenode

dequeue

dequeue operation

head tail9

phasependin

genque

uenode

dequeue

Performance evaluation

Architecture two 2.5 GHz quadcore Xeon E5420 processors

two 1.6 GHz quadcore

Xeon E5310 processors

# threads 8 8 8

RAM 16GB 16GB 16GB

OS CentOS 5.5 Server

Ubuntu 8.10 Server

RedHat Enterpise 5.3 Server

Java Sun’s Java SE Runtime 1.6.0 update 22, 64-bit Server VM

Benchmarks Enqueue-Dequeue benchmark

the queue is initially empty each thread iteratively performs enqueue and

then dequeue 1,000,000 iterations per thread

50%-Enqueue benchmark the queue is initialized with 1000 elements each thread decides uniformly and random which

operation to perform, with equal odds for enqueue and dequeue

1,000,000 operations per thread

Tested algorithmsCompared implementations: MS-queue Base wait-free queue Optimized wait-free queue

Opt 1: optimized helping (help one thread at a time)

Opt 2: atomic counter-based phase calculation

Measure completion time as a function of # threads

Enqueue-Dequeue benchmark TBD: add figures

The impact of optimizations TBD: add figures

Optimizing further: false sharing Created on accesses to state array Resolved by stretching the state with dummy

pads TBD: add figures

Optimizing further: memory management Every attempt to update state is preceded by

an allocation of a new record these records can be reused when the attempt

fails (more) validation checks can be performed to

reduce the number of failed attempts

When an operation is finished, remove the reference from state to a list node help garbage collector

Implementing the queue without GC Apply Hazard Pointers technique [Michael’04]

each thread is associated with hazard pointers single-writer multi-reader registers used by threads to point on objects they may access

later when an object should be deleted, a thread stores

its address in a special stack once in a while, it scans the stack and recycle

objects only if there are no hazard pointers pointing on it

In our case, the technique can be applied with a slight modification in the dequeue method

Summary First wait-free queue implementation supporting

multiple enqueuers and dequeuers

Wait-freedom incurs an inherent trade-off bounds the completion time of a single operation has a cost in a “typical” case

The additional cost can be reduced and become tolerable

Proposed design scheme might be applicable for other wait-free data structures

Thank you!Questions?

Wait-Free Queues with Multiple Enqueuers and Dequeuers Alex Kogan Erez Petrank Computer Science,...

Documents

LEONID KOGAN Leonid Kogan

An On-the-Fly Reference Counting Garbage Collector for Java Erez Petrank Technion – Israel Institute of Technology Joint work with Yossi Levanoni – Microsoft

Lower and Upper Bounds on Obtaining History Independence Niv Buchbinder and Erez Petrank Technion, Israel

Age-Oriented Concurrent Garbage Collection Harel Paz, Erez Petrank – Technion, Israel Steve Blackburn – ANU, Australia April 05 Compiler Construction Scotland

The Hardness of Cache Conscious Data Placement Erez Petrank, Technion Dror Rawitz, Caesarea Rothschild Institute Appeared in 29 th ACM Conference on Principles

On the limits of partial compaction Anna Bendersky & Erez Petrank Technion

Theory of Compilation 236360 Erez Petrank Lecture 9: Runtime Part II. 1

Theory of Compilation 236360 Erez Petrank

An On-the-Fly Mark and Sweep Garbage Collector Based on Sliding Views Hezi Azatchi - IBM Yossi Levanoni - Microsoft Harel Paz – Technion Erez Petrank –

IBM Labs in Haifa © 2004 IBM Corporation An Efficient Parallel Heap Compaction Diab Abuaiadh, Yoav Ossia, Erez Petrank, Uri Silberstein IBM Haifa Research

Theory of Compilation 236360 Erez Petrank Lecture 1: Introduction, Lexical Analysis 1

Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+

1 An Efficient On-the-Fly Cycle Collection Harel Paz, Erez Petrank - Technion, Israel David F. Bacon, V. T. Rajan - IBM T.J. Watson Research Center Elliot

On the utility of cutpoints for monitoring program executiontvla/sa/theses/msc-shachar.pdf · • Dr. Erez Petrank and Dr. Harel Paz, for pointing out the possibility of using garbage

Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel

Lecturer: Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Theory of Compilation 236360 Erez Petrank Lecture 11: Optimizations 1

Theory of Compilation 236360 Erez Petrank Lecture 4: Semantic Analysis: 1

Theory of Compilation 236360 Erez Petrank Lecture 6: Intermediate Representation 1

1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology