Concurrent Data Structures in Architectures with Limited Shared Memory Support

Concurrent Data Structures in Architectures with

Limited Shared Memory Support

Ivan WalulyaYiannis NikolakopoulosMarina Papatriantafilou

Philippas Tsigas

Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden

Yiannis Nikolakopoulos ioaniko@chalmers.se

Concurrent Data Structures• Parallel/Concurrent programming:– Share data among threads/processes,

sharing a uniform address space (shared memory)

• Inter-process/thread communication and synchronization– Both a tool and a goal

Concurrent Data Structures:Implementations

• Coarse grained locking– Easy but slow...

• Fine grained locking– Fast/scalable but: error-prone, deadlocks

• Non-blocking– Atomic hardware primitives (e.g. TAS, CAS)– Good progress guarantees (lock/wait-freedom)– Scalable

What’s happening in hardware?• Multi-cores many-cores– “Cache coherency wall”

[Kumar et al 2011]– Shared address space

will not scale– Universal atomic primitives (CAS, LL/SC) harder to

implement• Shared memory message passing

Cache Cache

IA Core

Shared Local

• Networks on chip (NoC)• Short distance

between cores• Message passing

model support• Shared memory support

Can we have Data Structures:Fast

ScalableGood progress guarantees

Cache Cache

IA Core

Shared Local

• Eliminatedcache coherency

• Limited support for synchronization primitives

Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion

Single-chip Cloud Computer (SCC)• Experimental processor by Intel• 48 independent x86 cores arranged on 24 tiles• NoC connects all tiles• TestAndSet register

per core

SCC: Architecture Overview

Memory Controllers:to private & shared

main memory

Message Passing

Buffer (MPB) 16Kb

Programming Challenges in SCC• Message Passing but…– MPB small for

large data transfers– Data Replication is difficult

• No universal atomic primitives (CAS); no wait-free implementations [Herlihy91]

Concurrent FIFO Queues• Main idea:– Data are stored in shared off-chip memory– Message passing for communication/coordination

• 2 design methodologies:– Lock-based synchronization (2-lock Queue)– Message passing-based synchronization

(MP-Queue, MP-Acks)

2-lock Queue• Array based, in shared off-chip memory (SHM)• Head/Tail pointers in MPBs• 1 lock for each pointer [Michael&Scott96]• TAS based locks on 2 cores

2-lock Queue:“Traditional” Enqueue Algorithm

• Acquire lock• Read & Update

Tail pointer (MPB)• Add data (SHM)• Release lock

2-lock Queue:Optimized Enqueue Algorithm

Tail pointer (MPB)• Release lock• Add data to node SHM• Set memory flag to dirty Why?

No Cache Coherency!

2-lock Queue:Dequeue Algorithm

Head pointer• Release lock• Check flag• Read node dataWhat about

progress?

2-lock Queue:Implementation

Head/TailPointers (MPB)

Data nodes

Locks?On which tile(s)?

Message Passing-based Queue• Data nodes in SHM• Access coordinated by a Server node who

keeps Head/Tail pointers• Enqueuers/Dequeuers request access through

dedicated slots in MPB• Successfully enqueued data are flagged with

dirty bit

MP-Queue

What if this fails and is

never flagged?“Pairwise blocking”

only 1 dequeue blocks

ADDDATA

Adding Acknowledgements• No more flags!

Enqueue sends ACK when done• Server maintains in SHM a private queue of

pointers• On ACK:

Server adds data location to its private queue• On Dequeue:

Server returns only ACKed locations

MP-Acks

No blocking between

enqueues/dequeues

Evaluation

Benchmark:• Each core performs Enq/Deq at random• High/Low contention

• Perfomance? Scalability?• Is it the same for all cores?

• Throughput:Data structure operations completed per time unit.

[Cederman et al 2013]

Measures

Yiannis Nikolakopoulosioaniko@chalmers.se

Operations by core i

Average operations per

Throughput – High Contention

Fairness – High Contention

Throughput VS Lock Location

Conclusion• Lock based queue– High throughput– Less fair– Sensitive to lock locations, NoC performance

• MP based queues– Lower throughput– Fairer– Better liveness properties– Promising scalability

Thank you!

ivanw@chalmers.seioaniko@chalmers.se

BACKUP SLIDES

Experimental Setup• 533MHz cores, 800MHz mesh, 800MHz DDR3 • Randomized Enq/Deq operations• High/Low contention• One thread per core• 600ms per execution • Averaged over 12 runs

Concurrent FIFO Queues• Typical 2-lock queue [Michael&Scott96]

Concurrent Data Structures in Architectures with Limited Shared Memory Support

Documents

TYPE ARCHITECTURES, SHARED MEMORY, AND THE …homes.cs.washington.edu/~snyder/TypeArchitectures.pdf · TYPE ARCHITECTURES, SHARED MEMORY, AND THE COROLLARY OF MODEST POTENTIAL Lawrence

X10: Concurrent Object-Oriented Programming for Modern ...noemi/pcp-10/x10oopsla06.pdf · X10: Concurrent Object-Oriented Programming for Modern Architectures OOPSLA 2006 Tutorial

Linearizability: A Correctness Condition for Concurrent ...cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf · message-passing architectures in which the shared ... for J. M. Wing was

SHARED RECONFIGURABLE ARCHITECTURES FOR CMPS …albonesi/research/papers/fpl08.pdf · SHARED RECONFIGURABLE ARCHITECTURES FOR CMPS Matthew A. Watkins, Mark J. Cianchetti, and David

1 Concurrent Programming. 2 Outline Semaphores for Shared Resources –Producer-consumer problem –Readers-writers problem Example –Concurrent server based

Black-box Concurrent Data Structures for NUMA Architectures

Analytic Evaluation of Shared-Memory Architectures · Analytic Evaluation of Shared-Memory Architectures ... Highly bursty memory request tra c and lock contention are also ... shared

Concurrent Programming - unipi.itpages.di.unipi.it/ferrari/CORSI/AP/LEZIONI2016/COOP2.pdf · Concurrent Programming Concurrency: Correctly and eﬃciently managing access to shared

Who's afraid of concurrent programming?madhavan/presentations/acm... · The challenge of concurrent programming Concurrent programming is di cult Carefully coordinate access to shared

TYPE ARCHITECTURES, SHARED MEMORY, AND THE …snyder/TypeArchitectures.pdf · TYPE ARCHITECTURES, SHARED MEMORY, AND THE COROLLARY OF MODEST POTENTIAL Lawrence Snyder Department of

HPC Architectures · Shared-Memory Architectures •Multi-processor shared-memory systems have been common since the early 90’s •originally built from many single-core processors

Black-box Concurrent Data Structures for NUMA Architecturescs.brown.edu/~irina/papers/asplos2017-final.pdf · Black-box Concurrent Data Structures for NUMA Architectures ... quires

Cabo: Concurrent Architectures are Better than One

Eﬃcient Hardware/Software Architectures for …willmann/pubs/willmann_dissertation.pdfRICE UNIVERSITY Eﬃcient Hardware/Software Architectures for Highly Concurrent Network Servers

Parallel Processing Architectures MIMD Varieties Multiprocessor-Shared ... · Parallel Processing Architectures MIMD - Multiprocessor Parallel Processing Architectures MIMD SIMD MISD

Advanced concurrent programming in Java Shared objectsfileadmin.cs.lth.se/cs/Education/EDA015F/2013/Ch3-presentation.pdf · Advanced concurrent programming in Java Shared objects

Timing Analysis of Concurrent Programs Running on Shared Cache Multi-Cores

Software Architectures: Shared Information Systemscs446/F2003/... · Software Design and Architectures Shared Information Systems exchange examples evolution Dataflow Diagram for

Shared memory architectures

MEMORY-EFFICIENT CONCURRENT VLSI ...shodhganga.inflibnet.ac.in/bitstream/10603/18186/20/20...MEMORY-EFFICIENT CONCURRENT VLSI ARCHITECTURES FOR TWO-DIMENSIONAL DISCRETE WAVELET TRANSFORM