Concurrent Data Structures in Architectures with Limited Shared Memory Support

Preview:

DESCRIPTION

Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden. Concurrent Data Structures in Architectures with Limited Shared Memory Support. Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas. Concurrent Data Structures. - PowerPoint PPT Presentation

Citation preview

Concurrent Data Structures in Architectures with

Limited Shared Memory Support

Ivan WalulyaYiannis NikolakopoulosMarina Papatriantafilou

Philippas Tsigas

Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden

Yiannis Nikolakopoulos ioaniko@chalmers.se

2

Concurrent Data Structures• Parallel/Concurrent programming:– Share data among threads/processes,

sharing a uniform address space (shared memory)

• Inter-process/thread communication and synchronization– Both a tool and a goal

Yiannis Nikolakopoulos ioaniko@chalmers.se

3

Concurrent Data Structures:Implementations

• Coarse grained locking– Easy but slow...

• Fine grained locking– Fast/scalable but: error-prone, deadlocks

• Non-blocking– Atomic hardware primitives (e.g. TAS, CAS)– Good progress guarantees (lock/wait-freedom)– Scalable

Yiannis Nikolakopoulos ioaniko@chalmers.se

4

What’s happening in hardware?• Multi-cores many-cores– “Cache coherency wall”

[Kumar et al 2011]– Shared address space

will not scale– Universal atomic primitives (CAS, LL/SC) harder to

implement• Shared memory message passing

Cache Cache

IA Core

Shared Local

Yiannis Nikolakopoulos ioaniko@chalmers.se

5

• Networks on chip (NoC)• Short distance

between cores• Message passing

model support• Shared memory support

Can we have Data Structures:Fast

ScalableGood progress guarantees

Cache Cache

IA Core

Shared Local

• Eliminatedcache coherency

• Limited support for synchronization primitives

Yiannis Nikolakopoulos ioaniko@chalmers.se

6

Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion

Yiannis Nikolakopoulos ioaniko@chalmers.se

7

Single-chip Cloud Computer (SCC)• Experimental processor by Intel• 48 independent x86 cores arranged on 24 tiles• NoC connects all tiles• TestAndSet register

per core

Yiannis Nikolakopoulos ioaniko@chalmers.se

8

SCC: Architecture Overview

Memory Controllers:to private & shared

main memory

Message Passing

Buffer (MPB) 16Kb

Yiannis Nikolakopoulos ioaniko@chalmers.se

9

Programming Challenges in SCC• Message Passing but…– MPB small for

large data transfers– Data Replication is difficult

• No universal atomic primitives (CAS); no wait-free implementations [Herlihy91]

Yiannis Nikolakopoulos ioaniko@chalmers.se

10

Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion

Yiannis Nikolakopoulos ioaniko@chalmers.se

11

Concurrent FIFO Queues• Main idea:– Data are stored in shared off-chip memory– Message passing for communication/coordination

• 2 design methodologies:– Lock-based synchronization (2-lock Queue)– Message passing-based synchronization

(MP-Queue, MP-Acks)

Yiannis Nikolakopoulos ioaniko@chalmers.se

12

2-lock Queue• Array based, in shared off-chip memory (SHM)• Head/Tail pointers in MPBs• 1 lock for each pointer [Michael&Scott96]• TAS based locks on 2 cores

Yiannis Nikolakopoulos ioaniko@chalmers.se

13

2-lock Queue:“Traditional” Enqueue Algorithm

• Acquire lock• Read & Update

Tail pointer (MPB)• Add data (SHM)• Release lock

Yiannis Nikolakopoulos ioaniko@chalmers.se

14

2-lock Queue:Optimized Enqueue Algorithm

• Acquire lock• Read & Update

Tail pointer (MPB)• Release lock• Add data to node SHM• Set memory flag to dirty Why?

No Cache Coherency!

Yiannis Nikolakopoulos ioaniko@chalmers.se

15

2-lock Queue:Dequeue Algorithm

• Acquire lock• Read & Update

Head pointer• Release lock• Check flag• Read node dataWhat about

progress?

Yiannis Nikolakopoulos ioaniko@chalmers.se

16

2-lock Queue:Implementation

Head/TailPointers (MPB)

Data nodes

Locks?On which tile(s)?

Yiannis Nikolakopoulos ioaniko@chalmers.se

17

Message Passing-based Queue• Data nodes in SHM• Access coordinated by a Server node who

keeps Head/Tail pointers• Enqueuers/Dequeuers request access through

dedicated slots in MPB• Successfully enqueued data are flagged with

dirty bit

Yiannis Nikolakopoulos ioaniko@chalmers.se

18

MP-Queue

ENQ

TAIL

DEQ

HEAD

SPIN

What if this fails and is

never flagged?“Pairwise blocking”

only 1 dequeue blocks

ADDDATA

Yiannis Nikolakopoulos ioaniko@chalmers.se

19

Adding Acknowledgements• No more flags!

Enqueue sends ACK when done• Server maintains in SHM a private queue of

pointers• On ACK:

Server adds data location to its private queue• On Dequeue:

Server returns only ACKed locations

Yiannis Nikolakopoulos ioaniko@chalmers.se

20

MP-Acks

ENQ

TAIL

ACK

DEQ

HEAD

No blocking between

enqueues/dequeues

Yiannis Nikolakopoulos ioaniko@chalmers.se

21

Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion

Yiannis Nikolakopoulos ioaniko@chalmers.se

22

Evaluation

Benchmark:• Each core performs Enq/Deq at random• High/Low contention

• Perfomance? Scalability?• Is it the same for all cores?

23

• Throughput:Data structure operations completed per time unit.

[Cederman et al 2013]

Measures

Yiannis Nikolakopoulosioaniko@chalmers.se

Operations by core i

Average operations per

core

Yiannis Nikolakopoulos ioaniko@chalmers.se

24

Throughput – High Contention

Yiannis Nikolakopoulos ioaniko@chalmers.se

25

Fairness – High Contention

Yiannis Nikolakopoulos ioaniko@chalmers.se

26

Throughput VS Lock Location

Yiannis Nikolakopoulos ioaniko@chalmers.se

27

Throughput VS Lock Location

Yiannis Nikolakopoulos ioaniko@chalmers.se

28

Conclusion• Lock based queue– High throughput– Less fair– Sensitive to lock locations, NoC performance

• MP based queues– Lower throughput– Fairer– Better liveness properties– Promising scalability

Yiannis Nikolakopoulos ioaniko@chalmers.se

29

Thank you!

ivanw@chalmers.seioaniko@chalmers.se

Yiannis Nikolakopoulos ioaniko@chalmers.se

30

BACKUP SLIDES

Yiannis Nikolakopoulos ioaniko@chalmers.se

31

Experimental Setup• 533MHz cores, 800MHz mesh, 800MHz DDR3 • Randomized Enq/Deq operations• High/Low contention• One thread per core• 600ms per execution • Averaged over 12 runs

Yiannis Nikolakopoulos ioaniko@chalmers.se

32

Concurrent FIFO Queues• Typical 2-lock queue [Michael&Scott96]

Recommended