Multithread dos and don'ts - ACCU Dos... · Multithreading dos and don'ts ACCU 2015, Bristol – June 2015 Hubert Matthews [email protected]. ... –In Java and C# it means “atomic

Multithreading dos and don'tsMultithreading dos and don'ts

ACCU 2015, Bristol – June 2015ACCU 2015, Bristol – June 2015

Hubert MatthewsHubert [email protected]@oxyware.com

Copyright © 2015 Oxyware Ltd 2/55

Why this talk?Why this talk?


Don't – why notDon't – why not

• Avoid multithreaded programming if you canAvoid multithreaded programming if you can– It's harder to write, to read, to understand and to It's harder to write, to read, to understand and to

test than singlethreaded codetest than singlethreaded code– It may appear to work but may just not have failed It may appear to work but may just not have failed

sufficiently visibly yetsufficiently visibly yet– Often distracts from the underlying application Often distracts from the underlying application

problem and focuses developers on technical issuesproblem and focuses developers on technical issues– A great consumer of developer time and generator A great consumer of developer time and generator

of frustrationof frustration– Often avoidableOften avoidable– May not deliver the performance benefits you May not deliver the performance benefits you

expectexpect


Do – whyDo – why

• You need it to access the full power of the You need it to access the full power of the machine (measure, don't guess!)machine (measure, don't guess!)

• You need to scale your application and your You need to scale your application and your application is CPUboundapplication is CPUbound

• You are running in a threaded environmentYou are running in a threaded environment• There is an obvious parallel decomposition of There is an obvious parallel decomposition of

the problem or algorithmthe problem or algorithm• Other approaches to covering latency and I/O Other approaches to covering latency and I/O

are worse or not availableare worse or not available• You are brave or a masochist (or have an ego)You are brave or a masochist (or have an ego)


AlternativesAlternatives

Single-threaded approaches• event-driven code• asynchronous I/O

• asio, libaio (linux)• overlapped I/O

(Windows)• non-blocking TCP• UDP

• coroutines or fibers• separate processes

Multi-threaded approaches• concurrent library (tasks)

• TBB, PPL• concurrent library (data)

• OpenMP• message passing

• MPI



A whole new set of problemsA whole new set of problems

• Getting singlethreaded code working well and Getting singlethreaded code working well and with good performance can be a challengewith good performance can be a challenge

• Multithreading provides a whole new set of Multithreading provides a whole new set of ways of getting it wrong or going slowerways of getting it wrong or going slower

– The problems may not show up except under load The problems may not show up except under load or at the most inconvenient timeor at the most inconvenient time

– They may not be reproducibleThey may not be reproducible– They will be hard to debug or to measureThey will be hard to debug or to measure– Knowing which of these problems you have is hardKnowing which of these problems you have is hard

focus on getting good single-threaded performance first before going multithreaded


Problem decompositionProblem decomposition

• Before considering how to implement a parallel Before considering how to implement a parallel solution you have to split the problem up into solution you have to split the problem up into pieces and find an algorithm for processing and pieces and find an algorithm for processing and recombining these piecesrecombining these pieces

– This split may be trivial for “embarrassingly parallel This split may be trivial for “embarrassingly parallel problems” or hard (travelling salesman problem)problems” or hard (travelling salesman problem)

• Classic approachesClassic approaches– Data parallel (sections of an array)Data parallel (sections of an array)– Task parallel (web requests)Task parallel (web requests)

• Interaction between subitems is keyInteraction between subitems is key


GoldilocksGoldilocks

• Parallel approaches have to find the “sweet Parallel approaches have to find the “sweet spot” between two extremesspot” between two extremes

• Too finegrainedToo finegrained– Data – computation dominated by overheadData – computation dominated by overhead– Threads – context switching overheadThreads – context switching overhead

• Too coarsegrainedToo coarsegrained– Data – load balancing problemsData – load balancing problems– Threads – insufficient items toThreads – insufficient items to

keep threads busykeep threads busy


TestingTesting

• Singlethreaded code can be unit testedSinglethreaded code can be unit tested– Repeatable results from isolated codeRepeatable results from isolated code

• Multithreaded code cannot be unit tested Multithreaded code cannot be unit tested easily or reliablyeasily or reliably

– Nondeterministic outputsNondeterministic outputs– Making them deterministic may be possibleMaking them deterministic may be possible– Errors are transient (data races)Errors are transient (data races)– Problems are often performancerelated and show Problems are often performancerelated and show

up only at scale or under loadup only at scale or under load

allow for scaling down to a single thread for test before scaling up for production


Avoid sharing mutable dataAvoid sharing mutable data

• Shared mutable Shared mutable data is the evil of all data is the evil of all computing!computing!

• Readonly data can Readonly data can be shared safely be shared safely without lockswithout locks

• Const is your friendConst is your friend• Pure messagePure message

passing approach passing approach avoids thisavoids this

mutable

sharednot shared

immutable


Shared writes don't scaleShared writes don't scale

(graphic by Dmitry Vykov, http://www.1024cores.net, CC BY-NC-SA 3.0) single writer principle for speed/scale

http://www.1024cores.net/


Why shared writes don't scaleWhy shared writes don't scale

Core1

L1

L2

Core2

L1

L2

L3

RAM

32+32 KB 1-3 cycles

256 KB 5-20 cycles

4 MB 30-50 cycles

16 GB 100-300 cycles

MESI

• Caches have to Caches have to communicate communicate to ensure to ensure coherent view coherent view

• MESI protocol MESI protocol passes passes messages messages between cachesbetween caches

• Shared writes Shared writes limited by limited by MESI commsMESI comms

MESI


Shared writes – cache pingpongShared writes – cache pingpong• Cache line Cache line

passed between passed between cachescaches

• Hardware Hardware serialises writes serialises writes to the same lineto the same line

• Therefore zero Therefore zero scalability!!!scalability!!!

• For speed, don't For speed, don't pass ownership: pass ownership: Single Writer Single Writer PrinciplePrinciple


Example – contention costsExample – contention costsstd::atomic<int> counter(0);

void count(){ for (auto i = 0; i != numLoops; ++i) counter++;}

// run with 4 threads on a 4-core machine// -O3 -march=native

$ time ./a.outreal 0m1.675suser 0m6.384ssys 0m0.007s

$ time taskset -c 1 ./a.outreal 0m0.476suser 0m0.470ssys 0m0.006s

3.5 times faster13.6 times less CPU

when run on one core


Example – contention costs (cont'd)Example – contention costs (cont'd)$ perf stat -D 100 -e cycles,instructions,stalled-cycles-backend,cs ./a.out

Performance counter stats for './a.out':

16,635,079,957 cycles 247,668,647 instructions # 0.01 insns per cycle # 65.91 stalled cycles per insn 16,324,609,637 stalled-cycles-backend # 98.13% backend cycles idle 1,912 cs

1.569828247 seconds time elapsed

$ perf stat -D 100 -e cycles,instructions,stalled-cycles-backend,cs taskset -c 1 ./a.out

Performance counter stats for 'taskset -c 1 ./a.out':

1,135,801,575 cycles 197,626,046 instructions # 0.17 insns per cycle # 5.19 stalled cycles per insn 1,026,469,027 stalled-cycles-backend # 90.37% backend cycles idle 115 cs

0.480011707 seconds time elapsed


Don't undersynchroniseDon't undersynchronise

• Shared variables need to be synchronised Shared variables need to be synchronised correctlycorrectly

– Do not rely on guessworkDo not rely on guesswork– Do not try and cheatDo not try and cheat– Do not rely on unspecified ordering or visibilityDo not rely on unspecified ordering or visibility

• Undersychronised variables are subject to data Undersychronised variables are subject to data races (at least one reader and one writer)races (at least one reader and one writer)

• Causes transient and unreproducible errorsCauses transient and unreproducible errors

use locks on shared mutable data structures or use single atomics as synchronisation points


Don't oversynchroniseDon't oversynchronise

• Shared variables need to be synchronised Shared variables need to be synchronised correctlycorrectly

– Too much locking will make the code serialised Too much locking will make the code serialised – Locks are there to slow your program down until Locks are there to slow your program down until

it is (hopefully) correctit is (hopefully) correct– Watch out for deadlock and livelockWatch out for deadlock and livelock– Performance reduces back to slower than a single Performance reduces back to slower than a single

thread in the worst case because of locking thread in the worst case because of locking overhead (locks are shared writes)overhead (locks are shared writes)

– Amdahl's Law kicks inAmdahl's Law kicks in

don't keep adding locks – have a clear plan


Amdahl's LawAmdahl's Law

• Serial code limits scale, regardless of the Serial code limits scale, regardless of the number of threads or cores availablenumber of threads or cores available

Serial portion of code Maximum speedup

1% 100x

5% 20x

10% 10x

20% 5x

25% 4x

avoid non-read-only data sharing to allow for maximum parallelism


Deadlock and livelockDeadlock and livelock

• If you have more than one lock in your If you have more than one lock in your program you may end up in a deadlock (deadly program you may end up in a deadlock (deadly embrace)embrace)

– Have only one lock (may limit performance)Have only one lock (may limit performance)– Increases lock hold timeIncreases lock hold time

• Locking order is importantLocking order is important– C++11's std::lockC++11's std::lock– Use addresses of locks to guarantee orderingUse addresses of locks to guarantee ordering– Release order is not importantRelease order is not important– May require exposing internal locks to callersMay require exposing internal locks to callers


Timebased synchronisationTimebased synchronisation

locks don't sequence start and finishuse futures to synchronise in time (A comes after B)

horizontal sync (locks)

time-based sync (futures)

fork/join model

(Amdahl)


Hardware v. software threadsHardware v. software threads

• There are a limited number of hardware There are a limited number of hardware threads availablethreads available

– 1 per core1 per core– 2 per core for Intel hyperthreading2 per core for Intel hyperthreading

• If there are more s/w than h/w threads then If there are more s/w than h/w threads then they will have to take turns (oversubscription)they will have to take turns (oversubscription)

– Leads to context switchingLeads to context switching– Slow; 1000s of cycles to switchSlow; 1000s of cycles to switch

• Call to operating systemCall to operating system• SchedulingScheduling• Cold cache and TLBCold cache and TLB

one software thread per hardware thread


Queuebased systemsQueuebased systems

• Systems that are based on queues can have Systems that are based on queues can have performance issues caused by:performance issues caused by:

– Context switching when queues are empty or fullContext switching when queues are empty or full– Voluntary context switching Voluntary context switching – Shared writes to queue (insert and remove)Shared writes to queue (insert and remove)– Processing per item is too smallProcessing per item is too small– Can be difficult to run in singlethreaded modeCan be difficult to run in singlethreaded mode– c.f. Disruptor patternc.f. Disruptor pattern

be careful with queues if performance is important


Lock hold time and scopeLock hold time and scope

• The time that a lock is held for determines the The time that a lock is held for determines the amount of parallelismamount of parallelism

– Shorter hold times are betterShorter hold times are better– Shorter times may also indicate less shared stateShorter times may also indicate less shared state

• However, small lock scopes may not protect the However, small lock scopes may not protect the data across lock scopes adequatelydata across lock scopes adequately

– Need to consider businesslevel transactions and Need to consider businesslevel transactions and logical unit of workslogical unit of works

– Can lead to applicationlevel errors because of Can lead to applicationlevel errors because of concurrently changing dataconcurrently changing data


TOCTOU and application errorsTOCTOU and application errors

• Timeofcheck to timeofuse errors (TOCTOU) Timeofcheck to timeofuse errors (TOCTOU) can lead to application errorscan lead to application errors

– Note:Note: these are not data races caused by these are not data races caused by synchronisation errors (i.e. locking errors)synchronisation errors (i.e. locking errors)

– These are caused by concurrent modifications at the These are caused by concurrent modifications at the application levelapplication level

– Usually caused by inappropriate APIsUsually caused by inappropriate APIs

Sample conversationMe: Is there any ice cream left, please?Waiter: I'll check... yes there isMe: I'll have some pleaseWaiter: Oops, we've just run out


TOCTOU and application errors (2)TOCTOU and application errors (2)

• Locking doesn't helpLocking doesn't help• Need a different APINeed a different API

–e.g. putIfAbsent()e.g. putIfAbsent()–popIfNotEmpty()popIfNotEmpty()

• Single batched Single batched operation with operation with expection andexpection andfailure notificationfailure notification

• Compareandset Compareandset (CAS) is a classic (CAS) is a classic approach (retries)approach (retries)

read

read

write

write

modify

modify


VolatileVolatile

• Volatile in C and C++ is of no use for Volatile in C and C++ is of no use for multithreadingmultithreading

– In Java and C# it means “atomic”In Java and C# it means “atomic”• It disables caching in registersIt disables caching in registers• It forces memory accessesIt forces memory accesses• It doesn't ensure crossthread visibiltyIt doesn't ensure crossthread visibilty• It doesn't affect compiler or hardware It doesn't affect compiler or hardware

reordering of operationsreordering of operations

do not use volatile variables except for memory-mapped device I/O


Spin loops and pollingSpin loops and polling

• Spin loops are disastrous on singlethreaded Spin loops are disastrous on singlethreaded machinesmachines

–They just burn CPU cycles until the OS They just burn CPU cycles until the OS reschedules the threadreschedules the thread

• On a multithread machine they are of use On a multithread machine they are of use when they use less CPU than the overhead of a when they use less CPU than the overhead of a context switch (1000s of cycles)context switch (1000s of cycles)

• Polling with a sleep() uses little CPU but has Polling with a sleep() uses little CPU but has longer wakeup latency (avg 1/2 polling time)longer wakeup latency (avg 1/2 polling time)

avoid spin loops unless you have measured the latency-CPU tradeoff


Blocking I/OBlocking I/O

• Programs can be CPU, memory, network or I/O Programs can be CPU, memory, network or I/O limitedlimited

• I/O limited programs that use blocking I/O I/O limited programs that use blocking I/O will often use too many threads to handle will often use too many threads to handle blocking calls for I/Oblocking calls for I/O

• This causes lots of context switchingThis causes lots of context switching• Investigate asynchronous approachesInvestigate asynchronous approaches

– libaio, nonblocking sockets, overlapped I/Olibaio, nonblocking sockets, overlapped I/O• Damagelimitation approaches such as I/O Damagelimitation approaches such as I/O

thread poolsthread pools• Watch out for copying of data (zero copy)Watch out for copying of data (zero copy)


Interrupting threads and shutdownInterrupting threads and shutdown

• Don't even think about trying to interrupt Don't even think about trying to interrupt another threadanother thread

• Plan a clean shutdown mechanism for your Plan a clean shutdown mechanism for your programprogram

• Often will involve a cooperative approachOften will involve a cooperative approach–Shared stop/start/state flagShared stop/start/state flag–Shutdown message in messagepassing Shutdown message in messagepassing

applicationsapplications


Thread priorities and schedulingThread priorities and scheduling

• If your application requires the use of thread If your application requires the use of thread priorities to operate correctly then it's almost priorities to operate correctly then it's almost certainly brokencertainly broken

• Beware of priority inversion and locking issuesBeware of priority inversion and locking issues• Thread scheduling is rarely the correct solutionThread scheduling is rarely the correct solution

–Probably implies locking issues and too Probably implies locking issues and too much contention; fix that firstmuch contention; fix that first

–Can be useful in limited circumstances to Can be useful in limited circumstances to provide runtocompletion semanticsprovide runtocompletion semantics


DeletionDeletion• Be careful about deleting data in concurrent Be careful about deleting data in concurrent

systems; another thread may still have a systems; another thread may still have a reference reference

• Reference counting can help but counters must Reference counting can help but counters must be threadsafe (std::shared_ptr is, mostly...)be threadsafe (std::shared_ptr is, mostly...)

• Avoid concurrent deletions: Avoid concurrent deletions: tbb::concurrent_vector is appendonlytbb::concurrent_vector is appendonly

• Separate the program into phases so that Separate the program into phases so that deletion is in a safe serial partdeletion is in a safe serial part

• Garbage collection is a big win hereGarbage collection is a big win here


Multiple atomicsMultiple atomics• Can be used successfully individuallyCan be used successfully individually• The problem becomes more about transactional The problem becomes more about transactional

correctnesscorrectness–Do the atomics make sense together?Do the atomics make sense together?–Race window between modifying bothRace window between modifying both–Initialisation orderInitialisation order–Visibility of updatesVisibility of updates

1) use atomic<Data *> instead of multiple atomics(even better, use atomic<const Data *>)

2) use std::call_once for initialisation


Immutable data and safe publishingImmutable data and safe publishing

• Immutable data can be shared without lockingImmutable data can be shared without locking• Fewer errors and easy to understandFewer errors and easy to understand• Be careful about deletion; are there still Be careful about deletion; are there still

references to the object?!references to the object?!• std::shared_ptr<> has atomic counters but not std::shared_ptr<> has atomic counters but not

its body so it must be lockedits body so it must be locked• Helps with exception safety, transactions and Helps with exception safety, transactions and

copyonwrite optimisationcopyonwrite optimisation

publish safely using std::atomic<const Data *>


Error handlingError handling

• Propagating errors from one thread to another Propagating errors from one thread to another is trickyis tricky

• std::future<> catches exceptions thrown in the std::future<> catches exceptions thrown in the called thread and rethrows them in the calling called thread and rethrows them in the calling thread when f.get() is calledthread when f.get() is called

• Make sure you have a try/catch at the toplevel Make sure you have a try/catch at the toplevel of every thread you startof every thread you start

use std::future<> for time sequencing and easier error handling


False sharingFalse sharing

cache line

thread 1 thread 2

• Separate threads can access separate variables on the Separate threads can access separate variables on the same cache line (often 64 bytes long)same cache line (often 64 bytes long)

• Writes by one thread invalidate the cache line for the Writes by one thread invalidate the cache line for the other thread, leading to “cache pingpong”other thread, leading to “cache pingpong”

• Major performance killer – effectively shared writesMajor performance killer – effectively shared writes

watch out for false sharing use padding to length of cache line


Parallel algorithmsParallel algorithms

• Some libraries can run in parallel mode without Some libraries can run in parallel mode without you having to start any threads or do any you having to start any threads or do any synchronisationsynchronisation

• Gcc does this for STL algorithms if compiled Gcc does this for STL algorithms if compiled with D_GLIBCXX_PARALLEL and fopenmpwith D_GLIBCXX_PARALLEL and fopenmp

use “free” parallelism if available


Read/write ratioRead/write ratio

• Different approaches are appropriate for readDifferent approaches are appropriate for readmostly or writemostly access patternsmostly or writemostly access patterns

• Also depends on lock hold timeAlso depends on lock hold time• Short hold, lots of writes => CAS, spin lock et alShort hold, lots of writes => CAS, spin lock et al• Short hold, mixed R/W => distributed mutexShort hold, mixed R/W => distributed mutex• Short (zero) hold => RCUstyle lock freeShort (zero) hold => RCUstyle lock free• Beware of reader/writer lock scalingBeware of reader/writer lock scaling

select an approach based on data-access patterns


Fast/slow pathsFast/slow paths

• Know what operations need to be fastKnow what operations need to be fast– Frequent operationsFrequent operations

• Avoid locks on the fast pathAvoid locks on the fast path– Mutexes, I/O, memory allocation, etcMutexes, I/O, memory allocation, etc

• Push work to the slow pathPush work to the slow path– Maybe use a queue or a background threadMaybe use a queue or a background thread– Block slow path until there are no fast path usersBlock slow path until there are no fast path users– RCU, garbage collection, distributed read/write RCU, garbage collection, distributed read/write

mutexmutex

know what needs to be fast


read thread

read thread

read thread

read thread

Example – distributed R/W mutexExample – distributed R/W mutex

• Per-core mutex can be locked by only one read thread at a time so the mutex is uncontended and therefore fast; read mutex cache line is not shared across cores

• Write thread locks all mutexes to block readers; slow operation

mutex mutex mutex mutexper core

write thread

read thread

read thread

read thread

read thread

• Windows: GetCurrentProcessorNumber()• Linux: sched_getcpu()• See http://1024cores.net for more details

http://1024cores.net/


Distributed R/W mutex read performanceDistributed R/W mutex read performancestruct alignas(64) PerCoreLock { std::mutex lock;};

PerCoreLock locks[numThreads];

void lockLoop(){ for (auto i = 0; i != numLoops; ++i) { auto core = sched_getcpu(); std::lock_guard<std::mutex> guard(locks[core].lock); }}







code as

shownone shared

lock – context switching

no alignas(64) – false sharing


Memory model and orderingMemory model and ordering• We want our programs to run quicklyWe want our programs to run quickly• Modern hardware reorders instructions and Modern hardware reorders instructions and

can run multiple instruction at oncecan run multiple instruction at once• Compilers can reorder instructions too (e.g. to Compilers can reorder instructions too (e.g. to

cover possible cache misses, delay slots, etc)cover possible cache misses, delay slots, etc)• Some languages (Java, C++11) have defined a Some languages (Java, C++11) have defined a

memory model to say what reordering means memory model to say what reordering means at the language levelat the language level

• Don't go there unless you can prove through Don't go there unless you can prove through measurement it's necessarymeasurement it's necessary


Need for a memory modelNeed for a memory model

• Correctness now based on memory, not just codeCorrectness now based on memory, not just code• Need to control caching (register, L1, L2, etc)Need to control caching (register, L1, L2, etc)

// C++03 code // single thread

// C++03 code // single thread

// C++11 code// thread 1


compiler reordering

compiler reordering

compiler reordering

compiler reordering

hardware reordering

and caching

hardware reordering

and caching



memory

compiler reordering

compiler reordering

hardware reordering

and caching

hardware reordering

and caching

sequence points

memory

volatile reads and writes

“happens before”

“synchronizes with” and visibility


Instruction interleavingInstruction interleaving

• ““Sequential consistency” means operations in Sequential consistency” means operations in separate threads are interleaved and that all threads separate threads are interleaved and that all threads see the same interleavingsee the same interleaving– Sequence is preserved within and across threadsSequence is preserved within and across threads

• This is the “natural” mental model for programmers This is the “natural” mental model for programmers to think of thread execution order and memoryto think of thread execution order and memory– It is also the C++11 default memory orderingIt is also the C++11 default memory ordering

// thread 1x = 1; // 1r1 = y; // 2

// thread 1x = 1; // 1r1 = y; // 2

// thread 2y = 1; // 3r2 = x; // 4

// thread 2y = 1; // 3r2 = x; // 4

// 6 possible sequentially consistent// execution orders

1234// x=1; r1=y; y=1; r2=x;1324// x=1; y=1; r1=y; r2=x;1342// x=1; y=1; r2=x; r1=y; 3412// y=1; r2=x; x=1; r1=y;3142// y=1; x=1; r2=x; r1=y;3124// y=1; x=1; r1=y; r2=x;

// 6 possible sequentially consistent// execution orders

1234// x=1; r1=y; y=1; r2=x;1324// x=1; y=1; r1=y; r2=x;1342// x=1; y=1; r2=x; r1=y; 3412// y=1; r2=x; x=1; r1=y;3142// y=1; x=1; r2=x; r1=y;3124// y=1; x=1; r1=y; r2=x;

interleave


Instruction reorderingInstruction reordering

• In order to gain performance both the compiler and the In order to gain performance both the compiler and the hardware may reorder instructions hardware may reorder instructions – compiler may move loads earlier (to allow for cache misses)compiler may move loads earlier (to allow for cache misses)– hardware may not write back to memory immediately (store buffers)hardware may not write back to memory immediately (store buffers)

• x, y, r1 and r2 are all independent so code can be reorderedx, y, r1 and r2 are all independent so code can be reordered• Even worse, changes in one thread may not be visible in Even worse, changes in one thread may not be visible in

another thread so results are not defined – another thread so results are not defined – data racedata race

// thread 1x = 1; // 1r1 = y; // 2

// thread 1x = 1; // 1r1 = y; // 2

// thread 2y = 1; // 3r2 = x; // 4

// thread 2y = 1; // 3r2 = x; // 4

// 4 factorial (== 24)// possible execution orders

1234// x=1; r1=y; y=1; r2=x;4321// r2=x; y=1; r1=y; x=1;3124// y=1; x=1; r1=y; r2=x;// etc...

// 4 factorial (== 24)// possible execution orders

1234// x=1; r1=y; y=1; r2=x;4321// r2=x; y=1; r1=y; x=1;3124// y=1; x=1; r1=y; r2=x;// etc...

could be executed in reverse order

reorder

Copyright © 2015 Oxyware Ltd

Hardware memory reorderingHardware memory reordering

Alp

ha

Alp

ha

AR

Mv7

A

RM

v7

PA

-RIS

C

PA

-RIS

C

PO

WE

R

PO

WE

R

SP

AR

C R

MO

S

PA

RC

RM

O

SP

AR

C P

SO

S

PA

RC

PS

O

SP

AR

C T

SO

S

PA

RC

TS

O

x86

x86

x86

oost

ore

x86

oost

ore

AM

D64

A

MD

64

IA-6

4 IA

-64

zSer

ies

zSer

ies

Loads reordered after loads Loads reordered after loads Y Y Y Y Y Y Y Y Y Y Y Y Y Y Loads reordered after stores Loads reordered after stores Y Y Y Y Y Y Y Y Y Y Y Y Y Y Stores reordered after stores Stores reordered after stores Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Stores reordered after loads Stores reordered after loads Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Atomic reordered with loads Atomic reordered with loads Y Y Y Y Y Y Y Y Y Y Atomic reordered with stores Atomic reordered with stores Y Y Y Y Y Y Y Y Y Y Y Y Dependent loads reordered Dependent loads reordered Y Y Incoherent Instruction cache pipeline Incoherent Instruction cache pipeline Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y

• Hardware can reorder memory operations in different waysHardware can reorder memory operations in different ways– Can also depend on operating systemCan also depend on operating system

• Solaris on SPARC uses Total Store Order (TSO)Solaris on SPARC uses Total Store Order (TSO)• Linux on SPARC uses Relaxed Memory Order (RMO)Linux on SPARC uses Relaxed Memory Order (RMO)

http://en.wikipedia.org/wiki/Memory_ordering


Synchronisation with seq. consistencySynchronisation with seq. consistency

• This code is correct when sequentially consistentThis code is correct when sequentially consistent– thread 2 doesn’t access x until it has been set by thread 1thread 2 doesn’t access x until it has been set by thread 1

• But in the presence of reordering it can failBut in the presence of reordering it can fail• The problem is that we haven’t specified that crossThe problem is that we haven’t specified that cross

thread order or visibility is importantthread order or visibility is important• We need to use synchronisation variables – atomicsWe need to use synchronisation variables – atomics• Making everything atomic is slow – 3060 cyclesMaking everything atomic is slow – 3060 cycles

– cache synchronisation is slow and has limited bandwidthcache synchronisation is slow and has limited bandwidth

// thread 1x = 42;x_init = true;

// thread 1x = 42;x_init = true;

// thread 2while (! x_init) {}y = x;

// thread 2while (! x_init) {}y = x;


Synchronisation with atomicsSynchronisation with atomics

• This now works without relying on having This now works without relying on having sequential consistency everywhere (just atomics)sequential consistency everywhere (just atomics)– atomics prevent the compiler moving code across atomics prevent the compiler moving code across

accesses accesses – atomics also cause memory updates to be visibleatomics also cause memory updates to be visible

• Load and store uses sequential consistencyLoad and store uses sequential consistency– uses default parameter of std::memory_order_seq_cstuses default parameter of std::memory_order_seq_cst

// thread 1std::atomic<bool> x_init;int x;

x = 42;x_init.store(true);

// or x_init = true;

// thread 1std::atomic<bool> x_init;int x;

x = 42;x_init.store(true);

// or x_init = true;

// thread 2extern std::atomic<bool> x_init;extern int x;

while (! x_init.load());y = x;

// or while (! x_init);

// thread 2extern std::atomic<bool> x_init;extern int x;

while (! x_init.load());y = x;

// or while (! x_init);


Lowlevel synchronisation detailLowlevel synchronisation detail

• BlueBlue fencesfences prevent the compiler reordering code prevent the compiler reordering code– they don’t generate any runtime codethey don’t generate any runtime code

• Red fencesRed fences force memory to make changes visible force memory to make changes visible– they do generate code: fence, lock prefix, CAS opcodesthey do generate code: fence, lock prefix, CAS opcodes– depends heavily on underlying hardware (c.f. depends heavily on underlying hardware (c.f.

reordering)reordering)– only need one of the two red fences, usually on storeonly need one of the two red fences, usually on store

• One reason that threads can’t just be a libraryOne reason that threads can’t just be a library

// thread 1std::atomic<bool> x_init;

x = 42;// blue fencex_init.store(true);// red fence

// thread 1std::atomic<bool> x_init;

x = 42;// blue fencex_init.store(true);// red fence

// thread 2extern std::atomic<bool> x_init;

while (/*rfence*/! x_init.load()); // blue fencey = x;

// thread 2extern std::atomic<bool> x_init;

while (/*rfence*/! x_init.load()); // blue fencey = x;

rfence here in loop forces load of latest value

of x_init

prevents x and x_init being reordered

makes store to x_init visible

prevents reordering


Generated assembler codeGenerated assembler code// thread 1x = 42;// blue fencex_init.store(true);// red fence

// thread 1x = 42;// blue fencex_init.store(true);// red fence

// thread 2

while (/*rfence*/ ! x_init.load());// blue fencey = x;

// thread 2

while (/*rfence*/ ! x_init.load());// blue fencey = x;

// with atomic bool x_init mov DWORD PTR x, 42 mov BYTE PTR x_init, 1 mfence

// with bool x_init mov DWORD PTR x, 42 mov BYTE PTR x_init, 1

// with atomic bool x_init mov DWORD PTR x, 42 mov BYTE PTR x_init, 1 mfence

// with bool x_init mov DWORD PTR x, 42 mov BYTE PTR x_init, 1

// with atomic bool x_initL25: movzx eax, BYTE PTR x_init test al, al je .L25 mov eax, DWORD PTR x mov DWORD PTR y, eax

// with bool x_init cmp BYTE PTR x_init, 0 jne .L3.L5: jmp .L5.L3: mov eax, DWORD PTR x mov DWORD PTR y, eax

// with atomic bool x_initL25: movzx eax, BYTE PTR x_init test al, al je .L25 mov eax, DWORD PTR x mov DWORD PTR y, eax

// with bool x_init cmp BYTE PTR x_init, 0 jne .L3.L5: jmp .L5.L3: mov eax, DWORD PTR x mov DWORD PTR y, eax

infinite loop because

visibility not specified

no fence needed on load

for X86

fence needed on store for

X86

sequential consistent by default

g++ 4.7output, x86


Using memory order flagsUsing memory order flags// thread 1x = 42;// blue fencex_init.store(true, std::memory_order_release);

// thread 1x = 42;// blue fencex_init.store(true, std::memory_order_release);

// thread 2while (! x_init.load(std::memory_order_acquire));// blue fencey = x;

// thread 2while (! x_init.load(std::memory_order_acquire));// blue fencey = x;

// with bool x_init seq_cst mov DWORD PTR x, 42 mov BYTE PTR x_init, 1 mfence

// with bool x_init release mov DWORD PTR x, 42 mov BYTE PTR x_init, 1

// with bool x_init seq_cst mov DWORD PTR x, 42 mov BYTE PTR x_init, 1 mfence

// with bool x_init release mov DWORD PTR x, 42 mov BYTE PTR x_init, 1

// with bool atomic x_initL25: movzx eax, BYTE PTR x_init test al, al je .L25 mov eax, DWORD PTR x mov DWORD PTR y, eax

// same code for x_init acquire

// with bool atomic x_initL25: movzx eax, BYTE PTR x_init test al, al je .L25 mov eax, DWORD PTR x mov DWORD PTR y, eax

// same code for x_init acquireno fence for release store

no fence needed on load

for X86

needed on seq_cst store

• Release provides only the blue fence (no writes Release provides only the blue fence (no writes in this thread reordered after the store)in this thread reordered after the store)

• Controls compiler reordering but not hardwareControls compiler reordering but not hardware

acquire means no reads in this thread

reordered before here


Memory model adviceMemory model advice

• This is a complex and subtle area and you This is a complex and subtle area and you should avoid using it unless you can prove should avoid using it unless you can prove that you can’t get adequate performance that you can’t get adequate performance without itwithout it– yes, really, I mean it....yes, really, I mean it....

• Even experts get confused by this stuff!Even experts get confused by this stuff!– did I mention you should avoid it....did I mention you should avoid it....

• If you do use it, use acquire on load and If you do use it, use acquire on load and release on storerelease on store– Anything else will be a source of subtle bugsAnything else will be a source of subtle bugs


std::atomic<int> counter(0);

void count(){ for (auto i = 0; i != numLoops; ++i) counter.store(5); //counter.store(5, std::memory_order_seq_cst); //counter.store(5, std::memory_order_release);}







code as

shown m_o_release (no mfence)

m_o_seq_cst (default)

Memory model exampleMemory model example


Concurrency spectrumConcurrency spectrum

• Invest your time in splitting up the problemInvest your time in splitting up the problem• You know your domainYou know your domain• Leave concurrency parts to othersLeave concurrency parts to others

processesmessaging

TBB/PPLatomics,futures

threadsmemoryordering

keep as far to the left as possible

shared memory


SummarySummary

• Multithreaded programming is trickyMultithreaded programming is tricky– New skills and ideas and ways to get it wrongNew skills and ideas and ways to get it wrong

• Focus on partitioning the problemFocus on partitioning the problem– Determines data sharing, locking, work breakdown Determines data sharing, locking, work breakdown

and schedulingand scheduling• Avoid shared mutable data where possibleAvoid shared mutable data where possible• Know your access patternsKnow your access patterns• Scale down as well as upScale down as well as up• Balance extremes of grain size, lock extent, etcBalance extremes of grain size, lock extent, etc• Don't try and be cleverDon't try and be clever

Documents

Multithread dos and don'ts - ACCU Dos... · Multithreading dos and don'ts ACCU 2015, Bristol – June 2015 Hubert Matthews [email protected]. ... –In Java and C# it means “atomic