Upload
lynhi
View
217
Download
1
Embed Size (px)
Citation preview
Multithreading dos and don'tsMultithreading dos and don'ts
ACCU 2015, Bristol – June 2015ACCU 2015, Bristol – June 2015
Hubert MatthewsHubert [email protected]@oxyware.com
Copyright © 2015 Oxyware Ltd 2/55
Why this talk?Why this talk?
Copyright © 2015 Oxyware Ltd 3/55
Don't – why notDon't – why not
• Avoid multithreaded programming if you canAvoid multithreaded programming if you can– It's harder to write, to read, to understand and to It's harder to write, to read, to understand and to
test than singlethreaded codetest than singlethreaded code– It may appear to work but may just not have failed It may appear to work but may just not have failed
sufficiently visibly yetsufficiently visibly yet– Often distracts from the underlying application Often distracts from the underlying application
problem and focuses developers on technical issuesproblem and focuses developers on technical issues– A great consumer of developer time and generator A great consumer of developer time and generator
of frustrationof frustration– Often avoidableOften avoidable– May not deliver the performance benefits you May not deliver the performance benefits you
expectexpect
Copyright © 2015 Oxyware Ltd 4/55
Do – whyDo – why
• You need it to access the full power of the You need it to access the full power of the machine (measure, don't guess!)machine (measure, don't guess!)
• You need to scale your application and your You need to scale your application and your application is CPUboundapplication is CPUbound
• You are running in a threaded environmentYou are running in a threaded environment• There is an obvious parallel decomposition of There is an obvious parallel decomposition of
the problem or algorithmthe problem or algorithm• Other approaches to covering latency and I/O Other approaches to covering latency and I/O
are worse or not availableare worse or not available• You are brave or a masochist (or have an ego)You are brave or a masochist (or have an ego)
Copyright © 2015 Oxyware Ltd 5/55
AlternativesAlternatives
Single-threaded approaches• event-driven code• asynchronous I/O
• asio, libaio (linux)• overlapped I/O
(Windows)• non-blocking TCP• UDP
• coroutines or fibers• separate processes
Multi-threaded approaches• concurrent library (tasks)
• TBB, PPL• concurrent library (data)
• OpenMP• message passing
• MPI
Copyright © 2015 Oxyware Ltd 6/55
Copyright © 2015 Oxyware Ltd 7/55
A whole new set of problemsA whole new set of problems
• Getting singlethreaded code working well and Getting singlethreaded code working well and with good performance can be a challengewith good performance can be a challenge
• Multithreading provides a whole new set of Multithreading provides a whole new set of ways of getting it wrong or going slowerways of getting it wrong or going slower
– The problems may not show up except under load The problems may not show up except under load or at the most inconvenient timeor at the most inconvenient time
– They may not be reproducibleThey may not be reproducible– They will be hard to debug or to measureThey will be hard to debug or to measure– Knowing which of these problems you have is hardKnowing which of these problems you have is hard
focus on getting good single-threaded performance first before going multithreaded
Copyright © 2015 Oxyware Ltd 8/55
Problem decompositionProblem decomposition
• Before considering how to implement a parallel Before considering how to implement a parallel solution you have to split the problem up into solution you have to split the problem up into pieces and find an algorithm for processing and pieces and find an algorithm for processing and recombining these piecesrecombining these pieces
– This split may be trivial for “embarrassingly parallel This split may be trivial for “embarrassingly parallel problems” or hard (travelling salesman problem)problems” or hard (travelling salesman problem)
• Classic approachesClassic approaches– Data parallel (sections of an array)Data parallel (sections of an array)– Task parallel (web requests)Task parallel (web requests)
• Interaction between subitems is keyInteraction between subitems is key
Copyright © 2015 Oxyware Ltd 9/55
GoldilocksGoldilocks
• Parallel approaches have to find the “sweet Parallel approaches have to find the “sweet spot” between two extremesspot” between two extremes
• Too finegrainedToo finegrained– Data – computation dominated by overheadData – computation dominated by overhead– Threads – context switching overheadThreads – context switching overhead
• Too coarsegrainedToo coarsegrained– Data – load balancing problemsData – load balancing problems– Threads – insufficient items toThreads – insufficient items to
keep threads busykeep threads busy
Copyright © 2015 Oxyware Ltd 10/55
TestingTesting
• Singlethreaded code can be unit testedSinglethreaded code can be unit tested– Repeatable results from isolated codeRepeatable results from isolated code
• Multithreaded code cannot be unit tested Multithreaded code cannot be unit tested easily or reliablyeasily or reliably
– Nondeterministic outputsNondeterministic outputs– Making them deterministic may be possibleMaking them deterministic may be possible– Errors are transient (data races)Errors are transient (data races)– Problems are often performancerelated and show Problems are often performancerelated and show
up only at scale or under loadup only at scale or under load
allow for scaling down to a single thread for test before scaling up for production
Copyright © 2015 Oxyware Ltd 11/55
Avoid sharing mutable dataAvoid sharing mutable data
• Shared mutable Shared mutable data is the evil of all data is the evil of all computing!computing!
• Readonly data can Readonly data can be shared safely be shared safely without lockswithout locks
• Const is your friendConst is your friend• Pure messagePure message
passing approach passing approach avoids thisavoids this
mutable
sharednot shared
immutable
Copyright © 2015 Oxyware Ltd 12/55
Shared writes don't scaleShared writes don't scale
(graphic by Dmitry Vykov, http://www.1024cores.net, CC BY-NC-SA 3.0) single writer principle for speed/scale
Copyright © 2015 Oxyware Ltd 13/55
Why shared writes don't scaleWhy shared writes don't scale
Core1
L1
L2
Core2
L1
L2
L3
RAM
32+32 KB 1-3 cycles
256 KB 5-20 cycles
4 MB 30-50 cycles
16 GB 100-300 cycles
MESI
• Caches have to Caches have to communicate communicate to ensure to ensure coherent view coherent view
• MESI protocol MESI protocol passes passes messages messages between cachesbetween caches
• Shared writes Shared writes limited by limited by MESI commsMESI comms
MESI
Copyright © 2015 Oxyware Ltd 14/55
Shared writes – cache pingpongShared writes – cache pingpong• Cache line Cache line
passed between passed between cachescaches
• Hardware Hardware serialises writes serialises writes to the same lineto the same line
• Therefore zero Therefore zero scalability!!!scalability!!!
• For speed, don't For speed, don't pass ownership: pass ownership: Single Writer Single Writer PrinciplePrinciple
Copyright © 2015 Oxyware Ltd 15/55
Example – contention costsExample – contention costsstd::atomic<int> counter(0);
void count(){ for (auto i = 0; i != numLoops; ++i) counter++;}
// run with 4 threads on a 4-core machine// -O3 -march=native
$ time ./a.outreal 0m1.675suser 0m6.384ssys 0m0.007s
$ time taskset -c 1 ./a.outreal 0m0.476suser 0m0.470ssys 0m0.006s
3.5 times faster13.6 times less CPU
when run on one core
Copyright © 2015 Oxyware Ltd 16/55
Example – contention costs (cont'd)Example – contention costs (cont'd)$ perf stat -D 100 -e cycles,instructions,stalled-cycles-backend,cs ./a.out
Performance counter stats for './a.out':
16,635,079,957 cycles 247,668,647 instructions # 0.01 insns per cycle # 65.91 stalled cycles per insn 16,324,609,637 stalled-cycles-backend # 98.13% backend cycles idle 1,912 cs
1.569828247 seconds time elapsed
$ perf stat -D 100 -e cycles,instructions,stalled-cycles-backend,cs taskset -c 1 ./a.out
Performance counter stats for 'taskset -c 1 ./a.out':
1,135,801,575 cycles 197,626,046 instructions # 0.17 insns per cycle # 5.19 stalled cycles per insn 1,026,469,027 stalled-cycles-backend # 90.37% backend cycles idle 115 cs
0.480011707 seconds time elapsed
Copyright © 2015 Oxyware Ltd 17/55
Don't undersynchroniseDon't undersynchronise
• Shared variables need to be synchronised Shared variables need to be synchronised correctlycorrectly
– Do not rely on guessworkDo not rely on guesswork– Do not try and cheatDo not try and cheat– Do not rely on unspecified ordering or visibilityDo not rely on unspecified ordering or visibility
• Undersychronised variables are subject to data Undersychronised variables are subject to data races (at least one reader and one writer)races (at least one reader and one writer)
• Causes transient and unreproducible errorsCauses transient and unreproducible errors
use locks on shared mutable data structures or use single atomics as synchronisation points
Copyright © 2015 Oxyware Ltd 18/55
Don't oversynchroniseDon't oversynchronise
• Shared variables need to be synchronised Shared variables need to be synchronised correctlycorrectly
– Too much locking will make the code serialised Too much locking will make the code serialised – Locks are there to slow your program down until Locks are there to slow your program down until
it is (hopefully) correctit is (hopefully) correct– Watch out for deadlock and livelockWatch out for deadlock and livelock– Performance reduces back to slower than a single Performance reduces back to slower than a single
thread in the worst case because of locking thread in the worst case because of locking overhead (locks are shared writes)overhead (locks are shared writes)
– Amdahl's Law kicks inAmdahl's Law kicks in
don't keep adding locks – have a clear plan
Copyright © 2015 Oxyware Ltd 19/55
Amdahl's LawAmdahl's Law
• Serial code limits scale, regardless of the Serial code limits scale, regardless of the number of threads or cores availablenumber of threads or cores available
Serial portion of code Maximum speedup
1% 100x
5% 20x
10% 10x
20% 5x
25% 4x
avoid non-read-only data sharing to allow for maximum parallelism
Copyright © 2015 Oxyware Ltd 20/55
Deadlock and livelockDeadlock and livelock
• If you have more than one lock in your If you have more than one lock in your program you may end up in a deadlock (deadly program you may end up in a deadlock (deadly embrace)embrace)
– Have only one lock (may limit performance)Have only one lock (may limit performance)– Increases lock hold timeIncreases lock hold time
• Locking order is importantLocking order is important– C++11's std::lockC++11's std::lock– Use addresses of locks to guarantee orderingUse addresses of locks to guarantee ordering– Release order is not importantRelease order is not important– May require exposing internal locks to callersMay require exposing internal locks to callers
Copyright © 2015 Oxyware Ltd 21/55
Timebased synchronisationTimebased synchronisation
locks don't sequence start and finishuse futures to synchronise in time (A comes after B)
horizontal sync (locks)
time-based sync (futures)
fork/join model
(Amdahl)
Copyright © 2015 Oxyware Ltd 22/55
Hardware v. software threadsHardware v. software threads
• There are a limited number of hardware There are a limited number of hardware threads availablethreads available
– 1 per core1 per core– 2 per core for Intel hyperthreading2 per core for Intel hyperthreading
• If there are more s/w than h/w threads then If there are more s/w than h/w threads then they will have to take turns (oversubscription)they will have to take turns (oversubscription)
– Leads to context switchingLeads to context switching– Slow; 1000s of cycles to switchSlow; 1000s of cycles to switch
• Call to operating systemCall to operating system• SchedulingScheduling• Cold cache and TLBCold cache and TLB
one software thread per hardware thread
Copyright © 2015 Oxyware Ltd 23/55
Queuebased systemsQueuebased systems
• Systems that are based on queues can have Systems that are based on queues can have performance issues caused by:performance issues caused by:
– Context switching when queues are empty or fullContext switching when queues are empty or full– Voluntary context switching Voluntary context switching – Shared writes to queue (insert and remove)Shared writes to queue (insert and remove)– Processing per item is too smallProcessing per item is too small– Can be difficult to run in singlethreaded modeCan be difficult to run in singlethreaded mode– c.f. Disruptor patternc.f. Disruptor pattern
be careful with queues if performance is important
Copyright © 2015 Oxyware Ltd 24/55
Lock hold time and scopeLock hold time and scope
• The time that a lock is held for determines the The time that a lock is held for determines the amount of parallelismamount of parallelism
– Shorter hold times are betterShorter hold times are better– Shorter times may also indicate less shared stateShorter times may also indicate less shared state
• However, small lock scopes may not protect the However, small lock scopes may not protect the data across lock scopes adequatelydata across lock scopes adequately
– Need to consider businesslevel transactions and Need to consider businesslevel transactions and logical unit of workslogical unit of works
– Can lead to applicationlevel errors because of Can lead to applicationlevel errors because of concurrently changing dataconcurrently changing data
Copyright © 2015 Oxyware Ltd 25/55
TOCTOU and application errorsTOCTOU and application errors
• Timeofcheck to timeofuse errors (TOCTOU) Timeofcheck to timeofuse errors (TOCTOU) can lead to application errorscan lead to application errors
– Note:Note: these are not data races caused by these are not data races caused by synchronisation errors (i.e. locking errors)synchronisation errors (i.e. locking errors)
– These are caused by concurrent modifications at the These are caused by concurrent modifications at the application levelapplication level
– Usually caused by inappropriate APIsUsually caused by inappropriate APIs
Sample conversationMe: Is there any ice cream left, please?Waiter: I'll check... yes there isMe: I'll have some pleaseWaiter: Oops, we've just run out
Copyright © 2015 Oxyware Ltd 26/55
TOCTOU and application errors (2)TOCTOU and application errors (2)
• Locking doesn't helpLocking doesn't help• Need a different APINeed a different API
–e.g. putIfAbsent()e.g. putIfAbsent()–popIfNotEmpty()popIfNotEmpty()
• Single batched Single batched operation with operation with expection andexpection andfailure notificationfailure notification
• Compareandset Compareandset (CAS) is a classic (CAS) is a classic approach (retries)approach (retries)
read
read
write
write
modify
modify
Copyright © 2015 Oxyware Ltd 27/55
VolatileVolatile
• Volatile in C and C++ is of no use for Volatile in C and C++ is of no use for multithreadingmultithreading
– In Java and C# it means “atomic”In Java and C# it means “atomic”• It disables caching in registersIt disables caching in registers• It forces memory accessesIt forces memory accesses• It doesn't ensure crossthread visibiltyIt doesn't ensure crossthread visibilty• It doesn't affect compiler or hardware It doesn't affect compiler or hardware
reordering of operationsreordering of operations
do not use volatile variables except for memory-mapped device I/O
Copyright © 2015 Oxyware Ltd 28/55
Spin loops and pollingSpin loops and polling
• Spin loops are disastrous on singlethreaded Spin loops are disastrous on singlethreaded machinesmachines
–They just burn CPU cycles until the OS They just burn CPU cycles until the OS reschedules the threadreschedules the thread
• On a multithread machine they are of use On a multithread machine they are of use when they use less CPU than the overhead of a when they use less CPU than the overhead of a context switch (1000s of cycles)context switch (1000s of cycles)
• Polling with a sleep() uses little CPU but has Polling with a sleep() uses little CPU but has longer wakeup latency (avg 1/2 polling time)longer wakeup latency (avg 1/2 polling time)
avoid spin loops unless you have measured the latency-CPU tradeoff
Copyright © 2015 Oxyware Ltd 29/55
Blocking I/OBlocking I/O
• Programs can be CPU, memory, network or I/O Programs can be CPU, memory, network or I/O limitedlimited
• I/O limited programs that use blocking I/O I/O limited programs that use blocking I/O will often use too many threads to handle will often use too many threads to handle blocking calls for I/Oblocking calls for I/O
• This causes lots of context switchingThis causes lots of context switching• Investigate asynchronous approachesInvestigate asynchronous approaches
– libaio, nonblocking sockets, overlapped I/Olibaio, nonblocking sockets, overlapped I/O• Damagelimitation approaches such as I/O Damagelimitation approaches such as I/O
thread poolsthread pools• Watch out for copying of data (zero copy)Watch out for copying of data (zero copy)
Copyright © 2015 Oxyware Ltd 30/55
Interrupting threads and shutdownInterrupting threads and shutdown
• Don't even think about trying to interrupt Don't even think about trying to interrupt another threadanother thread
• Plan a clean shutdown mechanism for your Plan a clean shutdown mechanism for your programprogram
• Often will involve a cooperative approachOften will involve a cooperative approach–Shared stop/start/state flagShared stop/start/state flag–Shutdown message in messagepassing Shutdown message in messagepassing
applicationsapplications
Copyright © 2015 Oxyware Ltd 31/55
Thread priorities and schedulingThread priorities and scheduling
• If your application requires the use of thread If your application requires the use of thread priorities to operate correctly then it's almost priorities to operate correctly then it's almost certainly brokencertainly broken
• Beware of priority inversion and locking issuesBeware of priority inversion and locking issues• Thread scheduling is rarely the correct solutionThread scheduling is rarely the correct solution
–Probably implies locking issues and too Probably implies locking issues and too much contention; fix that firstmuch contention; fix that first
–Can be useful in limited circumstances to Can be useful in limited circumstances to provide runtocompletion semanticsprovide runtocompletion semantics
Copyright © 2015 Oxyware Ltd 32/55
DeletionDeletion• Be careful about deleting data in concurrent Be careful about deleting data in concurrent
systems; another thread may still have a systems; another thread may still have a reference reference
• Reference counting can help but counters must Reference counting can help but counters must be threadsafe (std::shared_ptr is, mostly...)be threadsafe (std::shared_ptr is, mostly...)
• Avoid concurrent deletions: Avoid concurrent deletions: tbb::concurrent_vector is appendonlytbb::concurrent_vector is appendonly
• Separate the program into phases so that Separate the program into phases so that deletion is in a safe serial partdeletion is in a safe serial part
• Garbage collection is a big win hereGarbage collection is a big win here
Copyright © 2015 Oxyware Ltd 33/55
Multiple atomicsMultiple atomics• Can be used successfully individuallyCan be used successfully individually• The problem becomes more about transactional The problem becomes more about transactional
correctnesscorrectness–Do the atomics make sense together?Do the atomics make sense together?–Race window between modifying bothRace window between modifying both–Initialisation orderInitialisation order–Visibility of updatesVisibility of updates
1) use atomic<Data *> instead of multiple atomics(even better, use atomic<const Data *>)
2) use std::call_once for initialisation
Copyright © 2015 Oxyware Ltd 34/55
Immutable data and safe publishingImmutable data and safe publishing
• Immutable data can be shared without lockingImmutable data can be shared without locking• Fewer errors and easy to understandFewer errors and easy to understand• Be careful about deletion; are there still Be careful about deletion; are there still
references to the object?!references to the object?!• std::shared_ptr<> has atomic counters but not std::shared_ptr<> has atomic counters but not
its body so it must be lockedits body so it must be locked• Helps with exception safety, transactions and Helps with exception safety, transactions and
copyonwrite optimisationcopyonwrite optimisation
publish safely using std::atomic<const Data *>
Copyright © 2015 Oxyware Ltd 35/55
Error handlingError handling
• Propagating errors from one thread to another Propagating errors from one thread to another is trickyis tricky
• std::future<> catches exceptions thrown in the std::future<> catches exceptions thrown in the called thread and rethrows them in the calling called thread and rethrows them in the calling thread when f.get() is calledthread when f.get() is called
• Make sure you have a try/catch at the toplevel Make sure you have a try/catch at the toplevel of every thread you startof every thread you start
use std::future<> for time sequencing and easier error handling
Copyright © 2015 Oxyware Ltd 36/55
False sharingFalse sharing
cache line
thread 1 thread 2
• Separate threads can access separate variables on the Separate threads can access separate variables on the same cache line (often 64 bytes long)same cache line (often 64 bytes long)
• Writes by one thread invalidate the cache line for the Writes by one thread invalidate the cache line for the other thread, leading to “cache pingpong”other thread, leading to “cache pingpong”
• Major performance killer – effectively shared writesMajor performance killer – effectively shared writes
watch out for false sharing use padding to length of cache line
Copyright © 2015 Oxyware Ltd 37/55
Parallel algorithmsParallel algorithms
• Some libraries can run in parallel mode without Some libraries can run in parallel mode without you having to start any threads or do any you having to start any threads or do any synchronisationsynchronisation
• Gcc does this for STL algorithms if compiled Gcc does this for STL algorithms if compiled with D_GLIBCXX_PARALLEL and fopenmpwith D_GLIBCXX_PARALLEL and fopenmp
use “free” parallelism if available
Copyright © 2015 Oxyware Ltd 38/55
Read/write ratioRead/write ratio
• Different approaches are appropriate for readDifferent approaches are appropriate for readmostly or writemostly access patternsmostly or writemostly access patterns
• Also depends on lock hold timeAlso depends on lock hold time• Short hold, lots of writes => CAS, spin lock et alShort hold, lots of writes => CAS, spin lock et al• Short hold, mixed R/W => distributed mutexShort hold, mixed R/W => distributed mutex• Short (zero) hold => RCUstyle lock freeShort (zero) hold => RCUstyle lock free• Beware of reader/writer lock scalingBeware of reader/writer lock scaling
select an approach based on data-access patterns
Copyright © 2015 Oxyware Ltd 39/55
Fast/slow pathsFast/slow paths
• Know what operations need to be fastKnow what operations need to be fast– Frequent operationsFrequent operations
• Avoid locks on the fast pathAvoid locks on the fast path– Mutexes, I/O, memory allocation, etcMutexes, I/O, memory allocation, etc
• Push work to the slow pathPush work to the slow path– Maybe use a queue or a background threadMaybe use a queue or a background thread– Block slow path until there are no fast path usersBlock slow path until there are no fast path users– RCU, garbage collection, distributed read/write RCU, garbage collection, distributed read/write
mutexmutex
know what needs to be fast
Copyright © 2015 Oxyware Ltd 40/55
read thread
read thread
read thread
read thread
Example – distributed R/W mutexExample – distributed R/W mutex
• Per-core mutex can be locked by only one read thread at a time so the mutex is uncontended and therefore fast; read mutex cache line is not shared across cores
• Write thread locks all mutexes to block readers; slow operation
mutex mutex mutex mutexper core
write thread
read thread
read thread
read thread
read thread
• Windows: GetCurrentProcessorNumber()• Linux: sched_getcpu()• See http://1024cores.net for more details
Copyright © 2015 Oxyware Ltd 41/55
Distributed R/W mutex read performanceDistributed R/W mutex read performancestruct alignas(64) PerCoreLock { std::mutex lock;};
PerCoreLock locks[numThreads];
void lockLoop(){ for (auto i = 0; i != numLoops; ++i) { auto core = sched_getcpu(); std::lock_guard<std::mutex> guard(locks[core].lock); }}
$ time ./a.outreal 0m0.537suser 0m2.019ssys 0m0.003s
$ time taskset -c 1 ./a.outreal 0m1.992suser 0m1.985ssys 0m0.005s
$ time ./a.outreal 0m4.204suser 0m10.514ssys 0m0.005s
$ time taskset -c 1 ./a.outreal 0m2.091suser 0m2.077ssys 0m0.004s
$ time ./a.outreal 0m10.097suser 0m11.862ssys 0m24.078s
$ time taskset -c 1 ./a.outreal 0m2.057suser 0m2.045ssys 0m0.003s
code as
shownone shared
lock – context switching
no alignas(64) – false sharing
Copyright © 2015 Oxyware Ltd 42/55
Memory model and orderingMemory model and ordering• We want our programs to run quicklyWe want our programs to run quickly• Modern hardware reorders instructions and Modern hardware reorders instructions and
can run multiple instruction at oncecan run multiple instruction at once• Compilers can reorder instructions too (e.g. to Compilers can reorder instructions too (e.g. to
cover possible cache misses, delay slots, etc)cover possible cache misses, delay slots, etc)• Some languages (Java, C++11) have defined a Some languages (Java, C++11) have defined a
memory model to say what reordering means memory model to say what reordering means at the language levelat the language level
• Don't go there unless you can prove through Don't go there unless you can prove through measurement it's necessarymeasurement it's necessary
Copyright © 2015 Oxyware Ltd 43/55
Need for a memory modelNeed for a memory model
• Correctness now based on memory, not just codeCorrectness now based on memory, not just code• Need to control caching (register, L1, L2, etc)Need to control caching (register, L1, L2, etc)
// C++03 code // single thread
// C++03 code // single thread
// C++11 code// thread 1
// C++11 code// thread 1
compiler reordering
compiler reordering
compiler reordering
compiler reordering
hardware reordering
and caching
hardware reordering
and caching
// C++11 code// thread 2
// C++11 code// thread 2
memory
compiler reordering
compiler reordering
hardware reordering
and caching
hardware reordering
and caching
sequence points
memory
volatile reads and writes
“happens before”
“synchronizes with” and visibility
Copyright © 2015 Oxyware Ltd 44/55
Instruction interleavingInstruction interleaving
• ““Sequential consistency” means operations in Sequential consistency” means operations in separate threads are interleaved and that all threads separate threads are interleaved and that all threads see the same interleavingsee the same interleaving– Sequence is preserved within and across threadsSequence is preserved within and across threads
• This is the “natural” mental model for programmers This is the “natural” mental model for programmers to think of thread execution order and memoryto think of thread execution order and memory– It is also the C++11 default memory orderingIt is also the C++11 default memory ordering
// thread 1x = 1; // 1r1 = y; // 2
// thread 1x = 1; // 1r1 = y; // 2
// thread 2y = 1; // 3r2 = x; // 4
// thread 2y = 1; // 3r2 = x; // 4
// 6 possible sequentially consistent// execution orders
1234// x=1; r1=y; y=1; r2=x;1324// x=1; y=1; r1=y; r2=x;1342// x=1; y=1; r2=x; r1=y; 3412// y=1; r2=x; x=1; r1=y;3142// y=1; x=1; r2=x; r1=y;3124// y=1; x=1; r1=y; r2=x;
// 6 possible sequentially consistent// execution orders
1234// x=1; r1=y; y=1; r2=x;1324// x=1; y=1; r1=y; r2=x;1342// x=1; y=1; r2=x; r1=y; 3412// y=1; r2=x; x=1; r1=y;3142// y=1; x=1; r2=x; r1=y;3124// y=1; x=1; r1=y; r2=x;
interleave
Copyright © 2015 Oxyware Ltd 45/55
Instruction reorderingInstruction reordering
• In order to gain performance both the compiler and the In order to gain performance both the compiler and the hardware may reorder instructions hardware may reorder instructions – compiler may move loads earlier (to allow for cache misses)compiler may move loads earlier (to allow for cache misses)– hardware may not write back to memory immediately (store buffers)hardware may not write back to memory immediately (store buffers)
• x, y, r1 and r2 are all independent so code can be reorderedx, y, r1 and r2 are all independent so code can be reordered• Even worse, changes in one thread may not be visible in Even worse, changes in one thread may not be visible in
another thread so results are not defined – another thread so results are not defined – data racedata race
// thread 1x = 1; // 1r1 = y; // 2
// thread 1x = 1; // 1r1 = y; // 2
// thread 2y = 1; // 3r2 = x; // 4
// thread 2y = 1; // 3r2 = x; // 4
// 4 factorial (== 24)// possible execution orders
1234// x=1; r1=y; y=1; r2=x;4321// r2=x; y=1; r1=y; x=1;3124// y=1; x=1; r1=y; r2=x;// etc...
// 4 factorial (== 24)// possible execution orders
1234// x=1; r1=y; y=1; r2=x;4321// r2=x; y=1; r1=y; x=1;3124// y=1; x=1; r1=y; r2=x;// etc...
could be executed in reverse order
reorder
Copyright © 2015 Oxyware Ltd
Hardware memory reorderingHardware memory reordering
Alp
ha
Alp
ha
AR
Mv7
A
RM
v7
PA
-RIS
C
PA
-RIS
C
PO
WE
R
PO
WE
R
SP
AR
C R
MO
S
PA
RC
RM
O
SP
AR
C P
SO
S
PA
RC
PS
O
SP
AR
C T
SO
S
PA
RC
TS
O
x86
x86
x86
oost
ore
x86
oost
ore
AM
D64
A
MD
64
IA-6
4 IA
-64
zSer
ies
zSer
ies
Loads reordered after loads Loads reordered after loads Y Y Y Y Y Y Y Y Y Y Y Y Y Y Loads reordered after stores Loads reordered after stores Y Y Y Y Y Y Y Y Y Y Y Y Y Y Stores reordered after stores Stores reordered after stores Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Stores reordered after loads Stores reordered after loads Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Atomic reordered with loads Atomic reordered with loads Y Y Y Y Y Y Y Y Y Y Atomic reordered with stores Atomic reordered with stores Y Y Y Y Y Y Y Y Y Y Y Y Dependent loads reordered Dependent loads reordered Y Y Incoherent Instruction cache pipeline Incoherent Instruction cache pipeline Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
• Hardware can reorder memory operations in different waysHardware can reorder memory operations in different ways– Can also depend on operating systemCan also depend on operating system
• Solaris on SPARC uses Total Store Order (TSO)Solaris on SPARC uses Total Store Order (TSO)• Linux on SPARC uses Relaxed Memory Order (RMO)Linux on SPARC uses Relaxed Memory Order (RMO)
http://en.wikipedia.org/wiki/Memory_ordering
Copyright © 2015 Oxyware Ltd 47/55
Synchronisation with seq. consistencySynchronisation with seq. consistency
• This code is correct when sequentially consistentThis code is correct when sequentially consistent– thread 2 doesn’t access x until it has been set by thread 1thread 2 doesn’t access x until it has been set by thread 1
• But in the presence of reordering it can failBut in the presence of reordering it can fail• The problem is that we haven’t specified that crossThe problem is that we haven’t specified that cross
thread order or visibility is importantthread order or visibility is important• We need to use synchronisation variables – atomicsWe need to use synchronisation variables – atomics• Making everything atomic is slow – 3060 cyclesMaking everything atomic is slow – 3060 cycles
– cache synchronisation is slow and has limited bandwidthcache synchronisation is slow and has limited bandwidth
// thread 1x = 42;x_init = true;
// thread 1x = 42;x_init = true;
// thread 2while (! x_init) {}y = x;
// thread 2while (! x_init) {}y = x;
Copyright © 2015 Oxyware Ltd 48/55
Synchronisation with atomicsSynchronisation with atomics
• This now works without relying on having This now works without relying on having sequential consistency everywhere (just atomics)sequential consistency everywhere (just atomics)– atomics prevent the compiler moving code across atomics prevent the compiler moving code across
accesses accesses – atomics also cause memory updates to be visibleatomics also cause memory updates to be visible
• Load and store uses sequential consistencyLoad and store uses sequential consistency– uses default parameter of std::memory_order_seq_cstuses default parameter of std::memory_order_seq_cst
// thread 1std::atomic<bool> x_init;int x;
x = 42;x_init.store(true);
// or x_init = true;
// thread 1std::atomic<bool> x_init;int x;
x = 42;x_init.store(true);
// or x_init = true;
// thread 2extern std::atomic<bool> x_init;extern int x;
while (! x_init.load());y = x;
// or while (! x_init);
// thread 2extern std::atomic<bool> x_init;extern int x;
while (! x_init.load());y = x;
// or while (! x_init);
Copyright © 2015 Oxyware Ltd 49/55
Lowlevel synchronisation detailLowlevel synchronisation detail
• BlueBlue fencesfences prevent the compiler reordering code prevent the compiler reordering code– they don’t generate any runtime codethey don’t generate any runtime code
• Red fencesRed fences force memory to make changes visible force memory to make changes visible– they do generate code: fence, lock prefix, CAS opcodesthey do generate code: fence, lock prefix, CAS opcodes– depends heavily on underlying hardware (c.f. depends heavily on underlying hardware (c.f.
reordering)reordering)– only need one of the two red fences, usually on storeonly need one of the two red fences, usually on store
• One reason that threads can’t just be a libraryOne reason that threads can’t just be a library
// thread 1std::atomic<bool> x_init;
x = 42;// blue fencex_init.store(true);// red fence
// thread 1std::atomic<bool> x_init;
x = 42;// blue fencex_init.store(true);// red fence
// thread 2extern std::atomic<bool> x_init;
while (/*rfence*/! x_init.load()); // blue fencey = x;
// thread 2extern std::atomic<bool> x_init;
while (/*rfence*/! x_init.load()); // blue fencey = x;
rfence here in loop forces load of latest value
of x_init
prevents x and x_init being reordered
makes store to x_init visible
prevents reordering
Copyright © 2015 Oxyware Ltd 50/55
Generated assembler codeGenerated assembler code// thread 1x = 42;// blue fencex_init.store(true);// red fence
// thread 1x = 42;// blue fencex_init.store(true);// red fence
// thread 2
while (/*rfence*/ ! x_init.load());// blue fencey = x;
// thread 2
while (/*rfence*/ ! x_init.load());// blue fencey = x;
// with atomic bool x_init mov DWORD PTR x, 42 mov BYTE PTR x_init, 1 mfence
// with bool x_init mov DWORD PTR x, 42 mov BYTE PTR x_init, 1
// with atomic bool x_init mov DWORD PTR x, 42 mov BYTE PTR x_init, 1 mfence
// with bool x_init mov DWORD PTR x, 42 mov BYTE PTR x_init, 1
// with atomic bool x_initL25: movzx eax, BYTE PTR x_init test al, al je .L25 mov eax, DWORD PTR x mov DWORD PTR y, eax
// with bool x_init cmp BYTE PTR x_init, 0 jne .L3.L5: jmp .L5.L3: mov eax, DWORD PTR x mov DWORD PTR y, eax
// with atomic bool x_initL25: movzx eax, BYTE PTR x_init test al, al je .L25 mov eax, DWORD PTR x mov DWORD PTR y, eax
// with bool x_init cmp BYTE PTR x_init, 0 jne .L3.L5: jmp .L5.L3: mov eax, DWORD PTR x mov DWORD PTR y, eax
infinite loop because
visibility not specified
no fence needed on load
for X86
fence needed on store for
X86
sequential consistent by default
g++ 4.7output, x86
Copyright © 2015 Oxyware Ltd 51/55
Using memory order flagsUsing memory order flags// thread 1x = 42;// blue fencex_init.store(true, std::memory_order_release);
// thread 1x = 42;// blue fencex_init.store(true, std::memory_order_release);
// thread 2while (! x_init.load(std::memory_order_acquire));// blue fencey = x;
// thread 2while (! x_init.load(std::memory_order_acquire));// blue fencey = x;
// with bool x_init seq_cst mov DWORD PTR x, 42 mov BYTE PTR x_init, 1 mfence
// with bool x_init release mov DWORD PTR x, 42 mov BYTE PTR x_init, 1
// with bool x_init seq_cst mov DWORD PTR x, 42 mov BYTE PTR x_init, 1 mfence
// with bool x_init release mov DWORD PTR x, 42 mov BYTE PTR x_init, 1
// with bool atomic x_initL25: movzx eax, BYTE PTR x_init test al, al je .L25 mov eax, DWORD PTR x mov DWORD PTR y, eax
// same code for x_init acquire
// with bool atomic x_initL25: movzx eax, BYTE PTR x_init test al, al je .L25 mov eax, DWORD PTR x mov DWORD PTR y, eax
// same code for x_init acquireno fence for release store
no fence needed on load
for X86
needed on seq_cst store
• Release provides only the blue fence (no writes Release provides only the blue fence (no writes in this thread reordered after the store)in this thread reordered after the store)
• Controls compiler reordering but not hardwareControls compiler reordering but not hardware
acquire means no reads in this thread
reordered before here
Copyright © 2015 Oxyware Ltd 52/55
Memory model adviceMemory model advice
• This is a complex and subtle area and you This is a complex and subtle area and you should avoid using it unless you can prove should avoid using it unless you can prove that you can’t get adequate performance that you can’t get adequate performance without itwithout it– yes, really, I mean it....yes, really, I mean it....
• Even experts get confused by this stuff!Even experts get confused by this stuff!– did I mention you should avoid it....did I mention you should avoid it....
• If you do use it, use acquire on load and If you do use it, use acquire on load and release on storerelease on store– Anything else will be a source of subtle bugsAnything else will be a source of subtle bugs
Copyright © 2015 Oxyware Ltd 53/55
std::atomic<int> counter(0);
void count(){ for (auto i = 0; i != numLoops; ++i) counter.store(5); //counter.store(5, std::memory_order_seq_cst); //counter.store(5, std::memory_order_release);}
$ time ./a.outreal 0m2.039suser 0m5.932ssys 0m0.002s
$ time taskset -c 1 ./a.outreal 0m1.004suser 0m1.002ssys 0m0.001s
$ time ./a.outreal 0m2.118suser 0m6.496ssys 0m0.009s
$ time taskset -c 1 ./a.outreal 0m0.999suser 0m0.994ssys 0m0.004s
$ time ./a.outreal 0m0.097suser 0m0.225ssys 0m0.002s
$ time taskset -c 1 ./a.outreal 0m0.057suser 0m0.055ssys 0m0.002s
code as
shown m_o_release (no mfence)
m_o_seq_cst (default)
Memory model exampleMemory model example
Copyright © 2015 Oxyware Ltd 54/55
Concurrency spectrumConcurrency spectrum
• Invest your time in splitting up the problemInvest your time in splitting up the problem• You know your domainYou know your domain• Leave concurrency parts to othersLeave concurrency parts to others
processesmessaging
TBB/PPLatomics,futures
threadsmemoryordering
keep as far to the left as possible
shared memory
Copyright © 2015 Oxyware Ltd 55/55
SummarySummary
• Multithreaded programming is trickyMultithreaded programming is tricky– New skills and ideas and ways to get it wrongNew skills and ideas and ways to get it wrong
• Focus on partitioning the problemFocus on partitioning the problem– Determines data sharing, locking, work breakdown Determines data sharing, locking, work breakdown
and schedulingand scheduling• Avoid shared mutable data where possibleAvoid shared mutable data where possible• Know your access patternsKnow your access patterns• Scale down as well as upScale down as well as up• Balance extremes of grain size, lock extent, etcBalance extremes of grain size, lock extent, etc• Don't try and be cleverDon't try and be clever