106
Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich

Embed Size (px)

Citation preview

Spin Locks and Contention

Based on slides by by Maurice Herlihy & Nir Shavit

Tomer Gurevich

Mutual Exclusion

• Most programs aren’t embarrassingly parallel

• “critical sections” of the code must be executed by one thread at a time to ensure correctness

• use locks for mutual exclusion

Art of Multiprocessor Programming 2

Example: concurrent counter

Art of Multiprocessor Programming 3

Thread 2 Thread 1

R1

R1

W2

W2

Art of Multiprocessor Programming 4

Locks

CS

Resets lock upon exit

lock

critical section

...

…lock introduces sequential bottleneck

Art of Multiprocessor Programming 5

What Should you do if you can’t get a lock?

• Keep trying– “spin” or “busy-wait”– Good if delays are short

• Give up the processor– Good if delays are long– Always good on uniprocessor

(1)

Outline

• Spinlock review • TAS-lock optimizations • Queue locks • Abortable locks

Art of Multiprocessor Programming 6

Art of Multiprocessor Programming 7

Review: Test-and-Set

• Atomic operation • Test-and-set (addr,new_val)

– Set the current value of the word addr to new_val

– Return the old value • TAS aka “getAndSet”

Art of Multiprocessor Programming 8

Review: Test-and-Set

public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) {

boolean prior = value; value = newValue; return prior; }}

(5)

Art of Multiprocessor Programming 9

Test-and-Set Locks

• Locking– Lock is free: value is false– Lock is taken: value is true

• Acquire lock by calling TAS– If result is false, you win– If result is true, you lose

• Release lock by writing false

Art of Multiprocessor Programming 10

Test-and-set Lock

class TASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

Art of Multiprocessor Programming 11

Test-and-set Lock

class TASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Lock state is AtomicBoolean

Art of Multiprocessor Programming 12

Test-and-set Lock

class TASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Keep trying until lock acquired

Art of Multiprocessor Programming 13

Test-and-set Lock

class TASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

Release lock by resetting state to false

Art of Multiprocessor Programming 14

Space Complexity

• TAS spin-lock has small “footprint” • N thread spin-lock uses O(1) space

Art of Multiprocessor Programming 15

Performance

• Experiment– n threads– Increment shared counter 1 million

times• How long should it take?• How long does it take?

Art of Multiprocessor Programming 16

Mystery #1ti

me

threads

TAS lock

Ideal

(1)

What is going on?

Art of Multiprocessor Programming 17

Bus-Based Architectures

Bus

cache

memory

cachecache

Art of Multiprocessor Programming 18

Bus

Processor Issues Load Request

cache

memory

cachecache

data

Art of Multiprocessor Programming 19

Bus

Processor Issues Load Request

Bus

cache

memory

cachecache

data

Gimmedata

Art of Multiprocessor Programming 20

cache

Bus

Memory Responds

Bus

memory

cachecache

data

Got your data right here data

Art of Multiprocessor Programming 21

Bus

Processor Issues Load Request

memory

cachecachedata

data

Gimmedata

Art of Multiprocessor Programming 22

Bus

Processor Issues Load Request

Bus

memory

cachecachedata

data

Gimmedata

Art of Multiprocessor Programming 23

Bus

Processor Issues Load Request

Bus

memory

cachecachedata

data

I got data

Art of Multiprocessor Programming 24

Bus

Other Processor Responds

memory

cachecache

data

I got data

datadata

Bus

Art of Multiprocessor Programming 25

Bus

Other Processor Responds

memory

cachecache

data

datadata

Bus

Art of Multiprocessor Programming 26

Cache Coherence

• We have lots of copies of data– Original copy in memory – Cached copies at processors

• Some processor modifies its own copy– What do we do with the others?– How to avoid confusion?

Art of Multiprocessor Programming 27

Modify Cached Data

Bus

data

memory

cachedata

data

(1)

Art of Multiprocessor Programming 28

Modify Cached Data

Bus

data

memory

cachedata

data

data

(1)

Art of Multiprocessor Programming 29

memory

Bus

data

Modify Cached Data

cachedata

data

Art of Multiprocessor Programming 30

memory

Bus

data

Modify Cached Data

cache

What’s up with the other copies?

data

data

Art of Multiprocessor Programming 31

cache

Bus

Modified cache data

memory

cachedata

data

Other caches invalidate data

This cache acquires write permission

Art of Multiprocessor Programming 32

cache

Bus

Modified cache data

memory

cachedata

data

Memory can be updated later

Art of Multiprocessor Programming 33

What’s wrong with TASLock?

• TAS invalidates cache lines• Spinners

– Miss in cache– Go to bus

• Thread wants to release lock– delayed behind spinners

Art of Multiprocessor Programming 34

Test-and-Test-and-Set Locks

• Lurking stage– Wait until lock “looks” free– Spin while read returns true (lock

taken)• Pouncing state

– As soon as lock “looks” available– Read returns false (lock free)– Call TAS to acquire lock– If TAS loses, back to lurking

Art of Multiprocessor Programming 35

Test-and-test-and-set Lock

class TTASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; }}

Art of Multiprocessor Programming 36

Test-and-test-and-set Lock

class TTASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; }} Wait until lock looks free

Art of Multiprocessor Programming 37

Test-and-test-and-set Lock

class TTASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; }}

Then try to acquire it

Art of Multiprocessor Programming 38

Graph

TAS lock

TTAS lock

Idealtim

e

threads

Art of Multiprocessor Programming 39

Test-and-test-and-set

• Wait until lock “looks” free– Spin on local cache– No bus use while lock busy

• Problem: when lock is released– Invalidation storm …

Art of Multiprocessor Programming 40

Local Spinning while Lock is Busy

Bus

memory

busybusybusy

busy

Art of Multiprocessor Programming 41

Bus

On Release

memory

freeinvalidinvalid

free

Art of Multiprocessor Programming 42

On Release

Bus

memory

freeinvalidinvalid

free

miss miss

Everyone misses, rereads

(1)

Art of Multiprocessor Programming 43

On Release

Bus

memory

freeinvalidinvalid

free

TAS(…) TAS(…)

Everyone tries TAS

(1)

Art of Multiprocessor Programming 44

An important observation

spin locktimedr1dr2d

• If the lock looks free• But I fail to get it

• There must be contention• Better to back off than to collide again

Art of Multiprocessor Programming 45

Solution: delay

timed2d4d spin lock

If I fail to get lock– wait random duration before

retry– Each subsequent failure

doubles expected wait

Art of Multiprocessor Programming 46

Exponential Backoff Lock

public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}}

Art of Multiprocessor Programming 47

Exponential Backoff Lock

public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay

Art of Multiprocessor Programming 48

Exponential Backoff Lock

public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free

Art of Multiprocessor Programming 49

Exponential Backoff Lock

public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return

Art of Multiprocessor Programming 50

Exponential Backoff Lock

public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}}

Back off for random duration

Art of Multiprocessor Programming 51

Exponential Backoff Lock

public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}}

Double max delay, within reason

Art of Multiprocessor Programming 52

Spin-Waiting Overhead

TTAS Lock

Backoff locktim

e

threads

Art of Multiprocessor Programming 53

Backoff: Other Issues

• Good– Easy to implement– Beats TTAS lock

• Bad– Must choose parameters carefully– Not portable across platforms

Summary: basic TAS-Lock

• Perform well for low contention , but basic spinlocks aren’t scalable

• All thread spin on the same shared memory location, causing a lot of bus traffic

• No fairness , so a thread might starve

Art of Multiprocessor Programming 54

Queue locks

• Keep FIFO Order • Scalable locks • Harder to implement• Hurt performance for low

contention

Art of Multiprocessor Programming 55

Art of Multiprocessor Programming 56

Anderson Queue Lock

flags

next

T F F F F F F F

idle

Art of Multiprocessor Programming 57

Anderson Queue Lock

flags

next

T F F F F F F F

acquiring

getAndIncrement

Art of Multiprocessor Programming 58

Anderson Queue Lock

flags

next

T F F F F F F F

acquiring

getAndIncrement

Art of Multiprocessor Programming 59

Anderson Queue Lock

flags

next

T F F F F F F F

acquired

Mine!

Art of Multiprocessor Programming 60

Anderson Queue Lock

flags

next

T F F F F F F F

acquired acquiring

Art of Multiprocessor Programming 61

Anderson Queue Lock

flags

next

T F F F F F F F

acquired acquiring

getAndIncrement

Art of Multiprocessor Programming 62

Anderson Queue Lock

flags

next

T F F F F F F F

acquired acquiring

getAndIncrement

Art of Multiprocessor Programming 63

acquired

Anderson Queue Lock

flags

next

T F F F F F F F

acquiring

Art of Multiprocessor Programming 64

released

Anderson Queue Lock

flags

next

F T F F F F F F

acquired

Problem: false sharing

• Each thread spins on different variable, so there is no reason for contention.

• But adjacent Array elements are contained within the same cacheline…

Art of Multiprocessor Programming 65

66

released

The Solution: Padding

flags

next

T / / / F / / /

acquired

Line 1 Line 2Art of Multiprocessor Programming

Spin on my line

Art of Multiprocessor Programming 67

Performance

• Shorter handover than backoff

• Curve is practically flat• Scalable performance

queue

TTAS

Art of Multiprocessor Programming 68

Anderson Queue LockGood - Easy to implement Queue lock Bad

–Not Space efficient• What if unknown number of

threads?• What if small number of actual

contenders?

Art of Multiprocessor Programming 69

CLH Lock

• FIFO order• Small, constant-size overhead per

thread

Art of Multiprocessor Programming 70

CLH Queue Lock

class Qnode { AtomicBoolean locked = new AtomicBoolean(true);}

Art of Multiprocessor Programming 71

CLH Queue Lockclass CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }}

Art of Multiprocessor Programming 72

CLH Queue Lockclass CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }}

Queue tail

Art of Multiprocessor Programming 73

CLH Queue Lockclass CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }}

Thread-local Qnode

Art of Multiprocessor Programming 74

CLH Queue Lockclass CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }}

Swap in my node

Art of Multiprocessor Programming 75

CLH Queue Lockclass CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }}

Spin until predecessorreleases lock

Art of Multiprocessor Programming 76

Initially

false

tail

idle

Art of Multiprocessor Programming 77

Initially

false

tail

idle

Art of Multiprocessor Programming 78

Purple Wants the Lock

false

tail

acquiring

Art of Multiprocessor Programming 79

Purple Wants the Lock

false

tail

acquiring

true

Art of Multiprocessor Programming 80

Purple Wants the Lock

false

tail

acquiring

true

Swap

Art of Multiprocessor Programming 81

Purple Has the Lock

false

tail

acquired

true

Art of Multiprocessor Programming 82

Red Wants the Lock

false

tail

acquired acquiring

true true

Art of Multiprocessor Programming 83

Red Wants the Lock

false

tail

acquired acquiring

true

Swap

true

Art of Multiprocessor Programming 84

Red Wants the Lock

false

tail

acquired acquiring

true true

Art of Multiprocessor Programming 85

Red Wants the Lock

false

tail

acquired acquiring

true true

Art of Multiprocessor Programming 86

Red Wants the Lock

false

tail

acquired acquiring

true true

ImplicitLinked list

Art of Multiprocessor Programming 87

CLH Queue LockClass CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; }}

Art of Multiprocessor Programming 88

CLH Queue LockClass CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; }}

Notify successor

Art of Multiprocessor Programming 89

CLH Queue LockClass CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; }}

Recycle predecessor’s

node

Art of Multiprocessor Programming 90

Purple Releases

false

tail

release acquiring

false true

falseBingo

!

Art of Multiprocessor Programming 91

Purple Releases

tail

released acquired

true

Art of Multiprocessor Programming 92

Space Usage

• Let– L = number of locks– N = number of threads

• ALock– O(LN)

• CLH lock– O(L+N)

Art of Multiprocessor Programming 93

CLH Lock

• Good– Lock release affects predecessor only– Small, constant-sized space

• Bad– Doesn’t work for uncached NUMA

architectures

Art of Multiprocessor Programming 94

NUMA Architecturs

• Acronym:– Non-Uniform Memory Architecture

• Illusion:– Flat shared memory

• Truth:– No caches (sometimes)– Some memory regions faster than

others

Art of Multiprocessor Programming 95

MCS Lock

• FIFO order, list based Queue lock• Similar to CLH• Spin on local memory only, solving

the NUMA problem

MCS lock

• Each node contains now a “next” field.

• Each node spins locally on its own “Locked” field

• upon release, notify next node you finished

Art of Multiprocessor Programming 96

Art of Multiprocessor Programming 97

Abortable Locks

• What if you want to give up waiting for a lock?

• For example– Timeout– Database transaction aborted by user

Art of Multiprocessor Programming 98

Back-off Lock

• Aborting is trivial– Just return from lock() call

• Extra benefit:– No cleaning up– Immediate return

Art of Multiprocessor Programming 99

Queue Locks

• Can’t just quit– Thread in line behind will starve

• Need a graceful way out

Art of Multiprocessor Programming 100

Abortable CLH Lock

• When a thread gives up– Removing node in a wait-free way is

hard• Idea:

– let successor deal with it.

Art of Multiprocessor Programming 101

Queue Locks

locked

true

spinning

truetrue

spinning

Art of Multiprocessor Programming 102

Queue Locks

locked

trueabortrue

spinning

Time-out

Art of Multiprocessor Programming 103

Queue Locks

locked

trueabortrue

spinningPredecessor

aborted

Art of Multiprocessor Programming 104

Queue Locks

locked

truetrue

spinning

Art of Multiprocessor Programming 105

One Lock To Rule Them All?

• TTAS+Backoff, CLH, MCS, ToLock…• Each better than others in some

way• There is no one solution• Lock we pick really depends on:

– the application– the hardware– which properties are important

Art of Multiprocessor Programming 106

         This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.

• You are free:– to Share — to copy, distribute and transmit the work – to Remix — to adapt the work

• Under the following conditions:– Attribution. You must attribute the work to “The Art of

Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work).

– Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.

• For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to– http://creativecommons.org/licenses/by-sa/3.0/.

• Any of the above conditions can be waived if you get permission from the copyright holder.

• Nothing in this license impairs or restricts the author's moral rights.