40
Agenda Lecture Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks Next discussion papers Selecting Locking Primitives for Parallel Programming Selecting Locking Designs for Parallel Programs 2

Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Embed Size (px)

Citation preview

Page 1: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Agenda

• Lecture • Bottom-up motivation • Shared memory primitives • Shared memory synchronization

• Barriers and locks

• Next discussion papers • Selecting Locking Primitives for Parallel Programming • Selecting Locking Designs for Parallel Programs

2

Page 2: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Acknowledgments

• Pseudo code from: • “Algorithms for Scalable Synchronization on Shared-Memory

Multiprocessors”, Mellor-Crummey & Scott, ACM TOCS, Feb 1991 • http://www.cs.rochester.edu/research/synchronization/pseudocode/

ss.html

3

Page 3: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Barriers

4

Page 4: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Common Parallel Idiom: Barriers

5

• Physics simulation computation • Divide up each timestep computation into N independent pieces • Each timestep: compute independently, synchronize

• Example: each thread executes: • segment_size = total_particles / number_of_threads • my_start_particle = thread_id * segment_size • my_end_particle = my_start_particle + segment_size - 1 • for (timestep = 0; timestep += delta; timestep < stop_time):

• calculate_forces(t, my_start_particle, my_end_particle) • barrier() • update_locations(t, my_start_particle, my_end_particle) • barrier()

• Barrier? All threads wait until all threads have reached it

Page 5: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

• Merge-sort 4096 elements with four threads

• Step #1: • Sort each 1/4th of array • (N/4)*log(N/4) = 1024*10 = 10240 comparisons

• Step #2: • Two N/2 merges • 2048 comparisons

• Step #3: • Final merge • 4096 comparisons

• Total: 3x speed up four threads • Parallel: 16384 comparisons • Sequential: ~50k comparisons

Example: Barrier-Based Merge Sort

6

Barrier

Barrier

t0 t1 t2 t3

Step 1

Step 2

Step 3

Page 6: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Global Synchronization Barrier

• At a barrier • All threads wait until all other threads have reached it

• Strawman implementation (wrong!) global (shared) count : integer := P procedure central_barrier if fetch_and_decrement(&count) == 1 count := P else repeat until count == P

• What is wrong with the above code?

7

Barrier

t0 t1 t2 t3

Page 7: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Sense-Reversing Barrier

• Correct barrier implementation: global (shared) count : integer := P global (shared) sense : Boolean := true local (private) local_sense : Boolean := true procedure central_barrier // each processor toggles its own sense local_sense := !local_sense if fetch_and_decrement(&count) == 1 count := P // last processor toggles global sense sense := local_sense else repeat until sense == local_sense

• Single counter makes this a “centralized” barrier

8

Page 8: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Other Barrier Implementations

• Problem with centralized barrier • All processors must increment each counter • Each read/modify/write is a serialized coherence action

• Each one is a cache miss • O(n) if threads arrive simultaneously, slow for lots of processors

• Combining Tree Barrier • Build a logk(n) height tree of counters (one per cache block) • Each thread coordinates with k other threads (by thread id) • Last of the k processors, coordinates with next higher node in tree • As many coordination address are used, misses are not serialized • O(log n) in best case

• Static and more dynamic variants • Tree-based arrival, tree-based or centralized release

9

Page 9: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Barrier Performance (from 1991)

10

56 . J M. Mellor-Crummey and M. L. Scott

120

M dissemination

I ~ treeloo–

.

A

80 – “

Time(p.) ‘0 -

40 –

20 –

o0

-. tournament

A arriv~ tree

4 counter

flag wakeup)

I I I I I I I I I2 4 6 8 10 12 14 16 18

Processors

Fig. 21. Performance of barmers on the Symmetry

smoother performance curve: each additional processor adds another level tosome path through the tree, or becomes the second child of some node in thewakeup tree that is delayed slightly longer than its sibling.

Figure 21 shows the performance on the Sequent Symmetry of severaldifferent barriers. Results differ sharply from those on the Butterfly for twoprincipal reasons. First, it is acceptable on the Symmetry for more than oneprocessor to spin on the same location; each obtains a copy in its cache.Second, no significant advantage arises from distributing writes across thememory modules of the machine; the shared bus enforces an overall serializa-tion. The dissemination barrier requires O ( P log P) bus transactions toachieve a P-processor barrier. The other four algorithms require O(P) trans-actions, and all perform better than the dissemination barrier for P > 8.

Below the maximum number of processors in our tests, the fastest barrieron the Symmetry used a centralized counter with a sense-reversing wakeupflag (from Figure 8). P bus transactions are required to tally arrivals, 1 totoggle the sense-reversing flag (invalidating all the cached copies), and P – 1to effect the subsequent reloads. Our tree barrier generates 2 P – 2 writes toflag variables on which other processors are waiting, necessitating an addi-tional 2 P – 2 reloads. By using a central sense-reversing flag for wakeup(instead of the wakeup tree), we can eliminate half of this overhead. Theresulting algorithm is identified as “arrival tree” in Figure 21. Though the

ACM Transactmns on Computer Systems, Vol 9, No 1, February 1991

From Mellor-Crummey & Scott, ACM TOCS, 1991

Page 10: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Locks

11

Page 11: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Common Parallel Idiom: Locking

12

• Protecting a shared data structure

• Example: parallel tree walk, apply f() to each node • Global “set” object, initialized with pointer to root of tree • Each thread, while (true):

• node* next = set.remove() • if next == NULL: return // terminate thread • func(code->value) // computationally intense function • if (next->right != NULL):

• set.insert(next->right) • if (next->left != NULL):

• set.insert(next->left)

• How do we protect the “set” data structure? • Also, to perform well, what element should it “remove” each step?

Page 12: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Common Parallel Idiom: Locking

13

• Parallel tree walk, apply f() to each node • Global “set” object, initialized with pointer to root of tree • Each thread, while (true):

• acquire(set.lock_ptr) • node* next = set.pop() • release(set.lock_ptr) • if next == NULL:

• return // terminate thread • func(node->value) // computationally intense • acquire(set.lock_ptr) • if (next->right != NULL)

• set.insert(next->right) • if (next->left != NULL)

• set.insert(next->left) • release(set.lock_ptr)

Put lock/unlock into pop() method?

Put lock/unlock into insert() method?

Page 13: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

• Only one thread can hold a “lock” at a time • Used a provide serialized access to a data object

• If another threads tries to acquire a held lock • Must wait until other thread performs a release

• Performance implications • Lock contention limits parallelism • Lock acquire/release time adds overheads

• Correctness implications • Just one example:

• Thread #1: Holds lock A, tries to acquire B • Thread #2: Holds lock B, tries to acquire A • Classic deadlock!

Lock-Based Mutual Exclusion

LOC K

WA I T

LOC K

LOC K

LOC K

WA I T

WA I T

t0 t1 t2 t3

Page 14: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Simple Boolean Spin Locks

• Simplest lock: • Single variable, two states: locked, unlocked • When unlocked: atomically transition from unlocked to locked • When locked: keep checking (spin) until the lock is unlocked

• Busy waiting versus “blocking” • In a multicore, busy wait for other thread to release lock

• Likely to happen soon, assuming critical sections are small • Likely nothing “better” for the processor to do anyway

• In a single processor, if trying to acquire a held lock, block • The only sensible option is to tell the O.S. to context switch • O.S. knows not to reschedule thread until lock is released

• Blocking has high overhead (O.S. call) • IMHO, rarely makes sense in multicore (parallel) programs

15

Page 15: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

Spin Locks and Contention

Companion slides for The Art of Multiprocessor

Programming by Maurice Herlihy & Nir Shavit

Page 16: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

17

Focus so far: Correctness

• Models – Accurate (we never lied to you)

– But idealized (so we forgot to mention a few things)

• Protocols – Elegant – Important – But naïve

Page 17: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

18

New Focus: Performance

• Models – More complicated (not the same as complex!)

– Still focus on principles (not soon obsolete)

• Protocols – Elegant (in their fashion)

– Important (why else would we pay attention)

– And realistic (your mileage may vary)

Page 18: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

19

What Should you do if you can’t get a lock?

• Keep trying – “spin” or “busy-wait” – Good if delays are short

• Give up the processor – Good if delays are long – Always good on uniprocessor

(1)

Page 19: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

20

What Should you do if you can’t get a lock?

• Keep trying – “spin” or “busy-wait” – Good if delays are short

• Give up the processor – Good if delays are long – Always good on uniprocessor

our focus

Page 20: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

21

Basic Spin-Lock

CS

Resets lock upon exit

spin lock

critical section

...

Page 21: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

22

Basic Spin-Lock

CS

Resets lock upon exit

spin lock

critical section

...

…lock introduces sequential bottleneck

Page 22: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

23

Basic Spin-Lock

CS

Resets lock upon exit

spin lock

critical section

...

…lock suffers from contention

Page 23: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

24

Basic Spin-Lock

CS

Resets lock upon exit

spin lock

critical section

...Notice: these are distinct phenomena

…lock suffers from contention

Page 24: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

25

Basic Spin-Lock

CS

Resets lock upon exit

spin lock

critical section

...

…lock suffers from contention

Seq Bottleneck ! no parallelism

Page 25: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

26

Basic Spin-Lock

CS

Resets lock upon exit

spin lock

critical section

...Contention ! ???

…lock suffers from contention

Page 26: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

27

Review: Test-and-Set

• Boolean value • Test-and-set (TAS)

– Swap true with current value – Return value tells if prior value was true

or false

• Can reset just by writing false • TAS aka “getAndSet”

Page 27: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

28

Review: Test-and-Setpublic class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) {

boolean prior = value; value = newValue; return prior; } }

(5)

Page 28: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

29

Review: Test-and-Setpublic class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) {

boolean prior = value; value = newValue; return prior; } } Package

java.util.concurrent.atomic

Page 29: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

30

Review: Test-and-Setpublic class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) {

boolean prior = value; value = newValue; return prior; } }

Swap old and new values

Page 30: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

31

Review: Test-and-SetAtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true)

Page 31: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

32

Review: Test-and-SetAtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true)

(5)

Swapping in true is called “test-and-set” or TAS

Page 32: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

33

Test-and-Set Locks

• Locking – Lock is free: value is false – Lock is taken: value is true

• Acquire lock by calling TAS – If result is false, you win – If result is true, you lose

• Release lock by writing false

Page 33: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

34

Test-and-set Lockclass TASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

Page 34: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

35

Test-and-set Lockclass TASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

Lock state is AtomicBoolean

Page 35: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

36

Test-and-set Lockclass TASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

Keep trying until lock acquired

Page 36: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

37

Test-and-set Lockclass TASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

Release lock by resetting state to false

Page 37: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

38

Space Complexity

• TAS spin-lock has small “footprint” • N thread spin-lock uses O(1) space • As opposed to O(n) Peterson/Bakery • How did we overcome the Ω(n) lower

bound? • We used a RMW operation…

Page 38: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

39

Performance

• Experiment – n threads – Increment shared counter 1 million times

• How long should it take? • How long does it take?

Page 39: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

40

Graph

ideal

tim

e

threads

no speedup because of sequential bottleneck

Page 40: Agenda - Computing Science - Simon Fraser Universityashriram/Courses/2016/CS431/slides/Week3… · Agenda • Lecture • Bottom-up motivation • Shared memory primitives • Shared

Art of Multiprocessor Programming© Herlihy-Shavit

2007

41

Mystery #1ti

me

threads

TAS lock

Ideal

(1)

What is going on?