45

Click here to load reader

Lock-Free, Wait-Free Hash Table

Embed Size (px)

DESCRIPTION

In this presentation, Dr. Cliff Click describes a totally lock-free hashtable with extremely low-cost and near perfect scaling. Readers pay no more than HashMap readers: just the cost of computing the hash, loading and comparing the key, and returning the value. Writers must use AtomicUpdate instead of a simple assignment but otherwise pay the same as readers. In particular, there is no required order between loads and stores; correctness is assured, no matter how the hardware orders memory operations. A state-based technique demonstrates the correctness of the algorithm. This novel approach is very straightforward and much easier to understand than the usual "happens-before" memory-order-based reasoning.

Citation preview

Page 1: Lock-Free, Wait-Free Hash Table

A Lock-Free Hash Table

2007 JavaOne Conference

A Lock-Free Hash Table

Dr. Cliff ClickChief JVM Architect & Distinguished Engineerblogs.azulsystems.com/cliffAzul SystemsMay 8, 2007

Page 2: Lock-Free, Wait-Free Hash Table

2| ©2003 Azul Systems, Inc.

www.azulsystems.com

• Constant-Time Key-Value Mapping• Fast arbitrary function• Extendable, defined at runtime• Used for symbol tables, DB caching, network

access, url caching, web content, etc• Crucial for Large Business Applications

─ > 1MLOC• Used in Very heavily multi-threaded apps

─ > 1000 threads

Hash Tables

Page 3: Lock-Free, Wait-Free Hash Table

3| ©2003 Azul Systems, Inc.

www.azulsystems.com

Popular Java Implementations

• Java's HashTable─ Single threaded; scaling bottleneck

• HashMap─ Faster but NOT multi-thread safe

• java.util.concurrent.HashMap─ Striped internal locks; 16-way the default

• Azul, IBM, Sun sell machines >100cpus• Azul has customers using all cpus in same app• Becomes a scaling bottleneck!

Page 4: Lock-Free, Wait-Free Hash Table

4| ©2003 Azul Systems, Inc.

www.azulsystems.com

A Lock-Free Hash Table

• No locks, even during table resize─ No spin-locks─ No blocking while holding locks─ All CAS spin-loops bounded─ Make progress even if other threads die....

• Requires atomic update instruction: CAS (Compare-And-Swap), LL/SC (Load-Linked/Store-Conditional, PPC only) or similar

• Uses sun.misc.Unsafe for CAS

Page 5: Lock-Free, Wait-Free Hash Table

5| ©2003 Azul Systems, Inc.

www.azulsystems.com

• Slightly faster than j.u.c for 99% reads < 32 cpus• Faster with more cpus (2x faster)

─ Even with 4096-way striping─ 10x faster with default striping

• 3x Faster for 95% reads (30x vs default)• 8x Faster for 75% reads (100x vs default)• Scales well up to 768 cpus, 75% reads

─ Approaches hardware bandwidth limits

A Faster Hash Table

Page 6: Lock-Free, Wait-Free Hash Table

6| ©2003 Azul Systems, Inc.

www.azulsystems.com

Agenda

• Motivation• “Uninteresting” Hash Table Details• State-Based Reasoning• Resize• Performance• Q&A

Page 7: Lock-Free, Wait-Free Hash Table

7| ©2003 Azul Systems, Inc.

www.azulsystems.com

Some “Uninteresting” Details

• Hashtable: A collection of Key/Value Pairs• Works with any collection• Scaling, locking, bottlenecks of the collection

management responsibility of that collection• Must be fast or O(1) effects kill you• Must be cache-aware• I'll present a sample Java solution

─ But other solutions can work, make sense

Page 8: Lock-Free, Wait-Free Hash Table

8| ©2003 Azul Systems, Inc.

www.azulsystems.com

“Uninteresting” Details

• Closed Power-of-2 Hash Table─ Reprobe on collision─ Stride-1 reprobe: better cache behavior

• Key & Value on same cache line• Hash memoized

─ Should be same cache line as K + V─ But hard to do in pure Java

• No allocation on get() or put()• Auto-Resize

Page 9: Lock-Free, Wait-Free Hash Table

9| ©2003 Azul Systems, Inc.

www.azulsystems.com

Example get() code

idx = hash = key.hashCode();while( true ) { // reprobing loop idx &= (size-1); // limit idx to table size k = get_key(idx); // start cache miss early h = get_hash(idx); // memoized hash if( k == key || (h == hash && key.equals(k)) ) return get_val(idx);// return matching value if( k == null ) return null; idx++; // reprobe}

Page 10: Lock-Free, Wait-Free Hash Table

10| ©2003 Azul Systems, Inc.

www.azulsystems.com

“Uninteresting” Details

• Could use prime table + MOD─ Better hash spread, fewer reprobes─ But MOD is 30x slower than AND

• Could use open table─ put() requires allocation─ Follow 'next' pointer instead of reprobe─ Each 'next' is a cache miss─ Lousy hash -> linked-list traversal

• Could put Key/Value/Hash on same cache line• Other variants possible, interesting

Page 11: Lock-Free, Wait-Free Hash Table

11| ©2003 Azul Systems, Inc.

www.azulsystems.com

Agenda

• Motivation• “Uninteresting” Hash Table Details• State-Based Reasoning• Resize• Performance• Q&A

Page 12: Lock-Free, Wait-Free Hash Table

12| ©2003 Azul Systems, Inc.

www.azulsystems.com

Ordering and Correctness

• How to show table mods correct?─ put, putIfAbsent, change, delete, etc.

• Prove via: fencing, memory model, load/store ordering, “happens-before”?

• Instead prove* via state machine• Define all possible {Key,Value} states• Define Transitions, State Machine• Show all states “legal”

*Warning: hand-wavy proof follows

Page 13: Lock-Free, Wait-Free Hash Table

13| ©2003 Azul Systems, Inc.

www.azulsystems.com

State-Based Reasoning

• Define all {Key,Value} states and transitions• Don't Care about memory ordering:

─ get() can read Key, Value in any order─ put() can change Key, Value in any order─ put() must use CAS to change Key or Value

─ But not double-CAS• No fencing required for correctness!

─ (sometimes stronger guarantees are wanted and will need fencing)

• Proof is simple!

Page 14: Lock-Free, Wait-Free Hash Table

14| ©2003 Azul Systems, Inc.

www.azulsystems.com

• A Key slot is:─ null – empty─ K – some Key; can never change again

• A Value slot is:─ null – empty─ T – tombstone─ V – some Values

• A state is a {Key,Value} pair• A transition is a successful CAS

Valid States

Page 15: Lock-Free, Wait-Free Hash Table

15| ©2003 Azul Systems, Inc.

www.azulsystems.com

State Machine

{null,null}

Empty

{K,T}

{null,T/V}

Partially inserted K/V pair -Reader-only state

{K,V}

Standard K/V pair

insert

dele

teinsert

change

deleted key

{K,null}

Partially inserted K/V pair

Page 16: Lock-Free, Wait-Free Hash Table

16| ©2003 Azul Systems, Inc.

www.azulsystems.com

Example put(key,newval) code:

idx = hash = key.hashCode();while( true ) { // Key-Claim stanza idx &= (size-1); k = get_key(idx); // State: {k,?} if( k == null && // {null,?} -> {key,?} CAS_key(idx,null,key) ) break; // State: {key,?} h = get_hash(idx); // get memoized hash if( k == key || (h == hash && key.equals(k)) ) break; // State: {key,?} idx++; // reprobe}

Page 17: Lock-Free, Wait-Free Hash Table

17| ©2003 Azul Systems, Inc.

www.azulsystems.com

Example put(key,newval) code

// State: {key,?}oldval = get_val(idx); // State: {key,oldval}// Transition: {key,oldval} -> {key,newval}if( CAS_val(idx,oldval,newval) ) { // Transition worked ... // Adjust size} else { // Transition failed; oldval has changed // We can act “as if” our put() worked but // was immediately stomped over}return oldval;

Page 18: Lock-Free, Wait-Free Hash Table

18| ©2003 Azul Systems, Inc.

www.azulsystems.com

Some Things to Notice

• Once a Key is set, it never changes─ No chance of returning Value for wrong Key─ Means Keys leak; table fills up with dead Keys

─ Fix in a few slides...• No ordering guarantees provided!

─ Bring Your Own Ordering/Synchronization• Weird {null,V} state meaningful but uninteresting

─ Means reader got an empty key and so missed─ But possibly prefetched wrong Value

Page 19: Lock-Free, Wait-Free Hash Table

19| ©2003 Azul Systems, Inc.

www.azulsystems.com

Some Things to Notice

• There is no machine-wide coherent State!• Nobody guaranteed to read the same State

─ Except on the same CPU with no other writers• No need for it either• Consider degenerate case of a single Key• Same guarantees as:

─ single shared global variable ─ many readers & writers, no synchronization─ i.e., darned little

Page 20: Lock-Free, Wait-Free Hash Table

20| ©2003 Azul Systems, Inc.

www.azulsystems.com

A Slightly Stronger Guarantee

• Probably want “happens-before” on Values─ java.util.concurrent provides this

• Similar to declaring that shared global 'volatile'• Things written into a Value before put()

─ Are guaranteed to be seen after a get()• Requires st/st fence before CAS'ing Value

─ “free” on Sparc, X86• Requires ld/ld fence after loading Value

─ “free” on Azul

Page 21: Lock-Free, Wait-Free Hash Table

21| ©2003 Azul Systems, Inc.

www.azulsystems.com

Agenda

• Motivation• “Uninteresting” Hash Table Details• State-Based Reasoning• Resize• Performance• Q&A

Page 22: Lock-Free, Wait-Free Hash Table

22| ©2003 Azul Systems, Inc.

www.azulsystems.com

Resizing The Table

• Need to resize if table gets full• Or just re-probing too often• Resize copies live K/V pairs

─ Doubles as cleanup of dead Keys─ Resize (“cleanse”) after any delete─ Throttled, once per GC cycle is plenty often

• Alas, need fencing, 'happens before'• Hard bit for concurrent resize & put():

─ Must not drop the last update to old table

Page 23: Lock-Free, Wait-Free Hash Table

23| ©2003 Azul Systems, Inc.

www.azulsystems.com

Resizing

• Expand State Machine• Side-effect: mid-resize is a valid State• Means resize is:

─ Concurrent – readers can help, or just read&go─ Parallel – all can help─ Incremental – partial copy is OK

• Pay an extra indirection while resize in progress─ So want to finish the job eventually

• Stacked partial resizes OK, expected

Page 24: Lock-Free, Wait-Free Hash Table

24| ©2003 Azul Systems, Inc.

www.azulsystems.com

get/put during Resize

• get() works on the old table─ Unless see a sentinel

• put() or other mod must use new table• Must check for new table every time

─ Late writes to old table 'happens before' resize• Copying K/V pairs is independent of get/put• Copy has many heuristics to choose from:

─ All touching threads, only writers, unrelated background thread(s), etc

Page 25: Lock-Free, Wait-Free Hash Table

25| ©2003 Azul Systems, Inc.

www.azulsystems.com

New State: 'use new table' Sentinel

• X: sentinel used during table-copy─ Means: not in old table, check new

• A Key slot is:─ null, K─ X – 'use new table', not any valid Key─ null → K OR null → X

• A Value slot is:─ null, T, V─ X – 'use new table', not any valid Value─ null → {T,V}* → X

Page 26: Lock-Free, Wait-Free Hash Table

26| ©2003 Azul Systems, Inc.

www.azulsystems.com

State Machine – old table

{null,null}

Empty {K,T}

{null,T/V/X}

Partially inserted K/V pair

{K,V}

Standard K/V pair

insert

dele

te

insert

change

Deleted key

{K,V} into newer tablecopy

{K,X}

kill

kill{X,null}

check newer table

{K,null}

States {X,T/V/X} not possible

Page 27: Lock-Free, Wait-Free Hash Table

27| ©2003 Azul Systems, Inc.

www.azulsystems.com

State Machine: Copy One Pair

{null,null} {X,null}

empty

Page 28: Lock-Free, Wait-Free Hash Table

28| ©2003 Azul Systems, Inc.

www.azulsystems.com

State Machine: Copy One Pair

{K,T/null} {K,X}

dead or partially inserted

{null,null} {X,null}

empty

Page 29: Lock-Free, Wait-Free Hash Table

29| ©2003 Azul Systems, Inc.

www.azulsystems.com

State Machine: Copy One Pair

{K,V1}

alive, but old

{K,V2}

{K,X}

{K,T/null} {K,X}

{K,V2}

dead or partially inserted

old tablenew table

{null,null} {X,null}

empty

Page 30: Lock-Free, Wait-Free Hash Table

30| ©2003 Azul Systems, Inc.

www.azulsystems.com

Copying Old To New

• New States V', T' – primed versions of V,T─ Prime'd values in new table copied from old─ Non-prime in new table is recent put()

─ “happens after” any prime'd value─ Prime allows 2-phase commit─ Engineering: wrapper class (Java), steal bit (C)

• Must be sure to copy late-arriving old-table write• Attempt to copy atomically

─ May fail & copy does not make progress─ But old, new tables not damaged

Page 31: Lock-Free, Wait-Free Hash Table

31| ©2003 Azul Systems, Inc.

www.azulsystems.com

New States: Prime'd

• A Key slot is:─ null, K, X

• A Value slot is:─ null, T, V, X─ T',V' – primed versions of T & V

─ Old things copied into the new table─ “2-phase commit”

─ null → {T',V'}* → {T,V}* → X• State Machine again...

Page 32: Lock-Free, Wait-Free Hash Table

32| ©2003 Azul Systems, Inc.

www.azulsystems.com

State Machine – new table

{null,null}

Empty {K,T}

{null,T/T'/V/V'/X}

Partially inserted K/V pair

{K,V}

Standard K/V pair

insert

dele

te

insert

Deleted key

{K,V} into newer tablecopy

{K,X}

kill

kill{X,null}

check newer table

{K,null}

States {X,T/T'/V/V'/X} not possible

{K,T'/V'}

copy in from older table

Page 33: Lock-Free, Wait-Free Hash Table

33| ©2003 Azul Systems, Inc.

www.azulsystems.com

State Machine – new table

{null,null}

Empty {K,T}

{null,T/T'/V/V'/X}

Partially inserted K/V pair

{K,V}

Standard K/V pair

insert

dele

te

insert

Deleted key

{K,V} into newer tablecopy

{K,X}

kill

kill{X,null}

check newer table

{K,null}

States {X,T/T'/V/V'/X} not possible

{K,T'/V'}

copy in from older table

Page 34: Lock-Free, Wait-Free Hash Table

34| ©2003 Azul Systems, Inc.

www.azulsystems.com

State Machine – new table

{null,null}

Empty {K,T}

{null,T/T'/V/V'/X}

Partially inserted K/V pair

{K,V}

Standard K/V pair

insert

dele

te

insert

Deleted key

{K,V} into newer tablecopy

{K,X}

kill

kill{X,null}

check newer table

{K,null}

States {X,T/T'/V/V'/X} not possible

{K,T'/V'}

copy in from older table

Page 35: Lock-Free, Wait-Free Hash Table

35| ©2003 Azul Systems, Inc.

www.azulsystems.com

State Machine: Copy One Pair

partial copy

{K,X}

K,V' in new tableX in old table

copy complete

Fence{K,V'

x}

{K,V'1}

{K,V1}

{K,V'1}

{K,V1}Fence

read V'x

read V1

oldnew

Page 36: Lock-Free, Wait-Free Hash Table

36| ©2003 Azul Systems, Inc.

www.azulsystems.com

Some Things to Notice

• Old value could be V or T ─ or V' or T' (if nested resize in progress)

• Skip copy if new Value is not prime'd─ Means recent put() overwrote any old Value

• If CAS into new fails─ Means either put() or other copy in progress─ So this copy can quit

• Any thread can see any state at any time─ And CAS to the next state

Page 37: Lock-Free, Wait-Free Hash Table

37| ©2003 Azul Systems, Inc.

www.azulsystems.com

Agenda

• Motivation• “Uninteresting” Hash Table Details• State-Based Reasoning• Resize• Performance• Q&A

Page 38: Lock-Free, Wait-Free Hash Table

38| ©2003 Azul Systems, Inc.

www.azulsystems.com

• Measure insert/lookup/remove of Strings• Tight loop: no work beyond HashTable itself and

test harness (mostly RNG)• “Guaranteed not to exceed” numbers• All fences; full ConcurrentHashMap semantics• Variables:

─ 99% get, 1% put (typical cache) vs 75 / 25─ Dual Athalon, Niagara, Azul Vega1, Vega2─ Threads from 1 to 800─ NonBlocking vs 4096-way ConcurrentHashMap─ 1K entry table vs 1M entry table

Microbenchmark

Page 39: Lock-Free, Wait-Free Hash Table

39| ©2003 Azul Systems, Inc.

www.azulsystems.com

1 2 3 4 5 6 7 80

5

10

15

20

25

30

Threads

M-o

ps/s

ec

NB-99

CHM-99

NB-75

CHM-75

1 2 3 4 5 6 7 80

5

10

15

20

25

30

Threads

M-o

ps/s

ecNB

CHM

1K Table 1M Table

AMD 2.4GHz – 2 (ht) cpus

Page 40: Lock-Free, Wait-Free Hash Table

40| ©2003 Azul Systems, Inc.

www.azulsystems.com

0 8 16 24 32 40 48 56 640

10

20

30

40

50

60

70

80

Threads

M-o

ps/s

ec

0 8 16 24 32 40 48 56 640

10

20

30

40

50

60

70

80

Threads

M-o

ps/s

ec

1K Table 1M Table

NB

CHM

Niagara – 8x4 cpus

CHM-99, NB-99

CHM-75, NB-75

Page 41: Lock-Free, Wait-Free Hash Table

41| ©2003 Azul Systems, Inc.

www.azulsystems.com

0 100 200 300 4000

100

200

300

400

500

Threads

M-o

ps/s

ec

0 100 200 300 4000

100

200

300

400

500

Threads

M-o

ps/s

ec

NB-99

CHM-99

NB-75

CHM-75

1K Table 1M Table

Azul Vega1 – 384 cpus

CHM

NB

Page 42: Lock-Free, Wait-Free Hash Table

42| ©2003 Azul Systems, Inc.

www.azulsystems.com

0 100 200 300 400 500 600 700 8000

200

400

600

800

1000

1200

Threads

M-o

ps/s

ec

0 100 200 300 400 500 600 700 8000

200

400

600

800

1000

1200

Threads

M-o

ps/s

ec

NB-99

CHM-99

NB-75

CHM-75

1K Table 1M Table

NB

CHM

Azul Vega2 – 768 cpus

Page 43: Lock-Free, Wait-Free Hash Table

43| ©2003 Azul Systems, Inc.

www.azulsystems.com

Summary

• A faster lock-free HashTable• Faster for more CPUs• Much faster for higher table modification rate• State-Based Reasoning:

─ No ordering, no JMM, no fencing● Any thread can see any state at any time

● Must assume values change at each step● State graphs really helped coding & debugging● Resulting code is small & fast

Page 44: Lock-Free, Wait-Free Hash Table

44| ©2003 Azul Systems, Inc.

www.azulsystems.com

Summary

● Obvious future work:● Tools to check states● Tools to write code

● Seems applicable to other data structures as well● Concurrent append j.u.Vector● Scalable near-FIFO work queues

● Code & Video available at:

http://blogs.azulsystems.com/cliff/

Page 45: Lock-Free, Wait-Free Hash Table

Thank You

WWW.AZULSYSTEMS.COM

#1 Platform for Business Critical Java™