59
And now for something completely different...

MongoDB Europe 2016 - Building WiredTiger

  • Upload
    mongodb

  • View
    265

  • Download
    2

Embed Size (px)

Citation preview

And now for something completely different...

WiredTiger: Fast data structures in C

Keith Bostic MongoDB WiredTiger team [email protected]

#MDBE16

You are here: database layers

Middleware

Networking

Query APIs

Storage Engine

#MDBE16

Storage engines are performance critical

Middleware

Networking

Query APIs

mmapV1 Storage Engine

RocksDB Storage Engine

WiredTiger Storage Engine

ACID transactional guarantees

#MDBE16

WiredTiger •  From (some of) the folks that brought you Berkeley DB

• High performance data engine •  scalable throughput with low latency

• MongoDB’s default storage engine

•  a general-purpose workhorse

Next Ø  Hardware (is the problem) •  Hazard pointers •  Skiplists •  Ticket locks

#MDBE16

Modern servers have many CPUs/cores

core 3

core 2

core 1

core N

#MDBE16

Each core has multiple memory caches

core 3

core 2

core 1

core N

two or more

caches

two or more

caches

two or more

caches

two or more

caches

#MDBE16

Cache coherence: cores “snoop” on writes

core 3

core 2

core 1

core N

two or more

caches

two or more

caches

two or more

caches

Main Memory

two or more

caches

#MDBE16

Traditional data engines struggle with this architecture

• Writing “shared” memory is slow •  but databases exist to manage shared access to data!

• Snoopy cache-coherence scales poorly

#MDBE16

Programmers solve with locking •  Locks are complex objects

•  get exclusive access to the lock state •  review and update the lock state •  “publish” (ensure every CPU sees the changes) •  release exclusive access

#MDBE16

Locking is slow

• Every operation requires exclusive access •  even shared (“read”) locks require a lock/unlock cycle •  thread stall is inevitable

•  Locks require notification of every CPU •  Locks require exclusive access to the memory bus

#MDBE16

Locking is expensive

•  A lock per object is too much memory • POSIX locks cache-aligned, up to 128B •  grouping objects under locks makes contention worse

• More complexity to make locks “fair” and avoid starvation •  add thread queues • wake-up the next thread waiting for the lock

#MDBE16

We need to find something else If we can’t use locks, what do we use instead? Today we’re going to talk about ways to get rid of locks.

#MDBE16

WiredTiger is written in C

•  Java or C++ are better choices for system programming •  automatic memory management vs. malloc/free •  exception handling vs. explicit error paths • widespread availability of reusable components

•  Giving up programmer productivity

#MDBE16

C is “portable assembler”

• Marshall typed values to/from unaligned memory •  streaming compression, encryption, checksums •  unstructured I/O to/from stable storage

•  Light-weight access to shared data •  use the underlying machine primitives that make up locks •  algorithms/structures based on those primitives

You may have seen this last year:

Next

•  Hardware Ø  Hazard pointers •  Skiplists •  Ticket locks

#MDBE16

Pages in the WiredTiger cache

page 2

page 6

page 8

page 9

Lots and lots (and lots) of pages MongoDB worker threads read from disk WiredTiger server threads evict to disk

#MDBE16

A reasonable page-locking implementation

• MongoDB worker threads read, modify pages • WiredTiger server threads evict pages from the cache

•  Allocate a lock per page • MongoDB worker threads share pages • WiredTiger eviction threads require exclusive access

#MDBE16

Page locking in the WiredTiger cache

page 2

page 6

page 8

page 9

eviction

lock

lock

lock

lock

writer

reader thread stall on read locks! vulnerable to starvation too much memory

#MDBE16

Introducing memory barriers

• Memory barriers •  order reads, writes or both across a line of code •  compiler won’t cache values or reorder across a barrier

•  Locks imply memory barriers

#MDBE16

Something faster

• Hazard pointers: a technique for avoiding locks • MongoDB worker threads

•  “log” intention to access a page •  publish: a memory barrier to ensure global CPU visibility

• Write to a per-thread memory location

• write won’t collide with other worker threads

#MDBE16

What about eviction starvation?

•  Add a per-page “blocker” • MongoDB worker won’t proceed if the page is blocked

• Cheap: •  it’s only a bit of information •  a read-only operation for workers

#MDBE16

Worker threads

• Publish intent to access the page • Memory barrier to ensure global CPU visibility

•  If the page not blocked, it’s accessible

• Clear intent to access when done

#MDBE16

Hazard pointers for workers

page 2

page 6

page 8

page 9

flag

writer

reader

flag

flag

flag

page 9

page 2

page 6

page 2

page 9

#MDBE16

Eviction server

• Block future worker thread access • Memory barrier to ensure global CPU visibility

• Review worker thread access intentions •  can either wait or quit

• Unblock worker thread access when done

#MDBE16

Hazard pointers for workers and eviction

page 2

page 6

page 8

page 9

flag

flag

flag

flag

writer

reader page 9

page 2

page 6

page 2

page 9

eviction

#MDBE16

Something faster: hazard pointers

Replaces two lock/unlock pairs for each page access ... with a single memory barrier instruction.

•  Transfers work to the eviction server

• MongoDB worker latency is what we care about

• Memory costs •  per-worker-thread list •  per-page blocking flag

Next

•  Hardware •  Hazard pointers Ø  Skiplists •  Ticket locks

#MDBE16

Introducing atomic instructions

•  Atomic increment or decrement •  read a value •  change it and store it back without the possibility of racing

• Based on compare-and-swap (CAS) instruction •  read value •  update value if the value is unchanged

•  but fail if the value has changed

#MDBE16

Atomic prepend to singly-linked list Update head if (and only if), head’s value is unchanged

head

NEW

new.next = head compare_and_swap(head, new.next, new)

#MDBE16

How WiredTiger uses skiplists

•  WiredTiger pages start with a disk image

•  a compact representation we don’t want to modify •  Inserts and updates for the disk image stored in skiplists

#MDBE16

Skiplists start with a linked list Singly-linked list with sorted values: 7, 10, 13, 18, 21, 24

7 10 21 18 13 24

#MDBE16

Skiplists: add additional linked lists Each higher level “skips” over more of the list

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

#MDBE16

Skiplists, the great

Replaces a lock/unlock pair over the entire skiplist with one atomic memory instruction per object level

•  Insert without locking • Search without locking, while inserting •  Forward & backward traversal without locking, while inserting

#MDBE16

Skiplists, the good

• Simpler code than a Btree • WiredTiger binary search ~200 lines of code •  a typical skiplist search < 20

•  Fast search

•  a Btree guarantees search in logarithmic time •  skiplists don’t offer a guarantee, but are usually close

#MDBE16

Skiplists, the not-so-good

• Cache-unfriendly •  every indirection a CPU cache miss

• Memory-unfriendly •  needs more memory for a data set than a Btree

• Removal requires locking • WiredTiger is an MVCC engine (multiple values per key) •  removal less important to WiredTiger

Next

•  Hardware •  Hazard pointers •  Skiplists Ø  Ticket locks

#MDBE16

Ticket locks

• WiredTiger still needs to lock objects •  but we can make locks faster

•  Ticket locks •  customers take a unique ticket number •  customers served in ticket order

#MDBE16

Ticket locks

Please Take a Number

42 43 41 40 39

Now Serving

#MDBE16

Ticket locks

•  Two incrementing counters: ticket: the next available ticket number serving: the ticket number now being served

•  Thread takes a ticket number •  Thread increments “next available” •  Thread waits for “serving” to match its ticket number • When thread finishes, increments “serving”

#MDBE16

Ticket locks serialize threads

40

Now Serving

39

Thread A

39

40

39

40

41

Thread B

#MDBE16

Ticket locks are almost what we need

•  Ticket locks avoid starvation and are “fair” • Smaller memory footprint • Can be made significantly faster than POSIX locks

•  remember our compare-and-swap instructions!

• But POSIX locks are shared between readers

#MDBE16

Ticket locks: shared vs. exclusive

•  Three incrementing counters: ticket: the next available ticket number readers: the next reader to be served writers: the next writer to be served

#MDBE16

Readers run in parallel

40

Writers Readers

39

Thread A

39

40

41

41

39

40

41

42

39

40

41

42

Thread B

Thread C

#MDBE16

Multiple variable updates without locking

• Updating multiple counters would require locking ... but we can write the bus width atomically

• Encode the entire lock state in a single 8B value lock { uint16_t readers; uint16_t writers; uint16_t ticket; // 64K simultaneous threads uint16_t unused; }

#MDBE16

Ticket locks

Replaces two higher-level lock/unlock calls

... with two atomic instructions.

#MDBE16

That’s a (very) fast introduction.... • Hazard pointers • Skiplists •  Ticket locks

Open Source implementations are available in WiredTiger, including Public Domain ticket locks.

#MDBE16

WiredTiger distribution

• Standalone application database toolkit library •  key-value store (NoSQL) •  row-store, column-store and LSM engines •  schema layer includes data types and indexes

•  Another MongoDB Open Source contribution • WiredTiger available for other applications •  https://github.com/wiredtiger

Thank you! Keith Bostic [email protected]