48
ForestDB: Next Generation Storage Engine for Couchbase Chin Hong | VP Product Management, Couchbase

Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

Embed Size (px)

Citation preview

Page 1: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

ForestDB: Next Generation Storage

Engine for CouchbaseChin Hong | VP Product Management,

Couchbase

Page 2: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 2

Contents Why do we need a new KV storage engine? ForestDB

– HB+-Trie– Performance evaluations

Optimizations for Flash-Based SSDs Summary

Page 3: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

Why do we need a new KV storage engine?

Page 4: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 4

Operate on huge volumes of unstructured data

Significant amount of new data is constantly generated from hundreds of millions of users or devices

Still require high performance and scalability in managing their ever-growing database

Underlying storage engine, complementing fast memory and fast network, is one of the most critical parts in database systems to provide high performance / scalability

Modern Web / Mobile/ IoT Applications

Page 5: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 5

Common Storage Structures – B+-Tree

5

Reads Writes EnginesB+-Tree Good read performance if the

fan-out is high (“short” tree) for small fixed-length keys. Read performance degrades for variable-length keys.

Update-In-Place results in random writes and bad write latency.Append-Only file improves write performance but requires periodic compaction.

SQLite, InnoDB, Couchstore (append-only), WiredTiger B+

04/26

…Key

Value (or Pointer)

longer keys

Page 6: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 6

Common Storage Structures – LSM-Tree

6

Reads Writes EnginesLSM-Tree

Reads may have to traverse multiple trees – typically worse than B+ tree.

WAL(Write-Ahead Log) improves writes within-memory trees that are appended to the end of the log.

LevelDB, RocksDB, Cassandra, WiredTiger LSM

In-memory

Sequential log

flush/merge merge

C1 tree C2 tree

merge

Capacity increases exponentially

Page 7: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 7

Fast and scalable index structure for variable or fixed-length long keys– Targeting not only SSDs but also legacy HDDs

Less storage space overhead– Reduce write amplification

Efficient for different key patterns– Keys with or without common prefixes

Efficient for mixed workloads

Goals for Next-Generation Storage Engine

06/26

Page 8: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

ForestDB

Page 9: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 9

Key-Value storage engine developed by Couchbase Caching / Storage team

Its main index structure is built from Hierarchical B+-Tree based Trie or HB+-Trie– Originally presented at ACM SIGMOD in 2011 by Jung-Sang Ahn– ForestDB paper accepted for publication in IEEE Trans. On Computers

Significantly better read and write performance with less storage overhead

Underlying storage engine for secondary index, mobile, and key-value engine in Couchbase

ForestDB

Page 10: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 10

Multi-Version Concurrency Control (MVCC) with append-only storage model

Write-Ahead Logging (WAL) A value can be retrieved by its sequence number or disk offset in addition

to a key Custom compare function to support a customized key order Snapshot support to provide different views of database Rollback to revert the database to a specific point Ranged iteration by keys or sequence numbers Transactional support with read_committed or read_uncommitted isolation

level Multiple key-value instances per database file Manual or auto compaction configured per database file

Main Features

Page 11: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

ForestDB: Main Index Structure

Page 12: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 12

Trie (prefix tree) whose node is B+Tree A key is split into the list of fixed-size chunks (sub-string of

the key)

HB+Trie (Hierarchical B+Tree based Trie)

Variable length key: Fixed size (e.g. 4-byte)a83jgls83jgo29a…

07/26Lexicographical ordered traversal

Search using Chunk1

DocumentB+Tree (Node of HB+Trie)Node of B+Tree

Chunk1Chunk2Chunk3 …a83j gls8 3jgo …

Search using Chunk2

Search using Chunk3 07/26

Page 13: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 13

Prefix Compression

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)

Example: Chunk size = 1 byte

1stInsert ‘aaaa’

B+Tree using 1st

chunk as key

Page 14: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 14

Prefix Compression

1stInsert ‘aaaa’

aaaaaDistinguishable

by first chunk ‘a’

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)

Example: Chunk size = 1 byte B+Tree using 1st

chunk as key

Page 15: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 15

Prefix Compression

Distinguishable by

first chunk ‘b’

B+Tree using 1st

chunk as key

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)

Example: Chunk size = 1 byte

Insert ‘bbbb’

aaaa

1st

abbbb

b

Page 16: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 16

Prefix Compression

B+Tree using 1st

chunk as key

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)

Example: Chunk size = 1 byte

Insert ‘aaab’

aaaa

1st

abbbb

bCannot

distinguish using first chunk

‘a’

Page 17: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 17

Prefix Compression

Insert ‘aaab’

aaaaCannot distinguish

using first chunk ‘a’ First

distinguishable chunk: 4th

B+Tree using 1st

chunk as key

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)

Example: Chunk size = 1 byte

1st

abbbb

b

Page 18: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 18

Prefix Compression

Store skipped common prefix

‘aa’

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)

Example: Chunk size = 1 byte

1st

abbbb

b4th aa

aaaaa

aaabb

B+Tree using 4th chunk as key,

skipping common prefix ‘aa’

Page 19: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 19

Prefix Compression

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)

Example: Chunk size = 1 byte

1st

abbbb

b4th aa

aaaaa

aaabb

Insert ‘bbcd’ Cannot distinguish

using first chunk ‘b’

B+Tree using 4th chunk as key,

skipping common prefix ‘aa’

Page 20: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 20

Prefix Compression

1st

abbbb

b4th aa

aaaaa

aaabb

Insert ‘bbcd’ Cannot distinguish

using first chunk ‘b’

First distinguishable

chunk: 3rd

B+Tree using 4th chunk as key,

skipping common prefix ‘aa’

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)

Example: Chunk size = 1 byte

Page 21: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 21

Prefix Compression

1st

a b

4thaa

aaaaa

aaabb

3rd b

bbbb bbcdb c

Store skipped common prefix

‘b’B+Tree using 3rd

chunk as key, skipping common

prefix ‘b’

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)

Example: Chunk size = 1 byte

Page 22: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 22

Compact Index Structure

09/26

1st

Insert a83jfl2iejzm302k,dpwk3gjrieorigje,z9382h3igor8eh4k,283hgoeir8goerha,023o8f9o8zufisue

a83jfl2iejzm30

2k

a83j dpwk3gjrieorig

je

dpwk z9382h3igor8eh

4k

z938283hgoeir8goer

ha

283h023o8f9o8zufis

ue

023o

Majority of keys can be indexed by first chunk There will be only one B+Tree on HB+Trie

We don’t need to store & compare entire key string

When keys have common prefixes (e.g., secondary index keys) When keys are sufficiently long & uniform random (e.g., UUID or

hash value) Example: Chunk size = 4 bytes

Page 23: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 23

Suppose that– Node size: 4 KB, Key length: 64 bytes, Pointer (or value) size: 8

bytes– Indexing 1 billion keys

• Compaction overhead can be reduced significantly• Buffer cache can accommodate more pages and manage them more

efficiently

Compact Index Structure - Benefits

14.1 times smaller

10/26

Original B+Tree HB+Trie (4-byte chunk)

Fanout 4096 / (64+8) ~= 56 4096 / (4+8) ~= 341Height log56(10003) ~= 6 log341(10003) ~= 4Space needed for the index

4KB * ~= 2139.07 GB 4KB * ~= 151.70 GB

Page 24: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

ForestDB: Write Ahead Logging

Page 25: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 25

ForestDB maintains two index structures• HB+Trie: key index• Sequence B+Tree: sequence number (8-byte integer) index• Retrieve the file offset to a value using key or sequence number

ForestDB Index Structures

DB file Doc Doc Doc Doc Doc Doc …

HB+TrieB+Tree

key

Sequence number

11/26

Page 26: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 26

Append updates first, and reflect them in the main indexes later

Main purposes To maximize write throughput by sequential writes (append-

only updates) To reduce # of index nodes to be written by batching

updates

Write Ahead Logging

Append DB headerfor every commitHDocsDB file Docs Index nodes

h(key)h(key)

OffsetOffset

h(seq no)h(seq no)…

OffsetOffset

ID index Seq no. index

WAL indexes:in-memory structures(hash table)

H

Page 27: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 27

Append updates first, and reflect them in the main indexes later

Main purposes To maximize write throughput by sequential writes (append-

only updates) To reduce # of index nodes to be written by batched

updates

Write Ahead Logging

Append DB headerfor every commitHDocsDB file Docs Index nodes

h(key)h(key)

OffsetOffset

h(seq no)h(seq no)…

OffsetOffset

ID index Seq no. index

WAL indexes:in-memory structures(hash table)

H

< Key query>1. Retrieve WAL index first2. If hit return immediately3. If miss retrieve HB+Trie (or

B+Tree)

Page 28: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

ForestDB: Evaluation

Page 29: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 35

LevelDB– Compression is disabled– Write buffer size: 256 MB (initial load), 4 MB (otherwise)– Buffer cache size: 8 GB

RocksDB– Compression is disabled– Write buffer size: 256 MB (initial load), 4 MB (otherwise)– Maximum number of background compaction threads: 8– Maximum number of background memtable flushes: 8– Maximum number of write buffers: 8– Buffer cache size: 8 GB (uncompressed)

ForestDB– Compression is disabled– WAL size: 4,096 documents– Buffer cache size: 8 GB

KV Storage Engine Configurations

Page 30: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 36

Evaluation Environments– 64-bit machine running Centos 6.5– Intel Xeon 2.00 GHz CPU (6 cores, 12 threads)– 32GB RAM and Crucial M4 SSD

Data– Key size 32 bytes and value size 1KB– Load 100M items– Logical data size 100GB total

Evaluation Setup

Page 31: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 37

Initial Load Performance

3x ~ 6x less time

Page 32: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 39

Read-Only Performance

1 2 4 80

5000

10000

15000

20000

25000

30000

Throughput

ForestDB LevelDB RocksDB

# reader threads

Ope

ratio

ns p

er s

econ

d

2x ~ 5x

Page 33: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 40

Write-Only Performance

1 4 16 64 2560

2000

4000

6000

8000

10000

12000

Throughput

ForestDB LevelDB RocksDB

Write batch size (# documents)

Ope

ratio

ns p

er s

econ

d

- Small batch size (e.g., < 10) is not usually common

3x ~ 5x

Page 34: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 42

Mixed Workload Performance

1 2 4 80

2000

4000

6000

8000

10000

12000

Mixed (Unrestricted) Performance

ForestDB LevelDB RocksDB

# reader threads

Ope

ratio

ns p

er s

econ

d

2x ~ 5x

Single writer + multiple readers

Page 35: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

Optimizations for Solid-State Drives

Page 36: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 44

Characteristics of Flash Storage (vs. Hard Disk)

No-overwrite and FTL layer– Overwrite is not allowed– Another layer of address mapping inside flash storage

Limited lifetime

▪Write time in flash storage ~ write amount▪ Write time in hard disk ~ mechanical disk head

movement

Page 37: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 45

Copy-on-Write in ForestDB▪Document update

▪ Copy-on-Write, instead of in-place-update▪ Need periodic compaction to reclaim stale blocks

Page 38: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 46

Copy-On-Write in ForestDB (2)▪Why CoW?

▪ 1) Write atomicity, 2) multi-version concurrency control▪ A reasonable solution in HDD to improve write

performance▪ Writing blocks sequentially to avoid random disk head movements

▪Problems with CoW in flash storage▪ Writing new data and index pages write amplification

low performance ▪ Flash storage lifetime

Page 39: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 47

Opportunities in Flash Storage (1)▪SHARE interface: explicit address remapping

Page 40: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 48

Opportunities in Flash Storage (2)▪ForestDB Compaction with SHARE

▪ No write of valid documents to new file

Page 41: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 49

SHARE Implementation▪Firmware extension for SHARE

▪ OpenSSD Board (http://www.openssd-project.org/)▪ Atomic and recoverable

Page 42: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 50

Performance Evaluation▪Normal time performance: YCSB’s workload-F

Page 43: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 51

Performance Evaluation (2)▪Compaction performance

Elapsed Time(sec)

Written Bytes(MB)

Original ForestDB 227.5 1126.4

ForestDB with SHARE 88.4 150.6

Page 44: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 52

NVM Express* – Architected for NVM NVM Express* is a standardized high performance software

interface for PCI Express* (PCIe*) SSDs Developed by an open industry consortium

*Other names and brands may be claimed as the property of others.

Page 45: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2015 Couchbase Inc. 53

Throughput Testing: Throughput (op/s)

0

10000

20000

30000

40000

50000

60000

70000

30755

7282

47345

63946

SATA NVMe

Read Throughput Write Throughput

4 reader, 1 writerMax capacity for each thread 4 files, 4 instances

Page 46: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

Summary

Page 47: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

©2014 Couchbase, Inc. ©2015 Couchbase Inc. 55

Compact and efficient storage for variety of data – HB+Trie In-memory WAL indexes to improve write/read

performance Optimized for new SSD storage technology Unified storage engine that performs well for various

workloads Unified storage engine that scales from small devices to

large servers• Couchbase Server secondary index• Couchbase Lite• Couchbase Server KV engine

ForestDB - Summary

55

Page 48: Next Generation Storage Engine: ForestDB – Couchbase Live New York 2015

Get Started with Couchbase Server 4.0: developer.couchbase.com/server

Get Trained on Couchbase: training.couchbase.com