Upload
couchbase
View
603
Download
1
Embed Size (px)
Citation preview
ForestDB: Next Generation Storage
Engine for CouchbaseChin Hong | VP Product Management,
Couchbase
©2015 Couchbase Inc. 2
Contents Why do we need a new KV storage engine? ForestDB
– HB+-Trie– Performance evaluations
Optimizations for Flash-Based SSDs Summary
Why do we need a new KV storage engine?
©2015 Couchbase Inc. 4
Operate on huge volumes of unstructured data
Significant amount of new data is constantly generated from hundreds of millions of users or devices
Still require high performance and scalability in managing their ever-growing database
Underlying storage engine, complementing fast memory and fast network, is one of the most critical parts in database systems to provide high performance / scalability
Modern Web / Mobile/ IoT Applications
©2015 Couchbase Inc. 5
Common Storage Structures – B+-Tree
5
Reads Writes EnginesB+-Tree Good read performance if the
fan-out is high (“short” tree) for small fixed-length keys. Read performance degrades for variable-length keys.
Update-In-Place results in random writes and bad write latency.Append-Only file improves write performance but requires periodic compaction.
SQLite, InnoDB, Couchstore (append-only), WiredTiger B+
04/26
…
…
…Key
Value (or Pointer)
longer keys
©2015 Couchbase Inc. 6
Common Storage Structures – LSM-Tree
6
Reads Writes EnginesLSM-Tree
Reads may have to traverse multiple trees – typically worse than B+ tree.
WAL(Write-Ahead Log) improves writes within-memory trees that are appended to the end of the log.
LevelDB, RocksDB, Cassandra, WiredTiger LSM
…
In-memory
Sequential log
flush/merge merge
C1 tree C2 tree
merge
Capacity increases exponentially
©2015 Couchbase Inc. 7
Fast and scalable index structure for variable or fixed-length long keys– Targeting not only SSDs but also legacy HDDs
Less storage space overhead– Reduce write amplification
Efficient for different key patterns– Keys with or without common prefixes
Efficient for mixed workloads
Goals for Next-Generation Storage Engine
06/26
ForestDB
©2015 Couchbase Inc. 9
Key-Value storage engine developed by Couchbase Caching / Storage team
Its main index structure is built from Hierarchical B+-Tree based Trie or HB+-Trie– Originally presented at ACM SIGMOD in 2011 by Jung-Sang Ahn– ForestDB paper accepted for publication in IEEE Trans. On Computers
Significantly better read and write performance with less storage overhead
Underlying storage engine for secondary index, mobile, and key-value engine in Couchbase
ForestDB
©2015 Couchbase Inc. 10
Multi-Version Concurrency Control (MVCC) with append-only storage model
Write-Ahead Logging (WAL) A value can be retrieved by its sequence number or disk offset in addition
to a key Custom compare function to support a customized key order Snapshot support to provide different views of database Rollback to revert the database to a specific point Ranged iteration by keys or sequence numbers Transactional support with read_committed or read_uncommitted isolation
level Multiple key-value instances per database file Manual or auto compaction configured per database file
Main Features
ForestDB: Main Index Structure
©2015 Couchbase Inc. 12
Trie (prefix tree) whose node is B+Tree A key is split into the list of fixed-size chunks (sub-string of
the key)
HB+Trie (Hierarchical B+Tree based Trie)
Variable length key: Fixed size (e.g. 4-byte)a83jgls83jgo29a…
07/26Lexicographical ordered traversal
Search using Chunk1
DocumentB+Tree (Node of HB+Trie)Node of B+Tree
Chunk1Chunk2Chunk3 …a83j gls8 3jgo …
Search using Chunk2
Search using Chunk3 07/26
©2015 Couchbase Inc. 13
Prefix Compression
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
1stInsert ‘aaaa’
B+Tree using 1st
chunk as key
©2015 Couchbase Inc. 14
Prefix Compression
1stInsert ‘aaaa’
aaaaaDistinguishable
by first chunk ‘a’
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte B+Tree using 1st
chunk as key
©2015 Couchbase Inc. 15
Prefix Compression
Distinguishable by
first chunk ‘b’
B+Tree using 1st
chunk as key
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
Insert ‘bbbb’
aaaa
1st
abbbb
b
©2015 Couchbase Inc. 16
Prefix Compression
B+Tree using 1st
chunk as key
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
Insert ‘aaab’
aaaa
1st
abbbb
bCannot
distinguish using first chunk
‘a’
©2015 Couchbase Inc. 17
Prefix Compression
Insert ‘aaab’
aaaaCannot distinguish
using first chunk ‘a’ First
distinguishable chunk: 4th
B+Tree using 1st
chunk as key
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
1st
abbbb
b
©2015 Couchbase Inc. 18
Prefix Compression
Store skipped common prefix
‘aa’
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
1st
abbbb
b4th aa
aaaaa
aaabb
B+Tree using 4th chunk as key,
skipping common prefix ‘aa’
©2015 Couchbase Inc. 19
Prefix Compression
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
1st
abbbb
b4th aa
aaaaa
aaabb
Insert ‘bbcd’ Cannot distinguish
using first chunk ‘b’
B+Tree using 4th chunk as key,
skipping common prefix ‘aa’
©2015 Couchbase Inc. 20
Prefix Compression
1st
abbbb
b4th aa
aaaaa
aaabb
Insert ‘bbcd’ Cannot distinguish
using first chunk ‘b’
First distinguishable
chunk: 3rd
B+Tree using 4th chunk as key,
skipping common prefix ‘aa’
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
©2015 Couchbase Inc. 21
Prefix Compression
1st
a b
4thaa
aaaaa
aaabb
3rd b
bbbb bbcdb c
Store skipped common prefix
‘b’B+Tree using 3rd
chunk as key, skipping common
prefix ‘b’
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
©2015 Couchbase Inc. 22
Compact Index Structure
09/26
1st
Insert a83jfl2iejzm302k,dpwk3gjrieorigje,z9382h3igor8eh4k,283hgoeir8goerha,023o8f9o8zufisue
a83jfl2iejzm30
2k
a83j dpwk3gjrieorig
je
dpwk z9382h3igor8eh
4k
z938283hgoeir8goer
ha
283h023o8f9o8zufis
ue
023o
Majority of keys can be indexed by first chunk There will be only one B+Tree on HB+Trie
We don’t need to store & compare entire key string
When keys have common prefixes (e.g., secondary index keys) When keys are sufficiently long & uniform random (e.g., UUID or
hash value) Example: Chunk size = 4 bytes
©2015 Couchbase Inc. 23
Suppose that– Node size: 4 KB, Key length: 64 bytes, Pointer (or value) size: 8
bytes– Indexing 1 billion keys
• Compaction overhead can be reduced significantly• Buffer cache can accommodate more pages and manage them more
efficiently
Compact Index Structure - Benefits
14.1 times smaller
10/26
Original B+Tree HB+Trie (4-byte chunk)
Fanout 4096 / (64+8) ~= 56 4096 / (4+8) ~= 341Height log56(10003) ~= 6 log341(10003) ~= 4Space needed for the index
4KB * ~= 2139.07 GB 4KB * ~= 151.70 GB
ForestDB: Write Ahead Logging
©2015 Couchbase Inc. 25
ForestDB maintains two index structures• HB+Trie: key index• Sequence B+Tree: sequence number (8-byte integer) index• Retrieve the file offset to a value using key or sequence number
ForestDB Index Structures
DB file Doc Doc Doc Doc Doc Doc …
HB+TrieB+Tree
key
Sequence number
11/26
©2015 Couchbase Inc. 26
Append updates first, and reflect them in the main indexes later
Main purposes To maximize write throughput by sequential writes (append-
only updates) To reduce # of index nodes to be written by batching
updates
Write Ahead Logging
Append DB headerfor every commitHDocsDB file Docs Index nodes
h(key)h(key)
…
OffsetOffset
…
h(seq no)h(seq no)…
OffsetOffset
…
ID index Seq no. index
WAL indexes:in-memory structures(hash table)
H
©2015 Couchbase Inc. 27
Append updates first, and reflect them in the main indexes later
Main purposes To maximize write throughput by sequential writes (append-
only updates) To reduce # of index nodes to be written by batched
updates
Write Ahead Logging
Append DB headerfor every commitHDocsDB file Docs Index nodes
h(key)h(key)
…
OffsetOffset
…
h(seq no)h(seq no)…
OffsetOffset
…
ID index Seq no. index
WAL indexes:in-memory structures(hash table)
H
< Key query>1. Retrieve WAL index first2. If hit return immediately3. If miss retrieve HB+Trie (or
B+Tree)
ForestDB: Evaluation
©2015 Couchbase Inc. 35
LevelDB– Compression is disabled– Write buffer size: 256 MB (initial load), 4 MB (otherwise)– Buffer cache size: 8 GB
RocksDB– Compression is disabled– Write buffer size: 256 MB (initial load), 4 MB (otherwise)– Maximum number of background compaction threads: 8– Maximum number of background memtable flushes: 8– Maximum number of write buffers: 8– Buffer cache size: 8 GB (uncompressed)
ForestDB– Compression is disabled– WAL size: 4,096 documents– Buffer cache size: 8 GB
KV Storage Engine Configurations
©2015 Couchbase Inc. 36
Evaluation Environments– 64-bit machine running Centos 6.5– Intel Xeon 2.00 GHz CPU (6 cores, 12 threads)– 32GB RAM and Crucial M4 SSD
Data– Key size 32 bytes and value size 1KB– Load 100M items– Logical data size 100GB total
Evaluation Setup
©2015 Couchbase Inc. 37
Initial Load Performance
3x ~ 6x less time
©2015 Couchbase Inc. 39
Read-Only Performance
1 2 4 80
5000
10000
15000
20000
25000
30000
Throughput
ForestDB LevelDB RocksDB
# reader threads
Ope
ratio
ns p
er s
econ
d
2x ~ 5x
©2015 Couchbase Inc. 40
Write-Only Performance
1 4 16 64 2560
2000
4000
6000
8000
10000
12000
Throughput
ForestDB LevelDB RocksDB
Write batch size (# documents)
Ope
ratio
ns p
er s
econ
d
- Small batch size (e.g., < 10) is not usually common
3x ~ 5x
©2015 Couchbase Inc. 42
Mixed Workload Performance
1 2 4 80
2000
4000
6000
8000
10000
12000
Mixed (Unrestricted) Performance
ForestDB LevelDB RocksDB
# reader threads
Ope
ratio
ns p
er s
econ
d
2x ~ 5x
Single writer + multiple readers
Optimizations for Solid-State Drives
©2015 Couchbase Inc. 44
Characteristics of Flash Storage (vs. Hard Disk)
No-overwrite and FTL layer– Overwrite is not allowed– Another layer of address mapping inside flash storage
Limited lifetime
▪Write time in flash storage ~ write amount▪ Write time in hard disk ~ mechanical disk head
movement
©2015 Couchbase Inc. 45
Copy-on-Write in ForestDB▪Document update
▪ Copy-on-Write, instead of in-place-update▪ Need periodic compaction to reclaim stale blocks
©2015 Couchbase Inc. 46
Copy-On-Write in ForestDB (2)▪Why CoW?
▪ 1) Write atomicity, 2) multi-version concurrency control▪ A reasonable solution in HDD to improve write
performance▪ Writing blocks sequentially to avoid random disk head movements
▪Problems with CoW in flash storage▪ Writing new data and index pages write amplification
low performance ▪ Flash storage lifetime
©2015 Couchbase Inc. 47
Opportunities in Flash Storage (1)▪SHARE interface: explicit address remapping
©2015 Couchbase Inc. 48
Opportunities in Flash Storage (2)▪ForestDB Compaction with SHARE
▪ No write of valid documents to new file
©2015 Couchbase Inc. 49
SHARE Implementation▪Firmware extension for SHARE
▪ OpenSSD Board (http://www.openssd-project.org/)▪ Atomic and recoverable
©2015 Couchbase Inc. 50
Performance Evaluation▪Normal time performance: YCSB’s workload-F
©2015 Couchbase Inc. 51
Performance Evaluation (2)▪Compaction performance
Elapsed Time(sec)
Written Bytes(MB)
Original ForestDB 227.5 1126.4
ForestDB with SHARE 88.4 150.6
©2015 Couchbase Inc. 52
NVM Express* – Architected for NVM NVM Express* is a standardized high performance software
interface for PCI Express* (PCIe*) SSDs Developed by an open industry consortium
*Other names and brands may be claimed as the property of others.
©2015 Couchbase Inc. 53
Throughput Testing: Throughput (op/s)
0
10000
20000
30000
40000
50000
60000
70000
30755
7282
47345
63946
SATA NVMe
Read Throughput Write Throughput
4 reader, 1 writerMax capacity for each thread 4 files, 4 instances
Summary
©2014 Couchbase, Inc. ©2015 Couchbase Inc. 55
Compact and efficient storage for variety of data – HB+Trie In-memory WAL indexes to improve write/read
performance Optimized for new SSD storage technology Unified storage engine that performs well for various
workloads Unified storage engine that scales from small devices to
large servers• Couchbase Server secondary index• Couchbase Lite• Couchbase Server KV engine
ForestDB - Summary
55
Get Started with Couchbase Server 4.0: developer.couchbase.com/server
Get Trained on Couchbase: training.couchbase.com