Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
2
Agenda
§ BlueStore overview
§ Rocksdb overview
§ BlueStore latency over OSD
§ Rocksdb optimization
3
Agenda
§ BlueStore overview
§ Rocksdb overview
§ BlueStore latency over OSD
§ Rocksdb optimization
4
Ceph: Architecture
DISK DISK DISK DISK DISK
OSD OSD OSD
ObjectStore ObjectStore ObjectStore
MMM
CommodityServers
M
Storage Node
Monitor Node
§ BlueStore is default ObjectStore.
OSD OSD
ObjectStoreObjectStore
BlueStore
• BlueStore = Block + NewStore
Ø Data written directly to block device
Ø Key/value database (Rocksdb) for metadata
Ø Light weight file system BlueFS.
Metadata
S* - “superblock” properties for the entire store
B* - block allocation metadata (blocks, size, blocks_per_key etc)
b* - allocation bitmap
T* - stats (bytes used, compressed, tec)
C* - collection name -> cnode_t
O* - object name -> onode mapping
X* - shared blobs
L* - deferred writes
M* - omap data
Metadata – cons.
16k random write:Total 11113220 key-value pairsM,4758712O,3182396b,3172112size of keys (MB):M,157O,229b,30size of values (MB):M,577O,1580b,48total 2205
4k random write:Total 11120873 key-value pairsL,3177321M,4764738O,3178352size of keys (MB):L,30M,157O,228size of values (MB):L,6264M,578O,2202total 9044
• What kind of metadata?
BlueStore – small IO (rewrite)
STATE_PREPARE_txc_add_transaction/adddeferrediotokvwritebach STATE_AIO_WAITNoaios
STATE_IO_DONE
_txc_finish_io
STATE_KV_QUEUED Putkv_queueKv_cond.notifySTATE_KV_SUBMITTED _kv_sync_thread/
_kv_finalize_thread
STATE_KV_DONE
_txc_committed_kv
STATE_DEFERRED_QUEUED STATE_DEFERRED_CLEANUP_deferred_queue/_kv_finalize_thread
_txc_finish
STATE_DONE
• Key/Value DB acts as WAL (deferred IO).
• Data is written to KV db, and return to upper layer.
• Later data is written into block device.
• Deferred IO entry is deleted from KV db.
BlueStore – big IO
• Key/Value DB acts as WAL (deferred IO).
• Data is written to KV db, and return to upper layer.
• Later data is written into block device.
• Deferred IO entry is deleted from KV db.
STATE_PREPARE_txc_add_transaction STATE_AIO_WAIT_txc_aio_submit/txc_aio_finish
STATE_IO_DONE
_txc_finish_io
STATE_KV_QUEUED Putkv_queueKv_cond.notifySTATE_KV_SUBMITTED _kv_sync_thread/
_kv_finalize_thread
STATE_KV_DONE
_txc_committed_kv
STATE_FINISHING
_txc_finish
STATE_DONE
10
Agenda
§ BlueStore overview
§ Rocksdb overview
§ BlueStore latency over OSD
§ Rocksdb optimization
Rocksdb
MemTableImmutableMemTable
write
Memory
Disk
Flush
.sstLevel 0
.sstLevel 1
Level 2 .sst
SSTable
.log
manifest
current
• A key-value database, originated by Google, improved by Facebook.
• Based on LSM (Log-Structure merge Tree).
§ Flush
§ Compaction
§ Write: write into memTable
§ Read: memTable fist, and then Level 0, Level 1 etc until finding the value.
§ Write amplification
§ Read amplification
§ Space amplification
Compaction
12
Agenda
§ BlueStore overview
§ Rocksdb overview
§ BlueStore latency over OSD
§ Rocksdb optimization
rbd10_sata_dev_nvme_db_4k
§ Fio+librbd on 10 rbd images.
§ 4k random write.
§ Use Intel P3700 as db+wal, Intel S3520 as block device.
0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Perc
enta
ge
Time (3mins interval)
BlueStore IO time span
state_io_percentage state_kv_queued_percentage state_kv_percentage other_percentage
0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Perc
enta
ge
Time (3 mins interval)
OSD time span
bluestore_percentage other_percentage
rbd10_nvme_all_4k
§ Fio+librbd on 10 rbd images.
§ 4k random write.
§ Use Intel P3700 for all.
0%
20%
40%
60%
80%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Perc
enta
ge
Time (3mins interval)
BlueStore IO time span
state_io_percentage state_kv_queued_percentage state_kv_percentage other_percentage
0%
20%
40%
60%
80%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Perc
enta
ge
Time (3 mins interval)
OSD time span
bluestore_percentage other_percentage
15
Agenda
§ BlueStore overview
§ Rocksdb overview
§ BlueStore latency over OSD
§ Rocksdb optimization
Merge Merge
Rocksdb Optimization
• New merge style.
Ø Merge key/value pairs recursively.
Ø Decrease the data flushed into disks. 020406080
100120140160
1 2 3 4 5 6 7 8 9 101112131415161718192021222324
Wri
te.m
icro
s
Time
db.write.micros
normal_merge2 dup2
0100200300400500600700800900
1 2 3 4 5 6 7 8 9 101112131415161718192021222324
Get
.mic
ros
Time
db.Get.micros
normal_merge2 dup2
memtable1 memtable2 memtable3 memtable4
memtable1 memtable2 memtable3 memtable4
Dedup
Old:
New:
Rocksdb Optimization – cons.
• Enable column families.
Ø Create different column families for different kinds of metadata.
Ø Set different options based on attributes of each type of metadata.
• Omap
• Deferred Ios
• Other
Ø This is first step. Further optimization based on cf is in progress.
Impact of write buffer size
• Write_buffer_size
• Use Intel P3700
• 4k random Ios
• Different write_buffer_size and min_write_buffer_number_to_merge.
100012001400160018002000220024002600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Late
ncy
Time (3 mins interval)
Bluestore/commit_lat (us)
64M_merge8 64M_merge4 256M_merge2 256M_merge1
3500000
3700000
3900000
4100000
4300000
4500000
4700000
4900000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Cou
nt
Time (3 mins interval)
bluestore/txc count
64M_merge8 64M_merge4 256M_merge2 256M_merge1