87
MongoDB Storage Engine with RocksDB LSM Tree Denis Protivenskii, Software Engineer, Percona

with RocksDB LSM Tree MongoDB Storage Engine · MongoDB Storage Engine with RocksDB LSM Tree Denis Protivenskii, Software Engineer, Percona. 2 Contents ... -MongoDB contracts for

  • Upload
    others

  • View
    24

  • Download
    0

Embed Size (px)

Citation preview

MongoDB Storage Engine with RocksDB LSM Tree

Denis Protivenskii, Software Engineer, Percona

2

Contents

- What is MongoRocks?

3

Contents

- What is MongoRocks?

- RocksDB overview

4

Contents

- What is MongoRocks?

- RocksDB overview

- MongoDB contracts for storage engines

5

Contents

- What is MongoRocks?

- RocksDB overview

- MongoDB contracts for storage engines

- The most problematic operation

What is MongoRocks?

7

8

RocksDB overview

10

RocksDB for the user

Key-value storage:

- Get(k) → v- Put(k, v)- Delete(k)

11

RocksDB for the user

Key-value storage:

- Get(k) → v- Put(k, v)- Delete(k)

- Merge ...

12

Level organization

13

Write-ahead log

14

Every next level is larger multiple times

15

Keys are ordered within the level

16

Compaction starts when level is too large

17

Next level may not fit

18

Compaction may run recursively

19

Files in levels are immutable

- Compaction creates new files and old

ones get deleted when not used

20

Files in levels are immutable

- Compaction creates new files and old

ones get deleted when not used

- Files are written sequentially to disk,

which speeds up I/O

MongoDB + RocksDB

22

Data organization in MongoDB

23

Data organization in MongoDB

- “Containers” for data and indexes receive

unique string identifiers ident

- Elements themselves shall have unique

id inside a container

24

Data organization in RocksDB

25

How to present MongoDB’s data structure

in the “plain” storage like RocksDB?

26

Data organization in MongoRocks

<ident + id> for every container’s element

coll1 ind1_1 ind1_2 coll2 … indN_M

27

Data organization in MongoRocks

- ident > 20 symbols, extra cost for every

data element

28

Data organization in MongoRocks

- ident > 20 symbols, extra cost for every

data element

- such ident length is caused by using it as

a filename for WiredTiger and mmapv1

29

How to save on ident length properly?

30

Data organization in MongoRocks

- hash from ident is bad as it may cause

collisions for short hashes

31

Data organization in MongoRocks

- hash from ident is bad as it may cause

collisions for short hashes

- Auto increment counter (named prefix)

and map of ident → prefix

32

Data organization in MongoRocks

<prefix + id> for every container’s element

prefix_0 prefix_1 prefix_2 prefix_3 … prefix_N

33

Index format in MongoRocks

K = <prefix + value + order + id (loc)>

V = <typeof value>

34

Index format in MongoRocks

K = <prefix + value + order + id (loc)>

V = <typeof value>

Comes from MongoDB

35

How to search for id if it constitutes the part

of a key?

36

Index format in MongoRocks

- The storage should support search

operation lower_bound | upper_bound

37

Index format in MongoRocks

- The storage should support search

operation lower_bound | upper_bound

- Allows to position on the closest value

and decode it

38

Index format in MongoRocks

- The storage should support search

operation lower_bound | upper_bound

- Allows to position on the closest value

and decode it

- RocksDB has iterators for this purpose

The most problematic operation

40

Deleting data in MongoRocks

- Deleting an element (document, index) -

is just putting operation D into LSM-tree

41

Deleting data in MongoRocks

- Deleting an element (document, index) -

is just putting operation D into LSM-tree

- As a result, the tree is filled with garbage

of old data and delete ops, which slows

down the iteration

42

The solution!

43

Deleting data in MongoRocks

- Ask for iterator’s statistics after iteration

44

Deleting data in MongoRocks

- Ask for iterator’s statistics after iteration

- If there’s too much skipped data - run

compaction for this range

45

Deleting data in MongoRocks

- Ask for iterator’s statistics after iteration

- If there’s too much skipped data - run

compaction for this range

- The range is always a prefix

46

This was the easier part of the problem

though...

47

- Need to iterate over all data and indexes

of collection and delete every item

Deleting collections in MongoRocks

48

- Need to iterate over all data and indexes

of collection and delete every item

- A lot of garbage created

Deleting collections in MongoRocks

49

- Need to iterate over all data and indexes

of collection and delete every item

- A lot of garbage created

- Doesn’t make sense compared to

engines that just drop files on disk

Deleting collections in MongoRocks

50

Compaction filters

51

Deleting collections in MongoRocks

52

- Create filter with prefixes of dropped

containers

Deleting collections in MongoRocks

53

- Create filter with prefixes of dropped

containers

- Start compaction for prefix

Deleting collections in MongoRocks

54

- Create filter with prefixes of dropped

containers

- Start compaction for prefix

- Compaction calls the filter for every item

and decides if it shall be deleted or not

Deleting collections in MongoRocks

55

To run compaction after the crash, a

marker about dropped prefix is persisted,

and it’s kept until the compaction is finished

Deleting collections in MongoRocks

56

It can be even better

57

Deleting collections in MongoRocks

Fully contains range to drop

58

- DeleteFilesInRange allows to delete files

that contain keys fully in requested range

Deleting collections in MongoRocks

59

- DeleteFilesInRange allows to delete files

that contain keys fully in requested range

- Requires care as it deletes files

immediately even if some keys are still in

use (by snapshots)

Deleting collections in MongoRocks

60

What’s missing

61

- MongoDB doesn’t send notifications

about logical drop of a collection or a db

Deleting collections in MongoRocks

62

- MongoDB doesn’t send notifications

about logical drop of a collection or a db

- Because WiredTiger or mmapv1 don’t

need this as they delete files on disk

Deleting collections in MongoRocks

63

- MongoDB doesn’t send notifications

about logical drop of a collection or a db

- Because WiredTiger or mmapv1 don’t

need this as they delete files on disk

- Forces to compact every prefix by itself

Deleting collections in MongoRocks

64

oplog

65

MongoDB has specific

collection type built as

circular buffer

Capped collections in MongoRocks

66

MongoDB has specific

collection type built as

circular buffer

Developed solely for

oplog - replication log

Capped collections in MongoRocks

67

- oplog is pretty large (5% of disk size, not

more than 50Gb by default)

Capped collections in MongoRocks

68

- oplog is pretty large (5% of disk size, not

more than 50Gb by default)

- Because of lots of overwrites, oplog is

polluted with garbage, which affects the

performance of the whole storage

Capped collections in MongoRocks

69

- Have separate code to monitor oplog size

and number of ‘tombstones’ in it

Capped collections in MongoRocks

70

- Have separate code to monitor oplog size

and number of ‘tombstones’ in it

- Higher priority for oplog compaction (in

the queue of compaction operations)

Capped collections in MongoRocks

71

Radical solution

72

- Classic storage engine has one B-tree for

one “container” (data or index)

Column families in MongoRocks

73

- Classic storage engine has one B-tree for

one “container” (data or index)

- MongoRocks has one LSM-tree for all

“containers”

Column families in MongoRocks

74

More LSM-trees!

75

Column families in MongoRocks

76

- RocksDB supports set of LSM-trees

(column families) with shared WAL to

provide transactional logic

Column families in MongoRocks

77

- RocksDB supports set of LSM-trees

(column families) with shared WAL to

provide transactional logic

- First developed for MySQL (MyRocks

project)

Column families in MongoRocks

78

- MongoRocks should have separate

LSM-tree for oplog, maybe even separate

LSM-tree for every prefix

Column families in MongoRocks

Conclusion

80

- MongoDB contracts still have some

typical details not applicable to

MongoRocks

81

- MongoDB contracts still have some

typical details not applicable to

MongoRocks

- It’s good to order keys in a storage

somehow

82

- The problem of deleting keys may be

solved using different optimizations

83

- The problem of deleting keys may be

solved using different optimizations

- The idea of multiple LSM-trees is a step

forward

84

Thank You Sponsors!

85

SAVE THE DATE!

CALL FOR PAPERS OPENING SOON!www.perconalive.com

April 23-25, 2018Santa Clara Convention Center

Questions?

Thank you!