123
Monkey: Optimal Navigable Key-Value Store Niv Dayan, Manos Athanassoulis, Stratos Idreos

Monkey: Optimal Navigable Key-Value Storecs-people.bu.edu/mathan/publications/slides/sigmod2017-Monkey.pdf · Monkey answer what-if design questions performance? lookup cost existing

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Monkey: Optimal Navigable Key-Value Store

    Niv Dayan, Manos Athanassoulis, Stratos Idreos

  • price per GB

    time

    storage is cheaper

    inserts & updatesworkload

  • price per GB

    time

    storage is cheaper

    inserts & updatesworkload

    need for write-optimized database structures

  • time1996 now

    LSM-tree invented

    need for write-optimized database structures

  • time1996 now

    LSM-tree invented

    Key-Value Stores

    need for write-optimized database structures

  • LSM-tree Key-Value Stores

    What are they really?

  • memoryupdates

    buffer

    storage

    0

    level

  • memoryupdates

    buffer

    storage

    0

    1

    level

    sort & flush

    runs

  • memoryupdates

    buffer

    storage

    0

    1

    2 sort-merge

    sort & flush

    runs

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    exponentially increasing capacities

    O(log(N)) levels

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    fence pointers

    lookup key X

    X

    one I/O per run

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    fence pointers

    lookup key X

    X

    one I/O per run

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    fence pointers

    Bloomfilters

    lookup key X

    X

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    fence pointers

    lookup key X

    X

    true negative

    Bloomfilters

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    fence pointers

    lookup key X

    X

    false positive

    true negative

    Bloomfilters

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    fence pointers

    lookup key X

    X

    false positive

    true positive

    true negative

    Bloomfilters

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    fence pointers

    lookup key X

    X

    false positive

    true positive

    true negative

    Performance & Cost Tradeoffs

    Bloom filters

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    fence pointers

    Bloom filters

    lookup key X

    X

    false positive

    true positive

    true negative

    Performance & Cost Tradeoffs

    bigger filters fewer false positives

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    fence pointers

    Bloom filters

    lookup key X

    X

    false positive

    true positive

    true negative

    Performance & Cost Tradeoffs

    bigger filters fewer false positives memory vs. lookups

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    fence pointers

    Bloom filters

    lookup key X

    X

    false positive

    true positive

    true negative

    Performance & Cost Tradeoffs

    memory vs. lookupsbigger filters fewer false positives

    more merging fewer runsmore merging fewer runs

  • memory

    buffer

    storage

    0

    1

    2

    3

    level

    fence pointers

    Bloom filters

    lookup key X

    X

    false positive

    true positive

    true negative

    Performance & Cost Tradeoffs

    lookups vs. updates

    memory vs. lookupsbigger filters fewer false positives

    more merging fewer runs

  • lookup cost

    main memory

    update cost

  • lookup cost

    main memory

    update cost

    update cost

    lookup cost

    existing systems

    fixed memory

  • lookup cost

    main memory

    update cost

    merge more

    merge less

    fixed memory

    existing systems

    update cost

    lookup cost

  • lookup cost

    main memory

    update cost

    less memory

    more memory

    update cost

    lookup cost

  • Problem 1:existing systems

    Problem 2:

    update cost

    lookup cost

  • Problem 1:existing systems

    Problem 2:

    suboptimal filters allocation

    update cost

    lookup cost

  • suboptimal filters allocation

    fixed memory

    Pareto frontier

    x

    x

    existing systemsProblem 1:

    Problem 2:

    update cost

    lookup cost

  • x

    x

    hard to tune

    Problem 1:

    Problem 2:

    suboptimal filters allocation

    update cost

    lookup cost

  • x

    x

    Bloom filters sizelookups vs. memory

    Problem 1:

    Problem 2:

    suboptimal filters allocation

    hard to tune

    update cost

    lookup cost

  • max throughput

    x

    x

    merge policy greedlookups vs. updates

    Problem 1:

    Problem 2:

    suboptimal filters allocation

    hard to tune

    update cost

    lookup cost

  • Monkey: Optimal Navigable Key-Value Store

  • c

    Monkey: Optimal Navigable Key-Value Store

    insights:

    steps:

    observations:

  • c

    Monkey: Optimal Navigable Key-Value Store

    fixed false positive rates

    lookup cost = ∑ pi

    suboptimal

    filters

    optimize allocation

    asymptotically better

    memory vs. lookups

    insights:

    steps:

    observations:

  • c

    Monkey: Optimal Navigable Key-Value Store

    merge policy

    log

    sorted array

    performance?fixed false positive rates

    lookup cost = ∑ pi

    suboptimal

    filters LSM-tree

    optimize allocation

    asymptotically better

    memory vs. lookups

    updates vs. lookups

    navigate

    insights:

    steps:

    observations:

  • c

    Monkey: Optimal Navigable Key-Value Store

    merge policy

    log

    sorted array

    memory

    lookups updates

    ad-hoc trade-offs

    update costMonkey

    answer what-if design questions

    performance?

    lookup cost

    existing

    fixed false positive rates

    lookup cost = ∑ pi

    suboptimal

    filters LSM-tree

    updates vs. lookups

    navigate

    optimize allocation

    asymptotically better

    memory vs. lookups

    insights:

    steps:

    observations:

  • update cost

    lookup cost

    Pareto frontier

    WiredTiger

    Cassandra, HBase

    MonkeyRocksDB, LevelDB

    for fixed memory

  • update cost

    lookup cost

    Pareto frontier

    WiredTiger

    Cassandra, HBaseRocksDB, LevelDB

    max throughput

    Monkey

    for fixed memory

  • c

    Monkey: Optimal Navigable Key-Value Store

    merge policy

    log

    sorted array

    memory

    lookups updates

    ad-hoc trade-offs

    update costMonkey

    answer what-if design questions

    performance?

    lookup cost

    existing

    fixed false positive rates

    lookup cost = ∑ pi

    suboptimal

    filters LSM-tree

    updates vs. lookups

    navigate

    optimize allocation

    asymptotically better

    memory vs. lookups

    insights:

    steps:

    observations:

  • c

    Monkey: Optimal Navigable Key-Value Store

    merge policy

    log

    sorted array

    memory

    lookups updates

    answer what-if design questions

    performance?

    update costMonkey

    lookup cost

    existing

    fixed false positive rates

    lookup cost = ∑ pi

    suboptimal

    filters LSM-tree ad-hoc trade-offs

    updates vs. lookups

    navigate

    optimize allocation

    asymptotically better

    memory vs. lookups

    insights:

    steps:

    observations:

  • f

    fence pointers

    Bloom filtersbuffer

    memory storage

    data

  • f

    fence pointers

    Bloomfiltersbuffer

    memory storage

    data

  • f

    fence pointers

    Bloomfiltersbuffer

    memory storage

    data

  • f

    fence pointers

    Bloomfiltersbuffer

    memory storage

    data>

  • f

    Bloomfilters

    memory storage

    data

  • f

    Bloom filters

    storagememoryX bits

    per entry

    data

  • f

    Bloom filters

    storagememory

    data

    X bits per entry

  • Bloom filters

    memoryX bits

    per entry

    = e

    bits Mentries N- ln(2)

    2

    falsepositive rate p

  • Bloom filters

    memoryX bits

    per entry

    p

    p

    p

    = e

    bits Mentries N- ln(2)

    2

    false positive rate p

  • Bloom filters

    memory

    p

    p

    p

    = e

    bits Mentries N- ln(2)

    2

    false positive rate p

    X bits per entry

    worst-case I/O overhead:

  • Bloom filters

    memoryO( ∑p )

    p

    p

    p

    = e

    bits Mentries N- ln(2)

    2

    false positive rate p

    X bits per entry

    worst-case I/O overhead:

  • Bloom filters

    memoryO( ∑p )

    p

    p

    p

    =ln(2)2

    false positive rate p

    X bits per entry

    e

    bits Mentries N-

    worst-case I/O overhead:

  • Bloom filters

    memory

    p

    p

    p

    = e

    bits Mentries N- ln(2)

    2

    false positive rate p

    X bits per entry

    O( ∑e-M/N )

    worst-case I/O overhead:

  • Bloom filters

    memory

    p

    p

    p

    O(log(N))

    X bits per entry

    O( ∑e-M/N )

    worst-case I/O overhead:

  • Bloom filters

    memory

    p

    p

    p

    O(log(N))

    X bits per entry

    O( log(N) · e-M/N )

    worst-case I/O overhead:

  • Bloom filters

    memory

    p

    p

    p

    X bits per entryCan we do better?

    O( log(N) · e-M/N )

    worst-case I/O overhead:

  • fence pointers

    Bloom filters

    X

    lookupkey X

    data runs

  • fence pointers

    Bloom filters

    X

    false positive

    lookup key X

    false positive

    false positive

    false positive

    false positive

    data runs

    I/O

    I/O

    I/O

    I/O

    I/O

  • fence pointers

    Bloom filters

    X

    false positive

    lookup key X

    false positive

    false positive

    false positive

    false positive

    data runs

    I/O

    I/O

    I/O

    I/O

    I/Omost memory

  • fence pointers

    Bloom filters

    X

    false positive

    lookup key X

    false positive

    false positive

    false positive

    false positive

    data runs

    I/O

    I/O

    I/O

    I/O

    I/O

    most memory saves at most 1 I/O

    most memory

  • Bloom filters

    false positive

    rates

    reallocate some

    most memory

  • Bloom filters

    false positive

    rates

    same memory, fewer lookup I/Os

    reallocate some

    most memory

  • 0 < p2 < 1

    0 < p1 < 1

    0 < p0 < 1

    xxxx

    false positive rates

    relax

  • 0 < p2 < 1

    0 < p1 < 1

    0 < p0 < 1

    false positive rates

    relax

  • lookup cost = f(p0, p1 …)

    = f(p0, p1 …)

    false positive rates

    relax model

    0 < p2 < 1

    0 < p1 < 1

    0 < p0 < 1memory footprint

  • 0 < p2 < 1

    memory footprint

    lookup cost

    0 < p1 < 1

    0 < p0 < 1

    = f(p0, p1 …)

    = f(p0, p1 …) in terms of p0, p1

    model

    false positive rates

    relax optimize

  • lookup cost = ∑ pi

    p2

    p1

    p0

    false positive

    rates

    Bloom filters

  • = e

    bitsentries- ln(2)

    2

    falsepositive rate

    memory footprint

    p2

    p1

    p0

    false positive

    rates

    Bloom filters

    lookup cost = ∑ pi

  • = -ln(2)2

    ln( )falsepositive rate entriesbits

    memory footprint

    p2

    p1

    p0

    false positive

    rates

    Bloom filters

    lookup cost = ∑ pi

  • bits(p0, N)

    bits(p1, N/T)

    bits(p2, N/T2)

    memory footprint

    p2

    p1

    p0

    false positive

    rates

    Bloom filters

    lookup cost = ∑ pi

  • bits(p0, N)

    bits(p1, N/T)

    bits(p2, N/T2)

    memory footprint

    p2

    p1

    p0

    false positive

    rates

    Bloom filters

    lookup cost = ∑ pi memory = c · N ·- ∑

    ln(pi)Ti

    entriesconstant

    size ratio

    false positive rates

  • optimizep2

    p1

    p0

    false positive

    rates

    Bloom filters

    lookup cost = ∑ pi memory = c · N ·- ∑

    ln(pi)Ti

  • Monkey Bloom filters

    …false

    positive rates

    p0/T2

    p0/T

    p0

    exponential

    decrease

  • same

    State-of-the-Art Bloom filters

    Monkey Bloom filters

    …false

    positive rates

    p0/T2

    p0/T

    p0

    p

    p

    p

    exponential

    decrease

  • State-of-the-Art Bloom filters

    Monkey Bloom filters

    …false

    positive rates

    p0/T2

    p0/T

    p0

    p

    p

    p>

    <

    <

    lookup cost = ∑pi = ∑p<

    ……

  • State-of-the-Art Bloom filters

    Monkey Bloom filters

    …false

    positive rates

    p0/T2

    p0/T

    p0

    p

    p

    p>

    <

    <

    lookup cost = ∑pi = ∑p= O( log(N) · e-M/N )= O( e-M/N )

    N | number of entries M | overall memory for Bloom filters

    <

    ……

  • State-of-the-Art Bloom filters

    Monkey Bloom filters

    …false

    positive rates

    p0/T2

    p0/T

    p0

    p

    p

    p>

    <

    <

    lookup cost = ∑pi = ∑p

    N | number of entries M | overall memory for Bloom filters

    asymptotic winlookup cost increases at slower rate as data grows

    …… <

    = O( log(N) · e-M/N )= O( e-M/N )

  • Monkey Bloom filters

    …false

    positive rates

    p0/T2

    p0/T

    p0

    convergent geometric series

  • Monkey Bloom filters

    …false

    positive rates

    p0/T2

    p0/T

    p0

    c · entries ·- ln(pi)∑

    Ti=memory

  • Monkey Bloom filters

    …false

    positive rates

    p0/T2

    p0/T

    p0

    -ln(lookup cost) c · entries ·=memory

  • Monkey Bloom filters

    …false

    positive rates

    p0/T2

    p0/T

    p0

    model lookups vs. memory trade-off

    -ln(lookup cost)=memory c · entries ·

  • fixed memory

    existing systemsProblem 1:

    Problem 2:

    suboptimal filters allocation

    hard to tune

    update cost

    lookup cost

  • fixed memory

    Pareto frontier

    x

    x

    existing systemsProblem 1:

    Problem 2:

    suboptimal filters allocation

    hard to tune

    update cost

    lookup cost

  • x

    x

    Bloom filters sizelookups vs. memory

    Problem 1:

    Problem 2:

    suboptimal filters allocation

    hard to tune

    update cost

    lookup cost

  • max throughput

    x

    x

    merge policy greedlookups vs. updatesProblem 1:

    Problem 2:

    suboptimal filters allocation

    hard to tune

    update cost

    lookup cost

  • c

    Monkey: Optimal Navigable Key-Value Store

    merge policy

    log

    sorted array

    memory

    lookups updates

    update costMonkey

    answer what-if design questions

    performance?

    lookup cost

    existing

    fixed false positive rates

    lookup cost = ∑ pi

    suboptimal

    filters LSM-tree ad-hoc trade-offs

    updates vs. lookups

    navigate

    optimize allocation

    asymptotically better

    memory vs. lookups

    insights:

    steps:

    observations:

  • c

    Monkey: Optimal Navigable Key-Value Store

    merge policy

    log

    sorted array

    memory

    lookups updates

    update costMonkey

    answer what-if design questions

    performance?

    lookup cost

    existing

    fixed false positive rates

    lookup cost = ∑ pi

    suboptimal

    filters LSM-tree ad-hoc trade-offs

    updates vs. lookups

    navigate

    optimize allocation

    asymptotically better

    memory vs. lookups

    insights:

    steps:

    observations:

  • Identify

    size ratio

    merge policy

  • Identify Map

    size ratio look

    ups

    updates

    merge policy

  • Identify Map

    size ratio

    merge policy

    look

    ups

    updates

    sorted arraylog LSM-tree

  • Identify Map

    size ratio

    merge policy

    sorted array

    log

    look

    ups

    updates

  • Identify Map

    size ratio look

    ups

    updates

    merge policy

    Navigate

    workload hardware

    optimalmaximum throughout

    log

    sorted array

  • LevelingTiering

    Merge Policies

    read-optimizedwrite-optimized

  • Levelingread-optimized

    Tieringwrite-optimized

  • read-optimizedwrite-optimizedLevelingTiering

    T runs per level

  • read-optimizedwrite-optimizedLevelingTiering

    T runs per level

    merge & flush

  • read-optimizedwrite-optimizedLevelingTiering

    T runs per level

  • read-optimizedwrite-optimizedLevelingTiering

    T runs per level

    merge

  • read-optimizedwrite-optimizedLevelingTiering

    T runs per level

    flush

    T times bigger

  • read-optimizedwrite-optimizedLevelingTiering

    T runs per level T times bigger

  • T runs per level

    1 run per level

    write-optimized read-optimizedLevelingTiering

  • O(T · logT(N) · e-M/N) O(logT(N) · e-M/N)

    write-optimized read-optimizedLevelingTiering

    runs per level levels levels

    false positive rate

    false positive rate

    lookupcost:

    T runs per level

    1 run per level

  • write-optimized read-optimizedLevelingTiering

    T runs per level

    1 run per level

    O(T · logT(N) · e-M/N) O(logT(N) · e-M/N)

    runs per level levels levels

    false positive rate

    false positive rate

    lookup cost:

  • write-optimized read-optimizedLevelingTiering

    T runs per level

    1 run per level

    O(T · e-M/N) O(e-M/N)

    runs per level

    false positive rate

    false positive rate

    lookup cost:

  • O(logT(N)) O(T · logT(N))

    merges per levellevels levels

    write-optimized read-optimizedLevelingTiering

    updatecost:

    T runs per level

    1 run per level

    O(T · e-M/N) O(e-M/N)lookup cost:

  • write-optimized read-optimizedLevelingTiering

    size ratio T

    T runs per level

    1 run per level

    O(logT(N)) O(T · logT(N))update cost:

    O(T · e-M/N) O(e-M/N)lookup cost:

  • 1 run per level

    1 run per level

    write-optimized read-optimizedLevelingTiering

    O(e-M/N) = O(e-M/N)

    O(log(N)) = O(log(N)) update cost:

    lookup cost:

    size ratio T

  • write-optimized read-optimizedLevelingTiering

    T runs per level

    1 run per level

    O(logT(N)) O(T · logT(N))update cost:

    O(T · e-M/N) O(e-M/N)lookup cost:

    size ratio T

  • write-optimized read-optimizedLevelingTiering

    O(1) O(N)

    O(Na · e-M/N) O(e-M/N)

    O(lNl) runs per level 1 run per level

    update cost:

    lookup cost:

    size ratio T

  • 1 run per level

    write-optimized read-optimizedLevelingTiering

    log sorted array

    O(lNl) runs per level

    O(N)

    O(e-M/N)

    update cost:

    lookup cost:

    size ratio T

    O(Na · e-M/N)

    O(1)

  • lookup cost

    update cost

    Tiering

    log

    Leveling sorted array

  • lookup cost

    update cost

    Tiering

    log

    Leveling sorted array

    T=2

    T | size ratio

  • lookup cost

    update cost

    Tiering

    log

    Leveling sorted array

    T | size ratio

    T=2sorted arraylog LSM-tree

  • lookup cost

    update cost

    Tiering

    log

    Leveling sorted array

    T | size ratio

    workload hardware

    optimalmaximum

    throughoutT=2

  • update cost

    lookup cost max

    throughput

    x

    x

    merge policy greedlookups vs. updatesProblem 1:

    Problem 2:

    suboptimal filters allocation

    hard to tune

  • better asymptotic scalability

    number of entries (log scale)

    look

    up la

    tenc

    y (m

    s)

    LevelDB

    Monkey

  • better asymptotic scalability

    number of entries (log scale)

    look

    up la

    tenc

    y (m

    s)

    workload adaptability

    (F)

    T4

    T2LL4

    L4L6 L6 L8

    L8L16

    % lookups in workload

    look

    up la

    tenc

    y (m

    s)

    LevelDB

    MonkeyLevelDB

    fixed Monkey

    navigable Monkey

  • http://daslab.seas.harvard.edu/crimsondb/

    self-designs navigates what-if?

  • Monkey: Optimal Navigable Key-Value Store

    merge policy

    log

    sorted array

    memory

    lookups updates

    update costMonkey

    answer what-if design questions

    performance?

    lookup cost

    existing

    fixed false positive rates

    lookup cost = ∑ pi

    suboptimal

    filters LSM-tree ad-hoc trade-offs

    updates vs. lookups

    navigate

    optimize allocation

    asymptotically better

    memory vs. lookups

    insights:

    steps:

    observations:

  • Monkey: Optimal Navigable Key-Value Store

    0 < memory < ∞more in paper:

    bufferfilters cache

    skewed & range lookups

  • Monkey: Optimal Navigable Key-Value Store

    skewed & range lookups0 < memory < ∞more in paper:

    bufferfilters cache

    http://daslab.seas.harvard.edu/monkey/

  • Monkey: Optimal Navigable Key-Value Store

    skewed & range lookups

    Thanks!

    0 < memory < ∞more in paper:

    bufferfilters cache

    http://daslab.seas.harvard.edu/monkey/