Retrieval and Perfect Hashing

Embed Size (px)

Citation preview

  • 8/12/2019 Retrieval and Perfect Hashing

    1/16

    0 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    Department of Informatics of KIT1 and HANA Core of SAP AG2

    Retrieval and Perfect Hashingusing FingerprintingIngo Mller12, Peter Sanders1, Robert Schulze2, and Wei Zhou12

    KIT University of the State of Baden-Wuerttemberg and

    National Laboratory of the Helmholtz Association www.kit.edu

  • 8/12/2019 Retrieval and Perfect Hashing

    2/16

    The Retrieval Problem

    1 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    Perfect Hash Function(PHF)Map each keys Sto unique integeri ID

    Retrieval data structure

    Associate valuev Vto eachs S

    Classical implementation: hash table

    Store key/ID or key/value pairs

    Optimization: do not storeS

    Undefined behavior fors

    S

    Applications

    Look-up in dictionaries of in-memory DBMSs (like the SAP HANA database [1])

    Many more. . . (see Botelho et al. [2])

    S V . . .flag

    noun, verb,

    flare

    n, v,flask n,flash n, v

    ,. . .

    V optimization

  • 8/12/2019 Retrieval and Perfect Hashing

    3/16

    Known Methods

    2 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    Perfect Hash Functions

    Practical implementations exist: BPZ [3], CHD [4], etc.

    Store only constant, sometimes optimal number of extra bits

    Retrieval: use a PHF to index an array of values

    Direct Retrieval Data Structures

    CHM [5], etc.

    Prior work Our solution

    Construction complicated simplerinherently sequential easily parallelizable

    slow

    fasterDynamic operations no (rebuild) yesCache misses per query

    2 1

  • 8/12/2019 Retrieval and Perfect Hashing

    4/16

    Fingerprint Retrieval (FiRe)Overview

    3 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    ...

    Level1

    ...

    Level2

    ...

    LevelL

    PHF-based

    retrievaldata

    structure

    LevelL 1

    m n

    bbuckets

    fingerprints v1 v2 va

    acells

    A bucket

    bucket hash1 key 1 , . . . ,m ,fingerprint hash2 key 1 , . . . , k

    Recursively overflowto next level on fingerprint collision/full bucket

    Fingerprints implemented as bit vector for simplicity and speed

  • 8/12/2019 Retrieval and Perfect Hashing

    5/16

    Fingerprint Retrieval (FiRe)Asymptotics

    4 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    Llevels

    PHF

    fallback

    m

    buckets

    fingerprints acells

    n elements

    m

    nb buckets

    a cells per bucket

    r-bit values

    Expectedlinear construction time

    LFiRe levels,O n for each level

    Even forL : geometric series, as only a constant fraction of theelements overflow

    Constant worst-case query time, sinceL is constant

  • 8/12/2019 Retrieval and Perfect Hashing

    6/16

    Fingerprint Retrieval (FiRe)Formulae

    5 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    Llevels

    PHF

    fallback

    m

    buckets

    fingerprints acells

    n elements

    m nb buckets

    a cells per bucket

    r-bit values

    Leta1 be the expected number of elements in a bucket

    Space overheadper elements

    r

    a

    a1 size fingerprints

    a1bits

    Cache missesper queryl

    ba1

    Calculation ofa1: see our paper

  • 8/12/2019 Retrieval and Perfect Hashing

    7/16

    Fingerprint Retrieval (FiRe)Space/Time Trade-Off

    6 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    Space overheads and expected number ofcache missesl depend ona: #cells per bucket

    b: average #elements per bucket (

    nm )

    k: #possible fingerprint values

    r: size of each value

    How to choose parameters?

    aand ksuch that a bucket

    fits into acache lineDepending on desiredtrade-off

    0

    5

    10

    15

    20

    25

    30

    35

    40

    1 1.5 2 2.5 3

    s

    --spac

    e

    overhead

    [bits]

    l -- cache misses

    r = 32 bits

    opt. encoding a=14, k=149bit vector a=15, k= 32bit vector a=14, k= 64bit vector a=13, k= 96bit vector a=12, k=128

  • 8/12/2019 Retrieval and Perfect Hashing

    8/16

    Fingerprint Retrieval (FiRe)Dynamization

    7 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    Updatesand deletions(easy)

    Update associatedvalue in place

    Ignore deletions

    static part

    S V . . .flag

    noun, verb,flare

    n, v,flask

    n,flash

    n, v,. . .

    dynamic part

    queries updates

    Insertions

    Needs adynamic partwith keys + some book-keeping information

    Answer queries with thestatic part(FiRe)

    Idea:Overflow new and old element if fingerprints collideBlock fingerprint for future insertsRebuild when some stability criterion is violated

  • 8/12/2019 Retrieval and Perfect Hashing

    9/16

    Fingerprint Perfect Hashing (FiPHa)

    8 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    Llevels

    PHF

    fallback

    m

    buckets

    fingerprints acells

    Fingerprint-BasedPerfect Hashing(FiPHa)Special case with large buckets of empty values

    Associated ID is calculated asrankbucket v a rankfingerprint v

    Veryspace efficient(2.79bits overhead with2.78cache misses)

  • 8/12/2019 Retrieval and Perfect Hashing

    10/16

    Experimental ResultsSettings

    9 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    Configurationsof a,b,ksuch thatl

    1.05cache misses: FiRe5

    l 1.25cache misses: FiRe25

    l 1.50cache misses: FiRe50

    Retrieval data structure with FiPHaas PHF (3.78cache misses)

    Base lines

    BPZ, CHD-0.5/0.99 from theC Minimal Perfect Hashing Library [6]

    CHM-2/3 from our implementation

    Datasets

    Keys: 100million unique random32-bit integers

    Values: integers of sizer 8bits

  • 8/12/2019 Retrieval and Perfect Hashing

    11/16

    Experimental ResultsBuild Times

    10 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    0

    500

    1000

    1500

    2000

    2500

    3000

    FiRe5FiRe25

    FiRe50FiPH

    aCHM

    -2CHM

    -3BPZ CHD

    -0.5

    CHD-0.99

    buildtime[n

    s/element]

    r 8bits

    417 timesfaster sequential constructionfor FiRe

    FiPHa slower, but faster than competitors

  • 8/12/2019 Retrieval and Perfect Hashing

    12/16

    Experimental ResultsSpace Overhead

    11 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    0

    5

    10

    15

    20

    25

    30

    35

    FiRe5

    FiRe25

    FiRe50

    FiPHa

    CHM-2

    CHM-3

    BPZCH

    D-0.5

    CHD-0.99

    spaceoverhead

    [bits/element]

    r 8bits

    FiRe50 hascomparable overhead to most competitors

    FiPHa almost on par with best competitor (CHD-0.99)

  • 8/12/2019 Retrieval and Perfect Hashing

    13/16

    Experimental ResultsQuery Times

    12 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    0

    50

    100

    150200

    250

    300

    350

    400

    FiRe5

    FiRe25

    FiRe50

    FiPHa

    CHM-2

    CHM-3

    BPZ CHD-0.5

    CHD-0.99

    querytime[ns/element]

    r 8bits

    FiRe has thebest query timesdue to low number of cache misses

    FiPHa comparable to CHD-0.99, but has much faster construction

  • 8/12/2019 Retrieval and Perfect Hashing

    14/16

    Summary and Future Work

    13 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    Fingerprint Retrieval(FiRe) andPerfect Hashing(FiPHa)

    Simple concept, easy implementation

    Fast evaluation due tolow number of cache-misses

    Extremelyfast construction, even with sequential implementation

    Small space overhead

    Highlyconfigurable trade-offSupport forupdates,insertions, anddeletions

    Future Work

    Find more compact, yet practicalrepresentation of fingerprints

    Adapt idea ofcuckoo-hashingto fingerprinting

    Improve trade-off withdifferent settingsfor each level

    Adapt fingerprinting idea toother data structures

  • 8/12/2019 Retrieval and Perfect Hashing

    15/16

    14 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    Thank you

  • 8/12/2019 Retrieval and Perfect Hashing

    16/16

    References I

    15 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting

    Institute for Theoretical InformaticsAlgorithms II

    [1] F. Frberet al., SAP HANA Database: Data management for modern business applications, SIGMOD

    Rec., vol. 40, no. 4, pp. 4551, 2012. [Online]. Available: http://doi.acm.org/10.1145/2094114.2094126

    [2] F. C. Botelho and N. Ziviani, External perfect hashing for very large key sets, in Proceedings of the

    sixteenth ACM conference on Conference on information and knowledge management. ACM, 2007, pp.

    653662.

    [3] F. C. Botelho, R. Pagh, and N. Ziviani, Simple and space-efficient minimal perfect hash functions, in

    Algorithms and Data Structures. Springer, 2007, pp. 139150.

    [4] D. Belazzougui, F. C. Botelho, and M. Dietzfelbinger, Hash, displace, and compress, in Algorithms-ESA

    2009. Springer, 2009, pp. 682693.

    [5] Z. J. Czech, G. Havas, and B. S. Majewski, An optimal algorithm for generating minimal perfect hash

    functions,Information Processing Letters, vol. 43, no. 5, pp. 257264, 1992.

    [6] D. de Castro Reis, D. Belazzougui, F. C. Botelho, and N. Ziviani, CMPH C Minimal Perfect Hashing

    Library. [Online]. Available:http://cmph.sourceforge.net

    http://doi.acm.org/10.1145/2094114.2094126http://cmph.sourceforge.net/http://cmph.sourceforge.net/http://doi.acm.org/10.1145/2094114.2094126