38
Hashing with linear probing Rasmus Pagh IT University of Copenhagen Teoripärlor, Lund, September 16, 2007 1

Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

  • Upload
    ngodang

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Hashing with linear probing

Rasmus Pagh

IT University of Copenhagen

Teoripärlor, Lund, September 16, 2007

1

Page 2: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Hashing with linear probing

2

Page 3: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Hashing with linear probing

2

Page 4: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Hashing with linear probing

2

Page 5: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Hashing with linear probing

2

Page 6: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Hashing with linear probing

2

Page 7: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Hashing with linear probing

It was settled in the 60s that this is inferior to e.g. double hashing. So why care?

2

Page 8: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

3

Page 9: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

389 km/h

20 km/h

3

Page 10: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Race car vs golf car

• Linear probing uses a sequential scan and is thus cache-friendly.

• On my laptop: 24x speed difference between sequential and random access!

• Experimental studies have shown linear probing to be faster than other methods for load factor α in the range 30-70%.

For 4-byte words

For “small” keys

4

Page 11: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Race car vs golf car

• Linear probing uses a sequential scan and is thus cache-friendly.

• On my laptop: 24x speed difference between sequential and random access!

• Experimental studies have shown linear probing to be faster than other methods for load factor α in the range 30-70%.

• But: No theory behind the hash functions used for linear probing in practice.

For 4-byte words

For “small” keys

4

Page 12: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

History of linear probing

• First described in 1954.

• Analyzed in 1962 by D. Knuth, aged 24. Assumes hash function h is truly random.

• Over 30 papers using this assumption.

• Siegel and Schmidt (1990) showed that it suffices that h is O(log n)-wise independent.

5

Page 13: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

History of linear probing

• First described in 1954.

• Analyzed in 1962 by D. Knuth, aged 24. Assumes hash function h is truly random.

• Over 30 papers using this assumption.

• Siegel and Schmidt (1990) showed that it suffices that h is O(log n)-wise independent.

5

Page 14: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

History of linear probing

• First described in 1954.

• Analyzed in 1962 by D. Knuth, aged 24. Assumes hash function h is truly random.

• Over 30 papers using this assumption.

• Siegel and Schmidt (1990) showed that it suffices that h is O(log n)-wise independent.

Our main result: It suffices that h is 5-wise independent.

5

Page 15: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

log(n)-wise independence• Siegel (1989) showed time-space trade-offs

for evaluation of a function from a log(n)-wise independent family:

• Upper bound 2 is theoretically appealing, but has a huge constant factor – and uses many random memory accesses!

Time SpaceLower bound log(n)

log(s/ log n) s

Upper bound 1! O(log n) O(log n)Upper bound 2 O(1) n!

6

Page 16: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

log(n)-wise independence• Siegel (1989) showed time-space trade-offs

for evaluation of a function from a log(n)-wise independent family:

• Upper bound 2 is theoretically appealing, but has a huge constant factor – and uses many random memory accesses!

Time SpaceLower bound log(n)

log(s/ log n) s

Upper bound 1! O(log n) O(log n)Upper bound 2 O(1) n!

6

Page 17: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

5-wise independence

• Polynomial hash function:

• Tabulation-based hash function:

Carter and Wegman (FOCS ’79)

Thorup and Zhang (SODA ‘04)

Already quite fast

Within factor 2 of the fastest universal hash functions

h(x) =

(4∑

i=0

aixi mod p

)mod r

h(x1, x2) = T1[x1]! T2[x2]! T3[x1 + x2]

7

Page 18: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Today

• Background and motivation

• Hash functions

‣ New analysis of linear probingJoint work with Anna Pagh and Milan Ružić

•Hash functions - details

8

Page 19: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Total cost of insertions

((analysis on blackboard))

9

Page 20: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Hash function details

((analysis on blackboard))

h(x1, x2) = T1[x1]! T2[x2]! T3[x1 + x2]

10

Page 21: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Single insertion upper bound

11

Page 22: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Single insertion upper bound

11

Page 23: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Single insertion upper bound

1. Choose max t so B balls hash to B-t

slots, for some B

{

11

Page 24: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Single insertion upper bound

1. Choose max t so B balls hash to B-t

slots, for some B

{ {2. Choose max C such that C balls hash to C+t slots

11

Page 25: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Single insertion upper bound

1. Choose max t so B balls hash to B-t

slots, for some B

{ {2. Choose max C such that C balls hash to C+t slots

11

Page 26: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Single insertion upper bound

1. Choose max t so B balls hash to B-t

slots, for some B

{ {2. Choose max C such that C balls hash to C+t slots

11

Page 27: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Single insertion upper bound

1. Choose max t so B balls hash to B-t

slots, for some B

{ {2. Choose max C such that C balls hash to C+t slots

Cost( )≤1+C+t

Lemma:

11

Page 28: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Proof idea• Lemma: If operation on x goes on for more

than k steps, then there are “unusually many” keys with hash values in either:

1) Some interval with h(x) as right endpoint, or2) The interval [h(x),h(x)+k]

!h(x) h(x) + k

12

Page 29: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Proof idea

• To bound cost, upper bound probability of each event using tail bounds for sums of random variables with limited independence.

• Lemma: If operation on x goes on for more than k steps, then there are “unusually many” keys with hash values in either:

1) Some interval with h(x) as right endpoint, or2) The interval [h(x),h(x)+k]

!h(x) h(x) + k

12

Page 30: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Our main resultTheorem 2 Consider any sequence of insertions, dele-tions, and lookups in a linear probing hash table usinga 5-wise independent hash function. Then the expectedcost of any operation, performed at load factor !, is

O(1 + (1! !)!3) .

As a consequence, the expected average cost of successfullookups is O(1 + (1! !)!2).

13

Page 31: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Our main result

factor (1! !)!1

from what can beproved using fullindependence

Theorem 2 Consider any sequence of insertions, dele-tions, and lookups in a linear probing hash table usinga 5-wise independent hash function. Then the expectedcost of any operation, performed at load factor !, is

O(1 + (1! !)!3) .

As a consequence, the expected average cost of successfullookups is O(1 + (1! !)!2).

13

Page 32: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

End remarks

• Theory and practice of linear probing now (seem) much closer.

• We can generalize to variable key lengths.

14

Page 33: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

End remarks

• Theory and practice of linear probing now (seem) much closer.

• We can generalize to variable key lengths.

• Open:

‣ Still many hashing schemes where theory does not provide satisfactory methods.

‣ Tighter analysis, lower independence?

14

Page 34: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Advertisement

• Call for applications, PhD scholarships at ITU: http://www1.itu.dk/sw66047.asp

• For project proposals in algorithms, talk to me before October 1.

15

Page 35: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Problems

• Show that the total number of insertion steps is independent of the insertion order.

• Show how to efficiently implement deletions in a linear probing hash table.

• Show that the following hash function is 5-wise independent (hint: recursion):

h(x1, x2, x3, x4) = T1[x1]! T2[x2]! T3[x3]! T4[x4]!T5[x1 + x2]! T6[x3 + x4]!T7[x1 + x2 + x3 + x4]

16

Page 36: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

Practical exercise

• Implement a dictionary for null-terminated strings using linear probing. Beware of the “bad” way(s) of implementing it!

• Use the dictionary to count the number of distinct words in “Love and War”.http://www.gutenberg.org/etext/2600

• Compare the performance to the standard hash table in your programming environment.

17

Page 37: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

18

Page 38: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double

T H E N DE

18