41
Hash tables Hash tables

Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash tables

Hash tables

Page 2: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Basic Probability Theory

Two events A,B are independent if

Pr[A ∩ B] = Pr[A] · Pr[B]

Conditional probability:

Pr[A|B] =Pr[A ∩ B]

Pr[B]

The expectation of a (discrete) random variable X is

E [X ] =∑k

k · Pr[X = k]

In particular, if the values of X are 0 or 1, thenE [X ] = Pr[X = 1].

Hash tables

Page 3: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Basic Probability Theory

Union bound: For events A1, . . . ,Ak ,

Pr

[k⋃

i=1

Ai

]≤

k∑i=1

Pr[Ai ]

Linearity of expectation: For random variables X1, . . . ,Xk ,

E

[k∑

i=1

Xi

]=∑i=1

E [Xi ]

Hash tables

Page 4: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Dictionary

Definition

A dictionary is a data-structure that stores a set S ofelements, where each element x has a field x .key, andsupports the following operations:

Search(S , k) Return an element x ∈ S with x .key = k

Insert(S , x) Add x to S .

Delete(S , x) Delete x from S .

We will assume that

The keys are from U = {0, . . . , u − 1}.The keys are distinct.

Hash tables

Page 5: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Direct addressing

S is stored in an array T [0..m − 1]. The entry T [k]contains a pointer to the element with key k , if suchelement exists, and NULL otherwise.

Search(T , k): return T [k].

Insert(T , x): T [x .key]← x .

Delete(T , x): T [x .key]← NULL.

What is the problem with this structure?

0123456789

23

5

8

keysatellitedata

1

94

07

6

25

38

U

S

Hash tables

Page 6: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hashing

In order to reduce the space complexity of directaddressing, map the keys to a smaller range{0, . . .m − 1} using a hash function.

There is a problem of collisions (two or more keys thatare mapped to the same value).

There are two ways to handle collisions:

ChainingOpen addressing

Hash tables

Page 7: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash table with chaining

Let h : U → {0, . . . ,m − 1} be a hash function (m < u).

S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].

Search(T , k): Search the list T [h(k)].

Insert(T , x): Insert x at the head of T [h(x .key)].

Delete(T , x): Delete x from T [h(x .key)].

19 9

6 26S={6,9,19,26,30}

m=5, h(x)=x mod 5

30

Hash tables

Page 8: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Analysis

Assumption of simple uniform hashing: any element isequally likely to hash into any of the m slots,independently of where other elements have hashed into.

The above assumption is true when the keys are chosenuniformly and independently at random (withrepetitions), and the hash function satisfies|{k ∈ U : h(k) = i}| = u/m for every i ∈ {0, . . . ,m− 1}.We want to analyze the performance of hashing under theassumption of simple uniform hashing. This is the ballsinto bins problem.

Suppose we randomly place n balls into m bins. Let X bethe number of balls in bin 1.

The time complexity of a random search in a hash table isΘ(1 + X ).

Hash tables

Page 9: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

The expectation of X

Let α = n/m.

Claim

E [X ] = α.

Proof.

Let Ij be a random variable which is 1 if the j-th ball is inbin 1, and 0 otherwise. X =

∑nj=1 Ij , so

E [X ] = E [n∑

j=1

Ij ] =n∑

j=1

E [Ij ] =n∑

j=1

Pr[Ij = 1] = n · 1

m

Hash tables

Page 10: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

The distribution of X

Claim

Pr[X = r ] ≈ e−ααr

r !. (i.e., X has approximately Poisson

distribution).

Proof.

Pr[X = r ] =

(n

r

)(1

m

)r (1− 1

m

)n−r

=n(n − 1) · · · (n − r + 1)

r !

1

mr

(1− 1

m

)n−r

If m and n are large, n(n − 1) · · · (n − r + 1) ≈ nr and(1− 1

m

)n−r ≈ e−n/m. Thus, Pr[X = r ] ≈ e−n/m(n/m)r

r !.

Hash tables

Page 11: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

The maximum number of balls in a bin

Let Y be the maximum number of balls in some bin.

Claim

For m = n, Pr[Y ≥ R] ≤ 1n1.5 , where R = e ln n/ ln ln n.

Proof.

Let Xi be the number of balls in bin i .

Pr[Y ≥ r ] ≤n∑

i=1

Pr[Xi ≥ r ] = n · Pr[X ≥ r ]

Pr[X ≥ r ] ≤(n

r

)(1

n

)r

≤ nr

r !· 1

nr=

1

r !≤(er

)rwhere the last inequality is true since er =

∑∞i=0

r i

i!≥ r r

r !.

Hash tables

Page 12: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

The distribution of maxi Xi

Proof (continued).

Pr[Y ≥ R] ≤ n ·( eR

)R= n

(ln ln n

ln n

)e ln n/ ln ln n

= e ln ne(ln ln ln n−ln ln n)·e ln n/ ln ln n

= e−(e−1) ln n+ln n· e ln ln ln nln ln n

≤ e−1.5 ln n =1

n1.5.

Hash tables

Page 13: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Rehashing

We wish to maintain n = O(m) in order to have Θ(1)search time.

This can be achieved by rehashing. Suppose we wantn ≤ m. When the table has m elements and a newelement is inserted, create a new table of size 2m andcopy all elements into the new table.

The cost of rehashing is Θ(n).

Hash tables

Page 14: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Universal hash functions

Definition

A collection H of hash functions is a universal if for every pairof distinct keys x , y ∈ U , Prh∈H [h(x) = h(y)] ≤ 2

m.

Example

Let p be a prime number larger than u.fa,b(x) = ((ax + b) mod p) mod mHp,m = {fa,b|a ∈ {1, 2, . . . , p − 1}, b ∈ {0, 1, . . . , p − 1}}

Page 15: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Universal hash functions

Theorem

Suppose that H is a universal collection of hash functions.If a hash table for S is built using a randomly chosen h ∈ H,then for every k ∈ U, the expected time of Search(S , k) isΘ(1 + n/m).

Proof.

Let X = length of T [h(k)].X =

∑y∈S Iy where Iy = 1 if h(y .key) = h(k) and Iy = 0

otherwise.

E [X ] = E

[∑y∈S

Iy

]=∑y∈S

E [Iy ] =∑y∈S

Prh∈H

[h(y .key) = h(k)]

≤ 1 + n · 2

m.

Hash tables

Page 16: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Time complexity of chaining

Under the assumption of simple uniform hashing, theexpected time of a search is Θ(1 + α) time.

If α = Θ(1), and under the assumption of simple uniformhashing, the worst case time of a search isΘ(log n/ log log n), with probability at least 1− 1/nΘ(1).

If the hash function is chosen from a universal collectionat random, the expected time of a search is Θ(1 + α).

The worst case time of insert is Θ(1) if there is norehashing.

Hash tables

Page 17: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Interpreting keys as natural numbers

How can we convert floats or ASCII strings to naturalnumbers?

An ASCII string can be interpreted as a number in base128.

Example

For the string CLRS, the ASCII values of the characters areC = 67, L = 76, R = 82, S = 83. So CLRS is(67·1283)+(76·1282)+(82·1281)+(83·1280) = 141, 764, 947.

Hash tables

Page 18: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Horner’s rule

Horner’s rule:

adxd + ad−1x

d−1 + · · ·+ a1x1 + a0 =

(· · · ((adx + ad−1)x + ad−2)x + · · · )x + a0.

Code:y = adfor i = d − 1 to 0

y = ai + xy

If d is large the value of y is too big.

Solution: evaluate the polynomial modulo p:y = adfor i = d − 1 to 0

y = (ai + xy) mod p

Hash tables

Page 19: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Rabin-Karp pattern matching algorithm

Suppose we are given strings P and T , and we want tofind all occurences of P in T .The Rabin-Karp algorithm is as follows:

Compute h(P)for every substring T ′ of T of length |P |

if h(T ′) = h(P) check whether T ′ = P .The values h(T ′) for all T ′ can be computed in Θ(|T |)time using rolling hash.

Example

Let T = BGUCS and P = GUC. Let T1 = BGU, T2 = GUC.

h(T1) = (66 · 1282 + 71 · 128 + 85) mod p

h(T2) = (71 · 1282 + 85 · 128 + 67) mod p

= (h(T1)− 66 · 1282) · 128 + 67 mod p

Hash tables

Page 20: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Applications

Data deduplication: Suppose that we have many files, andsome files have duplicates. In order to save storage space,we want to store only one instance of each distinct file.

Distributed storage: Suppose we have many files, and wewant to store them on several servers.

Hash tables

Page 21: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Distributed storage

Let m denote the number of servers.

The simple solution is to use a hash functionh : U → {1, . . . ,m}, and assign file x to server h(x).

The problem with this solution is that if we add a server,we need to do rehashing which will move most filesbetween servers.

Hash tables

Page 22: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Distributed storage: Consistent hashing

Suppose that the server have identifiers s1, . . . , sm.

Let h : U → [0, 1] be a hash function.

For each server i associate a point h(si) on the unit circle.

For each file f , assign f to the server whose point is thefirst point encountered when traversing the unit cycleanti-clockwise starting from h(f ).

0

server 1

server 3

server 2

Hash tables

Page 23: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Distributed storage: Consistent hashing

Suppose that the server have identifiers s1, . . . , sm.

Let h : U → [0, 1] be a hash function.

For each server i associate a point h(si) on the unit circle.

For each file f , assign f to the server whose point is thefirst point encountered when traversing the unit cycleanti-clockwise starting from h(f ).

0

server 1

server 3

server 2

Hash tables

Page 24: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Distributed storage: Consistent hashing

When a new server m + 1 is added, let i be the serverwhose point is the first server point after h(sm+1).

We only need to reassign some of the files that wereassigned to server i .

The expected number of files reassignments is n/(m + 1).

0

server 1

server 3

server 2

server 4

Hash tables

Page 25: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

h(k)=k mod 10

Hash tables

Page 26: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

insert(T,14)

Hash tables

Page 27: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

20

insert(T,20)

Hash tables

Page 28: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

20

34

insert(T,34)

Hash tables

Page 29: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

20

34

19insert(T,19)

Hash tables

Page 30: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

20

34

19

55

insert(T,55)

Hash tables

Page 31: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

20

34

19

55

49

insert(T,49)

Hash tables

Page 32: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Insert

How to perform Search?

0123456789

14

20

34

19

55

49

search(T,55)

Hash tables

Page 33: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Insert

Search(T , k) is performed as follows:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = k

return jif T [j ] = NULL

return NULLreturn NULL

0123456789

14

20

34

19

55

49

search(T,25)

Hash tables

Page 34: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Delete

Delete method 1: To delete element k , store inT [h(k)] a special value DELETED.

0123456789

14

20

19

55

49

delete(T,34)

D

Hash tables

Page 35: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Delete

Delete method 2: Erase k from the table (re-place it by NULL) and also erase all the elementsin the block of k . Then, reinsert the latter ele-ments to the table.

0123456789

14

20

34

19

55

49

delete(T,34)

Hash tables

Page 36: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Delete

Delete method 2: Erase k from the table (re-place it by NULL) and also erase all the elementsin the block of k . Then, reinsert the latter ele-ments to the table.

0123456789

2049

19

Hash tables

Page 37: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Delete

Delete method 2: Erase k from the table (re-place it by NULL) and also erase all the elementsin the block of k . Then, reinsert the latter ele-ments to the table.

0123456789

2049

19

1455

Hash tables

Page 38: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Open addressing

Open addressing is a generalization oflinear probing.

Leth : U × {0, . . .m − 1} → {0, . . .m − 1}be a hash function such that{h(k , 0), h(k , 1), . . . h(k ,m − 1)} is apermutation of {0, . . .m − 1} for everyk ∈ U .

The slots examined during search/insertare h(k , 0), then h(k , 1), h(k , 2) etc.

In the example on the right,h(k , i) = (h1(k) + ih2(k)) mod 13 whereh1(k) = k mod 13h2(k) = 1 + (k mod 11)

0123456789

101112

79

6998

72

50

14

insert(T,14)Hash tables

Page 39: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Open addressing

Insertion is the same as in linear probing:for i = 0, . . . ,m − 1

j = h(k , i)if T [j ] = NULL OR T [j ] = DELETED

T [j ] = kreturn

error “hash table overflow”

Deletion is done using delete method 1 defined above(using special value DELETED).

Hash tables

Page 40: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Double hashing

In the double hashing method,

h(k , i) = (h1(k) + ih2(k)) mod m

for some hash functions h1 and h2.The value h2(k) must be relatively prime to m for theentire hash table to be searched. This can be ensured byeither

Taking m to be a power of 2, and the image of h2

contains only odd numbers.Taking m to be a prime number, and the image of h2

contains integers from {1, . . . ,m − 1}.For example,

h1(k) = k mod m

h2(k) = 1 mod m′

where m is prime and m′ < m.

Hash tables

Page 41: Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Time complexity of open addressing

Assume uniform hashing: the probe sequence of each keyis equally likely to be any of the m! permutations of{0, . . .m − 1}.Assuming uniform hashing and no deletions, the expectednumber of probes in a search is

At most 11−α for unsuccessful search.

At most 1α ln 1

1−α for successful search.

Hash tables