º°©°¡® - cs.bgu.ac.ilds202/wiki.files/Presentation08-hash.pdf · Analysis of Chaining Assume simple uniform hashing : ± each key is equally likely to be hashed to any slot

1

Motivation

• Many applications require a dynamic set that supports only the dictionary operations:

– insert

– search

– delete

• Example:

– A symbol table in a compiler.

2

Keys

• We will consider all keys to be (possibly large) natural numbers.

3

• Suppose: – The keys are from a universe U={0, 1, … u-1}

– Keys are distinct

• The idea: – Set up an array T[0..u-1] in which

• T[i] = x if x T and x.key = i

• T[i] = null otherwise

– Operations take O(1) time! • search (T, k) return T[k]

• insert (T, x) T[x.key] x

• delete (T, x) T[x.key] null

• So – what’s the pro le ?

Direct Addressing

4

Direct Addressing

• Direct addressing works well when the size of

the universe U is relatively small

• But what if the keys are 32-bit integers?

– It will have 232 entries

5

Hashing

• Solution:

– map keys to smaller range 0 .. m-1

• This mapping is called a hashing

6

Collisions

• Two hashed key may collide with one another

• Solution:

– chaining

– open addressing

7

Chaining

• Chaining puts elements that hash to the same

slot in a linked list.

8

Search, Insert and Delete

• search(T, k)

– search for an element with key k in list T[h(k)]

• insert(T, x)

– insert x at the head of list T[h(x.key)]

• delete(T, x)

– delete x from the list T[h(x.key)]

9

Analysis of Chaining

• Assume simple uniform hashing: – each key is equally likely to be hashed to any slot.

• Given n keys and m slots in the table: the load factor = n/m is the average # keys per slot

• We will show that the average cost of an unsuccessful search for a key is Θ(1+).

• We will show that the average cost of a successful search is Θ(1+/2) = Θ(1+).

• Hence, the average cost is Θ(1+).

• Thus, if n = O(m), α = / = O / = O 1), and the average cost is Θ(1)

10


• Theorem: – An unsuccessful search takes expected time Θ(1+α .

• Proof: – Simple uniform hashing any key not already in the table

is equally likely to hash to any of the m slots.

– To search unsuccessfully for any key k, need to search to the end of the list T[h(k)].

– This list has expe ted le gth E[le gth of T[h k ]] = α. – Therefore, the expected number of elements examined in

a u su essful sear h is α. – Adding in the time to compute the hash function, the total

time required is Θ 1 + α .

11


• Theorem: – An successful search takes expected time Θ(1+α .

• Proof: – Assume that the element x being searched for is equally likely to be

any of the n elements stored in the table.

– The number of elements examined during a successful search for x is 1 more than the number of elements that appear before x in x’s list.

– These are the elements inserted after x was inserted (because we insert at the head of the list).

– So we need to find the average, over the n elements x in the table, of how many elements were inserted into x’s list after x was inserted.

– For i = 1, 2, . . . , n, let xi be the i th element inserted into the table, and let ki = key[xi].

– For all i and j, define indicator random variable Xi j = I{h(ki) = h(kj)}.

12


13

Keys

• How can we convert floats or ASCII strings to natural numbers?

• Example: – Co sider CLR“

• ASCII values: C = 67, L = 76, R = 82, S = 83.

– There are 128 basic ASCII values.

– “o i terpret CLR“ • (67 ∙ 1283)+ (76 ∙ 1282)+ (82 ∙ 1281)+ (83 ∙ 1280) =

141,764,947.

14

Hor er’s rule

• Hor er’s rule:

• Code:

y = ad

for i = d-1 to 0

y = ai+x·y

• If d is large the value of y is too big.

• Solution: evaluate the polynomial modulo m:

y = ad

for i = d-1 to 0

y = (ai+x·y) mod m

0210

1

1

1

1 ...)))(...(... axxaxaxaaxaxaxa ddd

d

d

d

d

Choosing A Hash Function

• Clearly choosing the hash function well is

crucial

– What will a worst-case hash function do?

– What will be the time to search in this case?

• What are desirable features of the hash

function?

– Should distribute keys uniformly into slots

– Should not depend on patterns in the data

16

Choosing A Hash Function

• Unfortunately, it is typically not possible to check this conditions – One rarely knows the probability distribution

according to which the keys are drawn

– The keys may not be drawn independently.

• Occasionally we do know the distribution – If the keys are known to be random real numbers k

independently and uniformly distributed in the range 0≤k<1

– The hash function h(k) = km satisfies the condition of simple uniform hashing.

17

The Division Method

h(k) = k mod m

• For example, if the hash table has size m = 12 and the key is k = 100, then h(k) = 4.

• Hashing by division is quite fast.

• What happens if m is a power of 2 (say 2P)?

18

The Multiplication Method

• For a constant 0 < A < 1: h(k) = m·frac(kA), where frac(x) is the fractional part of x.

• Advantage: Value of m is not critical.

• Usually m = 2P for some integer p.

• Suppose that the word size is w bits.

• Choose A = s/2w, where is an integer in the range 0 < s < 2w

• The computation of h(k) can be done using integer operations:

– ks = kA·2w, so the rightmost w bits of ks are equal to frac(kA)·2w.

– m·frac(kA) = first p bits of frac(kA) after floating point = leftmost p bits of the rightmost w bits of ks.

19

Universal Hashing

• Peak a hash function randomly when the

algorithm begins

– A way to randomize the algorithm to control the

input distribution.

– Need a good family of hash functions to choose

from

20

Universal Set

• A finite collection H of hash functions is

universal

– if for each k, l U, where k ≠ l, the u er of hash functions h H for which h(k) = h(l) is

≤|H|/ , where is the size of the hash ta le.

• Alternatively, H is universal

– if, with a hash function h chosen randomly from H,

the probability of a collision between two

different keys is no more than 1/m.

21

Universal Hashing

• Theorem: – Choose h from a universal family of hash functions

– Hash n keys into a table T of m slots, n m

– Then the expected number of collisions involving a particular key x is less than 1

• Proof: – For each pair of keys x, y, let Ixy the indicator that y and x collide.

– E[Ixy] = 1/m

– Let Cx be total number of collisions involving key x

–

– Since n m, we have E[Cx] < 1

• Corollary

– Using chaining and universal hashing, the expected time for each search operation is O(1).

m

1n]E[I]IE[]E[C

xyTy

xy

xyTy

xyx

22

A Universal Hash Set of Functions

• Let U = {1, …, u} • Fix a prime p > u.

• For every a є {1, …, p-1}, b є {0, …, p-1}, define

hab(k) = ((ak + b) mod p) mod m, where m is the

size of the table

• H = {hab(k) } is universal

23

Open Addressing

• Basic idea:

– If slot is full, try another slot, etc., until an open slot is found (probing)

– To search, follow same sequence of probes as would be used when inserting the element

• If reach element with correct key, return it

• If reach a null pointer, element is not in table

• Good for fixed sets (adding but no deletion)

– Example: spell checking

• Ta le eed ’t e u h igger tha n

– But, comparing to chaining, there is more space available.

24

Hash Function

such that

for each k є U

{h(k,0),h(k,1), ..., h(k, m-1)}

is a permutation of

{0, 1, ..., m -1}

25

Insert & Search

26

Deletion

• Cannot just put null into the slot containing

the key we want to delete.

• Solution:

– Use a special value DELETED when marking a slot

as empty during deletion.

• The disadvantage:

– Search time is no longer dependent on the load

factor α.

27

Linear Probing

Let

h' : U → {0, 1, ..., m - 1} be an hash function

Define

h(k, i) = (h'(k) + i) mod m

• Easy to implement

• Suffers from a primary clustering

– Long runs of occupied sequences build up.

28

Double hashing

h(k, i) = (h1(k) + ih2(k)) mod m

where

h1 and h2 are hash functions

• Advantage:

– Two keys with same hash may have different steps.

– Only m2 different probe sequences.

29

Example

• h1(k) = k mod 13

• h2(k) = 1 + (k mod 11)

30

Analysis of Open-Address Hashing

• Theorem: – Given an open-address hash ta le with load fa tor α <

1,

– The expected number of probes in a successful search is at most

• assuming uniform hashing

– Each key is equally likely to have any of the m! permutations of 0, 1, . . . , − 1 as its probe sequence

• assuming that each key in the table is equally likely to be searched for.

31

Analysis of Open-Address Hashing

• Theorem: – Given an open-address hash table with load factor

α < 1,

– The expected number of probes in an unsuccessful search is at most 1/(1-α ,

• assuming uniform hashing.

• If α is a o sta t, a sear h ru s i O 1) time.

• Theorem: – The expected number of probes to insert is at most

1/(1 − α .

32

Bloom Filter

33

• . • URL 10,000,000

. , ( -URL .)

ו • ר ב . פ , .

• , , (false positive) ,

.

34

• -URL . • URL .

, -URL . • , -

URL . ," . -false positive .

35

36

פ –פ

• URL 10,000,000 . • T 80,000,000 , h .

-T -0. • URL x 1 - T[h(x)]. • URL x ,

T[h(x)] . • T[h(x)]=0 . • T[h(x)]=1 . • , -T[h(x)]=1 ?

• y h(x)=h(y). •

10,000,000/80,000,000=0.125

37

-פ

• m k . • : 0. • x ,

x k , k " " .

• y , k

"."

• n .

38

-פ

• , .

• " " ( k,m - n )

.

39

(m=12,k=3)

0 0 0 0 0 0 0 0 0 0 0 0

:

:x1

0 0 0 0 0 1 0 1 0 0 1 0

x1

:x2

0 0 1 1 0 1 0 1 0 0 1 0

x1 x2

40

(m=12,k=3)

:y1

0 0 1 1 0 1 0 1 0 0 1 0

x1 x2 y1

: false positive y2

0 0 1 1 0 1 0 1 0 0 1 0

x1 x2 y2

41

false positive

• , false positive k

1. • n ,

0 :

" " 1 , 0 kn

false positive

• k 1 :

פ false positive • m ( )-n

, k ( ) false

positive ? :false positive

• k k 0 .

.

)1ln(/ /

)1(mkn

ekkmknee

)1ln(/ mkn

ekg

פ false positive

• false positive :

• Byte 8 bit m/n=8 false positive 0.0214.

2lnn

mk

Documents

º°©°¡® - cs.bgu.ac.ilds202/wiki.files/Presentation08-hash.pdf · Analysis of Chaining Assume simple uniform hashing : ± each key is equally likely to be hashed to any slot