39
CS 2230 CS II: Data structures Meeting 29: Hashing Brandon Myers University of Iowa https://en.wikipedia.org/wiki/Hash_function

CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

CS2230CSII:Datastructures

Meeting29:HashingBrandonMyers

UniversityofIowa

https://en.wikipedia.org/wiki/Hash_function

Page 2: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Today’slearningobjectives

• IdentifyvariousdatastructurestoimplementaSet• Calculatethememoryusageofhashingdatastructures• ExecutetheSetmethodsforvarioushashsetimplementations,includingwhentherearecollisions• Identifyimportantpropertiesofhashcodes

Page 3: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

RoadmapPropose to represent a set of integers as an array of booleans so that we can

search in O(1) time

Wow that's fast! But it has the problem that it requires

too much memory!

Reduce the size of the array, but now elements

collide on the same index

Deal with collisions with a variety of methods

("chaining, probing")

Represent sets of any object by using a hash

function to turn the object into an integer

Page 4: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

WhatarewayswecanrepresentaSetofintegers?

1014

21

2

4

7

https://b.socrative.com/login/student/roomCS2230Xids1000-2999roomCS2230Yids3000+

IdentifyvariousdatastructurestoimplementaSet

Page 5: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Datastructure howtosearchforaspecificvalue

ifyouknowwhereitisstored(e.g.,indexorreference)

unsortedarrayofintegers searchfromstartuntilwefindit

gotothe index

10 14 4 15 7 2110 14 4 15 7 21

find(4)

10 14 4 15 7

0 1 2 3 4

21

5

get(2)

Page 6: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Datastructure howtosearchforaspecificvalue

ifyouknowwhereitisstored(e.g.,indexorreference)

unsortedarrayofintegers searchfromstartuntilwefindit

gotothe index

sortedarrayofintegers binarysearch gototheindex

10 14 4 15 7 21

4 7 10 14 15 21

10 14 4 15 7 21

find(4)

10 14 4 15 7

0 1 2 3 4

21

5

get(2)

4 7 10 14 15

0 1 2 3 4

21

5

find(7)

4 7 10 14 15

0 1 2 3 4

21

5

get(1)

Page 7: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Datastructure howtosearchforaspecificvalue

ifyouknowwhereitisstored(e.g.,indexorreference)

unsortedarrayofintegers searchfromstartuntilwefindit

gotothe index

sortedarrayofintegers binarysearch gototheindex

binarysearchtreeofintegers searchfromroot gotothenode

10 14 4 15 7 21

4 7 10 14 15 21

6

3

42

6

3

42

10 14 4 15 7 21

find(4)

10 14 4 15 7

0 1 2 3 4

21

5

get(2)

4 7 10 14 15

0 1 2 3 4

21

5

find(7)

4 7 10 14 15

0 1 2 3 4

21

5

get(1)

6

3

42

find(4)

Page 8: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Datastructure howtosearchforaspecificvalue

ifyouknowwhereitisstored(e.g.,indexorreference)

unsortedarrayofintegers searchfromstartuntilwefindit

gotothe index

sortedarrayofintegers binarysearch gototheindex

binarysearchtreeofintegers searchfromroot gotothenode

hugearray ofbooleans (truemeansthevalueisintheset)

use thevalueasanindex sameassearch

10 14 4 15 7 21

4 7 10 14 15 21

6

3

42

6

3

42

10 14 4 15 7 21

find(4)

10 14 4 15 7

0 1 2 3 4

21

5

get(2)

4 7 10 14 15

0 1 2 3 4

21

5

find(7)

4 7 10 14 15

0 1 2 3 4

21

5

get(1)

6

3

42

find(4)

F T T F ... F

0 1 3 Integer.MAX_VALUE

find(3)

2

F T T F ... F

0 1 3 Integer.MAX_VALUE

find(3)

2

F T T F ... F

0 1 3 Integer.MAX_VALUE

2

Page 9: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Thisdatastructureisgreat!FindanyvalueinO(1)time!

Problems?

boolean[] set = new boolean[Integer.MAX_INT+1];set[1] = true; // add 1set[2] = true; // add 2

F T T F ... F

0 1 3 Integer.MAX_VALUE

find(3)

2

Page 10: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Calculatethememoryusageofhashingdatastructures

Let𝑀(𝑛)betheamountofmemory thisSetuses,where𝑛=numberofelementsintheSet.Whichistrueandthebestbound?

a) 𝑀 𝑛 ∈ 𝑂(1)b) 𝑀 𝑛 ∈ 𝑂(𝑙𝑜𝑔𝑛)c) 𝑀 𝑛 ∈ 𝑂(𝑛)d) 𝑀 𝑛 ∈ 𝑂(𝐼𝑛𝑡𝑒𝑔𝑒𝑟.𝑀𝐴𝑋_𝐼𝑁𝑇)e) 𝑀 𝑛 ∈ 𝑂(𝑛 ∗ 𝐼𝑛𝑡𝑒𝑔𝑒𝑟.𝑀𝐴𝑋_𝐼𝑁𝑇)

F T T F ... F

0 1 3 Integer.MAX_VALUE

2

https://b.socrative.com/login/student/roomCS2230Xids1000-2999roomCS2230Yids3000+

Page 11: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Forexample...

Integer.MAX_VALUE= 2>? − 1boolean datatypeis1to2bytes

231-1*2bytes=~4GBevenifyoursetisnearlyempty!

Ifyouarecleverandrepresenttheboolean as1biteach(0=false,1=true)thenyoucangetdownto268MB

Evenif268MBfitsinyourcomputer’sRAM,realitybitesyou:ifyourelementsareuniformlyrandomlydistributedacrossthose268MBthentheelementsofyoursetwon’tallbeinyourcomputer’sfastcachememory,whichhasacapacityinthe100sofKB(takeCS:2630tolearnmore!)

F T T F ... F

0 1 3 Integer.MAX_VALUE

2

Page 12: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

F F F F F F

0 1 32 4 5

FixingthememoryproblemLimitthearraytoasmallercapacity,say6

add(2)

F F T F F F

0 1 32 4 5add(7)

F T T F F F

0 1 32 4 5

howtoadd(𝑖):marktrueatindex𝑖𝑚𝑜𝑑𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦

(bonus:wecanalsostorenegativeintegersnow)

anewproblem!Itlookslike1isintheset(and13,19,25,...)eventhoughweonlyadded7

Page 13: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

F F T F F F

0 1 32 4 5add(7)

F T T F F F

0 1 32 4 5

anewproblem!Itlookslike1isintheset(and14,21,28,...)eventhoughweonlyadded7

Sincemanyvalues(1,7,13,19,25,...)maptoindex1,weneedtokeeptrackofwhich keyisstoredthere

null 7 2 null null null

0 1 32 4 5

We’llhavethearraystoreIntegers,wherenullmeansthebucketisemptyandanon-nullvalueisthekeystoredthere

add(2)

Integer[] set = new Integer[6]; // capacity=6 set[2 % 6] = 2; set[7 % 6] = 7;

%meansmod

Page 14: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

ExecutetheSetmethodsforvarioushashsetimplementations

null null null null null null

0 1 32 4 5

null

6Supposeoursetisinitiallyemptyasabove.Whatwillitlooklikeafterthefollowingelementsareadded?-1,19,17,21,and8

21 8 null 17 null 19 -1

-1 19 17 21 8 null null -1 8 17 19 21 null null

null 19 8 21 null 17 -1

a)

c)

b)

d)

https://b.socrative.com/login/student/roomCS2230Xids1000-2999roomCS2230Yids3000+

Page 15: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

LearningobjectivesforFinalproject• Design,implement,andtestanapplicationbasedonawrittenspecification• ChooseappropriateADTsandefficientdatastructuresforvarioustasks• Useversioncontroltocollaborateonacodingprojectwithanotherperson

Page 16: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Finalproject:Semanticsimilarityofwords

_3 May. Bistritz._--Left Munich at 8:35 P. M., on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late. Buda-Pesth seems a wonderful place, from the glimpse which I got of it from the train and the little I could walk through the streets....

3similarwordsto“time”? come,0.6202651310028829sleep,0.613304123466795time,0.6082294707042364

Page 17: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Iamasickman.Iamaspitefulman.Iamanunattractiveman.Ibelievemyliverisdiseased.However,Iknownothingatallaboutmydisease,anddonotknowforcertainwhatailsme.

Theword“man”appearsinthefirstthreesentences.Itssemanticdescriptorvectorwouldbe:

[I=3,am=3,a=2,sick=1,man=0,spiteful=1,an=1,unattractive=1,believe=0,my=0,liver=0,is=0,diseased=0,However=0,know=0,nothing=0,at=0,all=0,about=0,disease=0,and=0,do=0,not=0,for=0,certain=0,what=0,ails=0,me=0]

Theword“liver”occursinthefourthsentence,soitssemanticdescriptorvectoris:

[I=1,am=0,a=0,sick=0,man=0,spiteful=0,an=0,unattractive=0,believe=1,my=1,liver=0,is=1,diseased=1,However=0,know=0,nothing=0,at=0,all=0,about=0,disease=0,and=0,do=0,not=0,for=0,certain=0,what=0,ails=0,me=0]

ourdefinitionofsemanticmeaning:thenumberoftimesawordappearswithotherwordsinthesamesentence.Eachwordbecomesavector.

Page 18: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

variousmeasuresofsimilarityoftwovectorscosinesimilaritynegativeEuclideandistancenegativeEuclideandistanceofnorms

usethesemeasurestoanswerqueriesaboutthewordsinatext

Page 19: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Projectduedates

• Nov17,11:59pm:Milestone1inGitHub(noslipdays)• FinishedPart1• PROGRESS_REPORT_NOV17.txt

• Nov29,11:59pm:Milestone2inGitHub(noslipdays)• FinishedPart3,andinitialdraftofPart4'swrittenanswers• PROGRESS_REPORT_NOV29.txt

• Dec6,11:59pm:FinalversioninGitHub(upto2slipdaysifatleast1partnerhasthem)• FinishedallParts

Page 20: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

null 7 2 null null null

0 1 32 4 5

Collisions!

uhoh...

add(13)//13%6=1

Page 21: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Youknowthatfeeling...whensomeonetakesyourparkingspot

Page 22: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

null 7 2 null null null

0 1 32 4 5

Dealingwithcollisions

add(13)//13%6=1

null 7 2 13 null null

0 1 32 4 5

null null null null

0 1 32 4 5

7

13 \

2 \

LinearprobingGotothenextspotuntilyoufindan opening

ChainingEachbucketisalinkedlistofelementsstoredthere

...andothertechniques!

Page 23: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

ExecutetheSetmethodsforvarioushashsetimplementations,includingwhentherearecollisions

null null null null null null

0 1 32 4 5

null

6Supposeoursetisinitiallyemptyasabove.Whatwillitlooklikeafterthefollowingelementsareadded,assumingweuselinearprobing?9,18,23,17

null null 9 null 18 null null

null null 9 23 18 17 null null null 23 17 4 null null

null null 9 18 23 17 null

a)

c)

b)

d)

https://b.socrative.com/login/student/roomCS2230Xids1000-2999roomCS2230Yids3000+

Page 24: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Howshouldweimplementremove()ifweareusinglinearprobing?(e.g.,remove(7))

a) setthethebuckettonullb) Removetheelementandmoveallelementsafteritleftby

onespacec) Moveallelementsafterit(uptothenextnull)leftbyone

spaced) leaveaspecialmarkerinthebucketthatmeansitisdeletede) thereisnogoodwaytoallowremove()

null 7 2 13 null null

0 1 32 4 5

https://b.socrative.com/login/student/roomCS2230Xids1000-2999roomCS2230Yids3000+

ExecutetheSetmethodsforvarioushashsetimplementations,includingwhentherearecollisions

Page 25: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing
Page 26: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

The hash set in your smartphone’s processor that you didn’t know about

allyourdata(e.g.,runningprograms,theOS,andtheirdata)

cachestoresasubsetofyourdata

itissmallbutthatmakesitfast!

Page 27: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

The hash set in your smartphone’s processor that you didn’t know about

Thecacheisbasicallyahashsethere’sonewhereeachbucketcanonlyhold1key

comparethekeywiththekeyinthebucket

thekey

hashfunctionistakesomeofthebitsofthememoryaddress

Page 28: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

The hash set in your smartphone’s processor that you didn’t know about

here’sonewhereeachbucketcanholdupto4keys

comparethekeywiththekeysinthebucket

thekey

hashfunctionistakesomeofthebitsofthememoryaddress

Thecacheisbasicallyahashset

Page 29: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Puttingnon-integersintoasetString[] set = new String[capacity];set[???] = "Cat";

Whereshouldweputthestring“Cat”?

Page 30: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Puttingnon-integersintoasetString[] set = new String[capacity];set[???] = "Cat";

Whereshouldweputthestring“Cat”?

useahashfunction

ahashfunctionisjustanyfunctionthatturnsanobjectintoaninteger

Page 31: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

null null null null null null

0 1 32 4 5

null

6Supposethehashfunctionforastringisthelength

Whatisthecontentsafterinserting“Cat”,“Dog”,“Froggy”?AssumeweuseLinearProbing.

null null null “Cat” “Dog” ”Froggy” nulla)

b)

c)

d)

“Froggy” null null “Cat” “Dog” null null

“Cat” ”Dog” ”Froggy” null null null null

null null null “Cat” “Dog” null “Froggy”

ExecutetheSetmethodsforvarioushashsetimplementations,includingwhentherearecollisions

https://b.socrative.com/login/student/roomCS2230Xids1000-2999roomCS2230Yids3000+

Page 32: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

forexample,OracleJavadistribution’shashfunctionforStrings

ahashfunctionisjustanyfunctionthatturnsanobjectintoaninteger

public class String {// a string is stored as an array of// "chars" (characters)private final char value[];

// hash function for Stringpublic int hashCode() {

int h = hash;if (h == 0 && value.length > 0) {

char val[] = value;

for (int i = 0; i < value.length; i++) {h = 31 * h + val[i];

}hash = h;

}return h;

}}

Page 33: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

AparaphraseofObject.hashCodespecificationintheJavaAPI• Thegeneralcontractof hashCode is:

• duringthesamerunofyourprogram,hashCode onaspecificobjectmustalwaysreturnthesameresult

• o1.equals(o2)⇒o1.hashCode()==o2.hashCode()

• Importanttoknowthat

o1.hashCode()==o2.hashCodeDOESNOTIMPLYo1.equals(o2)i.e.,itisokayfortwodifferentobjectstohavethesamehashCode (andprettymuchimpossibletoavoid)

https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode()

Page 34: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

IfIhavethefollowingcode

whichofthefollowingstatementsistrue

a) IfyouoverrideCat.equals youmustoverrideCat.hashCode

b) YoumustoverrideDog.equals andDog.hashCodec) YoumustoverrideCat.equals,Cat.hashCode,

Dog.equals,andDog.hashCode

Map<Cat, Dog> x = new HashMap<Cat,Dog>();

Identifyimportantpropertiesofhashcodes

https://b.socrative.com/login/student/roomCS2230Xids1000-2999roomCS2230Yids3000+

Page 35: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Runningtimeofsuccessfulfind?• Linearprobing

• expectedlengthofasequenceofnon-nulls:𝑂 1 + ??JK

where𝛼 istheloadfactor

• where𝛼 = #NOOPQRSTOUQUORVW

(𝛼 iscalledtheloadfactor)

• worstcase:O(n)ifthetableisallowedtogetnearlyfull(i.e.𝛼 isverycloseto1)

null 7 2 13 null null

0 1 32 4 5

Sincetherunningtimedependson𝛼,weshoulddecreaseitbygrowingthearraywhen𝛼 becomestoolarge

Ruleofthumb:if𝛼 increasesbeyond0.5or0.75,growthecapacity

Page 36: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Runningtimeofsuccessfulfind?• Chaining

• Whatistheexpectedlengthofthelongestchain?Whatistheaveragelengthofachain?

• Ofcourse,wewantourhashfunctiontodistributekeyswell(ifeverythinghashestoaconstantnumberofbuckets,lookuptimewouldbeO(n))

• Ifyouareluckyenoughfortheitemstobeuniformlydistributedacrossbucketsthentheaveragelengthofchainswouldbe1/𝛼

• However,thebirthdayparadoxfrom(see,DiscreteMath)tellsusthattheprobabilityofsomecollisionsishighevenifkeysaredrawnfromuniformdistribution

• Therefore,𝛼 shouldstillbekeptsufficientlysmallerthan1

null null null null

0 1 32 4 5

7

13 \

2 \

Page 37: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Today’slearningobjectives

• IdentifyvariousdatastructurestoimplementaSet• Calculatethememoryusageofhashingdatastructures• ExecuteSetmethodsforvarioushashsetimplementations,includingwhentherearecollisions• Identifyimportantpropertiesofhashcodes

Page 38: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Resources

Visualizationsofprobingandchaininghashtables!

http://www.cs.usfca.edu/~galles/visualization/OpenHash.html

http://www.cs.usfca.edu/~galles/visualization/ClosedHash.html

http://www.cs.usfca.edu/~galles/visualization/ClosedHashBucket.html

Page 39: CS 2230 CS II: Data structureshomepage.cs.uiowa.edu/.../lecture-033-hashing.pdf · •Identify various data structures to implement a Set •Calculate the memory usage of hashing

Acknowledgements

Cachediagrams– Ferry24Milan