Transcript
Page 1: David Evans cs.virginia/evans

David Evanshttp://www.cs.virginia.edu/evans

CS150: Computer ScienceUniversity of VirginiaComputer Science

Class 38:Googling

Page 2: David Evans cs.virginia/evans

2CS150 Fall 2005: Lecture 38: Googling Google

Some searches...

“David Evans”

“Dave Evans”

“idiot”

“lawn lighting” Tomorrow at 6pm (but google doesn’tknow that!)

Page 3: David Evans cs.virginia/evans

3CS150 Fall 2005: Lecture 38: Googling Google

Building a Web Search Engine

• Database of web pages– Crawling the web collecting pages and

links– Indexing them efficiently

• Responding to Searches– How to find documents that match a

query– How to rank the “best” documents

Page 4: David Evans cs.virginia/evans

4CS150 Fall 2005: Lecture 38: Googling Google

Crawling CrawleractiveURLs = [ “www.yahoo.com” ]while (len(activeURLs) > 0) : newURLs = [ ] for URL in activeURLs: page = downloadPage (URL) newURLs += extractLinks (page) activeURLs = newURLs

Problems:Will keep revisiting the same pagesWill take very long to get a good view of the webWill annoy web server adminsdownloadPage and extractLinks must be very robust

Page 5: David Evans cs.virginia/evans

5CS150 Fall 2005: Lecture 38: Googling Google

Crawling CrawleractiveURLs = [ “www.yahoo.com” ]visitedURLs = [ ]while (len(activeURLs) > 0) : newURLs = [ ] for URL in activeURLs: visitedURLs += URL page = downloadPage (URL) newURLs += extractLinks (page) -

visitedURLs activeURLs = newURLs

What is the complexity?

Page 6: David Evans cs.virginia/evans

6CS150 Fall 2005: Lecture 38: Googling Google

Distributed CrawleractiveURLs = [ “www.yahoo.com” ]visitedURLs = [ ]while (len(activeURLs) > 0) : newURLs = [ ] parfor URL in activeURLs: visitedURLs += URL page = downloadPage (URL) newURLs += extractLinks (page) -

visitedURLs activeURLs = newURLs

Is this as“easy” as distributingfinding aliens?

Page 7: David Evans cs.virginia/evans

7CS150 Fall 2005: Lecture 38: Googling Google

Building a Web Search Engine

• Database of web pages– Crawling the web collecting pages and

links– Indexing them efficiently

• Responding to Searches– How to find documents that match a

query– How to rank the “best” documents

Page 8: David Evans cs.virginia/evans

8CS150 Fall 2005: Lecture 38: Googling Google

Building an Index

• What if we just stored all the pages?

Answering a query would be (size of the database)

(need to look at all characters in database)

For google: about 4 Billion pages (actual size is now considered a corporate secret) * 60 KB (average web page size) = ~184 Trillion

Linear is not nearly good enough when n is Trillions

Page 9: David Evans cs.virginia/evans

9CS150 Fall 2005: Lecture 38: Googling Google

Reverse IndexWord Locations

“David” [ …, http://www.cs.virginia.edu/~evans/index.html:12, …]

“Evans”

[ …, http://www.cs.virginia.edu/~evans/index.html:19, …]

What is time complexity of search now?

Page 10: David Evans cs.virginia/evans

10CS150 Fall 2005: Lecture 38: Googling Google

Best Possible Searching

• Searching Problem:– Input: a target key key, a list of n <key,

value> pairs, sorted by key using a comparison function cf

– Output: if key is in the list, the value associated with key; otherwise, not found

• What is the best possible solution to the general searching problem?

Page 11: David Evans cs.virginia/evans

11CS150 Fall 2005: Lecture 38: Googling Google

Recall Class 13:Sorting problem is Ω(n log n)

• There are n! possible orderings • Each comparison can eliminate at

best ½ of them• So, best possible sorting procedure is

Ω(log2n!)

• Sterling’s approximation: n! = Ω(nn)– So, best possible sorting procedure is

Ω(log (nn)) = Ω(n log n)Recall log multiplicationis normal addition:log mn = log m + log n

Page 12: David Evans cs.virginia/evans

12CS150 Fall 2005: Lecture 38: Googling Google

Searching Problem is (log n)

• It is (log n)– Each comparison can eliminate at best ½ of

all the elements from consideration• It is O (log n)

– We know a procedure that solves it in (log n)

• For google: n is the number of distinct words on the web (hundreds of millions?) (log n) is not good enough

Page 13: David Evans cs.virginia/evans

13CS150 Fall 2005: Lecture 38: Googling Google

Faster Searching?• The proof that searching is (log n) relied

on knowing that the best a comparison can do is eliminate ½ the entries

• Can we do better?– Without knowing anything about comparison:

no– With knowing about comparison: yes

• What if one comparison can eliminate O(n) of the entries?

Page 14: David Evans cs.virginia/evans

14CS150 Fall 2005: Lecture 38: Googling Google

Bin SearchingFirst

LetterItems

a [<“aardvark”, [http://www.aardvarksareus.com, …]>, … ]

b [ … ]

z [ …, <“zweitgeist”, […]>]def binsearch (key, table) : search (key, table[key[0]])

What is time complexity of binsearch?

Page 15: David Evans cs.virginia/evans

15CS150 Fall 2005: Lecture 38: Googling Google

Searching in O(1)

• To do better than (log n) the number of bins must scale with n– Average number of elements in a bin

must be O(1)– One comparison must eliminate O(n) of

the elements

Page 16: David Evans cs.virginia/evans

16CS150 Fall 2005: Lecture 38: Googling Google

Hash Tables

• Bin = H(key, number of bins)– H is a hash function – We’ve seen cryptographic hash functions

where H must be collision resistant– For this, we don’t need that just need H

must distribute the keys well across the bins

• Finding a good H is difficult– You can download google’s from

http://goog-sparsehash.sourceforge.net/

Page 17: David Evans cs.virginia/evans

17CS150 Fall 2005: Lecture 38: Googling Google

Google’s Lexicon• 1998: 14 million words (much more today)• Lookup word in H(word, nbins)• Maps to WordID

Key Words

0 [<“aardvark”, 1024235>, ... ]

1 [<“aaa”, 224155>, ..., <“zzz”, 29543> ]

... ...

nbins – 1 [<“abba”, 25583>, ..., <“zeit”, 50395> ]

Page 18: David Evans cs.virginia/evans

18CS150 Fall 2005: Lecture 38: Googling Google

Google’s Reverse Index

WordId

ndocs

pointer

00000000

3

00000001

15

...

16777215

105

(From 1998 paper...may have changed some since then)

Lexicon: 293 MB (1998)

InvertedBarrels:

41 GB (1998)

Page 19: David Evans cs.virginia/evans

19CS150 Fall 2005: Lecture 38: Googling Google

Inverted Barrelsdocid (27 bits) nhits (5

bits)hits (16 bits each)

7630486927

23

...

plain hit:capitalized: 1 bitfont size: 3 bitsposition: 12 bits first 4095 chars, everything else

extra info foranchors, titles(less position bits)

Page 20: David Evans cs.virginia/evans

20CS150 Fall 2005: Lecture 38: Googling Google

Building a Web Search Engine

• Database of web pages– Crawling the web collecting pages and

links– Indexing them efficiently

• Responding to Searches– How to find documents that match a

query– How to rank the “best” documents

Page 21: David Evans cs.virginia/evans

21CS150 Fall 2005: Lecture 38: Googling Google

Finding the “Best” Documents

• Humans rate them– “Jerry and David’s Guide to the World

Wide Web” (became Yahoo!)

• Machines rate them– Count number of occurrences of keyword

• Easy for sites to rig this

– Machine language understanding not good enough

• Business Model– Whoever pays you the most is listed first

Page 22: David Evans cs.virginia/evans

22CS150 Fall 2005: Lecture 38: Googling Google

Random Walk ModelInitialize all page ranks = 0p = select a random URLfor as long as you feel like p.rank = p.rank + 1 p = select random link from Links (p)

Eventually, ranks measure probability a random websurfer would encounter a page

Problems with this?

Page 23: David Evans cs.virginia/evans

23CS150 Fall 2005: Lecture 38: Googling Google

Back Links

http://www.google.com/search?hl=en&lr=&q=link%3Awww.cs.virginia.edu%2F%7Eevans%2Findex.html&btnG=Search= 219 backlinks

Page 24: David Evans cs.virginia/evans

24CS150 Fall 2005: Lecture 38: Googling Google

Counting Back Links

• link:http://www.deainc.com/– 109 backlinks (hey, I should be first!)

• Back links are not a good measure– Most of mine are from my own pages

• But Google doesn’t know that (always)

– Some pages are more important than others

Page 25: David Evans cs.virginia/evans

25CS150 Fall 2005: Lecture 38: Googling Google

PageRank

Weight the back links by the popularity of the linking page

def PageRank (u): rank = 0 for b in BackLinks (u) rank = rank + PageRank (b) / Links (b) return rank

Would this work?

Page 26: David Evans cs.virginia/evans

26CS150 Fall 2005: Lecture 38: Googling Google

Converging PageRank

• Ranks of all pages depend on ranks of all other pages

• Keep recalculating ranks until they converge

def CalculatePageRanks (urls): initially, every rank is 1 for as many times as necessary calculate a new rank for each page (using old ranks of other pages) replace the old ranks with the new ranksHow do initial ranks effect results?

How many iterations are necessary?

Page 27: David Evans cs.virginia/evans

27CS150 Fall 2005: Lecture 38: Googling Google

PageRank• Crawlable web (1998): 150 million

pages, 1.7 Billion links• Database of 322 million links

– Converges in ~50 iterations

• Initialization matters– All pages = 1: very democratic, models

browser equally likely to start on random page

– www.yahoo.com = 1, ..., all others = 0• More like what Google probably uses

Page 28: David Evans cs.virginia/evans

28CS150 Fall 2005: Lecture 38: Googling Google

Query Work• To respond to 1 query (2002)

– Read 100 MB of data– 10s of Billions of CPU cycles

• Google in 2002:– 15,000 commodity PCs

• Racks of 88 2GB PCs, $278,000 each rack• Power: 10 MW-h/month ($1,500)

– If you have 15,000 PCs, there always be some with faults: load balancing, data partitioning

Page 29: David Evans cs.virginia/evans

29CS150 Fall 2005: Lecture 38: Googling Google

Building a Web Search Engine

• Database of web pages– Crawling the web collecting pages and

links– Indexing them efficiently

• Responding to Searches– How to find documents that match a

query– How to rank the “best” documentsReady to go become the next google?

Page 30: David Evans cs.virginia/evans

30CS150 Fall 2005: Lecture 38: Googling Google

Charge• Before becoming the next Google,

you need to finish PS8!• Tomorrow: 6pm, Lighting of the Lawn• Friday’s class:

– A few other neat things about Google– Guidelines for project presentations– Exam review – email me your topics and

questions

• Monday: project presentations