25
Building Googlebot Youngjin Kim October 15, 2013

212 building googlebot - deview - google drive

  • View
    3.317

  • Download
    4

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: 212 building googlebot - deview - google drive

Building Googlebot

Youngjin KimOctober 15, 2013

Page 2: 212 building googlebot - deview - google drive

http://www.creditwritedowns.com/2011/07/european-monetary-union-titanic.html

Page 3: 212 building googlebot - deview - google drive

From the web to your query

● Query processing1. Lookup keywords in the index => every relevant page2. Rank pages and display the result

● Google's index of the webkeyword => { page1, page2, ... }

● Building the index requires processing the current version of all of the pages on the web...

Page 4: 212 building googlebot - deview - google drive

All of the pages on the web!?!

Page 5: 212 building googlebot - deview - google drive

60 Trillion Pages And Counting!

Page 6: 212 building googlebot - deview - google drive

Our local copy of the web

● Crawling○ Googlebot

● Storage○ Google File System (GFS), BigTable

● Processing○ MapReduce

● Data Centers○ Job control, Fault-Tolerance, High-Speed Networking,

Power/Cooling, etc.

Page 7: 212 building googlebot - deview - google drive

Finding every page with googlebot

● Basic discovery crawl1. Start with the set

of known links2. Crawl every link

(pages change!)3. Extract every

new link, repeatCrawlStatus

WebPage

Crawl Pages

Extract Links

Page 8: 212 building googlebot - deview - google drive

Key considerations in crawling

● Polite crawling○ Do not overload websites and DNS (no DoS!) ○ Understand web serving infrastructure

● Prioritize among discovered links○ Crawl is a giant queuing system○ Predicting serving capacity

● Do not waste resources○ Ignore spam/broken links○ Skip links with duplicate content

Page 9: 212 building googlebot - deview - google drive

Mirrors

● Hosts with exactly the same contentdeview.krwww.deview.kr

● Paths within hosts with the same contentwww.cs.unc.edu/Courses/comp426-f09/docs/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docswww.cs.unc.edu/Courses/comp590-001-f08/docs/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docswww.cs.unc.edu/Courses/comp590-001-f08/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docswww.cs.unc.edu/Courses/jbs/tools/downloads/tomcat/ jakarta-tomcat/4.1.29/webapps/tomcat-docs

● Unrestricted mirroring across hosts and paths○ Distributed graph mining

Page 10: 212 building googlebot - deview - google drive
Page 11: 212 building googlebot - deview - google drive

Optimizing our crawling

● Efficient crawling requires duplicate handling○ Predict whether a newly discovered link points to

duplicate content○ Must happen before crawling

useful(link, status_table) => { yes, no }

Page 12: 212 building googlebot - deview - google drive

Duplicates in Dynamic Pages

● Duplicates are most common in dynamic linkshttp://foo.com/forum/viewtopic.php?t=3808&sid=126bc5f2http://foo.com/forum/viewtopic.php?t=3808&sid=d5b8483bhttp://foo.com/forum/viewtopic.php?t=3808&sid=3b1a8e27http://foo.com/forum/viewtopic.php?t=3808&sid=2a21f059...

● Significance analysis○ Parameter t is a relevant○ Parameter sid is irrelevant

● Duplicate predictionhttp://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a

SameContent

Page 13: 212 building googlebot - deview - google drive

Equivalence rules and class names

● Equivalence rule for a cluster○ Set of relevant parameters○ Set of irrelevant parameters

● Equivalence class name○ Remove irrelevant parameters

ECN(link1) = ECN(link2) => Same content!○ For the previous example

ECN(http://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a) = http://foo.com/forum/viewtopic.php?t=3808

Page 14: 212 building googlebot - deview - google drive

Modified crawl algorithm

● Representative table○ Equivalence class name => representative link

● Given a new link1. Identify cluster2. Lookup equivalence rule3. Apply rule to determine equivalence class name4. Lookup table of representatives5. Crawl link if no representative found

Page 15: 212 building googlebot - deview - google drive

Equivalence rule generation

● Find every crawled link under a cluster cluster = { link1 : content1, link2 : content2, ... }● Study evidence

1. Insignificance analysis2. Significance analysis3. Parameter classification4. Equivalence rule construction

rule(cluster) = { param1 : RELEVANT, param2 : IRRELEVANT, param3 : CONFLICT, ... }

Page 16: 212 building googlebot - deview - google drive

1. Insignificance analysis

● Group links by content content1 = { link11, link21, ... } content2 = { link21, link22, ... } ... ● For each parameter

○ For each content group with this parameter■ If parameter values are not the same, add the number

of links to the insignificance index

Page 17: 212 building googlebot - deview - google drive

2. Significance analysis

● For each parameter○ Remove the parameter from every link

■ Group content by remainder link remainder1 = { content11, content21, ... } remainder2 = { content21, content22, ... } ...

■ Increment significance index by the number of unique contents minus 1

Page 18: 212 building googlebot - deview - google drive

3. Parameter classification

● For each parameter○ Compute content relevance (or irrelevance) value

○ Sample criteria: 90/10 rule■ If relevance > 90 => parameter is RELEVANT■ If relevance < 10 => parameter is IRRELEVANT■ Otherwise, parameter is CONFLICT

Content_Relevance =Significance_Index

Significance_Index + Insignificance_Index

Content_Irrelevance =Insignificance_Index

Significance_Index + Insignificance_Index

Page 19: 212 building googlebot - deview - google drive

Example: P is content-irrelevant

http://foo.com/directory?P=1&Q=3http://foo.com/directory?P=2&Q=3

http://foo.com/directory?P=1&Q=2http://foo.com/directory?P=2&Q=2http://foo.com/directory?P=3&Q=2http://foo.com/directory?P=4&Q=2

Content B

Cluster

Content A

2 links,different Ps

Content A

4 links,different Ps

Content B

Insignificance Analysis of P

P's insignificance index = 2 + 4 = 6P's content-irrelevance value = 100%

2 links,Content A

Q = 3

4 links,Content B

Q = 2

Significance Analysis of P

P's significance index = 0P's content-relevance value = 0%

Page 20: 212 building googlebot - deview - google drive

Example: Q is content-relevant

http://foo.com/directory?P=1&Q=3http://foo.com/directory?P=2&Q=3

http://foo.com/directory?P=1&Q=2http://foo.com/directory?P=2&Q=2http://foo.com/directory?P=3&Q=2http://foo.com/directory?P=4&Q=2

Content B

Cluster

Content A

2 links,same Q

Content A

4 links,same Q

Content B

Insignificance Analysis of Q

Q's insignificance index = 0Q's content-irrelevance value = 0%

2 links,Content A&B

P = 1

2 links,Content A&B

P = 2

Significance Analysis of Q

Q's significance index = 1 + 1 = 2Q's content-relevance value = 100%

Page 21: 212 building googlebot - deview - google drive

Facing the Real World

● Limitations○ Co-changing parameters○ Noisy data○ Parameters not used in the standard way○ Need for continuous validation

● State-of-the-art○ White-box vs black-box

● Search is not solved○ Not even crawling is solved!

Page 22: 212 building googlebot - deview - google drive

Defining duplicates

● Identical pages● Identical visible content● Essentially identical visible content

○ Ignore page generation time○ Ignore breaking news side bar○ etc.

● What is the right answer?Two pages should be considered duplicatesif our users would consider them duplicates

● How to translate this notion into a checksum?

Page 23: 212 building googlebot - deview - google drive

Q & A

Page 24: 212 building googlebot - deview - google drive
Page 25: 212 building googlebot - deview - google drive

Thank You!