Upload
valentine-lambert
View
218
Download
0
Embed Size (px)
DESCRIPTION
3 Google Architecture [Brin/Page 98] Focus was on scalability to the size of the Web First to really exploit Link Analysis Started as an academic Stanford; became a startup Est. 450K commodity servers in 2006!!!
Citation preview
Google, Web Crawling,and Distributed Synchronization
Zachary G. IvesUniversity of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
April 1, 2008
2
Administrivia Homework due TONIGHT Please email me a list of project partners
by THURSDAY
3
Google Architecture [Brin/Page 98]
Focus was on scalabilityto the size of the Web
First to really exploitLink Analysis
Started as an academicproject @ Stanford;became a startup
Est. 450K commodityservers in 2006!!!
4
Google Document Repository“BigFile” distributed filesystem
(GFS) Support for 264 bytes across
multiple drives, filesystems Manages its own file descriptors,
resourcesRepository Contains every HTML page (this is
the cached page entry), compressed in zlib (faster than bzip)
Two indices: DocID documents URL checksum DocID
5
Lexicon The list of searchable words
(Presumably, today it’s used to suggest alternative words as well)
The “root” of the inverted index As of 1998, 14 million “words”
Kept in memory (was 256MB) Two parts:
Hash table of pointers to words and the “barrels” (partitions) they fall into
List of words (null-separated) Much larger today – may still fit
in memory??
6
Indices – Inverted and “Forward” Inverted index divided into
“barrels” (partitions by range) Indexed by the lexicon; for each
DocID, consists of a Hit List of entries in the document
Forward index uses the same barrels Used to find multi-word queries
with words in same barrel Indexed by DocID, then a list of
WordIDs in this barrel and this document, then Hit Lists corresponding to the WordIDs
Two barrels: short (anchor and title); full (all text)
Note: barrels are now called “shards”
hit hit hitDocIDhit hithit hit hithitDocIDhit hithit hit hithit
nhits: 8WordID: 24NULL
forward barrels: total 43 GB
NULL
nhits: 8WordID: 24
WordID: 24 nhits: 8WordID: 24
WordID: 24 nhits: 8
nhits: 8
ndocsWordIDndocsWordIDndocsWordID
Lexicon: 293 MB
nhits: 8nhits: 8nhits: 8nhits: 8 hit hit
Inverted Barrels: 41 GBhit hit hit hithit hit hithit hit hit hit
DocID: 27DocID: 27DocID: 27DocID: 27
original tables fromhttp://www.cs.huji.ac.il/~sdbi/2000/google/index.htm
7
Hit Lists (Not Mafia-Related) Used in inverted and forward indices Goal was to minimize the size – the bulk of
data is in hit entries For 1998 version, made it down to 2 bytes per
hit (though that’s likely climbed since then):
cap 1 font: 3 position: 12Plain
cap 1 font: 7type: 4 position: 8 Fancy
cap 1 font: 7type: 4 hash: 4 pos: 4 Anchor
vs.
special-cased to:
8
Google’s Distributed Crawler (in 1998) Single URL Server – the coordinator
A queue that farms out URLs to crawler nodes Implemented in Python!
Crawlers have 300 open connections apiece Each needs own DNS cache – DNS lookup is
major bottleneck Based on asynchronous I/O (which is now
supported in Java) Many caveats in building a “friendly”
crawler (we’ll talk about some momentarily)
9
Google’s Search Algorithm1. Parse the query2. Convert words into wordIDs3. Seek to start of doclist in the short barrel for every
word4. Scan through the doclists until there is a
document that matches all of the search terms5. Compute the rank of that document6. If we’re at the end of the short barrels, start at the
doclists of the full barrel, unless we have enough7. If not at the end of any doclist, goto step 48. Sort the documents by rank; return the top K
10
Ranking in GoogleConsiders many types of information:
Position, font size, capitalization Anchor text PageRank Count of occurrences (basically, TF) in a way
that tapers off (Not clear if they do IDF?)
Multi-word queries consider proximity also
11
Google’s ResourcesIn 1998:
24M web pages About 55GB data w/o repository About 110GB with repository Lexicon 293MB
Worked quite well with low-end PCBy 2006, > 25 billion pages, 400M queries/day:
Don’t attempt to include all barrels on every machine! e.g., 5+TB repository on special servers separate from index
servers Many special-purpose indexing services (e.g., images) Much greater distribution of data (2008, ~450K PCs?), huge
net BW Advertising needs to be tied in (100,000 advertisers claimed)
12
Web Crawling You’ve already built a basic crawler First, some politeness criteria Then we’ll look at the Mercator distributed
crawler as an example
13
Basic Crawler (Spider/Robot) Process Start with some initial page P0 Collect all URLs from P0 and add to the crawler
queue Consider <base href> tag, anchor links, optionally
image links, CSS, DTDs, scripts Considerations:
What order to traverse (polite to do BFS – why?) How deep to traverse What to ignore (coverage) How to escape “spider traps” and avoid cycles How often to crawl
14
Essential Crawler Etiquette Robot exclusion protocols First, ignore pages with:
<META NAME="ROBOTS” CONTENT="NOINDEX"> Second, look for robots.txt at root of web server
See http://www.robotstxt.org/wc/robots.html To exclude all robots from a server:
User-agent: *Disallow: /
To exclude one robot from two directories:User-agent: BobsCrawlerDisallow: /news/Disallow: /tmp/
15
Mercator A scheme for building a distributed web
crawler
Expands a “URL frontier” Avoids re-crawling same URLs
Also considers whether a document has been seen before Every document has signature/checksum info
computed as it’s crawled
16
Mercator Web Crawler
1. Dequeue frontier URL2. Fetch document3. Record into RewindStream
(RIS)4. Check against fingerprints to
verify it’s new
5. Extract hyperlinks6. Filter unwanted links7. Check if URL repeated
(compare its hash)8. Enqueue URL
17
Mercator’s Polite Frontier Queues Tries to go beyond breadth-first approach
– want to have only one crawler thread per server
Distributed URL frontier queue: One subqueue per worker thread The worker thread is determined by hashing
the hostname of the URL Thus, only one outstanding request per web server
18
Mercator’s HTTP Fetcher First, needs to ensure robots.txt is followed
Caches the contents of robots.txt for various web sites as it crawls them
Designed to be extensible to other protocols Had to write own HTTP requestor in Java –
their Java version didn’t have timeouts Today, can use setSoTimeout()
Can also use Java non-blocking I/O if you wish: http://www.owlmountain.com/tutorials/NonBlockingIo.htm But they use multiple threads and synchronous I/O
19
Other Caveats Infinitely long URL names (good way to
get a buffer overflow!) Aliased host names Alternative paths to the same host Can catch most of these with signatures of
document data (e.g., MD5) Crawler traps (e.g., CGI scripts that link to
themselves using a different name) May need to have a way for human to override
certain URL paths – see Section 5 of paper
Mercator Document Statistics
PAGE TYPE PERCENTtext/html 69.2%image/gif 17.9%image/jpeg 8.1%text/plain 1.5%pdf 0.9%audio 0.4%zip 0.4%postscript 0.3%other 1.4%
Histogram of document sizes(60M pages)
21
Further Considerations May want to prioritize certain pages as
being most worth crawling Focused crawling tries to prioritize based on
relevance
May need to refresh certain pages more often
22
Web Search Summarized Two important factors:
Indexing and ranking scheme that allows most relevant documents to be prioritized highest
Crawler that manages to be (1) well-mannered, (2) avoid traps, (3) scale
We’ll be using Pastry to distribute the work of crawling and to distribute the data (what Google calls “barrels”)
23
SynchronizationThe issue:
What do we do when decisions need to be made in a distributed system? e.g., Who decides what action should be taken? Whose
conflicting option wins? Who gets killed in a deadlock? Etc.Some options:
Central decision point Distributed decision points with locking/scheduling
(“pessimistic”) Distributed decision points with arbitration
(“optimistic”) Transactions (to come)
24
Multi-process Decision Arbitration: Mutual Exclusion (i.e., Locking)
a) Process 1 asks the coordinator for permission to enter a critical region. Permission is grantedb) Process 2 then asks permission to enter the same critical region. The coordinator does not
reply.c) When process 1 exits the critical region, it tells the coordinator, when then replies to 2
25
Distributed Locking: Still Based on a Central Arbitrator
a) Two processes want to enter the same critical region at the same moment.b) Process 0 has the lowest timestamp, so it wins.c) When process 0 is done, it sends an OK also, so 2 can now enter the critical
region.
26
Eliminating the Request/Response: A Token Ring Algorithm
a) An unordered group of processes on a network.
b) A logical ring constructed in software.
27
Comparison of Costs
Algorithm Messages per entry/exit
Delay before entry (in message times) Problems
Centralized 3 2 Coordinator crash
Distributed 2 ( n – 1 ) 2 ( n – 1 ) Crash of any process
Token ring 1 to 0 to n – 1 Lost token, process crash
28
Electing a Leader: The Bully Algorithm Is Won By Highest ID
Suppose the old leader dies… Process 4 needs a decision; it runs an election Process 5 and 6 respond, telling 4 to stop Now 5 and 6 each hold an election
29
Totally-Ordered Multicasting: Arbitrates via Priority Updating a replicated database and leaving it in an
inconsistent state.
30
Another Approach: Time We can choose the decision that
happened first
… But how do we know what happened first?