Upload
isleen
View
35
Download
0
Embed Size (px)
DESCRIPTION
Parallel Crawlers. Junghoo Cho (UCLA) Hector Garcia -Molina (Stanford) May 2002. Ke Gong. 1. Crawler. S ingle -process crawler Hard to scale Heavy Network loading Parallel crawler Scalability Increase number of crawling processes Network loading dispersion - PowerPoint PPT Presentation
Citation preview
Parallel CrawlersJunghoo Cho (UCLA)
Hector Garcia-Molina (Stanford)May 2002
Ke Gong1
Crawler
Single-process crawler
Hard to scale
Heavy Network loading
Parallel crawler
Scalability
Increase number of crawling processes
Network loading dispersion
Crawl geographically adjacent pages
Network loading reduction
Crawling only through local network
2
Architecture of a parallel crawler
Independent
Each crawling process starts with its own set of seed URLs and follows links without consulting with other crawling processes.
Dynamic Assignment
There exists a central coordinator that logically divides the Web into small partitions and dynamically assigns each partition to a crawling process for download
Static Assignment
The Web is partitioned and assigned to each C-proc before they start to crawl
3
Three modes for static assignment
• Independent
• Dynamic Assignment
• Static Assignment
• Firewall mode S1: a->b->c
• Cross-over mode S1: a->b->c->g->h->d->e
• Exchange mode S1: a->b->c->d->e
4
URL exchange minimization
Batch communication:
Instead of transferring an inter-partition URL immediately after it is discovered, a crawling process may wait for a while, to collect a set of URLs and send them in a batch.
Replication
If we replicate the most “popular” URLs at each crawling process and stop transferring them between crawling processes. we may significantly reduce URL exchanges.
5
Partition Function
1. URL-hash based: Hash the whole URL• Pages in the same site can be assigned to different C-proc’s. • Locality of links not reflected
2. Site-hash based: Hash only the site name of URL• Locality preserved• Partition even-loaded
3. Hierarchical: Partition based on domain names, countries or other features• Lower inter-partition links• Partition not even-loaded
6
Parallel Crawler Models
We will try to evaluate how different mode will effect on our crawling results.
So now, we need an evaluation model!
7
Evaluation Models
Overlap: (N-I)/I N: the total number of pages downloadedI : the number of unique pages downloaded
Coverage: U/IU: the number of pages has to downloadI : the number of unique pages downloaded
Quality: |PN∩AN|/|PN|PN : top N important pages from an ideal crawler
AN : : top N important pages from an actual crawler
Communication overhead: C/NC: the total number of inter-partition URLs exchangedN: the total number of pages downloaded
8
Dataset
• The pages using our Stanford WebBase crawler in December 1999 in the period of 2 weeks.
• The WebBase crawler started with the 1 Million URLs listed in Open Directory (http://www.dmoz.org) and followed links.
• The dataset contains 40M pages
• Many of dynamically-generated pages were still downloaded by the crawler
•
9
Firewall mode and coverage
1. When a relatively small number of crawling process are running in parallel, a crawler using the firewall mode provides good coverage.
2. The firewall mode is not a good choice if the crawler want to have a good coverage.
3. Increase number of seed urls will help to reach a better coverage
10
Crossover mode and overlap
1. When we have a larger number of crawling processes, we have to increase the overlap rate in order to obtain the same coverage.
2. Overlap stay at zero until the coverage becomes relatively large
3. A high coverage for crossover mode means a high overlap
11
Exchange mode and communication
1. Site hash has significantly lower communication overhead comparing to URL hash.
2. The network bandwidth used for URL exchange is relatively small, comparing to actual page download bandwidth
3. We can significantly reduce the communication overhead by replicating a relatively small number of URLs.
12
Quality and batch communication
• As the number of crawling processes increases, the quality of downloaded pages becomes worse, unless they exchange messages often.
• The quality of the firewall mode crawler(x=0) is significantly worse than that of the single-process crawler (x → ∞) when the crawler downloads a relatively small fraction of the pages
• The communication overhead does not increase linearly as the number of URL exchange increases.
• One does not need a large number of URL exchanges to achieve high quality.
Crawler downloaded 500K pages13
Summary
•Firewall mode will be a good idea to choose if we want to run with fewer than 4 crawling processes but high coverage
•Crossover crawlers incurs quite significant overlaps.
•A crawler based on the exchange mode consumes small network bandwidth for URL exchanges (less than 1% of the network bandwidth). It can also minimize other overheads by adopting the batch communication technique.
•By replicating between 10,000 and 100,000 popular URLs, we can reduce the communication overhead by roughly 40%.
14