Parallel Crawlers

Parallel CrawlersJunghoo Cho (UCLA)

Hector Garcia-Molina (Stanford)May 2002

Ke Gong1

Crawler

Single-process crawler

Hard to scale

Heavy Network loading

Parallel crawler

Scalability

Increase number of crawling processes

Network loading dispersion

Crawl geographically adjacent pages

Network loading reduction

Crawling only through local network

2

Architecture of a parallel crawler

Independent

Each crawling process starts with its own set of seed URLs and follows links without consulting with other crawling processes.

Dynamic Assignment

There exists a central coordinator that logically divides the Web into small partitions and dynamically assigns each partition to a crawling process for download

Static Assignment

The Web is partitioned and assigned to each C-proc before they start to crawl

3

Three modes for static assignment

• Independent

• Dynamic Assignment

• Static Assignment

• Firewall mode S1: a->b->c

• Cross-over mode S1: a->b->c->g->h->d->e

• Exchange mode S1: a->b->c->d->e

4

URL exchange minimization

Batch communication:

Instead of transferring an inter-partition URL immediately after it is discovered, a crawling process may wait for a while, to collect a set of URLs and send them in a batch.

Replication

If we replicate the most “popular” URLs at each crawling process and stop transferring them between crawling processes. we may significantly reduce URL exchanges.

5

Partition Function

1. URL-hash based: Hash the whole URL• Pages in the same site can be assigned to different C-proc’s. • Locality of links not reflected

2. Site-hash based: Hash only the site name of URL• Locality preserved• Partition even-loaded

3. Hierarchical: Partition based on domain names, countries or other features• Lower inter-partition links• Partition not even-loaded

6

Parallel Crawler Models

We will try to evaluate how different mode will effect on our crawling results.

So now, we need an evaluation model!

7

Evaluation Models

Overlap: (N-I)/I N: the total number of pages downloadedI : the number of unique pages downloaded

Coverage: U/IU: the number of pages has to downloadI : the number of unique pages downloaded

Quality: |PN∩AN|/|PN|PN : top N important pages from an ideal crawler

AN : : top N important pages from an actual crawler

Communication overhead: C/NC: the total number of inter-partition URLs exchangedN: the total number of pages downloaded

8

Dataset

• The pages using our Stanford WebBase crawler in December 1999 in the period of 2 weeks.

• The WebBase crawler started with the 1 Million URLs listed in Open Directory (http://www.dmoz.org) and followed links.

• The dataset contains 40M pages

• Many of dynamically-generated pages were still downloaded by the crawler

•

9

Firewall mode and coverage

1. When a relatively small number of crawling process are running in parallel, a crawler using the firewall mode provides good coverage.

2. The firewall mode is not a good choice if the crawler want to have a good coverage.

3. Increase number of seed urls will help to reach a better coverage

10

Crossover mode and overlap

1. When we have a larger number of crawling processes, we have to increase the overlap rate in order to obtain the same coverage.

2. Overlap stay at zero until the coverage becomes relatively large

3. A high coverage for crossover mode means a high overlap

11

Exchange mode and communication

1. Site hash has significantly lower communication overhead comparing to URL hash.

2. The network bandwidth used for URL exchange is relatively small, comparing to actual page download bandwidth

3. We can significantly reduce the communication overhead by replicating a relatively small number of URLs.

12

Quality and batch communication

• As the number of crawling processes increases, the quality of downloaded pages becomes worse, unless they exchange messages often.

• The quality of the firewall mode crawler(x=0) is significantly worse than that of the single-process crawler (x → ∞) when the crawler downloads a relatively small fraction of the pages

• The communication overhead does not increase linearly as the number of URL exchange increases.

• One does not need a large number of URL exchanges to achieve high quality.

Crawler downloaded 500K pages13

Summary

•Firewall mode will be a good idea to choose if we want to run with fewer than 4 crawling processes but high coverage

•Crossover crawlers incurs quite significant overlaps.

•A crawler based on the exchange mode consumes small network bandwidth for URL exchanges (less than 1% of the network bandwidth). It can also minimize other overheads by adopting the batch communication technique.

•By replicating between 10,000 and 100,000 popular URLs, we can reduce the communication overhead by roughly 40%.

14

Documents

Parallel Crawlers