CRAWLER DESIGN

CRAWLER DESIGN

YÜCEL SAYGINThese slides are based on the book “Mining the Web”

by Soumen Chakrabarti

Refer to “Crawling the Web” Chapter for more information

Challenges

The amount of information In 1994 the World Wide Web Worm indexed 110K

pages In 1997 : millions of pages In 2004 : billions of pages In 2010 : ???? Of pages

Complexity of the link graph

Basics

HTTP : Hypertext transport protocolTCP : Transmission Control ProtocolIP : Internet ProtocolHTML : Hypertext markup languageURL : Uniform Resource Locator

<a href=“http://www.cse.iitb.ac.in/”> The IIT Bombay Computer Science Department</a>

protocol Server host name

File path

Basics

A click on the hyperlink is converted to a network request by the browserBrowser will then fetch and display the web page pointed to ny the url. Server host name (like www.cse.iitb.ac.in) needs to be translated into an ip address such as 144.16.111.14 to contact the server using TCP.

<a href=“http://www.cse.iitb.ac.in/”> The IIT Bombay Computer Science Department</a>

protocol Server host name

File path

Basics

DNS (Domain Name Service) is a distributed database of name-to-IP address mappings This database is maintained by known serversA click on the hyperlink is translated into

telnet www.cse.iitb.ac.in 80 80 is the default http port

http://www.cse.iitb.ac.in/

MIME Header

MIME: Multipurpose Internet Mail Extensions, a standard for email and web content transfer.

Crawling

There is no directory of all accessible URLsThe main strategy is to

start from a set of seed web pages Extract URLs from those pages Apply the same techniques to the pages from those URL

It may not be possible to retrieve all the pages on the WEB with this technique since New pages are added every day

Use a queue structure and markVisited nodes

Crawling

Writing a basic crawler is easyWriting a large-scale crawler is challengingFollowing are the basic steps of crawling

URL to IP conversion using the DNS server Socket connection to the server and sending the request Receiving the requested page

For small pages, DNS lookup and socket connection takes more time then receiving the requested pageWe need to overlap the processing and waiting times for the above three steps.

Crawling

Storage requirements are hugeNeed to store the list of URLs and the retrieved pages in the disk Storing the URLs in the disk is also needed for persistencyPages are stored in compressed form (goodle uses zlib for compression, 3 to 1 )

Large Scale Crawler Tips

Fetch hundreds of pages at the same time to increase bandwidth utilizationUse more than one DNS server for concurrent DNS lookupUsing asynchronous sockets is better than multi-threadingEliminate duplicates to reduce the number of redundant fetches and to avoid spider traps (infinite set of fake URLs)

DNS Caching

Address mapping is a significant bottleneckA crawler can generate more requests per unit time than a DNS server can handleCaching the DNS entries helpsDNS cache needs to be refreshed periodically (whenever it is idle)

Concurrent page requests

Can be achieved by Multithreading Non-blocking sockets with event handlers

Multithreading A set of threads are created After the server name is translated to IP address,

a thread creates a client socket Connects to the Http service on the server Sends the http request header Reads the socket until eof Closes the socket

Blocking system calls are used to suspend the thread until the requested data is available

Multithreading

A fixed number of worker threads share a work-queue of pages to fetchHandling concurrent access to data structures is a problem. Mutual exclusion needs to be handled properlyDisk access can not be orchestrated when multiple concurrent threads are usedNon-blocking sockets could be a better approach!

Non-blocking sockets

Connect, send, and receive calls will return immediately without blocking for network dataThe status of the network can be polled later on“Select” system call lets the application wait for data to be available on the socketThis way completion of page fetching is serialized.

No need for locks or semaphores Can append the pages to the file in disk without being intercepted

Link Extraction and Normalization

An HTML page is searched for links to add to the work-poolURLs extracted from pages need to be preprocessed before they are added to the work-poolDuplicate elimination is necessary but difficult

Since mapping from urls to hostnames is many-to-many I.e., a computer may have many IP addresses and many hostnames.

Extracted URLs are converted to canonical form by Using the canonical hotname provided by the DNS response Adding an explicit port number Converting the relative addresses to absolute addresses

Some more tips

Server may disallow crawling using “robots.txt” found in the http root directoryRobots.txt specifies a list of path prefixes that crawlers should not try to fetch

Eliminating already visited URLS

IsUrlVisisted module in the architecture does that jobThe same page could be kinked from many different sitesChecking if the page is already visited eliminates redundant page requestsComparing the strings of URLs may take long time since it involves disk access and checking against all the stored URLS

Eliminating already visited URLS

Duplicate checking is done by applying a hash function MD5 originally designed for digital signature applicationsMD5 algorithm takes a message of arbitrary length as input and produces a 128-bit "fingerprint" or "message digest" as output“it is computationally infeasible to produce two messages having the same message digest”

http://www.w3.org/TR/1998/REC-DSig-label/MD5-1_0

Even the hashed URLs need to be stored in disk due to storage and persistency requirementsSpatial and temporal locality of URL access means less number of disk accesses when URL hashes are cached

Eliminating already visited URLs

We need utilize spatial locality as much as possibleBut MD5 will distribute the domain of similar URLs string uniformly over a range.Two-block or two-level hash function is used

Use different hash functions for the host address and the path B-tree could be used to index the host name, and the retrieved

page will contain the urls in the same host.

Spider Traps

Malicious pages designed to crash the crawlers Simply add 64K of null characters in the middle of URL to crash the

lexical analyzer

Infinitely deep web sites Using dynamically generated links via CGI scripts Need to check the link length

No technique is foolproof Generate periodic statistics for the crawler to eliminate dominating

sites Disable crawling active content

Avoiding duplicate pages

A page can be accessed via different URLsEliminating duplicate pages will also help eliminate spider trapsMD5 can be used for that purposeMinor changes can not be handled with MD5.

Can divide the page into blocks

Denial of Service

HTTP servers protect themselves against denial of service (DoS) attacksDoS attacks will send frequent requests to the same server to slow down its operationTherefore frequent requests from the same IP are prohibitedCrawlers need to consider such cases for courtesy/legal action

Need to limit the active requests to a given server IP address at any time

Maintain a queue of requests for each server This will also reduce the effect of spider traps

Text Repository

The pages that are fetched are dumped into a text repositoryThe text repository is significantly large

Needs to be compressed (google uses zlip for 3-1 compression)

Google implements its own file systemBerkeley DB (www.sleepycat.com) can also be used

Stores a database within a single file Provides several access methods such as B-tree or sequential

Refreshing Crawled Pages

HTTP protocol could be used to check if a page changes since last time it was crawledBut using HTTP for checking if a page is modified takes a lot of timeIf a page expires after a certain time, this could be extracted from the http header.If we had a score that reflects the probability of change since last time it was visited

We can sort the pages wrt that score and crawl them in that order

Use the past behavior to model the future!

Your crawler

Use w3c-libwww API to implement your crawlerStart from a very simple implementation and go on from that!Sample codes and algorithms are provided in the handouts

Documents

CRAWLER DESIGN