Upload
london
View
24
Download
2
Embed Size (px)
DESCRIPTION
CRAWLER DESIGN. YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information. Challenges. The amount of information In 1994 the World Wide Web Worm indexed 110K pages In 1997 : millions of pages - PowerPoint PPT Presentation
Citation preview
CRAWLER DESIGN
YÜCEL SAYGINThese slides are based on the book “Mining the Web”
by Soumen Chakrabarti
Refer to “Crawling the Web” Chapter for more information
Challenges
The amount of information In 1994 the World Wide Web Worm indexed 110K
pages In 1997 : millions of pages In 2004 : billions of pages In 2010 : ???? Of pages
Complexity of the link graph
Basics
HTTP : Hypertext transport protocolTCP : Transmission Control ProtocolIP : Internet ProtocolHTML : Hypertext markup languageURL : Uniform Resource Locator
<a href=“http://www.cse.iitb.ac.in/”> The IIT Bombay Computer Science Department</a>
protocol Server host name
File path
Basics
A click on the hyperlink is converted to a network request by the browserBrowser will then fetch and display the web page pointed to ny the url. Server host name (like www.cse.iitb.ac.in) needs to be translated into an ip address such as 144.16.111.14 to contact the server using TCP.
<a href=“http://www.cse.iitb.ac.in/”> The IIT Bombay Computer Science Department</a>
protocol Server host name
File path
Basics
DNS (Domain Name Service) is a distributed database of name-to-IP address mappings This database is maintained by known serversA click on the hyperlink is translated into
telnet www.cse.iitb.ac.in 80 80 is the default http port
MIME Header
MIME: Multipurpose Internet Mail Extensions, a standard for email and web content transfer.
Crawling
There is no directory of all accessible URLsThe main strategy is to
start from a set of seed web pages Extract URLs from those pages Apply the same techniques to the pages from those URL
It may not be possible to retrieve all the pages on the WEB with this technique since New pages are added every day
Use a queue structure and markVisited nodes
Crawling
Writing a basic crawler is easyWriting a large-scale crawler is challengingFollowing are the basic steps of crawling
URL to IP conversion using the DNS server Socket connection to the server and sending the request Receiving the requested page
For small pages, DNS lookup and socket connection takes more time then receiving the requested pageWe need to overlap the processing and waiting times for the above three steps.
Crawling
Storage requirements are hugeNeed to store the list of URLs and the retrieved pages in the disk Storing the URLs in the disk is also needed for persistencyPages are stored in compressed form (goodle uses zlib for compression, 3 to 1 )
Large Scale Crawler Tips
Fetch hundreds of pages at the same time to increase bandwidth utilizationUse more than one DNS server for concurrent DNS lookupUsing asynchronous sockets is better than multi-threadingEliminate duplicates to reduce the number of redundant fetches and to avoid spider traps (infinite set of fake URLs)
DNS Caching
Address mapping is a significant bottleneckA crawler can generate more requests per unit time than a DNS server can handleCaching the DNS entries helpsDNS cache needs to be refreshed periodically (whenever it is idle)
Concurrent page requests
Can be achieved by Multithreading Non-blocking sockets with event handlers
Multithreading A set of threads are created After the server name is translated to IP address,
a thread creates a client socket Connects to the Http service on the server Sends the http request header Reads the socket until eof Closes the socket
Blocking system calls are used to suspend the thread until the requested data is available
Multithreading
A fixed number of worker threads share a work-queue of pages to fetchHandling concurrent access to data structures is a problem. Mutual exclusion needs to be handled properlyDisk access can not be orchestrated when multiple concurrent threads are usedNon-blocking sockets could be a better approach!
Non-blocking sockets
Connect, send, and receive calls will return immediately without blocking for network dataThe status of the network can be polled later on“Select” system call lets the application wait for data to be available on the socketThis way completion of page fetching is serialized.
No need for locks or semaphores Can append the pages to the file in disk without being intercepted
Link Extraction and Normalization
An HTML page is searched for links to add to the work-poolURLs extracted from pages need to be preprocessed before they are added to the work-poolDuplicate elimination is necessary but difficult
Since mapping from urls to hostnames is many-to-many I.e., a computer may have many IP addresses and many hostnames.
Extracted URLs are converted to canonical form by Using the canonical hotname provided by the DNS response Adding an explicit port number Converting the relative addresses to absolute addresses
Some more tips
Server may disallow crawling using “robots.txt” found in the http root directoryRobots.txt specifies a list of path prefixes that crawlers should not try to fetch
Eliminating already visited URLS
IsUrlVisisted module in the architecture does that jobThe same page could be kinked from many different sitesChecking if the page is already visited eliminates redundant page requestsComparing the strings of URLs may take long time since it involves disk access and checking against all the stored URLS
Eliminating already visited URLS
Duplicate checking is done by applying a hash function MD5 originally designed for digital signature applicationsMD5 algorithm takes a message of arbitrary length as input and produces a 128-bit "fingerprint" or "message digest" as output“it is computationally infeasible to produce two messages having the same message digest”
http://www.w3.org/TR/1998/REC-DSig-label/MD5-1_0
Even the hashed URLs need to be stored in disk due to storage and persistency requirementsSpatial and temporal locality of URL access means less number of disk accesses when URL hashes are cached
Eliminating already visited URLs
We need utilize spatial locality as much as possibleBut MD5 will distribute the domain of similar URLs string uniformly over a range.Two-block or two-level hash function is used
Use different hash functions for the host address and the path B-tree could be used to index the host name, and the retrieved
page will contain the urls in the same host.
Spider Traps
Malicious pages designed to crash the crawlers Simply add 64K of null characters in the middle of URL to crash the
lexical analyzer
Infinitely deep web sites Using dynamically generated links via CGI scripts Need to check the link length
No technique is foolproof Generate periodic statistics for the crawler to eliminate dominating
sites Disable crawling active content
Avoiding duplicate pages
A page can be accessed via different URLsEliminating duplicate pages will also help eliminate spider trapsMD5 can be used for that purposeMinor changes can not be handled with MD5.
Can divide the page into blocks
Denial of Service
HTTP servers protect themselves against denial of service (DoS) attacksDoS attacks will send frequent requests to the same server to slow down its operationTherefore frequent requests from the same IP are prohibitedCrawlers need to consider such cases for courtesy/legal action
Need to limit the active requests to a given server IP address at any time
Maintain a queue of requests for each server This will also reduce the effect of spider traps
Text Repository
The pages that are fetched are dumped into a text repositoryThe text repository is significantly large
Needs to be compressed (google uses zlip for 3-1 compression)
Google implements its own file systemBerkeley DB (www.sleepycat.com) can also be used
Stores a database within a single file Provides several access methods such as B-tree or sequential
Refreshing Crawled Pages
HTTP protocol could be used to check if a page changes since last time it was crawledBut using HTTP for checking if a page is modified takes a lot of timeIf a page expires after a certain time, this could be extracted from the http header.If we had a score that reflects the probability of change since last time it was visited
We can sort the pages wrt that score and crawl them in that order
Use the past behavior to model the future!
Your crawler
Use w3c-libwww API to implement your crawlerStart from a very simple implementation and go on from that!Sample codes and algorithms are provided in the handouts