Ch19

Chapter 19

Web Crawler

Chapter Objectives

• Provide a case study example from problem statement through implementation

• Demonstrate how hash tables and graphs can be used to solve a problem

Web Crawler

• A web crawler is a system that searches the web, beginning with a user-designated we page, looking for a designated target string

• A web crawler follows all of the links on each page that it encounter until there are no more pages or until it reaches a designated limit

Web Crawler

• For this case study, we will create a graphical web crawler with the following requirements– Enter a designated starting web page

– Enter a target string for which to search

– Limit the search to 50 pages

– Display the results when done

Web Crawler - Design

• Our web crawler system consists of three high-level components:– The driver

– The graphical user interface

– The web crawler implementation• Makes use of graphs and hashtables

Web Crawler - Design• The algorithm for the web crawler is as follows

– Add the starting page to a HashSet of pages to be searched and to our graph

– Remove a page from the set of pages to be searched

– Search the page for the target string• If string exists, add page to list of results

– Search the page for links• If links have not already been searched, add them to set of

pages to be searched and to our graph

– Repeat the three previous steps until our limit is reached or the set is empty

FIGURE 19.1 User interface design

FIGURE 19.2 UML description

Technology