Chapter 19
Web Crawler
Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 19-2
Chapter Objectives
• Provide a case study example from problem statement through implementation
• Demonstrate how hash tables and graphs can be used to solve a problem
Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 19-3
Web Crawler
• A web crawler is a system that searches the web, beginning with a user-designated we page, looking for a designated target string
• A web crawler follows all of the links on each page that it encounter until there are no more pages or until it reaches a designated limit
Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 19-4
Web Crawler
• For this case study, we will create a graphical web crawler with the following requirements– Enter a designated starting web page
– Enter a target string for which to search
– Limit the search to 50 pages
– Display the results when done
Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 19-5
Web Crawler - Design
• Our web crawler system consists of three high-level components:– The driver
– The graphical user interface
– The web crawler implementation• Makes use of graphs and hashtables
Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 19-6
Web Crawler - Design• The algorithm for the web crawler is as follows
– Add the starting page to a HashSet of pages to be searched and to our graph
– Remove a page from the set of pages to be searched
– Search the page for the target string• If string exists, add page to list of results
– Search the page for links• If links have not already been searched, add them to set of
pages to be searched and to our graph
– Repeat the three previous steps until our limit is reached or the set is empty
Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 19-7
FIGURE 19.1 User interface design
Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 19-8
FIGURE 19.2 UML description
Recommended