12
Computer Networks and ISDN Systems 30 ( 1998) I I9- I30 SPHINX: a framework for creating personal, site-specific Web crawlers Robert C. Miller ‘.*, Krishna Bharat ‘.’ Abstract Crawlers. also called robots and spiders, are programs that browse the World Wide Web autonomously. This paper describes SPHINX, a Java toolkit and interactive development environment for Web crawlers. Unlike other crawler development systems, SPHINX is geared towards developing crawlers that are Web-site-specific, personally customized, and relocatable. SPHINX allows site-specific crawling rules to be encapsulated and reused in content analyzers, known as chssjjiers. Personal crawling tasks can be performed (often without programming) in the Crawler Workbench. an interactive environment for crawler development and testing. For efficiency, relocatable crawlers developed using SPHINX can be uploaded and executed on a remote Web server. 0 1998 Published by Elsevier Science B.V. All rights reserved. A’e~~t~rJrcl.c: Crawlers; Robots: Spiders; Web automation; Web searching; Java; End-user programming: Mobile code 1. Introduction Today’s World Wide Web is roamed by numerous automatons. variously called crawlers, robots, bots, and spiders. These are programs that browse the Web unassisted, and follow links and process pages in a largely autonomous fashion. Crawlers perform many useful services. including full text indexing [5], link maintenance [S], downloading [3]. printing [4], and visualization [ 171. However. this list of services is by no means exhaustive and users often require custom services that no existing robot provides. With the rapid growth and increasing importance of the Web in daily life. one expects a corresponding growth in demand for personalized Web automation. This * Corresponding author. E-mail: rcm@‘c.cmu.edu ’ E-mail: [email protected] will allow a user to delegate repetitive tasks to a robot, and generate alternative views and summaries of Web content that meet the user’s needs. Unfor- tunately, it is still quite a laborious task to build a crawler. State-of-the-art Web crawlers are generally hand- coded programs in Perl. C/C++, or Java. They typ- ically require the use of a network access library to retrieve Web pages, and some form of pattern matching to find links within pages and process textual content. With these techniques, even simple crawlers take serious work. Consider the following example: as a tirst step in creating a “photo album” of pictures of all the researchers at your institution. you might want the agent to crawl over the institu- tion’s list of home pages and collect all the images found. Unless your institution is enormous. coding this one-shot crawler in C or Per1 could easily take 0169-7S5Y9X/S19.00 c 199X Published bq Elhevler Science B.V. All rishr\ re\rr\ed PI! SO 169.7552(98,00064-6

SPHINX: a framework for creating personal, site-specific Web crawlers

Embed Size (px)

Citation preview

Page 1: SPHINX: a framework for creating personal, site-specific Web crawlers

Computer Networks and ISDN Systems 30 ( 1998) I I9- I30

SPHINX: a framework for creating personal, site-specific Web crawlers

Robert C. Miller ‘.*, Krishna Bharat ‘.’

Abstract

Crawlers. also called robots and spiders, are programs that browse the World Wide Web autonomously. This paper describes SPHINX, a Java toolkit and interactive development environment for Web crawlers. Unlike other crawler development systems, SPHINX is geared towards developing crawlers that are Web-site-specific, personally customized, and relocatable. SPHINX allows site-specific crawling rules to be encapsulated and reused in content analyzers, known as chssjjiers. Personal crawling tasks can be performed (often without programming) in the Crawler Workbench. an interactive environment for crawler development and testing. For efficiency, relocatable crawlers developed using SPHINX can be uploaded and executed on a remote Web server. 0 1998 Published by Elsevier Science B.V. All rights reserved.

A’e~~t~rJrcl.c: Crawlers; Robots: Spiders; Web automation; Web searching; Java; End-user programming: Mobile code

1. Introduction

Today’s World Wide Web is roamed by numerous automatons. variously called crawlers, robots, bots, and spiders. These are programs that browse the Web unassisted, and follow links and process pages in a largely autonomous fashion. Crawlers perform many useful services. including full text indexing [5], link maintenance [S], downloading [3]. printing [4], and visualization [ 171. However. this list of services is by no means exhaustive and users often require custom services that no existing robot provides. With the rapid growth and increasing importance of the Web in daily life. one expects a corresponding growth in demand for personalized Web automation. This

* Corresponding author. E-mail: rcm@‘c.cmu.edu ’ E-mail: [email protected]

will allow a user to delegate repetitive tasks to a robot, and generate alternative views and summaries of Web content that meet the user’s needs. Unfor- tunately, it is still quite a laborious task to build a crawler.

State-of-the-art Web crawlers are generally hand- coded programs in Perl. C/C++, or Java. They typ- ically require the use of a network access library to retrieve Web pages, and some form of pattern matching to find links within pages and process textual content. With these techniques, even simple crawlers take serious work. Consider the following example: as a tirst step in creating a “photo album” of pictures of all the researchers at your institution. you might want the agent to crawl over the institu- tion’s list of home pages and collect all the images found. Unless your institution is enormous. coding this one-shot crawler in C or Per1 could easily take

0169-7S5Y9X/S19.00 c 199X Published bq Elhevler Science B.V. All rishr\ re\rr\ed PI! SO 169.7552(98,00064-6

Page 2: SPHINX: a framework for creating personal, site-specific Web crawlers

120 R. C. Millr/; K. Bhurut /Computer Nrtworks und ISDN Swcms 30 (199X) Il9-130

longer than visiting each page in a browser and sav- ing the images manually. For a novice programmer, writing a full crawler requires a substantial learning curve.

The picture-collection task is an example of a [?~rsonul crawl. This is a task of interest to perhaps a single user, who may want to perform the operation only a small number of times. In other application domains, personal automation is often handled by macros or embedded scripting languages, but such facilities are not commonly available for the Web crawling domain.

Sire-sprcijk crawlers are also ill-supported by current crawler development techniques. A site-spe- cific crawler is tailored to the syntax of the particular Web sites it crawls (presentation style, linking, and URL naming schemes). Examples of site-specific crawlers include metasearch engines [21], home- page finders [22], robots for personalized news re- trieval [ 131. and comparison-shopping robots [7]. Site-specific crawlers are created by trial-and-error. The programmer needs to develop rules and pat- terns for navigating within the site and parsing local documents by a process of “reverse engineering”. No good model exists for modular construction of site-specific crawlers from reusable components, so site-specific rules engineered for one crawler are dif- ficult to transfer to another crawler. This causes users building crawlers for a site to repeat the work of others.

A third area with room for improvement in present techniques is relocatability. A relocatable crawler is capable of executing on a remote host on the network. Existing Web crawlers tend to be nonrelocatable. pulling all the pages back to their home site before they process them. This may be inefficient and wasteful of network bandwidth, since crawlers often tend to look at more pages than they actually need. Further, this strategy may not work in some cases because of access restrictions at the source site. For example, when a crawler resides on a server [ 171 outside a company’s firewall, then it cannot be used to crawl the Web inside the firewall. even if the user invoking it has permission to do so. Users often have a home computer connected by a slow phone line to a fast Internet Service Provider (ISP) who may in turn communicate with fast Web servers. The bottleneck in this communication is the

user’s link to the ISP. In such cases the best location for a user-specific crawler is the ISP or, in the case of site-specific crawlers, the Web site itself. On-site execution provides the crawler with high-bandwidth access to the data, and permits the site to do effective billing, access restriction, and load control.

This paper describes SPHINX (@e-oriented crocessors for _HTML wormation extraction), a Java-based toolkit and interactive development envi- ronment for Web crawlers. SPHINX explicitly sup- ports crawlers that are site-specific, personal, and relocatable. The toolkit provides a novel applica- tion framework which supports multithreaded page retrieval and crawl visualization and introduces the notion of classifiers, which are reusable site-spe- cific content analyzers. The interactive development environment, called the Crawler Workbench, is im- plemented as a Java applet. The Workbench allows users to specify and invoke simple personal crawlers without programming, and supports interactive de- velopment and debugging of site-specific crawlers. The SPHINX toolkit also provides library support for crawling in Java, including HTML parsing, pat- tern matching, and common Web page transforma- tions.

2. Crawling model

This section describes the model of Web traversal adopted by the SPHINX crawling toolkit, and the basic Java classes that implement the model. The Java interface is used by a programmer writing a crawler directly in Java. Users of the Crawler Work- bench need not learn it, unless they want to extend the capabilities of the Workbench with custom-pro- grammed code.

2. I. Crawling over the Web graph

SPHINX regards the Web as a directed graph of pages and links, which are reflected in Java as Page and Link objects. A Page object represents a parsed Web page, with methods to retrieve the page’s URL, title, and HTML parse tree. Also among the attributes of Page is a collection of Link objects representing the outgoing hyperlinks found on the page. A Link object describes not only the target of

Page 3: SPHINX: a framework for creating personal, site-specific Web crawlers

R.C. Miller: K. Bhurut/Cornputrr Nutworks und ISDN Swtwm 30 (19%‘) 119-130 1’1

the link (a URL), but also the source (a page and an HTML element within that page). Each Link and Page object also stores a set of arbitrary string- valued attributes. which are used by classifiers for labelling (discussed in the next section).

To write a crawler in SPHINX, the programmer extends the Crawler class by overriding two call- back methods: l boolean shouldVisit(Link 1) decides

whether link 1 should be followed (“visited”) by the crawler.

l void visit (Page p) processesthepage p that was reached by following a link. Here is an example of a simple crawler. This

crawler only follows links whose URL matches a certain pattern, where pattern-matching is handled by the abstract Pattern class. Several subclasses of Pattern are provided with SPHINX, including Wildcard (Unix shell-style wildcard matching, used in the example below), Regexp (Per1 5-style regular expressions), and Tagexp (regular expressions over the alphabet of HTML tags).

class FacultyListCrawler extends Crawler (

// This simple crawler visits a certain university department // and lists home-page URLs of the academic faculty. All the // homepage URLs are assumed to obey the following pattern:

static Pattern facultyHomePage = new Wildcard ("http://www.cs.someplace.edu/Faculty/*/home.html":;

boolean shouldvisit (Link link) { return facultyHomePage.match( link.getURL() .toString() );

.I/ match0 returns true if the link's URL matches /I' the facul tyHomePage pattern

}

void visit (Page page) { System.out.println ( page.getURL 0 );

// getlJRL() returns the URL that the page represents

The implementation of Crawler uses multiple threads to retrieve pages and links, and passes them to the callback methods according to the following scheme. The crawler contains a queue of “ready” links, which have been approved by shouldvisit but not yet retrieved. In each iteration, a crawling thread takes a link from this queue, downloads the page to which it points, and passes the page to visit for user-defined processing. Then the thread iterates through the collection of links on the page, call- ing shouldvisit on each link. Links approved by shouldvisi t are put in the queue, and the thread returns to the queue for another link to download.

The ready queue is initialized with a set of links, called the sfarting points or roots of the crawl. The crawler continues retrieving pages until its queue is

empty, unless the programmer cancels it by calling abort ( ) or the crawl exceeds some predetined limit (such as number of pages visited, number of bytes retrieved, or total wall-clock time).

Since the links of the Web form a directed graph, one can usefully regard the SPHINX crawling model from the standpoint of graph traversal. Breadth-first and depth-first traversal are simulated by different policies on the ready queue: first-in-first-out for breadth-first. last-in-first-out for depth-first. In our implementation, the ready queue is actually a pri- ority queue, so shouldvisit can define an arbi- trary traversal strategy by attaching priority values to links. This facility could, for example. be used to implement a heuristically-guided search through the Web.

Page 4: SPHINX: a framework for creating personal, site-specific Web crawlers

A traversal of the Web runs the risk of falling into a cycle and becoming trapped. This is especially true in cases where pages are dynamically created. SPHINX does simple cycle avoidance by ignoring links whose URL has already been visited. More insidious cycles can be broken by setting a depth limit on the traversal, so that pages at a certain (user-specified) depth from the root set are arbitrarily treated as leaves with no links. A better solution would be to use a classifier that detects if a “similar” page has been seen before to avoid visiting the same sort of page twice. A page resemblance technique such as [ 3 ] can be used for this purpose.

Like other crawling toolkits. SPHINX also sup- ports the informal standard for robot exclusion [ 141, which allows a Web site administrator to indicate por- tions of the site that should not be visited by robots.

A site-specific crawler typically contains a num- ber of rules for interpreting the Web site for which it is designed. Rules help the crawler choose which links on the site to follow (say. articles) and which to avoid (say, pictures). and which pages or parts of pages to process (body text) and which to ignore (decoration). SPHINX provides a facility for encap- sulating this sort of knowledge in reusable objects called c,/ass<fiers. A classifier is a helper object at- tached to a crawler which annotates pages and links with useful information. For example. a homepage classifier might use some heuristics to determine if a retrieved page is a personal homepage, and if so. set the homepage attribute on the page. (SPHINX’s Link and Page objects allow arbitrary string-valued attributes to be set programmatically.) A crawler’s visit method might use this attribute to decide how to process the page - for instance, if the homepage attribute is set, the crawler might look for an image of the person’s face.

The programmer writes a classifier by imple- menting the Classifier interface, which has one method: classify(Page p), which takes a re- trieved page and annotates it with attributes. At- tributes may be set on the page itself, on regions of the page (represented as a byte offset and length in the page), and on outgoing links. A classifier is used by registering it with a crawler; the main loop of

the crawler takes care of passing retrieved pages to every registered classifier before submitting them to shouldvisit and visit.

We have written a number of classifiers. The Stan dardclassif ier, which is registered by default with all crawlers, deduces simple link relationships by comparing the link’s source and target URLs. (For example. a link from http://www.yahoo.com/ to http://www.yahoo.com/Arts/ islabeledasa descendent link, because the source URL is a pre- fix of the target URL.) Another useful classifier is Cat egoryclassi f ier, which scores a page according to its similarity to various categories in some category hierarchy. such as Yahoo! ’ [25]. A category is repre- sented by a collection of documents (e.g., documents in the Yahoo! category Sports:Baseball). The match between a page and a category is computed using a Vector Space document/collection similarity metric, commonly used in Information Retrieval [ 201.

Classifiers are especially useful for encapsulat- ing Web-site-specific knowledge in a modular way. This allows them to be interchanged when appropri- ate. and reused in different crawlers. For instance, we have a classifier for the AltaVista’ search en- gine, which detects and parses AltaVista query result pages. This classifier is used in several of our ex- ample applications that use a search engine to start the crawl (see Section 4). When AltaVista released a new page design during the summer of 1997. it was sufficient to fix the AltaVista classifier - no changes were needed to the crawlers that use it. We were also able to experiment with other search engines. such as HotBot, simply by writing a HotBot classifier that sets the same attributes as the AltaVista classifier. Thus the classifier mechanism makes it straight- forward to build extensible meta-services. such as metasearch engines. in SPHINX.

3. Crawler Workbench

SPHINX has a graphical user interface called the Crawler Workbench that supports developing. run- ning, and visualizing crawlers in a Web browser. A number of commonly-used shouldvisit and

’ http://www.yahoo.com/ 3 http://altavi~t;l.digitnl.com/

Page 5: SPHINX: a framework for creating personal, site-specific Web crawlers

visit operations are built into the Workbench, en- abling the user to specify and run simple crawls without programming. The Workbench can run any SPHINX crawler, as long as it inherits from the basic Crawler class.

The Workbench includes a customizable crawler with a selection of predefined shouldvisit and visit operations for the user to choose from. For more customized operations, the Workbench allows a programmer to write Javascript code for should- Visit and visit (if the Web browser supports it). manipulating Pages and Links as Java objects.

The built-in shouldvisit predicates test whether a link should be followed, based on the link’s URL, anchor, or attributes attached to it by a classilier. The built-in visit operations include, among others: w Suve. which stores visited pages to the local

filesystem, in a directory structure mirroring the organization of files at the server, as revealed by the URLs seen;

l C’onc.cttenccTe. which concatenates the visited pages into a single HTML document, making it easy to view, save, or print them all at once (this feature has also been called linemkation [ 171):

l E.xmc*f, which matches a pattern (such as a reg- ular expression) against each visited page and stores all the matching text to a file. Each of the visit operations can be parame-

terized by a page predicate. This predicate is used to determine whether a visited page should actually be processed. It can be based on the page’s title, URL. content, or attributes attached to it by a clas- sifier. The page predicate is needed for two reasons: tirst. should Visit cannot always tell from the link alone whether a page will be interesting, and in such cases the crawler may actually need to fetch the page and let the page predicate decide whether to process it. Second, it may be necessary to crawl through uninter- esting pages to reach interesting ones, and so should vi s i t may need to be more general than the process- ing criterion.

An example of using the customizable crawler for a personal crawling task is shown in Fig. 1. The task is to print a hardcopy of Robert Harper’s

Introduction to Standard ML4, a Web document divided into multiple HTML pages. In the Crawler Workbench, the user enters the first page of the document as the starting URL and sets some gross limits, constraining the crawler to a single Web server and pages at most 5 hops away from the starting URL (Fig. la). Unfortunately these limits are not sufficient, because the document contains links to pages outside the document (such as Standard ML at Carnegie Mellon ‘) which we don’t want to print. A little investigation indicates that all the relevant pages have the prefix “http://www.cs.cmu.edu/-rwhl introsmY, so the user specifies this constraint as a regular expression on URLs that the crawler may follow (I b). Running the crawler produces a graph visualization of the Web pages visited ( lc), which the user can browse to check that the appropriate pages are being retrieved. This appears to be the case, so (Id) the user specifies that the crawler should now concatenate the visited pages together to make a single page that can be printed ( Ie). This example shows how the Crawler Workbench provides a visual, interactive. iterative development environment for Web crawling.

When combined with a library of classitiers (which can be dynamically loaded into the Work- bench), the customizable crawler enables some in- teresting crawls that would otherwise require pro- gramming. Let us suppose, for example, that the user wants to save all the pages found by an AltaVista query for off-line viewing or processing. Using the Crawler Workbench with the AltaVista classifier, the user can interactively specify a crawler that follows links labeled result (which point to pages that matched the query) or more-results (which point to more pages of query results), but not links to AltaVista’s help documentation or advertisers. This kind of crawl would be difficult to specify in other site-download tools 121, which lack the ability to make semantic distinctions among links.

While a crawler is running, the Workbench dy- namically displays the growing crawl graph, which

’ http://www.cs.cmu.edu/-rwh/introsml/index.htm

’ http://www.cs.cmu.rdu/-petel/smlguide

Page 6: SPHINX: a framework for creating personal, site-specific Web crawlers

(4 (b)

(4

Fiz. I Using the Crawler Workbench to linearize ti document graph for printing.

consists of the pages and links the crawler has en- countered. Two visualizations of the crawl graph are available in our prototype (Fig. 2). First is a graph view (Fig. 2a). which renders pages as nodes and links as directed edges, positioned automatically by a dynamic polynomial-approximation layout algo- rithm 1241. Nodes are displayed concisely, as icons. To get more information about a node, the user can position the mouse over an icon, and the page title and URL will appear in a popup “tip” window. The user can also drag nodes around with the mouse to improve the layout, if desired. The second visu- alization is an outline view (Fig. 2b), which is a hierarchical list of the pages visited by the crawler. A page can be expanded or collapsed to show or hide its descendents. Both views are linked to the Web browser, so that double-clicking a page dis- plays it in the browser. The availability of dynamic, synchronized views of the crawl proves to be very

valuable as the user explores a site and builds a custom crawler for it.

The visualizations are connected to the crawler by event broadcast. When some interesting event occurs in the crawl - such as following a link or retrieving a page - the crawler sends an event to all registered visualizations, which update their displays accordingly. The visualizations also have direct ac- cess to the crawl graph, represented internally as a linked collection of Page and Link objects. This architecture enables visualizations and crawlers to be developed independently. Furthermore, since the Crawler implementation takes care of broadcast- ing events at appropriate times, the visualizations in the Workbench can be connected lo any SPHINX crawler, as long as it inherits from Crawler.

Our graph visualization allows the programmer to create mappings between attributes of the Page

object and presentation attributes such as colour,

Page 7: SPHINX: a framework for creating personal, site-specific Web crawlers

(e) Fig. 1 (continued).

label. node size. and icon shape. There is much room for improvement. Data visualization techniques such as fisheye views [ 111, overview diagrams [ 181, cone trees [ 191, and hyperbolic trees [ 151, might greatly enhance our crawl display.

4. Applications

Some interesting crawlers written with the SPHINX toolkit are briefly described below.

4.1. Category ranking

The precision of search engine queries often suf- fers from the po/ysentv problem: query words with multiple meanings introduce spurious matches. This problem gets worse if there exists a bias on the Web towards certain interpretations of a word. For exam- ple, querying AltaVista for the term “amoeba” turns

up far more references to the Amoeba distributed operating system 6 than to the unicellular organism. at least among the first 50 results.

One way to avoid the polysemy problem is to allow the user to indicate a general-purpose category, such as Biology, to define the context of the query. This category is used to re-rank the results of a search engine query. We use the Categoryclassi fier (described previously) to compute the page’s similarity to a collection of pages considered repre- sentative of the category. In our implementation, we used page collections defined by Yaboo!’ to build up our predefined categories.

The category-ranking application shown in Fig. 3 is implemented as a SPHINX crawler that submits the user’s query to a search engine and retrieves some number of results (usually 50. which corre-

’ http:Nwww.am.cs.vu.nI/ 7 http://ww~.yahoo.com/

Page 8: SPHINX: a framework for creating personal, site-specific Web crawlers

(4 (b) Fig. 2. A Web crawl visualized as (a) a graph. and (b) an outline.

sponds to 5 result pages). For a fast approximation to the category ranking, the crawler first computes the category similarity of each page’s brief descrip- tion returned by the search engine. This allows a preliminary ranking of the results to be shown to the user immediately. Meanwhile, the crawler re- trieves the full pages in the background, in order to compute the category scores more accurately. Pages are retrieved in rank order based on score computed during the initial ranking. Here SPHINX’s priority- driven crawling scheme is exploited.

Note that category-ranking is not in any way re- stricted to search queries. The pages encountered in any crawl can be scored using the Categoryclas- si f ier and presented to the user in decreasing order of predicted usefulness.

4.2. Languuge modelling

A matter of concern to the speech recognition community is how to deal with a “non-stationary” source: a source of words whose probability distribu- tion changes over time. For example, consider a news broadcast. One minute, the broadcaster is reading a story on Iraq. in which the words “U.N.“, “Albright”.

“ambassador”, and “weapons” have higher probabil- ity of occurrence than in general. The next minute, the topic has shifted to the lditarod dog sled race in Alaska, in which the words “snow”. “cold”. and “mush” are more probable than usual. Speech recog- nition systems which have been trained on a fixed cor- pus of text have difficulty adapting to topic changes, and have no hope of comprehending completely new topics that were missing from the training corpus.

A novel solution to this problem {described in [ 11) is jusf-irr-time hngungr modelling. A just-in- time language modelling scheme uses recent utter- ances to the speech recognition system to query a text database, and then retrains its language model using relevant text from the database. The prototype system actually uses the World Wide Web as its text database. A small SPHINX crawler submits the last utterance as a search engine query (with common words removed). then retrieves pages in order of rel- evance until enough words have been accumulated to retrain the language model.

In the course of developing this crawler. we ex- perimented with several different search engines. in order to find the right balance of speed, precision. and recall needed to generate good training text.

Page 9: SPHINX: a framework for creating personal, site-specific Web crawlers

R.C. Mille,: K. Bhurut/Cmputer Network.~ trnd ISDN Swtenu 30 (19%‘) IIY-130 117

amoeba At most 150 results

1918 wsults found. 50 shown.

27 46 49 43

2 03 1 33 1 05 1.24 0 30 1.10 1 26 0.96 Amceba.htrnl

Amoeba Laser Homepagl laserdisc, movie, mowes.flms, theater z ;z iif Amoeba ChantWEB Htl I’m newto HTML. and so is mvweboaoe.The

11 0 86 0.54 47 0 43 0.49 48 0 99 0 48 3 0 36 0 47 24 0 40 0 46 28 7 01 0.45 26 117 0 45 44 0 43 0 45 xl 0 79 0.42 37 0 57 0 41 25 018 0 40

Pop-Up Threads in the An Pop-Up Threads in bie Amoeba &era&g &stem; AMOEBA- Amblenfaflon Mail to AMOEBA Lawul anti Graph&s by Deep Del : amoeba Amoeba1 Every sunday night at Den ofThieves Hoi InUexofIpubfdocfOSIAmo lnbex of/pubfciocfOSJAmoebti Name Last modifiej Amoeba Music - New and AM 0 E S A M U S I C Videos I Rare Posters 1 Sell: Enhancement of Process1 Enhancement of Processor Mlocatlon Techntque I( j Amoeba Amoeba we been here since Genesis d’s us who I Amoeba MUSIC, where yol Amoeba Music will buy your collectron new and us‘ PSC Gaming Club Amoel Plymouth Stale College Gammg Club Amoeba Wa, AMOEBASite Report Site Report- 8 September, 1995 Greebngs to all pi Amoeba MUSIC - Record 6 AM 0 E B A M U S I C For decades, the music indh

Fig. 3. Results of an Al&Vista query, ordered by similarity to Biology. “Rank” shows AltaVista’s original ranking (where the tirst result returned has rank 1). “Fast” shows the category-similarity score of the brief description. “Slow” shows the score of the full page. The three pages with the highest “Slow” scores are relevant to the unicellular organism. despite appearing rather late in AltaVista’a ranking.

Since all search engine dependencies are encapsu- lated in SPHINX classifiers, it was straightforward to retarget the crawler to AltaVista ‘, HotBot 9. and finally News Index “‘. which proved best for the broadcast news domain.

5. Relocatable crawlers

This section describes features of the SPHINX crawling architecture that contribute to developing relocatable crawlers. Relocatable crawlers are capa- ble of executing on a remote host on the network. We present some scenarios in which relocatable crawling is useful and discuss the issues involved in support- ing them in SPHINX. Support for relocation is cur- rently being added to the toolkit, so this discussion should be regarded as work-in-progress.

Relocation serves a variety of purposes, high- lighted in the following scenarios.

’ http://altavista.digitai.com/

’ http://www.hotbot.com/

‘I’ http://www.newsindex.com/

Scenario #I: the ctmvler is downloaded to a Web browweu. Consider a downloaded Java applet that does some crawling as part of its operation. Web- Cutter/Mappuccino [ 171, for example. is an applet that generates Web visualizations.

Scenario #2: the crauvler is uploaded to a “crawling sewer”. When the user has a slow computer or a low-bandwidth connection, it may be desirable to upload a crawler to some other host with better speed or faster network links.

Scenario M: the crmller is uploaded to the target Web sen’er: For a site-specific crawler, the best possible bandwidth and latency can be obtained by executing on the Web server itself. Much of the infrastructure for relocation is al-

ready implemented in Java, such as secure plat- form-independent execution, code shipping, object serialization, and remote method invocation. Two issues remain to be addressed by SPHINX: the net- work access policy for an untrusted crawler, and the user interface to a remote crawler. We consider each of these issues in turn.

In all three scenarios, the crawler may be un- trusted by its execution host (the browser or server).

Page 10: SPHINX: a framework for creating personal, site-specific Web crawlers

We assume that the execution host has a local, trusted copy of the SPHINX toolkit. Since SPHINX han- dles network access for the untrusted crawler, it can filter the crawler’s page requests according to an arbitrary, host-defined policy. For example, the page-request policy for Scenario #3 might restrict the crawler to pages on the local Web server, for- bidding remote page requests. Other uses for the page-request policy include URL remapping (e.g., converting http: URLs into corresponding file:

URLs on the Web server’s local filesystem), load control (e.g.. blocking the crawler’s requests until the load on the Web server is light), resource limits. auditing, or billing. The page-request policy is im- plemented as a Java class registered by the execution host.

Turning to the second issue, a crawler uploaded to a remote server must have some means of com- municating results back to the user. First, any events generated by the crawler are automatically redirected back to the user’s Crawler Workbench to visualize the remote crawl. The remote crawler can also cre- ate HTML output pages and direct SPHINX to send them back to the Workbench for display. The user may choose to disconnect from a long-running crawl and return at a later point, in which case the crawling events and HTML pages are batched by the remote server and delivered at the user’s request.

Relocatable crawlers combine local, interactive development with run-time efficiency. A crawler can be developed and debugged locally in the Crawler Workbench, and then shipped to a remote server, where it can run at maximum speed.

6. Related work

Most actual crawlers are written in languages such as Perl, Tel, or C. using libraries such as the W3C Reference Library ’ ’ [ 1 O] or libwww-per1 ” 191. These libraries typically do not provide an inter- active development environment. Several interactive systems allow the user to specify a crawl and vi- sualize the results. such as MacroBot ” [12] and

” http://www.w3.org/Library/ ” http://www.ics.uci.edu/pub/websoft/libwww-per]/ Ii http://www.ip~roup.com/macrobotl

WebCuttedMapuccino ” [ 171. The TkWWW Robot [23] was the first system to integrate crawling into a Web browser.

Several authors have considered issues of mobile agents on the Web. Mobility is related to our notion of relocation, except that mobile agents are usually capable of migrating from host to host during ex- ecution. Lingnau et al 1161 described using HTTP to transport and communicate with mobile agents. Duda [6] defined the Mobile Assistant Programming (MAP) model, a set of primitives for mobility, persis- tence, and fault-tolerance, and implemented a mobile agent in Scheme that moves to a remote server to search a collection of Web documents.

7. Summary and future work

The SPHINX toolkit and Crawler Workbench are designed to support crawlers that are Web- site-specific. personal, and relocatable. In the SPHINX architecture. crawlers leverage off clas- sifiers, which encapsulate Web-site-specific knowl- edge in a reusable form. The Crawler Workbench is the first general-purpose crawling tool that runs as a Java applet inside a Web browser. The Workbench provides the user with an opportunity to customize the crawler. and also several visualizations to gauge the effectiveness of the crawler and improve it iter- atively. Relocatable crawlers can be developed and debugged locally in the Workbench, then uploaded and run on a remote server at full speed.

Running the Crawler Workbench inside a browser provides a fair level of integration between browsing and crawling, so that crawling tasks that arise during browsing can be solved immediately without chang- ing context. Tighter integration between browsing and crawling would be desirable, however. For ex- ample. the user should be able to steer a running crawler by clicking on links in a browser window. Current browsers provide no way for a Java applet to obtain these kinds of events.

A prototype of SPHINX has been implemented, using Java version 1.0.2 and Netscape Communi- cator 4.0. Work is currently underway to prepare a version of SPHINX for public release. which is

Page 11: SPHINX: a framework for creating personal, site-specific Web crawlers

anticipated sometime in mid-1998. The public ver- sion will support relocatable crawlers and a wider variety of browsers. For more information, see the SPHINX home page at http://www.cs.cmu.edu/-rem lwebsphinxl.

Acknowledgements

The authors heartily thank Daniel Tunkelang for his polynomial-approximated graph drawing algo- rithm and its Java implementation. Thanks also to Brad Myers, Eric Tilton, Steve Glassman. and Monika Henzinger for suggestions for improving this paper. This work was done during the first author’s 1997 summer internship at DEC Systems Research Center.

References

[II A. Berger. R. Miller, Just-in-time language modelling. \ubmitted for publication to Inrrrnationul Cmf~rrrw on Acwrstics. .Sptwl~, urtd SignuI P rowssin,q ICASSP ‘YX.

17-I Blue Squirrel Software. WebWhacker. http://www.bluesqui

rrel.com/whackcr/

131 A. Broder. S. Classman. M. Manasse. and G. Zweip. Syntac- tic clustering of the Web. in: Proc. of WWW6. April 1997.

pp. 391104. [A) Canon Information Systems Research Australia Pty. Ltd.,

Canon WebRccord. http://www.research.canon.com.au/web

record/

15) Digital Equipment Corporation. AltaVista. http://altavista.di cTital.corn/

161 “A. Duda. S. Perret. A network programming model for cflicient information access on the WWW (http://www5con t’.inria.fr/fich~html/papers/P42/Overview.html), in: Proc~. o/ WWWS. Paris. France. 1996.

171 0. Etzioni. A scalable comparison-shopping agent for the

World Wide Web (ftp://ftp.cs.washington.edu/pub/etzioni/so

t’tbots/agcnls97,psl. in: Procretliqs of’ Autono~wus Agw~rs ‘07.

181 R.T. Fielding. Maintaining distributed hypertext infostruc-

lures: welcome to MOMspider’s Web (http://www.ics.uci.cd u/pub/websoft/MOMspider/WWW94/paper.html), in: F’roc.

o[ WWWI. Geneva. Switzerland, May 1994. 191 R.T. Fielding. libwww-perl. http://www.ics.uci.edu/pub/we

bsoft/libwww-per]/. 1996. [IO] H. Frystyk. Towards a uniform library of common

code: a presentation of the World Wide Web library (http://

www.ncxa.uiuc.edu/SDG/IT9WProceedings~Day/frystyW

LibraryPaper.htmll. in: Pmt. of WWW2: Mosaic and I/U, Web. Chicago USA. 1994.

[ I I ] G.W. Furnas, Generalized fisheye views. in: Proc. of’ CHI ‘X6 Hllmurl Ftrcron in C~~z~uring Systems. Boston USA.

April 1986. ACM Press. pp. 16-23.

1121 Information Projects Group. Inc.. MacroBot. http://www.ip

group.comlmacrobot/. 1997. [ 131 T. Kamba. K. Bharat. and M. Albers, The Krahatoa

Chronicle: an interactive. personalized newspaper on the Web (http://www.w3,org/pub/ConferenceslWWW4// 9.V). in: Proc. of WWW1. Boston. USA. December 1995.

I 141 M. Koster. A standard for robot exclusion. http:l/info.uebct

awler.com/mak/projects/robots/norobots.html. June 1991. [IS] J. Lampinp. R. Rao. and P. Pirolli. A focus + context

technique based on hyperbolic geometry for vi\ualiLing large hierarchies lhttp://www.acm.org/sigchi/chi95/Electron

icldocumnts/ptipers/jl_bdy.htm), in: Pr~~crrdin~~s r$CHI 0.S. Denver. USA. 1995. pp. 401108.

1161 A. Lingnau et al.. An HTTP-based infrastructure for mobile agents (http://www.w3.org/pub/Conferences/WWW4/Paper

s/I 50/J. in: P~.cjc. of WWW-1. Boston. USA. December 1995.

[ I71 Y.S. Maarek et al.. WebCutter: a system for dynamic and tailorable site mapping (http:/lproccedings.www6conf.org/ HyperNews/get/PAPER1O,html). in: Prrjc,. of WWW6. Santa

Clara. USA, April 1997.

I I81 S. Mukherjea. J.D. Foley. and S. Hudson. Visualizing con- plex hypermedia networks through multiple hierarchical views ~http://www.acm.or~/si~chi/chi95/Electronicldocumn

tsipaperalsm-bdy.html. in: Proceding,\ (!/ CH/‘Y5. Denver. USA. 1995. pp. 33 I-337.

[ I91 G.G. Robertson. J.D. Mackinlay. and S.K. Card, Cone tree<:

animated 3d visualizations of hierarchical information, in:

Procrrrli~~gs of’ CHI ‘(il. New Orlean\. USA. April %-Ma> 2. 1991. pp. 189-194.

1201 G. Salton and C. Buckley. Term weighting approaches in automatic tcxl retrieval. Infi)r-rrtclrion Proce.r.tiq md Mat?- ugrrrwnt. 245 1: 5 13-23. 1988.

121 I E. Selberg. 0. Etzioni. Multi-service search and comparison using the MetaCrawler (http:Nwww.w3.orglpub/Conference

~IWWW-lIPrtperdl691~. in: Pnx of’ WWW-I. Boston. LrSA. December 19%.

1221 J. Shakes et al.. Dynamic reference sifting: a case study in the homepage domain (http://wwwh.nttlabs.com/HyperN

ews/get/PAPER3Y,html). in: Proc~. of WWW6. Santa Clara. USA. April 1997.

1131 S. Spetka. The TkWWW robot: beyond browsing (http://ww w.ncsa.uiuc.edu/SDG/IT94/Proceedings/A\gents/spetkd\per

ka.html), in: /‘r~~~~. o/ WWWZ: Mosrric cud t/w W~,/I. 1994. [Z4I D. Tunhelang. Applying polynomial approximation to

graph drawing (http://www.cs.cmu.edu/-quixote/proposal. ps), Ph.D. thesis proposal. Carnegie Mellon University. 1997.

[ 25 I Yahoo! Inc.. Yahoo!. http:l/www.yahoo.com/

Page 12: SPHINX: a framework for creating personal, site-specific Web crawlers

Robert Miller is a Ph.D. student in Krishna Bbarat is a memher of the Computer Science at Carnegie Mellon research staff at Digital Equipment University. His research interests in- Corporation‘s Systems Research Cen- elude Web automation, end-user pro- ter. His current research interests in- gramming. programming-by-demon- elude Web content discovery and re- stration. and automatic text procesx- trieval. user interface issues in au- ing. He earned B.S. and M.Eng. de- tomating tasks on the Web. and grees in Computer Science from the speech interaction with hand-held de- Massachusetts Institute of Technology vices. Dr. Bharat received his Ph.D. in 1995. in Computer Science from Geor-

gia Institute of Technology in 1996. where he worked on tool and infras-

tructure support for building distributed user interface applica- tions.