21
10. Personalized and Focused Web Spiders Michael Chau and Hsinchun Chen Department of Management Information Systems The University of Arizona Tucson, Arizona 85721, USA Abstract As the size of the Web continues to grow, searching it for useful information has become increasingly difficult. Researchers have studied different ways to search the Web automatically using programs that have been known as spiders, crawlers, Web robots, Web agents, Webbots, etc. In this chapter, we will review research in this area, present two case studies, and suggest some future research directions. 10.1 Introduction The number of indexable pages on the World-Wide Web has exceeded 2 billion and is still growing at a substantial rate [10.60]. It has become increasingly difficult to retrieve information on the Web. To address this problem, many programs have been built to automatically retrieve Web pages. These programs have been known by a variety of names: spiders, crawlers, Web robots, Web agents, Webbots, wanderers, and worms, among others. The term “spiders” will be used throughout this chapter. Web spiders have been defined as “software programs that traverse the World Wide Web information space by following hypertext links and retrieving Web docu- ments by standard HTTP protocol” [10.28]. By broader definition, they can include any software that automatically retrieves Web documents by standard HTTP proto- col, either by following hypertext links or other methods. As such they include pro- grams such as metasearch spiders (spiders that connect to different search engines and combine the results) [10.23, 10.76]. In the remainder of this chapter, our dis- cussion accepts the broader definition. Other Web robots such as shopbots [10.34], chatbots or chatterbots [10.87] are generally not considered as spiders. Research in spiders began in the early 1990’s, shortly after the World-Wide Web begin to attract increasing traffic and attention. Wanderer, written in 1993, was claimed to be the first spider for the Web [10.42]. Many different versions of spiders have since been developed and studied. An overview of Web spider research is given in the following subsection. 10.1.1 Web Spider Research In general, Web spider research directions can be classified into the following cate- gories: 1. Speed and efficiency. In this category, researchers study different ways to in- crease the harvest speed of a spider. These projects focus on building fast spi- ders that can be scaled up to large collections by applying program-optimization

10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

10. Personalizedand FocusedWebSpiders

MichaelChauandHsinchun Chen

Departmentof Management InformationSystemsTheUniversityof ArizonaTucson,Arizona85721,USA

AbstractAs thesizeof theWebcontinuesto grow, searchingit for usefulinformationhas

becomeincreasinglydifficult. Researchershavestudieddifferentwaysto searchtheWebautomaticallyusingprogramsthathavebeenknown asspiders,crawlers,Webrobots,Web agents,Webbots,etc. In this chapter, we will review researchin thisarea,presenttwo casestudies,andsuggestsomefutureresearchdirections.

10.1 Intr oduction

Thenumber of indexablepagesontheWorld-WideWebhasexceeded2 billion andis still growing at a substantialrate[10.60]. It hasbecome increasingly difficult toretrieveinformationontheWeb. To addressthisproblem,many programshavebeenbuilt to automatically retrieve Web pages.Theseprogramshave beenknown by avarietyof names:spiders,crawlers,Webrobots,Webagents,Webbots,wanderers,andworms,among others.Theterm“spiders” will beusedthroughoutthis chapter.

Web spiders have beendefinedas“softwareprogramsthat traversethe WorldWideWebinformationspaceby following hypertext links andretrieving Webdocu-mentsby standardHTTPprotocol” [10.28]. By broaderdefinition, they canincludeany softwarethatautomaticallyretrieves Webdocumentsby standardHTTP proto-col, eitherby following hypertext links or othermethods.As suchthey includepro-gramssuchasmetasearchspiders(spiders thatconnect to different searchenginesandcombine the results)[10.23, 10.76]. In the remainder of this chapter, our dis-cussionacceptsthebroader definition. OtherWebrobots suchasshopbots [10.34],chatbotsor chatterbots [10.87] aregenerally not consideredasspiders.Researchinspidersbeganin theearly1990’s,shortlyaftertheWorld-WideWebbegin to attractincreasingtraffic andattention.Wanderer, written in 1993, wasclaimedto be thefirst spiderfor theWeb[10.42]. Many differentversions of spidershave sincebeendevelopedandstudied. An overview of Webspiderresearchis givenin thefollowingsubsection.

10.1.1 Web Spider Research

In general, Webspiderresearchdirections canbeclassifiedinto thefollowing cate-gories:

1. Speedand efficiency. In this category, researchersstudydifferent waysto in-creasetheharvestspeedof a spider. Theseprojectsfocus on building fastspi-dersthatcanbescaledupto largecollectionsbyapplyingprogram-optimization

Page 2: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

198 M. ChauandH. Chen

techniquesto operations suchasI/O proceduresandIP addresslookup. Mer-cator [10.46, 10.47], InternetArchive’s crawler [10.13, 10.50], andGoogle’scrawler [10.11] aresomeexamples.Currently, sophisticatedspiderscandown-loadmore than10million documentsperdayonasingleworkstation.

2. Spidering policy. Researchin thiscategory studiesthebehaviorsof spidersandtheir impactson other individualsandthe Web asa whole.A well-designed,“polite” spidershouldavoid overloadingWebservers [10.38]. Also, Webmas-tersor Webpageauthorsshouldbeableto specifywhether they wantto excludeparticular spiders’access.Therearetwo standardways.Thefirst one,calledtherobot exclusion protocol, allows Web site administratorsto indicate,by spec-ifying a file namedrobots.txtin the Web site’s root directory, which partsoftheirsiteshould notbevisitedby arobot [10.53]. In thesecondmethod, usuallyknownastherobotsMETA tag,Webpageauthorscanindicatetovisiting robotswhetheradocumentmaybeindexed,or usedto extractmorelinks [10.63]. Al-though thesestandards arenot strictly enforced,mostcommercial spidersarereportedto follow them.Somestudiessurvey theuseof thesestandardsin Websitesandinvestigatetheirpotentialimpacts[10.35].

3. Information retrieval. Most Web spiderresearchfall into this category. Thesestudiesinvestigatehow differentspideringalgorithmsandheuristicscanbeusedsuchthat spiderscanretrieve relevant informationfrom the Web moreeffec-tively. Many of thesestudiesapply to Web spiders techniquesthat have beenshown to beeffective in traditional informationretrieval applications, e.g.,thevectorspacemodel[10.75]. In thischapter, wefocusmainlyonresearchin thiscategory.

10.1.2 Applications of Web Spiders

Spidershave beenshown to beuseful in various Webapplications. Therearefourmainareaswherespidershavebeenwidely used:

1. Personalsearch. Personalspiderstry to searchfor Web pagesof interestto aparticular user. Becausethesespidersusuallyrun on theclient-machine, morecomputationalpoweris availablefor thesearchprocessandmorefunctionalitiesarepossible[10.17]. Thiswill bediscussedin moredetailin Sect.11.2.

2. Building collections. Web spidershave beenextensively usedto collect Webpagesthatareneededtocreatetheunderlyingindex of any searchengine[10.11,10.47, 10.62, 10.72]. In additionto building searchengines,spiderscanalsobeusedto collectWebpagesthatarelaterprocessedto serve otherpurposes.Forexample, BharatandBroderuseda spiderto crawl theYahoohierarchy to cre-atea lexicon of 400,000 words;Henzinger et al. [10.8, 10.45] usedMercatorto do random URL samplingfrom theWeb;many othershave usedspiderstocollectemailaddressesor resumesfrom theWeb. More detailson this typeofspiderwill bepresentedin Sect.11.3.

3. Archiving. A few projectshave tried to archive particularWebsitesor eventhewhole Web [10.50]. To meetthe challenge of the enormoussizeof the Web,

Page 3: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

10. PersonalizedandFocusedWebSpiders 199

fast,powerful spidersaredevelopedandusedto download targetedWeb sitesinto storagetapes.

4. Webstatistics. Thelargenumberof pagescollectedby spidersis oftenusedtoprovide useful, interestingstatisticsabouttheWeb. Suchstatisticsincludethetotal number of serverson the Web, the average sizeof a HTML document,or the number of URLs that return a 404 (page not found) response.Thesestatisticsareusefulin many differentWeb-relatedresearchprojectsandmanyspidershavebeendesignedprimarily for this purpose[10.12, 10.42].

10.1.3 Analysis of WebContent and Structur e

Therehasbeenmuchresearchon differentwaysof representing andanalyzingthecontentand structureof the Web, which are very important to Web spidersthatmay needto rely on suchinformationto guide their searches.In this section,dif-ferentanalysistechniquesthatarerelevant to Webspiderresearchwill bereviewed.In general, Web analysistechniquescanbe classifiedinto 2 main categories: (1)content-basedapproaches,and(2) link-basedapproaches.

In content-basedapproaches,the actualHTML contentof a Web pageis ana-lyzed to induceinformationabout the page.For example, the body text of a Webpagecanbeanalyzedto determine whetherthepageis relevant to a target domain.Indexing techniquescanbeusedto extract thekey concepts that represent a page.In addition, therelevance of a pagecanoftenbedeterminedby looking at thetitle.Wordsandphrasesthat appear in the title or headings in the HTML structureareusuallyassignedahigherweight[10.11, 10.15].

Domainknowledgealso can be incorporatedinto an analysis to improve theresults.For example,wordscanbecheckedagainsta list of domain-specifictermi-nology. A Web pagecontaining words thatarefound in the list canbeconsideredmorerelevant.

The URL address of a Web pageoften containsuseful information about thepage.For example, from theURL

“http://ourworld.compuserve.com/homepages/LungCancer/”,

we can tell that the URL comes from the domaincompuserve.com, andthat it islikely to be relatedto the topic LungCancer. We alsoknow that this page comesfrom a .comsite,whichmaybeconsideredlessauthoritative thanpagesfrom a .govsite.SomemetricsalsoconsiderURLs with fewer slashesmore usefulthanthosewith more slashes[10.2].

Link-basedapproacheshaveattractedincreasingattentionin recentyearsasWeblink structure hascometo beusedto infer important informationabout pages.Thebasicassumptionis thatif theauthor of aWebpageA placesalink to aWebpageB,heor shebelieves thatB is relevant or similar to A, or of good quality [10.44]. Weusethe term in-links to indicatethehyper-links pointing to a givenpage.Usually,thelargerthenumberof in-links,thebetterapageis consideredto be.Therationaleis thata pagereferencedby morepeopleis likely to bemoreimportant thana page

Page 4: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

200 M. ChauandH. Chen

thatis seldomreferenced.Thisis similar to citationanalysis,in whichanoften-citedarticleis consideredbetterthanonenevercited.

Link analysishasbeenusedin more and more applications in recent years.Link analysistechniques were first used to guide searchingin spider applica-tions [10.30, 10.73, 10.79]. FocusedCrawler [10.16] and HyPursuit [10.86] usehyperlinks informationto enhance Web pageclassificationandclusteringrespec-tively. Link analysis hasalso beenappliedto identify cyber communities on theWeb (groupsof individualswho sharea common interest,togetherwith the Webpagesmostpopular amongstthem)[10.39, 10.54].

By analyzing thepagescontaining a link of interest,it is alsopossibleto obtaintheanchor text thatdescribesthelink. Anchor text is theunderlined,clickabletextof anoutgoing link in a Webpage.Anchortext mayprovide a good descriptionofthe target pagebecauseit represents otherpeople’s actualdescription of the page.Severalstudieshave tried to make useof anchor text or any text nearby to predictthecontentof the targetpage[10.1, 10.3]. Somestudiesalsoanalyzethe text thatappearsneara hyper-link [10.74].

In addition, it is reasonable to give a link from an authoritative source(suchas Yahoo) a higherweight than a link from an unimportant personal homepage.Researchershave developedseveralalgorithms to addressthis issue.Amongthese,PageRank[10.11] andHITS [10.51] arethetwo mostwidely used.

ThePageRankalgorithm computesa page’s scoreby weightingeachin-link tothe pageproportionally to the quality of the pagecontaining the in-link [10.11].The quality of thesereferring pagesalsoaredetermined by PageRank. Thus,thePageRankof a pagep is calculatedrecursively asfollows:

ù�y �1©Xü�y���r x,®n~R¬³x �Ú� ¨2~�ÎB¨Qm ���� î�� ¶�ç���¶�ç���� �-î x

ù�y �1©Xü�y���r xU¯§~¦lxU¯§~ ~ (10.1)

whered is a damping factorbetween0 and1,c(q) is thenumberof outgoing links in q.

Intuitively, a Web pagecanhave a high PageRankif the pageis linked frommany otherpages,andthescoreswill beevenhigherif thesereferring pagesarealsogoodpages (pagesthathavehighPageRankscores).This is illustratedin Fig. 11.1.It is also interestingto notethat the PageRankalgorithm follows a random walkmodel - the PageRank of a pageis proportional to the probability that a randomsurferclicking onrandomlinks will arriveat thatpage.

ThePageRankalgorithm, appliedin thecommercial searchengineGoogle,hasbeenshown to be very effective for ranking searchresults[10.11]. Computationtime, however, is a major problem in using PageRank.The PageRankscoreofeachWeb Pagehasto be calculatediteratively, makingit computationally expen-sive [10.43].

Kleinberg [10.51] proposedthe HITS (hyper-link-induced topic search)algo-rithm which is similar to PageRank.In theHITS algorithm, authoritypages arede-

Page 5: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

10. PersonalizedandFocusedWebSpiders 201

Fig. 10.1.PageRankandHITS: ThePageRankscoreof a pageó depends on thePageRankscoresof pagespointingto p (q to q� ). In theHITS algorithm,theAuthority scoreof a pageó depends on the Hub scoresof pagespointing to p (q to q� ); the Hub scoreof a pagepdependson theAuthority scoresof thepagesp is pointingto (r to r � )finedashigh-quality pages relatedto a particular topic or searchquery. Hub pagesare thosethat arenot necessarilyan authority themselvesbut provide pointers tootherauthority pages. A pageto which many others point shouldbe a goodau-thority, anda pagethatpointsto many othersshouldbea goodhub. Basedon thisintuition, two scoresarecalculatedin the HITS algorithm for eachWeb page:anauthority scoreanda hubscore,asillustratedin Fig. 11.1.They arecalculatedasfollows:

���n¢��! #"#$Á¢&%('¦) #" ©�x,®n~p¬ ��*� î�� ¶ ç���¶ ç���� �-î,+ �$|-'¦) #" ©�xU¯§~9z (10.2)

+ �-|)'¦) #" ©�x,®$~ɬ�

�/.1032 ¶ ç�� î�� ¶�ç�� 2 ¥ 054 î ���n¢��! #"#$Á¢&%('¦) #" ©�x6"§~9æ (10.3)

A pagewith a high authority scoreis onepointedto by many hubs,anda pagewith a high hub scoreis one that points to many authorities. One example thatappliesthe HITS algorithm is the Clever searchengine [10.14], which achievesahigheruserevaluationthanthemanually compiled directory of Yahoo. TheTeomasearchengine alsousesa similar algorithm in its rankingprocess.BharatandHen-zinger[10.9] have addedseveral extensionsto the basicHITS algorithm, suchasregulating how mucha node,basedon its relevance,influencesits neighbors.Simi-larly to thePageRankalgorithm, adrawbackof theHITS algorithm is its highcom-

Page 6: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

202 M. ChauandH. Chen

putational requirement,becausethehubandauthority scoreshave to becalculatediteratively.

10.1.4 Graph TraversalAlgorithms

Traditional graph searchalgorithms have beenextensively studiedin the field ofcomputer science.Sincemost researchersview the Web asa directed graph witha setof nodes(pages)connectedwith directededges(hyper-links), someof thesealgorithms have beenappliedin Webapplications.In this section,we review threecategories of graphsearchalgorithms that are relevant to our study, namely, (1)uninformedsearch,(2) informedsearch,and(3) parallelsearch[10.71, 10.90].

Thefirst category of graphsearchalgorithmsconsistsof simplealgorithmssuchasbreadth-first searchanddepth-first search.They arealsoknown asuninformedsearch as they do not make useof any information to guide the searchprocess.Breadth-firstsearchis oneof the mostpopular methods usedin Web searchspi-dersthatcollectall pageson thecurrent level before proceeding to the next level.Although thesealgorithmsareeasyto implement andusein different applications,they areusuallynotveryefficientbecauseof their simplicity.

Thesecondcategory is informedsearch, in whichsomeinformationabout eachsearchnode is availableduringthesearchprocess.Suchinformationis usedastheheuristicsto guidethesearch.Best-firstsearchis oneexample that is widely used.Best-firstsearchexplores the mostpromising nodeat eachstep.This classof al-gorithms hasbeenstudiedin different searchenginespidersor searchagentsys-temswith differentvariations [10.21, 10.30]. Differentmetrics,suchasnumber ofin-links, PageRankscore,keyword frequency, andsimilarity to searchquery, havebeenusedasguiding heuristics.

Another category is parallel search. Algorithms in this category try to exploredifferent partsof a searchspacein parallel. One example is the spreadingacti-vation algorithm usedin artificial neural network models,which tries to achievehuman-likeperformanceby modeling thehumannervoussystem.A neuralnetworkis a graphof many active nodes(neurons) that are connectedby weighted links(synapses).A neural network usesspreadingactivationover thenodes to representandretrieve conceptsandknowledge[10.5, 10.25, 10.55]. Another exampleis ge-neticalgorithms,which increasinglyhavebeenusedin optimizationproblemssuchasfinancialportfolio optimizationandresourceallocation.Geneticalgorithm is anevolutionary processdesigned to searchfor anoptimalsolutionthroughcrossoverandmutationoperations[10.66]. Gordon[10.41] providesamodel for usinggeneticalgorithms in textual analysis.Chenet al.[10.21] extendedthemodel andusedit ina Webspider. Although thesealgorithmsarepowerful andhave beenusedin tradi-tional information-retrieval applications[10.19], they havenotbeenwidely appliedin Webapplications.

Page 7: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

10. PersonalizedandFocusedWebSpiders 203

10.2 WebSpidersfor PersonalSearch

Many Web spidershave beendevelopedto help individual userssearchfor usefulinformation on the Web. Becausethesespiders usuallyrun on theclient machine,moreCPUtime andmemory canbeallocatedto thesearchprocessandmorefunc-tionalitiesarepossible.Thesetoolsalsoallow usersto have more control andper-sonalizationoptionsduring thesearchprocess.

10.2.1 PersonalWebSpiders

tueMosaicis a prominentearly example of personal Web spiders[10.32]. UsingtueMosaic,userscan enterkeywords, specify the depthand width of searchforlinks contained in the current homepagesdisplayed, andrequestthe spideragentto fetch homepagesconnectedto the current homepage.tueMosaicusesthe “fishsearch”algorithm, amodifiedbest-firstsearchmethod. Sinceits introduction,manymorepowerful personal spidershavebeendeveloped.

Somespiders have beendesignedto provide additional functionalities.TheTk-WWW robot is aprogramintegratedin theTkWWW browser[10.80]. It canbedis-patchedfrom thebrowserandsearchWebneighborhoodsto find relevant pagesandreturnsa list of links thatlook promising.SPHINX,a spiderwritten in Java,allowsusersto performbreadth-first searchandview thesearchresultsasa 2-dimensionalgraph[10.67]. CI Spiderperforms linguistic analysis andclusteringof the searchresults[10.20]. Collaborative Spider, anextendedversionof CI Spider, is a multi-agentsystemdesignedto improve searcheffectivenessby sharingrelevant searchsessionsamongusers[10.18].

In otherstudies,spidersusemore advancedalgorithms during the searchpro-cess.TheItsy Bitsy SpidersearchestheWebusinga best-firstsearchanda geneticalgorithmapproach[10.21, 10.22]. EachURL is modeledasan individual in theinitial population.Crossover is defined asextractingthe URLs that arepointed toby multiple startingURLs. Mutation is modeled by retrieving random URLs fromYahoo. Becausethegeneticalgorithm is anoptimization process,it is well suitedtofinding thebestWeb pages according to particularcriteria.Webnautis a laterspi-der thatalsoappliesgeneticalgorithms[10.70]. Otheradvancedsearchalgorithmsalsohave beenusedin personal spiders.Yanget al. [10.91] applyhybrid simulatedannealingin a personal spiderapplication. FocusedCrawler locatesWebpagesrel-evantto a pre-definedsetof topicsbasedonexamplepagesprovidedby theuser. Inaddition, it alsoanalyzesthelink structuresamongtheWebpagescollected[10.16].Context FocusedCrawler usesa Naive Bayesianclassifierto guide thesearchpro-cess[10.33]. Relevancefeedbackhasalsobeenappliedin spiders[10.4,10.84].

Many commercial Web spidersarealsoavailable.WebRipper, WebMiner, andTeleportare examples of software that help usersdownload specifiedfiles fromgivenWebsites.Excalibur’sRetrievalWareandInternetSpidercollect,monitor andindex informationfrom text documentson theWebaswell asgraphic files.Auton-omy’s productssupport a wide rangeof informationcollectionandanalysistasks,

Page 8: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

204 M. ChauandH. Chen

whichincludeautomaticsearching andmonitoring informationsources in theInter-netandcorporateIntranets,andclassifyingdocumentsinto categoriespredefinedbyusersor domainexperts. Verity’s knowledge-managementproducts,suchasAgentServer, InformationServer andIntelligent Classifier, alsoperform similar tasksinanintegratedmanner.

Another important category of personal spidersis composedof metaspiders,programsthatconnecttodifferent searchenginesto retrievesearchresults.Lawrenceand Giles [10.58] showed that the best searchenginecovered only about 16%of Web sitesin 1999. Therefore,combining resultsfrom different searchenginesachievesmorecomprehensive coverage.MetaCrawler wasthefirst metaspiderre-ported[10.76, 10.77]. It providesa singleinterfaceallowing usersto searchsimul-taneouslyfrom six differentsearchengines, namelyLycos,WebCrawler, Infoseek,Galaxy, OpenText, andYahoo. MetaCrawler, with muchmoresearchoptionsavail-able,is still in servicenow.

Following thesuccessof MetaCrawler, many metasearchserviceshavebeende-veloped.Profusionallowsusersto chooseamong alist of six searchenginesthatcanbequeried [10.40]. 37.com connectswith 37differentsearchengines.Dogpilepro-videsmetasearchservicefor news,Usenet,whitepages,yellow pages,images,etc.,in addition to Webpages.SavvySearchconnectswith alargenumberof general andtopic-specificsearchenginesandselectsthoselikely to returnuseful resultsbasedon pastperformance[10.49]. Similarly, Yu et al. [10.92] uselink analysisto decidewhich aretheappropriatesearchenginesto beused,andChenandSoo[10.27] usedomainontology in metasearchagents to assistusersin queryformulation.BlueSquirrel’s WebSeeker andCopernic’s softwareconnectwith othersearchengines,monitor Web pages for any changes,andscheduleautomatic search.Grouper, anextensionof MetaCrawler, clustersthe searchresultsfrom various searchenginesbasedona suffix-treeclusteringalgorithm [10.93].

In additionto gettinga list of URLs andsummaries returned by othersearchengines,somemetaspidersdownloadandanalyzeall the documentsin the resultset. Inquirus, also known as the NECI metasearchengine, downloadsactual re-sult pagesandgeneratesa new summary of eachpagebasedon the searchquery.Pagesthatarenolongeravailable(deadlinks) or whichnolongercontainthesearchtermsarefiltered from thesearchresults[10.56, 10.57]. MetaSpiderprovidesthesamefunctionalitiesasInquirus,but alsoperformslinguisticanalysisandclusteringof the searchresults[10.23]. Another similar systemis TetraFusion,which per-formshierarchical andgraph-basedclassificationon theresultset[10.31]. FocusedCrawler [10.16] andFetuccino[10.6] usesearchresultsfrompopularsearchenginesandexpand the resultsetbasedon theseURLs. Dwork et al. [10.36] proposedtheuseof a Markov chainto combine searchresultsfrom different searchenginesandachievedpromisingexperimentalresults.

Recently, searchspiders have alsobeendevelopedon thebasisof peer-to-peer(P2P)technology, following thesuccessof otherP2Papplications suchasNapster.JXTA Search,formerly known as InfraSearch, usesGnutellaas its backbone andlinks to other computers when a requestis received from a user. The request is

Page 9: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

10. PersonalizedandFocusedWebSpiders 205

passedto neighboring computersto seeif any of themcanfulfil the request.Eachcomputer canhave its own strategy onhow to respondto therequest[10.85].

10.2.2 CaseStudy

In this section,we presentthe architecture of two searchagentsenhanced withpostretrieval analysiscapabilities. Competitive IntelligenceSpider, or CI Spider, isasearchagentthatcollectsWebpagesonareal-timebasisfrom Websitesspecifiedby theuserandperforms indexing andcategorizationanalysison them,to providetheuserwith acomprehensiveview of theWebsitesof interest[10.20]. Thesecondtool, MetaSpider, hasfunctionalitiessimilar to thoseof theCI Spiderbut, insteadof performing breadth-first searchon a particularWeb site, connectsto differentsearchengineson the Internet andintegratesthe results[10.23]. The architectureof CI SpiderandMetaSpideris shown in Fig. 11.2.Thereare4 maincomponents,namely(1) userinterface,(2) Internet spiders,(3) Arizona nounphraser, and(4)self-organizingmap(SOM).Thesecomponentswork together asa unit to performWebsearchandanalysis.TheArizonanoun phraser, developedat theUniversity ofArizona,is theindexing tool usedto index thekey phrasesthatappearin eachdocu-mentcollectedfrom theWebby theInternetspiders.It extracts all thenounphrasesfrom eachdocumentbasedonpart-of-speechtaggingandlinguistic rules[10.83].

TheSOM employs anartificial neural network algorithm to automaticallyclus-ter the Webpagescollectedinto different regions on a 2-dimensionalmap[10.24,10.26, 10.52]. Eachdocument is represented asan input vectorof keywordsanda 2-dimensionalgrid of outputnodesis created.After the network is trained, thedocumentsaresubmitted to thenetwork andclusteredinto different regions. Eachregion is labeledby the phrasethat is the key concept mostaccuratelyrepresent-ing theclusterof documentsin thatregion. More important concepts occupy largerregions, andsimilar concepts aregroupedin a neighborhood [10.59]. The mapisdisplayedthrough the userinterfaceandthe usercanview the documentsin eachregionby clicking on it.

Two separateexperimentshave beenconductedto evaluateCI SpiderandMetaSpider. Theexperimentaltasksweredesigned to permitevaluationof how a com-binationof their functionalitiesperformedin identifying themajor themesrelatedto a certaintopic beingsearched. Thirty undergraduatesubjectswererecruited toparticipatein eachexperiment.

In thefirst experiment,CI Spiderwascomparedwith Lycosandmanual “within-site” browsingandsearching. Our experimentalresultsshowedthatboththepreci-sionandrecallratesfor CI Spiderweresignificantlyhigher thanthoseof Lycosata5%significance level.CI Spider’susabilityalsoachievedastatisticallyhighervaluethanthoseof Lycosandwithin-sitebrowsingandsearching.

In the secondexperiment, Meta Spiderwascompared with MetaCrawler andNorthernLight. In termsof precision, Meta Spiderperformedbetterthaneitherofthese,andthedifferencefrom NorthernLight wasstatisticallysignificant.MetaSpi-der’s recall rate was comparable to that of MetaCrawler and better than that ofNorthernLight.

Page 10: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

206 M. ChauandH. Chen

Fig. 10.2.Architectureof CI SpiderandMetaSpider

We suggest that the main reasonfor the high precisionrateof CI SpiderandMeta Spideris their ability to fetch andverify the content of eachWeb pageinreal time. This meansthesetwo spiders canensurethat every pageshown to theusercontains thekeyword beingsearched.On theotherhand, indexesin LycosandNorthernLight, like thoseof mostothersearchengines,wereoften outdated. Thehigh recall rateof CI Spideris mainly attributableto its exhaustive searchingchar-acteristic.Lycosshowedthelowestrecallratebecause,like mostothercommercialsearchengines,it samplesonly a numberof Web pages in eachWeb site, therebymissingother pagesthat containthe searchkeyword. A userperforming manualwithin-sitebrowsingandsearchingis likely to misssomeimportant pagesbecausetheprocessis mentallyexhausting.Many subjectsalsocommented that they likedthepostretrieval capabilitiesof theArizonanounphraserandtheSOM.

10.3 Using WebSpiders to CreateSpecializedSearch Engines

Useof WebsearchenginessuchasGoogle,AltaVista,NorthernLight, Excite,Lycos,andInfoseekis the mostpopular way to look for information on the Web. Manyusersbegin their Webactivities by submitting a queryto a searchengine.All thesesearchenginesrely onspidersto collectWebpagesthatarethenprocessedto createtheir underlying searchindexes.Examples include AltaVista’s Scooter, Google’sGooglebot, Lycos’s T-Rex, andExcite’sArchitext.

Searchenginespiderresearchbegan asearly as1994with World Wide WebWorm,thefirstwidelyusedsearchenginespiderthatindexedonlyadocument’stitleandheader [10.64]. Therepository-basedsoftwareengineering(RBSE)Spiderwasthefirst spiderto have full-text indexing capabilities[10.37]. Soonafter, many full-text indexingspidersweredeveloped,includingWebCrawler [10.72], Lycos[10.62],

Page 11: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

10. PersonalizedandFocusedWebSpiders 207

andHarvest[10.10]. All thesespidersfollow asimplebreadth-first searchalgorithmthatis still widely usednow by thespidersbehindmostmajorcommercial general-purposesearchenginesto crawl theWeb. Someresearchhasalsostudiedtheuseofanincrementalspiderthattriesto collectonly Webpagesthathavechanged[10.29].

10.3.1 SpecializedSearch Engines

Becauseof the enormoussizeof the Web,general-purposesearchenginescannolongersatisfyall theneedsof thosewhoaresearchingfor morespecificinformation.Many specializedsearchengineshavebeenbuilt to addressvariousproblems.Thesesearchenginesspecializein particular Web site(s),topics (suchas computers ormedicine), languages(suchasChineseor Japanese),file types(suchasimages orresearchpapers), andsoon.Becausetheir sizeis more manageable(much smallerthantheentireWeb),thesesearchenginesusuallyprovide more preciseresultsandmorecustomizable functions.For instance,BuildingOnline specializesin searchingin thebuilding industrydomain, CollegeBotsearchesfor educationalresources,andLawCrawler specializesin searching for legal informationon the Web. There arealsocontent-type-specificsearchengines.Forexample,DejaNewssearchesfor newsarticles,andWebSeeksearches for imagefiles [10.78].

Likegeneral-purposesearchengines,thesesearchenginesrely onspidersto col-lectWebpages.Thistaskis relatively easyfor site-specificsearchengines,in whichspiderscanberestrictedto downloadingonly Webpageswith agiven domainnamein theURL [10.61, 10.82, 10.88, 10.89]. For example, spiderscanbeinstructedtodiscardany URL not startingwith the string “http://www.arizona.edu/”. However,the taskbecomesmoredifficult for otherspecializedsearchengines in which thespidersneedto addresstwo mainproblems:

1. Thespiders needto identify from a list of unvisitedURLs theonesmostlikelyto containrelevant information.To improveefficiency, aspidershouldfirst visitsuchWebpages.

2. For eachdownloadeddocument,the spidersneedto determine its relevanceaccording to a specificpurpose.The spidersshouldavoid irrelevant or poor-qualitydocumentsby determining thequalityandreputation of eachdocument.

This kind of searchengine spidersometimes is alsoknown asa focusedspideror a focusedcrawler. In recentyears,different focusedsearchenginespidershavebeendevelopedandevaluated.We next will focus on researchinvestigatingtheuseof efficientspidering algorithmsthataimto addressthefirst problem.

10.3.2 FocusedSpidering Algorithms for SpecializedSearch Engines

Despiteits simplicity, breadth-first searchis widely usedin specializedsearchen-ginesprimarily becauseit is easyto implement andfast to execute.Intuitively, ifa URL is relevant to a target domainit is likely that the Web pages in its neigh-borhood arealsorelevant. It hasbeenshown thatbreadth-firstsearchcandiscover

Page 12: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

208 M. ChauandH. Chen

high-qualitypagesearlyonin aspidering process.As themostimportantpageshavemany links pointing to themfrom numeroushosts,thoselinks usuallycanbefoundearly in thesearchprocess[10.68]. Many people chooseto usefree tools,suchasWebGlimpse[10.61], Alkaline, ht://dig [10.82], andGreenStone [10.88], to buildspecializedsearchengines.While they work well for building site-specificsearchengines,it is moredifficult to usethemfor building topic-specific searchenginesbecausethereis noheuristicfor locatingandidentifying relevant Webpages.To al-leviatethis problem,usersareusuallyallowedto specifythemaximum depththataspidershouldexplore [10.61, 10.81]. However, this tacticis usuallyinadequateandresultsin collectionsthataretoodiversefor specifictopics.

Best-firstsearchis another widely usedalgorithmin focusedspiders.Dependingontheapplication, different heuristicscanbeused,suchaskeyword frequency, simi-larity tostartingexamples,numberof in-links,orPageRankscore.Choetal. [10.30]haveshown thatin anexperimentontheStanford Website,abest-firstsearchspiderusingPageRankscoreperformedbetterthana breadth-first searchspideror a best-first searchspiderusingthenumber of in-links.Chakrabarti etal. [10.16] combinedsimilarity andlink analysisin FocusedCrawler, which visits URLs basedon eachpage’sprobability of having relevantcontent.

Machinelearningtechniquesalsohave beenappliedin searchengine spiders.McCallumetal. [10.74] usedreinforcementlearningtodesignaspiderthattraversestheWebbasedon immediateandfuturerewardasmeasuredin termsof Webpagerelevance.Thatspiderwasusedin Cora,a computer scienceresearchpapersearchengine[10.65].

10.3.3 CaseStudy

Seekingto combine different Web content andstructureanalysis techniqueswithtraditional graph searchtechniques to build spider programs for vertical searchengines,we developedandcomparedthreeversions of Web spiders,namely, (1)Breadth-FirstSearchSpider, (2) PageRankSpider, and(3) HopfieldNet Spider. Inthis section,wedescribethedesignsandapproachesadoptedin ourstudy.

TheBreadth-First SearchSpider(or BFSSpider) collectsall Webpages on thecurrent levelbeforeproceedingto thenext level. In otherwords,it visitsURLsbasedontheorderin whichthey arediscovered.This is implementedusingafirst-in-first-out queuelike thatgenerally usedin breadth-first searchapplications. It runsuntiltherequirednumberof pagesarecollected.

ThePageRankSpiderwasadaptedfrom thealgorithm reportedin [10.30]. Aim-ing to combinelink-basedanalysisandaheuristics-basedtraversalalgorithm, it wasdesignedto perform best-firstsearchusingPageRank(asdescribedearlier)as itsheuristics.URLs with higherPageRankscoresareto bevisitedearlier.

In eachstep,thespidergetstheURL with thehighest PageRankscore,fetchesthecontent, andextractsandenqueuesall theoutgoing links in thepage. Thepro-cessruns until therequired number of pageshave beencollected.PageRankscoreis calculatediteratively usingthe algorithmdescribed earlieruntil convergenceisreached. Thedamping factord is setto 0.90in our implementation.Thehot queue

Page 13: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

10. PersonalizedandFocusedWebSpiders 209

approachusedin the original studyhasalsobeenadopted in the PageRankSpi-der for anchor text analysis.Two priority queues are established:hot queue andnormal queue. TheURLs within eachqueue areorderedby PageRankscorein de-scendingorder. The spiderfirst dequeuesfrom the hot queue.If the hot queue isempty, the spiderdequeuesfrom the normal queue.In our design, a URL will beplacedin thehot queueif theanchor text pointing to this URL containsa relevantterm.

In the Hopfield Net Spider, the Web is viewed asa large network structureofmassive, distributedknowledgecomposedof pagesandhyperlinks contributedbyall Web pageauthors.This can be viewed as a neuralnetwork, in which nodesare representedby pages and links are simply represented by hyperlinks. In thisapproachwe model theWebasa HopfieldNet,which is a single-layered,weightedneuralnetwork [10.48]. Nodes areactivatedin parallelandactivation valuesfromdifferent sourcesarecombinedfor eachindividual nodeuntil theactivation scoresof nodesonthenetwork reacha stablestate(convergence).

Basedon this spreading activation algorithm, which hasbeenshown to be ef-fective for knowledgeretrieval anddiscovery in a Hopfield Net, we developedtheHopfieldNet Spiderto perform searching on theWeb. In this approach,we aimedto combine a parallelsearchalgorithm with content-basedandlink-basedanalysis.Our implementationincorporatedthebasicHopfieldNet spreading activationidea,but significantmodification wasmadeto take into considerationtheuniquecharac-teristicsof theWeb.

The HopfieldNet Spiderstartswith a setof seedURLs represented asnodes,andthenactivatesneighboring URLs,combinesweightedlinks, anddeterminestheweightsof newly discoverednodes.Theprocessrepeatsuntil therequired numberof URLs havebeenvisited.Thealgorithmadoptedis asfollows:

1. Initialization with SeedURLs. An initial setof seedURLsis givento thesystemandeachof themis representedasanodewith aweightof 1. 7¡¶ (t) is definedastheweightof node i at iterationt.

7 ¶Sx98�~*¬ � ª: #"�y<;=;�s�©X©�¨�s*>�ü�iËs?$?æ (10.4)

ThespiderfetchesandanalyzestheseseedWebpagesin iteration0. ThenewURLs found in thesepagesareaddedto thenetwork.

2. Activation, WeightComputation,andIteration. Proceeding to thenext iteration,theweightof eachnode is calculatedasfollows:

7 ¶Sx�¢�Î � ~*¬Ïª Ø x �� q ç �A@ ç î�B .C2 ç��ED � £p¶)F D�G ¶!7HD�xU¢O~O~:z (10.5)

where

F D G ¶ is theweightof thelink betweentwo nodes,f Ø is a SIGMOID transformationfunctionthatnormalizeda weightto a valuebetween0 and1.BecauseF D G ¶ is the weight of the linkage betweentwo URLs, it tries to esti-matewhethera URL i pointed to from a Web pageh is relevant to the target

Page 14: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

210 M. ChauandH. Chen

domain, basedontheuseof anchortext in h. Thisscoreis calculatedasa func-tion of thenumber of anchor-text wordsrelevant to thetarget domain. We usea slightly modified SIGMOID function to make surethe resultingvalue is apositivenumber.After theweightsof all thenodesin thecurrent iterationhave beencalculated,thespiderneedsto decidewhichnode(URL) shouldbeactivated(visited)first.As theweightsdecidetheorderin whichURLsaretobevisited,they arecriticalto theeffectivenessof thealgorithm.Thesetof nodesin thecurrent iterationarethenvisitedandfetchedfrom theWebin descending orderof weight.In orderto filter out low-quality URLs, nodeswith a weightsmallerthana threshold Iarenotvisited.Theactivation processis illustratedin Fig. 11.3.

Fig. 10.3.Spreadingactivation: Startingwith a setof seedURLs, the Hopfield Net Spideractivatesneighboring URLs,combinesweightedlinks, anddeterminestheweightsof newlydiscoverednodes.Nodeswith a low weight(e.g.,node7 andnode24) arediscarded

After all the pageswith a weight greaterthan I have beenvisited anddown-loaded, the weight of eachnodein the new iterationis updatedto reflect thequalityandrelevanceof thedownloadedpagecontent asfollows:

7 ¶ x�¢�Î � ~*¬Ïª§Ø§x67 ¶ xU¢�Î � ~pm�® ¶ ~ (10.6)

Page 15: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

10. PersonalizedandFocusedWebSpiders 211

where®{¶ is a weightthatrepresentstherelevance of thetextualcontentof apagei.

Thep¶ scoreis afunctionof thenumberof phrasesfound in apage’scontentthatarerelevant to thetarget domain. A pagewith morerelevantphraseswill receiveahigherscore.PhrasescanbeextractedfromeachpageusingtheArizonaNounPhraser[10.83].

3. Stopping Condition. The above processis repeateduntil the required numberof Webpageshavebeencollectedor until theaverageweightof all nodesin aniterationis lessthanamaximum allowableerror(a smallnumber).

The threedifferentapproacheswere implemented as the backend spidersfora medicalsearchengine calledHelpfulMed. A simulationexperiment anda userstudy were carriedout to evaluate the precisionand execution time of the threespiders.Theexperimentresultsshow thattheHopfieldNet SpiderhadsignificantlybetterprecisionthanboththeBFSSpiderandthePageRankSpiderat the1%level.The BFS Spideralsodid significantlybetterthanthe PageRankSpiderat the 1%level. For execution time,we found thattheHopfieldNetSpiderusedslightly moretime thantheBFSSpider, while thePageRankSpiderspentalmost90 timesmoreexecution time thantheothertwo spiders.

10.4 Conclusions

Over the pastdecade,Web spidershave evolved from simplebreadth-first searchspidersto intelligent,adaptivespiders.At thesametime,thesizeof theWebhasalsogrown by morethan250,000times,from 130Webhostsin June1993to morethan38,000,000 hostsin February2002 [10.42, 10.69]. Thecontenton theWebhasalsobecomemorediversein termsof topic,language,file type,encoding method, andsoon,with many dynamicallygeneratedWebpages.LocatingdesiredinformationontheWebis still noteasy, despitetheavailability of varioussearchspidersandsearchengines.

Spiderscanbeimprovedandextendedin several ways:

– Currently, mostspiderscanindex only staticWeb pages.As the amount of dy-namiccontenton theWeb increases,spidersneedto beableto retrieve andma-nipulatedynamiccontentautonomously.

– Spiderscanperformbetterindexingbyapplyingcomputationallinguisticanalysisto extract meaningful entitiesratherthanmerekeywordsfrom Web pages.Thiswill becomea moreinterestingissueastheSemanticWeb[10.7] becomesmoremature.

– As thequalityandcredibility of Webpagesvaryconsiderably, spidersneedto usemoreadvancedintelligent techniquesto distinguishbetweengood andbadpages.

Page 16: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

212 M. ChauandH. Chen

– An ideal personal spidershouldbehave like a human librarian who tries to un-derstandandansweruserqueriesthrough anautonomousor interactive processusingnaturallanguage.

As we havewitnessed,state-of-the-artsearchservicessuchasWebCrawler andLy-cos that were introducedlessthana decade agohave beensurpassedby servicessuchasGoogle thatutilize newer algorithms.As theWebcontinuesto evolve,spi-dersand searchengines also must evolve in order to accommodatethe size anddynamics of theWeb.

Acknowledgement

The CI Spider, Meta Spider, andHelpfulMed projects werefundedin part by thefollowing grants:

– NSF Digital Library Initiative-2 (PI: H. Chen), “High-performanceDigital Li-brary Systems:From Information Retrieval to Knowledge Management,” IIS-9817473,April 1999–March2002;

– NIH/NLM Grant(PI: H. Chen),“UMLS EnhancedDynamicAgents to ManageMedicalKnowledge,” 1 R01LM06919-1A1, February 2001–January2004;

– NSF/CISE/CSS(PI: H. Chen),“An IntelligentCSCWWorkbench:Analysis,Vi-sualization,andAgents”,IIS-9800696,June1998–June2001.

We would alsolike to thankall currentandpreviousmembers of theArtificialIntelligenceLabattheUniversityof Arizonawhohavecontributed to theseprojects.

10.5 Appendix A: URLs of Spidersand Search Engines

37.com– http://www.37.com/AltaVista– http://www.altavista.com/Autonomy – http://www.autonomy.com/Blue Squirrel’s WebSeeker– http://www.bluesquirrel.com/BuildingOnline – http://www.buildingonline.com/CollegeBot– http://www.collegebot.com/Copernic– http://www.copernic.com/DejaNews – http://www.dejanews.com/Dogpile– http://www.dogpile.com/Excalibur – http://www.excalib.com/Excite– http://www.excite.com/Google– http://www.google.com/HelpfulMed– http://ai.bpa.arizona.edu/helpfulmed/Infoseek– http://infoseek.go.com/JXTA Search– http://search.jxta.org

Page 17: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

10. PersonalizedandFocusedWebSpiders 213

LawCrawler – http://www.lawcrawler.com/Lycos– http://www.lycos.com/MetaCrawler – http://www.metacrawler.com/NorthernLight – http://www.northernlight.com/SavvySearch– http://www.savvysearch.com/Teoma– http://www.teoma.com/Verity – http://www.verity.com/WebSeek– http://www.ctr.columbia.edu/webseek/Yahoo– http://www.yahoo.com/

References

10.1 Amitay, E.: Using CommonHypertext Links to Identify the BestPhrasalDescriptionof Target Web Documents.Proc. the 21stACM-SIGIRPost-ConferenceWorkshop onHypertext InformationRetrieval for theWeb(Melbourne,Australia,1998)

10.2 A. Arasu, J. Cho, H. Garcia-Molina,A. Paepcke, S. Raghavan: Searchingthe Web.ACM Transactionson InternetTechnology, 1 (1), 2-43(2001)

10.3 R. Armstrong,D. Freitag,T. Joachims,T. Mitchell: WebWatcher:A LearningAppren-tice for the World-Wide Web. Proc. the AAAI-95Spring Symposium on InformationGatheringfrom Heterogenous, DistributedEnvironments(Stanford,California,USA,1995)

10.4 M. Balabanovic, Y. Shoham:LearningInformationRetrieval Agents:ExperimentwithWebBrowsing.Proc. theAAAI-95 SpringSymposiumon InformationGatheringfromHeterogenous,DistributedEnvironments(Stanford,California,USA, 1995)

10.5 R.K. Belew: Adaptive InformationRetrieval: Usinga ConnectionistRepresentationtoRetrieve andLearnabout Documents. Proc. the 12th ACM-SIGIRConference on Re-search and Development in InformationRetrieval (Cambridge,Massachusetts, USA,1989)

10.6 I. Ben-Shaul,M. Herscovici, M. Jacovi, Y.S. Maarek, D. Pelleg, M. Shtalhaim,V.Soroka,S.Ur: Adding Supportfor DynamicandFocusedSearchwith Fetuccino.Proc.the8th World-Wide WebConference(Toronto,May 1999)

10.7 T. Berners-Lee:Weaving theWeb. Harper, SanFrancisco(1999)10.8 K. Bharat,A. Broder: A Techniquefor Measuringthe Relative Sizeand Overlap of

Public Web SearchEngines.Proc. the 7th International World-Wide Web Conference(Brisbane,Australia,1998)

10.9 K. Bharat,M.R. Henzinger:Improved Algorithms for Topic Distillation in a Hyper-linkedEnvironment. Proc. the21stACMSIGIRConferenceon Research andDevelop-mentin InformationRetrieval, Melbourne(Australia,1998)

10.10 C. Bowman,P. Danzig,U. Manber, M. Schwartz:ScalableInternetResourceDiscov-ery: ResearchProblemsandApproaches.Communicationsof theACM, 37 (8) 98-107(1994)

10.11 S. Brin, L. Page:The Anatomy of a Large-scaleHypertextual Web SearchEngine.Proc.the7th International World-Wide WebConference(Brisbane,Australia,1998)

10.12 A. Broder, R. Kumar, F. Maghoul, P. Raghavan,S.Rajagopalan,R.Stata,A. Tomkins,J. Wiener:GraphStructurein the Web. Proc. the 9th International World-Wide WebConference(Amsterdam,Netherlands,May 2000)

10.13 M. Burner:Crawling TowardsEternity:Building anArchive of theWorld-WideWeb.WebTechniques,2 (5) (1997)

Page 18: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

214 M. ChauandH. Chen

10.14 S. Chakrabarti,B. Dom, S.R.Kumar, P. Raghavan,S. Rajagopalan, A. Tomkins,D.Gibson,J.Kleinberg: Mining theWeb’s Link Structure.IEEEComputer, 32 (8), 60-67(1999)

10.15 S. Chakrabarti,M. Joshi,V. Tawde:EnhancedTopic Distillation usingText, MarkupTags,andHyperlinks.Proc. the24thACM-SIGIRConferenceon Research andDevel-opmentin InformationRetrieval (New Orleans,Louisiana,USA, Sep.2001)

10.16 S. Chakrabarti,M. van denBerg, B. Dom: FocusedCrawling: A New ApproachtoTopic-specificWeb ResourceDiscovery. Proceedings of the 8th International World-Wide WebConference(Toronto,Canada,May 1999)

10.17 M. Chau,D. Zeng,H. Chen:PersonalizedSpidersfor WebSearchandAnalysis.Proc.the1stACM-IEEEJoint ConferenceonDigital Libraries(Roanoke, Virginia,USA,Jun2001)pp.79-87.

10.18 M. Chau,D. Zeng,H. Chen,M. Huang,D. Hendriawan: DesignandEvaluationofa Multi-agentCollaborative WebMining System.DecisionSupportSystems(2002)inpress.

10.19 H. Chen:MachineLearningfor InformationRetrieval: NeuralNetworks, SymbolicLearning,andGeneticAlgorithms. Journal of the AmericanSocietyfor InformationScience,46 (3), 194-216 (1995)

10.20 H. Chen,M. Chau,D. Zeng:CI Spider:A Tool for Competitive Intelligenceon theWeb. DecisionSupportSystems(2002)in press.

10.21 H. Chen,Y. Chung,M. Ramsey, C.C. Yang:A SmartItsy-Bitsy Spiderfor the Web.Journalof the AmericanSocietyfor InformationScience,SpecialIssueon AI Tech-niquesfor Emerging InformationSystemsApplications,49 (7), 604-618(1998)

10.22 H. Chen,Y. Chung,M. Ramsey, C.C. Yang:An Intelligent PersonalSpider(Agent)for DynamicInternet/IntranetSearching.DecisionSupportSystems,23,41-58(1998)

10.23 H. Chen,H. Fan,M. Chau,D. Zeng:MetaSpider:Meta-searchingandCategorizationon theWeb. Journalof theAmericanSocietyof InformationScience& Technology, 52(13),1134-1147(1998)

10.24 H. Chen,A. Houston,R.R.Sewell, B. Schatz:InternetBrowsingandSearching:UserEvaluationsof Category Map andConceptSpaceTechniques.Journalof theAmericanSocietyfor InformationScience,SpecialIssueon AI Techniquesfor Emerging Infor-mationSystemsApplications,49 (7) 582-603 (1998)

10.25 H. Chen,T. Ng: An Algorithmic Approachto Concept Explorationin aLargeKnowl-edge Network (Automatic ThesaurusConsultation):Symbolic Branch and BoundSearchvs.Connectionist HopfieldNetActivation.Journalof theAmericanSocietyforInformationScience,46 (5) 348-369 (1995).

10.26 H. Chen,C.Schufels,R.Orwig: InternetCategorizationandSearch:A Self-organizingApproach,Journalof VisualCommunicationandImageRepresentation,7 (1) 88-102(1996)

10.27 Y.J. Chen,V.W. Soo: Ontology-basedInformation GatheringAgents.Proc. the 1stAsia-PacificConferenceonWebIntelligence(MaebashiCity, Japan,Oct2001)pp.423-427.

10.28 F.C. Cheong: InternetAgents:Spiders,Wanderers,Brokers,and Bots (New RidersPublishing,Indianapolis, Indiana, USA, 1996)

10.29 J.Cho,H. Garcia-Molina:TheEvolutionof theWebandImplicationsfor anIncremen-tal Crawler. Proc. the26th InternationalConferenceon Very Large Databases(VLDB2000)(Cairo,Egypt,2000)

10.30 J. Cho,H. Garcia-Molina,L. Page:Efficient Crawling throughURL Ordering.Proc.the7th InternationalWorld-Wide WebConference(Brisbane,Australia,Apr 1998)

10.31 F. Crimmins,A.F. Smeaton,T. Dkaki, J. Mothe:TetraFusion:InformationDiscoveryon theInternet.IEEEIntelligentSystem,Jul-Aug,55-62(1999)

Page 19: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

10. PersonalizedandFocusedWebSpiders 215

10.32 P. DeBra,R.Post:InformationRetrieval in theWorld-WideWeb:MakingClient-basedSearchingFeasible.Proc.theFirstInternational World-WideWebConference (Geneva,Switzerland,1994)

10.33 M. Diligenti, F. Coetzee,S. Lawrence,C.L. Giles, M. Gori: FocusedCrawling us-ing Context Graphs.Proc.the26thInternationalConference on Very Large Databases(VLDB2000)(Cairo,Egypt,2000) pp.527-534

10.34 R.B. Doorenbos,O. Etzioni, D.S.Weld: A ScalableComparison-shoppingAgent fortheWorld-Wide Web. Proc. theFirst InternationalConferenceon AutonomousAgents(Agents’97)(Marinadel Rey, California,USA, Feb1997) pp.39-48

10.35 Drott, M. C.: Indexing Aids at CorporateWebsites:TheUseof robots.txtandMETATags.InformationProcessingandManagement,38,209-219(2002)

10.36 C.Dwork,R.Kumar, M. Noar, D. Sivakumar:RankAggregationMethodsfor theWeb.Proc.the10thInternational World-WideWebConference(HongKong,May 2001)

10.37 D. Eichmann:TheRBSESpiderBalancingEffectiveSearchAgainstWebLoad.Proc.the1stInternational World-Wide WebConference (Geneva, Switzerland,1994)

10.38 D. Eichmann:EthicalWebAgents.Proc.the2ndInternationalWorld-WideWebCon-ference(Chicago,Illinois, USA, 1994)

10.39 G.W. Flake, S. Lawrence,C.L. Giles, F. Coetzee:Self-organizationof the Web andIdentificationof Communities.IEEEComputer, 35 (3), 66-71(2002)

10.40 S.Gauch,G. Wang,M. Gomez:Profusion:IntelligentFusionfrom Multiple DifferentSearchEngines.Journalof UniversalComputerScience,2 (9) (1996)

10.41 M. Gordon: ProbabilisticandGeneticAlgorithms for DocumentRetrieval. Commu-nicationsof theACM, 31 (10) 1208-1218(1988)

10.42 M. Gray:InternetGrowth andStatistics:CreditsandBackground.[Online]. Availableathttp://www.mit.edu/people/mkgray/net/background.html(1993)

10.43 T.H. Haveliwala:Efficient Computationof PageRank.StanfordUniversity TechnicalReport.Available at: http://dbpubs.stanford.edu:8090/pub/1999-31 (1999)

10.44 M. R. Henzinger:Hyperlink Analysisfor the Web. IEEE InternetComputing,5 (1),45-50(2001).

10.45 M.R. Henzinger, A. Heydon, M. Mitzenmacher, M. Najork: On Near-uniform URLSampling. Proc. the 9th International World-Wide Web Conference (Amsterdam,Netherlands,May 2000)

10.46 A. Heydon,M. Najork:PerformanceLimitationsof theJava CoreLibraries.Proc.the1999ACM JavaGrandeConference, (Jun1999) pp.35-41.

10.47 A. Heydon,M. Najork: Mercator:A Scalable,ExtensibleWebCrawler. World-WideWeb,219-229 (Dec1999)

10.48 J.J.Hopfield: NeuralNetwork andPhysicalSystemswith Collective ComputationalAbilities. Proc. theNationalAcademyof Science, USA, 79 (4), 2554-2558(1982).

10.49 A.E. Howe, D. Dreilinger: SavvySearch:A Meta-searchEnginethat LearnswhichSearchEnginesto Query. AI Magazine,18 (2) 19-25(1997)

10.50 B. Kahle:PreservingtheInternet.ScientificAmerica(Mar 1997).10.51 J.Kleinberg: AuthoritativeSourcesin aHyperlinkedEnvironment.Proc.the9thACM-

SIAMSymposium on DiscreteAlgorithms(SanFrancisco,California,USA, Jan1998)pp.668–677.

10.52 T. Kohonen,T.: Self-organizingMaps(Springer, Berlin, 1995)10.53 M. Koster: A Standard for Robot Exclusion. [Online]. Available at:

http://www.robotstxt.org/wc/norobots.html(1994)10.54 R. Kumar, P. Raghavan, S.Rajagopalan,A. Tomkins:Trawling theWebfor Emerging

Cyber-communities.Proc.the8th World-WideWebConference(Toronto,May 1999)10.55 K.L. Kwok: A NeuralNetwork for ProbabilisticInformationRetrieval. Proc.the12th

ACM-SIGIRConferenceonResearch andDevelopmentin InformationRetrieval (Cam-bridge,Massachusetts,USA, Jun1989) pp.21-30

Page 20: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

216 M. ChauandH. Chen

10.56 S.Lawrence,C.L. Giles:Inquirus,theNECI MetaSearchEngine.Proc.the7th Inter-nationalWorld-Wide WebConference(Brisbane,Australia,Apr 1998)

10.57 S.Lawrence,C.L. Giles:Context andPageAnalysisfor ImprovedWebSearch.IEEEInternetComputing,Jul-Aug,38-46(1998).

10.58 S.Lawrence,C.L. Giles:Accessibilityof Informationon theWeb. Nature,400,107-109(1999)

10.59 C.Lin, H. Chen,J.Nunamaker:Verifying theProximityandSizeHypothesisfor Self-organizingMaps.Journalof ManagementInformationSystems,16 (3) 61-73(2000)

10.60 P. Lyman,H.R. Varian:How Much Information. [Online]. Available at http://www.sims.berkeley.edu/how-much-info/(2000)

10.61 U. Manber, M. Smith,B. Gopal:WebGlimpse:CombiningBrowsing andSearching.Proc.theUSENIX1997 Annual TechnicalConference (Anaheim,California,Jan1997)

10.62 M.L. Mauldin: Lycos:DesignChoicesin anInternetSearchService.IEEEExpert,12(1) 8-11(1997)

10.63 M.L. Mauldin: SpideringBOF Report.Reportof the DistributedIndexing/SearchingWorkshop, (Cambridge,Massachusetts,USA, May 1996)

10.64 O.A. McBryan: GENVL andWWWW: Tools for Tamingthe Web. Proc. the 1st In-ternationalWorld Wide WebConference(Geneva,Switzerland,1994)

10.65 A. McCallum,K. Nigam,J. Rennie,K. Seymore:A Machine LearningApproach toBuilding Domain-specific SearchEngines.Proc.theInternational Joint ConferenceonArtificial Intelligence(IJCAI-99)(1999)pp.662-667

10.66 Z. Michalewicz (1992):GeneticAlgorithms+ DataStructures= EvolutionPrograms.(Springer, Berlin, 1992)

10.67 R.C. Miller, K. Bharat:SPHINX: A Framework for CreatingPersonal,Site-specificWebCrawlers.Proceedingsof the7th International World-WideWebConference(Bris-bane,Australia,Apr 1998)

10.68 M. Najork, J.L. Wiener: Breadth-firstSearchCrawling Yields High-quality Pages.Proceedingsof the10th International World-Wide Web Conference(HongKong,May2001)

10.69 Netcraft: Web Server Survey. [Online]. Available athttp://www.netcraft.com/Survey/Reports/0202/(2002)

10.70 Z.Z. Nick, P. Themis:Web SearchUsing a GeneticAlgorithm. IEEE InternetCom-puting,5 (2) 18-26(2001)

10.71 J. Pearl: Heuristics: Intelligent SearchStrategies for ComputerProblemSolving.(Addison-Wesley PublishingCompany, Reading,Massachusetts,USA, 1984)

10.72 B. Pinkerton:Finding What PeopleWant: Experiences with the WebCrawler. Proc.the2ndInternational World-Wide WebConference(Chicago,Illinois, USA, 1994)

10.73 P. Pirolli, J.Pitkow, R. Rao:Silk from aSow’sEar:ExtractingUsableStructuresfromthe Web. Proc. the ACM Conferenceon HumanFactors in ComputingSystems(Van-couver, Canada,Apr 1996)

10.74 J. Rennie,A.K. McCallum: Using ReinforcementLearning to Spider the Web Ef-ficiently. Proc. the 16th International Conference on Machine Learning (ICML-99)(Bled,Slovenia,1999)pp.335-343

10.75 G.Salton:AnotherLook atAutomaticText-retrieval Systems.Communicationsof theACM, 29 (7) 648-656 (1986)

10.76 E. Selberg, O. Etzioni: Multi-serviceSearchandComparisonusingtheMetaCrawler.Proc.the4th World-Wide WebConference(Boston,MA USA, December1995)

10.77 E. Selberg, O. Etzioni: The MetaCrawler Architecturefor ResourceAggregationontheWeb. IEEEExpert,Jan-Feb,11-14(1997)

10.78 J. Smith,S.F. Chang: Visually Searchingthe Web for Content.IEEE Multimedia,4(3), 12-20(1997)

10.79 E. Spertus:ParaSite: Mining StructuralInformationon the Web. Proc. the 6th Inter-nationalWorld-Wide WebConference(SantaClara,California,USA, Apr 1997)

Page 21: 10. Personalized and Focused W eb Spidersmchau/papers/WebSpiders.pdf · 10.1.2 Applications of W eb Spiders Spiders have been shown to be useful in various Web applications. There

10. PersonalizedandFocusedWebSpiders 217

10.80 S. Spetka:The TkWWW Robot: Beyond Browsing. Proc. the 2nd InternationalWorld-Wide WebConference (Chicago,Illinois, USA, 1994)

10.81 R.G.Sumner, K. Yang,B.J.Dempsey: An InteractiveWWW SearchEnginefor User-definedCollections.Proc. the 3rd ACM Conferenceon Digital Libraries (Pittsburgh,Pennsylvania,USA, Jun1998)pp.307-308

10.82 Theht://dig Group.:htdig Reference.[Online]. Availableat http://www.htdig.org/ ht-dig.html

10.83 K.M. Tolle, H. Chen:ComparingNoun PhrasingTechniquesfor Usewith MedicalDigital Library Tools.Journal of theAmericanSocietyfor InformationScience,SpecialIssueon Digital Libraries,51 (4) 352-370(2000)

10.84 S. Vrettos,A. Stafylopatis:A Fuzzy Rule-basedAgent for Web Retrieval-filtering.Proc. the 1stAsia-Pacific Conferenceon Web Intelligence(MaebashiCity, Japan,Oct2001)pp.448-453

10.85 S.Waterhouse,D.M. Doolin, G. Kan,Y. Faybishenko: DistributedSearchin P2PNet-works.IEEEInternetComputing, 6 (1) 68-72(2002)

10.86 R. Weiss,B. Velez,M.A. Sheldon:HyPursuit:A HierarchicalNetwork SearchEnginethat Exploits Content-linkHypertext Clustering.Proceedingsof the ACM Conferenceon Hypertext (Washington,DC, USA, 1996)

10.87 J. Weizenbaum:Eliza – A ComputerProgramfor the Study of Natural LanguageCommunicationbetweenMan andMachine.Communicationof the ACM, 9 (1), 36-45 (1966)

10.88 I.H. Witten,D. Bainbridge,S.J.Boddie:Greenstone:Open-sourceDL Software.Com-municationsof theACM, 44 (5), 47 (2001)

10.89 I.H. Witten,R.J.McNab,S.J.Boddie,D. Bainbridge:Greenstone:A ComprehensiveOpen-sourceDigital Library SoftwareSystem.Proc. the5th ACM Conferenceon Dig-ital Libraries(SanAntonio,Texas,USA, 2000)pp.113-121

10.90 A.H. Whinston: Artificial Intelligence(Addison-Wesley PublishingCompany Inc.,Reading,Massachusetts,SecondEdition,1984)

10.91 C.C. Yang,J. Yen, H. Chen:Intelligent InternetSearchingAgent Basedon HybridSimulatedAnnealing. DecisionSupportSystems,28,269-277 (2000)

10.92 C. Yu, W. Meng,K.L. Liu: Efficient andEffective Metasearchfor Text DatabasesIn-corporatingLinkagesamong Documents.Proc. the2001ACM SIGMODInternationalConferenceon Managementof Data (Dallas,Texas,May 2001)pp.187-198

10.93 O. Zamir, O. Etzioni: Grouper:A DynamicClusteringInterfaceto Web SearchRe-sults.Proc.the8th World-Wide WebConference(Toronto,May 1999)