Project Descriptionwolber/Wolber/Research/.../NSF/NSF… · Web viewSearch tools search the same set of documents, and return the same results, no matter what user makes the query

Project DescriptionMuch has been written about the information explosion and the difficulties of finding pertinent

information in the ever-expanding World Wide Web (WWW). But the information explosion has also

changed the personal information space, and the intersection between that space and the WWW. Most

personal spaces now consist of hundreds of documents, most of which are not islands to themselves, but

are part of a web of documents related by direct hyperlinks and references, or by similarity of content. We

call this information space, consisting of user documents and the web documents referenced by them, the

personal web (see Figure 1).

Traditional file management tools are not well suited for navigating this personal web. Documents are

organized with three separate topical hierarchies: 1) the file system directory, showing only user

documents 2) the user's bookmark directory, showing only web documents, and 3) the hidden hierarchy

(graph) defined by the direct links (hyperlinks and citations) within documents. Such a disintegrated

information space complicates the work of the user. To navigate for information on a particular topic, a

user has to browse the topic subdirectory both in his file manager and in the bookmarks of his Internet

browser. Furthermore, these tools ignore direct-links within documents, so as the user navigates the file

and bookmark hierarchy, he cannot see the documents related to those listed by direct links. Without an

overview of these direct links, the user must open each interesting document in an editor and manually

Figure 1. The Personal Web and Periphery. A document is considered within the personal web if it is has been saved or bookmarked by the user, or if the user has explicitly added a link to it. Other documents are considered n degrees of separation away from the personal web, where a degree of separation is defined as an explicit inward or outward link , or an implicit similar content relationship (shown as a dotted line).

2nd Degree

1st Degree

Nth Degree Degee Degree

Personal

Web

search for the links within it. Even more laborious analysis must be performed if the user wishes to see all

documents that point to a particular document (i.e., inward links), or all documents with similar content.

Whereas file management tools do a poor job of exposing the personal web, search tools ignore the

personal web and how it can help users find information in the WWW. Search tools search the same set

of documents, and return the same results, no matter what user makes the query and no matter what

documents the user is working on or has previously saved (bookmarked) in the personal space.

HIGH-LEVEL OBJECTIVE

The project’s objective is to design and implement tools better suited, than conventional ones, for

organizing and exploring the personal web. The following are sub-objectives defined to meet this goal:

De-emphasize the storage location of documents, and instead emphasize a document's context,

that is, how it is related to other documents.

Identify various types of references within documents, other than just hyperlinks and citations,

that can lead to related information (e.g., the name of a city leads to a map link).

Integrate desktop tools (e.g., editors, file managers) with WWW tools (browsers, search engines)

so that exploration is continuous between the personal web and the WWW, and a "context"

switch" is not necessary to move from creation, document management, and information

exploration.

Personalize searching of the WWW by augmenting search queries with "profiling" information

mined from the personal web and by performing inward-out searches that begin in the personal

web and crawl outward to its periphery within the WWW.

EXPECTED SIGNIFICANCE

The project can have a significant impact on next generation user interfaces for the personal desktop. It

can expose and clarify the limitation of conventional tools, and present alternatives that address those

limitations, along with evaluation data on the viability and significance of those alternatives.

CiteSeer [16] has revolutionized computer-science research by presenting research documents in context

and making it easy for a researcher to explore the history and relationships of ideas. The proposed project

can have a similar impact on a general audience by introducing tools that help users organize and explore

their own personal information spaces.

PRELIMINARY WORK AND RELATION TO LONGER-TERM GOALS

Up until a year or so ago, the principal investigator’s (PI's) research focused primarily on the design of

Programming by Example tools for building animated interfaces. After over a decade of work in this area

the research culminated in a journal article as well as an ACM Communications paper and a chapter in

David Wolber Exploring the Personal Web 2

Henry Lieberman's book, Your Wish is My Command: Programming by Example, both co-authored with

Brad Myers [see the Biographical Sketch for these publications].

About that time, the PI served on the program committee for the 2000 International Conference on

Intelligent User Interfaces (IUI). Inspired by tools such as Watson [4] and Margin Notes [24,26], the PI

subsequently began exploring software agents that assist users in the creative process. He was further

inspired by his own secondary-career efforts writing fiction in the Master's of creative writing program at

USF, and the desire he perceived for better tools to help in the creative process.

In the summer of 2001, the PI and two USF students began work on WebTop, a tool based on the high-

level objectives described above. A working prototype has been designed and implemented, with

preliminary ideas and findings reported in [29,30]. The team is enthusiastic about the results so far as the


Figure 2. The WebTop interface. The right panel is the currently open document (either a page rendered in an embedded Internet Explorer browser, or an HTML file in the WebTop WYSIWYG editor). The left panel is the tree view showing, in this case, the open page's three inward links, six outward links, and five content-related links mined from the WWW. All of these related documents are considered one degree of separation away. The fourth outward link (ice.cs.usfca.edu/projects.shtml) has been expanded to show its relations, which are considered two degrees of separation from the open document.

prototype is already a useful and illuminating tool (to its handful of users), and numerous interesting and

important questions have presented themselves.

The PI plans to focus on this new line of research for the long-term, just as he spent the years 1990-1999

working in Programming by Example. The long-term goal is the design of an ideal tool (agent) for

organizing and exploring the personal information space, its periphery, and the WWW beyond.

OVERVIEW OF PRELIMINARY DESIGN

The initial WebTop design provides a tree view of the personal web that is tightly integrated with an

editor/browser (see Figure 2). The tree view is a zero-input interface [20], meaning it pushes document

context information to the user without the user typing keywords or clicking a mouse or switching

applications to a search engine. The user can instead intermittently glance at the information and follow

one of the suggested links, or just continue on with the current task.

When a document d is opened in the editor/browser, WebTop performs a number of operations:

1. Finds outward links: d is parsed to find its outward links. If d is a directory, its outward

links include all of its subdirectories and files. If d is a document, its outward links are its hyper-

links (in upcoming versions this will be generalized to include citations links and any other

explicit reference that can be mapped to a set of URLs). This list will also contain edge links,

which are discussed in the next section.

2. Finds inward links: If d is a web document, a wrapper is called that invokes an advanced search

facility in Google. This wrapper returns all web documents that refer to d. Whether d is a user or

web document, all documents from the personal web that refer to d are then added to the list of

inward links.

3. Finds content-related links: The n most important terms in d are identified using a TFIDF based

algorithm [27]. These terms are then used to perform one or more of three possible searches. The

first uses a Google wrapper to perform a standard search of the WWW. The second searches only

documents within the personal web. The third begins in d and crawls out to the WWW following

inward and outward links and using a BFS algorithm [7]. The user may choose which types of

searches are performed in the options dialog.

The compiled lists of each relationship type are then ranked as described below (see personalized

document ranking) A user-configurable number of each relationship type is then displayed, color-coded

by type, in the first level of the tree view. For example, the default configuration displays five inward

links, five outward links, and five content-related links. A “more” node is provided under the nodes of

each type. When expanded, additional relations at the same level in the tree (degree of separation) are

shown.


When the user expands any of the related links, the process is re-triggered and a set of documents two

degrees of separation from d are displayed. Note that whether the expanded first level link is an inward

(i), outward (o), or content-related (c) link of d, the second level will display relationships of all three

types. So the second level can show grandparents of d (i-i), grandchildren of d (o-o), siblings of d (i-o or

o-i) and relatives defined by the other combinations (c-c, c-i, i-c, c-o, o-c).

An earlier version of WebTop showed the various relationship types in separate panels. This scheme was

advantageous in that inward links could be shown with a conceptually intuitive, backward oriented (left-

growing) tree, and outward links in a right-growing tree. Also, the separation of types was clearer than

with color-coding or icons. However, early users of WebTop complained that they couldn’t reach siblings

(i-o or o-i) with a single click in the tree view-- they had to open an outward link as the “current

document” to see an (o-i) relationship. Similarly, they complained that it was difficult to view the inward

or outward links of the displayed content-related documents (c-i or c-o).

Personalized Document Ranking

WebTop exposes more relationship types than other tools, all within a single view, so its ranking

algorithm, i.e., the way it chooses which links to display, is extremely important. Current search tools and

reconnaissance agents use a combination of similarity comparison and popularity measurement to rank

documents. Similarity comparisons can include meta-data as well as internal content analysis to compute

how similar a document is to the seed document. Popularity measurements count how many inward links

a particular document has, and recursively, how many links its inward links have, etc. Such analysis can

identify “hubs” and “experts” [14] and, in general, documents that others have found to be relevant.

The link analysis of conventional search tools is completely impersonal. Every user gets the same results,

no matter whether a user's personal web contains zero or a thousand links to a page under consideration.

A personal web agent, on the other hand, can consider the user’s own preferences in ranking documents.

If numerous documents within the personal web refer to the document under consideration (or many

documents point to documents that point to it, etc.), that document can be ranked higher.

Various algorithms are possible for personalized link analysis. One involves the degrees of separation as

part of the measurement. Instead of just counting the number of documents from the personal web that

refer to a "found" document under consideration, a higher score is assigned to documents near the

currently open document (the seed document). The argument for such a scheme is similar to the old

dating adage: getting fixed up on a date is beneficial, since one is more likely to prefer friends of friends

than strangers. Here, the intuition is that one is more likely to prefer documents pointed to by neighbors

of the current document over documents pointed to by documents many degrees of freedom away.


Note that such a scheme makes use of user profiling information to decide which documents to suggest.

Unlike tools that use a single user profile, or multiple profiles the user must explicitly specify [20], the

profiles in WebTop are implicitly defined and chosen dynamically as "the current document and its

neighbors”.

Besides improving the relevancy of suggested documents, personalized ranking can combat the “rich get

richer” monopolistic effect of global link analysis algorithms. Giving preference to hubs and experts has

the effect of making them even stronger hubs and experts, since more and more other documents will find

them and then refer to them. Because a personalized ranking system focuses on user preference, it is not

affected by the follow-the-leader mentality of the masses.

Multi-Reference Saving

WebTop is also designed to help users organize documents in the personal web. Specifically, the system

circumvents the traditional save and open operations and the traditional notion of directory in order to

collapse the directory, bookmark, and direct-link hierarchies into a single topic hierarchy. Physically, a

file, bookmark, and document are essentially the same. A directory (bookmark) is just a set of links to

documents, different from a document (e.g., web page) only in that it is restricted to containing only links

and no other text or graphics. However, conventional tools treat directories in special ways. Editors allow

documents to be saved in directories, and do not allow directories to be edited or annotated. File managers


Figure 3. The WebTop SaveAs Dialog. A user saves a document by adding links to it from other documents. A user "bookmarks" a web document using this same dialog. The nodes shown in the dialog above may be directories, user documents, or web documents-- each is treated as a collection of links. Note that the Save dialogs only show outward links, not inward links or content-related links as in the tree view.

display directories differently from other documents, and browsers provide special directories for saving

bookmarks.

The WebTop design provides for a fundamentally different method of saving and bookmarking files.

Whether editing an HTML or Word page, or browsing a web document, a user “saves” not by adding a

link in some special “directory” file, but by adding link(s) in regular documents. When the user selects

File | Save, the save dialog does not show the file system directory; this directory is completely hidden

from the user. Instead, it displays the outward links of a special HTML document called Root.html (see

Figure 3). Root.html is an HTML document that serves as the root of the personal web. It is different

from other documents only in that the editing and browsing tools display it in the save and open dialogs.

The system parses Root.html to find its outward links, then displays those linked documents as first-level

nodes in the tree within the save/open dialog. When the user expands one of these nodes, the system

parses the corresponding document and displays its outward links. Check boxes are presented next to

each node, so the user can select any number of documents to which a link to the current document

should be “saved”. After the user closes the dialog, the context view is updated to show the newly

specified inward links to the current document.

The open operation works in a similar manner. The open dialog appears displaying Root.html and its

outward links. The user can select any node, or expand a node to see the next level of documents. Double-

clicking on a node opens the document as the current document. The document appears in the

editor/browser, and its context is displayed in the tree view. Documents can also be opened directly from

the tree view by double clicking on a node.

With these fundamentally modified save and open operations, the direct-link becomes the basis for a

single, integrated topic hierarchy. Users can create documents that serve as “directories”, but these

documents are physically no different than other documents, so they can be annotated with text and

graphics. This annotated view appears when the document is opened in the main editor/browser. The

“directory” or “links-only” view of the document appears in the tree view.

Edge Links

Initially, the system was designed so that “saving” a document meant adding a direct hyper-link within

the parent document, if the parent was an HTML user document. The system added the links at the

bottom of the parent page. This scheme ensured that a user could always annotate the links in his

“directory” documents. But in early usability testing, we found that users were often disconcerted by the

addition of actual internal links added to a parent document.

Because of this feedback, the system now, by default, adds an edge link to a parent document when a

document is saved. When a document is opened, the tree view displays both its internal and edge links,


while the document viewer (editor or browser) only shows its internal links. Edge links allow a user to

specify relationships between documents without changing either document’s internal text. This is useful

for user-owned documents, especially formal ones such as research papers, and it is essential for web

documents, since the user cannot modify their internal contents.

Summary of Preliminary Design

A prototype implementing most of the above features has been completed, and a handful of super users,

including the PI and the student developers, have enthusiastically begun using the system in their daily

work. From this experience, the PI has determined his research path for the next three years, as detailed in

the next section.

PROPOSED WORK PLAN

The research will be completed using an iterative approach, with four iterations of approximately nine

months duration. Tasks includes studies of related academic and commercial works, tool development,

including documentation and toolkit implementations, and evaluation through user studies and feedback

from public releases.

Designing user studies to evaluate the tool will be challenging, as the objectives of the tool include such

vague notions as, “improves the organization and exploration capabilities of a user,” and “lists more

‘relevant’ documents to the user.” The interface issue is just as challenging, as the proposed tool is taking

a difficult problem of reconnaissance agents, that of choosing the most relevant content-related

documents to list, and including inward and outward links in the mix as well.

Iteration 1

Task 1A: Complete implementation of WebTop 1 (as specified in the preliminary design above).

Work includes implementation of the personalized searching mechanism, full integration of the various

relationship types into a single view, and adding a high degree of user-configurability to the system (e.g.,

the ability to specify how many of each relationship to show, and what searches should be performed to

find content-related links). Simple known algorithms will be used in searching for and ranking

documents. Challenges include finding efficient algorithms for dynamically searching the personal web,

and/or integrating idle-time personal crawls to index documents in the neighborhood.

Along with coding, significant system testing will be completed to ready the tool for public consumption,

and the existing partial Unified Modeling Language (UML) documentation will be refined and updated.

Deliverable: WebTop Public Release 1, including documentation.

Task 1B: Perform a complete and updated related work survey. The proposed research spans

numerous areas within the Human-Computer Interface and Artificial Intelligence disciplines, and both

academic and commercial endeavors.

Deliverable: a paper, A Survey of Techniques for Integrating the Personal and World-Wide Webs.


Task 1C: Evaluate the prototype with informal user studies. As development proceeds, informal user

studies will be performed. These studies will consist of developers watching users use the system in a

free-form fashion, and watching them perform particular tasks provided by the project team. Feedback

will be used to guide the development process and to shape the formal user studies to be performed in

iteration 2.

Iteration 2

Task 2A: Obtain and analyze feedback from public release 1

Collect feedback from known solicited users and from unsolicited comments posted on project website.

Task 2B: Perform formal user studies

Formal user studies will be performed to measure the effectiveness of WebTop's design. Detailed tasks

will be outlined for the users and specific data measured. Formal studies will be developed to answer

specific numerous questions, including the following:

1. Compare searches in the neighborhood of the personal web with global searches. Are the results

more relevant? Can a hybrid approach, providing results from both, please users?

2. Compare ranking algorithms that use personalized link analysis with standard ranking algorithms

(e.g., Google) that use global link analysis. Are the results more relevant?

3. Analyze the effectiveness of WebTop compared to conventional desktop and web tools. Given the

same amount of time and preparation, will a test group of research paper writers cite a "better" list of

related works? Will their papers be graded higher?

4. Analyze the effectiveness of displaying inward, outward, and content-related links within a single

view, compared to separate views. Does it have a negative effect on traditional file management

tasks?

5. Analyze the effectiveness of multi-reference saving and eliminating the distinction between

directory, document, and bookmark. Will the concept of directory become extinct, or will users define

documents that perform the same function and contain only lists of links?

Deliverables 1. Mining the Neighborhood of the Personal Web-- a paper providing an overview of the

system and a detailed report of the findings from user studies (1) and (2).

Deliverable 2: Improved Writing Through Integrated Editing, File Management, and Searching-- a paper

providing an overview of the system, focusing on user interface issues, and a detailed report of the

findings from user studies 3, 4, and 5.

Task 2C: Explore Issues in interfacing with various application/document types. To promote

widespread real-world use, the system needs to parse various types of documents and invoke a WebTop-

like interface from various software tools (e.g., from Microsoft Word). A project objective is also to


develop software that identifies references, other than hyperlinks and citations, and maps them to a

collection of web pages.

Deliverable: A toolkit, built on top of COM, for identifying various references within various document

types and for specializing various tools such as Word, Explorer, and Acrobat. This toolkit has two

purposes. First, because USF is a predominantly undergraduate institution, there will be a high turnover

of students on the project each year, so an easy-to-use toolkit with proper documentation can significantly

reduce the start-up time for new student developers. Second, too often research efforts must duplicate the

same difficult implementation efforts due to "technical" results not being disseminated. Thus, this project

will follow the lead of [23] and make this toolkit publicly available.

Task 2D: Perform normal maintenance and debugging of software.

Deliverable: WebTop Public Release 1.1

Task 2E: Monitor new related academic publications and commercial releases.

Iteration 3

Task 3A: Design and implement WebTop Version 2. The new features for this version will be designed

using feedback acquired from user studies and from the first public release. It will include issues studied

in Task 2C concerning user interface alternatives and increasing the types of references that can be

mapped to other documents. It will also include a refinement of search and ranking algorithms using data

from the personal web, and some form of Group Web, whereby personal webs are used in collaboration.

Deliverable: WebTop Public Release 2.0.

Task 3B: Perform user studies on a particular focus group. User testing will be targeted to a group

consisting of writers and students in the USF Creative Writing Program. The goal is to obtain feedback on

the tool's usefulness specifically for the creative process. Feedback will be used to help in the new design

of the general tool, and in the consideration of specialized tool(s).

Task 3C: Monitor new related academic publications and commercial releases.

Iteration 4

Task 4A: Perform formal user studies on WebTop Release 2. These studies will target new features

and refinements designed during the previous iteration.

Task 4B: Design and implement specialized versions for focus group. The versions will contain

special features designed specifically to help a special focus groups (e.g. creative writers) better

accomplish their tasks. Features will include the identification of domain-specific "links" within

documents, e.g., when the tool identifies the name of an author in a document, it might display a list of

the author's works or similar authors.

Deliverable 1: WebTop for Writers, Release 1.0.

Deliverable 2: A paper titled, A Software Agent for Creative Writers.


Task 4C: Analyze and categorize findings.

Deliverable: A journal paper (TOCHI?) detailing the results of the entire project.

Team Member Responsibilities

The PI will be responsible for the literature and commercial system surveys and updates and the majority

of the writing. The PI will also oversee the design and implementation of the system(s), and the user

studies. The student team members' main responsibility will be implementing the system(s) and

performing user studies, though all project members will be read the core related papers, help in the

design of the system, and suggest innovations from their experiences building the system. Some top

students will also take part in writing research papers.

RELATION TO PRESENT STATE OF THE FIELD

This section describes the proposed system from a number of perspectives and in relation to several

research areas.

WebTop as a Reconnaissance Agent

Reconnaissance agents, also called just-in-time information retrieval agents [26], expose similar content

relationships between documents. These agents mine the user's open documents to find the most

characteristic words, then automatically send the words to a search engine (e.g., Google). The user can

intermittently glance at the resulting suggested links and, without changing context from creation mode

(editing), choose whether to follow one of the suggestions, or continue on with the current task.

Reconnaissance agents are said to have zero-click interfaces [20] because searches are performed without

the user stopping work to initiate them.

The proposed project will explore generalizing the basic ideas behind reconnaissance agents. One

generalization is to expose direct-link associations, both inward and outward, and of various types (e.g.,

hyperlink, citation, proper name), along with the content-related documents. One might argue that

displaying the outward links is repetitive, since they appear in the document itself. However, many

documents do not fit in a single window, so often a user must scroll to view all of a document’s links.

Furthermore, the proposed tool can rank the outward links it displays, similar to Letizia [19,20].

Inward links can be as illuminating as outward links, but are ignored by most tools (Google is an

exception, as it has an advanced search feature that lists all documents that link to a given URL).

Following inward links is following the refinement and transformation of an idea, moving ahead in the

history of the idea, instead of backward, as with forward links.

A second generalization of reconnaissance agents is to display all types of links (inward, outward, and

content-related) within a tree structure, instead of the one-dimensional list displayed by most

reconnaissance agents. This allows a user to easily navigate to documents more than one degree of


separation from the open document (see Figure 1 above). With most reconnaissance agents, the user must

open a suggested document to see its content-related and direct links, and can only view one degree of

separation at any time. With the proposed tree view, the user can expand nodes to view a collection of

documents related by various degrees of separation, and various relationship types.

A third generalization of reconnaissance agents is to trigger searches not only when a user edits a

document [4,25,26], browses a web page [2,19,20,], or uses some other application [22], but also as the

user traverses the documents in the file manager. When a user selects a document or directory in the file

manager tree view (in this case, the proposed super-file-manager tree view), the agent performs the

reconnaissance-agent-like automated search to find content-related documents and lists these along with

the explicitly linked documents. Because explicit links can include sub-directories and documents within

a directory, the proposed system is reconnaissance agent meets file manager meets document graph.

A fourth generalization of reconnaissance agents concerns the search engine that is automatically

invoked. Reconnaissance agents like the Remembrance Agent [25] search the personal space for related

documents, while others search only the WWW [4], and one, Letizia [19,20], focuses only on pages one

outward link away from the currently open document. The proposed system will search the personal

space, the WWW, and the intersection of the two, i.e., the documents in the neighborhood of the personal

web. The latter is somewhat similar to Letizia's scheme, only the search will include not only forward

links of the current document, but inward links of it, outward links of inward links, inward links of

outward links, and in general, all documents within a certain radius of the currently open document.

Because the proposed project focuses on the personal web, it will not focus on designing new algorithms

for choosing the words in a document that most characterize it. Instead, the project will make use of

previous research in the area, and specifically some variant of the TFIDF [27] algorithm (which ranks a

word based on how often it occurs in the document, and how seldom it occurs in some large corpus).

Watson[4] uses a refined TFIDF, taking into account heuristics such as “words in bold are more

important” and “words at the top of a document are more important”. Margin Notes [24] uses TFIDF, but

it considers the document to be of too large a granularity to seed optimal searches. Margin Notes thus

treats each section of a document as a search seed, then lists the links related to each section in the

document’s margins. Intellizap [10] is similar, only it considers only the selected text in the document.

WebTop as a CiteSeer for the Personal Web

A number of systems expose the direct-link relationships between documents [3,13,16,21,23,28]. Some of

the most popular are those that display citation graphs of research papers [16,21] These systems crawl the

web searching for research papers (in PDF files), then parse those papers to build a citation graph.

Butterfly [21] provides a 3D visualization of the citation graph of a collection of research papers. The left


side of the butterfly shows the documents that cite a given document (inward links), the right side shows

the documents it cites (outward links), and the third dimension provides information specific to the given

document. CiteSeer [16] displays similar data in 2D format. Besides exposing direct-link citation

relationships, it also exposes content-related documents in a “working bibliography” panel. CiteSeer is of

revolutionary utility to those of us that remember long nights in the library, manually building citations

graphs and searching through conference proceedings and journals.

One way to view the proposed tool is as a CiteSeer for the personal web. Instead of displaying

relationships between a domain-specific collection of documents, it displays relationships between

personal documents and between personal documents and web documents “in the neighborhood”.

The WebTop prototype only displays hyper-link and directory-document relationships. Ideally, a WebTop

user could also traverse CiteSeer-like citation links both within the personal web and out to the WWW.

But even if a CiteSeer-like citation parser were executed over the collection of documents in the personal

web, it would not allow “citation” navigation across the boundary to the WWW, since citations, unlike

hyper-links (URLs), cannot be reached with a standard browser. Cameron has proposed a “universal

bibliographic and citation database linking every scholarly work ever written” [5], but until a major

search engine makes this a reality, users will not have the capability of following any citation from the

personal web to the WWW.

The issue with citation links is indicative of a more general issue that the proposed project will explore,

that of identifying any type of explicit reference, and automatically mapping it to one or more URLs.

WebTop as a User–Assisted Personal Crawler

Domain specific search engines such as CiteSeer can provide more relevant results by limiting the search

space to a limited collection of documents. Focused crawlers [6,7,8,9] are often used to build up the

documents in such a domain. Beginning with a set of seed documents representing the domain’s topic,

crawling proceeds along outward links, with documents most similar to the seed documents visited first.

A personal crawler [6,7] is a domain-specific searcher, where the “domain” is represented with a set of

documents in the personal web. The conventional algorithm used is best first search (BFS). All links of

the seed documents are ranked according to their similarity to the seed, then the best link is “visited”,

meaning its links are added to the list of ranked documents. The best link in the updated ranking list is

then visited, and so on. Crawling proceeds in this manner until a certain depth or width is reached. The

top ranked documents are then presented to the user.

One weakness of BFS is that it will often miss highly relevant pages that are more than one link away

from the “current” document [9]. For instance, a crawler spawned from a document concerning

reconnaissance agents might arrive at the MIT research directory page, find it unrelated to the seed


documents, and thus never follow its links. This would be unfortunate, since Henry Lieberman of MIT, an

expert in agents, has a home page that links to a numerous documents concerning reconnaissance agents.

One approach [9] to this problem uses machine learning techniques to teach the crawler to sacrifice short

terms gains in the interest of overall crawl performance, i.e., the crawler learns to identify promising

directory-like pages and follow their links, so relevant pages below them are considered.

The proposed system takes a different approach to the problem-- it allows the user to assist the crawl.

Beginning with the open working document, the tool crawls out one level at a time, following both

inward and outward links, and displays the links in the tree view. The user assists this real-time crawl by

expanding the most desirable links. The tool assists the user by ranking the documents and displaying the

most related links.

WebTop also crawls out more than one degree on each expansion, and displays these links as “content-

related” links. Currently, a BFS algorithm is used in this search, but a machine learning technique like [9]

could improve this process.

WebTop as an End-User Tool for Defining Domain-Specific Searches

Meta-Search engines such as [17] automatically select domain-specific search areas to send keywords to

based on the user's current task (open documents). These systems keep a fixed list of search areas along

with profiles of each that are matched with data from the current documents.

WebTop provides all the facilities necessary for a user to define domain-specific searches. A user can

simply add links from document(s) in the personal web to pages in the WWW that make up the "domain".

When the user then selects the document pointing to those pages, or opens the document in the

editor/browser, WebTop invokes a search of the web beginning with that document and crawling through

the inward and outward links for some depth and breadth. In essence, WebTop dynamically chooses

search areas and dynamically builds them.

WebTop as an Integrated Searching/Browsing Tool

The two standard methods of exploration are keyword search and categorical browsing. The proposed

tool, in effect, provides a single-click interface for performing both types of navigation simultaneously.

Every time a document in the tree view is expanded, the user is browsing the sub-documents of the

"directory" defined by the document, and invoking a keyword search of the WWW or personal web

neighborhood (this is the content-related search)

RELATION TO WORK IN PROGRESS ELSEWHERE

The proposed project is most related to work in progress at the NEC Research group on Web

Technologies [31], the Intelligent Information Lab at Northwestern University[32], and the Artificial

Intelligence Lab at the University of Arizona[32]. THE NEC group is responsible for CiteSeer and


numerous articles concerning focused crawling and context in web search [9,15,16,]. The lab at

Northwestern is responsible for Watson [4] and numerous other software agents. The group at the

University of Arizona has focused on personal (client-based) crawls [7,8]. The proposed project deviates

from these others in its plan to integrate reconnaissance information and other contextual data (explicit

references) with traditional file management functionality, and in its particular focus on the personal

information space and its relationship with the WWW.

BROADER IMPACT OF THE PROJECT

The project will have a major impact on the USF computer science department and its students. First, as

the preliminary work has shown, students are extremely enthusiastic about building visual tools, software

agents, and web applications. This enthusiasm can (and already has) lead them to great accomplishments

and an order of magnitude leap in knowledge and capabilities. Second, working on the project requires

students to master a number of technologies, including COM, web page wrappers, parsing, and advanced

graphical user interface components, so students will not only gain research experience but will learn

practical technologies that can help them get jobs. Finally, the research assistant funds will allow top

students to work on this important and educational research during the summer and school year, instead

of supporting their education in some non-educational employment.

The project will have a positive effect on all computer science students, not just those hired as research

assistants. The PI has historically brought his research into the classroom, most productively by assigning

projects based on his research systems. For instance, over the years one of the most popular projects in his

software engineering courses has been the development of a simplified PBD interface builder, based on

the PIs Pavlov system [see Biographical Sketch for citations]. Recently, he assigned a simplified

reconnaissance agent as a project in his freshman programming course.

Impact on USF Computer Science

USF, and its Computer Science department, is on the cusp of transforming from a predominantly teaching

school to one also known for its research. NSF funding for the proposed project can significantly impact

this transformation through support for research assistants, the PI, and travel. It can also lead to

infrastructure improvements such as labs set aside specifically for research groups.

Potential Impact on Society

The proposed project explores the fundamental ways in which people organize and explore information,

and ultimately, the way in which people create new ideas. Because the focus is on the personal web, and

its intersection with the outside world, any innovations will affect all people that use computers in their

daily lives.