Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Project DescriptionMuch has been written about the information explosion and the difficulties of finding pertinent
information in the ever-expanding World Wide Web (WWW). But the information explosion has also
changed the personal information space, and the intersection between that space and the WWW. Most
personal spaces now consist of hundreds of documents, most of which are not islands to themselves, but
are part of a web of documents related by direct hyperlinks and references, or by similarity of content. We
call this information space, consisting of user documents and the web documents referenced by them, the
personal web (see Figure 1).
Traditional file management tools are not well suited for navigating this personal web. Documents are
organized with three separate topical hierarchies: 1) the file system directory, showing only user
documents 2) the user's bookmark directory, showing only web documents, and 3) the hidden hierarchy
(graph) defined by the direct links (hyperlinks and citations) within documents. Such a disintegrated
information space complicates the work of the user. To navigate for information on a particular topic, a
user has to browse the topic subdirectory both in his file manager and in the bookmarks of his Internet
browser. Furthermore, these tools ignore direct-links within documents, so as the user navigates the file
and bookmark hierarchy, he cannot see the documents related to those listed by direct links. Without an
overview of these direct links, the user must open each interesting document in an editor and manually
Figure 1. The Personal Web and Periphery. A document is considered within the personal web if it is has been saved or bookmarked by the user, or if the user has explicitly added a link to it. Other documents are considered n degrees of separation away from the personal web, where a degree of separation is defined as an explicit inward or outward link , or an implicit similar content relationship (shown as a dotted line).
2nd Degree
1st Degree
Nth Degree Degee Degree
Personal
Web
search for the links within it. Even more laborious analysis must be performed if the user wishes to see all
documents that point to a particular document (i.e., inward links), or all documents with similar content.
Whereas file management tools do a poor job of exposing the personal web, search tools ignore the
personal web and how it can help users find information in the WWW. Search tools search the same set
of documents, and return the same results, no matter what user makes the query and no matter what
documents the user is working on or has previously saved (bookmarked) in the personal space.
HIGH-LEVEL OBJECTIVE
The project’s objective is to design and implement tools better suited, than conventional ones, for
organizing and exploring the personal web. The following are sub-objectives defined to meet this goal:
De-emphasize the storage location of documents, and instead emphasize a document's context,
that is, how it is related to other documents.
Identify various types of references within documents, other than just hyperlinks and citations,
that can lead to related information (e.g., the name of a city leads to a map link).
Integrate desktop tools (e.g., editors, file managers) with WWW tools (browsers, search engines)
so that exploration is continuous between the personal web and the WWW, and a "context"
switch" is not necessary to move from creation, document management, and information
exploration.
Personalize searching of the WWW by augmenting search queries with "profiling" information
mined from the personal web and by performing inward-out searches that begin in the personal
web and crawl outward to its periphery within the WWW.
EXPECTED SIGNIFICANCE
The project can have a significant impact on next generation user interfaces for the personal desktop. It
can expose and clarify the limitation of conventional tools, and present alternatives that address those
limitations, along with evaluation data on the viability and significance of those alternatives.
CiteSeer [16] has revolutionized computer-science research by presenting research documents in context
and making it easy for a researcher to explore the history and relationships of ideas. The proposed project
can have a similar impact on a general audience by introducing tools that help users organize and explore
their own personal information spaces.
PRELIMINARY WORK AND RELATION TO LONGER-TERM GOALS
Up until a year or so ago, the principal investigator’s (PI's) research focused primarily on the design of
Programming by Example tools for building animated interfaces. After over a decade of work in this area
the research culminated in a journal article as well as an ACM Communications paper and a chapter in
David Wolber Exploring the Personal Web 2
Henry Lieberman's book, Your Wish is My Command: Programming by Example, both co-authored with
Brad Myers [see the Biographical Sketch for these publications].
About that time, the PI served on the program committee for the 2000 International Conference on
Intelligent User Interfaces (IUI). Inspired by tools such as Watson [4] and Margin Notes [24,26], the PI
subsequently began exploring software agents that assist users in the creative process. He was further
inspired by his own secondary-career efforts writing fiction in the Master's of creative writing program at
USF, and the desire he perceived for better tools to help in the creative process.
In the summer of 2001, the PI and two USF students began work on WebTop, a tool based on the high-
level objectives described above. A working prototype has been designed and implemented, with
preliminary ideas and findings reported in [29,30]. The team is enthusiastic about the results so far as the
David Wolber Exploring the Personal Web 3
Figure 2. The WebTop interface. The right panel is the currently open document (either a page rendered in an embedded Internet Explorer browser, or an HTML file in the WebTop WYSIWYG editor). The left panel is the tree view showing, in this case, the open page's three inward links, six outward links, and five content-related links mined from the WWW. All of these related documents are considered one degree of separation away. The fourth outward link (ice.cs.usfca.edu/projects.shtml) has been expanded to show its relations, which are considered two degrees of separation from the open document.
prototype is already a useful and illuminating tool (to its handful of users), and numerous interesting and
important questions have presented themselves.
The PI plans to focus on this new line of research for the long-term, just as he spent the years 1990-1999
working in Programming by Example. The long-term goal is the design of an ideal tool (agent) for
organizing and exploring the personal information space, its periphery, and the WWW beyond.
OVERVIEW OF PRELIMINARY DESIGN
The initial WebTop design provides a tree view of the personal web that is tightly integrated with an
editor/browser (see Figure 2). The tree view is a zero-input interface [20], meaning it pushes document
context information to the user without the user typing keywords or clicking a mouse or switching
applications to a search engine. The user can instead intermittently glance at the information and follow
one of the suggested links, or just continue on with the current task.
When a document d is opened in the editor/browser, WebTop performs a number of operations:
1. Finds outward links: d is parsed to find its outward links. If d is a directory, its outward
links include all of its subdirectories and files. If d is a document, its outward links are its hyper-
links (in upcoming versions this will be generalized to include citations links and any other
explicit reference that can be mapped to a set of URLs). This list will also contain edge links,
which are discussed in the next section.
2. Finds inward links: If d is a web document, a wrapper is called that invokes an advanced search
facility in Google. This wrapper returns all web documents that refer to d. Whether d is a user or
web document, all documents from the personal web that refer to d are then added to the list of
inward links.
3. Finds content-related links: The n most important terms in d are identified using a TFIDF based
algorithm [27]. These terms are then used to perform one or more of three possible searches. The
first uses a Google wrapper to perform a standard search of the WWW. The second searches only
documents within the personal web. The third begins in d and crawls out to the WWW following
inward and outward links and using a BFS algorithm [7]. The user may choose which types of
searches are performed in the options dialog.
The compiled lists of each relationship type are then ranked as described below (see personalized
document ranking) A user-configurable number of each relationship type is then displayed, color-coded
by type, in the first level of the tree view. For example, the default configuration displays five inward
links, five outward links, and five content-related links. A “more” node is provided under the nodes of
each type. When expanded, additional relations at the same level in the tree (degree of separation) are
shown.
David Wolber Exploring the Personal Web 4
When the user expands any of the related links, the process is re-triggered and a set of documents two
degrees of separation from d are displayed. Note that whether the expanded first level link is an inward
(i), outward (o), or content-related (c) link of d, the second level will display relationships of all three
types. So the second level can show grandparents of d (i-i), grandchildren of d (o-o), siblings of d (i-o or
o-i) and relatives defined by the other combinations (c-c, c-i, i-c, c-o, o-c).
An earlier version of WebTop showed the various relationship types in separate panels. This scheme was
advantageous in that inward links could be shown with a conceptually intuitive, backward oriented (left-
growing) tree, and outward links in a right-growing tree. Also, the separation of types was clearer than
with color-coding or icons. However, early users of WebTop complained that they couldn’t reach siblings
(i-o or o-i) with a single click in the tree view-- they had to open an outward link as the “current
document” to see an (o-i) relationship. Similarly, they complained that it was difficult to view the inward
or outward links of the displayed content-related documents (c-i or c-o).
Personalized Document Ranking
WebTop exposes more relationship types than other tools, all within a single view, so its ranking
algorithm, i.e., the way it chooses which links to display, is extremely important. Current search tools and
reconnaissance agents use a combination of similarity comparison and popularity measurement to rank
documents. Similarity comparisons can include meta-data as well as internal content analysis to compute
how similar a document is to the seed document. Popularity measurements count how many inward links
a particular document has, and recursively, how many links its inward links have, etc. Such analysis can
identify “hubs” and “experts” [14] and, in general, documents that others have found to be relevant.
The link analysis of conventional search tools is completely impersonal. Every user gets the same results,
no matter whether a user's personal web contains zero or a thousand links to a page under consideration.
A personal web agent, on the other hand, can consider the user’s own preferences in ranking documents.
If numerous documents within the personal web refer to the document under consideration (or many
documents point to documents that point to it, etc.), that document can be ranked higher.
Various algorithms are possible for personalized link analysis. One involves the degrees of separation as
part of the measurement. Instead of just counting the number of documents from the personal web that
refer to a "found" document under consideration, a higher score is assigned to documents near the
currently open document (the seed document). The argument for such a scheme is similar to the old
dating adage: getting fixed up on a date is beneficial, since one is more likely to prefer friends of friends
than strangers. Here, the intuition is that one is more likely to prefer documents pointed to by neighbors
of the current document over documents pointed to by documents many degrees of freedom away.
David Wolber Exploring the Personal Web 5
Note that such a scheme makes use of user profiling information to decide which documents to suggest.
Unlike tools that use a single user profile, or multiple profiles the user must explicitly specify [20], the
profiles in WebTop are implicitly defined and chosen dynamically as "the current document and its
neighbors”.
Besides improving the relevancy of suggested documents, personalized ranking can combat the “rich get
richer” monopolistic effect of global link analysis algorithms. Giving preference to hubs and experts has
the effect of making them even stronger hubs and experts, since more and more other documents will find
them and then refer to them. Because a personalized ranking system focuses on user preference, it is not
affected by the follow-the-leader mentality of the masses.
Multi-Reference Saving
WebTop is also designed to help users organize documents in the personal web. Specifically, the system
circumvents the traditional save and open operations and the traditional notion of directory in order to
collapse the directory, bookmark, and direct-link hierarchies into a single topic hierarchy. Physically, a
file, bookmark, and document are essentially the same. A directory (bookmark) is just a set of links to
documents, different from a document (e.g., web page) only in that it is restricted to containing only links
and no other text or graphics. However, conventional tools treat directories in special ways. Editors allow
documents to be saved in directories, and do not allow directories to be edited or annotated. File managers
David Wolber Exploring the Personal Web 6
Figure 3. The WebTop SaveAs Dialog. A user saves a document by adding links to it from other documents. A user "bookmarks" a web document using this same dialog. The nodes shown in the dialog above may be directories, user documents, or web documents-- each is treated as a collection of links. Note that the Save dialogs only show outward links, not inward links or content-related links as in the tree view.
display directories differently from other documents, and browsers provide special directories for saving
bookmarks.
The WebTop design provides for a fundamentally different method of saving and bookmarking files.
Whether editing an HTML or Word page, or browsing a web document, a user “saves” not by adding a
link in some special “directory” file, but by adding link(s) in regular documents. When the user selects
File | Save, the save dialog does not show the file system directory; this directory is completely hidden
from the user. Instead, it displays the outward links of a special HTML document called Root.html (see
Figure 3). Root.html is an HTML document that serves as the root of the personal web. It is different
from other documents only in that the editing and browsing tools display it in the save and open dialogs.
The system parses Root.html to find its outward links, then displays those linked documents as first-level
nodes in the tree within the save/open dialog. When the user expands one of these nodes, the system
parses the corresponding document and displays its outward links. Check boxes are presented next to
each node, so the user can select any number of documents to which a link to the current document
should be “saved”. After the user closes the dialog, the context view is updated to show the newly
specified inward links to the current document.
The open operation works in a similar manner. The open dialog appears displaying Root.html and its
outward links. The user can select any node, or expand a node to see the next level of documents. Double-
clicking on a node opens the document as the current document. The document appears in the
editor/browser, and its context is displayed in the tree view. Documents can also be opened directly from
the tree view by double clicking on a node.
With these fundamentally modified save and open operations, the direct-link becomes the basis for a
single, integrated topic hierarchy. Users can create documents that serve as “directories”, but these
documents are physically no different than other documents, so they can be annotated with text and
graphics. This annotated view appears when the document is opened in the main editor/browser. The
“directory” or “links-only” view of the document appears in the tree view.
Edge Links
Initially, the system was designed so that “saving” a document meant adding a direct hyper-link within
the parent document, if the parent was an HTML user document. The system added the links at the
bottom of the parent page. This scheme ensured that a user could always annotate the links in his
“directory” documents. But in early usability testing, we found that users were often disconcerted by the
addition of actual internal links added to a parent document.
Because of this feedback, the system now, by default, adds an edge link to a parent document when a
document is saved. When a document is opened, the tree view displays both its internal and edge links,
David Wolber Exploring the Personal Web 7
while the document viewer (editor or browser) only shows its internal links. Edge links allow a user to
specify relationships between documents without changing either document’s internal text. This is useful
for user-owned documents, especially formal ones such as research papers, and it is essential for web
documents, since the user cannot modify their internal contents.
Summary of Preliminary Design
A prototype implementing most of the above features has been completed, and a handful of super users,
including the PI and the student developers, have enthusiastically begun using the system in their daily
work. From this experience, the PI has determined his research path for the next three years, as detailed in
the next section.
PROPOSED WORK PLAN
The research will be completed using an iterative approach, with four iterations of approximately nine
months duration. Tasks includes studies of related academic and commercial works, tool development,
including documentation and toolkit implementations, and evaluation through user studies and feedback
from public releases.
Designing user studies to evaluate the tool will be challenging, as the objectives of the tool include such
vague notions as, “improves the organization and exploration capabilities of a user,” and “lists more
‘relevant’ documents to the user.” The interface issue is just as challenging, as the proposed tool is taking
a difficult problem of reconnaissance agents, that of choosing the most relevant content-related
documents to list, and including inward and outward links in the mix as well.
Iteration 1
Task 1A: Complete implementation of WebTop 1 (as specified in the preliminary design above).
Work includes implementation of the personalized searching mechanism, full integration of the various
relationship types into a single view, and adding a high degree of user-configurability to the system (e.g.,
the ability to specify how many of each relationship to show, and what searches should be performed to
find content-related links). Simple known algorithms will be used in searching for and ranking
documents. Challenges include finding efficient algorithms for dynamically searching the personal web,
and/or integrating idle-time personal crawls to index documents in the neighborhood.
Along with coding, significant system testing will be completed to ready the tool for public consumption,
and the existing partial Unified Modeling Language (UML) documentation will be refined and updated.
Deliverable: WebTop Public Release 1, including documentation.
Task 1B: Perform a complete and updated related work survey. The proposed research spans
numerous areas within the Human-Computer Interface and Artificial Intelligence disciplines, and both
academic and commercial endeavors.
Deliverable: a paper, A Survey of Techniques for Integrating the Personal and World-Wide Webs.
David Wolber Exploring the Personal Web 8
Task 1C: Evaluate the prototype with informal user studies. As development proceeds, informal user
studies will be performed. These studies will consist of developers watching users use the system in a
free-form fashion, and watching them perform particular tasks provided by the project team. Feedback
will be used to guide the development process and to shape the formal user studies to be performed in
iteration 2.
Iteration 2
Task 2A: Obtain and analyze feedback from public release 1
Collect feedback from known solicited users and from unsolicited comments posted on project website.
Task 2B: Perform formal user studies
Formal user studies will be performed to measure the effectiveness of WebTop's design. Detailed tasks
will be outlined for the users and specific data measured. Formal studies will be developed to answer
specific numerous questions, including the following:
1. Compare searches in the neighborhood of the personal web with global searches. Are the results
more relevant? Can a hybrid approach, providing results from both, please users?
2. Compare ranking algorithms that use personalized link analysis with standard ranking algorithms
(e.g., Google) that use global link analysis. Are the results more relevant?
3. Analyze the effectiveness of WebTop compared to conventional desktop and web tools. Given the
same amount of time and preparation, will a test group of research paper writers cite a "better" list of
related works? Will their papers be graded higher?
4. Analyze the effectiveness of displaying inward, outward, and content-related links within a single
view, compared to separate views. Does it have a negative effect on traditional file management
tasks?
5. Analyze the effectiveness of multi-reference saving and eliminating the distinction between
directory, document, and bookmark. Will the concept of directory become extinct, or will users define
documents that perform the same function and contain only lists of links?
Deliverables 1. Mining the Neighborhood of the Personal Web-- a paper providing an overview of the
system and a detailed report of the findings from user studies (1) and (2).
Deliverable 2: Improved Writing Through Integrated Editing, File Management, and Searching-- a paper
providing an overview of the system, focusing on user interface issues, and a detailed report of the
findings from user studies 3, 4, and 5.
Task 2C: Explore Issues in interfacing with various application/document types. To promote
widespread real-world use, the system needs to parse various types of documents and invoke a WebTop-
like interface from various software tools (e.g., from Microsoft Word). A project objective is also to
David Wolber Exploring the Personal Web 9
develop software that identifies references, other than hyperlinks and citations, and maps them to a
collection of web pages.
Deliverable: A toolkit, built on top of COM, for identifying various references within various document
types and for specializing various tools such as Word, Explorer, and Acrobat. This toolkit has two
purposes. First, because USF is a predominantly undergraduate institution, there will be a high turnover
of students on the project each year, so an easy-to-use toolkit with proper documentation can significantly
reduce the start-up time for new student developers. Second, too often research efforts must duplicate the
same difficult implementation efforts due to "technical" results not being disseminated. Thus, this project
will follow the lead of [23] and make this toolkit publicly available.
Task 2D: Perform normal maintenance and debugging of software.
Deliverable: WebTop Public Release 1.1
Task 2E: Monitor new related academic publications and commercial releases.
Iteration 3
Task 3A: Design and implement WebTop Version 2. The new features for this version will be designed
using feedback acquired from user studies and from the first public release. It will include issues studied
in Task 2C concerning user interface alternatives and increasing the types of references that can be
mapped to other documents. It will also include a refinement of search and ranking algorithms using data
from the personal web, and some form of Group Web, whereby personal webs are used in collaboration.
Deliverable: WebTop Public Release 2.0.
Task 3B: Perform user studies on a particular focus group. User testing will be targeted to a group
consisting of writers and students in the USF Creative Writing Program. The goal is to obtain feedback on
the tool's usefulness specifically for the creative process. Feedback will be used to help in the new design
of the general tool, and in the consideration of specialized tool(s).
Task 3C: Monitor new related academic publications and commercial releases.
Iteration 4
Task 4A: Perform formal user studies on WebTop Release 2. These studies will target new features
and refinements designed during the previous iteration.
Task 4B: Design and implement specialized versions for focus group. The versions will contain
special features designed specifically to help a special focus groups (e.g. creative writers) better
accomplish their tasks. Features will include the identification of domain-specific "links" within
documents, e.g., when the tool identifies the name of an author in a document, it might display a list of
the author's works or similar authors.
Deliverable 1: WebTop for Writers, Release 1.0.
Deliverable 2: A paper titled, A Software Agent for Creative Writers.
David Wolber Exploring the Personal Web 10
Task 4C: Analyze and categorize findings.
Deliverable: A journal paper (TOCHI?) detailing the results of the entire project.
Team Member Responsibilities
The PI will be responsible for the literature and commercial system surveys and updates and the majority
of the writing. The PI will also oversee the design and implementation of the system(s), and the user
studies. The student team members' main responsibility will be implementing the system(s) and
performing user studies, though all project members will be read the core related papers, help in the
design of the system, and suggest innovations from their experiences building the system. Some top
students will also take part in writing research papers.
RELATION TO PRESENT STATE OF THE FIELD
This section describes the proposed system from a number of perspectives and in relation to several
research areas.
WebTop as a Reconnaissance Agent
Reconnaissance agents, also called just-in-time information retrieval agents [26], expose similar content
relationships between documents. These agents mine the user's open documents to find the most
characteristic words, then automatically send the words to a search engine (e.g., Google). The user can
intermittently glance at the resulting suggested links and, without changing context from creation mode
(editing), choose whether to follow one of the suggestions, or continue on with the current task.
Reconnaissance agents are said to have zero-click interfaces [20] because searches are performed without
the user stopping work to initiate them.
The proposed project will explore generalizing the basic ideas behind reconnaissance agents. One
generalization is to expose direct-link associations, both inward and outward, and of various types (e.g.,
hyperlink, citation, proper name), along with the content-related documents. One might argue that
displaying the outward links is repetitive, since they appear in the document itself. However, many
documents do not fit in a single window, so often a user must scroll to view all of a document’s links.
Furthermore, the proposed tool can rank the outward links it displays, similar to Letizia [19,20].
Inward links can be as illuminating as outward links, but are ignored by most tools (Google is an
exception, as it has an advanced search feature that lists all documents that link to a given URL).
Following inward links is following the refinement and transformation of an idea, moving ahead in the
history of the idea, instead of backward, as with forward links.
A second generalization of reconnaissance agents is to display all types of links (inward, outward, and
content-related) within a tree structure, instead of the one-dimensional list displayed by most
reconnaissance agents. This allows a user to easily navigate to documents more than one degree of
David Wolber Exploring the Personal Web 11
separation from the open document (see Figure 1 above). With most reconnaissance agents, the user must
open a suggested document to see its content-related and direct links, and can only view one degree of
separation at any time. With the proposed tree view, the user can expand nodes to view a collection of
documents related by various degrees of separation, and various relationship types.
A third generalization of reconnaissance agents is to trigger searches not only when a user edits a
document [4,25,26], browses a web page [2,19,20,], or uses some other application [22], but also as the
user traverses the documents in the file manager. When a user selects a document or directory in the file
manager tree view (in this case, the proposed super-file-manager tree view), the agent performs the
reconnaissance-agent-like automated search to find content-related documents and lists these along with
the explicitly linked documents. Because explicit links can include sub-directories and documents within
a directory, the proposed system is reconnaissance agent meets file manager meets document graph.
A fourth generalization of reconnaissance agents concerns the search engine that is automatically
invoked. Reconnaissance agents like the Remembrance Agent [25] search the personal space for related
documents, while others search only the WWW [4], and one, Letizia [19,20], focuses only on pages one
outward link away from the currently open document. The proposed system will search the personal
space, the WWW, and the intersection of the two, i.e., the documents in the neighborhood of the personal
web. The latter is somewhat similar to Letizia's scheme, only the search will include not only forward
links of the current document, but inward links of it, outward links of inward links, inward links of
outward links, and in general, all documents within a certain radius of the currently open document.
Because the proposed project focuses on the personal web, it will not focus on designing new algorithms
for choosing the words in a document that most characterize it. Instead, the project will make use of
previous research in the area, and specifically some variant of the TFIDF [27] algorithm (which ranks a
word based on how often it occurs in the document, and how seldom it occurs in some large corpus).
Watson[4] uses a refined TFIDF, taking into account heuristics such as “words in bold are more
important” and “words at the top of a document are more important”. Margin Notes [24] uses TFIDF, but
it considers the document to be of too large a granularity to seed optimal searches. Margin Notes thus
treats each section of a document as a search seed, then lists the links related to each section in the
document’s margins. Intellizap [10] is similar, only it considers only the selected text in the document.
WebTop as a CiteSeer for the Personal Web
A number of systems expose the direct-link relationships between documents [3,13,16,21,23,28]. Some of
the most popular are those that display citation graphs of research papers [16,21] These systems crawl the
web searching for research papers (in PDF files), then parse those papers to build a citation graph.
Butterfly [21] provides a 3D visualization of the citation graph of a collection of research papers. The left
David Wolber Exploring the Personal Web 12
side of the butterfly shows the documents that cite a given document (inward links), the right side shows
the documents it cites (outward links), and the third dimension provides information specific to the given
document. CiteSeer [16] displays similar data in 2D format. Besides exposing direct-link citation
relationships, it also exposes content-related documents in a “working bibliography” panel. CiteSeer is of
revolutionary utility to those of us that remember long nights in the library, manually building citations
graphs and searching through conference proceedings and journals.
One way to view the proposed tool is as a CiteSeer for the personal web. Instead of displaying
relationships between a domain-specific collection of documents, it displays relationships between
personal documents and between personal documents and web documents “in the neighborhood”.
The WebTop prototype only displays hyper-link and directory-document relationships. Ideally, a WebTop
user could also traverse CiteSeer-like citation links both within the personal web and out to the WWW.
But even if a CiteSeer-like citation parser were executed over the collection of documents in the personal
web, it would not allow “citation” navigation across the boundary to the WWW, since citations, unlike
hyper-links (URLs), cannot be reached with a standard browser. Cameron has proposed a “universal
bibliographic and citation database linking every scholarly work ever written” [5], but until a major
search engine makes this a reality, users will not have the capability of following any citation from the
personal web to the WWW.
The issue with citation links is indicative of a more general issue that the proposed project will explore,
that of identifying any type of explicit reference, and automatically mapping it to one or more URLs.
WebTop as a User–Assisted Personal Crawler
Domain specific search engines such as CiteSeer can provide more relevant results by limiting the search
space to a limited collection of documents. Focused crawlers [6,7,8,9] are often used to build up the
documents in such a domain. Beginning with a set of seed documents representing the domain’s topic,
crawling proceeds along outward links, with documents most similar to the seed documents visited first.
A personal crawler [6,7] is a domain-specific searcher, where the “domain” is represented with a set of
documents in the personal web. The conventional algorithm used is best first search (BFS). All links of
the seed documents are ranked according to their similarity to the seed, then the best link is “visited”,
meaning its links are added to the list of ranked documents. The best link in the updated ranking list is
then visited, and so on. Crawling proceeds in this manner until a certain depth or width is reached. The
top ranked documents are then presented to the user.
One weakness of BFS is that it will often miss highly relevant pages that are more than one link away
from the “current” document [9]. For instance, a crawler spawned from a document concerning
reconnaissance agents might arrive at the MIT research directory page, find it unrelated to the seed
David Wolber Exploring the Personal Web 13
documents, and thus never follow its links. This would be unfortunate, since Henry Lieberman of MIT, an
expert in agents, has a home page that links to a numerous documents concerning reconnaissance agents.
One approach [9] to this problem uses machine learning techniques to teach the crawler to sacrifice short
terms gains in the interest of overall crawl performance, i.e., the crawler learns to identify promising
directory-like pages and follow their links, so relevant pages below them are considered.
The proposed system takes a different approach to the problem-- it allows the user to assist the crawl.
Beginning with the open working document, the tool crawls out one level at a time, following both
inward and outward links, and displays the links in the tree view. The user assists this real-time crawl by
expanding the most desirable links. The tool assists the user by ranking the documents and displaying the
most related links.
WebTop also crawls out more than one degree on each expansion, and displays these links as “content-
related” links. Currently, a BFS algorithm is used in this search, but a machine learning technique like [9]
could improve this process.
WebTop as an End-User Tool for Defining Domain-Specific Searches
Meta-Search engines such as [17] automatically select domain-specific search areas to send keywords to
based on the user's current task (open documents). These systems keep a fixed list of search areas along
with profiles of each that are matched with data from the current documents.
WebTop provides all the facilities necessary for a user to define domain-specific searches. A user can
simply add links from document(s) in the personal web to pages in the WWW that make up the "domain".
When the user then selects the document pointing to those pages, or opens the document in the
editor/browser, WebTop invokes a search of the web beginning with that document and crawling through
the inward and outward links for some depth and breadth. In essence, WebTop dynamically chooses
search areas and dynamically builds them.
WebTop as an Integrated Searching/Browsing Tool
The two standard methods of exploration are keyword search and categorical browsing. The proposed
tool, in effect, provides a single-click interface for performing both types of navigation simultaneously.
Every time a document in the tree view is expanded, the user is browsing the sub-documents of the
"directory" defined by the document, and invoking a keyword search of the WWW or personal web
neighborhood (this is the content-related search)
RELATION TO WORK IN PROGRESS ELSEWHERE
The proposed project is most related to work in progress at the NEC Research group on Web
Technologies [31], the Intelligent Information Lab at Northwestern University[32], and the Artificial
Intelligence Lab at the University of Arizona[32]. THE NEC group is responsible for CiteSeer and
David Wolber Exploring the Personal Web 14
numerous articles concerning focused crawling and context in web search [9,15,16,]. The lab at
Northwestern is responsible for Watson [4] and numerous other software agents. The group at the
University of Arizona has focused on personal (client-based) crawls [7,8]. The proposed project deviates
from these others in its plan to integrate reconnaissance information and other contextual data (explicit
references) with traditional file management functionality, and in its particular focus on the personal
information space and its relationship with the WWW.
BROADER IMPACT OF THE PROJECT
The project will have a major impact on the USF computer science department and its students. First, as
the preliminary work has shown, students are extremely enthusiastic about building visual tools, software
agents, and web applications. This enthusiasm can (and already has) lead them to great accomplishments
and an order of magnitude leap in knowledge and capabilities. Second, working on the project requires
students to master a number of technologies, including COM, web page wrappers, parsing, and advanced
graphical user interface components, so students will not only gain research experience but will learn
practical technologies that can help them get jobs. Finally, the research assistant funds will allow top
students to work on this important and educational research during the summer and school year, instead
of supporting their education in some non-educational employment.
The project will have a positive effect on all computer science students, not just those hired as research
assistants. The PI has historically brought his research into the classroom, most productively by assigning
projects based on his research systems. For instance, over the years one of the most popular projects in his
software engineering courses has been the development of a simplified PBD interface builder, based on
the PIs Pavlov system [see Biographical Sketch for citations]. Recently, he assigned a simplified
reconnaissance agent as a project in his freshman programming course.
Impact on USF Computer Science
USF, and its Computer Science department, is on the cusp of transforming from a predominantly teaching
school to one also known for its research. NSF funding for the proposed project can significantly impact
this transformation through support for research assistants, the PI, and travel. It can also lead to
infrastructure improvements such as labs set aside specifically for research groups.
Potential Impact on Society
The proposed project explores the fundamental ways in which people organize and explore information,
and ultimately, the way in which people create new ideas. Because the focus is on the personal web, and
its intersection with the outside world, any innovations will affect all people that use computers in their
daily lives.
David Wolber Exploring the Personal Web 15