Upload
moris-jennings
View
214
Download
0
Embed Size (px)
DESCRIPTION
Motivation With the exponential growth of the Internet, it has become more and more difficult to find information. Most of web search services return a ranked list of web pages in response to a user’s search request. Web pages on different topics or different aspects of the same topic are mixed together in the returned list.
Citation preview
Bringing Order to the Web : Automatically Categorizing Search Results
Advisor : Dr. HsuGraduate : Keng-Wei ChangAuthor : Hao Chen
Susan Dumais
outline Motivation Objective Introduction Related Work Text Classification User Interface User Study Conclusions Personal Opinion
Motivation With the exponential growth of the Internet, it
has become more and more difficult to find information.
Most of web search services return a ranked list of web pages in response to a user’s search request.
Web pages on different topics or different aspects of the same topic are mixed together in the returned list.
Objective To combine the advantage of structured topic
information in directories and broad coverage in search engines, we built a system that takes the web pages returned by a search engine and classifies them into a known hierarchical structure such as LookSmart’s Web directory.
Introduction
Web search services such as AltaVista, InfoSeek, and MSNWebSearch help people to find information on the web.
Most of these systems return a ranked list of web pages in response to a user’s search request.
Introduction
The system consists of two main components A text classifier that categorizes web pages on-th
e-fly, A user interface that presents the web pages withi
n the category structure and allows the user to manipulate the structured view.
Related Work Generating structure Using structure to support search
Generating structure Three general techniques have been used to
organize documents into topical contexts. Structural information (meta data) associated with
each document. Clustering classification
Using structure to support search A statistical text classification model is trained
offline on a representative sample of Web pages with known category labels.
At query time, new search results are quickly classified on-the-fly into the learned category structure. The benefit of using known and consistent
category labels Easily incorporating new items into the structure.
Text Classification Data Set
A collection of web pages from LookSmart’s Web Directory
13 top-level categories, 150 second-level categories, and over 17,000 categories in total.
Text Classification Pre-processing
Extracted plain text from each web page. In addition, the title, description, keyword, and
image tag fields were also extracted if they existed.
Text Classification Classification
A Support Vector Machine (SVM) algorithm was used as the classifier.
Used 13,352 pre-classified web pages to train the model for the 13 top-level categories, and between 1,985 and 10,431 for second-level categories.
User Interface The search results were organized into hierarchical categories.
User Interface Under each category, web pages beloingin to
that category were listed. The category could be expanded (or collapse
d) on demand by the user. To save screen space, only the title of each p
age was shown (the summary can be viewed by hover text)
User Interface Only top-level categories on the first screen.
Help the user identify domains of interest quickly. Save a lot of screen space. Classification accuracy is usually in top level. Computationally faster If only few pages in the subcategories.
User Study Compare the Category Interface to the List
Interface
User Study query “jaguar” Twenty items are shown initially Summary are shown on hover Contain ShowMore button Category interface has a SubCategory button Eighteen subjects
User Study The subject worked with three windows
User Study-result Subjective questionnaire measures
“easy to use” (6.4 vs. 3.9, t(17)=6.41 ; p<<0.001) “liked using it” (6.7 vs. 4.3, t(17)=6.01 ; p<<0.001) “confident that I could find the information if it was
there” (6.3 vs. 4.4, t(17)=4.91 ; p<<0.001) “Easy to get a good sense of the range of
alternatives” (6.4 vs. 4.2, t(17)=6.22 ; p<<0.001) “prefer this to my usual search engine”
(6.4 vs. 4.3, t(17)=4.13 ; p<<0.001) On all of subjects much preferred the Category
User Study-result Subjective questionnaire measures
“Summaries in hover text was useful in both interfaces” (6.5 vs. 6.4, t(17)=0.36 ; p<0.72)
“ShowMore option was useful” (6.5 vs. 6.1, t(17)=1.94 ; p<0.07)
User Study-result Objective measures
Search Time 56s for Category 85s for List F(1,16)=12.94 ; p=.002
User Study-result Objective measures
There is no interaction between order and interface (F(1,16)=1.23 ; p=0.28)
User Study-result Objective measures
Top20(57s) NotTop20(98s) F(1,56)=16.5 ; p<<.001
No interaction between query difficulty and interface (F(1,56)=2.52 ; p=.12)
User Study-result Objective measures
Subjects performed in the course of finding the items than those in the Category interface (4.60 vs. 2.99, t(17)=-5.54 ; p<.001)
Subjects actually viewed in the right window is somewhat larger in the List interface (1.41 vs. 1.23, t(17)=-2.08; p<.053)
Subjects in the Category interface used more expansion operations (0.78 vs. 0.48, t(17)=3.54 ; p<.003)
Conclusions SVM classifier Consistent category information to assist the
user in quickly focusing in on task-relevant. Across user study, the results convincingly
demonstrate that the category interface is superior to the list interface in both subjective and objective measures.
Personal Opinion