Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate...

Preview:

DESCRIPTION

Motivation With the exponential growth of the Internet, it has become more and more difficult to find information. Most of web search services return a ranked list of web pages in response to a user’s search request. Web pages on different topics or different aspects of the same topic are mixed together in the returned list.

Citation preview

Bringing Order to the Web : Automatically Categorizing Search Results

Advisor : Dr. HsuGraduate : Keng-Wei ChangAuthor : Hao Chen

Susan Dumais

outline Motivation Objective Introduction Related Work Text Classification User Interface User Study Conclusions Personal Opinion

Motivation With the exponential growth of the Internet, it

has become more and more difficult to find information.

Most of web search services return a ranked list of web pages in response to a user’s search request.

Web pages on different topics or different aspects of the same topic are mixed together in the returned list.

Objective To combine the advantage of structured topic

information in directories and broad coverage in search engines, we built a system that takes the web pages returned by a search engine and classifies them into a known hierarchical structure such as LookSmart’s Web directory.

Introduction

Web search services such as AltaVista, InfoSeek, and MSNWebSearch help people to find information on the web.

Most of these systems return a ranked list of web pages in response to a user’s search request.

Introduction

The system consists of two main components A text classifier that categorizes web pages on-th

e-fly, A user interface that presents the web pages withi

n the category structure and allows the user to manipulate the structured view.

Related Work Generating structure Using structure to support search

Generating structure Three general techniques have been used to

organize documents into topical contexts. Structural information (meta data) associated with

each document. Clustering classification

Using structure to support search A statistical text classification model is trained

offline on a representative sample of Web pages with known category labels.

At query time, new search results are quickly classified on-the-fly into the learned category structure. The benefit of using known and consistent

category labels Easily incorporating new items into the structure.

Text Classification Data Set

A collection of web pages from LookSmart’s Web Directory

13 top-level categories, 150 second-level categories, and over 17,000 categories in total.

Text Classification Pre-processing

Extracted plain text from each web page. In addition, the title, description, keyword, and

image tag fields were also extracted if they existed.

Text Classification Classification

A Support Vector Machine (SVM) algorithm was used as the classifier.

Used 13,352 pre-classified web pages to train the model for the 13 top-level categories, and between 1,985 and 10,431 for second-level categories.

User Interface The search results were organized into hierarchical categories.

User Interface Under each category, web pages beloingin to

that category were listed. The category could be expanded (or collapse

d) on demand by the user. To save screen space, only the title of each p

age was shown (the summary can be viewed by hover text)

User Interface Only top-level categories on the first screen.

Help the user identify domains of interest quickly. Save a lot of screen space. Classification accuracy is usually in top level. Computationally faster If only few pages in the subcategories.

User Study Compare the Category Interface to the List

Interface

User Study query “jaguar” Twenty items are shown initially Summary are shown on hover Contain ShowMore button Category interface has a SubCategory button Eighteen subjects

User Study The subject worked with three windows

User Study-result Subjective questionnaire measures

“easy to use” (6.4 vs. 3.9, t(17)=6.41 ; p<<0.001) “liked using it” (6.7 vs. 4.3, t(17)=6.01 ; p<<0.001) “confident that I could find the information if it was

there” (6.3 vs. 4.4, t(17)=4.91 ; p<<0.001) “Easy to get a good sense of the range of

alternatives” (6.4 vs. 4.2, t(17)=6.22 ; p<<0.001) “prefer this to my usual search engine”

(6.4 vs. 4.3, t(17)=4.13 ; p<<0.001) On all of subjects much preferred the Category

User Study-result Subjective questionnaire measures

“Summaries in hover text was useful in both interfaces” (6.5 vs. 6.4, t(17)=0.36 ; p<0.72)

“ShowMore option was useful” (6.5 vs. 6.1, t(17)=1.94 ; p<0.07)

User Study-result Objective measures

Search Time 56s for Category 85s for List F(1,16)=12.94 ; p=.002

User Study-result Objective measures

There is no interaction between order and interface (F(1,16)=1.23 ; p=0.28)

User Study-result Objective measures

Top20(57s) NotTop20(98s) F(1,56)=16.5 ; p<<.001

No interaction between query difficulty and interface (F(1,56)=2.52 ; p=.12)

User Study-result Objective measures

Subjects performed in the course of finding the items than those in the Category interface (4.60 vs. 2.99, t(17)=-5.54 ; p<.001)

Subjects actually viewed in the right window is somewhat larger in the List interface (1.41 vs. 1.23, t(17)=-2.08; p<.053)

Subjects in the Category interface used more expansion operations (0.78 vs. 0.48, t(17)=3.54 ; p<.003)

Conclusions SVM classifier Consistent category information to assist the

user in quickly focusing in on task-relevant. Across user study, the results convincingly

demonstrate that the category interface is superior to the list interface in both subjective and objective measures.

Personal Opinion