Improving Search Engines using Online Communities

Preview:

DESCRIPTION

Anatoliy Gruzd Research Forum, Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign, IL March 14, 2007

Citation preview

Improving Search Engines using

Online Communities

Anatoliy Gruzd <agruzd2@uiuc.edu>

Research ForumGraduate School of Library and Information Science

University of Illinois, Urbana-Champaign, IL March 14, 2007

It takes an [Internet] village …

Anatoliy Gruzd Community-created metadata

AgendaAgenda

1. Common search problems

2. Online bookmarking - http://del.icio.us

3. Pilot Study

4. Future work

Anatoliy Gruzd Community-created metadata

Common search problems

The main drawback of all modern search engines is that they force

the user to guess words that might appear in all relevant documents

and at the same time will not appear in NON-relevant documents.

1. A relevant page will not be retrieved, if it does not contain keywords that the user chose for searching.

2. Even If user’s search keywords are found inside a web page, it does not always mean that the page is relevant to the user.

Anatoliy Gruzd Community-created metadata

Query#1: weight loss

User’s Query

Web page

MatchingMatching

Results

weight loss

weight loss ???

Architecture of a typical search engine

Anatoliy Gruzd Community-created metadata

Query#1: weight loss• http://www.paleofood.com/

Recipes are: grain-free, bean-free, potato-free, dairy-free, and sugar-free.

Anatoliy Gruzd Community-created metadata

Query#2: assignment about "human brain" for homeschooling

This is an instructor’s blog for a Human Development class in the Evergreen

State College. The page was retrieved because of two unrelated postings titled

“Homeschoolers use selective socialization” and

“Part Of Human Brain Functions Like A Digital Computer”.

This is an instructor’s blog for a Human Development class in the Evergreen

State College. The page was retrieved because of two unrelated postings titled

“Homeschoolers use selective socialization” and

“Part Of Human Brain Functions Like A Digital Computer”.

Anatoliy Gruzd Community-created metadata

AgendaAgenda

1. Common search problems

2. Online bookmarking - http://del.icio.us

3. Pilot Study

4. Future work

Anatoliy Gruzd Community-created metadata

Anatoliy Gruzd Community-created metadata

username

Anatoliy Gruzd Community-created metadata

Common Tags forhttp://www.paleofood.com/

• ethnic • evolutionary eating • food • allergies • german • naturopathic • primitivism • weight loss

• ethnic • evolutionary eating • food • allergies • german • naturopathic • primitivism • weight loss

Tag

Tag

Tag

Anatoliy Gruzd Community-created metadata

User’s Query

Web page

MatchingMatching

Results

Tags

weight loss

weight loss ???

Anatoliy Gruzd Community-created metadata

AgendaAgenda

1. Common search problems

2. Online bookmarking - http://del.icio.us

3. Pilot Study

4. Future work

Anatoliy Gruzd Community-created metadata

Pilot Study

User’s Query

Web page

MatchingMatching

Results A

Tags

MatchingMatching

Results B

System A System B

Anatoliy Gruzd Community-created metadata

Pilot Study

• Search engine – Indri, a cooperative effort between the University of

Massachusetts and Carnegie Mellon University

• Search queries – ~20-30 Users’ real questions found on the

Internet

• Pilot dataset– 454 health-related web pages

Anatoliy Gruzd Community-created metadata

115 /Neurological_Disorders

101 /Cancer

54 /Immune_Disorders/Immune_Deficiency

53 /Endocrine_Disorders

35 /Cardiovascular_Disorders

26 /Respiratory_Disorde

23 /Digestive_Disorders

“The Open Directory Project is the largest, most comprehensive human-edited directory of the Web.”

http://dmoz.org

Started with ~64,000 URLs (from Top/Health/Conditions_and_Diseases)-> only 544 are bookmarked by del.icio.us users

-> only 454 were accessible at the time of my experiment

Started with ~64,000 URLs (from Top/Health/Conditions_and_Diseases)-> only 544 are bookmarked by del.icio.us users

-> only 454 were accessible at the time of my experiment

Pilot dataset: 454 health-related web pages

Anatoliy Gruzd Community-created metadata

Noise in Tags

• toread• todo• interesting• imported• safari_export• system:unfiled• .imported

Anatoliy Gruzd Community-created metadata

Compound tags

• generalhealth• computersoftware

• cancerpatients-supportgroups• highbloodpressure

• whoiwanttosharewith

Anatoliy Gruzd Community-created metadata

Keywords-based Tags-based

1. (---) /term "assignment" 2. (---) /term "brain [center]" 3. (+++) Neuroscience For Kids -

Explore the nervous system

1. (+++) Neuroscience For Kids - Explore the nervous system

2. (+++) 3. (+++)

Common tags

anatomy

psychology

biology

cognitive

education

reference

medical

human

homeschool

Web page

Matching

Results A

System A

Tags

Matching

Results B

System B

Anatoliy Gruzd Community-created metadata

AgendaAgenda

1. Common search problems

2. Online bookmarking - http://del.icio.us

3. Pilot Study

4. Future work

Anatoliy Gruzd Community-created metadata

Future work

• Use a larger dataset

• Compare results across different subject domains and genres

• Explore ways to combine tags and keywords to determine whether it will improve the quality of results (if at all)

Recommended