37
Search Session 12 LBSC 690 Information Technology

Search Session 12 LBSC 690 Information Technology

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Search Session 12 LBSC 690 Information Technology

Search

Session 12

LBSC 690

Information Technology

Page 2: Search Session 12 LBSC 690 Information Technology

Agenda

• The search process

• Information retrieval

• Recommender systems

• Evaluation

Page 3: Search Session 12 LBSC 690 Information Technology

Information “Retrieval”

• Find something that you want– The information need may or may not be explicit

• Known item search– Find the class home page

• Answer seeking– Is Lexington or Louisville the capital of Kentucky?

• Directed exploration– Who makes videoconferencing systems?

Page 4: Search Session 12 LBSC 690 Information Technology

DocumentDelivery

BrowseSearch

Query Document

Select Examine

Information Retrieval Paradigm

Page 5: Search Session 12 LBSC 690 Information Technology

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

Page 6: Search Session 12 LBSC 690 Information Technology

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Page 7: Search Session 12 LBSC 690 Information Technology

Human-Machine Synergy

• Machines are good at:– Doing simple things accurately and quickly– Scaling to larger collections in sublinear time

• People are better at:– Accurately recognizing what they are looking for– Evaluating intangibles such as “quality”

• Both are pretty bad at:– Mapping consistently between words and concepts

Page 8: Search Session 12 LBSC 690 Information Technology

Search Component Model

Comparison Function

Representation Function

Query Formulation

Human Judgment

Representation Function

Retrieval Status Value

Utility

Query

Information Need Document

Query Representation Document Representation

Que

ry P

roce

ssin

g

Doc

umen

t P

roce

ssin

g

Page 9: Search Session 12 LBSC 690 Information Technology

Ways of Finding Text

• Searching metadata– Using controlled or uncontrolled vocabularies

• Free text– Characterize documents by the words the contain

• Social filtering– Exchange and interpret personal ratings

Page 10: Search Session 12 LBSC 690 Information Technology

“Exact Match” Retrieval

• Find all documents with some characteristic– Indexed as “Presidents -- United States”– Containing the words “Clinton” and “Peso”– Read by my boss

• A set of documents is returned– Hopefully, not too many or too few– Usually listed in date or alphabetical order

Page 11: Search Session 12 LBSC 690 Information Technology

Ranked Retrieval

• Put most useful documents near top of a list– Possibly useful documents go lower in the list

• Users can read down as far as they like– Based on what they read, time available, ...

• Provides useful results from weak queries– Untrained users find exact match harder to use

Page 12: Search Session 12 LBSC 690 Information Technology

Similarity-Based Retrieval

• Assume “most useful” = most similar to query

• Weight terms based on two criteria:– Repeated words are good cues to meaning– Rarely used words make searches more selective

• Compare weights with query– Add up the weights for each query term– Put the documents with the highest total first

Page 13: Search Session 12 LBSC 690 Information Technology

Simple Example: Counting Words

1

1

1

1: Nuclear fallout contaminated Texas.

2: Information retrieval is interesting.

3: Information retrieval is complicated.

1

1

1

1

1

1

nuclear

fallout

Texas

contaminated

interesting

complicated

information

retrieval

1

1 2 3

Documents:

Query: recall and fallout measures for information retrieval

1

1

1

Query

Page 14: Search Session 12 LBSC 690 Information Technology

Discussion Point: Which Terms to Emphasize?

• Major factors– Uncommon terms are more selective– Repeated terms provide evidence of meaning

• Adjustments– Give more weight to terms in certain positions

• Title, first paragraph, etc.

– Give less weight each term in longer documents– Ignore documents that try to “spam” the index

• Invisible text, excessive use of the “meta” field, …

Page 15: Search Session 12 LBSC 690 Information Technology

“Okapi” Term Weights

5.0

5.0log*

5.05.1 ,

,,

j

j

jii

jiji DF

DFN

TFLL

TFw

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25

Raw TF

Oka

pi

TF 0.5

1.0

2.0

4.4

4.6

4.8

5.0

5.2

5.4

5.6

5.8

6.0

0 5 10 15 20 25

Raw DF

IDF Classic

Okapi

LL /

TF component IDF component

Page 16: Search Session 12 LBSC 690 Information Technology

Index Quality

• Crawl quality– Comprehensiveness, dead links, duplicate detection

• Document analysis– Frames, metadata, imperfect HTML, …

• Document extension– Anchor text, source authority, category, language, …

• Document restriction (ephemeral text suppression)– Banner ads, keyword spam, …

Page 17: Search Session 12 LBSC 690 Information Technology

Indexing Anchor Text

• A type of “document expansion”– Terms near links describe content of the target

• Works even when you can’t index content– Image retrieval, uncrawled links, …

Page 18: Search Session 12 LBSC 690 Information Technology

Queries on the Web (1999)

• Low query construction effort– 2.35 (often imprecise) terms per query– 20% use operators– 22% are subsequently modified

• Low browsing effort– Only 15% view more than one page– Most look only “above the fold”

• One study showed that 10% don’t know how to scroll!

Page 19: Search Session 12 LBSC 690 Information Technology

Types of User Needs

• Informational (30-40% of AltaVista queries)– What is a quark?

• Navigational – Find the home page of United Airlines

• Transactional– Data: What is the weather in Paris?– Shopping: Who sells a Viao Z505RX?– Proprietary: Obtain a journal article

Page 20: Search Session 12 LBSC 690 Information Technology

Searching Other Languages

Search

Translated Query

Selection

Ranked List

Examination

Document

Use

Document

QueryFormulation

QueryTranslation

Query

Query Reformulation

MT

Translated “Headlines”

English Definitions

Page 21: Search Session 12 LBSC 690 Information Technology
Page 22: Search Session 12 LBSC 690 Information Technology

Speech Retrieval Architecture

AutomaticSearch

BoundaryTagging

InteractiveSelection

ContentTagging

SpeechRecognition

QueryFormulation

Page 23: Search Session 12 LBSC 690 Information Technology

Rating-Based Recommendation

• Use ratings as to describe objects– Personal recommendations, peer review, …

• Beyond topicality:– Accuracy, coherence, depth, novelty, style, …

• Has been applied to many modalities– Books, Usenet news, movies, music, jokes, beer, …

Page 24: Search Session 12 LBSC 690 Information Technology

Using Positive InformationSmallWorld

SpaceMtn

MadTea Pty

Dumbo Speed-way

CntryBear

Joe D A B D ? ?Ellen A F D FMickey A A A A A AGoofy D A CJohn A C A C ABen F A FNathan D A A

Page 25: Search Session 12 LBSC 690 Information Technology

Using Negative InformationSmallWorld

SpaceMtn

MadTea Pty

Dumbo Speed-way

CntryBear

Joe D A B D ? ?Ellen A F D FMickey A A A A A AGoofy D A CJohn A C A C ABen F A FNathan D A A

Page 26: Search Session 12 LBSC 690 Information Technology

Problems with Explicit Ratings

• Cognitive load on users -- people don’t like to provide ratings

• Rating sparsity -- needs a number of raters to make recommendations

• No ways to detect new items that have not rated by any users

Page 27: Search Session 12 LBSC 690 Information Technology

Segment Object Class

Examine View Select

Retain

BookmarkSavePurchasePrintDelete

Subscribe

Reference QuoteCut&Paste

CiteLinkReplyForward

Interpret AnnotateRatePublishOrganize

Implicit Evidence for Ratings

Page 28: Search Session 12 LBSC 690 Information Technology

Click Streams

• Browsing histories are easily captured– Send all links to a central site– Record from and to pages and user’s cookie– Redirect the browser to the desired page

• Reading time is correlated with interest– Can be used to build individual profiles– Used to target advertising by doubleclick.com

Page 29: Search Session 12 LBSC 690 Information Technology

Estimating Authority from Links

Authority

Authority

Hub

Page 30: Search Session 12 LBSC 690 Information Technology

Information Retrieval Types

Source: Ayse Goker

Page 31: Search Session 12 LBSC 690 Information Technology

Hands On: Try Some Search Engines• Web Pages (using spatial layout)

– http://kartoo.com/

• Images (based on image similarity)– http://elib.cs.berkeley.edu/photos/blobworld/

• Multimedia (based on metadata)– http://singingfish.com

• Movies (based on recommendations)– http://www.movielens.umn.edu

• Grey literature (based on citations)– http://citeseer.ist.psu.edu/

Page 32: Search Session 12 LBSC 690 Information Technology

Evaluation

• What can be measured that reflects the searcher’s ability to use a system? (Cleverdon, 1966)

– Coverage of Information

– Form of Presentation

– Effort required/Ease of Use

– Time and Space Efficiency

– Recall

– Precision

Effectiveness

Page 33: Search Session 12 LBSC 690 Information Technology

Relevant

Retrieved

|Rel|

|RelRet| Recall

|Ret|

|RelRet| Precision

Measures of Effectiveness

Page 34: Search Session 12 LBSC 690 Information Technology

Precision-Recall Curves

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

Source: Ellen Voorhees, NIST

Page 35: Search Session 12 LBSC 690 Information Technology

Affective Evaluation

• Measure stickiness through frequency of use– Non-comparative, long-term

• Key factors (from cognitive psychology):– Worst experience– Best experience– Most recent experience

• Highly variable effectiveness is undesirable– Bad experiences are particularly memorable

Page 36: Search Session 12 LBSC 690 Information Technology

Other Web Search Quality Factors

• Spam suppression– “Adversarial information retrieval”– Every source of evidence has been spammed

• Text, queries, links, access patterns, …

• “Family filter” accuracy– Link analysis can be very helpful

Page 37: Search Session 12 LBSC 690 Information Technology

Summary

• Search is a process engaged in by people

• Human-machine synergy is the key

• Content and behavior offer useful evidence

• Evaluation must consider many factors