70
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14

WEB BAR 2004 Advanced Retrieval and Web Mining

Embed Size (px)

DESCRIPTION

WEB BAR 2004 Advanced Retrieval and Web Mining. Lecture 14. Today’s Topics. Latent Semantic Indexing / Dimension reduction Interactive information retrieval / User interfaces Evaluation of interactive retrieval. How LSI is used for Text Search. LSI is a technique for dimension reduction - PowerPoint PPT Presentation

Citation preview

Page 1: WEB BAR 2004  Advanced Retrieval and Web Mining

WEB BAR 2004 Advanced Retrieval and Web Mining

Lecture 14

Page 2: WEB BAR 2004  Advanced Retrieval and Web Mining

Today’s Topics

Latent Semantic Indexing / Dimension reduction Interactive information retrieval / User interfaces Evaluation of interactive retrieval

Page 3: WEB BAR 2004  Advanced Retrieval and Web Mining

How LSI is used for Text Search

LSI is a technique for dimension reduction Similar to Principal Component Analysis (PCA) Addresses (near-)synonymy: car/automobile Attempts to enable concept-based retrieval

Pre-process docs using a technique from linear algebra called Singular Value Decomposition.

Reduce dimensionality to: Fewer dimensions, more “collapsing of axes”, better recall,

worse precision More dimensions, less collapsing, worse recall, better

precision Queries handled in this new (reduced) vector space.

Page 4: WEB BAR 2004  Advanced Retrieval and Web Mining

Input: Term-Document Matrix wi,j = (normalized) weighted count (ti , dj)

Key idea: Factorize this matrix

ti

djn

m

Page 5: WEB BAR 2004  Advanced Retrieval and Web Mining

hj is representation of dj in terms of basis W

If rank(W) ≥ rank(A) then we can always find H so A = WHNotice duality of problem

More “semantic” dimensions -> LSI (latent semantic

indexing)

Matrix Factorization

= x

n

mBasis Representation

m

k

k

n

A = W x Hhjdj

Page 6: WEB BAR 2004  Advanced Retrieval and Web Mining

Minimization Problem

Minimize

Minimize information loss Given:

norm for SVD, the 2-norm

constraints on W, S, V for SVD, W and V are orthonormal, and S is diagonal

TWSVA

Page 7: WEB BAR 2004  Advanced Retrieval and Web Mining

Matrix Factorizations: SVD

= x

n

m

Basis Representationm

k

k

nA = W x S x VT

x

SingularValues

Restrictions on representation: W, V orthonormal; S diagonal

Page 8: WEB BAR 2004  Advanced Retrieval and Web Mining

Dimension Reduction

For some s << Rank, zero out all but the s biggest singular values in S. Denote by Ss this new version of S. Typically s in the hundreds while r (Rank) could be

in the (tens of) thousands. Before: A= W S Vt

Let As = W Ss Vt = WsSsVst

As is a good approximation to A. Best rank s approximation according to 2-norm

Page 9: WEB BAR 2004  Advanced Retrieval and Web Mining

Dimension Reduction

= x

n

m

Basis Representation

0 0

m

k

k

nAs = W x Ss x VT

x0

0

SingularValues

The columns of As represent the docs, but in s << m dimensions Best rank s approximation according to 2-norm

s s

0

Page 10: WEB BAR 2004  Advanced Retrieval and Web Mining

More on W and V

Recall m n matrix of terms docs, A. Define term-term correlation matrix T = AAt

At denotes the matrix transpose of A. T is a square, symmetric m m matrix.

Doc-doc correlation matrix D=AtA. D is a square, symmetric n n matrix. Why?

Page 11: WEB BAR 2004  Advanced Retrieval and Web Mining

Eigenvectors

Denote by W the m r matrix of eigenvectors of T.

Denote by V the n r matrix of eigenvectors of D.

Denote by S the diagonal matrix with the squares of the eigenvalues of T = AAt in sorted order.

It turns out that A = WSVt is the SVD of A Semi-precise intuition: The new dimensions are

the principal components of term correlation space.

Page 12: WEB BAR 2004  Advanced Retrieval and Web Mining

Query processing

Exercise: How do you map the query into the reduced space?

Page 13: WEB BAR 2004  Advanced Retrieval and Web Mining

Take Away

LSI is optimal: optimal solution for given dimensionality Caveat: Mathematically optimal is not necessarily

“semantically” optimal. LSI is unique

Except for signs, singular values with same value Key benefits of LSI

Enhances recall, addresses synonymy problem But can decrease precision

Maintenance challenges Changing collections Recompute in intervals?

Performance challenges Cheaper alternatives for recall enhancement

E.g. Pseudo-feedback Use of LSI in deployed systems

Why?

Page 14: WEB BAR 2004  Advanced Retrieval and Web Mining

Resources: LSI

Random projection theorem: http://citeseer.nj.nec.com/dasgupta99elementary.html

Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html

Latent semantic indexing: http://citeseer.nj.nec.com/deerwester90indexing.html http://cs276a.stanford.edu/handouts/fsnlp-svd.pdf

Books: FSNLP 15.4, MG 4.6, MIR 2.7.2.

Page 15: WEB BAR 2004  Advanced Retrieval and Web Mining

Interactive Information RetrievalUser Interfaces

Page 16: WEB BAR 2004  Advanced Retrieval and Web Mining

The User in Information Access

Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Page 17: WEB BAR 2004  Advanced Retrieval and Web Mining

Main Focus of Information Retrieval

yes

no

Focus of

most IR! Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

User

Find startingpoint

Page 18: WEB BAR 2004  Advanced Retrieval and Web Mining

Information Access in Context

Stop

High-LevelGoal

Synthesize

Done?

Analyze

yes

no

User

Information Access

Page 19: WEB BAR 2004  Advanced Retrieval and Web Mining

The User in Information Access

Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Page 20: WEB BAR 2004  Advanced Retrieval and Web Mining

Queries on the WebMost Frequent on 2002/10/26

Page 21: WEB BAR 2004  Advanced Retrieval and Web Mining

Queries on the Web (2000)

Why only 9% sex?

Page 22: WEB BAR 2004  Advanced Retrieval and Web Mining

Intranet Queries (Aug 2000)

3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map

773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid

Source: Ray Larson

Page 23: WEB BAR 2004  Advanced Retrieval and Web Mining

Intranet Queries

Summary of sample data from 3 weeks of UCB queries

13.2% Telebears/BearFacts/InfoBears/BearLink (12297) 6.7% Schedule of classes or final exams (6222) 5.4% Summer Session (5041) 3.2% Extension (2932) 3.1% Academic Calendar (2846) 2.4% Directories (2202) 1.7% Career Center (1588) 1.7% Housing (1583) 1.5% Map (1393)

Source: Ray Larson

Page 24: WEB BAR 2004  Advanced Retrieval and Web Mining

Types of Information Needs

Need answer to question (who won the superbowl?)

Re-find a particular document Find a good recipe for tonight’s dinner Exploration of new area (browse sites about

Mexico City) Authoritative summary of information (HIV

review) In most cases, only one interface! Cell phone / pda / camera / mp3 analogy

Page 25: WEB BAR 2004  Advanced Retrieval and Web Mining

The User in Information Access

Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Page 26: WEB BAR 2004  Advanced Retrieval and Web Mining

Find Starting Point By Browsing

x

x

xxxx

x

x

x

xx

x

x x

Entry point

Starting point for search (or the answer?)

Page 27: WEB BAR 2004  Advanced Retrieval and Web Mining

Hierarchical browsing

Level 2

Level 1

Level 0

Page 28: WEB BAR 2004  Advanced Retrieval and Web Mining
Page 29: WEB BAR 2004  Advanced Retrieval and Web Mining

Visual Browsing: Hyperbolic Tree

Page 30: WEB BAR 2004  Advanced Retrieval and Web Mining

Visual Browsing: Hyperbolic Tree

Page 31: WEB BAR 2004  Advanced Retrieval and Web Mining

Visual Browsing: Themescape

Page 32: WEB BAR 2004  Advanced Retrieval and Web Mining

Scatter/Gather

Scatter/gather allows the user to find a set of documents of interest through browsing.

It iterates: Scatter

Take the collection and scatter it into n clusters. Gather

Pick the clusters of interest and merge them.

Page 33: WEB BAR 2004  Advanced Retrieval and Web Mining

Scatter/Gather

Page 34: WEB BAR 2004  Advanced Retrieval and Web Mining

Browsing vs. Searching

Browsing and searching are often interleaved. Information need dependent

Open-ended (find information about mexico city) -> browsing Specific (who won the superbowl) -> searching

User dependent Some users prefer searching, others browsing (confirmed in

many studies: some hate to type) Advantage of browsing: You don’t need to know the vocabulary

of the collection Compare to physical world

Browsing vs. searching in a grocery store

Page 35: WEB BAR 2004  Advanced Retrieval and Web Mining

Browsers vs. Searchers

1/3 of users do not search at all 1/3 rarely search

Or urls only Only 1/3 understand the concept of search (ISP data from 2000)

Why?

Page 36: WEB BAR 2004  Advanced Retrieval and Web Mining

Starting Points

Methods for finding a starting point Select collections from a list

Highwire press Google!

Hierarchical browsing, directories Visual browsing

Hyperbolic tree Themescape, Kohonen maps

Browsing vs searching

Page 37: WEB BAR 2004  Advanced Retrieval and Web Mining

The User in Information Access

Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Page 38: WEB BAR 2004  Advanced Retrieval and Web Mining

Form-based Query Specification (Infoseek)

Credit: Marti Hearst

Page 39: WEB BAR 2004  Advanced Retrieval and Web Mining

Boolean Queries

Boolean logic is difficult for the average user. Some interfaces for average users support

formulation of boolean queries Current view is that non-expert users are best

served with non-boolean or simple +/- boolean (pioneered by altavista).

But boolean queries are the standard for certain groups of expert users (eg, lawyers).

Page 40: WEB BAR 2004  Advanced Retrieval and Web Mining

Dire

ct M

anip

ula t

ion

Spe

c.V

QU

ER

Y (

J one

s 98

)

Credit: Marti Hearst

Page 41: WEB BAR 2004  Advanced Retrieval and Web Mining

One Problem With Boolean Queries: Feast or Famine

Famine

Feast

Specifyinga well targetedquery is hard.

Bigger problem for Boolean.

Google: 1860 hitsfor “standard userdlink 650”

0 hits after adding“no card found”

How general is the query?

Page 42: WEB BAR 2004  Advanced Retrieval and Web Mining

Boolean Queries

Summary Complex boolean queries are difficult for average

user Feast or famine problem

Prior to google, many IR researchers thought boolean queries were a bad idea.

Google queries are strict conjunctions. Why is this working well?

Page 43: WEB BAR 2004  Advanced Retrieval and Web Mining

Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.

Parametric search example

Page 44: WEB BAR 2004  Advanced Retrieval and Web Mining

Parametric search example

We can add text search.

Page 45: WEB BAR 2004  Advanced Retrieval and Web Mining

Parametric search

Each document has, in addition to text, some “meta-data” e.g., Make, Model, City, Color

A parametric search interface allows the user to combine a full-text query with selections on these parameters

Page 46: WEB BAR 2004  Advanced Retrieval and Web Mining

Interfaces for term browsing

Page 47: WEB BAR 2004  Advanced Retrieval and Web Mining
Page 48: WEB BAR 2004  Advanced Retrieval and Web Mining

Re/Formulate Query

Single text box (google, stanford intranet) Command-based (socrates) Boolean queries Parametric search Term browsing Other methods

Relevance feedback Query expansion Spelling correction Natural language, question answering

Page 49: WEB BAR 2004  Advanced Retrieval and Web Mining

The User in Information Access

Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Page 50: WEB BAR 2004  Advanced Retrieval and Web Mining

Category Labels to Support Exploration

Example: ODP categories on google

Advantages: Interpretable Capture summary

information Describe multiple

facets of content Domain dependent,

and so descriptive

Credit: Marti Hearst

Disadvantages Domain dependent, so costly to acquire May mis-match users’ interests

Page 51: WEB BAR 2004  Advanced Retrieval and Web Mining

Evaluate ResultsContext in Hierarchy: Cat-a-Cone

Page 52: WEB BAR 2004  Advanced Retrieval and Web Mining

Summarization to Support Exploration

Query-dependent summarization

KWIC (keyword in context) lines (a la google)

Query-independent summarization

Summary written by author (if available)

Automatically generated summary.

Page 53: WEB BAR 2004  Advanced Retrieval and Web Mining

Visualize Document Structure for Exploration

Page 54: WEB BAR 2004  Advanced Retrieval and Web Mining

Result Exploration User Goal: Do these results answer my question? Methods

Category labels Summarization Visualization of document structure

Other methods Metadata: URL, date, file size, author Hypertext navigation: Can I find the answer by

following a link? Browsing in general

Clustering of results (jaguar example)

Page 55: WEB BAR 2004  Advanced Retrieval and Web Mining

Exercise

Current information retrieval user interfaces are designed for typical computer screens.

How would you design a user interface for a wall-sized screen?

Observe your own information seeking behavior Examples

WWW University library Grocery store

Are you a searcher or a browser? How do you reformulate your query?

Read bad hits, then minus terms Read good hits, then plus terms Try a completely different query …

Page 56: WEB BAR 2004  Advanced Retrieval and Web Mining

Take Away

yes

no

Focus of

most IR Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

User

Find startingpoint

Page 57: WEB BAR 2004  Advanced Retrieval and Web Mining

Evaluation of Interactive Retrieval

Page 58: WEB BAR 2004  Advanced Retrieval and Web Mining

Recap: Relevance Feedback

User sends query Search system returns results User marks some results as relevant and resubmits query

plus relevant results Search system now has better description of the information

need and returns more relevant results. One method: Rocchio algorithm

Page 59: WEB BAR 2004  Advanced Retrieval and Web Mining

Why Evaluate Relevance Feedback?

Simulated interactive retrieval consistently outperforms non-interactive retrieval (70% here).

Page 60: WEB BAR 2004  Advanced Retrieval and Web Mining

Relevance Feedback EvaluationCase Study

Example of evaluation of interactive information retrieval

Koenemann & Belkin 1996 Goal of study: show that relevance feedback

improves retrieval effectiveness

Page 61: WEB BAR 2004  Advanced Retrieval and Web Mining

Details on the User Study

64 novice searchers 43 female, 21 male, native English

TREC test bed Wall Street Journal subset

Two search topics Automobile Recalls Tobacco Advertising and the Young

Relevance judgements from TREC and experimenter System was INQUERY (vector space with some bells and

whistles) Subjects had a tutorial session to learn the system Their goal was to keep modifying the query until they have

developed one that gets high precision Reweighting of terms similar to but different from Rocchio

Page 62: WEB BAR 2004  Advanced Retrieval and Web Mining

Credit: Marti Hearst

Page 63: WEB BAR 2004  Advanced Retrieval and Web Mining

Evaluation

Criterion: p@30 (precision at 30 documents) Compare:

p@30 for users with relevance feedback p@30 for users without relevance feedback

Goal: show that users with relevance feedback do better

Page 64: WEB BAR 2004  Advanced Retrieval and Web Mining

Precision vs. RF condition (from Koenemann & Belkin 96)

Credit: Marti Hearst

Page 65: WEB BAR 2004  Advanced Retrieval and Web Mining

Result

Subjects with relevance feedback had, on average, 17-34% better performance than subjects without relevance feedback.

Does this show conclusively that relevance feedback is better?

Page 66: WEB BAR 2004  Advanced Retrieval and Web Mining

But … Difference in precision numbers not statistically significant. Search times approximately equal.

Page 67: WEB BAR 2004  Advanced Retrieval and Web Mining

Take Away

Evaluating interactive systems is harder than evaluating algorithms.

Experiments involving humans have many confounding variables: Age Level of education Prior experience with search Search style (browsing vs searching) Mac vs linux vs MS user Mood, level of alertness, chemistry with experimenter etc.

Showing statistical significance becomes harder as the number of confounding variables increases.

Also: human subject studies are resource-intensive It’s hard to “scientifically prove” the superiority of relevance

feedback.

Page 68: WEB BAR 2004  Advanced Retrieval and Web Mining

Other Evaluation Issues Query variability

Always compare methods on query-by-query basis Methods with the same average performance can differ a lot

in user friendliness Inter-judge variability

In general, judges disagree often Big impact on relevance assessment of a single document Little impact on ranking of systems

Redundancy A highly relevant document with no new information is useless Most IR measures don’t measure redundancy

Page 69: WEB BAR 2004  Advanced Retrieval and Web Mining

Resources

FOA 4.3

MIR Ch. 10.8 – 10.10

Ellen Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, ACM Sigir 98

Harman, D.K. Overview of the Third REtrieval Conference (TREC-3). In: Overview of The Third Text REtrieval Conference (TREC-3). Harman, D.K. (Ed.). NIST Special Publication 500-225, 1995, pp.l-19.

Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results (1996)  Marti A. Hearst, Jan O. Pedersen

Proceedings of SIGIR-96,

Paul Over, TREC-6 Interactive Track Report, NIST, 1998.

Page 70: WEB BAR 2004  Advanced Retrieval and Web Mining

Resources

MIR Ch. 10.0 – 10.7

Donna Harman, Overview of the fourth text retrieval conference (TREC 4), National Institute of Standards and Technology.

Cutting, Karger, Pedersen, Tukey. Scatter/Gather. ACM SIGIR. http://citeseer.nj.nec.com/cutting92scattergather.html

Hearst, Cat-a-cone, an interactive interface for specifying searches and viewing retrieving results in a large category hierarchy, ACM SIGIR.

http://www.acm.org/sigchi/chi96/proceedings/papers/Koenemann/jk1_txt.htm

http://otal.umd.edu/olive