WEB BAR 2004 Advanced Retrieval and Web Mining

WEB BAR 2004 Advanced Retrieval and Web Mining

Lecture 14

Today’s Topics

Latent Semantic Indexing / Dimension reduction Interactive information retrieval / User interfaces Evaluation of interactive retrieval

How LSI is used for Text Search

LSI is a technique for dimension reduction Similar to Principal Component Analysis (PCA) Addresses (near-)synonymy: car/automobile Attempts to enable concept-based retrieval

Pre-process docs using a technique from linear algebra called Singular Value Decomposition.

Reduce dimensionality to: Fewer dimensions, more “collapsing of axes”, better recall,

worse precision More dimensions, less collapsing, worse recall, better

precision Queries handled in this new (reduced) vector space.

Input: Term-Document Matrix wi,j = (normalized) weighted count (ti , dj)

Key idea: Factorize this matrix

ti

djn

m

hj is representation of dj in terms of basis W

If rank(W) ≥ rank(A) then we can always find H so A = WHNotice duality of problem

More “semantic” dimensions -> LSI (latent semantic

indexing)

Matrix Factorization

= x

n

mBasis Representation

m

k

k

n

A = W x Hhjdj

Minimization Problem

Minimize

Minimize information loss Given:

norm for SVD, the 2-norm

constraints on W, S, V for SVD, W and V are orthonormal, and S is diagonal

TWSVA

Matrix Factorizations: SVD

= x

n

m

Basis Representationm

k

k

nA = W x S x VT

x

SingularValues

Restrictions on representation: W, V orthonormal; S diagonal

Dimension Reduction

For some s << Rank, zero out all but the s biggest singular values in S. Denote by Ss this new version of S. Typically s in the hundreds while r (Rank) could be

in the (tens of) thousands. Before: A= W S Vt

Let As = W Ss Vt = WsSsVst

As is a good approximation to A. Best rank s approximation according to 2-norm

Dimension Reduction

= x

n

m

Basis Representation

0 0

m

k

k

nAs = W x Ss x VT

x0

0

SingularValues

The columns of As represent the docs, but in s << m dimensions Best rank s approximation according to 2-norm

s s

0

More on W and V

Recall m n matrix of terms docs, A. Define term-term correlation matrix T = AAt

At denotes the matrix transpose of A. T is a square, symmetric m m matrix.

Doc-doc correlation matrix D=AtA. D is a square, symmetric n n matrix. Why?

Eigenvectors

Denote by W the m r matrix of eigenvectors of T.

Denote by V the n r matrix of eigenvectors of D.

Denote by S the diagonal matrix with the squares of the eigenvalues of T = AAt in sorted order.

It turns out that A = WSVt is the SVD of A Semi-precise intuition: The new dimensions are

the principal components of term correlation space.

Query processing

Exercise: How do you map the query into the reduced space?

Take Away

LSI is optimal: optimal solution for given dimensionality Caveat: Mathematically optimal is not necessarily

“semantically” optimal. LSI is unique

Except for signs, singular values with same value Key benefits of LSI

Enhances recall, addresses synonymy problem But can decrease precision

Maintenance challenges Changing collections Recompute in intervals?

Performance challenges Cheaper alternatives for recall enhancement

E.g. Pseudo-feedback Use of LSI in deployed systems

Why?

Resources: LSI

Random projection theorem: http://citeseer.nj.nec.com/dasgupta99elementary.html

Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html

Latent semantic indexing: http://citeseer.nj.nec.com/deerwester90indexing.html http://cs276a.stanford.edu/handouts/fsnlp-svd.pdf

Books: FSNLP 15.4, MG 4.6, MIR 2.7.2.

Interactive Information RetrievalUser Interfaces

The User in Information Access

Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Main Focus of Information Retrieval

yes

no

Focus of

most IR! Stop



Done?

Query

Send to system

Receive results

User

Find startingpoint

Information Access in Context

Stop

High-LevelGoal

Synthesize

Done?

Analyze

yes

no

User

Information Access


Stop



Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Queries on the WebMost Frequent on 2002/10/26

Queries on the Web (2000)

Why only 9% sex?

Intranet Queries (Aug 2000)

3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map

773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid

Source: Ray Larson

Intranet Queries

Summary of sample data from 3 weeks of UCB queries

13.2% Telebears/BearFacts/InfoBears/BearLink (12297) 6.7% Schedule of classes or final exams (6222) 5.4% Summer Session (5041) 3.2% Extension (2932) 3.1% Academic Calendar (2846) 2.4% Directories (2202) 1.7% Career Center (1588) 1.7% Housing (1583) 1.5% Map (1393)

Source: Ray Larson

Types of Information Needs

Need answer to question (who won the superbowl?)

Re-find a particular document Find a good recipe for tonight’s dinner Exploration of new area (browse sites about

Mexico City) Authoritative summary of information (HIV

review) In most cases, only one interface! Cell phone / pda / camera / mp3 analogy


Stop



Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Find Starting Point By Browsing

x

x

xxxx

x

x

x

xx

x

x x

Entry point

Starting point for search (or the answer?)

Hierarchical browsing

Level 2

Level 1

Level 0

Visual Browsing: Hyperbolic Tree

Visual Browsing: Hyperbolic Tree

Visual Browsing: Themescape

Scatter/Gather

Scatter/gather allows the user to find a set of documents of interest through browsing.

It iterates: Scatter

Take the collection and scatter it into n clusters. Gather

Pick the clusters of interest and merge them.

Scatter/Gather

Browsing vs. Searching

Browsing and searching are often interleaved. Information need dependent

Open-ended (find information about mexico city) -> browsing Specific (who won the superbowl) -> searching

User dependent Some users prefer searching, others browsing (confirmed in

many studies: some hate to type) Advantage of browsing: You don’t need to know the vocabulary

of the collection Compare to physical world

Browsing vs. searching in a grocery store

Browsers vs. Searchers

1/3 of users do not search at all 1/3 rarely search

Or urls only Only 1/3 understand the concept of search (ISP data from 2000)

Why?

Starting Points

Methods for finding a starting point Select collections from a list

Highwire press Google!

Hierarchical browsing, directories Visual browsing

Hyperbolic tree Themescape, Kohonen maps

Browsing vs searching


Stop



Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Form-based Query Specification (Infoseek)

Credit: Marti Hearst

Boolean Queries

Boolean logic is difficult for the average user. Some interfaces for average users support

formulation of boolean queries Current view is that non-expert users are best

served with non-boolean or simple +/- boolean (pioneered by altavista).

But boolean queries are the standard for certain groups of expert users (eg, lawyers).

Dire

ct M

anip

ula t

ion

Spe

c.V

QU

ER

Y (

J one

s 98

)


One Problem With Boolean Queries: Feast or Famine

Famine

Feast

Specifyinga well targetedquery is hard.

Bigger problem for Boolean.

Google: 1860 hitsfor “standard userdlink 650”

0 hits after adding“no card found”

How general is the query?

Boolean Queries

Summary Complex boolean queries are difficult for average

user Feast or famine problem

Prior to google, many IR researchers thought boolean queries were a bad idea.

Google queries are strict conjunctions. Why is this working well?

Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.

Parametric search example

Parametric search example

We can add text search.

Parametric search

Each document has, in addition to text, some “meta-data” e.g., Make, Model, City, Color

A parametric search interface allows the user to combine a full-text query with selections on these parameters

Interfaces for term browsing

Re/Formulate Query

Single text box (google, stanford intranet) Command-based (socrates) Boolean queries Parametric search Term browsing Other methods

Relevance feedback Query expansion Spelling correction Natural language, question answering


Stop



Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Category Labels to Support Exploration

Example: ODP categories on google

Advantages: Interpretable Capture summary

information Describe multiple

facets of content Domain dependent,

and so descriptive


Disadvantages Domain dependent, so costly to acquire May mis-match users’ interests

Evaluate ResultsContext in Hierarchy: Cat-a-Cone

Summarization to Support Exploration

Query-dependent summarization

KWIC (keyword in context) lines (a la google)

Query-independent summarization

Summary written by author (if available)

Automatically generated summary.

Visualize Document Structure for Exploration

Result Exploration User Goal: Do these results answer my question? Methods

Category labels Summarization Visualization of document structure

Other methods Metadata: URL, date, file size, author Hypertext navigation: Can I find the answer by

following a link? Browsing in general

Clustering of results (jaguar example)

Exercise

Current information retrieval user interfaces are designed for typical computer screens.

How would you design a user interface for a wall-sized screen?

Observe your own information seeking behavior Examples

WWW University library Grocery store

Are you a searcher or a browser? How do you reformulate your query?

Read bad hits, then minus terms Read good hits, then plus terms Try a completely different query …

Take Away

yes

no

Focus of

most IR Stop



Done?

Query

Send to system

Receive results

User

Find startingpoint

Evaluation of Interactive Retrieval

Recap: Relevance Feedback

User sends query Search system returns results User marks some results as relevant and resubmits query

plus relevant results Search system now has better description of the information

need and returns more relevant results. One method: Rocchio algorithm

Why Evaluate Relevance Feedback?

Simulated interactive retrieval consistently outperforms non-interactive retrieval (70% here).

Relevance Feedback EvaluationCase Study

Example of evaluation of interactive information retrieval

Koenemann & Belkin 1996 Goal of study: show that relevance feedback

improves retrieval effectiveness

Details on the User Study

64 novice searchers 43 female, 21 male, native English

TREC test bed Wall Street Journal subset

Two search topics Automobile Recalls Tobacco Advertising and the Young

Relevance judgements from TREC and experimenter System was INQUERY (vector space with some bells and

whistles) Subjects had a tutorial session to learn the system Their goal was to keep modifying the query until they have

developed one that gets high precision Reweighting of terms similar to but different from Rocchio


Evaluation

Criterion: p@30 (precision at 30 documents) Compare:

p@30 for users with relevance feedback p@30 for users without relevance feedback

Goal: show that users with relevance feedback do better

Precision vs. RF condition (from Koenemann & Belkin 96)


Result

Subjects with relevance feedback had, on average, 17-34% better performance than subjects without relevance feedback.

Does this show conclusively that relevance feedback is better?

But … Difference in precision numbers not statistically significant. Search times approximately equal.

Take Away

Evaluating interactive systems is harder than evaluating algorithms.

Experiments involving humans have many confounding variables: Age Level of education Prior experience with search Search style (browsing vs searching) Mac vs linux vs MS user Mood, level of alertness, chemistry with experimenter etc.

Showing statistical significance becomes harder as the number of confounding variables increases.

Also: human subject studies are resource-intensive It’s hard to “scientifically prove” the superiority of relevance

feedback.

Other Evaluation Issues Query variability

Always compare methods on query-by-query basis Methods with the same average performance can differ a lot

in user friendliness Inter-judge variability

In general, judges disagree often Big impact on relevance assessment of a single document Little impact on ranking of systems

Redundancy A highly relevant document with no new information is useless Most IR measures don’t measure redundancy

Resources

FOA 4.3

MIR Ch. 10.8 – 10.10

Ellen Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, ACM Sigir 98

Harman, D.K. Overview of the Third REtrieval Conference (TREC-3). In: Overview of The Third Text REtrieval Conference (TREC-3). Harman, D.K. (Ed.). NIST Special Publication 500-225, 1995, pp.l-19.

Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results (1996) Marti A. Hearst, Jan O. Pedersen

Proceedings of SIGIR-96,

Paul Over, TREC-6 Interactive Track Report, NIST, 1998.

Resources

MIR Ch. 10.0 – 10.7

Donna Harman, Overview of the fourth text retrieval conference (TREC 4), National Institute of Standards and Technology.

Cutting, Karger, Pedersen, Tukey. Scatter/Gather. ACM SIGIR. http://citeseer.nj.nec.com/cutting92scattergather.html

Hearst, Cat-a-cone, an interactive interface for specifying searches and viewing retrieving results in a large category hierarchy, ACM SIGIR.

http://www.acm.org/sigchi/chi96/proceedings/papers/Koenemann/jk1_txt.htm

http://otal.umd.edu/olive

Documents

WEB BAR 2004 Advanced Retrieval and Web Mining