Upload
chantale-fernandez
View
21
Download
0
Embed Size (px)
DESCRIPTION
WEB BAR 2004 Advanced Retrieval and Web Mining. Lecture 14. Today’s Topics. Latent Semantic Indexing / Dimension reduction Interactive information retrieval / User interfaces Evaluation of interactive retrieval. How LSI is used for Text Search. LSI is a technique for dimension reduction - PowerPoint PPT Presentation
Citation preview
WEB BAR 2004 Advanced Retrieval and Web Mining
Lecture 14
Today’s Topics
Latent Semantic Indexing / Dimension reduction Interactive information retrieval / User interfaces Evaluation of interactive retrieval
How LSI is used for Text Search
LSI is a technique for dimension reduction Similar to Principal Component Analysis (PCA) Addresses (near-)synonymy: car/automobile Attempts to enable concept-based retrieval
Pre-process docs using a technique from linear algebra called Singular Value Decomposition.
Reduce dimensionality to: Fewer dimensions, more “collapsing of axes”, better recall,
worse precision More dimensions, less collapsing, worse recall, better
precision Queries handled in this new (reduced) vector space.
Input: Term-Document Matrix wi,j = (normalized) weighted count (ti , dj)
Key idea: Factorize this matrix
ti
djn
m
hj is representation of dj in terms of basis W
If rank(W) ≥ rank(A) then we can always find H so A = WHNotice duality of problem
More “semantic” dimensions -> LSI (latent semantic
indexing)
Matrix Factorization
= x
n
mBasis Representation
m
k
k
n
A = W x Hhjdj
Minimization Problem
Minimize
Minimize information loss Given:
norm for SVD, the 2-norm
constraints on W, S, V for SVD, W and V are orthonormal, and S is diagonal
TWSVA
Matrix Factorizations: SVD
= x
n
m
Basis Representationm
k
k
nA = W x S x VT
x
SingularValues
Restrictions on representation: W, V orthonormal; S diagonal
Dimension Reduction
For some s << Rank, zero out all but the s biggest singular values in S. Denote by Ss this new version of S. Typically s in the hundreds while r (Rank) could be
in the (tens of) thousands. Before: A= W S Vt
Let As = W Ss Vt = WsSsVst
As is a good approximation to A. Best rank s approximation according to 2-norm
Dimension Reduction
= x
n
m
Basis Representation
0 0
m
k
k
nAs = W x Ss x VT
x0
0
SingularValues
The columns of As represent the docs, but in s << m dimensions Best rank s approximation according to 2-norm
s s
0
More on W and V
Recall m n matrix of terms docs, A. Define term-term correlation matrix T = AAt
At denotes the matrix transpose of A. T is a square, symmetric m m matrix.
Doc-doc correlation matrix D=AtA. D is a square, symmetric n n matrix. Why?
Eigenvectors
Denote by W the m r matrix of eigenvectors of T.
Denote by V the n r matrix of eigenvectors of D.
Denote by S the diagonal matrix with the squares of the eigenvalues of T = AAt in sorted order.
It turns out that A = WSVt is the SVD of A Semi-precise intuition: The new dimensions are
the principal components of term correlation space.
Query processing
Exercise: How do you map the query into the reduced space?
Take Away
LSI is optimal: optimal solution for given dimensionality Caveat: Mathematically optimal is not necessarily
“semantically” optimal. LSI is unique
Except for signs, singular values with same value Key benefits of LSI
Enhances recall, addresses synonymy problem But can decrease precision
Maintenance challenges Changing collections Recompute in intervals?
Performance challenges Cheaper alternatives for recall enhancement
E.g. Pseudo-feedback Use of LSI in deployed systems
Why?
Resources: LSI
Random projection theorem: http://citeseer.nj.nec.com/dasgupta99elementary.html
Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html
Latent semantic indexing: http://citeseer.nj.nec.com/deerwester90indexing.html http://cs276a.stanford.edu/handouts/fsnlp-svd.pdf
Books: FSNLP 15.4, MG 4.6, MIR 2.7.2.
Interactive Information RetrievalUser Interfaces
The User in Information Access
Stop
Information need Explore results
Formulate/Reformulate
Done?
Query
Send to system
Receive results
yes
no
User
Find startingpoint
Main Focus of Information Retrieval
yes
no
Focus of
most IR! Stop
Information need Explore results
Formulate/Reformulate
Done?
Query
Send to system
Receive results
User
Find startingpoint
Information Access in Context
Stop
High-LevelGoal
Synthesize
Done?
Analyze
yes
no
User
Information Access
The User in Information Access
Stop
Information need Explore results
Formulate/Reformulate
Done?
Query
Send to system
Receive results
yes
no
User
Find startingpoint
Queries on the WebMost Frequent on 2002/10/26
Queries on the Web (2000)
Why only 9% sex?
Intranet Queries (Aug 2000)
3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map
773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid
Source: Ray Larson
Intranet Queries
Summary of sample data from 3 weeks of UCB queries
13.2% Telebears/BearFacts/InfoBears/BearLink (12297) 6.7% Schedule of classes or final exams (6222) 5.4% Summer Session (5041) 3.2% Extension (2932) 3.1% Academic Calendar (2846) 2.4% Directories (2202) 1.7% Career Center (1588) 1.7% Housing (1583) 1.5% Map (1393)
Source: Ray Larson
Types of Information Needs
Need answer to question (who won the superbowl?)
Re-find a particular document Find a good recipe for tonight’s dinner Exploration of new area (browse sites about
Mexico City) Authoritative summary of information (HIV
review) In most cases, only one interface! Cell phone / pda / camera / mp3 analogy
The User in Information Access
Stop
Information need Explore results
Formulate/Reformulate
Done?
Query
Send to system
Receive results
yes
no
User
Find startingpoint
Find Starting Point By Browsing
x
x
xxxx
x
x
x
xx
x
x x
Entry point
Starting point for search (or the answer?)
Hierarchical browsing
Level 2
Level 1
Level 0
Visual Browsing: Hyperbolic Tree
Visual Browsing: Hyperbolic Tree
Visual Browsing: Themescape
Scatter/Gather
Scatter/gather allows the user to find a set of documents of interest through browsing.
It iterates: Scatter
Take the collection and scatter it into n clusters. Gather
Pick the clusters of interest and merge them.
Scatter/Gather
Browsing vs. Searching
Browsing and searching are often interleaved. Information need dependent
Open-ended (find information about mexico city) -> browsing Specific (who won the superbowl) -> searching
User dependent Some users prefer searching, others browsing (confirmed in
many studies: some hate to type) Advantage of browsing: You don’t need to know the vocabulary
of the collection Compare to physical world
Browsing vs. searching in a grocery store
Browsers vs. Searchers
1/3 of users do not search at all 1/3 rarely search
Or urls only Only 1/3 understand the concept of search (ISP data from 2000)
Why?
Starting Points
Methods for finding a starting point Select collections from a list
Highwire press Google!
Hierarchical browsing, directories Visual browsing
Hyperbolic tree Themescape, Kohonen maps
Browsing vs searching
The User in Information Access
Stop
Information need Explore results
Formulate/Reformulate
Done?
Query
Send to system
Receive results
yes
no
User
Find startingpoint
Form-based Query Specification (Infoseek)
Credit: Marti Hearst
Boolean Queries
Boolean logic is difficult for the average user. Some interfaces for average users support
formulation of boolean queries Current view is that non-expert users are best
served with non-boolean or simple +/- boolean (pioneered by altavista).
But boolean queries are the standard for certain groups of expert users (eg, lawyers).
Dire
ct M
anip
ula t
ion
Spe
c.V
QU
ER
Y (
J one
s 98
)
Credit: Marti Hearst
One Problem With Boolean Queries: Feast or Famine
Famine
Feast
Specifyinga well targetedquery is hard.
Bigger problem for Boolean.
Google: 1860 hitsfor “standard userdlink 650”
0 hits after adding“no card found”
How general is the query?
Boolean Queries
Summary Complex boolean queries are difficult for average
user Feast or famine problem
Prior to google, many IR researchers thought boolean queries were a bad idea.
Google queries are strict conjunctions. Why is this working well?
Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.
Parametric search example
Parametric search example
We can add text search.
Parametric search
Each document has, in addition to text, some “meta-data” e.g., Make, Model, City, Color
A parametric search interface allows the user to combine a full-text query with selections on these parameters
Interfaces for term browsing
Re/Formulate Query
Single text box (google, stanford intranet) Command-based (socrates) Boolean queries Parametric search Term browsing Other methods
Relevance feedback Query expansion Spelling correction Natural language, question answering
The User in Information Access
Stop
Information need Explore results
Formulate/Reformulate
Done?
Query
Send to system
Receive results
yes
no
User
Find startingpoint
Category Labels to Support Exploration
Example: ODP categories on google
Advantages: Interpretable Capture summary
information Describe multiple
facets of content Domain dependent,
and so descriptive
Credit: Marti Hearst
Disadvantages Domain dependent, so costly to acquire May mis-match users’ interests
Evaluate ResultsContext in Hierarchy: Cat-a-Cone
Summarization to Support Exploration
Query-dependent summarization
KWIC (keyword in context) lines (a la google)
Query-independent summarization
Summary written by author (if available)
Automatically generated summary.
Visualize Document Structure for Exploration
Result Exploration User Goal: Do these results answer my question? Methods
Category labels Summarization Visualization of document structure
Other methods Metadata: URL, date, file size, author Hypertext navigation: Can I find the answer by
following a link? Browsing in general
Clustering of results (jaguar example)
Exercise
Current information retrieval user interfaces are designed for typical computer screens.
How would you design a user interface for a wall-sized screen?
Observe your own information seeking behavior Examples
WWW University library Grocery store
Are you a searcher or a browser? How do you reformulate your query?
Read bad hits, then minus terms Read good hits, then plus terms Try a completely different query …
Take Away
yes
no
Focus of
most IR Stop
Information need Explore results
Formulate/Reformulate
Done?
Query
Send to system
Receive results
User
Find startingpoint
Evaluation of Interactive Retrieval
Recap: Relevance Feedback
User sends query Search system returns results User marks some results as relevant and resubmits query
plus relevant results Search system now has better description of the information
need and returns more relevant results. One method: Rocchio algorithm
Why Evaluate Relevance Feedback?
Simulated interactive retrieval consistently outperforms non-interactive retrieval (70% here).
Relevance Feedback EvaluationCase Study
Example of evaluation of interactive information retrieval
Koenemann & Belkin 1996 Goal of study: show that relevance feedback
improves retrieval effectiveness
Details on the User Study
64 novice searchers 43 female, 21 male, native English
TREC test bed Wall Street Journal subset
Two search topics Automobile Recalls Tobacco Advertising and the Young
Relevance judgements from TREC and experimenter System was INQUERY (vector space with some bells and
whistles) Subjects had a tutorial session to learn the system Their goal was to keep modifying the query until they have
developed one that gets high precision Reweighting of terms similar to but different from Rocchio
Credit: Marti Hearst
Evaluation
Criterion: p@30 (precision at 30 documents) Compare:
p@30 for users with relevance feedback p@30 for users without relevance feedback
Goal: show that users with relevance feedback do better
Precision vs. RF condition (from Koenemann & Belkin 96)
Credit: Marti Hearst
Result
Subjects with relevance feedback had, on average, 17-34% better performance than subjects without relevance feedback.
Does this show conclusively that relevance feedback is better?
But … Difference in precision numbers not statistically significant. Search times approximately equal.
Take Away
Evaluating interactive systems is harder than evaluating algorithms.
Experiments involving humans have many confounding variables: Age Level of education Prior experience with search Search style (browsing vs searching) Mac vs linux vs MS user Mood, level of alertness, chemistry with experimenter etc.
Showing statistical significance becomes harder as the number of confounding variables increases.
Also: human subject studies are resource-intensive It’s hard to “scientifically prove” the superiority of relevance
feedback.
Other Evaluation Issues Query variability
Always compare methods on query-by-query basis Methods with the same average performance can differ a lot
in user friendliness Inter-judge variability
In general, judges disagree often Big impact on relevance assessment of a single document Little impact on ranking of systems
Redundancy A highly relevant document with no new information is useless Most IR measures don’t measure redundancy
Resources
FOA 4.3
MIR Ch. 10.8 – 10.10
Ellen Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, ACM Sigir 98
Harman, D.K. Overview of the Third REtrieval Conference (TREC-3). In: Overview of The Third Text REtrieval Conference (TREC-3). Harman, D.K. (Ed.). NIST Special Publication 500-225, 1995, pp.l-19.
Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results (1996) Marti A. Hearst, Jan O. Pedersen
Proceedings of SIGIR-96,
Paul Over, TREC-6 Interactive Track Report, NIST, 1998.
Resources
MIR Ch. 10.0 – 10.7
Donna Harman, Overview of the fourth text retrieval conference (TREC 4), National Institute of Standards and Technology.
Cutting, Karger, Pedersen, Tukey. Scatter/Gather. ACM SIGIR. http://citeseer.nj.nec.com/cutting92scattergather.html
Hearst, Cat-a-cone, an interactive interface for specifying searches and viewing retrieving results in a large category hierarchy, ACM SIGIR.
http://www.acm.org/sigchi/chi96/proceedings/papers/Koenemann/jk1_txt.htm
http://otal.umd.edu/olive