Upload
tadhg
View
35
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Vector and Probabilistic Ranking. Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval. Review. Inverted files The Vector Space Model Term weighting. Inverted indexes. - PowerPoint PPT Presentation
Citation preview
9/19/2000 Information Organization and Retrieval
Vector and Probabilistic Ranking
Ray Larson & Marti Hearst
University of California, Berkeley
School of Information Management and Systems
SIMS 202: Information Organization and Retrieval
9/19/2000 Information Organization and Retrieval
Review
• Inverted files
• The Vector Space Model
• Term weighting
9/19/2000 Information Organization and Retrieval
Inverted indexes• Permit fast search for individual terms• For each term, you get a list consisting of:
– document ID – frequency of term in doc (optional) – position of term in doc (optional)
• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2
• Also used for statistical ranking algorithms
9/19/2000 Information Organization and Retrieval
How Inverted Files are Used
Dictionary PostingsDoc # Freq
2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Query on “time” AND “dark”
2 docs with “time” in dictionary ->IDs 1 and 2 from posting file
1 doc with “dark” in dictionary ->ID 2 from posting file
Therefore, only doc 2 satisfied the query.
9/19/2000 Information Organization and Retrieval
Vector Space Model
• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary vectors of terms
• Queries represented the same as documents• Query and Document weights are based on length
and direction of their vector• A vector distance measure between the query and
documents is used to rank retrieved documents
9/19/2000 Information Organization and Retrieval
Documents in 3D Space
Assumption: Documents that are “close together” in space are similar in meaning.
9/19/2000 Information Organization and Retrieval
Vector Space Documentsand Queries
docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3
D10 0 1 1 5D11 1 0 1 4
Q 1 2 3q1 q2 q3
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t2
t3
t1
Boolean term combinations
Q is a query – also represented as a vector
9/19/2000 Information Organization and Retrieval
Documents in Vector Space
t1
t2
t3
D1
D2
D10
D3
D9
D4
D7
D8
D5
D11
D6
9/19/2000 Information Organization and Retrieval
Assigning Weights to Terms
• Binary Weights• Raw term frequency• tf x idf
– Recall the Zipf distribution– Want to weight terms highly if they are
• frequent in relevant documents … BUT• infrequent in the collection as a whole
• Automatically derived thesaurus terms
9/19/2000 Information Organization and Retrieval
Binary Weights
• Only the presence (1) or absence (0) of a term is included in the vector
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1D11 1 0 1
9/19/2000 Information Organization and Retrieval
Raw Term Weights
• The frequency of occurrence for the term in each document is included in the vector
docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1
D10 0 3 5D11 4 0 1
9/19/2000 Information Organization and Retrieval
Assigning Weights• tf x idf measure:
– term frequency (tf)– inverse document frequency (idf) -- a way to
deal with the problems of the Zipf distribution
• Goal: assign a tf * idf weight to each term in each document
9/19/2000 Information Organization and Retrieval
tf x idf)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
9/19/2000 Information Organization and Retrieval
Inverse Document Frequency
• IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
For a collectionof 10000 documents
9/19/2000 Information Organization and Retrieval
Similarity Measures
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
9/19/2000 Information Organization and Retrieval
Vector Space Visualization
9/19/2000 Information Organization and Retrieval
Text ClusteringClustering is
“The art of finding groups in data.” -- Kaufmann and Rousseeu
Term 1
Term 2
9/19/2000 Information Organization and Retrieval
K-Means Clustering
• 1 Create a pair-wise similarity measure• 2 Find K centers using agglomerative clustering
– take a small sample
– group bottom up until K groups found
• 3 Assign each document to nearest center, forming new clusters
• 4 Repeat 3 as necessary
9/19/2000 Information Organization and Retrieval
Scatter/Gather
Cutting, Pedersen, Tukey & Karger 92, 93Hearst & Pedersen 95
• Cluster sets of documents into general “themes”, like a table of contents
• Display the contents of the clusters by showing topical terms and typical titles
• User chooses subsets of the clusters and re-clusters the documents within
• Resulting new groups have different “themes”
9/19/2000 Information Organization and Retrieval
S/G Example: query on “star”Encyclopedia text
14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars
29 constellations 7 miscelleneous
Clustering and re-clustering is entirely automated
9/19/2000 Information Organization and Retrieval
Another use of clustering
• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.
• “Project” these onto a 2D graphical representation:
9/19/2000 Information Organization and Retrieval
Clustering Multi-Dimensional Document Space
(image from Wise et al 95)
9/19/2000 Information Organization and Retrieval
Clustering Multi-Dimensional Document Space
(image from Wise et al 95)
9/19/2000 Information Organization and Retrieval
Concept “Landscapes”
Pharmocology
Anatomy
Legal
Disease
Hospitals
(e.g., Lin, Chen, Wise et al.) Too many concepts, or too coarse Too many concepts, or too coarse Single concept per documentSingle concept per document No titlesNo titles Browsing without searchBrowsing without search
9/19/2000 Information Organization and Retrieval
Clustering
• Advantages:– See some main themes
• Disadvantage:– Many ways documents could group together
are hidden
• Thinking point: what is the relationship to classification systems and facets?
9/19/2000 Information Organization and Retrieval
Today
• Vector Space Ranking
• Probabilistic Models and Ranking
• (lots of math)
9/19/2000 Information Organization and Retrieval
tf x idf normalization• Normalize the term weights (so longer documents
are not unfairly given more weight)– normalize usually means force all values to fall within a certain
range, usually between 0 and 1, inclusive.
t
k kik
kikik
nNtf
nNtfw
1
22 )]/[log()(
)/log(
9/19/2000 Information Organization and Retrieval
Vector space similarity(use the weights to compare the documents)
terms.) thehting when weigdone tion was(Normaliza
product.inner normalizedor cosine, thecalled also is This
),(
:is documents twoof similarity theNow,
1
t
kjkikji wwDDsim
9/19/2000 Information Organization and Retrieval
Vector Space Similarity Measurecombine tf x idf into a similarity measure
)()(
),(
:comparison similarity in the normalize otherwise
),( :normalized weights termif
absent is terma if 0 ...,,
,...,,
1
2
1
2
1
1
,21
21
t
jd
t
jqj
t
jdqj
i
t
jdqji
qtqq
dddi
ij
ij
ij
itii
ww
ww
DQsim
wwDQsim
wwwwQ
wwwD
9/19/2000 Information Organization and Retrieval
To Think About
• How does this ranking algorithm behave?– Make a set of hypothetical documents
consisting of terms and their weights– Create some hypothetical queries– How are the documents ranked, depending on
the weights of their terms and the queries’ terms?
9/19/2000 Information Organization and Retrieval
Computing Similarity Scores
2
1 1D
Q2D
98.0cos
74.0cos
)8.0 ,4.0(
)7.0 ,2.0(
)3.0 ,8.0(
2
1
2
1
Q
D
D
1.0
0.8
0.6
0.8
0.4
0.60.4 1.00.2
0.2
9/19/2000 Information Organization and Retrieval
Computing a similarity score
98.0 42.0
64.0
])7.0()2.0[(*])8.0()4.0[(
)7.0*8.0()2.0*4.0(),(
yield? comparison similarity their doesWhat
)7.0,2.0(document Also,
)8.0,4.0(or query vect have Say we
22222
2
DQsim
D
Q
9/19/2000 Information Organization and Retrieval
Vector Space with Term Weights and Cosine Matching
1.0
0.8
0.6
0.4
0.2
0.80.60.40.20 1.0
D2
D1
Q
1
2
Term B
Term A
Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)
t
j
t
j dq
t
j dq
i
ijj
ijj
ww
wwDQsim
1 1
22
1
)()(),(
Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)
98.042.0
64.0
])7.0()2.0[(])8.0()4.0[(
)7.08.0()2.04.0()2,(
2222
DQsim
74.058.0
56.),( 1 DQsim
9/19/2000 Information Organization and Retrieval
Weighting schemes
• We have seen something of– Binary– Raw term weights– TF*IDF
• There are many other possibilities– IDF alone– Normalized term frequency
9/19/2000 Information Organization and Retrieval
Term Weights in SMART
• SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell.
• Designed for laboratory experiments in IR – Easy to mix and match different weighting
methods– Really terrible user interface– Intended for use by code hackers (and even
they have trouble using it)
9/19/2000 Information Organization and Retrieval
Term Weights in SMART
• In SMART weights are decomposed into three factors:
norm
collectfreqw kkd
kd
9/19/2000 Information Organization and Retrieval
SMART Freq Components
1)ln(
)max(2
1
2
1)max(
}1,0{
kd
kd
kd
kd
kd
kd
freq
freqfreq
freq
freq
freq
Binary
maxnorm
augmented
log
9/19/2000 Information Organization and Retrieval
Collection Weighting in SMART
k
k
k
k
k
k
Doc
Doc
DocNDocDoc
NDoc
Doc
NDoc
collect
1
log
log
log
2
Inverse
squared
probabilistic
frequency
9/19/2000 Information Organization and Retrieval
Term Normalization in SMART
jvector
vectorj
vectorj
vectorj
w
w
w
w
norm
max
4
2
sum
cosine
fourth
max
9/19/2000 Information Organization and Retrieval
Problems with Vector Space
• There is no real theoretical basis for the assumption of a term space– it is more for visualization that having any real
basis– most similarity measures work about the same
regardless of model
• Terms are not really orthogonal dimensions– Terms are not independent of all other terms
9/19/2000 Information Organization and Retrieval
Probabilistic Models
• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query
• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)
• Rely on accurate estimates of probabilities
9/19/2000 Information Organization and Retrieval
Probability Ranking Principle
• If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.
Stephen E. Robertson, J. Documentation 1977
9/19/2000 Information Organization and Retrieval
Model 1 – Maron and Kuhns
• Concerned with estimating probabilities of relevance at the point of indexing:– If a patron came with a request using term ti,
what is the probability that she/he would be satisfied with document Dj ?
9/19/2000 Information Organization and Retrieval
Bayes Formula
• Bayesian statistical inference used in both models…
),|()|(
),|()|(
),|()|(),|()|(
)|,()|,(
CABPACP
BACPABP
CABPACPBACPABP
ABCPACBP
9/19/2000 Information Organization and Retrieval
Model 1 Bayes
• A is the class of events of using the library• Di is the class of events of Document i being
judged relevant• Ij is the class of queries consisting of the
single term Ij
• P(Di|A,Ij) = probability that if a query is submitted to the system then a relevant document is retrieved
)|(
),|()|(),|(
AIP
DAIPADPIADP
j
ijiji
9/19/2000 Information Organization and Retrieval
Model 2 – Robertson & Sparck Jones
Document Relevance
Documentindexing
Given a term t and a query q
+ -
+ r n-r n
- R-r N-n-R+r N-n
R N-R N
9/19/2000 Information Organization and Retrieval
Robertson-Spark Jones Weights
• Retrospective formulation --
rRnNrnrR
r
log
9/19/2000 Information Organization and Retrieval
Robertson-Sparck Jones Weights
5.05.05.0
5.0
log)1(
rRnNrnrR
r
w
Predictive formulation
9/19/2000 Information Organization and Retrieval
Model 1
• A patron submits a query (call it Q) consisting of some specification of her/his information need. Different patrons submitting the same stated query may differ as to whether or not they judge a specific document to be relevant. The function of the retrieval system is to compute for each individual document the probability that it will be judged relevant by a patron who has submitted query Q.
Robertson, Maron & Cooper, 1982
9/19/2000 Information Organization and Retrieval
Model 2• Documents have many different properties; some documents
have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties.
Robertson, Maron & Cooper, 1982
9/19/2000 Information Organization and Retrieval
Probabilistic Models: Some Unifying Notation
• D = All present and future documents
• Q = All present and future queries
• (Di,Qj) = A document query pair
• x = class of similar documents,
• y = class of similar queries,
• Relevance (R) is a relation:
}Q submittinguser by therelevant judged
isDdocument ,Q ,D | )Q,{(D R
j
ijiji QD
Dx
Qy
9/19/2000 Information Organization and Retrieval
Probabilistic Models
• Model 1 -- Probabilistic Indexing, P(R|y,Di)
• Model 2 -- Probabilistic Querying, P(R|Qj,x)
• Model 3 -- Merged Model, P(R| Qj, Di)• Model 0 -- P(R|y,x)• Probabilities are estimated based on prior
usage or relevance estimation
9/19/2000 Information Organization and Retrieval
Probabilistic ModelsQD
x
y
Di
Qj
9/19/2000 Information Organization and Retrieval
Logistic Regression
• Based on work by William Cooper, Fred Gey and Daniel Dabney.
• Builds a regression model for relevance prediction based on a set of training data
• Uses less restrictive independence assumptions than Model 2– Linked Dependence
9/19/2000 Information Organization and Retrieval
Probabilistic Models: Logistic Regression
• Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables.
nnkji vcvcvcctdR|qO ...),,(log 22110
)),|(log(1
1),|(
ji dqROjie
dqRP
m
kkjiji ROtdqROdqRO
1, )](log),|([log),|(log
Log odds of relevance is a linear function of attributes:
Term contributions summed:
Probability of Relevance is inverse of log odds:
9/19/2000 Information Organization and Retrieval
Logistic Regression
100 -90 -80 -70 -60 -50 -40 -30 -20 -10 -0 -
0 10 20 30 40 50 60Term Frequency in Document
Rel
evan
ce
9/19/2000 Information Organization and Retrieval
Probabilistic Models: Logistic Regression attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
Average Absolute Query Frequency
Query Length
Average Absolute Document Frequency
Document Length
Average Inverse Document Frequency
Inverse Document Frequency
Number of Terms in common between query and document -- logged
9/19/2000 Information Organization and Retrieval
Probabilistic Models: Logistic Regression
6
10),|(
iii XccDQRP
Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients.At retrieval the probability estimate is obtained by:
For the 6 X attribute measures shown previously
9/19/2000 Information Organization and Retrieval
Probabilistic Models
• Strong theoretical basis
• In principle should supply the best predictions of relevance given available information
• Can be implemented similarly to Vector
• Relevance information is required -- or is “guestimated”
• Important indicators of relevance may not be term -- though terms only are usually used
• Optimally requires on-going collection of relevance information
Advantages Disadvantages
9/19/2000 Information Organization and Retrieval
Vector and Probabilistic Models
• Support “natural language” queries• Treat documents and queries the same• Support relevance feedback searching• Support ranked retrieval• Differ primarily in theoretical basis and in how the
ranking is calculated– Vector assumes relevance
– Probabilistic relies on relevance judgments or estimates
9/19/2000 Information Organization and Retrieval
Current use of Probabilistic Models
• Virtually all the major systems in TREC now use the “Okapi BM25 formula” which incorporates the Robertson-Sparck Jones weights…
5.05.05.0
5.0
log)1(
rRnNrnrR
r
w
9/19/2000 Information Organization and Retrieval
Okapi BM25
• Where:• Q is a query containing terms T• K is k1((1-b) + b.dl/avdl)• k1, b and k3 are parameters , usually 1.2, 0.75 and 7-1000• tf is the frequency of the term in a specific document• qtf is the frequency of the term in a topic from which Q was
derived• dl and avdl are the document length and the average
document length measured in some convenient unit• w(1) is the Robertson-Sparck Jones weight.
QT qtfk
qtfk
tfK
tfkw
3
31)1( )1()1(
9/19/2000 Information Organization and Retrieval
Logistic Regression and Cheshire II
• The Cheshire II system uses Logistic Regression equations estimated from TREC full-text data.
• Demo (?)