Introduction Inverted index Processing Boolean queries Course overview
Introduction to Information Retrievalhttp://informationretrieval.org
IIR 1: Boolean Retrieval
Hinrich Schutze
Institute for Natural Language Processing, Universitat Stuttgart
2008.04.22
Schutze: Boolean Retrieval 1 / 55
Introduction Inverted index Processing Boolean queries Course overview
Definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
Schutze: Boolean Retrieval 4 / 55
Introduction Inverted index Processing Boolean queries Course overview
Boolean retrieval
Queries are Boolean expressions, e.g., Caesar and Brutus
The seach engine returns all documents that satisfy theBoolean expression.
Does Google use the Boolean model?
Schutze: Boolean Retrieval 7 / 55
Introduction Inverted index Processing Boolean queries Course overview
Outline
1 Introduction
2 Inverted index
3 Processing Boolean queries
4 Course overview
Schutze: Boolean Retrieval 8 / 55
Introduction Inverted index Processing Boolean queries Course overview
Unstructured data in 1650
Which plays of Shakespeare contain the words Brutus and
Caesar, but not Calpurnia?
One could grep all of Shakespeare’s plays for Brutus andCaesar, then strip out lines containing Calpurnia.
Why is grep not the solution?
Schutze: Boolean Retrieval 10 / 55
Introduction Inverted index Processing Boolean queries Course overview
Unstructured data in 1650
Which plays of Shakespeare contain the words Brutus and
Caesar, but not Calpurnia?
One could grep all of Shakespeare’s plays for Brutus andCaesar, then strip out lines containing Calpurnia.
Why is grep not the solution?
Slow (for large collections)“not Calpurnia” is non-trivialOther operations (e.g., find the word Romans nearcountryman) not feasibleRanked retrieval (best documents to return) – focus of laterlectures, but not this one
Schutze: Boolean Retrieval 10 / 55
Introduction Inverted index Processing Boolean queries Course overview
Term-document incidence matrix
Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest
CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar.Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The
tempest.
Schutze: Boolean Retrieval 11 / 55
Introduction Inverted index Processing Boolean queries Course overview
Incidence vectors
So we have a 0/1 vector for each term.
To answer the query Brutus and Caesar and not
Calpurnia:
Schutze: Boolean Retrieval 12 / 55
Introduction Inverted index Processing Boolean queries Course overview
Incidence vectors
So we have a 0/1 vector for each term.
To answer the query Brutus and Caesar and not
Calpurnia:
Take the vectors for Brutus, Caesar, and Calpurnia
Complement the vector of Calpurnia
Do a (bitwise) and on the three vectors110100 and 110111 and 101111 = 100100
Schutze: Boolean Retrieval 12 / 55
Introduction Inverted index Processing Boolean queries Course overview
0/1 vector for Brutus
Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest
CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .
Schutze: Boolean Retrieval 13 / 55
Introduction Inverted index Processing Boolean queries Course overview
Bigger collections
Consider N = 106 documents, each with about 1000 tokens
On average 6 bytes per token, including spaces andpunctuation ⇒ size of document collection is about 6 GB
Assume there are M = 500,000 distinct terms in the collection
(Notice that we are making a term/token distinction.)
Schutze: Boolean Retrieval 15 / 55
Introduction Inverted index Processing Boolean queries Course overview
Can’t build the incidence matrix
M = 500,000× 106 = half a trillion 0s and 1s.
But the matrix has no more than one billion 1s.
Matrix is extremely sparse.
What is a better representations?
We only record the 1s.
Schutze: Boolean Retrieval 16 / 55
Introduction Inverted index Processing Boolean queries Course overview
Inverted Index
For each term t, we store a list of all documents that contain t.
Brutus −→ 1 2 4 11 31 45 173 174
Caesar −→ 1 2 4 5 6 16 57 132 . . .
Calpurnia −→ 2 31 54 101
...
︸ ︷︷ ︸ ︸ ︷︷ ︸
dictionary postings
Schutze: Boolean Retrieval 17 / 55
Introduction Inverted index Processing Boolean queries Course overview
Inverted index construction
1 Collect the documents to be indexed:
Friends, Romans, countrymen. So let it be with Caesar . . .
2 Tokenize the text, turning each document into a list of tokens:
Friends Romans countrymen So . . .
3 Do linguistic preprocessing, producing a list of normalized
tokens, which are the indexing terms: friend roman
countryman so . . .
4 Index the documents that each term occurs in by creating aninverted index, consisting of a dictionary and postings.
Schutze: Boolean Retrieval 18 / 55
Introduction Inverted index Processing Boolean queries Course overview
Outline
1 Introduction
2 Inverted index
3 Processing Boolean queries
4 Course overview
Schutze: Boolean Retrieval 25 / 55
Introduction Inverted index Processing Boolean queries Course overview
Simple conjunctive query (two terms)
Consider the query: Brutus AND Calpurnia
To find all matching documents using inverted index:1 Locate Brutus in the dictionary2 Retrieve its postings list from the postings file3 Locate Calpurnia in the dictionary4 Retrieve its postings list from the postings file5 Intersect the two postings lists6 Return intersection to user
Schutze: Boolean Retrieval 26 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒
Schutze: Boolean Retrieval 27 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒
Schutze: Boolean Retrieval 27 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒
Schutze: Boolean Retrieval 27 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2
Schutze: Boolean Retrieval 27 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2
Schutze: Boolean Retrieval 27 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2
Schutze: Boolean Retrieval 27 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2
Schutze: Boolean Retrieval 27 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
Schutze: Boolean Retrieval 27 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
Schutze: Boolean Retrieval 27 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
Schutze: Boolean Retrieval 27 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
Schutze: Boolean Retrieval 27 / 55
Introduction Inverted index Processing Boolean queries Course overview
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
Schutze: Boolean Retrieval 27 / 55
Recap The term vocabulary Skip pointers Phrase queries
Recall basic intersection algorithm
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
Can we do better?
Schutze: The term vocabulary and postings lists 40 / 60
Recap The term vocabulary Skip pointers Phrase queries
Skip lists
Schutze: The term vocabulary and postings lists 42 / 60
Recap The term vocabulary Skip pointers Phrase queries
Outline
1 Recap
2 The term vocabulary
3 Skip pointers
4 Phrase queries
Schutze: The term vocabulary and postings lists 8 / 60
Recap The term vocabulary Skip pointers Phrase queries
Definitions
Word – A delimited string of characters as it appears in thetext.
Term – A “normalized” word (case, morphology, spelling etc);an equivalence class of words.
Token – An instance of a word or term occurring in adocument.
Type – The same as a term in most cases: an equivalenceclass of tokens.
Schutze: The term vocabulary and postings lists 13 / 60
Recap The term vocabulary Skip pointers Phrase queries
Recall: Inverted index construction
Input:
Friends, Romans, countrymen. So let it be with Caesar . . .
Output:
friend roman countryman so . . .
Each token is a candidate for a postings entry.
What are valid tokens to emit?
Schutze: The term vocabulary and postings lists 15 / 60
Recap The term vocabulary Skip pointers Phrase queries
Stop words
stop words = extremely common words which would appearto be of little value in helping select documents matching auser need
Examples: a, an, and, are, as, at, be, by, for, from, has, he, in,
is, it, its, of, on, that, the, to, was, were, will, with
Stop word elimination used to be standard in older IR systems.
But you need stop words for phrase queries, e.g. “King ofDenmark”
Most web search engines index stop words.
Schutze: The term vocabulary and postings lists 29 / 60
Recap The term vocabulary Skip pointers Phrase queries
Lemmatization
Reduce inflectional/variant forms to base form
Example: am, are, is → be
Example: car, cars, car’s, cars’ → car
Example: the boy’s cars are different colors → the boy car be
different color
Lemmatization implies doing “proper” reduction to dictionaryheadword form (the lemma).
Inflectional morphology (cutting → cut) vs. derivationalmorphology (destruction → destroy)
Schutze: The term vocabulary and postings lists 32 / 60
Recap The term vocabulary Skip pointers Phrase queries
Stemming
Definition of stemming: Crude heuristic process that chops offthe ends of words in the hope of achieving what “principled”lemmatization attempts to do with a lot of linguisticknowledge.
Language dependent
Often inflectional and derivational
Example for derivational: automate, automatic, automation
all reduce to automat
Schutze: The term vocabulary and postings lists 33 / 60
Recap The term vocabulary Skip pointers Phrase queries
Porter stemmer: A few rules
Rule Example
SSES → SS caresses → caressIES → I ponies → poniSS → SS caress → caressS → cats → cat
Schutze: The term vocabulary and postings lists 35 / 60
Recap Term frequency tf-idf weighting The vector space
Introduction to Information Retrievalhttp://informationretrieval.org
IIR 6: Scoring, Term Weighting, The Vector Space Model
Hinrich Schutze
Institute for Natural Language Processing, Universitat Stuttgart
2008.05.20
Schutze: Scoring, term weighting, the vector space model 1 / 53
Recap Term frequency tf-idf weighting The vector space
Outline
1 Recap
2 Term frequency
3 tf-idf weighting
4 The vector space
Schutze: Scoring, term weighting, the vector space model 9 / 53
Recap Term frequency tf-idf weighting The vector space
Ranked retrieval
Thus far, our queries have all been Boolean.
Documents either match or don’t.
Good for expert users with precise understanding of theirneeds and the collection.
Also good for applications: Applications can easily consume1000s of results.
Not good for the majority of users.
Most users are not capable of writing Boolean queries (or theyare, but they think it’s too much work).
Most users don’t want to wade through 1000s of results.
This is particularly true of web search.
Schutze: Scoring, term weighting, the vector space model 10 / 53
Recap Term frequency tf-idf weighting The vector space
Problem with Boolean search: Feast or famine
Boolean queries often result in either too few (=0) or toomany (1000s) results.
Query 1: “standard user dlink 650” → 200,000 hits
Query 2: “standard user dlink 650 no card found”: 0 hits
It takes a lot of skill to come up with a query that produces amanageable number of hits.
With a ranked list of documents it does not matter how largethe retrieved set is.
Schutze: Scoring, term weighting, the vector space model 11 / 53
Recap Term frequency tf-idf weighting The vector space
Scoring as the basis of ranked retrieval
We wish to return in order the documents most likely to beuseful to the searcher.
How can we rank-order the documents in the collection withrespect to a query?
Assign a score – say in [0, 1] – to each document
This score measures how well document and query “match”.
Schutze: Scoring, term weighting, the vector space model 12 / 53
Recap Term frequency tf-idf weighting The vector space
Query-document matching scores
We need a way of assigning a score to a query/document pair.
Let’s start with a one-term query.
If the query term does not occur in the document: scoreshould be 0.
The more frequent the query term in the document, thehigher the score
Schutze: Scoring, term weighting, the vector space model 13 / 53
Recap Term frequency tf-idf weighting The vector space
Recall: Binary incidence matrix
Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest
CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .
Each document is represented by a binary vector ∈ {0, 1}|V |.
Schutze: Scoring, term weighting, the vector space model 17 / 53
Recap Term frequency tf-idf weighting The vector space
From now on, we will use the frequencies of terms
Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest
CleopatraAnthony 157 73 0 0 0 1Brutus 4 157 0 2 0 0Caesar 232 227 0 2 1 0Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0mercy 2 0 3 8 5 8worser 2 0 1 1 1 5. . .
Each document is represented by a count vector ∈ N|V |.
Schutze: Scoring, term weighting, the vector space model 18 / 53
Recap Term frequency tf-idf weighting The vector space
Bag of words model
We do not consider the order of words in a document.
John is quicker than Mary and Mary is quicker than John arerepresented the same way.
This is called a bag of words model.
Schutze: Scoring, term weighting, the vector space model 19 / 53
Recap Term frequency tf-idf weighting The vector space
Term frequency tf
The term frequency tft,d of term t in document d is definedas the number of times that t occurs in d .
We want to use tf when computing query-document matchscores.
But how?
Schutze: Scoring, term weighting, the vector space model 20 / 53
Recap Term frequency tf-idf weighting The vector space
Term frequency tf
The term frequency tft,d of term t in document d is definedas the number of times that t occurs in d .
We want to use tf when computing query-document matchscores.
But how?
Raw term frequency is not what we want.
A document with 10 occurrences of the term is more relevantthan a document with one occurrence of the term.
But not 10 times more relevant.
Relevance does not increase proportionally with termfrequency.
Schutze: Scoring, term weighting, the vector space model 20 / 53
Recap Term frequency tf-idf weighting The vector space
Log frequency weighting
The log frequency weight of term t in d is defined as follows
wt,d =
{
1 + log10 tft,d if tft,d > 00 otherwise
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
Schutze: Scoring, term weighting, the vector space model 21 / 53
Recap Term frequency tf-idf weighting The vector space
Log frequency weighting
The log frequency weight of term t in d is defined as follows
wt,d =
{
1 + log10 tft,d if tft,d > 00 otherwise
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
Score for a document-query pair: sum over terms t in both q
and d :matching-score =
∑
t∈q∩d(1 + log tft,d)
Schutze: Scoring, term weighting, the vector space model 21 / 53
Recap Term frequency tf-idf weighting The vector space
Log frequency weighting
The log frequency weight of term t in d is defined as follows
wt,d =
{
1 + log10 tft,d if tft,d > 00 otherwise
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
Score for a document-query pair: sum over terms t in both q
and d :matching-score =
∑
t∈q∩d(1 + log tft,d)
The score is 0 if none of the query terms is present in thedocument.
Schutze: Scoring, term weighting, the vector space model 21 / 53
Recap Term frequency tf-idf weighting The vector space
Outline
1 Recap
2 Term frequency
3 tf-idf weighting
4 The vector space
Schutze: Scoring, term weighting, the vector space model 22 / 53
Recap Term frequency tf-idf weighting The vector space
Document frequency
Rare terms are more informative than frequent terms.Consider a term in the query that is rare in the collection (e.g.,arachnocentric)
A document containing this term is very likely to be relevant.→ We want a high weight for rare terms like arachnocentric.
Schutze: Scoring, term weighting, the vector space model 23 / 53
Recap Term frequency tf-idf weighting The vector space
Document frequency
Rare terms are more informative than frequent terms.Consider a term in the query that is rare in the collection (e.g.,arachnocentric)
A document containing this term is very likely to be relevant.→ We want a high weight for rare terms like arachnocentric.
Consider a term in the query that is frequent in the collection (e.g., high,increase, line)
A document containing this term is more likely to be relevant than adocument that doesn’t, but it’s not a sure indicator of relevance.→ For frequent terms, we want positive weights for words like high,increase, and line, but lower weights than for rare terms.
Schutze: Scoring, term weighting, the vector space model 23 / 53
Recap Term frequency tf-idf weighting The vector space
Document frequency
Rare terms are more informative than frequent terms.Consider a term in the query that is rare in the collection (e.g.,arachnocentric)
A document containing this term is very likely to be relevant.→ We want a high weight for rare terms like arachnocentric.
Consider a term in the query that is frequent in the collection (e.g., high,increase, line)
A document containing this term is more likely to be relevant than adocument that doesn’t, but it’s not a sure indicator of relevance.→ For frequent terms, we want positive weights for words like high,increase, and line, but lower weights than for rare terms.
We will use document frequency to factor this into computing thematching score.The document frequency is the number of documents in the collectionthat the term occurs in.
Schutze: Scoring, term weighting, the vector space model 23 / 53
Recap Term frequency tf-idf weighting The vector space
idf weight
dft is the document frequency, the number of documents that t
occurs in.df is an inverse measure of the informativeness of the term.We define the idf weight of term t as follows:
idft = log10
N
dft
idf is a measure of the informativeness of the term.
Schutze: Scoring, term weighting, the vector space model 24 / 53
Recap Term frequency tf-idf weighting The vector space
Examples for idf
Compute idft using the formula: idft = log101,000,000
dft
term dft idftcalpurnia 1 6animal 100 4sunday 1000 3fly 10,000 2under 100,000 1the 1,000,000 0
Schutze: Scoring, term weighting, the vector space model 25 / 53
Recap Term frequency tf-idf weighting The vector space
tf-idf weighting
The tf-idf weight of a term is the product of its tf weight andits idf weight.
wt,d = (1 + log tft,d) · logN
dft
Best known weighting scheme in information retrieval
Note: the “-” in tf-idf is a hyphen, not a minus sign!
Schutze: Scoring, term weighting, the vector space model 28 / 53
Recap Term frequency tf-idf weighting The vector space
Summary: tf-idf
Assign a tf-idf weight for each term t in each document d :wt,d = (1 + log tft,d) · log N
dftN: total number of documents
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection
Schutze: Scoring, term weighting, the vector space model 29 / 53
Recap Term frequency tf-idf weighting The vector space
Outline
1 Recap
2 Term frequency
3 tf-idf weighting
4 The vector space
Schutze: Scoring, term weighting, the vector space model 31 / 53
Recap Term frequency tf-idf weighting The vector space
Binary → count → weight matrix
Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest
CleopatraAnthony 5.25 3.18 0.0 0.0 0.0 0.35Brutus 1.21 6.10 0.0 1.0 0.0 0.0Caesar 8.59 2.54 0.0 1.51 0.25 0.0Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0mercy 1.51 0.0 1.90 0.12 5.25 0.88worser 1.37 0.0 0.11 4.15 0.25 1.95. . .
Each document is now represented by a real-valued vector of tf-idf weights∈ R
|V |.
Schutze: Scoring, term weighting, the vector space model 32 / 53
Recap Term frequency tf-idf weighting The vector space
Documents as vectors
Each document is now represented by a real-valued vector oftf-idf weights ∈ R
|V |.
So we have a |V |-dimensional real-valued vector space.
Terms are axes of the space.
Documents are points or vectors in this space.
Very high-dimensional: tens of millions of dimensions whenyou apply this to a web search engine
This is a very sparse vector - most entries are zero.
Schutze: Scoring, term weighting, the vector space model 33 / 53
Recap Term frequency tf-idf weighting The vector space
Queries as vectors
Key idea 1: do the same for queries: represent them asvectors in the space
Key idea 2: Rank documents according to their proximity tothe query
Schutze: Scoring, term weighting, the vector space model 34 / 53
Recap Term frequency tf-idf weighting The vector space
How do we formalize vector space similarity?
First cut: distance between two points
( = distance between the end points of the two vectors)
Euclidean distance?
Schutze: Scoring, term weighting, the vector space model 35 / 53
Recap Term frequency tf-idf weighting The vector space
How do we formalize vector space similarity?
First cut: distance between two points
( = distance between the end points of the two vectors)
Euclidean distance?
Euclidean distance is a bad idea . . .
. . . because Euclidean distance is large for vectors of differentlengths.
Schutze: Scoring, term weighting, the vector space model 35 / 53
Recap Term frequency tf-idf weighting The vector space
Why distance is a bad idea
0 10
1
jealous
gossip
q
d1
d2
d3
The Euclidean distance of ~q
and ~d2 is large although thedistribution of terms in thequery q and the distribution ofterms in the document d2 arevery similar.
Schutze: Scoring, term weighting, the vector space model 36 / 53
Recap Term frequency tf-idf weighting The vector space
Use angle instead of distance
Rank documents according to angle with query
Thought experiment: take a document d and append it toitself. Call this document d ′.
“Semantically” d and d ′ have the same content.
The angle between the two documents is 0, corresponding tomaximal similarity.
The Euclidean distance between the two documents can bequite large.
Schutze: Scoring, term weighting, the vector space model 37 / 53
Recap Term frequency tf-idf weighting The vector space
From angles to cosines
The following two notions are equivalent.
Rank documents according to the angle between query anddocument in decreasing orderRank documents according to cosine(query,document) inincreasing order
Cosine is a monotonically decreasing function of the angle forthe interval [0◦, 180◦]
Schutze: Scoring, term weighting, the vector space model 38 / 53
Recap Term frequency tf-idf weighting The vector space
Length normalization
How do we compute the cosine?
A vector can be (length-) normalized by dividing each of itscomponents by its length – here we use the L2 norm:
||x ||2 =√
∑
i x2i
This maps vectors onto the unit sphere . . .
. . . since after normalization: ||x ||2 =√
∑
i x2i = 1.0
As a result, longer documents and shorter documents haveweights of the same order of magnitude.
Effect on the two documents d and d ′ (d appended to itself)from earlier slide: they have identical vectors afterlength-normalization.
Schutze: Scoring, term weighting, the vector space model 41 / 53
Recap Term frequency tf-idf weighting The vector space
Cosine similarity between query and document
cos(~q, ~d) = sim(~q, ~d) =~q · ~d
|~q||~d |=
∑|V |i=1 qidi
√∑|V |
i=1 q2i
√∑|V |
i=1 d2i
qi is the tf-idf weight of term i in the query.
di is the tf-idf weight of term i in the document.
|~q| and |~d | are the lengths of ~q and ~d .
This is the cosine similarity of ~q and ~d . . . . . . or, equivalently,the cosine of the angle between ~q and ~d .
Schutze: Scoring, term weighting, the vector space model 42 / 53
Recap Term frequency tf-idf weighting The vector space
Cosine similarity illustrated
0 10
1
jealous
gossip
~v(q)
~v(d1)
~v(d2)
~v(d3)
θ
Schutze: Scoring, term weighting, the vector space model 44 / 53
Recap Term frequency tf-idf weighting The vector space
Cosine: Example
How similar arethe novels? SaS:Sense andSensibility, PaP:Pride andPrejudice, andWH: WutheringHeights?
term frequencies (counts)
term SaS PaP WH
affection 115 58 20jealous 10 7 11gossip 2 0 6wuthering 0 0 38
Schutze: Scoring, term weighting, the vector space model 45 / 53
Recap Term frequency tf-idf weighting The vector space
Cosine: Example
term frequencies (counts)
term SaS PaP WH
affection 115 58 20jealous 10 7 11gossip 2 0 6wuthering 0 0 38
log frequency weighting
term SaS PaP WH
affection 3.06 2.76 2.30jealous 2.0 1.85 2.04gossip 1.30 0 1.78wuthering 0 0 2.58
(To simplify this example, we don’t do idf weighting.)
Schutze: Scoring, term weighting, the vector space model 46 / 53
Recap Term frequency tf-idf weighting The vector space
Cosine: Example
log frequency weighting
term SaS PaP WH
affection 3.06 2.76 2.30jealous 2.0 1.85 2.04gossip 1.30 0 1.78wuthering 0 0 2.58
log frequency weighting& cosine normalization
term SaS PaP WH
affection 0.789 0.832 0.524jealous 0.515 0.555 0.465gossip 0.335 0.0 0.405wuthering 0.0 0.0 0.588
Schutze: Scoring, term weighting, the vector space model 47 / 53
Recap Term frequency tf-idf weighting The vector space
Cosine: Example
log frequency weighting
term SaS PaP WH
affection 3.06 2.76 2.30jealous 2.0 1.85 2.04gossip 1.30 0 1.78wuthering 0 0 2.58
log frequency weighting& cosine normalization
term SaS PaP WH
affection 0.789 0.832 0.524jealous 0.515 0.555 0.465gossip 0.335 0.0 0.405wuthering 0.0 0.0 0.588
cos(SaS,PaP) ≈0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0 ≈ 0.94.cos(SaS,WH) ≈ 0.79cos(PaP,WH) ≈ 0.69Why do we have cos(SaS,PaP) > cos(SAS,WH)?
Schutze: Scoring, term weighting, the vector space model 47 / 53
Recap Term frequency tf-idf weighting The vector space
Summary: Ranked retrieval in the vector space model
Represent the query as a weighted tf-idf vector
Represent each document as a weighted tf-idf vector
Compute the cosine similarity between the query vector andeach document vector
Rank documents with respect to the query
Return the top K (e.g., K = 10) to the user
Schutze: Scoring, term weighting, the vector space model 52 / 53