Download pdf - Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Introduction to Information Retrievalhttp://informationretrieval.org

IIR 1: Boolean Retrieval

Hinrich Schutze

Institute for Natural Language Processing, Universitat Stuttgart

2008.04.22

Schutze: Boolean Retrieval 1 / 55

http://informationretrieval.org


Definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).



Boolean retrieval

Queries are Boolean expressions, e.g., Caesar and Brutus

The seach engine returns all documents that satisfy theBoolean expression.

Does Google use the Boolean model?



Outline

1 Introduction

2 Inverted index

3 Processing Boolean queries

4 Course overview



Unstructured data in 1650

Which plays of Shakespeare contain the words Brutus and

Caesar, but not Calpurnia?

One could grep all of Shakespeare’s plays for Brutus andCaesar, then strip out lines containing Calpurnia.

Why is grep not the solution?



Unstructured data in 1650

Which plays of Shakespeare contain the words Brutus and

Caesar, but not Calpurnia?

One could grep all of Shakespeare’s plays for Brutus andCaesar, then strip out lines containing Calpurnia.

Why is grep not the solution?

Slow (for large collections)“not Calpurnia” is non-trivialOther operations (e.g., find the word Romans nearcountryman) not feasibleRanked retrieval (best documents to return) – focus of laterlectures, but not this one



Term-document incidence matrix

Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest

CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar.Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The

tempest.



Incidence vectors

So we have a 0/1 vector for each term.

To answer the query Brutus and Caesar and not

Calpurnia:



Incidence vectors

So we have a 0/1 vector for each term.

To answer the query Brutus and Caesar and not

Calpurnia:

Take the vectors for Brutus, Caesar, and Calpurnia

Complement the vector of Calpurnia

Do a (bitwise) and on the three vectors110100 and 110111 and 101111 = 100100



0/1 vector for Brutus


CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .



Bigger collections

Consider N = 106 documents, each with about 1000 tokens

On average 6 bytes per token, including spaces andpunctuation ⇒ size of document collection is about 6 GB

Assume there are M = 500,000 distinct terms in the collection

(Notice that we are making a term/token distinction.)



Can’t build the incidence matrix

M = 500,000× 106 = half a trillion 0s and 1s.

But the matrix has no more than one billion 1s.

Matrix is extremely sparse.

What is a better representations?

We only record the 1s.



Inverted Index

For each term t, we store a list of all documents that contain t.

Brutus −→ 1 2 4 11 31 45 173 174

Caesar −→ 1 2 4 5 6 16 57 132 . . .

Calpurnia −→ 2 31 54 101

...

︸︷︷︸︸︷︷︸

dictionary postings



Inverted index construction

1 Collect the documents to be indexed:

Friends, Romans, countrymen. So let it be with Caesar . . .

2 Tokenize the text, turning each document into a list of tokens:

Friends Romans countrymen So . . .

3 Do linguistic preprocessing, producing a list of normalized

tokens, which are the indexing terms: friend roman

countryman so . . .

4 Index the documents that each term occurs in by creating aninverted index, consisting of a dictionary and postings.



Outline

1 Introduction

2 Inverted index

3 Processing Boolean queries

4 Course overview



Simple conjunctive query (two terms)

Consider the query: Brutus AND Calpurnia

To find all matching documents using inverted index:1 Locate Brutus in the dictionary2 Retrieve its postings list from the postings file3 Locate Calpurnia in the dictionary4 Retrieve its postings list from the postings file5 Intersect the two postings lists6 Return intersection to user



Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒




Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒




Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒




Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2




Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2




Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2




Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2




Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2 → 31




Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101





Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101





Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101





Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101



Recap The term vocabulary Skip pointers Phrase queries

Recall basic intersection algorithm

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101


Can we do better?

Schutze: The term vocabulary and postings lists 40 / 60


Skip lists



Outline

1 Recap

2 The term vocabulary

3 Skip pointers

4 Phrase queries



Definitions

Word – A delimited string of characters as it appears in thetext.

Term – A “normalized” word (case, morphology, spelling etc);an equivalence class of words.

Token – An instance of a word or term occurring in adocument.

Type – The same as a term in most cases: an equivalenceclass of tokens.



Recall: Inverted index construction

Input:

Friends, Romans, countrymen. So let it be with Caesar . . .

Output:

friend roman countryman so . . .

Each token is a candidate for a postings entry.

What are valid tokens to emit?



Stop words

stop words = extremely common words which would appearto be of little value in helping select documents matching auser need

Examples: a, an, and, are, as, at, be, by, for, from, has, he, in,

is, it, its, of, on, that, the, to, was, were, will, with

Stop word elimination used to be standard in older IR systems.

But you need stop words for phrase queries, e.g. “King ofDenmark”

Most web search engines index stop words.



Lemmatization

Reduce inflectional/variant forms to base form

Example: am, are, is → be

Example: car, cars, car’s, cars’ → car

Example: the boy’s cars are different colors → the boy car be

different color

Lemmatization implies doing “proper” reduction to dictionaryheadword form (the lemma).

Inflectional morphology (cutting → cut) vs. derivationalmorphology (destruction → destroy)



Stemming

Definition of stemming: Crude heuristic process that chops offthe ends of words in the hope of achieving what “principled”lemmatization attempts to do with a lot of linguisticknowledge.

Language dependent

Often inflectional and derivational

Example for derivational: automate, automatic, automation

all reduce to automat



Porter stemmer: A few rules

Rule Example

SSES → SS caresses → caressIES → I ponies → poniSS → SS caress → caressS → cats → cat


Recap Term frequency tf-idf weighting The vector space

Introduction to Information Retrievalhttp://informationretrieval.org

IIR 6: Scoring, Term Weighting, The Vector Space Model

Hinrich Schutze

Institute for Natural Language Processing, Universitat Stuttgart

2008.05.20

Schutze: Scoring, term weighting, the vector space model 1 / 53

http://informationretrieval.org


Outline

1 Recap

2 Term frequency

3 tf-idf weighting

4 The vector space



Ranked retrieval

Thus far, our queries have all been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of theirneeds and the collection.

Also good for applications: Applications can easily consume1000s of results.

Not good for the majority of users.

Most users are not capable of writing Boolean queries (or theyare, but they think it’s too much work).

Most users don’t want to wade through 1000s of results.

This is particularly true of web search.



Problem with Boolean search: Feast or famine

Boolean queries often result in either too few (=0) or toomany (1000s) results.

Query 1: “standard user dlink 650” → 200,000 hits

Query 2: “standard user dlink 650 no card found”: 0 hits

It takes a lot of skill to come up with a query that produces amanageable number of hits.

With a ranked list of documents it does not matter how largethe retrieved set is.



Scoring as the basis of ranked retrieval

We wish to return in order the documents most likely to beuseful to the searcher.

How can we rank-order the documents in the collection withrespect to a query?

Assign a score – say in [0, 1] – to each document

This score measures how well document and query “match”.



Query-document matching scores

We need a way of assigning a score to a query/document pair.

Let’s start with a one-term query.

If the query term does not occur in the document: scoreshould be 0.

The more frequent the query term in the document, thehigher the score



Recall: Binary incidence matrix



Each document is represented by a binary vector ∈ {0, 1}|V |.



From now on, we will use the frequencies of terms



Each document is represented by a count vector ∈ N|V |.



Bag of words model

We do not consider the order of words in a document.

John is quicker than Mary and Mary is quicker than John arerepresented the same way.

This is called a bag of words model.



Term frequency tf

The term frequency tft,d of term t in document d is definedas the number of times that t occurs in d .

We want to use tf when computing query-document matchscores.

But how?



Term frequency tf

The term frequency tft,d of term t in document d is definedas the number of times that t occurs in d .

We want to use tf when computing query-document matchscores.

But how?

Raw term frequency is not what we want.

A document with 10 occurrences of the term is more relevantthan a document with one occurrence of the term.

But not 10 times more relevant.

Relevance does not increase proportionally with termfrequency.



Log frequency weighting

The log frequency weight of term t in d is defined as follows

wt,d =

{

1 + log10 tft,d if tft,d > 00 otherwise

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.





wt,d =

{


0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

Score for a document-query pair: sum over terms t in both q

and d :matching-score =

∑

t∈q∩d(1 + log tft,d)





wt,d =

{


0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

Score for a document-query pair: sum over terms t in both q

and d :matching-score =

∑

t∈q∩d(1 + log tft,d)

The score is 0 if none of the query terms is present in thedocument.



Outline

1 Recap

2 Term frequency

3 tf-idf weighting

4 The vector space



Document frequency

Rare terms are more informative than frequent terms.Consider a term in the query that is rare in the collection (e.g.,arachnocentric)

A document containing this term is very likely to be relevant.→ We want a high weight for rare terms like arachnocentric.



Document frequency



Consider a term in the query that is frequent in the collection (e.g., high,increase, line)

A document containing this term is more likely to be relevant than adocument that doesn’t, but it’s not a sure indicator of relevance.→ For frequent terms, we want positive weights for words like high,increase, and line, but lower weights than for rare terms.



Document frequency



Consider a term in the query that is frequent in the collection (e.g., high,increase, line)

A document containing this term is more likely to be relevant than adocument that doesn’t, but it’s not a sure indicator of relevance.→ For frequent terms, we want positive weights for words like high,increase, and line, but lower weights than for rare terms.

We will use document frequency to factor this into computing thematching score.The document frequency is the number of documents in the collectionthat the term occurs in.



idf weight

dft is the document frequency, the number of documents that t

occurs in.df is an inverse measure of the informativeness of the term.We define the idf weight of term t as follows:

idft = log10

N

dft

idf is a measure of the informativeness of the term.



Examples for idf

Compute idft using the formula: idft = log101,000,000

dft

term dft idftcalpurnia 1 6animal 100 4sunday 1000 3fly 10,000 2under 100,000 1the 1,000,000 0



tf-idf weighting

The tf-idf weight of a term is the product of its tf weight andits idf weight.

wt,d = (1 + log tft,d) · logN

dft

Best known weighting scheme in information retrieval

Note: the “-” in tf-idf is a hyphen, not a minus sign!



Summary: tf-idf

Assign a tf-idf weight for each term t in each document d :wt,d = (1 + log tft,d) · log N

dftN: total number of documents

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection



Outline

1 Recap

2 Term frequency

3 tf-idf weighting

4 The vector space



Binary → count → weight matrix


CleopatraAnthony 5.25 3.18 0.0 0.0 0.0 0.35Brutus 1.21 6.10 0.0 1.0 0.0 0.0Caesar 8.59 2.54 0.0 1.51 0.25 0.0Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0mercy 1.51 0.0 1.90 0.12 5.25 0.88worser 1.37 0.0 0.11 4.15 0.25 1.95. . .

Each document is now represented by a real-valued vector of tf-idf weights∈ R

|V |.



Documents as vectors

Each document is now represented by a real-valued vector oftf-idf weights ∈ R

|V |.

So we have a |V |-dimensional real-valued vector space.

Terms are axes of the space.

Documents are points or vectors in this space.

Very high-dimensional: tens of millions of dimensions whenyou apply this to a web search engine

This is a very sparse vector - most entries are zero.



Queries as vectors

Key idea 1: do the same for queries: represent them asvectors in the space

Key idea 2: Rank documents according to their proximity tothe query



How do we formalize vector space similarity?

First cut: distance between two points

( = distance between the end points of the two vectors)

Euclidean distance?



How do we formalize vector space similarity?

First cut: distance between two points

( = distance between the end points of the two vectors)

Euclidean distance?

Euclidean distance is a bad idea . . .

. . . because Euclidean distance is large for vectors of differentlengths.



Why distance is a bad idea

0 10

1

jealous

gossip

q

d1

d2

d3

The Euclidean distance of ~q

and ~d2 is large although thedistribution of terms in thequery q and the distribution ofterms in the document d2 arevery similar.



Use angle instead of distance

Rank documents according to angle with query

Thought experiment: take a document d and append it toitself. Call this document d ′.

“Semantically” d and d ′ have the same content.

The angle between the two documents is 0, corresponding tomaximal similarity.

The Euclidean distance between the two documents can bequite large.



From angles to cosines

The following two notions are equivalent.

Rank documents according to the angle between query anddocument in decreasing orderRank documents according to cosine(query,document) inincreasing order

Cosine is a monotonically decreasing function of the angle forthe interval [0◦, 180◦]



Length normalization

How do we compute the cosine?

A vector can be (length-) normalized by dividing each of itscomponents by its length – here we use the L2 norm:

||x ||2 =√

∑

i x2i

This maps vectors onto the unit sphere . . .

. . . since after normalization: ||x ||2 =√

∑

i x2i = 1.0

As a result, longer documents and shorter documents haveweights of the same order of magnitude.

Effect on the two documents d and d ′ (d appended to itself)from earlier slide: they have identical vectors afterlength-normalization.



Cosine similarity between query and document

cos(~q, ~d) = sim(~q, ~d) =~q · ~d

|~q||~d |=

∑|V |i=1 qidi

√∑|V |

i=1 q2i

√∑|V |

i=1 d2i

qi is the tf-idf weight of term i in the query.

di is the tf-idf weight of term i in the document.

|~q| and |~d | are the lengths of ~q and ~d .

This is the cosine similarity of ~q and ~d . . . . . . or, equivalently,the cosine of the angle between ~q and ~d .



Cosine similarity illustrated

0 10

1

jealous

gossip

~v(q)

~v(d1)

~v(d2)

~v(d3)

θ



Cosine: Example

How similar arethe novels? SaS:Sense andSensibility, PaP:Pride andPrejudice, andWH: WutheringHeights?

term frequencies (counts)

term SaS PaP WH

affection 115 58 20jealous 10 7 11gossip 2 0 6wuthering 0 0 38



Cosine: Example

term frequencies (counts)

term SaS PaP WH

affection 115 58 20jealous 10 7 11gossip 2 0 6wuthering 0 0 38

log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30jealous 2.0 1.85 2.04gossip 1.30 0 1.78wuthering 0 0 2.58

(To simplify this example, we don’t do idf weighting.)



Cosine: Example


term SaS PaP WH


log frequency weighting& cosine normalization

term SaS PaP WH

affection 0.789 0.832 0.524jealous 0.515 0.555 0.465gossip 0.335 0.0 0.405wuthering 0.0 0.0 0.588



Cosine: Example


term SaS PaP WH


log frequency weighting& cosine normalization

term SaS PaP WH

affection 0.789 0.832 0.524jealous 0.515 0.555 0.465gossip 0.335 0.0 0.405wuthering 0.0 0.0 0.588

cos(SaS,PaP) ≈0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0 ≈ 0.94.cos(SaS,WH) ≈ 0.79cos(PaP,WH) ≈ 0.69Why do we have cos(SaS,PaP) > cos(SAS,WH)?



Summary: Ranked retrieval in the vector space model

Represent the query as a weighted tf-idf vector

Represent each document as a weighted tf-idf vector

Compute the cosine similarity between the query vector andeach document vector

Rank documents with respect to the query

Return the top K (e.g., K = 10) to the user