CIS 8590 – Fall 2008 NLP 1 Introduction to Information Retrieval Slides by me,

CIS 8590 – Fall 2008 NLP1

Introduction to Information Retrieval

Slides by me,

The Inverted Index


Indexing

• Indexing is a technique borrowed from databases

• An index is a data structure that supports efficient lookups in a large data set– E.g., hash indexes, R-trees, B-trees, etc.


Document Retrieval

• In search engines, the lookups have to find all documents that contain query terms.

• What’s the problem with using a tree-based index?

• A hash index?


Inverted Index

An inverted index stores an entry for every word, and a pointer to every document where that word is seen.


Vocabulary Postings ListWord1 Document17, Document 45123...WordN Document991, Document123001

Example

Example:Document D1: “yes we got no bananas”

Document D2: “what you got”

Document D3: “yes I like what you got”


Vocabulary Postings ListYes D1, D3we D1got D1, D2, D3no D1bananas D1What D2, D3You D2, D3I D3like D3

Query“you got”:

“you” {D2, D3}“got” {D1, D2, D3}

Whole query gives the intersection:

“you got” {D2, D3} ^ {D1, D2, D3} = {D2, D3}

Variations

• Record-level index stores just document identifiers in the postings list

• Word-level index stores document IDs and offsets for the position of the words in each document– Supports phrased based searches (why?)

Real search engines add all kinds of other information to their postings lists (see below).


Index Construction

Algorithm:1.Scan through each document, word by word

- Write term, docID pair for each word to TempIndex file

2. Sort TempIndex by terms

3. Iterate through sorted TempIndex:

merge all entries for the same term into one postings list.


Efficient Index Construction

Problem: Indexes can be huge. How can we efficiently build them?

- Blocked Sort-based Construction (BSBI)

- Single-Pass In-Memory Indexing (SPIMI)

What’s the difference?CIS 8590 – Fall 2008 NLP

9

Ranking Results

CIS 8590 – Fall 2008 NLP10

Problem: Too many matching results for every query

Using an inverted index is all fine and good, but if your document collection has 10^12 documents and someone searches for “banana”, they’ll get 90 million results.

We need to be able to return the “most relevant” results.

We need to rank the results.

CIS 8590 – Fall 2008 NLP11

Documents as Vectors

CIS 8590 – Fall 2008 NLP12

Example:Document D1: “yes we got no bananas”

Document D2: “what you got”

Document D3: “yes I like what you got”

Vector V1:

Vector V2:

Vector V3:

yes we got no bananas what you I like

1 1 1 1 1 0 0 0 0

0 0 1 0 0 1 1 0 0

1 0 1 0 0 1 1 1 1

What about queries?

In the vector space model, queries are treated as (very short) documents.

Example query: “bananas”

CIS 8590 – Fall 2008 NLP13


Query Q1: 0 0 0 0 1 0 0 0 0

Measuring Similarity

Similarity metric:

the size of the angle between document vectors.

“Cosine Similarity”:

CIS 8590 – Fall 2008 NLP14

21

21,21 21

cos),(vv

vvvvCS vv

Ranking documents

CIS 8590 – Fall 2008 NLP15


Query Q1: 0 0 1 0 0 0 1 0 0

Vector V1:

Vector V2:

Vector V3:

1 1 1 1 1 0 0 0 0

0 0 1 0 0 1 1 0 0

1 0 1 0 0 1 1 1 1

52

1),(

11

1111

vq

vqvqCS

32

2),(

21

2121

vq

vqvqCS

62

2),(

31

3131

vq

vqvqCS

All words are equal?

The TF-IDF measure is used to weight different words by more or less, depending on how informative they are.

CIS 8590 – Fall 2008 NLP16

Compare Document Classification and Document Retrieval/Ranking

• Similarities:

• Differences:

CIS 8590 – Fall 2008 NLP17

Synonymy

CIS 8590 – Fall 2008 NLP18

Handling Synonymy in Retrieval

Problem:

Straightforward search for a term may miss the most relevant results, because those documents use a synonym of the term.

Examples:Search for “Burma” will miss documents containing only “Myanmar”

Search for “document classification” will miss results for “text classification”

Search for “scientists” will miss results for “physicists”, “chemists”, etc.

NLP19

Two approaches

1. Convert retrieval into a classification or clustering problem

- Relevance Feedback (classification)- Pseudo-relevance Feedback (clustering)

2. Expand the query to include synonyms or other relevant terms

- Thesaurus-based- Automatic query expansion

NLP20

Relevance Feedback

Algorithm:

1.User issues a query q

2.System returns initial results D1

3.User labels some results (relevant or not)

4.System learns a classifier/ranker for relevance

5.System returns new result set D2

NLP21

Relevance Feedback as Text Classification

• The system gets a set of labeled documents

(+ = relevant, - = not relevant)– This is exactly the input to a standard text

classification problem

• Solution: convert labeled documents into vectors, then apply standard learning– Rocchio, Naïve Bayes, k-NN, SVM, …

NLP22

Details

• In relevance feedback, there are few labeled examples

• Efficiency is a concern – user is waiting online during training and testing

• Output is ranking, not binary classification– But most classifiers can be converted into rankerse.g., Naïve Bayes can rank according to the probability score,

SVM can rank according to wTx + b

CIS 8590 – Spring 2010 NLP23

Pseudo Relevance Feedback

IDEA: instead of waiting for user to provide relevance judgements, just use top-K documents to represent + (relevant) class

• It’s a somewhat mind-bending thought, but this actually works in practice.

• Essentially, this is like one iteration of K-means clustering!

NLP24

Clickstream Mining

(Aka, “Indirect relevance feedback”)

IDEA: use the clicks that users make as proxies for relevance judgments

For example, if the search engine returns 10 documents for “bananas”, and users consistently click on the third link first, then increase the rank of that document and similar ones.


Query Expansion

IDEA: help users formulate “better” queries

“better” can mean

- More precise, to exclude more unrelated stuff

- More inclusive, to increase recall of documents that wouldn’t match a basic query


Query Term Suggestion

Problem:

Given a base query q, suggest a list of terms T={t1, …, tK} that could help the user refine the query.

One common technique, is to suggest terms that frequently “co-occur” with terms already in the base query.


Co-occurrence

Terms t1 and t2 “co-occur” if they occur near each other in the same document.

There are many measures of co-occurrence, including:

PMI, MI, LSI-based scores, and others


Computing Co-occurrence Example

d1 d2 d3 d4

t1 2 0 1 0

t2 1 0 0 1

t3 0 1 2 0

t4 1 2 0 2


At,d =

Computing Co-occurrence Example

t1 t2 t3 t4

d1 2 1 0 1

d2 0 0 1 2

d3 1 0 2 0

d4 0 1 0 2


Ct,t’ = ATA=

*

d1 d2 d3 d4

t1 2 0 1 0

t2 1 0 0 1

t3 0 1 2 0

t4 1 2 0 2

Query Log Mining

IDEA: use other people’s queries as suggestions for refinements of this query.

Example:

If I type “google” into the search bar, the search engine can suggest follow-up words that other people used, like:

“maps”, “earth”, “translate”, “wave”, …