Retrieval models { week 13}

Preview:

DESCRIPTION

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Retrieval models { week 13}. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0. - PowerPoint PPT Presentation

Citation preview

Retrieval models{week 13}

The College of Saint RoseCSC 460 / CIS 560 – Search and Information RetrievalDavid Goldschmidt, Ph.D.

from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

Retrieval models (i) A retrieval model is a formal

(mathematical) representation of the process of matching a query and a document

Forms the basis of ranking results

doc 234

doc 345

doc 455

doc 567

doc 678

doc 789

doc 881

doc 972

doc 123

doc 257

user query terms ?

doc 913

Retrieval models (ii)

Goal: Retrieve exactly the documents that users want (whether they know it or not!) A good retrieval model finds documents

that are likely to be consideredrelevant by the user submittingthe query (i.e. user relevance)

A good retrieval model alsooften considers topical relevance

Topical relevance

Given a query, topical relevance identifies documents judged to be on the same topic Even though keyword-based document

scores might show a lack of relevance!Abraham Lincoln

query: Abraham Lincoln

Civil War

Tall Guys with

BeardsStovepipe

Hats

U.S. President

s

User relevance

User relevance is difficult to quantify because of each user’s subjectivity Humans often have difficulty

explaining why one documentis more relevant than another

Humans may disagree abouta given document’s relevancein relation to the same query

R

R

Boolean retrieval model (i) In the Boolean retrieval model, there

are exactly two possible outcomes for query processing: TRUE (an exact match of query

specification) FALSE (otherwise)

Ranking is nonexistent Each matching document has a score of

1

Boolean retrieval model (ii) Often the goal is to reduce the

number of search results down to a manageable size Typically called searching by numbers

Given a small enough set of results, human users can continue their search manually Still a useful strategy, but the “best”

resultsmay be omitted

Boolean retrieval model (iii) Advantages:

Results are predictable and explainable Efficient and easy implementation

Disadvantages: Query results essentially unranked

(instead ordered by date or title) Effectiveness of query results depends

entirely on the user’s ability to formulate query

Vector space model (i)

The vector space model is a decades-old IR approach for implementing term weighting and document ranking Documents are represented as vector Di

ina t-dimensional vector space

Each element dij represents the weight ofterm j in document i

t is the numberof index terms

Vector space model (ii)

Given n documents, we can use a matrix to represent all term weights:

Vector space model (iii)

term weights arethe term counts

in each document

Vector space model (iv)

Query Q is represented by a t-dimensional vector of weights

Each qj is the weight of term j in the query

Vector space model (v)

Given the query “tropical fish,”query vector Qa is below:

Qa00010000001

Qb10100000100

Qc00000100010

what do query vectorsQb and Qc represent?

Vector space model (vi)

Conceptually, the document vector closest to the query vector is the most relevant In reality, the distance

function is not a goodmeasure of relevance

Use a similarity measure instead (and maximize)

First, think normalization

Cosine correlation (i)

The cosine correlation measures thecosine of the angle betweenquery and document vectors Normalize vectors such that

all documents and queriesare of equal length

Cosine correlation (ii)

The cosine function is shown in blue below:

http://en.wikipedia.org/wiki/File:Sine_cosine_one_period.svg

Cosine correlation (iii)

Given document Di and query Q, the cosine measure is given by:

normalization occursin the denominator

Term weighting (i)

Term weighting is often based on tf.idf: The term frequency (tf) quantifies the

importance of a term in a document

▪ tfik is term frequency weight of term k in document Di ▪ fik is the number of occurrences of term k in Di

word count(of words considered)

in document Di

Term weighting (ii)

Term weighting is often based on tf.idf: The inverse document frequency (idf)

quantifies the importance of a termwithin the entire collection of documents

▪ idfk is inverse document frequency weight for term k ▪ N is the number of documents in the collection▪ nk is the number of documents in which term k

occurs

Term weighting (iii)

Obtain term weights by multiplying term frequency and inverse document frequency values together Perform this calculation for

each term As new/updated documents

are processed, algorithmmust recalculate idf

What next?

Read and study Chapter 7

Do Exercises 7.1, 7.2, 7.3, and 7.4

Recommended