Information Retrieval Modeling

Information Retrieval Modeling

CS 652 Information Extraction and Integration

2

Introduction IR systems usually adopt index terms to index documents

& process queries

Index term:

a keyword or group of selected/related words

any word (in general)

Problems of IR based on index terms:

oversimplication & lose of semantics

badly-formed queries

Stemming might be used:

connect: connecting, connection, connections

An inverted file is built for the chosen index terms

3

Introduction

Docs

Information Need

Index Terms

doc

query

Rankingmatch

4

Introduction Matching at index term level is quite imprecise

Consequence: no surprise that users get frequently unsatisfied

Since most users have no training in query formation, problem is even worst

Result: frequent dissatisfaction of Web users

Issue of deciding relevance is critical for IR systems: ranking

5

Introduction A ranking is an ordering of the documents retrieved

that (hopefully) reflects the relevance of the documents to the user query

A ranking is based on fundamental premises regarding the notion of relevance, such as:

common sets of index terms

sharing of weighted terms

likelihood of relevance

Each set of premises leads to a distinct IR model

6

IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: Adhoc Filtering

Browsing

U s e r

T a s k

Classic Models

Boolean Vector (Space) Probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector (Space) Latent Semantic Index Neural Networks

Browsing

Flat Structure Guided Hypertext

7

IR Models The IR model, the logical view of the docs, and the

retrieval task are distinct aspects of the system

Index Terms Full Text Full Text +Structure

RetrievalClassic

Set TheoreticAlgebraic

Probabilistic

ClassicSet Theoretic

AlgebraicProbabilistic

Structured

Browsing FlatFlat

HypertextStructure Guided

Hypertext

LOGICAL VIEW OF DOCUMENTS

USER

TASK

8

Retrieval: Ad Hoc vs. Filtering Ad hoc retrieval: Static set of documents & dynamic

queries

Collection“Fixed Size”

Q2

Q3

Q1

Q4Q5

9

Retrieval: Ad Hoc vs. Filtering Filtering: dynamic set of documents & static queries

Documents Stream

User 1Profile

User 2Profile

Docs Filteredfor User 2

Docs forUser 1

incoming

outgoing

outgoing

10

Classic IR Models – Basic Concepts Each document is described by a set of

representative keywords or index terms

Index terms are document words (i.e. nouns), which have meaning by themselves for remembering

the main themes of a document

However, search engines assume that all words are index terms (full text representation)

User profile construction process:1) user-provided keywords

2) relevance feedback cycle

3) system adjusted user profile description

4) repeat step (2) till stabilized

11

Classic IR Models - Basic Concepts Not all terms are equally useful for representing the

document contensts: less frequent terms allow identifying a narrower set of documents

The importance of the index terms is represented by weights associated to them

Let

ki be an index term

dj be a document

wij is a weight associated with (ki,dj ), which quantifies

the importance of ki for describing the contents of dj

12

Classic IR Models - Basic Concepts Ki is a generic index term

dj is a document

t is the total number of index terms in an IR sytem

K = (k1, k2, …, kt) is the set of all index terms

wij 0 is a weight associated with (ki,dj)

wij = 0 indicates that ki does not belong to dj

vec(dj) = (w1j, w2j

, …, wtj ) is a weighted vector

associated with the document dj

gi(vec(dj)) = wij is a function which returns the

weight associated with pair (ki,dj )

13

The Boolean Model Simple model based on set theory & Boolean algebra Queries specified as boolean expressions

precise semantics neat formalism q = ka (kb kc )

Terms are either present or absent. Thus, Ki {0,1} Consider

q = ka (kb kc )

vec(qdnf) = (1,1,1) (1,1,0) (1,0,0)

vec(qcc) = (1,1,0) is a conjunctive component

14

The Boolean Model

q = ka (kb kc)

1 if vec(qcc) | (vec(qcc) vec(qdnf))

( ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise

(1,1,1)(1,0,0)

(1,1,0)

Ka Kb

Kc

sim(q,dj ) =

15

Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no

notion of partial matching

No ranking of the documents is provided (absence of a grading scale)

Information need has to be translated into a Boolean expression which most users find awkward

The Boolean queries formulated by the users are most often too simplistic

Using the exact matching approach, the model frequently returns either too few or too many documents in response to a user query

16

The Vector (Space) Model (VSM)

Use of binary weights is too limiting

Non-binary weights provide consideration for partial matching

These term weights are used to compute a degree of similarity between a query and a document

Ranked set of documents provides better matching,i.e., partal matching vs. exact matching

Answer set of VSM is a lot more precise than the answer set retrieved by the Boolean model

17

The Vector (Space) Model Define:

wij > 0 whenever ki dj

wiq>= 0 associated with the pair (ki,q)

vec(dj ) = (w1j, w2j

, ..., wtj ), document vector of dj

vec(q) = (w1q, w2q

, ..., wtq ), query vector of q

To each term ki is associated a unitary vector vec(i)

The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)

The t unitary vectors vec(i) form an orthonormal basis for a t-dimensional space

In this space, queries and documents are represented as weighted vectors

18

The Vector (Space) Model

Sim(q,dj ) = cos() = [vec(dj) vec(q)] / |dj | |q| = [t

i=1 wij wiq

] / ti=1 wij

2

ti=1 wiq

2

where is the inner product operator & |q| is the length of q

Since wij 0 and wiq

0, 1 sim(q, dj ) 0

A document is retrieved even if it matches the query terms only partially

i

jdj

q

19


Sim(q, dj ) = [ti=1 wij

wiq ] / |dj | |q|

How to compute the weights wij and wiq

?

A good weight must take into account two effects:

1. quantification of intra-document contents (similarity) tf factor, the term frequency within a document

2. quantification of inter-documents separation (dissi-milarity)

idf factor, the inverse document frequency

wij = tf(i, j) idf(i)

20

The Vector (Space) Model Let,

N be the total number of documents in the collection ni be the number of documents which contain ki

freq(i, j), the raw frequency of ki within dj

A normalized tf factor is given by f(i, j) = freq(i, j) / max(freq(l, j)),

where the maximum is computed over all terms which occur within the document dj

The inverse document frequency (idf) factor is idf(i) = log (N / ni ) the log is used to make the values of tf and idf

comparable. It can also be interpreted as the amount of information associated with term ki.

21


The best term-weighting schemes use weights which are give by

wij = f(i, j) log(N / ni)

the strategy is called a tf-idf weighting scheme

For the query term weights, a suggestion is Wiq

= (0.5 + [0.5 freq(i, q) / max(freq(l, q))]) log(N/ni)

The vector model with tf-idf weights is a good ranking strategy with general collections

The VSM is usually as good as the known ranking alternatives. It is also simple and fast to compute

22

The Vector (Space) Model Advantages:

term-weighting improves quality of the answer set

partial matching allows retrieval of documents that approximate the query conditions

cosine ranking formula sorts documents according to degree of similarity to the query

A popular IR model because of its simplicity & speed

Disadvantages:

assumes mutually independence of index terms (??); not clear that this is bad though

23


k1 k2 k3 q djd1 1 0 1 2d2 1 0 0 1d3 0 1 1 2d4 1 0 0 1d5 1 1 1 3d6 1 1 0 2d7 0 1 0 1

q 1 1 1

d1

d2

d3

d4 d5

d6

d7

k1

k2

k3

Example I

24

d1

d2

d3

d4 d5

d6

d7

k1

k2

k3

k1 k2 k3 q djd1 1 0 1 4d2 1 0 0 1d3 0 1 1 5d4 1 0 0 1d5 1 1 1 6d6 1 1 0 3d7 0 1 0 2

q 1 2 3


Example II

25

d1

d2

d3

d4 d5

d6

d7

k1

k2

k3

k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10

q 1 2 3


Example III

26

Probabilistic Model Objective: to capture the IR problem using a

probabilistic framework

Given a user query, there is an ideal answer set

Querying as specification of the properties of this ideal answer set (clustering)

But, what are these properties?

Guess at the beginning what they could be (i.e., guess initial description of ideal answer set)

Improve by iteration

27

Probabilistic Model An initial set of documents is retrieved somehow

User inspects these docs looking for the relevant ones (in truth, only top 10 - 20 need to be inspected)

IR system uses this information to refine description of ideal answer set

By repeating this process, it is expected that the description of the ideal answer set will improve

Have always in mind the need to guess at the verybeginning the description of the ideal answer set

Description of ideal answer set is modeled in probabilistic terms

28

Probabilistic Ranking Principle Given a user query q and a document d in a collection,

the probabilistic model tries to estimate the probability that the user will find the document d

interesting (i.e., relevant).

The model assumes that this probability of relevance depends on the query and the

document representations only. The ideal answer set R should maximize the probability of relevance. Documents in R are predicted to be relevant to q.

Problems: How to compute probabilities of relevance? What is the sample space?

29

The Ranking Probabilistic ranking is computed as:

sim(q, dj ) = P(dj relevant-to q) / P(dj non-relevant-to q)

This is the odds of the document dj being relevant to q

Taking the odds minimize the probability of an erroneous judgement

Definitions:

wij , wiq

{0,1}, i.e., all index term weights are binary

P(R | vec(dj )): probability that doc dj is relevant to q, where R is the known set of docs to be relevant

P(R | vec(dj )): probability that doc dj is not relevant to q, where R is the complement of R

30

The Ranking sim(dj, q) = P(R | vec(dj)) / P(R | vec(dj))

= [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)]

~ P(vec(dj) | R) P(vec(dj) | R)

where,

P(vec(dj ) | R): probability of randomly selecting dj from R of relevant documents

P(R): probability of a randomly selected doc from thecollection is relevant

(Bayes’ rule)

31

The Ranking

sim(dj, q) ~ P(vec(dj) | R) P(vec(dj) | R)

~ [ gi(vec(dj))=1 P(ki | R)] * [ gi(vec(dj))=0 P(ki | R)]

[ gi(vec(dj))=1 P(ki | R)] * [ gi(vec(dj))=0

P(ki | R)]

where,

P(ki | R) : probabilityprobability that the index term ki is present in a document randomly selected from

R of relevant documents

32

The Ranking

sim(dj,q) ~ log [ P(ki | R)] * [ P(ki | R)] [ P(ki | R)] * [ P(ki | R)]

~ K * [ log P(ki | R) + log P(ki | R) ]

P(ki | R) P(ki | R)

~ ti=1 wiq

* wij* (log P(ki | R) + log 1 - P(ki | R))

P(ki | R) 1 - P(ki | R)

33

The Initial Ranking sim(dj, q) ~

wiq* wij

* (log P(ki | R) + log 1 - P(ki | R))

P(ki | R) 1 - P(ki | R)

Probabilities P(ki | R) and P(ki | R) ?

Estimates based on assumptions:

P(ki | R) = 0.5

P(ki | R) = ni

Nwhere ni is the number of docs that contain ki

N is the number of docs in the collection

Use this initial guess to retrieve an initial ranking

Improve upon this initial ranking

34

Improving the Initial Ranking sim(d

j,q) ~

wiq* wij

* (log P(ki | R) + log 1 – P(ki | R))

P(ki | R) 1 - P(ki | R) Let

V : set of docs initially retrieved & ranked Vi : subset of V retrieved that contain ki

Reevaluate estimates: P(ki | R) = Vi

V P(ki | R) = ni - Vi

N - V

Repeat recursively

(~ P(Ki | R) by the distribution of Ki among V)

(All the non-retrieved docs are irrelevant)

35

Improving the Initial Ranking sim(dj, q) ~ wiq

* wij*

(log P(ki | R) + log 1 - P(ki | R ) ) P(ki | R) 1 - P(ki | R)

To avoid problems with V = 1 & Vi = 0:

P(ki | R) = Vi + 0.5 V + 1

P(ki | R) = ni - Vi + 0.5 N - V + 1

Or (alteratively) P(ki | R) = Vi + ni /N

V + 1 P(ki | R) = ni - Vi + ni /N

N - V + 1

(ni / N is an adjustment factor)

36

Pluses and Minuses

Advantages:

Documents ranked in decreasing order of probability of relevance

Disadvantages:

Need to guess initial estimates for P(ki | R)

Method does not take into account tf and idf factors

assumes mutually independence of index terms (??); not clear that this is bad though

37

Brief Comparison of Classic Models

Boolean model does not provide for partial matches and is considered to be the weakest classic model

Salton and Buckley did a series of experiments that indicate that, in general, the vector

(space) model outperforms the probabilistic model with general collections

This seems also to be the view of the research community