25
Retrieval Models II Vector Space, Probabilistic

Retrieval Models II Vector Space, Probabilistic. Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

  • View
    225

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Retrieval Models II

Vector Space, Probabilistic

Page 2: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Properties of Inner Product

• The inner product is unbounded.

• Favors long documents with a large number of unique terms.

• Measures how many terms matched but not how many terms are not matched.

Page 3: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Inner Product -- Examples

Binary:– D = 1, 1, 1, 0, 1, 1, 0

– Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3

retri

eval

database

archite

cture

computer

textmanagem

ent

informatio

n

Size of vector = size of vocabulary = 70 means corresponding term not found

in document or query

Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Page 4: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

t

i

t

i

t

i

ww

ww

qd

qd

iqij

iqij

j

j

1 1

22

1

)(

Cosine Similarity Measure

• Cosine similarity measures the cosine of the angle between two vectors.

• Inner product normalized by the vector lengths.

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13

Q = 0T1 + 0T2 + 2T3

t3

t1

t2

D1

D2

Q

D1 is 6 times better than D2 using cosine similarity but only 5 times better using

inner product.

CosSim(dj, q) =

Page 5: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Simple Implementation

1. Convert all documents in collection D to tf-idf weighted vectors, dj, for keyword vocabulary V.

2. Convert query to a tf-idf-weighted vector q.

3. For each dj in D do

Compute score sj = cosSim(dj, q)

4. Sort documents by decreasing score.

5. Present top ranked documents to the user.

Time complexity: O(|V|·|D|) Bad for large V & D !|V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000

Page 6: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Comments on Vector Space Models

• Simple, mathematically based approach.

• Considers both local (tf) and global (idf) word occurrence frequencies.

• Provides partial matching and ranked results.

• Tends to work quite well in practice despite obvious weaknesses.

• Allows efficient implementation for large document collections.

Page 7: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Problems with Vector Space Model

• Assumption of term independence

• Missing semantic information (e.g. word sense).

• Missing syntactic information (e.g. phrase structure, word order, proximity information).

• Lacks the control of a Boolean model (e.g., requiring a term to appear in a document).– Given a two-term query “A B”,

• may prefer a document containing A frequently but not B,

over a document that contains both A and B, but both less frequently.

Page 8: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Statistical Models

• A document is typically represented by a bag of words (unordered words with frequencies).

• Bag = set that allows multiple occurrences of the same element.

• User specifies a set of desired terms with optional weights:– Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 >– Unweighted query terms: Q = < database; text; information >– No Boolean conditions specified in the query.

Page 9: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Statistical Retrieval

• Retrieval via similarity based on probability of relevance to Q

• Given Q, the set of all documents is partitioned with into the sets rel and nonrel.– The sets rel and nonrel change from query to query

• Output documents are ranked according to probability of relevance to query.– Pr(relevance) of each document to the query is not

available in practice

Page 10: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Basic Probabilistic Retrieval Model

• We need a similarity function s so that: – P(rel|Di)>P(rel|Dj) iff s(Q, Di)>s(Q, Dj)

• Retrieve if P(relevant|D) > P(non-relevant|D)– calculate P(D|R)/P(D|NR)

• Different ways of estimating these probabilities lead to different probabilistic models

Page 11: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Probability• Experiment: a specific set of actions the results of which can

not be predicted with certainty – i.e. rolling two dice and recording their values

• Simple outcome: ea. possible set of recorded data– for the example, each pair is a simple outcome

(1,1) (2,1) (3,1) ... (6,1)

(1,2) (2,2) (3,2) ... (6,2)

(1,3) (2,3) (3,3) ... (6,3)

(1,4) (2,4) (3,4) ... (6,4)

(1,5) (2,5) (3,5) ...(6,5)

(1,6) (2,6) (3,6) ... (6,6)

Page 12: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Probability

• Sample Space– non-empty set containing all possible simple outcomes

of the experiment

(1,1) (2,1) (3,1) ... (6,1)(1,2) (2,2) (3,2) ... (6,2)(1,3) (2,3) (3,3) ... (6,3)(1,4) (2,4) (3,4) ... (6,4)(1,5) (2,5) (3,5) ...(6,5)(1,6) (2,6) (3,6) ... (6,6)

– each element is known as a sample point

Page 13: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Probability

• Sample Space(1,1) (2,1) (3,1) ... (6,1)

(1,2) (2,2) (3,2) ... (6,2)(1,3) (2,3) (3,3) ... (6,3)(1,4) (2,4) (3,4) ... (6,4)(1,5) (2,5) (3,5) ...(6,5)(1,6) (2,6) (3,6) ... (6,6)

• Event space: subsets of a sample space defined by a specific event or outcome– i.e. the event that the sum of the two die is 4

Page 14: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Event Space

• The probability of an event is the sum of the probabilities of the sample points associated with the event– what is the probability that the sum is 4?

• Recall that sample points represent the possible outcomes of a statistical “experiment”

• 36 possible outcomes when rolling 2 die• 3 ways to get sum of 4: (1,3) (2,2) (3,1) • Pr(sum is 4) = 3/36 = 1/12

Page 15: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Event Space

• For a retrieval model, the event space is Q x D, – where each sample point is a query-document pair

– Each has an associated relevance judgment

• For a particular a query, a probabilistic model tries to estimate P(R|D)

Page 16: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Probability Ranking Principle

• Ranking documents in decreasing order of pr(rel) to the query, where probabilities are estimated using all available evidence, produces the best possible effectiveness

– Assume relevance of a document is independent of other documents in the collection

– Bayes Decision Rule: Retrieve if P(R|D) > P(NR|D)• minimizes the average probability of error

– equivalent to optimizing recall/fallout tradeoff

{ NR decide weif D)|P(R

R decide weif D)|P(NRD)|P(error

Page 17: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Basic Probabilistic Model• Doc d = (t1, t2, …tn)

ti = 0 means index term ti absent, ti = 1 means term ti present

– pi = P(ti =1|R) and 1-pi = P(ti =0|R)

– qi = P(ti =1|NR) and 1- qi = P(ti =0|NR)

• Assume conditional independence– P(d|R) is product of the probs for the components of d (i.e. product of

probabilities of getting a particular vector of 1’s and 0’s)

• Appearance of a term in a doc interpreted either as – evidence that document is relevant or

– evidence that document is non-relevant

• The key is finding a means of estimating pi and qi

– pi is prob term is present given Relevant

– qi is prob term is present given Not Relevant

Page 18: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Basic Probabilistic Model

• Need to calculate – “relevant to query given term appears” and

– “irrelevant to query given term appears”

• These values can be based upon some known relevance judgments

Page 19: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Estimation with Relevance Information

– N = total # docs, R = # rel docs, N-R = # nonrel docs

– ft = # docs w/term

– Rt = # rel docs w/ term

– ft -Rt = # nonrel w/term

• We can estimate the conditional probabilities w/ the table – P(rel|t is present) = Rt /ft

– P(non-rel|t is present) = ft -Rt / ft

– P(t is present| rel) = Rt / R

– P(t is present| nonrel) = ft - Rt / N-R

Number Docs Relevant Nonrelevant Total Term t present Rt ft - Rt ft

Term t absent R - Rt N- ft-(R- Rt ) N - ft

R N-R N

Page 20: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Estimation with Relevance Information

• wt = (Rt /(R-Rt ))/ ((ft - Rt )/(N - ft - (R -Rt )))– ratio of rel with term to relevant without term

ratio of nonrel with term to nonrel without term

• Suppose N = 20; R = 13 relevant; term t appears in 11 rel; term appears in 12 docs– wt =(11/(13-11))/((12-11)/(20-12-(13-11))) = 5.5/0.17 = 33

Number Docs Relevant Nonrelevant Total Term t present Rt ft - Rt ft

Term t absent R - Rt N- ft-(R- Rt ) N - ft

R N-R N

Page 21: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Estimation with Relevance Information

• Think of this as extent to which the term can discriminate b/w relevant and non-relevant docs

– wt = (Rt /(R-Rt ))/ ((ft - Rt )/(N - ft - (R -Rt )))– wt = 33– t is strongly indicative of relevance since it frequently

appears in rel documents and rarely in nonrel

– what if N = 20; R = 13; Rt = 4; ft = 7 ?• wt = (4/9)/(3/4) = 0.59• t counts slightly against the doc being relevant

– wt = 1 indicates neutral term since it appears randomly across relevant and nonrelevant docs

Page 22: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Estimation with Relevance Information

• wt = (Rt /(R-Rt ))/ ((ft - Rt )/(N - ft - (R -Rt )))

• wt = 1 indicates neutral term since it appears randomly across relevant and nonrelevant docs

• assuming that the occurrences of terms in documents are independent– document weight is the product of it’s term weights

• w(d) = wt

– conventional to express as a sum of logs:

log wt – neg. values indicate nonrel, 0 indicates there is as much

evidence for relevance as for non-relevance

Page 23: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Estimation• Relevance information is usually not available

• Estimate prob based on information in query & collection– previous queries can also be used with some learning approaches

• If qi (pr of occurrence in non-relevant documents) is estimated

as ft /N, the second part of the weight is

– which for large N is the IDF weight

– # non-relevant documents are approximated by the whole collection

n

n - Nlog

q

q1log

i

i

Page 24: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

Estimation

• pi (probability of occurrence in relevant documents) can be estimated in various ways– constant (Croft and Harper combination match)– proportional to probability of occurrence in collection– more accurately, proportional to log(probability of occurrence)

• Greif, 1998

• Maximum likelihood estimates have problems with small samples or zero values

• Estimating probabilities is the same problem as determining weighting formulae in less formal models

Page 25: Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded

Allan, Ballesteros, Croft, and/or Turtle

An Independence Assumption

• Typically, terms aren’t independent (e.g. phrases)– Modeling dependence can be very complex

• The set of all terms are distributed independently in both rel and nonrel– Very strong assumption! e.g.)

Q: “What is happening with the impeachment trial?”

– Occurrence of “impeachment” in relevant documents is independent from occurrence of “trial”