20
Extending ranking with interword spacing analysis Maria Carmela Daniele, Claudio Carpineto and Andrea Bernardini

Maria daniele

Embed Size (px)

Citation preview

Page 1: Maria daniele

Extending ranking with interword spacing analysis

Maria Carmela Daniele, Claudio Carpineto and Andrea Bernardini

Page 2: Maria daniele

Overview

I.  Word weighting based on interword spacing: σp II.  Extension of quantistic weight through corpora analysis: σ* III.  σ* application to ranking IV.  Experiments V.  Selective application of quantistic and frequentistic metrics based on:

a)  Document’s length

b)  Query hardness

Page 3: Maria daniele

Words  weighting  based  on  spacing  between  term  occurrences: σp

•  Research branch evolved in the last decade.

•  Follow studies on energy level of statistical system formed by irregular quantum, created by Ortuño et al (2002)

•  Keyword extraction based on distances between term’s occurrences in a document, regardless of terms frequency analysis of the document.

•  Let’s see in more detail…

Page 4: Maria daniele

Reference Scenario

  Similar to quantistic system, terms in a document are subject to an attraction/repulsion phenomena, that is stronger between relevant terms compared to common words.

  Reference Document: Charles Darwin’s “The Origin of Species”

  In practice:

  Relevant words tend to cluster in documents ( ie: “INSTINCT”)

  Common words like “THE” are distributed uniformly

Page 5: Maria daniele

Definition of σp

•  Weighting method definition based on probability distributions of distances

•  A more efficient method characterized by Standard Deviation:

•  Normalizing with respect to the mean value:

A great scientist must be a good teacher and a good researcher

1 2 3 4 5 6 7 8 9 10 11 12

•  For term “a” we get: X={1,6,10}, D = {0,5,4,2} (di = xi+1- xi), and:

s =1

n −1x

i+1−xi( ) − µ( )

2

i=0

n

Page 6: Maria daniele

Extension of quantistic weighting through corpora analysis: σ*

•  We propose to modify the original metric with a factor σf based on the variance of term frequencies (Salton 1975). The factor σf is analogous to σp and it has a twofold goal:

1.  Penalize rare words, because they can be often seen as ‘noise’ in real collection of documents, while they tend to be overestimated using σp ;

2.  Reward words that make it possible to better discriminate a document from the rest of the collection. This feature is lacking in quantistic weighting

with

sf (w) =1ND

⋅ fi(w) − µ f( )2

i=1

n

Page 7: Maria daniele

Comparison between quantistic and frequentistic metrics

Rank Tf-Idf Tf-Idf* σp σ*

1 unto lord jesus jesus

2 shall god christ saul

3 lord absalom paul absalom

4 thou son peter jephthah

5 thy king disciples jubile

6 thee behold faith ascendeteh

7 him man john abimelech

8 god judah david elias

9 his land saul joab

10 hath men gospel haman

•  Using Tf-Idf (with and without stop words) for the metric on the frequencies •  Using σp e σ* for the quantistic weighting •  Reference Document: “The Bible” of The King James •  To calculate Idf e σf that require the collection, we use WT10g Trec collection

Page 8: Maria daniele

Application of σ* to ranking (1)

•  Based on the complementary features of quantistic and frequentistic weighting metrics, we would like to combine these two metrics.

•  Using σ* metric it’s possible to rank a collection of documents against a query q

Page 9: Maria daniele

Application of σ* to ranking (2)  

•  The combined metric is obtained through:   Linear Combination of Okapi’s BM25 and σ* metrics

•  Prerequisite for the linear combination is that the the scores will be in similar range

•  Application of normalization of scores through:

•  The scores are combined by:

Page 10: Maria daniele

Collection:   Web Track: about 1.690.000 documents   Robust Track: more than 500.000 documents

•  Evaluation measure: MAP (mean average precision)

•  Lucene with BM25 extension created by Perez-Iglesias

Experiments (1)

Page 11: Maria daniele

Experiments (2)

•  The quantistic metric alone does not work well:

•  Experiments on combined quantistic method enhance in a significant way performance of classical methods of IR

•  We let the α parameter vary in the range [0,1]: the two extreme points coincides, respectively, with BM25 and σ∗ techniques.

•  Results suggest us that the method is sufficiently robust, because we found a range of values in which the performance of the combined method was good.

Collezione Topics BM25 σ* BM25+σ*

WT10g 501-550 0.143 0.057 0.153

Robust 301-450,601-700 0.195 0.089 0.203

α 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

MAP .1436 .1469 .1537 .1535 .1501 .1379 .1222 .096 .0819 .0679 .0547

MAP .1954 .2033 .2031 .1983 .1673 .1549 .1428 .1203 .1075 .9674 .0898

Page 12: Maria daniele

Query by query analysis  

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

AvP

N° Query

BM25

σ*

BM25+σ*

Page 13: Maria daniele

Selective application of quantistic and frequentistic techniques

1. Relying on predictors of the query difficulty for choosing which metric to use (rationale: the quantistic method should be better on difficult queries)

2. Relying on document’s length for choosing which metric to use (rationale: the quantistic method should be better for long documents)

Page 14: Maria daniele

Query hardness (1)

•  We used two well-know query predictor:

•  Simplified Clarity Score

•  σ1

Page 15: Maria daniele

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

MA

P

SCS

Robust

BM25 SS* BM25 SS*

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

MA

P

sigma

WT10g Bm25 SS* Lineare(Bm25) Lineare(SS*)

Query hardness (2)

• WT10g with σ1 predictor

• Robust with SCS predictor

• Predictor obtained values on x-axis

• MAP value on y-axes (both BM25 and σ∗)

Page 16: Maria daniele

Document Lenght (1)

•  Why using document length? Because the quantistic method works better with long texts

BM25 σ*

Relevant Retrieved 1544 3729 Relevant NOT Retrieved 4239 2115

Page 17: Maria daniele

Document Lenght (2)

• Collection: WT10g

• σ*

• BM25

• X-Axis: document’s length expressed in number of words

• Y-Axis: Cumulative percentage of relevant documents (retrieved in Blue, not retrieved in Red)

Page 18: Maria daniele

Conclusions on using a selective application of frequentistic and quantistic weighting

•  Query hardness did not work.

•  Using document length was more promising

Page 19: Maria daniele

Conclusions and future works

•  Definition of an extended quantistic weighting method through corpora analysis.

•  Integration of quantistic and frequentistic ranking methods

•  A linear combination showed a significant enhance of performance compared to the classical frequentistic method

•  Selective application: query hardness not useful, document length useful

•  This method could be applied on other Information Retrieval Task, i.e.: •  Document Summarization: for create a short version of a text •  Query Expansion: expand the query phrase (ie : using synonymous) •  Search Result Clustering: group results in clusters

Page 20: Maria daniele

Thanks for listening! questions?

Conclusions