44
SWLM: NEITHER GENERAL, NOR SPECIFIC, BUT SIGNIFICANT MOSTAFA DEHGHANI 1

MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

SWLM: NEITHER GENERAL, NOR SPECIFIC, BUT SIGNIFICANT

MOSTAFA DEHGHANI

1

Page 2: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

REPRESENTATION MATTERS

▸ Performance of any data oriented method is heavily dependent on the way the data is represented.

▸ A good representation:

▸ Precise

▸ Robust against noise

▸ Transferable over time

▸ Interpretable by human inspection

▸ …

@m__dehghani

2

Page 3: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

REPRESENTATION OF A SET OF OBJECTS

▸ In many application, you may need to represent a set of objects (instead of one single object)

▸ Classification, Recommendation, Feedback, etc.

▸ A representation of a set should be:

▸ Discriminative enough to make this set distinguishable from other objects (other sets)

▸ General enough to represent the whole set, not just part of it

@m__dehghani

3

Page 4: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

LUHN-BASED MODEL

▸ How to estimate a model for representing a set of textual entities capturing all, and only, the essential shared commonalities of these entities?

H. P. Luhn. The automatic creation of literature abstracts.IBM J. Res.Dev., 2(2):159–165, 1958.@m__dehghani

4

Page 5: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

LUHN-BASED MODEL

▸ How to estimate a model for representing a set of textual entities capturing all, and only, the essential shared commonalities of these entities?

Significant Words Language Models

Revisiting Luhn!

H. P. Luhn. The automatic creation of literature abstracts.IBM J. Res.Dev., 2(2):159–165, 1958.@m__dehghani

4

Page 6: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

TASK1: RELEVANCE FEEDBACK

@m__dehghani

➤ Task: (Pseudo-)Relevance Feedback

➤ Query Expansion

➤ To enrich the user query to improve performance of the retrieval, given a set of judged documents (or using top-ranked retrieved documents as relevant documents)

➤ Main Goal:

➤ To make use of feedback documents to estimate more accurate query models representing the notion of relevance.

Mostafa Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra, and M. Marx. “Luhn Revisited: Significant Words Language Models”, in CIKM’16.

5

Page 7: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

TASK1: EXAMPLE

@m__dehghani

➤ Example: ➤ Top 50 terms in the language model estimated from the set of top seven relevant

documents retrieved for topic 374, “Nobel prize winners”, of the TREC Robust04 test collection: Standard-LM

prize 5.55e-02

nobel 3.36e-02

physics 2.35e-02

science 2.18e-02

...

time 1.68e-02

...

palestinian 1.34e-02

year 1.34e-02

...

General-LMnew 3.70e-03

cent 2.98e-03

two 2.97e-03

dollars 2.76e-03

people 2.71e-03

...

time 2.47e-03

...

year 2.16e-03

...

SMM [45]prize 6.07e-02

nobel 4.37e-02

awards 3.43e-02

chemistry 3.23e-02

physics 2.82e-02

palestinian 2.18e-02

cesium 2.09e-02

arafat 1.94e-02

university 1.92e-02

...

Specific-LMinsulin 2.25e-02

palestinian 2.15e-02

dehmelt 1.81e-02

oscillations 1.79e-02

waxman 1.69e-02

marcus 1.69e-02

attack 1.61e-02

...

arafat 1.29e-02

...

SWLMprize 6.02e-02

nobel 4.53e-02

science 2.68e-02

award 2.43e-02

physics 1.94e-02

winner 1.90e-02

won 1.80e-02

peace 1.80e-02

discovery 1.71e-02

...

Figure 2: Extracting significant terms from relevant feedback documents. (topic 374 of the TREC Robust04 test collection: “Nobel prize winners”)

tems, like mixture models [45] and parsimonious language mod-els [16]. They tried to make the feedback model more distinctive byeliminating the effect of common terms from the model. However,instead of using fixed frequency cut-offs, they made use of a moreadvanced way to do this. Hiemstra et al. stated the following in theirpaper:

[. . . ] our approach bears some resemblance with earlywork on information retrieval by Luhn, who specifiestwo word frequency cut-offs, an upper and a lower toexclude non-significant words. The words exceedingthe upper cut-off are considered to be common andthose below the lower cut-off rare, and therefore notcontributing significantly to the content of the docu-ment. Unlike Luhn, we do not exclude rare words andwe do not have simple frequency cut-offs [. . . ]

In a way, this paper completes the cycle following the vision ofLuhn. We introduce a meaningful translation of specificity and gen-erality against significance in the context of feedback problem andpropose an effective way of establishing a representation consistingof significant words, by parsimonizing the feedback model towardnot only the common observations, but also the rare observations.

Generally speaking, SWLM tries to estimate a language modelfrom the set of feedback documents which is “specific” enough todistinguish the features of the feedback documents from other docu-ments by removing general terms, and in the same time, “general”enough to capture all the shared features of feedback documentsas the notion of relevance, by excluding document specific terms.To do so, SWLM assumes that terms in the feedback documentsare drawn from three models: 1. General model, representative ofcommon observation, 2. Specific model, representative of partialobservation, and 3. Relevance model which is a latent model repre-senting the notion of relevance. Then, it tries to extract the latentrelevance model as the feedback language model.

Figure 2 shows an example of estimating language models fromthe set of top seven relevant documents retrieved for topic 374, “No-bel prize winners”, of the TREC Robust04 test collection. Termsin each list are selected from top 50 terms of the models estimatedafter stop word removal. Standard-LM is the language model es-timated using MLE considering feedback documents as a singledocument. SMM is the language model estimated using simplemixture model [45], one of the most powerful feedback approaches,which generally tries to take out background terms from the feed-back model. As the components of SWLM, General-LM denotescommon terms which is an estimation for the collection’s languagemodel, Specific-LM that determines the probability of terms to bespecific in the feedback set, i.e being frequent in one of the feed-back documents but not the others and the latent significant wordslanguage model which represents feedback model.

As can be seen, considering feedback documents as a mixture ofrelevance model and collection model, SMM penalizes some gen-

eral terms like “time” and “year” by decreasing their probabilities.However, since some frequent words in the feedback set are notfrequent in the whole collection, their probabilities are boosted, like“Palestinian” and “Arafat”, while they are not good indicators for thewhole feedback set. The point is although these terms are frequentlyobserved, they only occur in some feedback documents not mostof them, which means that they are in fact “specific” terms, notdistinctive terms. Considering both general terms and specific terms,SWLM estimates the significant language model reflecting mutualnotion of relevance.

Generally the main aim of this paper is to develop an approach toestimate a robust model from a set of documents that captures all,and only, the essential shared commonalities of these documents.Having the task of feedback in information retrieval as the applica-tion, we break this down into three concrete research questions:

RQ1 How to estimate significant words language model for a setof feedback documents capturing the mutual notion of rele-vance?

RQ2 How effective is significant words language model in (pseudo)relevance feedback?

RQ3 How significant words language model prevents the feedbackmodel to be affected by non-relevant terms of non-relevant orpartially relevant feedback documents?

The rest of the paper is structured as follows. First, in Section 2 wereview related work. Then, we explain our approach for estimatingsignificant words language models in Section 3. Sections 4, 5,and 6present the experimental setup, the results of the experiments on thetask of (pseudo) relevance feedback, and comprehensive analysison the robustness of the proposed approach. Finally, Section 7concludes the paper and discusses extensions as future work.

2. RELATED WORKIn this section, we discuss about related studies in the problem

of feedback in information retrieval. First, we talk about differentfeedback approaches in particular methods in the language modelingframework. Then, after discussing initiatives focusing on the task ofRF and PRF, we will discuss the main challenges of this tasks andsome already proposed methods addressed them.

It has been shown that there is a limitation on providing increas-ingly better results for retrieval systems only based on the originalquery [41]. So, it is crucial to reformulate the search request us-ing terms which reflect the user’s information need to improve theperformance of the retrieval systems. To address this issue, auto-matic feedback methods for information retrieval were introducedfifty years ago [34] and have been extensively studied during pastdecades [4, 6, 11–13, 15, 20, 24, 26, 32, 33, 35, 36, 38, 40, 45].

As the earliest relevance feedback approach in information re-trieval, the Rocchio method [34] is proposed in the vector processing

6

Page 8: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

TASK1: EXAMPLE

@m__dehghani

➤ Example: ➤ Top 50 terms in the language model estimated from the set of top seven relevant

documents retrieved for topic 374, “Nobel prize winners”, of the TREC Robust04 test collection: Standard-LM

prize 5.55e-02

nobel 3.36e-02

physics 2.35e-02

science 2.18e-02

...

time 1.68e-02

...

palestinian 1.34e-02

year 1.34e-02

...

General-LMnew 3.70e-03

cent 2.98e-03

two 2.97e-03

dollars 2.76e-03

people 2.71e-03

...

time 2.47e-03

...

year 2.16e-03

...

SMM [45]prize 6.07e-02

nobel 4.37e-02

awards 3.43e-02

chemistry 3.23e-02

physics 2.82e-02

palestinian 2.18e-02

cesium 2.09e-02

arafat 1.94e-02

university 1.92e-02

...

Specific-LMinsulin 2.25e-02

palestinian 2.15e-02

dehmelt 1.81e-02

oscillations 1.79e-02

waxman 1.69e-02

marcus 1.69e-02

attack 1.61e-02

...

arafat 1.29e-02

...

SWLMprize 6.02e-02

nobel 4.53e-02

science 2.68e-02

award 2.43e-02

physics 1.94e-02

winner 1.90e-02

won 1.80e-02

peace 1.80e-02

discovery 1.71e-02

...

Figure 2: Extracting significant terms from relevant feedback documents. (topic 374 of the TREC Robust04 test collection: “Nobel prize winners”)

tems, like mixture models [45] and parsimonious language mod-els [16]. They tried to make the feedback model more distinctive byeliminating the effect of common terms from the model. However,instead of using fixed frequency cut-offs, they made use of a moreadvanced way to do this. Hiemstra et al. stated the following in theirpaper:

[. . . ] our approach bears some resemblance with earlywork on information retrieval by Luhn, who specifiestwo word frequency cut-offs, an upper and a lower toexclude non-significant words. The words exceedingthe upper cut-off are considered to be common andthose below the lower cut-off rare, and therefore notcontributing significantly to the content of the docu-ment. Unlike Luhn, we do not exclude rare words andwe do not have simple frequency cut-offs [. . . ]

In a way, this paper completes the cycle following the vision ofLuhn. We introduce a meaningful translation of specificity and gen-erality against significance in the context of feedback problem andpropose an effective way of establishing a representation consistingof significant words, by parsimonizing the feedback model towardnot only the common observations, but also the rare observations.

Generally speaking, SWLM tries to estimate a language modelfrom the set of feedback documents which is “specific” enough todistinguish the features of the feedback documents from other docu-ments by removing general terms, and in the same time, “general”enough to capture all the shared features of feedback documentsas the notion of relevance, by excluding document specific terms.To do so, SWLM assumes that terms in the feedback documentsare drawn from three models: 1. General model, representative ofcommon observation, 2. Specific model, representative of partialobservation, and 3. Relevance model which is a latent model repre-senting the notion of relevance. Then, it tries to extract the latentrelevance model as the feedback language model.

Figure 2 shows an example of estimating language models fromthe set of top seven relevant documents retrieved for topic 374, “No-bel prize winners”, of the TREC Robust04 test collection. Termsin each list are selected from top 50 terms of the models estimatedafter stop word removal. Standard-LM is the language model es-timated using MLE considering feedback documents as a singledocument. SMM is the language model estimated using simplemixture model [45], one of the most powerful feedback approaches,which generally tries to take out background terms from the feed-back model. As the components of SWLM, General-LM denotescommon terms which is an estimation for the collection’s languagemodel, Specific-LM that determines the probability of terms to bespecific in the feedback set, i.e being frequent in one of the feed-back documents but not the others and the latent significant wordslanguage model which represents feedback model.

As can be seen, considering feedback documents as a mixture ofrelevance model and collection model, SMM penalizes some gen-

eral terms like “time” and “year” by decreasing their probabilities.However, since some frequent words in the feedback set are notfrequent in the whole collection, their probabilities are boosted, like“Palestinian” and “Arafat”, while they are not good indicators for thewhole feedback set. The point is although these terms are frequentlyobserved, they only occur in some feedback documents not mostof them, which means that they are in fact “specific” terms, notdistinctive terms. Considering both general terms and specific terms,SWLM estimates the significant language model reflecting mutualnotion of relevance.

Generally the main aim of this paper is to develop an approach toestimate a robust model from a set of documents that captures all,and only, the essential shared commonalities of these documents.Having the task of feedback in information retrieval as the applica-tion, we break this down into three concrete research questions:

RQ1 How to estimate significant words language model for a setof feedback documents capturing the mutual notion of rele-vance?

RQ2 How effective is significant words language model in (pseudo)relevance feedback?

RQ3 How significant words language model prevents the feedbackmodel to be affected by non-relevant terms of non-relevant orpartially relevant feedback documents?

The rest of the paper is structured as follows. First, in Section 2 wereview related work. Then, we explain our approach for estimatingsignificant words language models in Section 3. Sections 4, 5,and 6present the experimental setup, the results of the experiments on thetask of (pseudo) relevance feedback, and comprehensive analysison the robustness of the proposed approach. Finally, Section 7concludes the paper and discusses extensions as future work.

2. RELATED WORKIn this section, we discuss about related studies in the problem

of feedback in information retrieval. First, we talk about differentfeedback approaches in particular methods in the language modelingframework. Then, after discussing initiatives focusing on the task ofRF and PRF, we will discuss the main challenges of this tasks andsome already proposed methods addressed them.

It has been shown that there is a limitation on providing increas-ingly better results for retrieval systems only based on the originalquery [41]. So, it is crucial to reformulate the search request us-ing terms which reflect the user’s information need to improve theperformance of the retrieval systems. To address this issue, auto-matic feedback methods for information retrieval were introducedfifty years ago [34] and have been extensively studied during pastdecades [4, 6, 11–13, 15, 20, 24, 26, 32, 33, 35, 36, 38, 40, 45].

As the earliest relevance feedback approach in information re-trieval, the Rocchio method [34] is proposed in the vector processing

➤ Some terms are not discriminating (lets call them general terms!)

6

Page 9: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

TASK1: EXAMPLE

@m__dehghani

➤ Example: ➤ Top 50 terms in the language model estimated from the set of top seven relevant

documents retrieved for topic 374, “Nobel prize winners”, of the TREC Robust04 test collection: Standard-LM

prize 5.55e-02

nobel 3.36e-02

physics 2.35e-02

science 2.18e-02

...

time 1.68e-02

...

palestinian 1.34e-02

year 1.34e-02

...

General-LMnew 3.70e-03

cent 2.98e-03

two 2.97e-03

dollars 2.76e-03

people 2.71e-03

...

time 2.47e-03

...

year 2.16e-03

...

SMM [45]prize 6.07e-02

nobel 4.37e-02

awards 3.43e-02

chemistry 3.23e-02

physics 2.82e-02

palestinian 2.18e-02

cesium 2.09e-02

arafat 1.94e-02

university 1.92e-02

...

Specific-LMinsulin 2.25e-02

palestinian 2.15e-02

dehmelt 1.81e-02

oscillations 1.79e-02

waxman 1.69e-02

marcus 1.69e-02

attack 1.61e-02

...

arafat 1.29e-02

...

SWLMprize 6.02e-02

nobel 4.53e-02

science 2.68e-02

award 2.43e-02

physics 1.94e-02

winner 1.90e-02

won 1.80e-02

peace 1.80e-02

discovery 1.71e-02

...

Figure 2: Extracting significant terms from relevant feedback documents. (topic 374 of the TREC Robust04 test collection: “Nobel prize winners”)

tems, like mixture models [45] and parsimonious language mod-els [16]. They tried to make the feedback model more distinctive byeliminating the effect of common terms from the model. However,instead of using fixed frequency cut-offs, they made use of a moreadvanced way to do this. Hiemstra et al. stated the following in theirpaper:

[. . . ] our approach bears some resemblance with earlywork on information retrieval by Luhn, who specifiestwo word frequency cut-offs, an upper and a lower toexclude non-significant words. The words exceedingthe upper cut-off are considered to be common andthose below the lower cut-off rare, and therefore notcontributing significantly to the content of the docu-ment. Unlike Luhn, we do not exclude rare words andwe do not have simple frequency cut-offs [. . . ]

In a way, this paper completes the cycle following the vision ofLuhn. We introduce a meaningful translation of specificity and gen-erality against significance in the context of feedback problem andpropose an effective way of establishing a representation consistingof significant words, by parsimonizing the feedback model towardnot only the common observations, but also the rare observations.

Generally speaking, SWLM tries to estimate a language modelfrom the set of feedback documents which is “specific” enough todistinguish the features of the feedback documents from other docu-ments by removing general terms, and in the same time, “general”enough to capture all the shared features of feedback documentsas the notion of relevance, by excluding document specific terms.To do so, SWLM assumes that terms in the feedback documentsare drawn from three models: 1. General model, representative ofcommon observation, 2. Specific model, representative of partialobservation, and 3. Relevance model which is a latent model repre-senting the notion of relevance. Then, it tries to extract the latentrelevance model as the feedback language model.

Figure 2 shows an example of estimating language models fromthe set of top seven relevant documents retrieved for topic 374, “No-bel prize winners”, of the TREC Robust04 test collection. Termsin each list are selected from top 50 terms of the models estimatedafter stop word removal. Standard-LM is the language model es-timated using MLE considering feedback documents as a singledocument. SMM is the language model estimated using simplemixture model [45], one of the most powerful feedback approaches,which generally tries to take out background terms from the feed-back model. As the components of SWLM, General-LM denotescommon terms which is an estimation for the collection’s languagemodel, Specific-LM that determines the probability of terms to bespecific in the feedback set, i.e being frequent in one of the feed-back documents but not the others and the latent significant wordslanguage model which represents feedback model.

As can be seen, considering feedback documents as a mixture ofrelevance model and collection model, SMM penalizes some gen-

eral terms like “time” and “year” by decreasing their probabilities.However, since some frequent words in the feedback set are notfrequent in the whole collection, their probabilities are boosted, like“Palestinian” and “Arafat”, while they are not good indicators for thewhole feedback set. The point is although these terms are frequentlyobserved, they only occur in some feedback documents not mostof them, which means that they are in fact “specific” terms, notdistinctive terms. Considering both general terms and specific terms,SWLM estimates the significant language model reflecting mutualnotion of relevance.

Generally the main aim of this paper is to develop an approach toestimate a robust model from a set of documents that captures all,and only, the essential shared commonalities of these documents.Having the task of feedback in information retrieval as the applica-tion, we break this down into three concrete research questions:

RQ1 How to estimate significant words language model for a setof feedback documents capturing the mutual notion of rele-vance?

RQ2 How effective is significant words language model in (pseudo)relevance feedback?

RQ3 How significant words language model prevents the feedbackmodel to be affected by non-relevant terms of non-relevant orpartially relevant feedback documents?

The rest of the paper is structured as follows. First, in Section 2 wereview related work. Then, we explain our approach for estimatingsignificant words language models in Section 3. Sections 4, 5,and 6present the experimental setup, the results of the experiments on thetask of (pseudo) relevance feedback, and comprehensive analysison the robustness of the proposed approach. Finally, Section 7concludes the paper and discusses extensions as future work.

2. RELATED WORKIn this section, we discuss about related studies in the problem

of feedback in information retrieval. First, we talk about differentfeedback approaches in particular methods in the language modelingframework. Then, after discussing initiatives focusing on the task ofRF and PRF, we will discuss the main challenges of this tasks andsome already proposed methods addressed them.

It has been shown that there is a limitation on providing increas-ingly better results for retrieval systems only based on the originalquery [41]. So, it is crucial to reformulate the search request us-ing terms which reflect the user’s information need to improve theperformance of the retrieval systems. To address this issue, auto-matic feedback methods for information retrieval were introducedfifty years ago [34] and have been extensively studied during pastdecades [4, 6, 11–13, 15, 20, 24, 26, 32, 33, 35, 36, 38, 40, 45].

As the earliest relevance feedback approach in information re-trieval, the Rocchio method [34] is proposed in the vector processing

Standard-LMprize 5.55e-02

nobel 3.36e-02

physics 2.35e-02

science 2.18e-02

...

time 1.68e-02

...

palestinian 1.34e-02

year 1.34e-02

...

General-LMnew 3.70e-03

cent 2.98e-03

two 2.97e-03

dollars 2.76e-03

people 2.71e-03

...

time 2.47e-03

...

year 2.16e-03

...

SMM [45]prize 6.07e-02

nobel 4.37e-02

awards 3.43e-02

chemistry 3.23e-02

physics 2.82e-02

palestinian 2.18e-02

cesium 2.09e-02

arafat 1.94e-02

university 1.92e-02

...

Specific-LMinsulin 2.25e-02

palestinian 2.15e-02

dehmelt 1.81e-02

oscillations 1.79e-02

waxman 1.69e-02

marcus 1.69e-02

attack 1.61e-02

...

arafat 1.29e-02

...

SWLMprize 6.02e-02

nobel 4.53e-02

science 2.68e-02

award 2.43e-02

physics 1.94e-02

winner 1.90e-02

won 1.80e-02

peace 1.80e-02

discovery 1.71e-02

...

Figure 2: Extracting significant terms from relevant feedback documents. (topic 374 of the TREC Robust04 test collection: “Nobel prize winners”)

tems, like mixture models [45] and parsimonious language mod-els [16]. They tried to make the feedback model more distinctive byeliminating the effect of common terms from the model. However,instead of using fixed frequency cut-offs, they made use of a moreadvanced way to do this. Hiemstra et al. stated the following in theirpaper:

[. . . ] our approach bears some resemblance with earlywork on information retrieval by Luhn, who specifiestwo word frequency cut-offs, an upper and a lower toexclude non-significant words. The words exceedingthe upper cut-off are considered to be common andthose below the lower cut-off rare, and therefore notcontributing significantly to the content of the docu-ment. Unlike Luhn, we do not exclude rare words andwe do not have simple frequency cut-offs [. . . ]

In a way, this paper completes the cycle following the vision ofLuhn. We introduce a meaningful translation of specificity and gen-erality against significance in the context of feedback problem andpropose an effective way of establishing a representation consistingof significant words, by parsimonizing the feedback model towardnot only the common observations, but also the rare observations.

Generally speaking, SWLM tries to estimate a language modelfrom the set of feedback documents which is “specific” enough todistinguish the features of the feedback documents from other docu-ments by removing general terms, and in the same time, “general”enough to capture all the shared features of feedback documentsas the notion of relevance, by excluding document specific terms.To do so, SWLM assumes that terms in the feedback documentsare drawn from three models: 1. General model, representative ofcommon observation, 2. Specific model, representative of partialobservation, and 3. Relevance model which is a latent model repre-senting the notion of relevance. Then, it tries to extract the latentrelevance model as the feedback language model.

Figure 2 shows an example of estimating language models fromthe set of top seven relevant documents retrieved for topic 374, “No-bel prize winners”, of the TREC Robust04 test collection. Termsin each list are selected from top 50 terms of the models estimatedafter stop word removal. Standard-LM is the language model es-timated using MLE considering feedback documents as a singledocument. SMM is the language model estimated using simplemixture model [45], one of the most powerful feedback approaches,which generally tries to take out background terms from the feed-back model. As the components of SWLM, General-LM denotescommon terms which is an estimation for the collection’s languagemodel, Specific-LM that determines the probability of terms to bespecific in the feedback set, i.e being frequent in one of the feed-back documents but not the others and the latent significant wordslanguage model which represents feedback model.

As can be seen, considering feedback documents as a mixture ofrelevance model and collection model, SMM penalizes some gen-

eral terms like “time” and “year” by decreasing their probabilities.However, since some frequent words in the feedback set are notfrequent in the whole collection, their probabilities are boosted, like“Palestinian” and “Arafat”, while they are not good indicators for thewhole feedback set. The point is although these terms are frequentlyobserved, they only occur in some feedback documents not mostof them, which means that they are in fact “specific” terms, notdistinctive terms. Considering both general terms and specific terms,SWLM estimates the significant language model reflecting mutualnotion of relevance.

Generally the main aim of this paper is to develop an approach toestimate a robust model from a set of documents that captures all,and only, the essential shared commonalities of these documents.Having the task of feedback in information retrieval as the applica-tion, we break this down into three concrete research questions:

RQ1 How to estimate significant words language model for a setof feedback documents capturing the mutual notion of rele-vance?

RQ2 How effective is significant words language model in (pseudo)relevance feedback?

RQ3 How significant words language model prevents the feedbackmodel to be affected by non-relevant terms of non-relevant orpartially relevant feedback documents?

The rest of the paper is structured as follows. First, in Section 2 wereview related work. Then, we explain our approach for estimatingsignificant words language models in Section 3. Sections 4, 5,and 6present the experimental setup, the results of the experiments on thetask of (pseudo) relevance feedback, and comprehensive analysison the robustness of the proposed approach. Finally, Section 7concludes the paper and discusses extensions as future work.

2. RELATED WORKIn this section, we discuss about related studies in the problem

of feedback in information retrieval. First, we talk about differentfeedback approaches in particular methods in the language modelingframework. Then, after discussing initiatives focusing on the task ofRF and PRF, we will discuss the main challenges of this tasks andsome already proposed methods addressed them.

It has been shown that there is a limitation on providing increas-ingly better results for retrieval systems only based on the originalquery [41]. So, it is crucial to reformulate the search request us-ing terms which reflect the user’s information need to improve theperformance of the retrieval systems. To address this issue, auto-matic feedback methods for information retrieval were introducedfifty years ago [34] and have been extensively studied during pastdecades [4, 6, 11–13, 15, 20, 24, 26, 32, 33, 35, 36, 38, 40, 45].

As the earliest relevance feedback approach in information re-trieval, the Rocchio method [34] is proposed in the vector processing

➤ Some terms are not discriminating (lets call them general terms!)

➤ Mixture model

➤ Some terms are not general (lets call them specific terms!)

6

Page 10: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

TASK1: EXAMPLE

@m__dehghani

➤ Example: ➤ Top 50 terms in the language model estimated from the set of top seven relevant

documents retrieved for topic 374, “Nobel prize winners”, of the TREC Robust04 test collection: Standard-LM

prize 5.55e-02

nobel 3.36e-02

physics 2.35e-02

science 2.18e-02

...

time 1.68e-02

...

palestinian 1.34e-02

year 1.34e-02

...

General-LMnew 3.70e-03

cent 2.98e-03

two 2.97e-03

dollars 2.76e-03

people 2.71e-03

...

time 2.47e-03

...

year 2.16e-03

...

SMM [45]prize 6.07e-02

nobel 4.37e-02

awards 3.43e-02

chemistry 3.23e-02

physics 2.82e-02

palestinian 2.18e-02

cesium 2.09e-02

arafat 1.94e-02

university 1.92e-02

...

Specific-LMinsulin 2.25e-02

palestinian 2.15e-02

dehmelt 1.81e-02

oscillations 1.79e-02

waxman 1.69e-02

marcus 1.69e-02

attack 1.61e-02

...

arafat 1.29e-02

...

SWLMprize 6.02e-02

nobel 4.53e-02

science 2.68e-02

award 2.43e-02

physics 1.94e-02

winner 1.90e-02

won 1.80e-02

peace 1.80e-02

discovery 1.71e-02

...

Figure 2: Extracting significant terms from relevant feedback documents. (topic 374 of the TREC Robust04 test collection: “Nobel prize winners”)

tems, like mixture models [45] and parsimonious language mod-els [16]. They tried to make the feedback model more distinctive byeliminating the effect of common terms from the model. However,instead of using fixed frequency cut-offs, they made use of a moreadvanced way to do this. Hiemstra et al. stated the following in theirpaper:

[. . . ] our approach bears some resemblance with earlywork on information retrieval by Luhn, who specifiestwo word frequency cut-offs, an upper and a lower toexclude non-significant words. The words exceedingthe upper cut-off are considered to be common andthose below the lower cut-off rare, and therefore notcontributing significantly to the content of the docu-ment. Unlike Luhn, we do not exclude rare words andwe do not have simple frequency cut-offs [. . . ]

In a way, this paper completes the cycle following the vision ofLuhn. We introduce a meaningful translation of specificity and gen-erality against significance in the context of feedback problem andpropose an effective way of establishing a representation consistingof significant words, by parsimonizing the feedback model towardnot only the common observations, but also the rare observations.

Generally speaking, SWLM tries to estimate a language modelfrom the set of feedback documents which is “specific” enough todistinguish the features of the feedback documents from other docu-ments by removing general terms, and in the same time, “general”enough to capture all the shared features of feedback documentsas the notion of relevance, by excluding document specific terms.To do so, SWLM assumes that terms in the feedback documentsare drawn from three models: 1. General model, representative ofcommon observation, 2. Specific model, representative of partialobservation, and 3. Relevance model which is a latent model repre-senting the notion of relevance. Then, it tries to extract the latentrelevance model as the feedback language model.

Figure 2 shows an example of estimating language models fromthe set of top seven relevant documents retrieved for topic 374, “No-bel prize winners”, of the TREC Robust04 test collection. Termsin each list are selected from top 50 terms of the models estimatedafter stop word removal. Standard-LM is the language model es-timated using MLE considering feedback documents as a singledocument. SMM is the language model estimated using simplemixture model [45], one of the most powerful feedback approaches,which generally tries to take out background terms from the feed-back model. As the components of SWLM, General-LM denotescommon terms which is an estimation for the collection’s languagemodel, Specific-LM that determines the probability of terms to bespecific in the feedback set, i.e being frequent in one of the feed-back documents but not the others and the latent significant wordslanguage model which represents feedback model.

As can be seen, considering feedback documents as a mixture ofrelevance model and collection model, SMM penalizes some gen-

eral terms like “time” and “year” by decreasing their probabilities.However, since some frequent words in the feedback set are notfrequent in the whole collection, their probabilities are boosted, like“Palestinian” and “Arafat”, while they are not good indicators for thewhole feedback set. The point is although these terms are frequentlyobserved, they only occur in some feedback documents not mostof them, which means that they are in fact “specific” terms, notdistinctive terms. Considering both general terms and specific terms,SWLM estimates the significant language model reflecting mutualnotion of relevance.

Generally the main aim of this paper is to develop an approach toestimate a robust model from a set of documents that captures all,and only, the essential shared commonalities of these documents.Having the task of feedback in information retrieval as the applica-tion, we break this down into three concrete research questions:

RQ1 How to estimate significant words language model for a setof feedback documents capturing the mutual notion of rele-vance?

RQ2 How effective is significant words language model in (pseudo)relevance feedback?

RQ3 How significant words language model prevents the feedbackmodel to be affected by non-relevant terms of non-relevant orpartially relevant feedback documents?

The rest of the paper is structured as follows. First, in Section 2 wereview related work. Then, we explain our approach for estimatingsignificant words language models in Section 3. Sections 4, 5,and 6present the experimental setup, the results of the experiments on thetask of (pseudo) relevance feedback, and comprehensive analysison the robustness of the proposed approach. Finally, Section 7concludes the paper and discusses extensions as future work.

2. RELATED WORKIn this section, we discuss about related studies in the problem

of feedback in information retrieval. First, we talk about differentfeedback approaches in particular methods in the language modelingframework. Then, after discussing initiatives focusing on the task ofRF and PRF, we will discuss the main challenges of this tasks andsome already proposed methods addressed them.

It has been shown that there is a limitation on providing increas-ingly better results for retrieval systems only based on the originalquery [41]. So, it is crucial to reformulate the search request us-ing terms which reflect the user’s information need to improve theperformance of the retrieval systems. To address this issue, auto-matic feedback methods for information retrieval were introducedfifty years ago [34] and have been extensively studied during pastdecades [4, 6, 11–13, 15, 20, 24, 26, 32, 33, 35, 36, 38, 40, 45].

As the earliest relevance feedback approach in information re-trieval, the Rocchio method [34] is proposed in the vector processing

Standard-LMprize 5.55e-02

nobel 3.36e-02

physics 2.35e-02

science 2.18e-02

...

time 1.68e-02

...

palestinian 1.34e-02

year 1.34e-02

...

General-LMnew 3.70e-03

cent 2.98e-03

two 2.97e-03

dollars 2.76e-03

people 2.71e-03

...

time 2.47e-03

...

year 2.16e-03

...

SMM [45]prize 6.07e-02

nobel 4.37e-02

awards 3.43e-02

chemistry 3.23e-02

physics 2.82e-02

palestinian 2.18e-02

cesium 2.09e-02

arafat 1.94e-02

university 1.92e-02

...

Specific-LMinsulin 2.25e-02

palestinian 2.15e-02

dehmelt 1.81e-02

oscillations 1.79e-02

waxman 1.69e-02

marcus 1.69e-02

attack 1.61e-02

...

arafat 1.29e-02

...

SWLMprize 6.02e-02

nobel 4.53e-02

science 2.68e-02

award 2.43e-02

physics 1.94e-02

winner 1.90e-02

won 1.80e-02

peace 1.80e-02

discovery 1.71e-02

...

Figure 2: Extracting significant terms from relevant feedback documents. (topic 374 of the TREC Robust04 test collection: “Nobel prize winners”)

tems, like mixture models [45] and parsimonious language mod-els [16]. They tried to make the feedback model more distinctive byeliminating the effect of common terms from the model. However,instead of using fixed frequency cut-offs, they made use of a moreadvanced way to do this. Hiemstra et al. stated the following in theirpaper:

[. . . ] our approach bears some resemblance with earlywork on information retrieval by Luhn, who specifiestwo word frequency cut-offs, an upper and a lower toexclude non-significant words. The words exceedingthe upper cut-off are considered to be common andthose below the lower cut-off rare, and therefore notcontributing significantly to the content of the docu-ment. Unlike Luhn, we do not exclude rare words andwe do not have simple frequency cut-offs [. . . ]

In a way, this paper completes the cycle following the vision ofLuhn. We introduce a meaningful translation of specificity and gen-erality against significance in the context of feedback problem andpropose an effective way of establishing a representation consistingof significant words, by parsimonizing the feedback model towardnot only the common observations, but also the rare observations.

Generally speaking, SWLM tries to estimate a language modelfrom the set of feedback documents which is “specific” enough todistinguish the features of the feedback documents from other docu-ments by removing general terms, and in the same time, “general”enough to capture all the shared features of feedback documentsas the notion of relevance, by excluding document specific terms.To do so, SWLM assumes that terms in the feedback documentsare drawn from three models: 1. General model, representative ofcommon observation, 2. Specific model, representative of partialobservation, and 3. Relevance model which is a latent model repre-senting the notion of relevance. Then, it tries to extract the latentrelevance model as the feedback language model.

Figure 2 shows an example of estimating language models fromthe set of top seven relevant documents retrieved for topic 374, “No-bel prize winners”, of the TREC Robust04 test collection. Termsin each list are selected from top 50 terms of the models estimatedafter stop word removal. Standard-LM is the language model es-timated using MLE considering feedback documents as a singledocument. SMM is the language model estimated using simplemixture model [45], one of the most powerful feedback approaches,which generally tries to take out background terms from the feed-back model. As the components of SWLM, General-LM denotescommon terms which is an estimation for the collection’s languagemodel, Specific-LM that determines the probability of terms to bespecific in the feedback set, i.e being frequent in one of the feed-back documents but not the others and the latent significant wordslanguage model which represents feedback model.

As can be seen, considering feedback documents as a mixture ofrelevance model and collection model, SMM penalizes some gen-

eral terms like “time” and “year” by decreasing their probabilities.However, since some frequent words in the feedback set are notfrequent in the whole collection, their probabilities are boosted, like“Palestinian” and “Arafat”, while they are not good indicators for thewhole feedback set. The point is although these terms are frequentlyobserved, they only occur in some feedback documents not mostof them, which means that they are in fact “specific” terms, notdistinctive terms. Considering both general terms and specific terms,SWLM estimates the significant language model reflecting mutualnotion of relevance.

Generally the main aim of this paper is to develop an approach toestimate a robust model from a set of documents that captures all,and only, the essential shared commonalities of these documents.Having the task of feedback in information retrieval as the applica-tion, we break this down into three concrete research questions:

RQ1 How to estimate significant words language model for a setof feedback documents capturing the mutual notion of rele-vance?

RQ2 How effective is significant words language model in (pseudo)relevance feedback?

RQ3 How significant words language model prevents the feedbackmodel to be affected by non-relevant terms of non-relevant orpartially relevant feedback documents?

The rest of the paper is structured as follows. First, in Section 2 wereview related work. Then, we explain our approach for estimatingsignificant words language models in Section 3. Sections 4, 5,and 6present the experimental setup, the results of the experiments on thetask of (pseudo) relevance feedback, and comprehensive analysison the robustness of the proposed approach. Finally, Section 7concludes the paper and discusses extensions as future work.

2. RELATED WORKIn this section, we discuss about related studies in the problem

of feedback in information retrieval. First, we talk about differentfeedback approaches in particular methods in the language modelingframework. Then, after discussing initiatives focusing on the task ofRF and PRF, we will discuss the main challenges of this tasks andsome already proposed methods addressed them.

It has been shown that there is a limitation on providing increas-ingly better results for retrieval systems only based on the originalquery [41]. So, it is crucial to reformulate the search request us-ing terms which reflect the user’s information need to improve theperformance of the retrieval systems. To address this issue, auto-matic feedback methods for information retrieval were introducedfifty years ago [34] and have been extensively studied during pastdecades [4, 6, 11–13, 15, 20, 24, 26, 32, 33, 35, 36, 38, 40, 45].

As the earliest relevance feedback approach in information re-trieval, the Rocchio method [34] is proposed in the vector processing

➤ Some terms are not discriminating (lets call them general terms!)

➤ Mixture model

➤ Some terms are not general (lets call them specific terms!)

6

Page 11: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

LUHNIAN MODEL

@m__dehghani

➤ Terms that are neither specifics nor general?

7

Page 12: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

LUHNIAN MODEL

@m__dehghani

➤ Terms that are neither specifics nor general?

Significant Words Language Models

➤ Luhn Model, back in 1958.

7

Page 13: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

LUHNIAN MODEL

@m__dehghani

➤ Terms that are neither specifics nor general?

Collection

a b d a b d a d a d

b c d b c d c d d d

x w d y z d d d d

s u d t v d d d d

Set of docs

➤ Toy example:Significant Words Language Models

➤ Luhn Model, back in 1958.

7

Page 14: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

LUHNIAN MODEL

@m__dehghani

➤ Terms that are neither specifics nor general?

Collection

a b d a b d a d a d

b c d b c d c d d d

x w d y z d d d d

s u d t v d d d d

Set of docs

➤ Toy example:Significant Words Language Models

➤ Luhn Model, back in 1958.

a b c d d a b c d d a b c d d a b d d

d

x w d y z d d d d

s u d t v d d d d

Set of docsCollection

0.0

0.12,5

0.25

0.37,5

0.50

A B C D

7

Page 15: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

LUHNIAN MODEL

@m__dehghani

➤ Terms that are neither specifics nor general?

Collection

a b d a b d a d a d

b c d b c d c d d d

x w d y z d d d d

s u d t v d d d d

Set of docs

➤ Toy example:Significant Words Language Models

➤ Luhn Model, back in 1958.

a b c d d a b c d d a b c d d a b d d

d

x w d y z d d d d

s u d t v d d d d

Set of docsCollection

0.0

0.12,5

0.25

0.37,5

0.50

A B C D

7

Page 16: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

LUHNIAN MODEL

@m__dehghani

a b c d d a b c d d a b c d d a b d d

d

x w d y z d d d d

s u d t v d d d d

Set of docsCollection

0.0

0.17,5

0.35

0.52,5

0.70

A B C D

Collection

a b d a b d a d a d

b c d b c d c d d d

x w d y z d d d d

s u d t v d d d d

Set of docs

➤ Toy example:Significant Words Language Models

➤ Luhn Model, back in 1958.

8

➤ Terms that are neither specifics nor general?

Page 17: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

LUHNIAN MODEL

@m__dehghani

a b c d d a b c d d a b c d d a b d d

d

x w d y z d d d d

s u d t v d d d d

Set of docsCollection

0.0

0.17,5

0.35

0.52,5

0.70

A B C D0.0

0.17,5

0.35

0.52,5

0.70

A B C D

Collection

a b d a b d a d a d

b c d b c d c d d d

x w d y z d d d d

s u d t v d d d d

Set of docs

➤ Toy example:Significant Words Language Models

➤ Luhn Model, back in 1958.

8

➤ Terms that are neither specifics nor general?

Page 18: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

GENERAL IDEA

@m__dehghani

➤ SWLM assumes that terms in the feedback documents are drawn from three models:

➤ 1. General model

➤ 2. Specific model

➤ 3. Significant Words model

➤ Capturing the “mutual notion of relevance”

➤ Having a representation of relevance which is not only distinctive, but also supported by all the feedback documents.

Standard-LMprize 5.55e-02

nobel 3.36e-02

physics 2.35e-02

science 2.18e-02

...

time 1.68e-02

...

palestinian 1.34e-02

year 1.34e-02

...

General-LMnew 3.70e-03

cent 2.98e-03

two 2.97e-03

dollars 2.76e-03

people 2.71e-03

...

time 2.47e-03

...

year 2.16e-03

...

SMM [45]prize 6.07e-02

nobel 4.37e-02

awards 3.43e-02

chemistry 3.23e-02

physics 2.82e-02

palestinian 2.18e-02

cesium 2.09e-02

arafat 1.94e-02

university 1.92e-02

...

Specific-LMinsulin 2.25e-02

palestinian 2.15e-02

dehmelt 1.81e-02

oscillations 1.79e-02

waxman 1.69e-02

marcus 1.69e-02

attack 1.61e-02

...

arafat 1.29e-02

...

SWLMprize 6.02e-02

nobel 4.53e-02

science 2.68e-02

award 2.43e-02

physics 1.94e-02

winner 1.90e-02

won 1.80e-02

peace 1.80e-02

discovery 1.71e-02

...

Figure 2: Extracting significant terms from relevant feedback documents. (topic 374 of the TREC Robust04 test collection: “Nobel prize winners”)

tems, like mixture models [45] and parsimonious language mod-els [16]. They tried to make the feedback model more distinctive byeliminating the effect of common terms from the model. However,instead of using fixed frequency cut-offs, they made use of a moreadvanced way to do this. Hiemstra et al. stated the following in theirpaper:

[. . . ] our approach bears some resemblance with earlywork on information retrieval by Luhn, who specifiestwo word frequency cut-offs, an upper and a lower toexclude non-significant words. The words exceedingthe upper cut-off are considered to be common andthose below the lower cut-off rare, and therefore notcontributing significantly to the content of the docu-ment. Unlike Luhn, we do not exclude rare words andwe do not have simple frequency cut-offs [. . . ]

In a way, this paper completes the cycle following the vision ofLuhn. We introduce a meaningful translation of specificity and gen-erality against significance in the context of feedback problem andpropose an effective way of establishing a representation consistingof significant words, by parsimonizing the feedback model towardnot only the common observations, but also the rare observations.

Generally speaking, SWLM tries to estimate a language modelfrom the set of feedback documents which is “specific” enough todistinguish the features of the feedback documents from other docu-ments by removing general terms, and in the same time, “general”enough to capture all the shared features of feedback documentsas the notion of relevance, by excluding document specific terms.To do so, SWLM assumes that terms in the feedback documentsare drawn from three models: 1. General model, representative ofcommon observation, 2. Specific model, representative of partialobservation, and 3. Relevance model which is a latent model repre-senting the notion of relevance. Then, it tries to extract the latentrelevance model as the feedback language model.

Figure 2 shows an example of estimating language models fromthe set of top seven relevant documents retrieved for topic 374, “No-bel prize winners”, of the TREC Robust04 test collection. Termsin each list are selected from top 50 terms of the models estimatedafter stop word removal. Standard-LM is the language model es-timated using MLE considering feedback documents as a singledocument. SMM is the language model estimated using simplemixture model [45], one of the most powerful feedback approaches,which generally tries to take out background terms from the feed-back model. As the components of SWLM, General-LM denotescommon terms which is an estimation for the collection’s languagemodel, Specific-LM that determines the probability of terms to bespecific in the feedback set, i.e being frequent in one of the feed-back documents but not the others and the latent significant wordslanguage model which represents feedback model.

As can be seen, considering feedback documents as a mixture ofrelevance model and collection model, SMM penalizes some gen-

eral terms like “time” and “year” by decreasing their probabilities.However, since some frequent words in the feedback set are notfrequent in the whole collection, their probabilities are boosted, like“Palestinian” and “Arafat”, while they are not good indicators for thewhole feedback set. The point is although these terms are frequentlyobserved, they only occur in some feedback documents not mostof them, which means that they are in fact “specific” terms, notdistinctive terms. Considering both general terms and specific terms,SWLM estimates the significant language model reflecting mutualnotion of relevance.

Generally the main aim of this paper is to develop an approach toestimate a robust model from a set of documents that captures all,and only, the essential shared commonalities of these documents.Having the task of feedback in information retrieval as the applica-tion, we break this down into three concrete research questions:

RQ1 How to estimate significant words language model for a setof feedback documents capturing the mutual notion of rele-vance?

RQ2 How effective is significant words language model in (pseudo)relevance feedback?

RQ3 How significant words language model prevents the feedbackmodel to be affected by non-relevant terms of non-relevant orpartially relevant feedback documents?

The rest of the paper is structured as follows. First, in Section 2 wereview related work. Then, we explain our approach for estimatingsignificant words language models in Section 3. Sections 4, 5,and 6present the experimental setup, the results of the experiments on thetask of (pseudo) relevance feedback, and comprehensive analysison the robustness of the proposed approach. Finally, Section 7concludes the paper and discusses extensions as future work.

2. RELATED WORKIn this section, we discuss about related studies in the problem

of feedback in information retrieval. First, we talk about differentfeedback approaches in particular methods in the language modelingframework. Then, after discussing initiatives focusing on the task ofRF and PRF, we will discuss the main challenges of this tasks andsome already proposed methods addressed them.

It has been shown that there is a limitation on providing increas-ingly better results for retrieval systems only based on the originalquery [41]. So, it is crucial to reformulate the search request us-ing terms which reflect the user’s information need to improve theperformance of the retrieval systems. To address this issue, auto-matic feedback methods for information retrieval were introducedfifty years ago [34] and have been extensively studied during pastdecades [4, 6, 11–13, 15, 20, 24, 26, 32, 33, 35, 36, 38, 40, 45].

As the earliest relevance feedback approach in information re-trieval, the Rocchio method [34] is proposed in the vector processing

Standard-LMprize 5.55e-02

nobel 3.36e-02

physics 2.35e-02

science 2.18e-02

...

time 1.68e-02

...

palestinian 1.34e-02

year 1.34e-02

...

General-LMnew 3.70e-03

cent 2.98e-03

two 2.97e-03

dollars 2.76e-03

people 2.71e-03

...

time 2.47e-03

...

year 2.16e-03

...

SMM [45]prize 6.07e-02

nobel 4.37e-02

awards 3.43e-02

chemistry 3.23e-02

physics 2.82e-02

palestinian 2.18e-02

cesium 2.09e-02

arafat 1.94e-02

university 1.92e-02

...

Specific-LMinsulin 2.25e-02

palestinian 2.15e-02

dehmelt 1.81e-02

oscillations 1.79e-02

waxman 1.69e-02

marcus 1.69e-02

attack 1.61e-02

...

arafat 1.29e-02

...

SWLMprize 6.02e-02

nobel 4.53e-02

science 2.68e-02

award 2.43e-02

physics 1.94e-02

winner 1.90e-02

won 1.80e-02

peace 1.80e-02

discovery 1.71e-02

...

Figure 2: Extracting significant terms from relevant feedback documents. (topic 374 of the TREC Robust04 test collection: “Nobel prize winners”)

tems, like mixture models [45] and parsimonious language mod-els [16]. They tried to make the feedback model more distinctive byeliminating the effect of common terms from the model. However,instead of using fixed frequency cut-offs, they made use of a moreadvanced way to do this. Hiemstra et al. stated the following in theirpaper:

[. . . ] our approach bears some resemblance with earlywork on information retrieval by Luhn, who specifiestwo word frequency cut-offs, an upper and a lower toexclude non-significant words. The words exceedingthe upper cut-off are considered to be common andthose below the lower cut-off rare, and therefore notcontributing significantly to the content of the docu-ment. Unlike Luhn, we do not exclude rare words andwe do not have simple frequency cut-offs [. . . ]

In a way, this paper completes the cycle following the vision ofLuhn. We introduce a meaningful translation of specificity and gen-erality against significance in the context of feedback problem andpropose an effective way of establishing a representation consistingof significant words, by parsimonizing the feedback model towardnot only the common observations, but also the rare observations.

Generally speaking, SWLM tries to estimate a language modelfrom the set of feedback documents which is “specific” enough todistinguish the features of the feedback documents from other docu-ments by removing general terms, and in the same time, “general”enough to capture all the shared features of feedback documentsas the notion of relevance, by excluding document specific terms.To do so, SWLM assumes that terms in the feedback documentsare drawn from three models: 1. General model, representative ofcommon observation, 2. Specific model, representative of partialobservation, and 3. Relevance model which is a latent model repre-senting the notion of relevance. Then, it tries to extract the latentrelevance model as the feedback language model.

Figure 2 shows an example of estimating language models fromthe set of top seven relevant documents retrieved for topic 374, “No-bel prize winners”, of the TREC Robust04 test collection. Termsin each list are selected from top 50 terms of the models estimatedafter stop word removal. Standard-LM is the language model es-timated using MLE considering feedback documents as a singledocument. SMM is the language model estimated using simplemixture model [45], one of the most powerful feedback approaches,which generally tries to take out background terms from the feed-back model. As the components of SWLM, General-LM denotescommon terms which is an estimation for the collection’s languagemodel, Specific-LM that determines the probability of terms to bespecific in the feedback set, i.e being frequent in one of the feed-back documents but not the others and the latent significant wordslanguage model which represents feedback model.

As can be seen, considering feedback documents as a mixture ofrelevance model and collection model, SMM penalizes some gen-

eral terms like “time” and “year” by decreasing their probabilities.However, since some frequent words in the feedback set are notfrequent in the whole collection, their probabilities are boosted, like“Palestinian” and “Arafat”, while they are not good indicators for thewhole feedback set. The point is although these terms are frequentlyobserved, they only occur in some feedback documents not mostof them, which means that they are in fact “specific” terms, notdistinctive terms. Considering both general terms and specific terms,SWLM estimates the significant language model reflecting mutualnotion of relevance.

Generally the main aim of this paper is to develop an approach toestimate a robust model from a set of documents that captures all,and only, the essential shared commonalities of these documents.Having the task of feedback in information retrieval as the applica-tion, we break this down into three concrete research questions:

RQ1 How to estimate significant words language model for a setof feedback documents capturing the mutual notion of rele-vance?

RQ2 How effective is significant words language model in (pseudo)relevance feedback?

RQ3 How significant words language model prevents the feedbackmodel to be affected by non-relevant terms of non-relevant orpartially relevant feedback documents?

The rest of the paper is structured as follows. First, in Section 2 wereview related work. Then, we explain our approach for estimatingsignificant words language models in Section 3. Sections 4, 5,and 6present the experimental setup, the results of the experiments on thetask of (pseudo) relevance feedback, and comprehensive analysison the robustness of the proposed approach. Finally, Section 7concludes the paper and discusses extensions as future work.

2. RELATED WORKIn this section, we discuss about related studies in the problem

of feedback in information retrieval. First, we talk about differentfeedback approaches in particular methods in the language modelingframework. Then, after discussing initiatives focusing on the task ofRF and PRF, we will discuss the main challenges of this tasks andsome already proposed methods addressed them.

It has been shown that there is a limitation on providing increas-ingly better results for retrieval systems only based on the originalquery [41]. So, it is crucial to reformulate the search request us-ing terms which reflect the user’s information need to improve theperformance of the retrieval systems. To address this issue, auto-matic feedback methods for information retrieval were introducedfifty years ago [34] and have been extensively studied during pastdecades [4, 6, 11–13, 15, 20, 24, 26, 32, 33, 35, 36, 38, 40, 45].

As the earliest relevance feedback approach in information re-trieval, the Rocchio method [34] is proposed in the vector processing

9

Page 19: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HOW TO ESTIMATE SWLM?

@m__dehghani

10

Page 20: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HOW TO ESTIMATE SWLM?

@m__dehghani

10

Page 21: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HOW TO ESTIMATE SWLM?

@m__dehghani

10

Page 22: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HOW TO ESTIMATE SWLM?

@m__dehghani

probability of term t to be important in one of the document models but not others,

marginalizing over all the documents

10

Page 23: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HOW TO ESTIMATE SWLM?

@m__dehghani

probability of term t to be important in one of the document models but not others,

marginalizing over all the documents

10

Page 24: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HOW TO ESTIMATE SWLM?

@m__dehghani

probability of term t to be important in one of the document models but not others,

marginalizing over all the documents

10

Page 25: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HOW TO ESTIMATE SWLM?

@m__dehghani

Latent Variables:

probability of term t to be important in one of the document models but not others,

marginalizing over all the documents

10

Page 26: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HOW TO ESTIMATE SWLM?

@m__dehghani

Latent Variables:

We define the a conjugate Dirichlet prior on :

age

gend

ergro

upt-ty

pe

t-dura

tion

season

prefer

ences

0.2

0.4

0.6

0.31

0.24 0.3

0.21

0.4

0.29

0.67

0.52 0

.58

0.51

0.58

0.56

0.51

MA

P

Group Group+Preferences Preferences

Figure 3: Performance of employing user preferences-based and group-based customization on contextual suggestion task.

Figure 2 represents the plate notation of SWLM. As it is shown,for each document the contribution of each of three models, �s, areestimated. It can be seen that general model, ✓

g

, and specific model,✓

s

are considered as external observations, which are involved inthe estimation process as infinitely strong priors.

4. MODEL’S APPLICATIONSIn order to assess the effectiveness of SWLM, we have employed

it in different applications. In the following sections, we brieflydiscuss these applications, prompting RQ2: “What are the differentapplications of SWLM and how effective is it in these applications?”

4.1 Group ProfilingIn this section, we address the question: “How to employ SWLM

on group profiling and how effective are the models on contentcustomization?”

Group profiling is to understand and model the characteristics ofthe group of objects. One of the important applications of groupprofiling is in the content customization, which generally is theprocess of tailoring content to individual users’ characteristics orpreferences. In the content customization, using individual prefer-ences is not always possible. For example, sometimes there is anew user in the system with no historical interactions and no richinformation about the preferences, or sometimes the user is notable to determine his/her preferences explicitly. In these situations,group based content customization would be beneficial to suggestcontent to the user based on the preferences of the groups that theuser belongs to.

We propose to use SWLM to extract the ‘abstract’ group levellatent model that captures all, and only, the essential features ofthe whole group. We have employed the resulting models in thetask of contextual suggestion. Analysing different grouping criteriausing TREC 2015 contextual suggestion1 batch task dataset, wefind that group-based suggestions using SWLM improve the per-formance of content customization [5]. Figure 3 reports the resultsof one of our experiments on evaluating the performance of group-based suggestion employing different grouping approaches, individ-ual preferences-based suggestion, and combinations of these twoapproaches.

4.2 FeedbackOne of the applications in which applying SWLM leads to built a

better model is the Feedback problem. In this section we addressthe question“How to employ SWLM on (pseudo) relevance feed-

1https://sites.google.com/site/treccontext/trec-2015

Table 1: Performance of different systems on pseudo relevance feedback ondifferent datasets. Baseline methods are the maximum likelihood estimation—without feedback (MLE) [10], the simple mixture model (SMM) [16], therelevance models (RM3) [11], and maximum-entropy divergence minimiza-tion model (MEDMM) [13]. ú indicates that the improvements over all otherruns are significant at the 0.05 level using the two-tailed t-test.

Method Robust04 WT10G GOV2MAP P@10 MAP P@10 MAP P@10

MLE 0.2501 0.4253 0.2058 0.3031 0.3037 0.5147SMM 0.2745 0.4381 0.2087 0.3159 0.3140 0.5163RM3 0.2732 0.4626 0.2291 0.3215 0.3245 0.5236MEDMM 0.2842 0.4700 0.2308 0.3258 0.3311 0.5287RSWLM 0.2874 0.4681 0.2407ú 0.3346 0.3421ú 0.5366

0.31

0.32

0.33

0.34

0.35

0.36

0.37

Ave

rage

Prec

isio

n

SMM DMM RM3 RM4

0 1 2 3 4 5 6 7 8 9 10

00.20.40.60.8

Number of Feedback document

�si

nSW

LM

d,sw

d,g

d,s

Figure 4: Dealing with poison pills: Effectiveness of different feedbacksystems facing with bad relevant document in topic 374 of TREC Robust04.

back? How does it prevent the feedback model to be affected bynon-relevant terms of non-relevant or partially relevant feedbackdocuments?”

The main goal of feedback systems is to extract a feedback modelfrom a set of feedback documents, where the model representsthe “relevant” documents. However, the existence of documentswith a broader topic or multiple topics in the feedback set (both inrelevance feedback and pseudo-relevance feedback) can distract thefeedback model by adding bad expansion terms, leading to topicdrift [7]. Using SWLM for estimating feedback model enables usto tackle this challenge and extract a language model of feedbackdocuments capturing the essential terms representing the mutualnotion of relevance, i.e the representation of relevance that is notonly distinctive, but also supported by all the feedback documents.

It is obvious how SWLM can be used for estimating languagemodel from the set of feedback documents. However, in the originalestimation process, information from the query is considered forestimating the feedback model. In order to involve information fromthe original query, inspired by the work by Tao and Zhai [14], wemodify the estimation process and incorporate the extra knowledgefrom the query model by defining a prior parameter and employmaximum a posteriori to fit the model to feedback documents andsolve the following problem:

∗ = argmax

⌥p(D�⌥)P (⌥) (9)

We define the a conjugate Dirichlet prior on ✓

r

as follows:

p(✓sw

)∝�t∈V

p(t�✓sw

)�p(t�✓q), (10)

probability of term t to be important in one of the document models but not others,

marginalizing over all the documents

10

Page 27: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HOW TO ESTIMATE SWLM?

@m__dehghani

Latent Variables:

We employ maximum a posteriori to fit the model to feedback documents and solve the following problem:

age

gend

ergro

upt-ty

pe

t-dura

tion

season

prefer

ences

0.2

0.4

0.6

0.31

0.24 0.3

0.21

0.4

0.29

0.67

0.52 0

.58

0.51

0.58

0.56

0.51

MA

P

Group Group+Preferences Preferences

Figure 3: Performance of employing user preferences-based and group-based customization on contextual suggestion task.

Figure 2 represents the plate notation of SWLM. As it is shown,for each document the contribution of each of three models, �s, areestimated. It can be seen that general model, ✓

g

, and specific model,✓

s

are considered as external observations, which are involved inthe estimation process as infinitely strong priors.

4. MODEL’S APPLICATIONSIn order to assess the effectiveness of SWLM, we have employed

it in different applications. In the following sections, we brieflydiscuss these applications, prompting RQ2: “What are the differentapplications of SWLM and how effective is it in these applications?”

4.1 Group ProfilingIn this section, we address the question: “How to employ SWLM

on group profiling and how effective are the models on contentcustomization?”

Group profiling is to understand and model the characteristics ofthe group of objects. One of the important applications of groupprofiling is in the content customization, which generally is theprocess of tailoring content to individual users’ characteristics orpreferences. In the content customization, using individual prefer-ences is not always possible. For example, sometimes there is anew user in the system with no historical interactions and no richinformation about the preferences, or sometimes the user is notable to determine his/her preferences explicitly. In these situations,group based content customization would be beneficial to suggestcontent to the user based on the preferences of the groups that theuser belongs to.

We propose to use SWLM to extract the ‘abstract’ group levellatent model that captures all, and only, the essential features ofthe whole group. We have employed the resulting models in thetask of contextual suggestion. Analysing different grouping criteriausing TREC 2015 contextual suggestion1 batch task dataset, wefind that group-based suggestions using SWLM improve the per-formance of content customization [5]. Figure 3 reports the resultsof one of our experiments on evaluating the performance of group-based suggestion employing different grouping approaches, individ-ual preferences-based suggestion, and combinations of these twoapproaches.

4.2 FeedbackOne of the applications in which applying SWLM leads to built a

better model is the Feedback problem. In this section we addressthe question“How to employ SWLM on (pseudo) relevance feed-

1https://sites.google.com/site/treccontext/trec-2015

Table 1: Performance of different systems on pseudo relevance feedback ondifferent datasets. Baseline methods are the maximum likelihood estimation—without feedback (MLE) [10], the simple mixture model (SMM) [16], therelevance models (RM3) [11], and maximum-entropy divergence minimiza-tion model (MEDMM) [13]. ú indicates that the improvements over all otherruns are significant at the 0.05 level using the two-tailed t-test.

Method Robust04 WT10G GOV2MAP P@10 MAP P@10 MAP P@10

MLE 0.2501 0.4253 0.2058 0.3031 0.3037 0.5147SMM 0.2745 0.4381 0.2087 0.3159 0.3140 0.5163RM3 0.2732 0.4626 0.2291 0.3215 0.3245 0.5236MEDMM 0.2842 0.4700 0.2308 0.3258 0.3311 0.5287RSWLM 0.2874 0.4681 0.2407ú 0.3346 0.3421ú 0.5366

0.31

0.32

0.33

0.34

0.35

0.36

0.37

Ave

rage

Prec

isio

n

SMM DMM RM3 RM4

0 1 2 3 4 5 6 7 8 9 10

00.20.40.60.8

Number of Feedback document�

sin

SWLM

d,sw

d,g

d,s

Figure 4: Dealing with poison pills: Effectiveness of different feedbacksystems facing with bad relevant document in topic 374 of TREC Robust04.

back? How does it prevent the feedback model to be affected bynon-relevant terms of non-relevant or partially relevant feedbackdocuments?”

The main goal of feedback systems is to extract a feedback modelfrom a set of feedback documents, where the model representsthe “relevant” documents. However, the existence of documentswith a broader topic or multiple topics in the feedback set (both inrelevance feedback and pseudo-relevance feedback) can distract thefeedback model by adding bad expansion terms, leading to topicdrift [7]. Using SWLM for estimating feedback model enables usto tackle this challenge and extract a language model of feedbackdocuments capturing the essential terms representing the mutualnotion of relevance, i.e the representation of relevance that is notonly distinctive, but also supported by all the feedback documents.

It is obvious how SWLM can be used for estimating languagemodel from the set of feedback documents. However, in the originalestimation process, information from the query is considered forestimating the feedback model. In order to involve information fromthe original query, inspired by the work by Tao and Zhai [14], wemodify the estimation process and incorporate the extra knowledgefrom the query model by defining a prior parameter and employmaximum a posteriori to fit the model to feedback documents andsolve the following problem:

∗ = argmax

⌥p(D�⌥)P (⌥) (9)

We define the a conjugate Dirichlet prior on ✓

r

as follows:

p(✓sw

)∝�t∈V

p(t�✓sw

)�p(t�✓q), (10)

We define the a conjugate Dirichlet prior on :

age

gend

ergro

upt-ty

pe

t-dura

tion

season

prefer

ences

0.2

0.4

0.6

0.31

0.24 0.3

0.21

0.4

0.29

0.67

0.52 0

.58

0.51

0.58

0.56

0.51

MA

P

Group Group+Preferences Preferences

Figure 3: Performance of employing user preferences-based and group-based customization on contextual suggestion task.

Figure 2 represents the plate notation of SWLM. As it is shown,for each document the contribution of each of three models, �s, areestimated. It can be seen that general model, ✓

g

, and specific model,✓

s

are considered as external observations, which are involved inthe estimation process as infinitely strong priors.

4. MODEL’S APPLICATIONSIn order to assess the effectiveness of SWLM, we have employed

it in different applications. In the following sections, we brieflydiscuss these applications, prompting RQ2: “What are the differentapplications of SWLM and how effective is it in these applications?”

4.1 Group ProfilingIn this section, we address the question: “How to employ SWLM

on group profiling and how effective are the models on contentcustomization?”

Group profiling is to understand and model the characteristics ofthe group of objects. One of the important applications of groupprofiling is in the content customization, which generally is theprocess of tailoring content to individual users’ characteristics orpreferences. In the content customization, using individual prefer-ences is not always possible. For example, sometimes there is anew user in the system with no historical interactions and no richinformation about the preferences, or sometimes the user is notable to determine his/her preferences explicitly. In these situations,group based content customization would be beneficial to suggestcontent to the user based on the preferences of the groups that theuser belongs to.

We propose to use SWLM to extract the ‘abstract’ group levellatent model that captures all, and only, the essential features ofthe whole group. We have employed the resulting models in thetask of contextual suggestion. Analysing different grouping criteriausing TREC 2015 contextual suggestion1 batch task dataset, wefind that group-based suggestions using SWLM improve the per-formance of content customization [5]. Figure 3 reports the resultsof one of our experiments on evaluating the performance of group-based suggestion employing different grouping approaches, individ-ual preferences-based suggestion, and combinations of these twoapproaches.

4.2 FeedbackOne of the applications in which applying SWLM leads to built a

better model is the Feedback problem. In this section we addressthe question“How to employ SWLM on (pseudo) relevance feed-

1https://sites.google.com/site/treccontext/trec-2015

Table 1: Performance of different systems on pseudo relevance feedback ondifferent datasets. Baseline methods are the maximum likelihood estimation—without feedback (MLE) [10], the simple mixture model (SMM) [16], therelevance models (RM3) [11], and maximum-entropy divergence minimiza-tion model (MEDMM) [13]. ú indicates that the improvements over all otherruns are significant at the 0.05 level using the two-tailed t-test.

Method Robust04 WT10G GOV2MAP P@10 MAP P@10 MAP P@10

MLE 0.2501 0.4253 0.2058 0.3031 0.3037 0.5147SMM 0.2745 0.4381 0.2087 0.3159 0.3140 0.5163RM3 0.2732 0.4626 0.2291 0.3215 0.3245 0.5236MEDMM 0.2842 0.4700 0.2308 0.3258 0.3311 0.5287RSWLM 0.2874 0.4681 0.2407ú 0.3346 0.3421ú 0.5366

0.31

0.32

0.33

0.34

0.35

0.36

0.37

Ave

rage

Prec

isio

n

SMM DMM RM3 RM4

0 1 2 3 4 5 6 7 8 9 10

00.20.40.60.8

Number of Feedback document

�si

nSW

LM

d,sw

d,g

d,s

Figure 4: Dealing with poison pills: Effectiveness of different feedbacksystems facing with bad relevant document in topic 374 of TREC Robust04.

back? How does it prevent the feedback model to be affected bynon-relevant terms of non-relevant or partially relevant feedbackdocuments?”

The main goal of feedback systems is to extract a feedback modelfrom a set of feedback documents, where the model representsthe “relevant” documents. However, the existence of documentswith a broader topic or multiple topics in the feedback set (both inrelevance feedback and pseudo-relevance feedback) can distract thefeedback model by adding bad expansion terms, leading to topicdrift [7]. Using SWLM for estimating feedback model enables usto tackle this challenge and extract a language model of feedbackdocuments capturing the essential terms representing the mutualnotion of relevance, i.e the representation of relevance that is notonly distinctive, but also supported by all the feedback documents.

It is obvious how SWLM can be used for estimating languagemodel from the set of feedback documents. However, in the originalestimation process, information from the query is considered forestimating the feedback model. In order to involve information fromthe original query, inspired by the work by Tao and Zhai [14], wemodify the estimation process and incorporate the extra knowledgefrom the query model by defining a prior parameter and employmaximum a posteriori to fit the model to feedback documents andsolve the following problem:

∗ = argmax

⌥p(D�⌥)P (⌥) (9)

We define the a conjugate Dirichlet prior on ✓

r

as follows:

p(✓sw

)∝�t∈V

p(t�✓sw

)�p(t�✓q), (10)

probability of term t to be important in one of the document models but not others,

marginalizing over all the documents

10

Page 28: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HOW TO ESTIMATE SWLM?

@m__dehghani

Latent Variables:

We employ maximum a posteriori to fit the model to feedback documents and solve the following problem:

age

gend

ergro

upt-ty

pe

t-dura

tion

season

prefer

ences

0.2

0.4

0.6

0.31

0.24 0.3

0.21

0.4

0.29

0.67

0.52 0

.58

0.51

0.58

0.56

0.51

MA

P

Group Group+Preferences Preferences

Figure 3: Performance of employing user preferences-based and group-based customization on contextual suggestion task.

Figure 2 represents the plate notation of SWLM. As it is shown,for each document the contribution of each of three models, �s, areestimated. It can be seen that general model, ✓

g

, and specific model,✓

s

are considered as external observations, which are involved inthe estimation process as infinitely strong priors.

4. MODEL’S APPLICATIONSIn order to assess the effectiveness of SWLM, we have employed

it in different applications. In the following sections, we brieflydiscuss these applications, prompting RQ2: “What are the differentapplications of SWLM and how effective is it in these applications?”

4.1 Group ProfilingIn this section, we address the question: “How to employ SWLM

on group profiling and how effective are the models on contentcustomization?”

Group profiling is to understand and model the characteristics ofthe group of objects. One of the important applications of groupprofiling is in the content customization, which generally is theprocess of tailoring content to individual users’ characteristics orpreferences. In the content customization, using individual prefer-ences is not always possible. For example, sometimes there is anew user in the system with no historical interactions and no richinformation about the preferences, or sometimes the user is notable to determine his/her preferences explicitly. In these situations,group based content customization would be beneficial to suggestcontent to the user based on the preferences of the groups that theuser belongs to.

We propose to use SWLM to extract the ‘abstract’ group levellatent model that captures all, and only, the essential features ofthe whole group. We have employed the resulting models in thetask of contextual suggestion. Analysing different grouping criteriausing TREC 2015 contextual suggestion1 batch task dataset, wefind that group-based suggestions using SWLM improve the per-formance of content customization [5]. Figure 3 reports the resultsof one of our experiments on evaluating the performance of group-based suggestion employing different grouping approaches, individ-ual preferences-based suggestion, and combinations of these twoapproaches.

4.2 FeedbackOne of the applications in which applying SWLM leads to built a

better model is the Feedback problem. In this section we addressthe question“How to employ SWLM on (pseudo) relevance feed-

1https://sites.google.com/site/treccontext/trec-2015

Table 1: Performance of different systems on pseudo relevance feedback ondifferent datasets. Baseline methods are the maximum likelihood estimation—without feedback (MLE) [10], the simple mixture model (SMM) [16], therelevance models (RM3) [11], and maximum-entropy divergence minimiza-tion model (MEDMM) [13]. ú indicates that the improvements over all otherruns are significant at the 0.05 level using the two-tailed t-test.

Method Robust04 WT10G GOV2MAP P@10 MAP P@10 MAP P@10

MLE 0.2501 0.4253 0.2058 0.3031 0.3037 0.5147SMM 0.2745 0.4381 0.2087 0.3159 0.3140 0.5163RM3 0.2732 0.4626 0.2291 0.3215 0.3245 0.5236MEDMM 0.2842 0.4700 0.2308 0.3258 0.3311 0.5287RSWLM 0.2874 0.4681 0.2407ú 0.3346 0.3421ú 0.5366

0.31

0.32

0.33

0.34

0.35

0.36

0.37

Ave

rage

Prec

isio

n

SMM DMM RM3 RM4

0 1 2 3 4 5 6 7 8 9 10

00.20.40.60.8

Number of Feedback document�

sin

SWLM

d,sw

d,g

d,s

Figure 4: Dealing with poison pills: Effectiveness of different feedbacksystems facing with bad relevant document in topic 374 of TREC Robust04.

back? How does it prevent the feedback model to be affected bynon-relevant terms of non-relevant or partially relevant feedbackdocuments?”

The main goal of feedback systems is to extract a feedback modelfrom a set of feedback documents, where the model representsthe “relevant” documents. However, the existence of documentswith a broader topic or multiple topics in the feedback set (both inrelevance feedback and pseudo-relevance feedback) can distract thefeedback model by adding bad expansion terms, leading to topicdrift [7]. Using SWLM for estimating feedback model enables usto tackle this challenge and extract a language model of feedbackdocuments capturing the essential terms representing the mutualnotion of relevance, i.e the representation of relevance that is notonly distinctive, but also supported by all the feedback documents.

It is obvious how SWLM can be used for estimating languagemodel from the set of feedback documents. However, in the originalestimation process, information from the query is considered forestimating the feedback model. In order to involve information fromthe original query, inspired by the work by Tao and Zhai [14], wemodify the estimation process and incorporate the extra knowledgefrom the query model by defining a prior parameter and employmaximum a posteriori to fit the model to feedback documents andsolve the following problem:

∗ = argmax

⌥p(D�⌥)P (⌥) (9)

We define the a conjugate Dirichlet prior on ✓

r

as follows:

p(✓sw

)∝�t∈V

p(t�✓sw

)�p(t�✓q), (10)

We define the a conjugate Dirichlet prior on :

age

gend

ergro

upt-ty

pe

t-dura

tion

season

prefer

ences

0.2

0.4

0.6

0.31

0.24 0.3

0.21

0.4

0.29

0.67

0.52 0

.58

0.51

0.58

0.56

0.51

MA

P

Group Group+Preferences Preferences

Figure 3: Performance of employing user preferences-based and group-based customization on contextual suggestion task.

Figure 2 represents the plate notation of SWLM. As it is shown,for each document the contribution of each of three models, �s, areestimated. It can be seen that general model, ✓

g

, and specific model,✓

s

are considered as external observations, which are involved inthe estimation process as infinitely strong priors.

4. MODEL’S APPLICATIONSIn order to assess the effectiveness of SWLM, we have employed

it in different applications. In the following sections, we brieflydiscuss these applications, prompting RQ2: “What are the differentapplications of SWLM and how effective is it in these applications?”

4.1 Group ProfilingIn this section, we address the question: “How to employ SWLM

on group profiling and how effective are the models on contentcustomization?”

Group profiling is to understand and model the characteristics ofthe group of objects. One of the important applications of groupprofiling is in the content customization, which generally is theprocess of tailoring content to individual users’ characteristics orpreferences. In the content customization, using individual prefer-ences is not always possible. For example, sometimes there is anew user in the system with no historical interactions and no richinformation about the preferences, or sometimes the user is notable to determine his/her preferences explicitly. In these situations,group based content customization would be beneficial to suggestcontent to the user based on the preferences of the groups that theuser belongs to.

We propose to use SWLM to extract the ‘abstract’ group levellatent model that captures all, and only, the essential features ofthe whole group. We have employed the resulting models in thetask of contextual suggestion. Analysing different grouping criteriausing TREC 2015 contextual suggestion1 batch task dataset, wefind that group-based suggestions using SWLM improve the per-formance of content customization [5]. Figure 3 reports the resultsof one of our experiments on evaluating the performance of group-based suggestion employing different grouping approaches, individ-ual preferences-based suggestion, and combinations of these twoapproaches.

4.2 FeedbackOne of the applications in which applying SWLM leads to built a

better model is the Feedback problem. In this section we addressthe question“How to employ SWLM on (pseudo) relevance feed-

1https://sites.google.com/site/treccontext/trec-2015

Table 1: Performance of different systems on pseudo relevance feedback ondifferent datasets. Baseline methods are the maximum likelihood estimation—without feedback (MLE) [10], the simple mixture model (SMM) [16], therelevance models (RM3) [11], and maximum-entropy divergence minimiza-tion model (MEDMM) [13]. ú indicates that the improvements over all otherruns are significant at the 0.05 level using the two-tailed t-test.

Method Robust04 WT10G GOV2MAP P@10 MAP P@10 MAP P@10

MLE 0.2501 0.4253 0.2058 0.3031 0.3037 0.5147SMM 0.2745 0.4381 0.2087 0.3159 0.3140 0.5163RM3 0.2732 0.4626 0.2291 0.3215 0.3245 0.5236MEDMM 0.2842 0.4700 0.2308 0.3258 0.3311 0.5287RSWLM 0.2874 0.4681 0.2407ú 0.3346 0.3421ú 0.5366

0.31

0.32

0.33

0.34

0.35

0.36

0.37

Ave

rage

Prec

isio

n

SMM DMM RM3 RM4

0 1 2 3 4 5 6 7 8 9 10

00.20.40.60.8

Number of Feedback document

�si

nSW

LM

d,sw

d,g

d,s

Figure 4: Dealing with poison pills: Effectiveness of different feedbacksystems facing with bad relevant document in topic 374 of TREC Robust04.

back? How does it prevent the feedback model to be affected bynon-relevant terms of non-relevant or partially relevant feedbackdocuments?”

The main goal of feedback systems is to extract a feedback modelfrom a set of feedback documents, where the model representsthe “relevant” documents. However, the existence of documentswith a broader topic or multiple topics in the feedback set (both inrelevance feedback and pseudo-relevance feedback) can distract thefeedback model by adding bad expansion terms, leading to topicdrift [7]. Using SWLM for estimating feedback model enables usto tackle this challenge and extract a language model of feedbackdocuments capturing the essential terms representing the mutualnotion of relevance, i.e the representation of relevance that is notonly distinctive, but also supported by all the feedback documents.

It is obvious how SWLM can be used for estimating languagemodel from the set of feedback documents. However, in the originalestimation process, information from the query is considered forestimating the feedback model. In order to involve information fromthe original query, inspired by the work by Tao and Zhai [14], wemodify the estimation process and incorporate the extra knowledgefrom the query model by defining a prior parameter and employmaximum a posteriori to fit the model to feedback documents andsolve the following problem:

∗ = argmax

⌥p(D�⌥)P (⌥) (9)

We define the a conjugate Dirichlet prior on ✓

r

as follows:

p(✓sw

)∝�t∈V

p(t�✓sw

)�p(t�✓q), (10)

EM

E-Step:

M-Step:

probability of term t to be important in one of the document models but not others,

marginalizing over all the documents

10

Page 29: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

SWLM APPLICATIONS: REGULARIZED SWLM

Mostafa Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra, and M. Marx. “Luhn Revisited: Significant Words Language Models”, in CIKM’16. Mostafa Dehghani, S. Abnar, and J. Kamps. “The Healing Power of Poison: Helpful Non-relevant Documents in Feedback”, in CIKM’16. @m__dehghani

➤ Poison Pills: Relevant documents that hurt the performance of retrieval after feedback :

➤ More than 5% of all relevant documents perform poorly

➤ In one third of all topics there exists at least one bad relevant document

Robustness of different systems against bad relevant documents based on robustness index

Dealing with poison pills: Effectiveness of different feedback systems facing with bad relevant document in topic 374 of TREC Robust04

Table 4: Robustness of different systems against bad relevant documentsbased on RI

d

measure.

Dataset SMM DMM RM3 RM4 RMM MEDMM SWLM RSWLM

Robust04 0.8661 0.7845 0.8721 0.8618 0.8833 0.8944 0.9313 0.9365WT10G 0.8476 0.8143 0.8833 0.8952 0.8952 0.9071 0.9571 0.9619GOV2 0.8503 0.8054 0.8748 0.8558 0.8980 0.8939 0.9361 0.9170

However, this also could happen in the RF. This is because althoughthe harming feedback document is relevant, there could be only asubset of it containing relevant information. So, adding off-topicterms from this document to the query results in loosing the retrievalperformance. These relevant documents that hurt the performanceof retrieval after feedback are called “poison pills” [12, 39, 43].

Terra and Warren [39] studied the effect of the poison pills. Theyused a single relevant document for feedback with several systemsto find documents that make the precision drop in all systems. Theyshowed that more than 5% of all relevant documents perform poorlyand in one third of all topics there exists at least one bad relevantdocument which can decrease the performance of the retrieval afterrelevance feedback.

We have investigated this effect in the multiple feedback docu-ments experiments. In these experiments, for each topic with morethan 10 relevant documents, we add relevant documents one by one,based on their ranking in the initial run, to the relevance feedback setand keep the track of the change on the performance of the retrievalafter feedback.

To evaluate the robustness of different systems against bad rel-evance documents, we change the definition of Robustness Index

(RI) measure [8] to be applicable in the document level. We defineRI

d

= N

+r −N−r��r� where N+

r

and N

−r

denote number of relevant doc-uments which adding them to the feedback set, based on the abovesetting, respectively enhances or diminishes the performance of theretrieval, compared to the case of not including them in terms of AP,and �r� is total number of tested relevant documents. The higher thevalue of RI is, the more the method is robust. Table 4 presents theRI of different systems on different datasets. As can be seen, bothsystems based on significant words language models are stronglyrobust against bad effects of relevant datasets.

Furthermore, we have looked into the results of experimentsin all the collections and extracted the set of relevant documentsthat adding them to the feedback set decreases the performance offeedback in all the baseline systems, which are referred as poison

pills. Overal, we found 118 poison pills and we observed that theperformance of RSWLM in these situations always has the leastdrop and in 92% of the cases, it provides the best average precisionafter adding the poison pill.

As it is discussed by Terra and Warren [39], poison pills are usu-ally relevant documents which have either a broad topic, or severaltopics. In these situations, employing significant words languagemodels enables the feedback system to control the contribution ofthese documents and prevents their specific or general terms affectthe feedback model. Figure 6 shows how using significant wordslanguage model empowers the feedback system to deal with thepoison pills. In this figure, the performance of the different systemsin topic 374 on Robust04 dataset are illustrated. As can be seen,adding the seventh relevant document to the feedback set leads toa substantial decrement in the performance of the feedback in allthe systems. The query is “Nobel prize winners" and the seventhdocument talks about the Nobel peace prize, but at the end, it hasa discussion concerning Middle East issues, which contains somehigh frequent terms that are non-relevant to the query (see Figure 2).However, RSWLM and SWLM are able to distinguish this docu-

0.31

0.32

0.33

0.34

0.35

0.36

0.37

Ave

rage

Prec

isio

n

SMM DMM

RM3 RM4

RMM MEDMM

SWLM RSWLM

0 1 2 3 4 5 6 7 8 9 10

0

0.2

0.4

0.6

0.8

Number of Feedback document

�s

inSW

LM

�d,r �d,g �d,s

Figure 6: Dealing with poison pills: Effectiveness of different feedbacksystems facing with bad relevant document in topic 374 of TREC Robust04.

ment as a poison pill and by reducing its contribution to the feedbackmodel, i.e. low value of �

d7,r , they prevent the severe drop in thefeedback performance.

So, as an additional point for our proposed method, it could beconsidered as an automatic way to determine whether adding aspecific relevant document to the feedback set hurts the retrievalperformance for a specific topic or not, and it automatically takesthe benefits even from the poisonous relevant feedback documents.

6.3 Sensitivity to the number of feedback doc-uments

In order to investigate the sensitivity of our proposed method tothe number of documents, we plot the performance of SWLM andRSWLM with regard to the number of documents in the feedbackset for pseudo relevance feedback in Figure 7. As it is noticed, bothmethods have acceptable robustness. SWLM is more sensitive, espe-cially in Web collections, when low ranked documents are added, itis slightly affected by noises. However, RSWLM is strongly robustand less sensitive to the number of feedback documents.

Furthermore, according to Figure 7, the performance of bothsystems in all collections is best when the number of feedbackdocuments are around 10, which is a valid observation in otherfeedback methods as well [24]. Moreover, this observation is inaccordance with the information from the charts in Figure 4 in whichthe top-10 documents always possess a strong contribution from therelevance model, i.e. high values of �

d,r

.

In this section, we discussed the robustness of SWLM from differ-ent points of view through different experiments, in detail address-ing the question “How significant words language model preventsthe feedback model to be affected by non-relevant terms of non-relevant or partially relevant feedback documents?” Results showthat SWLM and RSWLM provides robustness against non-relevantor partially relevant documents in PRF, and poison pills in RF. Fur-thermore, we demonstrate that the performance of SWLM remainsstable using different numbers of feedback documents.

7. CONCLUSIONSThis paper concerns the problem of using feedback information

to improve the performance of information retrieval. The mainaim of this paper was to develop an approach to estimate a robustmodel from a set of documents that captures all, and only, the

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

Ratio of relevant documents

JS-D

iver

genc

e

SMMDMMRM3RM4RMM

MEDMMSWLM

RSWLM

(a) Robust04 dataset

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

Ratio of relevant documents

JS-D

iver

genc

e

SMMDMMRM3RM4RMM

MEDMMSWLM

RSWLM

(b) WT10G dataset

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

Ratio of relevant documents

JS-D

iver

genc

e

SMMDMMRM3RM4RMM

MEDMMSWLM

RSWLM

(c) GOV2 dataset

Figure 5: Divergence of true relevance feedback models and pseudo relevance feedback models in different systems, for queries with different ratio of relevantdocuments in top-10 results.

Table 4: Robustness of different systems against bad relevant documentsbased on RI(D

r

) measure.

Dataset SMM DMM RM3 RM4 RMM MEDMM SWLM RSWLM

Robust04 0.8663 0.7841 0.8716 0.8681 0.8843 0.8914 0.9319 0.9305WT10G 0.8504 0.8190 0.8783 0.8961 0.8990 0.9082 0.9583 0.9698GOV2 0.8456 0.8062 0.8809 0.8519 0.8910 0.8801 0.9386 0.9209

documents are farther from relevant retrieved documents, comparedto Robust04 dataset. According to the charts in Figure 5, in allcollections, SWLM and RSWLM have the least divergence in allthe ratios. It means that our proposed models are more robustagainst being distracted by non-relevant documents. An interestingobservation is that in all the collections, the behavior of SWLM andRSWLM are almost the same when at least half of the documentsare relevant. In other words, we do not need regularization if atleast half of the documents are of interest to the query’s topic, eithercompletely or partially.

6.2 Dealing with Poison PillsAlthough it has been shown that on average, the overall perfor-

mance will be improved after feedback [14, 17], for some topics,employing some documents may decrease the average precisionof the initial run. As we discussed, in the PRF, it could be dueto the fact that the harming feedback documents are not relevant.However, this also could happen in the RF. This is because althoughthe harming feedback document is relevant, there could be only asubset of it containing relevant information. So, adding off-topicterms from this document to the query results in loosing the retrievalperformance. These relevant documents that hurt the performanceof retrieval after feedback are called “poison pills” [9, 14, 38, 42].

Terra and Warren [38] studied the effect of the poison pills. Theyused a single relevant document for feedback with several systemsto find documents that make the precision drop in all systems. Theyshowed that more than 5% of all relevant documents perform poorlyand in one third of all topics there exists at least one bad relevantdocument which can decrease the performance of the retrieval afterrelevance feedback.

We have investigated this effect in the multiple feedback docu-ments experiments. In these experiments, for each topic with morethan ten relevant documents, we add relevant documents one by one,based on their ranking in the initial run, to the feedback set and keepthe track of the change in the performance of the feedback run afteradding each relevant document to the feedback set compared to thefeedback run without its presence in the feedback set.

To evaluate the robustness of different systems against bad rele-vant documents, we define a variant of Robustness Index (RI) [7] tobe applicable in the document level instead of topic level. For

0.31

0.32

0.33

0.34

0.35

0.36

0.37

0.38

Ave

rage

Prec

isio

n

SMM DMM

RM3 RM4

RMM MEDMM

SWLM RSWLM

0 1 2 3 4 5 6 7 8 9 10

0

0.2

0.4

0.6

0.8

Number of Feedback document

�s

inSW

LM

�d,sw �d,g �d,s

Figure 6: Dealing with poison pills: Effectiveness of different feedback sys-tems facing with a bad relevant document in topic 374 of TREC Robust04.

a set of relevant documents,Dr

, the RI measure is defined as:RI(D

r

) = N

+r −N−r��Dr � where N

+r

and N

−r

denote number of rele-vant documents which adding them to the feedback set, based onthe above setting, respectively helps or huts the performance of thefeedback run in terms of AP, compared to the case of not includingthem. �D

r

� is total number of tested relevant documents. The higherthe value of RI(D

r

) is, the more the method is robust against poi-son pills. Table 4 presents the RI(D

r

) of different systems ondifferent datasets. As can be seen, both systems based on significantwords language models are strongly robust against the effect of badrelevant documents in all datasets.

Furthermore, we have looked into the results of experiments inall the collections and extracted the set of poison pills, i.e. relevantdocuments that adding them to the feedback set decreases the perfor-mance of feedback in all the baseline systems (). Overal, we found118 poison pills and we observed that the performance of RSWLMin these situations always has the least drop and in 92% of the cases,it provides the best average precision after adding the poison pill.

As it is discussed by Terra and Warren [38], poison pills are usu-ally relevant documents which have either a broad topic, or severaltopics. In these situations, employing significant words languagemodels enables the feedback system to control the contribution ofthese documents and prevents their specific or general terms affectthe feedback model. Figure 6 shows how using significant wordslanguage model empowers the feedback system to deal with thepoison pills. In this figure, the performance of different systems

11

Page 30: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

SWLM AS AN ANALYTICAL TOOL▸ Score Decomposition:

▸ Dynamically determining the contribution of each document

Mostafa Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra, and M. Marx. “Luhn Revisited: Significant Words Language Models”, Under submission in CIKM’16.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

�d,sw� d

,g

�d,s

(a) Robust04 dataset

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

�d,sw� d

,g

�d,s

(b) WT10G dataset

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

�d,sw� d

,g

�d,s

1

20

40

60

80

100

Ran

k

(c) GOV2 dataset

Figure 4: Contribution of each of the relevance, general, and specific models in the top-100 documents as the feedback set, according to the �s learned in theRSWLM (the average over all the queries).

technique in the evaluation of the results of these experiments [6, 10,32]. In addition to the above metrics, we also report robustness index,RI(Q), which is also called reliability of improvement [8]. For aset of queries Q, the RI measure is defined as: RI(Q) = N

+−N−��Q�,where N

+ is the number of queries helped by the feedback methodand N

− is the number of queries hurt.In our experiments, as the baseline methods, we have used the

most popular unsupervised state-of-the-arts for the feedback taskthat are proposed in the language modeling framework. Our baselinemethods are: the maximum likelihood estimation—without feed-back (MLE) [17], the simple mixture model (SMM) [41], the diver-gence minimization model (DMM) [41], the relevance models (RM3and RM4) [1, 18], the regularized mixture model (RMM) [34], andmaximum-entropy divergence minimization model (MEDMM) [24].

5. SWLM FOR FEEDBACKIn this section, we investigate our second research question: “How

effective are significant words language models in (pseudo) rele-vance feedback?” We report our experimental results to indicate theeffectiveness of significant words language models and regularizedsignificant words language models in the task of pseudo relevancefeedback (PRF) and true relevevance feedback (TRF) and comparethem to the baseline methods.

5.1 Pseudo Relevance FeedbackPseudo relevance feedback aims to expand the query to improve

the performance of retrieval having no information about the judge-ments. In PRF, the underlying assumption is that the initial retrievalyields the relevant documents which can be used to refine the query.Thus, assuming the top-ranked documents F = {d1, . . . , dF} fromthe initial run as relevant, the feedback model ✓F is estimated andused for the query expansion. Table 2 presents the results of em-ploying significant words language models, regularized significantwords language models as the feedback model as well as baselinemethods on the task of PRF. As can be seen, RSWLM significantlyoutperforms all the baselines in terms of MAP, in WT10G andGOV2 collections, that are noisy Web collections.3 Furthermore, ithas the highest reliability of improvements in terms of RobustnessIndex in all the collections. In the PRF task, RSWLM works betterthan SWLM as it guides the estimator of the feedback model towardthe query model and prevents it to be distracted by the noises of3Note that we only indicate when (R)SWLM is significantly betterthan all baseline methods, they are always significantly better thanthe non-expansion MLE baseline.

Table 2: Performance of different systems on the task of PRF. ú indicates thatthe improvements over no FB (MLE) and all the baseline feedback methodsare statistically significant, at the 0.05 level using the paired two-tailed t-testwith Bonferroni correction.

Method Robust04 WT10G GOV2

MAP P@10 RI MAP P@10 RI MAP P@10 RI

MLE 0.2501 0.4253 n/a 0.2058 0.3031 n/a 0.3037 0.5147 n/aSMM 0.2787 0.4416 0.37 0.2193 0.3264 0.23 0.3214 0.5230 0.41DMM 0.2701 0.4370 0.31 0.2184 0.3170 0.14 0.3026 0.5211 0.29RM3 0.2937 0.4683 0.40 0.2406 0.3317 0.26 0.3417 0.5360 0.45RM4 0.2690 0.4402 0.32 0.2323 0.3273 0.18 0.3316 0.5208 0.37RMM 0.2681 0.4384 0.28 0.2222 0.3209 0.21 0.3112 0.5193 0.33MEDMM 0.2961 0.4719 0.45 0.2413 0.3440 0.25 0.3396 0.5377 0.43SWLM 0.2918 0.4674 0.47 0.2462 0.3377 0.28 0.3423 0.5316 0.50RSWLM 0.2945 0.4704 0.47 0.2506ú 0.3427 0.31 0.3510ú 0.5419 0.53

non-relevant documents.Although it has been shown that PRF always improves the aver-

age performance of retrieval [11], under some parameter settings,for some topics it decreases the average precision. This is due tothe fact that there might be some non-relevant documents in thefeedback set containing non-relevant terms resulting to the topicdrift in the extracted feedback model [5, 12, 13]. Thus, as one ofthe main challenging problems in PRF, it is necessary to control thecontribution of different feedback documents for inclusion in thefeedback model based on their merit [13] for a specific query.

5.2 Relevance DecompositionSignificant words language models empower our proposed feed-

back method to dynamically determine the quality of each document.Figure 4 addresses the question “How SWLM controls the contribu-tion of feedback documents in the feedback model based on theirlevel of relevancy?” In this figure, as a sample, we take top-100documents as the feedback set and illustrate the average contributionof each of the significant words, general, and specific models in thisdocuments, according to the �s learned in the regularized significantwords language models.

It is an interesting observation that in all the collections the trendof the change in the contribution of three models is similar. Inmost cases, as the ranking goes down, the contribution of the sig-nificant words model decreases, which is in accordance with therelevance probability of documents based on their ranking. However,this decay is slower in the Robust04 dataset compared to WT10Gand GOV2 datasets. This is likely because that Robust04 containsnewswire articles, which are typically high-quality text data withlittle noise, in contrast to WT10G and GOV2 which are web collec-

12

Page 31: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HIERARCHICAL SWLM

▸ Modelling Hierarchical Entities

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Two-Way Parsimonious Classification Models for Evolving Hierarchies”, CLEF’16. Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “On Horizontal and Vertical Separation in Hierarchical Text Classification”, ICTIR’16. @m__dehghani

13

Page 32: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HIERARCHICAL SWLM

▸ Modelling Hierarchical Entities

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Two-Way Parsimonious Classification Models for Evolving Hierarchies”, CLEF’16. Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “On Horizontal and Vertical Separation in Hierarchical Text Classification”, ICTIR’16. @m__dehghani

13

Page 33: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HIERARCHICAL SWLM

▸ Modelling Hierarchical Entities

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Two-Way Parsimonious Classification Models for Evolving Hierarchies”, CLEF’16. Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “On Horizontal and Vertical Separation in Hierarchical Text Classification”, ICTIR’16.

Specification

@m__dehghani

13

Page 34: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HIERARCHICAL SWLM

▸ Modelling Hierarchical Entities

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Two-Way Parsimonious Classification Models for Evolving Hierarchies”, CLEF’16. Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “On Horizontal and Vertical Separation in Hierarchical Text Classification”, ICTIR’16.

Specification

@m__dehghani

13

Page 35: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HIERARCHICAL SWLM

▸ Modelling Hierarchical Entities

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Two-Way Parsimonious Classification Models for Evolving Hierarchies”, CLEF’16. Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “On Horizontal and Vertical Separation in Hierarchical Text Classification”, ICTIR’16.

Specification Generalisation

@m__dehghani

13

Page 36: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

HIERARCHICAL SWLM

▸ Modelling Hierarchical Entities

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Two-Way Parsimonious Classification Models for Evolving Hierarchies”, CLEF’16. Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “On Horizontal and Vertical Separation in Hierarchical Text Classification”, ICTIR’16.

Specification Generalisation

@m__dehghani

13

Page 37: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

TASK2: HIERARCHICAL CLASSIFICATION

▸ Hierarchical Classification:

▸ Two-dimensionally separable ▸ Transferable models for classification of entities in evolving hierarchies

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Two-Way Parsimonious Classification Models for Evolving Hierarchies”, CLEF’16. Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “On Horizontal and Vertical Separation in Hierarchical Text Classification”, ICTIR’16. @m__dehghani

14

Page 38: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

TASK2: HIERARCHICAL CLASSIFICATION

▸ Hierarchical Classification:

▸ Two-dimensionally separable ▸ Transferable models for classification of entities in evolving hierarchies

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Two-Way Parsimonious Classification Models for Evolving Hierarchies”, CLEF’16. Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “On Horizontal and Vertical Separation in Hierarchical Text Classification”, ICTIR’16. @m__dehghani

14

Page 39: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

TASK2: HIERARCHICAL CLASSIFICATION

▸ Hierarchical Classification:

▸ Two-dimensionally separable ▸ Transferable models for classification of entities in evolving hierarchies

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Two-Way Parsimonious Classification Models for Evolving Hierarchies”, CLEF’16. Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “On Horizontal and Vertical Separation in Hierarchical Text Classification”, ICTIR’16. @m__dehghani

14

Page 40: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

TASK3: CONTEXTUAL SUGGESTION

▸ MAIN GOAL:

▸ Estimating effective profiles for different objects in the data.

▸ For building a positive profile for each user (as a set of positively rated documents)

▸ For building a negative profile for each user (as a set of negatively rated documents)

▸ For building a profile for a group of users (as a set of users sharing a specific property) - to be used in group-based suggestions.

@m__dehghani

15

Page 41: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

SWLM FOR CONTEXTUAL SUGGESTION

▸ As it is explained, SWLM is an effective approach to estimated a model representing a ”Set” of documents.

▸ In contextual suggestion:

▸ Profile for an individual user

▸ as the set of attractions rated by the user

▸ positively rated / negatively rated

▸ Profile for a group of users

▸ as the set of users

@m__dehghani

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Generalized Group Profiling for Content Customization”, CHIIR’16. S. H. Hashemi, Mostafa Dehghani, and J. Kamps. “Parsimonious User and Group Profiling in Venue Recommendation”, TREC’15

16

Page 42: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

TASK3: GROUP PROFILING

▸ Group Profiling: To understand and model the characteristics of the group of entities.

▸ Content Customisation: Contextual Suggestion

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Generalized Group Profiling for Content Customization”, CHIIR’16. S. H. Hashemi, Mostafa Dehghani, and J. Kamps. “Parsimonious User and Group Profiling in Venue Recommendation”, TREC’15 Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Significant Words Language Models for Contextual Suggestions”, TREC’16

3. EXPERIMENTSIn this section, we present our experiments to evaluate the

e↵ectiveness of the estimated language model of groups inthe task of contextual suggestion. Furthermore, we analysethe e↵ect of group granularity on group profiling. We firstexplain the data collection used in our experiments and thenpresent the evaluation results.

3.1 Data CollectionIn this research, we have made use of the TREC 2015 con-

textual suggestion1 dataset. Contextual suggestion is thetask of searching for complex information needs that arehighly dependent on both context and user interests. Thedataset contains the information from 207 users includingtheir age, gender, and set of rated places or activities as theuser preferences (rates are in the range of -1 to 4). Thetask is to generate a list of ranked suggestions from a setof candidate attractions, by giving the user information aswell as some information about the context, including loca-tion of trip, trip season, trip type, tripe duration, and thetype of group the person is travelling with. For each user,we consider rated suggestions that are annotated with ratesof more than 2 as relevant. Furthermore, we generate theuser language models as a mixture of their relevant prefer-ences considering the rates. Based on the information inthe dataset, we divide users into several groups. Groupingsare based on the users information and context information.Table 1 presents grouping criteria, the groups, and numberof users in each group.

3.2 Group Profiling for Contextual SuggestionIn this section, we investigate our second research ques-

tion: “How e↵ective are group profiles to customize contentsuggestion for individual users?”

We generate group-based ranking for suggestions to eval-uate the quality of group profiles in content customization.To this end, one of the grouping approaches given in Ta-ble 1 is chosen, e.g. based on users’ age. Then we estimatelanguage model of each group employing the approach ex-plained in Section 2. Afterward, regarding the informationof the given request, i.e. the user information and contextinformation, the group which the user belongs to is selectedand based on the similarity of the language model of theselected group and the language model of candidate, theranked list of the suggestions is generated.

Beside the group-based ranking, we generate a ranked listof suggestions based on the preferences of the user as a base-line. To do so, a language model is estimated as the mixtureof the model of user preferences regarding their ratings andbased on the similarity of the preferences language modeland the candidate language model, a ranked list is gener-ated.

Furthermore, according to the explanation in Section 2,the contribution of each of specific and group models in eachuser preferences are learned as the model parameters, i.e.�

u,s

and �

u,g

. Having these parameters empowers us to ef-ficiently combine the score of the group-based strategy withthe score of preferences-based strategy to consider both in-dividual preferences and group preferences for content cus-tomization. To evaluate the quality of the combination, we

1https://sites.google.com/site/treccontext/

trec-2015

age

gender

group

t-type

t-duration

season

preferences

0.2

0.4

0.6

0.3

1

0.2

4

0.3

0.2

1

0.4

0.2

9

0.6

7

0.5

2

0.5

8

0.5

1

0.5

8

0.5

6

0.5

1

MAP

Group Group+Preferences Preferences

Figure 1: Performance of employing user preferences-based andgroup-based customization on contextual suggestion task. Im-provements of combining group-based and preferences-based ap-proach over the preferences-based approach and correspondinggroup-based approaches are statistically significant based on one-tailed t-test, with p-value < 0.05.

have done experiments considering di↵erent grouping crite-ria.Figure 1 presents the performance of employing di↵erent

grouping approaches for group-based suggestion as well aspreferences-based suggestion. The combinations of prefer-ences-based suggestion and group-based suggestion are alsoreported.As can be seen, among the group-based strategies, sugges-

tions based on the duration of the trip is the most e↵ectivestrategy. Also age of the user and the type of the groupthe user travels with, are rather important while type of thetrip is not so important. This could be due to the fact thatmost of the time, the user’s interests and beloved attrac-tions do not change based on the type of trip which could be“business” or “holiday”. On the other hand, combining thepreferences-based suggestions with group-based suggestionsin all grouping strategies leads to improvement. This meansin case of incompleteness of user’s profile, customizing thecontent based on the groups that user belongs to, implicitlyfills the missing information and improves the performanceof suggestions. However, this depends on the quality of thegroups profiles that should reflect essential common (notgeneral, not specific) characteristics of the groups.

3.3 Effect of Group GranularityIn this section, we investigate our third research question:

“How does the user’s group granularity a↵ect the quality ofthe group’s profile?”In the grouping stage, sometimes users can be grouped

based on di↵erent levels of granularity. For example, havingthe age of users, discretization can be done using binningwith di↵erent sizes of bin. In this section, we analyse thee↵ect of granularity of groups, and consecutively the size ofthe groups with a fixed volume of train data, on the qualityof group profiling.

Performance of employing user preferences-based and group-based customization on contextual suggestion task.

@m__dehghani

17

Page 43: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

EFFECT OF GROUP PROFILING

▸ We noticed that group-based suggestion helps more when users have tendency to rate attraction in a neutral way, compared to the case users have extreme opinion in their rating.

improve the performance. However, the improvement we getfrom employing group profiles estimated by SWLM is notsignificant. We looked into the data to see in which casesadding group information helps and in which cases it is note↵ective. We observed that there is a correlation betweenthe amount of improvement in contextual suggestion usinggroup information and the rating behavior of users.To do so, we simply take the average rate that user gave

to di↵erent attractions as their general tendency of rating.Figure 1 shows the scatter plot of the change in p@10 afteremploying group-based information based on di↵erent ratingtendency. According to the plot, group-based informationworks better when the user has a neural tendency in herrating (around rate 2) and it is less likely to help when usershave rather strong biases by rating attraction with high orlow rates. This could be due to the fact that in case ofhaving neutral user, we have less string information comingfrom his/her profile and then group-based information iscompensating this lack of strong signals

4. CONCLUSIONSIn this paper, we presented the participation of University

of Amsterdam, ExPoSe team, in the TREC 2016 Contex-tual Suggestion Track. We described our approach which isemploying Significant Words Language Models (SWLM) [1]as an e↵ective method for estimating models representingsignificant features of sets of attractions as user profiles andsets of users as group profile.We had two main research questions. The first research

question was “How can SWLM help to estimate better userprofiles for the contextual suggestion?” We observed thatusing SWLM, we are able to better estimate a model rep-resenting the set of preferences positively rated by users astheir profile, compared to the case we use standard languagemodel as the profiling approach. We also found that usingnegatively rated attractions as negative samples along withpositively rated attractions as positive samples, we may loosethe performance when we use standard language model as theprofiling approach. While, using SWLM, taking negativelyrated attractions into consideration may help improving thequality of suggestions.

Our second research question was “In what conditions doesinformation of group profiles estimated using SWLM improvethe performance of the contextual suggestion?”We investi-gated the e↵ect of employing information from groups thatusers belong to on the performance of suggestions providedfor individual users. We noticed that group-based suggestionhelps more when users have tendency to rate attraction ina neutral way, compared to the case users are subjective intheir rating behaviour.

5. ACKNOWLEDGMENTSThis research is funded in part by Netherlands Organiza-

tion for Scientific Research through the Exploratory PoliticalSearch project (ExPoSe, NWO CI # 314.99.108), and bythe Digging into Data Challenge through the Digging IntoLinked Parliamentary Data project (DiLiPaD, NWO Digginginto Data # 600.006.014).

References[1] M. Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra,

and M. Marx. Luhn revisited: significant words language

1 1.5 2 2.5 3

−0.1

0

0.1

0.2

Average Rate

Absolute

P@5Im

provement

Figure 1: Improvement of the performance of contextual sug-

gestion by help of group profiles for users with di↵erent rating

behavior.

models. In The Proceedings of The ACM InternationalConference on Information and Knowledge Management(CIKM’16), 2016.

[2] M. Dehghani, H. Azarbonyad, J. Kamps, and M. Marx.Generalized group profiling for content customization. InCHIIR’16, pages 245–248, 2016.

[3] S. H. Hashemi, M. Dehghani, and J. Kamps. Parsimo-nious user and group profiling in venue recommendation.In TREC 2015. NIST, 2015.

Mostafa Dehghani, H. Azarbonyad, J. Kamps, and M. Marx. “Significant Words Language Models for Contextual Suggestions”, TREC’16 @m__dehghani

18

Page 44: MOSTAFA DEHGHANI SWLM: NEITHER GENERAL, NOR SPECIFIC, … · TASK1: EXAMPLE @m__dehghani Example: Top 50 terms in the language model estimated from the set of top seven relevant documents

Thank you

Take home massage? Don’t build your model based on propertyless common observations or Unreliable rare observations, Take the significant ones.

Luhn and Mostafa! 19