November 10, 2004Dmitriy Fradkin, CIKM'041 A Design Space Approach to Analysis of Information...

November 10, 2004 Dmitriy Fradkin, CIKM'04 1

A Design Space Approach to Analysis of Information Retrieval

Adaptive Filtering Systems

Dmitriy Fradkin, Paul Kantor

DIMACS,

Rutgers University

What Is This Work About?

• Small-scale view: We analyze differences between two implementations of Rocchio method and discuss choices of parameters.

• Large-scale view: The problem of constructing an IR/AF system can be seen as an optimization problem in a large design space. (Well-known methods are simply points in this space.)

Large-Scale View

• Use optimization methods to find optimal choices of parameters. These optimal choices do not have to correspond to well-known methods or standard practices.

• Design space optimization methods have been suggested for designing VLSI chips [Bahuman et. al. 2002], airplanes [Schwabacher and Gelsey, 1996; Zha et. al. 19996] and HVAC systems [Szykman 1997].

What’s in a name?

• We find that even a single “name” involves an enormous number of design choices.

• TREC2002 Adaptive Filtering– DIMACS: Rocchio method

– Chinese Academy of Sciences: Rocchio Method

• One method performs almost twice as well as the other.

For any system:

• Choose Data Representation• Construct Initial Classifier• Training Phase:

• Incorporate labeled examples

• Supplement with “pseudo positives” and “pseudo negatives”

• Set the threshold

• Filtering Phase: as new documents arrive • Evaluate performance

• Update the classifier model

• Update threshold

All of these are usually:

• Characterized informally, as a choice, and the exclusion of alternatives.

• Seen as points on a map – but to understand the significance of these choices we need to explore the real territory.

• So: we must interpolate between the choices made in one method and those made in another.

Interpolation

• Identify the corresponding design decisions

• Develop a “path” between them – sometimes called a “homotopy” from the

topological concept of smoothly distorting one shape (say a coffee cup) into another (say, a doughnut).

• Study the effectiveness along various paths among design options.

Interpolation Aspects for IR/AF

• Term Representation

• Term Weighting

• Computing Scores

• Setting Classifier Threshold

• Document Set Representation

• Pseudolabeled Documents in Training

Interpolation Aspects (cont.)

• Query Initialization

• Unjudged document in test

• Query Update

• Quitting Strategy

Example: Term Representation

otherwise 0

0,d)(t,f' if d)),(t,log(f'1 d)f(t,

Where f’(t,d) is number of times a term occurs in a document

Example: Term Weighting

• DIMACS: • CAS:

• Homotopy:

)i'(t)

T((t)iD

, i'(t)

,)), if i'(ti'(t)

iC iD i (t)λi)λ(t)(i)i(t,λ 1

i’(t) is the number of documents, in training set T, containing term t.

Example: Score Computation

• DIMACS:

• CAS:

• Homotopy:

q)W(dq)(ds DD,,

elsewhere 0

diagonal,on 1 wλiw ))(i(t,λ)w(t,λ

i’(t) is the number of documents, in training set T, containing term t. W is a diagonal matrix of weights

||||||||

q)W(dq)(ds

;)i(t,λ(t)w iD

;2)i(t,λ(t)w iC

Example: Score Interpolation

)λ(1s)λφ(s)λ,s,s(s SDSSDC C

m(d))φ(s C

Same mapping for scoresand for thresholds from CASscale to DIMACS scale:

Homotopy:

Example: Setting Thresholds

• DIMACS:

• CAS:

• Homotopy:

is chosen to optimize utility

Threshold for query q after seeing document i:

(q,i)τD

.submissionlast thesinceseen

documents ofnumber - ,005.0 where

otherwise ,1

6000 if ,1

negative isutility if ,1

)(q,iτ

z)(q,iτ

)(q,iτ

(q,i)τ

)λ(q,i)(τ(q,i))λφ(τ)λτ(q,i SSC DS 1,

Example: Set Representation

• DIMACS

• CAS

• Homotopy

xSv )(

r x)S)(λ(

)λv(S111

Example: Pseudo-labeled Documents

• CAS method does not make use of pseudo-labeled documents in training stage

• DIMACS method: Given “density” parameters (d+ and d-) and “proportion” (p+ and p-), score unlabeled training documents and choose top and bottom sets according to “proportion”. Then pick documents out of these sets according to corresponding “density”.

• Interpolate between density and proportion parameters (DIMACS) and 0 (CAS).

Example: Query Initialization

)(')(')(')('' ip

iitermsinit DvyDvxDvDvqq General Formula:

DIMACS:

)(')1()(')1()())1(3(),( ipp

itermsp

init DvyDvxDvqq

Homotopy:

0' 0,y' 0, x'1,' ,3'

0' , |D|

2 x'1,' ,1'

Example: Unjudged Documents

• A submitted document for which there is no label is “unjudged”. DIMACS ignores such documents. CAS considers such documents pseudo-negative if its score is less than 0.6.

• Can view this as a threshold:

uuuu 6.00)1(6.0)(u

Example: Query Update

)()()()( ppinit DyvDxvDvDvqq

)())1(0125.03.1()())1(125.08.1()(),( pyyinit

y DvDvDvqq

General Formula:

DIMACS:

CAS: 3.1y 0, x1.8, 1, ,1

0.0125y ,0 x125,.0 1, ,1

Homotopy:

Example: Quitting Strategy

• DIMACS: if after 50 submissions the utility is negative, stop submitting for this topic

• CAS: no quitting strategy Alternatively:

)1(02.0

0 :CAS

0.02 :DIMACS

negative. isutility documents 1

submittingafter ifQuit

Experimental Evaluation• TREC11 Data - Reuters Corpus v1• 23,000 training; 800,000 test• 100 topics (50 assessor, 50 intersection)• 3 positive and 0 negative examples per topic

5.0)5.0T11NU,max(T11SU

|)||(|||2T11NU

T+ - all positive documents; D+ - submitted positive;D- - submitted negative; Du – submitted unlabelled

Diagonal Interpolation

0 0.2 0.4 0.6 0.8 1

No quitting

With Quitting

Lambda 0 0.2 0.4 0.6 0.8 1 CAS Average T11SU, no quitting 0.033 0.103 0.26 0.364 0.404 0.394 0.405Average T11SU, with quitting 0.113 0.139 0.263 0.364 0.404 0.394 0.405

Documents Retrieved

Parameter Analysis

• It is possible to analyze effect of individual parameters at each point in space by taking “small steps” along the parameter axis.

• Requires a lot of computational effort

• Results may not be easy to interpret

Example of Parameter Analysis

\lambda 0.7 0.7 0.8 0.8 0.9 0.9 relevant nonrelevant relevant nonrelevant relevant nonrelevant\lambda_\alpha 2086 975 ... ... 2089 1010\lambda_\gamma 2273 1233 ... ... 1923 830\lambda_p 2043 1129 ... ... 2014 939\lambda_y 2106 1005 ... ... 2029 948\lambda_u 2065 1037 ... ... 2071 977\lambda_i 2062 977 2062 977 2062 977\lambda_w 2055 1000 ... ... 2119 1007\lambda_S 2153 1021 ... ... 2123 1031\lambda_r 2037 993 ... ... 2149 1044\lambda_q 2062 977 ... ... 2062 977

Effect of individual parameters on number of relevant andnonrelevant documents retrieved around 0.8 point

Results based on topic type

assessor intersection # topics avg. T11SU difference # topics avg. T11SU differenceCAS better than 0.8 18 -0.047 25 -0.0370.8 better than CAS 25 0.062 13 0.011CAS and 0.8 equal 7 0 12 0Total 50 0.014 50 -0.015

Comparison of CAS results and 0.8 diagonal homotopy point

Additional Experiments

• Reordered TREC documents

• Experimented with 77 topics on OHSUMED dataset (1987-1988 as training data, 1989-1991 as test)

The results are similar to those on the original

TREC task.

Result of Experiments with Reordering

Lambda 0.0 0.8 1.0

Average T11SU

0.108 0.406 0.391

Standard Deviation

0.002 0.002 0.004

Average Results on 5 re-orderings of TREC test set:

OHSUMED Results

Lambda 0 0.2 0.4 0.6 0.7 0.8 0.9 1Mean T11SU, no quitting 0.005 0.051 0.361 0.467 0.474 0.463 0.464 0.482Mean T11SU, with quitting 0.138 0.132 0.319 0.467 0.474 0.463 0.464 0.482

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

No quitting

With quitting

Documents Retrieved: OHSUMED

Documents Retrieved

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

lambda

Not Relevant

Relevant

Discussion

• We demonstrate the design complexity hidden under “Rocchio method”

• We provide specific models for interpolating between design choices

• These interpolation options can work for methods that are significantly more different (for example Rocchio and SVM).

Discussion (cont.)

• These models should help researchers explore their systems, and regions “between systems”

• Suggests a new approach to designing IR systems: finding a set of (interpolation) parameters optimizing performance

• This can be done with existing optimization methods.

A Note on Interpolation Limits

The need for two endpoint systems is not

very restrictive:

• Some interpolation parameters can be moved beyond [0,1] interval.

• The endpoints themselves can be moved.

Abstract Interpolation

• More abstractly: do not interpolate every single parameter –work at higher abstraction levels

• Ex: representation block, scoring block, thresholding block, etc.

• Can use this with several systems• This is at a lower level than ensembles of

classifiers.

Caveat

In moving to large design space we still face two major problems:

• The range of parameters cannot be explored exhaustively, and non-smooth optimization is needed

• Requires a lot of labeled data that is usually produced manually and is in short supply.

Acknowledgments

• KD-D group via NSF grant EIA-0087022

• Andrei Anghelescu, Vladimir Menkov

• Jamie Callan

• Members of DIMACS MMS project

• CAS researchers

• Ian Soboroff

• Anonymous reviewers

November 10, 2004Dmitriy Fradkin, CIKM'041 A Design Space Approach to Analysis of Information...

Documents

Dmitriy Krizhanovskii

Pwpt.ru dmitriy donskoy

PUBLICATIONSOFEDUARDOFRADKIN - University Of …eduardo.physics.illinois.edu/homepage/pubs.pdf · Eduardo Fradkin and Leonard Susskind, ... Eduardo Fradkin, Paul Goldbart and Oliver

CIKM Tutorial 2008

La argentina colonial - Fradkin y Garavaglia

BBH-Fradkin-Garavaglia.qxd:Layout 1

AUTHORS FR: FRADKIN, E.E. TO: FRAIMANTitle: AUTHORS FR: FRADKIN, E.E. TO: FRAIMAN : Subject: AUTHORS FR: FRADKIN, E.E. TO: FRAIMAN : Keywords: On tb-1 Rarit.&-Scbwiiagar Method in

Dmitriy teplyakov donbass arena

Autoalarmanlage Gruppe: Eugen Riefert Dmitriy Aranovich Dmitriy Aranovich

LocWeb 2014 Workshop at CIKM

Cikm 2014 v2

dmitriy shahov

Barral Fradkin

Fradkin-guerra y Sociedad

Introducción. FRADKIN

Formula dmitriy semenov

Dmitriy Feofanov

ACM Seventeenth Conference - CIKM 2008 · ACM Seventeenth Conference on Information and Knowledge Management CIKM 2008 Opening Address ... • Solicit applications to host CIKM 2011

Dmitriy makarenko

Cikm keynote nov2014