36
November 10, 200 4 Dmitriy Fradkin, CIKM'04 1 A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems Dmitriy Fradkin, Paul Kantor DIMACS, Rutgers University

A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

  • Upload
    lucian

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems. Dmitriy Fradkin, Paul Kantor DIMACS, Rutgers University. What Is This Work About?. Small-scale view: We analyze differences between two implementations of Rocchio method and discuss choices of parameters. - PowerPoint PPT Presentation

Citation preview

Page 1: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 1

A Design Space Approach to Analysis of Information Retrieval

Adaptive Filtering Systems

Dmitriy Fradkin, Paul Kantor

DIMACS,

Rutgers University

Page 2: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 2

What Is This Work About?

• Small-scale view: We analyze differences between two implementations of Rocchio method and discuss choices of parameters.

• Large-scale view: The problem of constructing an IR/AF system can be seen as an optimization problem in a large design space. (Well-known methods are simply points in this space.)

Page 3: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 3

Large-Scale View

• Use optimization methods to find optimal choices of parameters. These optimal choices do not have to correspond to well-known methods or standard practices.

• Design space optimization methods have been suggested for designing VLSI chips [Bahuman et. al. 2002], airplanes [Schwabacher and Gelsey, 1996; Zha et. al. 19996] and HVAC systems [Szykman 1997].

Page 4: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 4

What’s in a name?

• We find that even a single “name” involves an enormous number of design choices.

• TREC2002 Adaptive Filtering– DIMACS: Rocchio method

– Chinese Academy of Sciences: Rocchio Method

• One method performs almost twice as well as the other.

Page 5: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 5

For any system:

• Choose Data Representation• Construct Initial Classifier• Training Phase:

• Incorporate labeled examples

• Supplement with “pseudo positives” and “pseudo negatives”

• Set the threshold

• Filtering Phase: as new documents arrive • Evaluate performance

• Update the classifier model

• Update threshold

Page 6: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 6

All of these are usually:

• Characterized informally, as a choice, and the exclusion of alternatives.

• Seen as points on a map – but to understand the significance of these choices we need to explore the real territory.

• So: we must interpolate between the choices made in one method and those made in another.

Page 7: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 7

Interpolation

• Identify the corresponding design decisions

• Develop a “path” between them – sometimes called a “homotopy” from the

topological concept of smoothly distorting one shape (say a coffee cup) into another (say, a doughnut).

• Study the effectiveness along various paths among design options.

Page 8: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 8

Interpolation Aspects for IR/AF

• Term Representation

• Term Weighting

• Computing Scores

• Setting Classifier Threshold

• Document Set Representation

• Pseudolabeled Documents in Training

Page 9: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 9

Interpolation Aspects (cont.)

• Query Initialization

• Unjudged document in test

• Query Update

• Quitting Strategy

Page 10: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 10

Example: Term Representation

otherwise 0

0,d)(t,f' if d)),(t,log(f'1 d)f(t,

Where f’(t,d) is number of times a term occurs in a document

Page 11: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 11

Example: Term Weighting

• DIMACS: • CAS:

• Homotopy:

)i'(t)

T((t)iD

1

1log

60

61

1log

, i'(t)

,)), if i'(ti'(t)

T(

(t)iC

iC iD i (t)λi)λ(t)(i)i(t,λ 1

i’(t) is the number of documents, in training set T, containing term t.

Page 12: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 12

Example: Score Computation

• DIMACS:

• CAS:

• Homotopy:

q)W(dq)(ds DD,,

elsewhere 0

diagonal,on 1 wλiw ))(i(t,λ)w(t,λ

i’(t) is the number of documents, in training set T, containing term t. W is a diagonal matrix of weights

||||||||

,,

qWd

q)W(dq)(ds

C

CC

;)i(t,λ(t)w iD

;2)i(t,λ(t)w iC

Page 13: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 13

Example: Score Interpolation

)λ(1s)λφ(s)λ,s,s(s SDSSDC C

(d)sm

m(d))φ(s C

C

DC

Same mapping for scoresand for thresholds from CASscale to DIMACS scale:

Homotopy:

Page 14: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 14

Example: Setting Thresholds

• DIMACS:

• CAS:

• Homotopy:

is chosen to optimize utility

Threshold for query q after seeing document i:

(q,i)τD

.submissionlast thesinceseen

documents ofnumber - ,005.0 where

otherwise ,1

6000 if ,1

negative isutility if ,1

1

1

i

C

iC

C

C

z

)(q,iτ

z)(q,iτ

)(q,iτ

(q,i)τ

)λ(q,i)(τ(q,i))λφ(τ)λτ(q,i SSC DS 1,

Page 15: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 15

Example: Set Representation

• DIMACS

• CAS

• Homotopy

Sx

xS

Sv1

)(

Sx

xSv )(

Sxr

r x)S)(λ(

)λv(S111

1,

Page 16: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 16

Example: Pseudo-labeled Documents

• CAS method does not make use of pseudo-labeled documents in training stage

• DIMACS method: Given “density” parameters (d+ and d-) and “proportion” (p+ and p-), score unlabeled training documents and choose top and bottom sets according to “proportion”. Then pick documents out of these sets according to corresponding “density”.

• Interpolate between density and proportion parameters (DIMACS) and 0 (CAS).

Page 17: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 17

Example: Query Initialization

)(')(')(')('' ip

ip

iitermsinit DvyDvxDvDvqq General Formula:

DIMACS:

CAS:

)(')1()(')1()())1(3(),( ipp

ipp

itermsp

init DvyDvxDvqq

Homotopy:

0' 0,y' 0, x'1,' ,3'

0' , |D|

5y' ,

|D|

2 x'1,' ,1'

-ip

ip

Page 18: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 18

Example: Unjudged Documents

• A submitted document for which there is no label is “unjudged”. DIMACS ignores such documents. CAS considers such documents pseudo-negative if its score is less than 0.6.

• Can view this as a threshold:

uuuu 6.00)1(6.0)(u

Page 19: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 19

Example: Query Update

)()()()( ppinit DyvDxvDvDvqq

)())1(0125.03.1()())1(125.08.1()(),( pyyinit

y DvDvDvqq

General Formula:

DIMACS:

CAS: 3.1y 0, x1.8, 1, ,1

0.0125y ,0 x125,.0 1, ,1

Homotopy:

Page 20: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 20

Example: Quitting Strategy

• DIMACS: if after 50 submissions the utility is negative, stop submitting for this topic

• CAS: no quitting strategy Alternatively:

)1(02.0

1)(

0 :CAS

0.02 :DIMACS

negative. isutility documents 1

submittingafter ifQuit

q

q

qq

Page 21: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 21

Experimental Evaluation• TREC11 Data - Reuters Corpus v1• 23,000 training; 800,000 test• 100 topics (50 assessor, 50 intersection)• 3 positive and 0 negative examples per topic

5.1

5.0)5.0T11NU,max(T11SU

||2

|)||(|||2T11NU

T

DDD u

T+ - all positive documents; D+ - submitted positive;D- - submitted negative; Du – submitted unlabelled

Page 22: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 22

Diagonal Interpolation

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 0.2 0.4 0.6 0.8 1

No quitting

With Quitting

Lambda 0 0.2 0.4 0.6 0.8 1 CAS Average T11SU, no quitting 0.033 0.103 0.26 0.364 0.404 0.394 0.405Average T11SU, with quitting 0.113 0.139 0.263 0.364 0.404 0.394 0.405

Page 23: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 23

Documents Retrieved

Page 24: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 24

Parameter Analysis

• It is possible to analyze effect of individual parameters at each point in space by taking “small steps” along the parameter axis.

• Requires a lot of computational effort

• Results may not be easy to interpret

Page 25: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 25

Example of Parameter Analysis

\lambda 0.7 0.7 0.8 0.8 0.9 0.9 relevant nonrelevant relevant nonrelevant relevant nonrelevant\lambda_\alpha 2086 975 ... ... 2089 1010\lambda_\gamma 2273 1233 ... ... 1923 830\lambda_p 2043 1129 ... ... 2014 939\lambda_y 2106 1005 ... ... 2029 948\lambda_u 2065 1037 ... ... 2071 977\lambda_i 2062 977 2062 977 2062 977\lambda_w 2055 1000 ... ... 2119 1007\lambda_S 2153 1021 ... ... 2123 1031\lambda_r 2037 993 ... ... 2149 1044\lambda_q 2062 977 ... ... 2062 977

Effect of individual parameters on number of relevant andnonrelevant documents retrieved around 0.8 point

Page 26: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 26

Results based on topic type

assessor intersection # topics avg. T11SU difference # topics avg. T11SU differenceCAS better than 0.8 18 -0.047 25 -0.0370.8 better than CAS 25 0.062 13 0.011CAS and 0.8 equal 7 0 12 0Total 50 0.014 50 -0.015

Comparison of CAS results and 0.8 diagonal homotopy point

Page 27: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 27

Additional Experiments

• Reordered TREC documents

• Experimented with 77 topics on OHSUMED dataset (1987-1988 as training data, 1989-1991 as test)

The results are similar to those on the original

TREC task.

Page 28: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 28

Result of Experiments with Reordering

Lambda 0.0 0.8 1.0

Average T11SU

0.108 0.406 0.391

Standard Deviation

0.002 0.002 0.004

Average Results on 5 re-orderings of TREC test set:

Page 29: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 29

OHSUMED Results

Lambda 0 0.2 0.4 0.6 0.7 0.8 0.9 1Mean T11SU, no quitting 0.005 0.051 0.361 0.467 0.474 0.463 0.464 0.482Mean T11SU, with quitting 0.138 0.132 0.319 0.467 0.474 0.463 0.464 0.482

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

No quitting

With quitting

Page 30: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 30

Documents Retrieved: OHSUMED

Documents Retrieved

0

2000

4000

6000

8000

10000

12000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

lambda

Nu

mb

er

of

Do

cum

en

ts

Not Relevant

Relevant

Page 31: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 31

Discussion

• We demonstrate the design complexity hidden under “Rocchio method”

• We provide specific models for interpolating between design choices

• These interpolation options can work for methods that are significantly more different (for example Rocchio and SVM).

Page 32: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 32

Discussion (cont.)

• These models should help researchers explore their systems, and regions “between systems”

• Suggests a new approach to designing IR systems: finding a set of (interpolation) parameters optimizing performance

• This can be done with existing optimization methods.

Page 33: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 33

A Note on Interpolation Limits

The need for two endpoint systems is not

very restrictive:

• Some interpolation parameters can be moved beyond [0,1] interval.

• The endpoints themselves can be moved.

Page 34: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 34

Abstract Interpolation

• More abstractly: do not interpolate every single parameter –work at higher abstraction levels

• Ex: representation block, scoring block, thresholding block, etc.

• Can use this with several systems• This is at a lower level than ensembles of

classifiers.

Page 35: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 35

Caveat

In moving to large design space we still face two major problems:

• The range of parameters cannot be explored exhaustively, and non-smooth optimization is needed

• Requires a lot of labeled data that is usually produced manually and is in short supply.

Page 36: A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems

November 10, 2004 Dmitriy Fradkin, CIKM'04 36

Acknowledgments

• KD-D group via NSF grant EIA-0087022

• Andrei Anghelescu, Vladimir Menkov

• Jamie Callan

• Members of DIMACS MMS project

• CAS researchers

• Ian Soboroff

• Anonymous reviewers