100
the hedge algorithm the hedge algorithm and applications and applications Virgil Pavlu Virgil Pavlu

the hedge algorithm and applications - College of … more problems z more problems z trading stocks z IR metasearch z disease classification z where we need to “train” doctors

Embed Size (px)

Citation preview

the hedge algorithmthe hedge algorithmand applicationsand applications

Virgil PavluVirgil Pavlu

22

will it rain tomorrow ?will it rain tomorrow ?

I say it will rainI say it will rainNick says it will not rainNick says it will not rainCNN says it will rainCNN says it will rainFOX says it will not rainFOX says it will not rain

TOMORROW : sunny all day…TOMORROW : sunny all day…what about Saturday?what about Saturday?

33

weighted majority algorithmweighted majority algorithm

Give the “weatherGive the “weather--men” weightsmen” weightsInitial uniform or according to some prior beliefInitial uniform or according to some prior belief

Run the forecast for several days. Every day :Run the forecast for several days. Every day :Make our prediction by weightedMake our prediction by weighted--majoritymajority--votevoteGet the real outcomeGet the real outcomeUpdate weightsUpdate weights

““penalizepenalize”” wrong predictorswrong predictors““rewardreward”” good predictorsgood predictors

44

more problemsmore problems

more problemsmore problemstrading stockstrading stocksIR IR metasearchmetasearchdisease classificationdisease classification

where we need to “train” doctorswhere we need to “train” doctorscommon underlying ideacommon underlying idea

generalization of weightedgeneralization of weighted--majoritymajoritywhy why notnot exactly WMexactly WM

losses are real numbers instead of discrete 0losses are real numbers instead of discrete 0--11our loss may be a weighted sum of expert lossesour loss may be a weighted sum of expert losses

55

this talkthis talk

Hedge algorithm for online Hedge algorithm for online allocation [allocation [schapireschapire, , freundfreund ’96]’96]

ApplicationsApplicationsOn classification : On classification : AdaboostAdaboostOn IR : On IR : metasearchmetasearch algorithm algorithm

66

““how to use expert advice” ?how to use expert advice” ?

N experts (or strategies) N experts (or strategies) maintain a set of weights over experts maintain a set of weights over experts loop for T episodes t=1,2,…..,Tloop for T episodes t=1,2,…..,T

allocate Resources (believe) allocate Resources (believe) -- based on weightsbased on weightsreceive loses receive loses reweightreweight the expertsthe experts

77

online allocation online allocation -- hedge algorithmhedge algorithm

[ ]

[ ] 1;1,0 w weightsinitial

(systems) strategies ;1,0

1

11 =∈

∑=

N

ii

N w

∑=

= N

i

ti

tit

i

w

wp

1

[ ]Ntl 1,0 ∈ tttH lpL ⋅=

tilt

iti ww β⋅=+1

∑ =⋅=

T

ttt

HEDGE lpL1

88

why hedgewhy hedgegoal : Hedge loss close to the Loss of the best expert goal : Hedge loss close to the Loss of the best expert (bound)(bound)proof idea :proof idea :

relate episodic Hedge loss with the sum of weightsrelate episodic Hedge loss with the sum of weights

relate cumulate Hedge loss with sum of final weightsrelate cumulate Hedge loss with sum of final weights

relate sum of final weights with loss of the best expertrelate sum of final weights with loss of the best expert

( ) ( )

−−≤ ∑∑

=

+ tHedgeLt

i

N

i

ti ww )(11

1

1

ββ

( ) ( )NLL bestHedge ln1

11

/1ln)( ββ

ββ −

+−

99

episodic episodic hedge loss hedge loss

∑∑==

+ =N

i

lti

N

i

ti

tiww

11

1 β

∑=

−−≤N

i

ti

ti lw

1))1(1( β

( ) ( )( )∑ ∑=

−−=N

i

ti

ti

ti lwp

1 11 β

( ) ( )( )∑∑=

−−=N

i

ti

ti

ti

ti lppw

1 1 β

( ) ( )

−−= ∑∑

=

N

i

ti

ti

ti lpw

111 β

( ) ( )

−−= ⋅∑ ttt

iw lp11 β tttHedgeL lp ⋅=

( ) ( )

−−≤ ∑∑

=

+ tHedgeLt

i

N

i

ti ww β11

1

1

[ ]( ) l

ll 11

1,0,ββ

β

−−≤

∈xxf )1(1)( β−−=xxf β=)(

∑=

= N

i

ti

tit

i

w

wp

1

1010

cumulative cumulative hedge loss hedge loss

( ) ( )

−−≤ ∑∑

=

+ tHedgeLt

i

N

i

ti ww β11

1

1

( ) ( )∏∑∑==

+

−−≤ ⋅

T

ti

N

i

ti

ttww1

1

1

1 lp11 β

∑ =⋅=

T

ttt

Hedge lpL1

( ) ( )∏∑=

−−≤ ⋅

T

ti

ttw1

1 lp1exp β )exp(1 xx ≤−

1111

[almost] as good as the best expert[almost] as good as the best expert

anyLany

Tany

N

i

Ti www β11

1

1 =≥ +

=

+∑

β

ββ −

≤1

ln1ln)(

any

Hedge

LNL

Nw any

11 =

1212

optimalityoptimality

β

ββ −

≤1

ln1ln)(

best

Hedge

LNL

( ) ( )NLbest ln1

11

/1lnββ

β−

+−

=

( )NacLL bestB ln+≤

1313

how to choose how to choose ββ

( ) ( )NLL bestHedge ln1

11

/1ln)( ββ

ββ −

+−

ββ = confidence parameter= confidence parameterthink of it as a tradethink of it as a trade--off off try to make the bound tighttry to make the bound tightbinary search : perfect expert + (binary search : perfect expert + (ββ=0) =0)

1414

that wasn’t so badthat wasn’t so bad

Hedge algorithm for Hedge algorithm for online allocationonline allocation

ApplicationsApplicationsOn classification : On classification : AdaboostAdaboostOn IR : On IR : metasearchmetasearch algorithm algorithm

1515

boostingboosting

disease classification…. disease classification…. get past dataget past data“train” a disease“train” a disease--predictor “doctor” predictor “doctor”

(“(“hypothesis”,”weakhypothesis”,”weak learner”)learner”)

how good is it (on training data) ?how good is it (on training data) ?where is it wrong ? where is it wrong ? train a new predictor to correct train a new predictor to correct mistakes for the first onemistakes for the first one

1616

hedge application : hedge application : AdaBoostAdaBoost[[schapireschapire, , freundfreund ’96]’96]

HEDGEHEDGEgiven : expertsgiven : experts

incoming : losesincoming : loses

reweightreweight: experts: experts

BOOSTINGBOOSTINGgiven : given : datapointsdatapoints

experts in weakexperts in weak--learners performancelearners performance

incoming: weak learnersincoming: weak learnerscompute error (loss)compute error (loss)

compute “believe” (compute “believe” (ββ))

reweightreweight : : datapointsdatapoints

{-1,1} X : h t →

≠=

⋅=+ )(if)(if/1

1itit

itit

t

tt xhy

xhyZDD

ββ

miD 1)(1 =

])([Pr iitDt yxht

≠=ε

11>

−=

t

tt ε

εβ

( )

= ∑

ttt xhxH )(lnsgn)(final β

ADABOOSTADABOOST

1717

AdaBoostAdaBoost -- exampleexample

Start with uniform distribution on data

Weak learners = halfplanes

1818

round 1round 1

1919

round 2round 2

2020

round 3round 3

2121

final hypothesisfinal hypothesis

2222

AdaBoostAdaBoost -- analysisanalysis

generalization error generalization error

training error training error

∏ −≤t

ttH )1(4)error(training final εε

[ ]1,0 ,)1(4)( ∈−= xxxxf

2323

2003 Gödel Prize2003 Gödel PrizeYoavYoav Freund and Robert Freund and Robert SchapireSchapire

““The prize was awarded to The prize was awarded to YoavYoav Freund and Robert Freund and Robert SchapireSchapire for for their paper "A Decision Theoretic Generalization of Ontheir paper "A Decision Theoretic Generalization of On--Line Line Learning and an Application to Boosting," Journal of Computer anLearning and an Application to Boosting," Journal of Computer and d System Sciences 55 (1997), pp. 119System Sciences 55 (1997), pp. 119--139. This paper introduced 139. This paper introduced AdaBoostAdaBoost, an adaptive algorithm to improve the accuracy of , an adaptive algorithm to improve the accuracy of hypotheses in machine learning. The algorithm demonstrated novelhypotheses in machine learning. The algorithm demonstrated novelpossibilities in possibilities in analysinganalysing data and is a permanent contribution to data and is a permanent contribution to science even beyond computer science. Because of a combination science even beyond computer science. Because of a combination of features, including its elegance, the simplicity of its of features, including its elegance, the simplicity of its implementation, its wide applicability, and its striking successimplementation, its wide applicability, and its striking success in in reducing errors in benchmark applications even while its theoretreducing errors in benchmark applications even while its theoretical ical assumptions are not known to hold, assumptions are not known to hold, the algorithm set off an the algorithm set off an explosion of research in the fields of statistics, explosion of research in the fields of statistics, artificial intelligence, experimental machine artificial intelligence, experimental machine learning, and data mininglearning, and data mining. The algorithm is now widely . The algorithm is now widely used in practice. The paper highlights the fact that theoreticalused in practice. The paper highlights the fact that theoreticalcomputer science continues to be a fount of powerful and entirelcomputer science continues to be a fount of powerful and entirely y novel ideas with significant and direct impact even in areas, sunovel ideas with significant and direct impact even in areas, such as ch as data analysis, that have been studied extensively by other data analysis, that have been studied extensively by other communities.” communities.”

2424

what we did last year what we did last year [[AslamAslam, Pavlu, , Pavlu, SavellSavell]]

Hedge algorithm for Hedge algorithm for online allocationonline allocation

ApplicationsApplicationsOn classification : On classification : AdaboostAdaboostOn IR : On IR : metasearchmetasearch algorithm algorithm

2525

metasearchmetasearch problemproblemSearch for: CIKM 2003

2626

search and search and metasearchmetasearch

Search engines:Search engines:Provide a ranked list of documents.Provide a ranked list of documents.May provide relevance scores.May provide relevance scores.

MetasearchMetasearch engines:engines:Query multiple search engines.Query multiple search engines.May or may not combine results.May or may not combine results.

2727

metasearchmetasearch: : DogpileDogpile

2828

metasearchmetasearch: : MetacrawlerMetacrawler

2929

metasearchmetasearch: Profusion: Profusion

3030

metasearchmetasearch algorithmsalgorithms

Heuristics and hacks:Heuristics and hacks:Interleave, average rank, sum scores, etc.Interleave, average rank, sum scores, etc.

Principled models:Principled models:Bayesian inference, election theory, etc.Bayesian inference, election theory, etc.OnOn--line combination of expert advice.line combination of expert advice.

3131

online online metasearchmetasearchSearch for: CIKM 2003

3232

hedge application : hedge application : metasearchmetasearchHEDGEHEDGE

given : expertsgiven : experts

incoming : losesincoming : loses

reweightreweight: experts: experts

METASEARCHMETASEARCHgiven : search enginesgiven : search engines

incoming: documentsincoming: documentscompute lossescompute losses

reweightreweight : search engines : search engines

3333

a unified modela unified model

3434

a unified modela unified model

3535

a unified modela unified model

3636

ranks importanceranks importance

11

+=−=TNONRELEVAN

RELEVANT

rZ

Zrrrvalue ln1...

111)( ≈+++

+=

rZdlabelrankvaluedlabelsdLOSS sd ln)()()(),( , ⋅≈⋅=

• map ranks to values

total loss vs. total precision vs. average precision

3737

average_value at episode taverage_value at episode t“trust” in systems change with t“trust” in systems change with t

feedbackfeedbacklooploop)(

)(_

,1

1sd

N

s

ts

t

rankvaluew

dvalueaverage

⋅=

=

∑=

),(1 sdLOSSts

ts

ttww β⋅=+

)()(),( ,sdtt trankvaluedlabelsdLOSS ⋅=

)(_ dvalueaverage t

get feedback get feedback

metasearchmetasearch list : order docs bylist : order docs by

modify weightsmodify weightscompute lossescompute losses

)( tdlabel

3838

experimentsexperimentsTREC 3,5,6,7,8TREC 3,5,6,7,8

4141--129 systems129 systems50 queries per TREC50 queries per TRECMetasearchMetasearch combines combines allall systemssystems

Use TREC judgments as user feedbackUse TREC judgments as user feedback

3939

metasearchmetasearch -- no feedback (yet)no feedback (yet)

MNZ=CombMNZ(Fox,Shaw,Lee et al)COND=Condorcet(Aslam,Montague)

4040

metasearchmetasearch –– TREC8TREC8

4141

metasearchmetasearch –– TREC 3TREC 3

4242

metasearchmetasearch –– TREC 5TREC 5

4343

metasearchmetasearch –– TREC 6TREC 6

4444

metasearchmetasearch –– TREC 7TREC 7

4545

metasearchmetasearch –– TREC 3,5,6,7TREC 3,5,6,7

4646

actually…we do more than actually…we do more than metasearchmetasearch

4747

actually…we do more than actually…we do more than metasearchmetasearch

4848

““a unified model for a unified model for metasearchmetasearch, , pooling and system evaluation”pooling and system evaluation”

4949

conclusionconclusion

theoretic explanation of “being adaptive”theoretic explanation of “being adaptive”simple, elegant, intuitivesimple, elegant, intuitiveusually performs much better than the usually performs much better than the boundbound

5050

END END

5151

AdaBoostAdaBoost -- technicaltechnical

Start with uniform Start with uniform distribdistrib DD1 1 ::At every round t=1 to T At every round t=1 to T

given given DDtt

•• ffind weak hypothesisind weak hypothesis

•• with errorwith error

••

•• compute “belief” in compute “belief” in hhtt

••

•• update distributionupdate distribution

final hypothesis:final hypothesis:

≠=

⋅=−

+ )(if)(if

1iti

iti

t

tt xhye

xhyeZDD

t

t

α

α

01ln21

>

−=

t

tt ε

εα

= ∑

ttt xhxH )(sgn)(final α

miD 1)(1 =

{-1,1} X : h t →

])([Pr iitDt yxht

≠=ε

5252

OnOn--line Setupline Setup

5353

OnOn--line line MetasearchMetasearch [Aslam, Pavlu, [Aslam, Pavlu, SavellSavell]]

5454

hedge application : hedge application : metasearchmetasearch

problem : problem : metasearchmetasearchsearch enginessearch enginesdocument judgmentdocument judgmentmetaserachmetaserach

[hedge] solution[hedge] solutionsearch engines are “experts”search engines are “experts”judgments on documents are “loses”judgments on documents are “loses”

5555

an IR setupan IR setup

5656

generalization errorgeneralization error-- based on marginsbased on margins

generalization error [generalization error [Schapire,Freund,Barlett,LeeSchapire,Freund,Barlett,Lee1998]1998]

training error [training error [Schapire,FreundSchapire,Freund 1996]1996]

∏ −≤t

ttH )1(4)error(training final εε

5757

hedge approachhedge approach

5858

∑ ∗−∗=

==

docs all

))(2(),( : FACTranks allat valuesprecision of average ofprecision total)(

sTPZCsdLOSSssTP

loss functionloss function

11

+=−=TNONRELEVAN

RELEVANTrZ

Zrrrvalue ln1...

111)( ≈+++

+=

Average the precision at Average the precision at ALLALL ranksranksNormalize so ideal system gets TP=1Normalize so ideal system gets TP=1math is simplermath is simpler

rZdlabelrankvaluedlabelsdLOSS sd ln)()()(),( , ⋅≈⋅=

• map ranks to values

5959

average_valueaverage_value at episode tat episode t“trust” in systems change with t“trust” in systems change with t

metasearchmetasearch : include already judged docs: include already judged docspooling pooling

system evaluationsystem evaluation

get feedbackget feedbackmodify weightsmodify weights

how does it work ?how does it work ?)(

)(_

,1

1sd

N

s

ts

t

rankvaluew

dvalueaverage

⋅=

=

∑=

),(1 sdLOSSts

ts

ttww β⋅=+

)()(),( ,sdtt trankvaluedlabelsdLOSS ⋅=

( ))(_maxarg dvalueaveraged dt =

6060

actually…we do more than actually…we do more than metasearchmetasearch

6161

experimentsexperimentsTREC 3,5,6,7,8TREC 3,5,6,7,8

4141--129 systems129 systems50 queries per TREC50 queries per TRECmetasearchmetasearch uses uses allall systemssystems

use TREC judgments as user feedbackuse TREC judgments as user feedbacksystem evaluation : incomplete judgmentssystem evaluation : incomplete judgments

6262

experiments experiments -- relevant docs foundrelevant docs found

6363

experiments experiments -- system evaluation system evaluation

6464

system evaluation system evaluation –– kendall’skendall’s ττ

6565

metasearchmetasearch -- no feedback (yet)no feedback (yet)

MNZ=CombMNZ(Fox,Shaw,Lee et al)COND=Condorcet(Aslam,Montague)

6666

experiments experiments –– metasearchmetasearch –– TREC8TREC8

6767

metasearchmetasearch –– TREC 3,5,6,7TREC 3,5,6,7

6868

conclusionconclusiona powerful machine learning approacha powerful machine learning approach

Hedge = Hedge = AdaBoostAdaBoost corecore

works [usually] better than anything else we’ve works [usually] better than anything else we’ve seenseentrue, it uses feedbacktrue, it uses feedback

but without feedback there are provable limitationsbut without feedback there are provable limitations

6969

Why hedge [Why hedge [schapireschapire, , freundfreund ’96]’96]

( ) ....p)1(111

1 ≤⋅−−

≤ ∑∑

==

+ TTN

i

Ti

N

i

Ti lww β

)p)1(exp(...1

t tT

tl⋅−−≤ ∑

=

βTT lphedge ⋅= T episodeat loss

HEDGEL losscummulat

ββ

+

≤1

ln1ln NLL

SYSTEM

HEDGE

)Hedge(βL ∑=

+N

i

Tiw

1

1

7070

AdaBoostAdaBoost –– distribution updatedistribution update

01ln21

>

−=

t

tt ε

εα

{-1,1} X : h t →

])([Pr iitDt yxht

≠=ε

““neutralize” the last weak hypothesisneutralize” the last weak hypothesis

)( iti xhy ≠ )( iti xhy =

teZDD

t

tt

α⋅=+1te

ZDD

t

tt

α−+ ⋅=1

7171

ApplicaltionApplicaltion -- metasearchmetasearch

7272

problem setupproblem setupOn the same queryOn the same querySet of underlying systems Set of underlying systems User feedbackUser feedbackGoalGoal

Find relevant documentsFind relevant documentsProduce online Produce online metasearchmetasearch listslistsPerform online system evaluation Perform online system evaluation

We are looking for an adaptive approachWe are looking for an adaptive approach

7373

motivationmotivation

7474

our unified modelour unified model

7575

∑ ∗−∗=

==

docs all

: FACT

ranksallat precisionaverageof precision total

))(2(),()(

sTPZCsdLOSSssTP

loss functionloss function

Average the precision at Average the precision at ALLALL ranksranksNormalize so ideal system gets TP=1Normalize so ideal system gets TP=1math is more simplemath is more simple

7676

pooling pooling –– howtohowto

Naturally “want” top ranksNaturally “want” top ranksIf NON RELEVANT, then a NR in top ranks of the If NON RELEVANT, then a NR in top ranks of the system listssystem listsIf RELEVANT, bingo.If RELEVANT, bingo.

7777

system evaluation system evaluation –– howtohowto

assume assume allall docs not judged docs not judged (so many ?)(so many ?) to be to be NON RELEVANTNON RELEVANTcompute compute AvegPrecisionAvegPrecision for every systemfor every systemone (or few) very good systems one (or few) very good systems –– use small use small ββ

7878

metasearchmetasearch -- howtohowto

Compute “pooling value” for each docCompute “pooling value” for each docInstead of “select the top doc” for poolingInstead of “select the top doc” for poolingdo “select the top 1000 doc” for do “select the top 1000 doc” for metasearchmetasearch

almost 1000 almost 1000 –– docs already pooled are docs already pooled are automatically in top of automatically in top of metasearchmetasearch listlist

7979

experimentsexperimentsTRECTREC

~100 systems~100 systems50 queries each competition50 queries each competition

Use TREC Use TREC qrelsqrels as user feedbackas user feedbackincomplete feedbackincomplete feedback

For comparison with depthFor comparison with depth--pooling we use pooling we use average number of pools (over queries)average number of pools (over queries)

8080

experiments experiments -- relevant docs foundrelevant docs found

8181

experiments experiments -- system evaluation system evaluation

8282

system evaluation system evaluation –– kendall’skendall’s tautau

8383

metasearchmetasearch -- no feedback (yet)no feedback (yet)

8484

experiments experiments –– metasearchmetasearch –– TREC8TREC8

8585

metasearchmetasearch –– TREC 3,5,6,7TREC 3,5,6,7

8686

conclusionconclusionA powerful machine learning approachA powerful machine learning approach

Hedge = Hedge = AdaBoostAdaBoost corecoreWorks [usually] better than anything else we’ve Works [usually] better than anything else we’ve seenseenTrue, it uses feedbackTrue, it uses feedback

But without feedback there are provable limitationsBut without feedback there are provable limitations

It is missing a rigorous analysisIt is missing a rigorous analysisWe are not very far away with thatWe are not very far away with thatNeed a model assumptionNeed a model assumption

8787

After conclusion After conclusion –– don’t readdon’t read

KEY:FIND RELEVANT

DOCS

8888

testtest

8989

testtest

9090

pooling pooling -- comparison with comparison with CormackCormack

9191

motivationmotivation

9292

9393

pooling pooling -- howtohowto

Naturally “want” top ranksNaturally “want” top ranksIf NON RELEVANT, then a NR in top ranks of the If NON RELEVANT, then a NR in top ranks of the system listssystem listsIf RELEVANT, bingo.If RELEVANT, bingo.

9494

metasearchmetasearch –– howtohowtoBefore the next episodeBefore the next episode

Compute “pooling value” Compute “pooling value” for each docfor each doc

Instead of “select the top doc” for poolingInstead of “select the top doc” for poolingdo “select the top 1000 doc” for do “select the top 1000 doc” for metasearchmetasearch

almost 1000 almost 1000 –– docs already pooled are docs already pooled are automatically in top of automatically in top of metasearchmetasearch listlist

9595

““total” total” precizionprecizion

Average the precision at Average the precision at ALLALL ranksranksNormalize so ideal system gets TP=1Normalize so ideal system gets TP=1

math is more simplemath is more simple

9696

MetasMetas –– trec3trec3

9797

MetasMetas –– trec5trec5

9898

MetasMetas –– trec6trec6

9999

MetasMetas –– trec7trec7

100100

MetasMetas –– trec3trec3