31
Comparing Boolean and Probabilistic Information Retrieval Systems Across Queries and Disciplines Robert M. Losee Manning Hall, CB# 3360 University of North Carolina Chapel Hill, NC 27599-3360 U.S.A. Phone: 919-962-7150 Fax: 919-962-8071 [email protected] Published in J. of the American Society for Information Science 48 (2) 1997, 143–156. June 28, 1998 1

Comparing Boolean and Probabilistic Information Retrieval Systems

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

ComparingBooleanandProbabilisticInformationRetrieval SystemsAcrossQueriesandDisciplines

RobertM. LoseeManningHall, CB#3360

Universityof NorthCarolinaChapelHill, NC 27599-3360

U.S.A.

Phone:919-962-7150Fax: [email protected]

PublishedinJ. of the American Society for Information Science

48(2) 1997,143–156.

June28,1998

1

Abstract

Whetherusing Booleanqueriesor ranking documentsusing documentandterm weightswill result in betterretrieval performancehasbeenthe subjectofconsiderablediscussionamongdocumentretrieval systemusersandresearchers.We suggesta methodthatallows oneto analyticallycomparethetwo approachesto retrieval andexaminetheir relative merits. The performanceof informationretrieval systemsmaybedeterminedeitherby usingexperimentalsimulation,orthroughtheapplicationof analytictechniquesthat directly estimatethe retrievalperformance,given valuesfor query and databasecharacteristics.Using theseperformancepredictingtechniques,sampleperformancefiguresareprovided forqueriesusingtheBooleanand andor, aswell asfor probabilisticsystemsassum-ing statisticalterm independenceor term dependence.The variationof perfor-manceacrosssublanguages(usedin differentacademicdisciplines)andqueriesis examined.Theperformanceof modelsfailing to meetstatisticalandotheras-sumptionsis examined.

1 Intr oduction

Informationretrieval is thescienceandart of locatingandobtainingdocumentsbasedon informationneedsexpressedto a systemin a query language.Retrieval systemsoftenorderdocumentsin a mannerconsistentwith theassumptionsof Booleanlogic,by retrieving, for example,documentsthat have the termsdogs andcats, andby notretrieving documentswithout oneof theseterms. Systemsconsistentwith the prob-abilistic modelof retrieval locatedocumentsbasedon a query list of terms,suchas�dogs, cats � , or may acceptas input a naturallanguagequery, suchas I want infor-

mation on dogs and cats. A probabilisticsystemthenranksdocumentsfor retrieval byassigninganumericvalueto eachdocument,basedon theweightsfor querytermsandthefrequenciesof termoccurrencesin documents.

Many lecturesandtextbookscoveringonlineandCD-ROM searchinghave taughthow to “best” formulatea query, given the searcher’s knowledgeof the query anddatabasecharacteristics.Both practitionersandexperimentershave long known thatsomeretrieval techniqueswork betteron certainqueriesanddocumentsthando othertechniques.Realizingthis, someresearchers,including this author, have combined(with varyingdegreesof success)theBooleanandtermweightingmodelsto producemoregeneralmodelsthat areflexible and, ideally, moreeffective. Therehave beenmany theoreticalandimplementationorienteddiscussionsrelatingtermweightingsys-temsand Boolean-like retrieval, with representative works including [BR84, Cro86,Eva94, EC92,FW90, Lee94,Los94b, LB88, LBY86, Nie89, Rad82,Rad79,RT90,Sal84,Sme84,Tur94]. ThesemodelsandsystemsemphasizecombiningBooleanandprobabilisticor vectorbasedsystemsinto asinglesystem.Ourdiscussionbelow differsin that it trys to keepseparateBooleanandtermweightingsystemsthatassumesterm

2

independence.Given this knowledge,the bettersearchenginefor a particularqueryanddatabasecombinationcanbeselected.

As retrieval researchhasmatured,it hasbecomemoreobvious that different re-trieval techniqueshavedifferentstrengthsandweaknesses,with somedocumentsmosteasily retrieved usingonetechniqueandotherdocumentsbeingbestretrieved usingothermethods[BCB94, BCCC93, Lee95]. Using thesetechniques,a systemor indi-vidualmightbeableto chooseoneof severaldifferentmatchingproceduresbaseduponhow eachprocedurewill perform.

Ourapproachto comparingBooleanandtermweightingmodelshasbecomepossi-blebecauseof thedevelopmentof analyticmodelsof informationretrievalperformance[Los91, Los95a, Los96]. While moststudiesof retrieval systemsprovideexperimentalresults,theability to preciselydescribeandexplain theoperationof Booleanandtermweightingsystemsenablesus to calculatethe expectedperformanceof eithersystemundera varietyof assumptions.In addition,usingtechniquesdescribedbelow, it willbe possibleto understandanddescribetraditionalBooleanretrieval asa specialcaseof a term weightingsystem,enablingus to analyticallycomparethe performanceofBooleanandtermweightingsystems.

This article containslittle formal math in the body of the article, with equationsprovidedin theappendixthatwill beof interestto themoreseriousstudentof retrieval.

2 Modelsof Retrieval and Term Weighting

Several differentmodelsof informationretrieval systems,including thoseacceptingBooleanexpressionsasqueries,have beenpopularwith commercialvendorsandwiththe informationretrieval researchcommunity. While themajority of commercialsys-temshaveusedBooleanquerylanguages,thoseinterestedin formalmodelsof retrievalhave probablypublishedmoreon theprobabilisticandvectormodelsof retrieval thanon Booleanretrieval. The modelsof probabilisticretrieval provide searcherswith adecisionrule statingthat a documentshouldbe retrieved if a calculatedvaluethat isbasedon severalparametersis lessthana costbasedvalue;if thecalculatedvalueex-ceedsthecost,thedocumentshouldnotberetrieved.Thesecostsareoftendifficult forpatronsto articulate,in partbecausepatronshave little experiencevaluingor shoppingfor information,which is sooftenprovidedat little or nocostin many societies.

Thevaluescalculatedin thedecisionrule providedby theprobabilisticmodelusu-ally requireestimatingseveral parameters,denotedhereas: � , the probability that aparticularterm is presentin a relevant document;� , the probability that a particulartermis presentin anon-relevantdocument;and � , theprobabilitythataparticulartermis presentin a document(ignoringthequestionof documentrelevance).A full list ofvariablesis givenin AppendixC.Knowledgeaboutparametersis oftenlearnedthroughtwo processes.The first, the query, providessomeinformationexternalto the queryaboutwhat the userexpectsto find in relevantdocuments.The informationprovided

3

by thequeryis examinedin Losee[Los88]. Thesecondway thatparameters’valuesmaybeestimatedis throughrelevancefeedback,informationprovidedby thesearcheraboutwhat the userfinds to be of interest[Boo83,CY82, Los88, Moo93, RTMB86,Sme84,Spi95].

Whenestimatingprobabilitiesfor usein theprobabilisticmodel,it is necessarytoeitherassumestatisticalindependenceof termsor to formally incorporatesomeformof statisticaldependencebetweendocumentfeatures[Coo95, CL68, Cro86, LY82,Los94a, Los95a, VR77, YBLS83]. Termsareconsideredto be independentin mostcommercialandexperimentaltermweightingsystems;whentermfrequenciesarebi-nary, this form of systemwill bereferredto asconsistentwith the termindependence(IND) model. This assumesthat,given two terms,oneterm containsno informationaboutthe probabilityor likelihoodthat the otherterm beingconsideredwill occurinthe samedocumentor in the samerelevanceclass. Differentforms of theseassump-tionshave beenexamined[RSJ76,Rob77], with the mostrecentdiscussionprovidedby Cooper[Coo95]. The term independenceassumptionis obviously an inaccurateassumption:the termcats in a queryis obviously morelikely to co-occurwith termslike fur anddogs than it is to co-occurwith termssuchas tortellini or ravioli. Yet,makingtheindependenceassumptionallows for thetimely andaccurateestimationofparametervaluesandtheretrieval of documents.

Theprobabilityof two termsco-occurringmaybecomputedfrom theprobabilitiesof the termsoccurringindependentlyas well as a factor representingthe degreeofdependencebetweenthe terms(Equation2 in the Appendix.) This factorincludes ���which representsthe expectedproportionof documentscontainingboth terms. Thisvariableis similar to a correlation,but it is not the traditionalcorrelationcoefficientfoundin socialsciencestatistics.It is theaverageproductof thetermfrequencies,whatmightbecalledthe“activecomponent”in themorecommonlyfoundPearsonproductmomentcorrelationcoefficient [Los94a].

Vectorretrieval modelstake a geometricapproachto theretrieval problem,with aqueryandeachdocumentbeingrepresentedasadocumentvectoror arrow moving outfrom the origin (0,0) point in a space.A documentthat is very similar to the querywill have it’ s documentvectorat a smallangleto thequeryvector. Theanglebetweenthem(computedby a cosinemeasure)will be relatively small, while documentslesssimilar will have a largeranglebetweenthemselvesandthequery. Eachdimensioninthe“space”maybeusedto representthefrequenciesof aspecificterm.Assumingtermdependencein the vectorretrieval modelresultsin the adoptionof a somewhatmoreelaboratemodelof thetermspaceandtherelationshipbetweenvectors.Interestingly,in many cases,the formulasdevelopedby thoseusingtheprobabilisticmodelarethesameasthe formulasproducedby thoseusingthevectorapproach[Boo82]. For thisreason,we feel comfortabledevelopingour modelunifying term weightingsystemswith Booleanretrievalsystemsbyusingtheprobabilisticmodel.A similardevelopmentusingthevectormodelmaybeperformed,althoughtheinterpretationof theprocesses

4

will bedifferent.

3 Analytic Modelsof Retrieval Performance

Probabilisticandvectormodelsof retrieval have traditionallybeenevaluatedby sim-ulating retrieval systemsusing testdatabasescontainingsamplequeries,documents,andrelevancejudgements.In an analogousmanner, onecould determinethe areaofa rectanglethrougha numberof experimentalmethods,includingthesimplecountingof thenumberof tiny squaresof a givensizethatonecanphysicallyfit in therectan-gle. However, many tenyearsolds,usinga formulaic,analyticmethod,canmultiplytheheightof therectangletimesthewidth of therectangleto determinetherectangle’sarea.Theperformanceof retrieval systemssimilarly maybedeterminedanalytically,with theexpectedperformancebeingbasedon a numberof factorsthatdirectly deter-mineperformance,including � , � , and � for thetermsincludedin thequery.

The modelof retrieval performanceusedherepredictsthe average search length(ASL), thepositionof theaveragerelevantdocumentin therankedlist of documents.Insituationswhereretrieval is random,for example,theASL will betherankpositionofthemediandocumentin thedatabase.TheASL is usedherein lieu of moretraditionalmeasuressuchasprecisionandrecallbecauseit is easyto understand(theASL is theaveragenumberof documentsonewill needto examineto get to theaveragerelevantdocument)andbecauseof theeasewith which it canbepredicted.

TheASL maybecomputedfor thesimplecasewherethereis a singletermin thequeryby estimatingwheretheaveragerelevantdocumentwould be(probabilistically)in the ranked list of weighteddocuments,the proportionof the way throughthe unitspaceonewould needto move to get to theaveragepositionfor a relevantdocument.TheASL in the rangefrom 0 to 1 is referredto asthe “raw ASL.” We begin by esti-matingthepercentof documentswithout the termandthepercentof documentswiththeterm. For eachof thesecategoriesof documents,we calculatethemidpointof the“percentrange”anduseit asthe representative point for the range. Eachof the twopointsis thenmultiplied by thepercentof relevantdocumentswith thecorrespondingcharacteristic.This hastheeffectof finding theproportionof relevantdocumentswithandwithout theterm,aswell astheir relativeposition,in thescaledrangefrom � to �Thisvalueis thenmultipliedby thenumberof documentsin thedatabaseandthen ��� is added.A similarproceduremaybeusedfor largernumbersof terms.

If thereare 100 documents,for example,spreadingthe relevant documentsran-domly throughoutthe databasewould result in an ASL of ������� If therewere � doc-uments,all with the sameterm, andthe orderingwasperfect,the ASL would be ��Considera morecomplicatedcasewherethe queryis a singleterm andthereare10documents,4 of themrelevant, with 3 of the relevant documentsand1 non-relevantdocumenthaving the singleterm in the query. Then, �������� and ������� The “rawASL” (scaledin the rangefrom � to ) may becomputedby noting the midpoint for

5

thedocumentswith thetermis � � This canrepresent����� of therelevantdocuments,while themid point for thedocumentswith theterm, ���� canrepresentthe ���� of thedocumentswithout the term. Summing � ������� and ���! �� producesa raw ASL of��" ��� This,whenmultipliedby the10 documents( �� �� ) canbeaddedto ��� �� resultingin anASL of ������

Thesetechniquesareappliedthroughtherestof thearticletoahypothesizeddatabaseof #��� documents.Randomperformancefor thisdatabasewill haveanASL of ������� Astermsbecomepositivediscriminatorsof relevance,thatis, � increasesbeyond �$� theper-formancewill moveawayfrom this randomASL value.For all thegraphsshown here,the variable � wassetto %�� that is, #�"� of the documentshave the term in question.All resultsreportedherealsousetwo termsto simplify discussion,with � and � beingthe samefor both terms. This samenessis not requiredby our model,but acceptingthe sameparametervaluesfor both termsdecreasesthe numberof variablespresent,allowing us to examineandmake strongerstatementsaboutsomeothervariablesofinterest(e.g., � .)

The degreeof dependencebetweentwo termsis capturedin part by the averageproductof the binary term frequenciesfor the two terms(in all documents).This �value,whenusedto estimatetheunconditionedjoint probability(alongwith theproba-bility � for asingleterm)is denotedas �'&(� while thecorresponding� usedin estimatingtheconditionalprobabilitythata termpairoccursin a relevantdocumentis denotedas�()�

The performanceresultsshown in Figure1 vary parameters� and �'&* Thesere-trieval surfacesshow the effect of changesin parametervalues,somethingnot easilyshown usingmore traditionalexperimentalmethodologies,suchassimulationusingtestdatabases.Setting�+�,- resultsin thetermbeinganeutraldiscriminator(resultingin anASL of ������ ) while a �$&.�/��� is effectively a Pearsonproductmomentcorrela-tion of ��� with a “negative” correlationrepresentedby �$&10/2(�����3�(465 anda positivecorrelationby �$&879���� Thesurfacewith themeshrepresentsperformanceassumingthatthetermsareindependentandtheperformancesurfacewithoutthemeshrepresentsretrieval performanceassumingtermdependence.For theindependencemodel,the �()componentis computedusingEquation5 in the Appendixso that a Pearsonproductmomentcorrelationof � is modeledbetweenquerytermsin relevantdocuments.Thedependencebasedprobability is computedwith �() numericallyset to that valuethatresultsin the lowestASL, providing an indicationof thegreatestdegreeof differencethatmightbefound.

For theretrieval surfacesshown in Figure1, increasingthe � value( �'& ) slightly po-larizesthedocuments,increasingthe proportionof documentswith both termseitherpresentor bothabsent,resultingin a setof documentsthatareeasierto separateintorelevantandnon-relevantclasses.A strongereffect seenin the Figureis thatwhen �increases,theASL steadilyandmorestronglydecreases,dueto theincreasein thedis-criminationability of theterms.Performanceunderdependencein this caseis always

6

0.1

0.11

0.12

0.13p

0.005

0.00750.01

0.01250.015

c

47

48

49

50

51

52

ASL

0.1

0.11

0.12

0.13p

0.005

0.00750.01

0.01250.015

c

Figure1: Performance(down is “good” for ASL in thesefigures)for thetermindepen-dencemodel(surfacewith mesh)anddependencemodel(surfacewithoutmesh.)

7

superiorto or equalsthat with independence.The worst caseperformanceobtainedconsistentwith dependenceassumptionswould bethatobtainedconsistentwith inde-pendenceassumptions.

This andsomeotherperformancedescriptionsusedherecanbe saidto “mix ap-plesandoranges”in thesensethatthe �:) valuesfor two differentretrieval surfacesareseldomthe samefor a given �$& value,that is, thosepointsthataredirectly above oneanotherandhave matching�'& and � valuesoftenhave different �() values.This is be-causewearedisplayingthebestperformingdependencevalueconsistentwith theotherparametersusedin determiningtheperformanceof theindependencemodel.

Whencomparingtermdependenceandtermindependencemodels,it seemsappro-priateto assumethat theprobabilitiescomputedassumingtermdependenceareaccu-rate,while thoseassumingtermindependenceareoftenlessaccurate.WhenestimatingtheASL, it is necessarywhenconsideringeachpossibledocumentprofile (bothtermspresent,both termsabsent,etc.) that the probability be computedthat the profile isfoundin relevantdocumentsunderthemodelof interestandusingits assumptions.Asecondprobabilitymustbecomputedthatdescribestheproportionof documentsthathave this profile; this latter probability is always computedassumingthe (accurate)termdependencemodel.

4 BooleanOperators and Probabilistic Ranking

Retrieval systemsbasedon Booleanlogic have long served asthe cornerstoneof thecommercialdocumentretrieval systemmarket andremainvery importantbecauseoftherelative simplicity of thequerylanguageandtheeasewith which it canbeunder-stoodandimplemented.Themostcommonusefor aBooleanexpressionis tostatewhatcharacteristicsmustbepresentin materialto beretrievedin asystemthatretrievesandpresentsto usersbibliographicrecordsor full-text. A seconduseof Booleanexpres-sionslikely to increasein importanceover thenext decadeis in rulesincorporatedintodocumentandemailfiltering systems.Sucha rule might take theform of a statement,“If the documentcontainsterm foo, then placedocumentin folder bar and displaynoticeof arrival on thescreen.”

Booleanexpressionstypically usethreeoperators:and, or, andnot. A searchfordocumentsaboutboth dogsandcatsmight be expressedasdogs and cats. Logicalimplications,suchasdog implies mammal, if somethingis a dogthenit is a mammal,may be expressedwithout usingthe implicationoperator, e.g.,usinga statementlike“It is not thecasethatsomethingis a dog andis nota mammal.”

If weareto applytheanalyticmodelof retrieval performance,it becomesnecessaryto treatBooleansystemsasspecialforms of probabilisticretrieval systems.We sug-gesta way to do this here,by comparingthe rankingprovidedby individual Booleanoperatorswith therankingprovidedby systemsconsistentwith probabilisticmodels.

8

4.1 Conjunctive Normal Form

Any Booleanquerymay beexpressedin eitherof the commonnormalizedforms forBooleanexpressions:Conjunctive Normal Form (CNF), or Disjunctive Normal Form(DNF). CNF representsthe conjunctionof disjunctions,that is, a seriesof “anded”componentswith thesecomponents,in turn,consistsof the“oring” of individualterms(or thenegationsof theseterms.)Any Booleanexpressioncanbeconvertedinto CNF[KK84]. Similarly, a logical expressionin DNF is a disjunctionof conjunctions,a setof “ored” components,whereeachcomponentconsistsof andedterms.ExpressionsinCNFmaybetreatedastheconjunctionof statisticallyindependentcomponents[Boo85,LB88] andmayappearmore“natural” in somecircumstances,althoughDNF appearsto beappropriatein someothercircumstances[Cro86].

An expressionin DNF suchas2<;>=�?�@BA"CD;FEGA>H$IKJ�@�5L=MEN2O�PA��Q@8A"CD;FE�A>H'IRJ�@�5maybetransformedinto CNFby regroupingthetermsandBooleanoperators2O;S=T?�@U=�EV�6A��Q@M5WA"CD;FE�A>H'IRJ�@�Anotherexpression, CD=T�.2<;>=�?�@BA"CD;X�6A��Q@M5$�maybetransformedinto a queryin CNF:2YCD=T�W;>=�?�@�5Z=ME[2\C]=��^�6A��Q@M5$By convertingaBooleanexpressionof any complexity into oneof thesenormalforms,a rankingof documentsusingtheseprobabilisticmethodsmaybeeasilyimplementedthroughthesimplecombinationof themethodsfor theBooleanor, not, andand. Weassumebelow that all our querieshave beenconvertedto CNF, thussimplifying thetypesof operandseachof ourBooleanoperatorsmustaccept.

4.2 Boolean“and”

Theuseof theBooleanand maybeemulatedin aprobabilisticretrieval systemthroughtheuseof joint probabilitiesandassumingspecifictermdependencies.Assumea userdesiresto retrieve documentsaboutpoodles and allergies, with thosedocumentsnotcontainingboththesetermsbeingaccordeda secondaryplacein a rankedlist of docu-ments.Thesameorderingmaybedesiredandobtainedusinga probabilisticretrievalsystemif the joint probabilitiesof poodles andallergies areestimatedin sucha waythat thedocumentsareorderedso that thosewith both termsaretreatedasonesetofdocuments,andthosewithout both termsareretrievedafterward. It is alsonecessarythata probabilisticsystemtreatdocumentswith eitheror neitherof theterms(but not

9

with bothterms)asthoughthey hadthesamerankingvalue.Therankingmustbe:

Term 1 Term 21 11 00 1 (these3 aretreatedthesame.)0 0

For a given �_�]�$� and �'&(� thereis a unique �() suchthat the orderingrequiredby theBooleanand is obtained.It maybecomputedusingEquation9, thevaluefor �:) whichproducesthe rankingof documentsdescribedabove for the Booleanand, given thevalues� , � , and �'& .

Figure2 shows the level of performanceobtainedwhencomparinga querysuchasX andY with a probabilisticquerycontainingthesametwo terms,giventhat the �()valuesarecomputedsothatthey areconsistentwith theassumptionsdescribedabove,i.e. Equation9. In thisfigure,theresultsbeingcomparedfor agiven �'& usetwodifferentvaluesfor �()� The �:) valuefor theBooleanand is computedasabove,while the �() forthetermdependencemodelis chosensothattheASL is minimized.

Therelationshipbetweentheperformanceobtainedwith theBooleanand andthetermindependencemodelis shown in Figure3. Thevaluesof �() for bothmodelsarecomputedasdescribedearlier(Equations5and9). Interestingly, thetermindependencemodelis sometimessuperiorto theBooleanmodelandsometimesinferior. The“breakeven” point at which bothproducethesameperformanceis shown graphicallyin thefigureandmaybecomputedalgebraically(Equation10.)

As onewould expect,theseresultsillustratethattheBooleanand is not asgoodasprobabilisticretrieval takingadvantageof all termdependenceinformationexceptin asmallsetof circumstances.At thesametime, theand resultsin performancesuperiorto whatis obtainedassumingtermindependencewhen� is notmuchgreaterthan � andfor lowervaluesof �$&( We arecomparingtheperformanceexpectedfrom two differentretrieval models,eachof which makesassumptionsthat areoften not met. Figure3illustrateswhichmodel,with its assumptions,performsbetterin particularsituations.

TheBooleanand’s performanceis often inferior to whatwe obtainwith a systemconsistentwith term independenceassumptionsbecausethe and model treatsdocu-mentsthesamewhetherthey haveonly onetermor theotheror whenthey haveneitherterm. However, documentswith only oneof thetermsaremorelikely to beof interestto theuserthana documentwith neitherterm. Theprobabilisticmodelcanrankthosedocumentswith only asingletermin commonwith thequerybetweenthosedocumentswith bothtermsandthosewith neitherterm,resultingin performancesuperiorto thatobtainedwith theBooleanmodel.

The term independencemodel is sometimesinferior due to the performanceob-tainedwhenthe � valuedrops.Whenthereis anegativecorrelationbetweenterms(i.e.,

10

0.1

0.11

0.12

0.13p

0.005

0.00750.01

0.01250.015

c

47

48

49

50

51

52

ASL

0.1

0.11

0.12

0.13p

0.005

0.00750.01

0.01250.015

c

Figure2: Themeshedsurfacerepresentstheperformanceobtainedwith theand modeland the unmeshedsurfacerepresentsthe (superior)performanceobtainedwith termdependence.Thevalue � representstheprobabilityfor bothtermsthat they occurin arelevantdocument,assumingdependence,while � is a form of correlationmeasure,theaverageproductof thetermsin all documents( �$&*�5

11

0.10.11

0.12

0.13p

0.0050.0075

0.010.0125 0.015

c

48

50

52

ASL

0.10.11

0.12

0.13p

0.0050.0075

0.010.0125 0.015

c

Figure3: Theperformanceof theBooleanand is displayedasa meshedsurface,withthetermindependencemodeldisplayedasaplainsurface.

12

�'&`0a�(4 ), fewer documentsthathave onetermwill have theother, makingit hardertodiscriminatebetweentherelevantandnon-relevantdocuments.

4.3 Boolean“not”

The Booleannot may be emulatedthroughthe reversalof the probabilisticrankingfor the termin question.For example,processingtheBooleanqueryquilting maybeemulatedby retrievingdocumentsbasedontheprobabilisticmethodappliedto thetermquilting. TheBooleanquerynot quilting maybeemulatedasthereverseof theorderingobtainedwith the queryquilting. This is the sameasorderingthe documentsby theprobability that a documentdoesn’t have the term. Whenqueriesarein conjunctivenormalform, asweareassumingthey are,they will beatmostonetermastheoperandfor eachnot operator, andthuswe needonly addresssingletermsasoperandsfor not.Theperformanceof the term independenceandBooleanmodelswill be thesameforsimplenot operationsbecausethey eachhaveasingletermorconceptusedin documentordering.

4.4 Boolean“or”

The useof the Booleanor alsomay be emulatedby a probabilisticinformationre-trieval system(aswith theBooleanand) throughtheuseof joint probabilitiesandbyassumingspecifictermdependencies.A queryconsistingof two termsconnectedwiththe Booleanor operatorretrievesdocumentshaving eitheroneor both of the terms.The sameorderingmay be obtainedusinga probabilisticretrieval systemif the jointprobabilitiesof the two termsareselectedso that the documentsareorderedso thatdocumentswith either(or both)of the termsaretreatedasonesetof documents,andthosedocumentswith neitherterm constitutea secondset to be retrieved afterward.Therankingmustthenbe:

Term 1 Term 21 11 0 these3 aretreatedthesame0 10 0

As with theBooleanand, thereis a unique �:) for a givensetof parameters.This�() maybe computedusingEquation13, which determinesthe necessaryvaluefor �()whichproducestherankingof documentsdescribedabove for theBooleanor.

Figure4 shows thelevel of performanceobtainedwhencomparingaquerysuchasX orY with a probabilisticquerycontainingthesametwo termsandthat the �() valuesare computedso that they are consistentwith the assumptionsdescribedabove, i.e.Equation13. Two differentvaluesfor �:) areusedin producingthisfigure.The �() value

13

0.1

0.11

0.12

0.13p

0.005

0.00750.01

0.01250.015

c

47

48

49

50

51

52

ASL

0.1

0.11

0.12

0.13p

0.005

0.00750.01

0.01250.015

c

Figure4: Themeshedsurfacerepresentstheperformanceobtainedwith a Booleanorand the unmeshedsurfacerepresentsthe (superior)performanceobtainedwith termdependence.

14

for theBooleanor is computedconsistentwith Equation13, while the �:) for thetermdependencemodelis againchosensothattheASL is minimized.

The relationshipbetweenthe performanceobtainedwith the Booleanor andthetermindependencemodelis shown in Figure5. Performanceusingthetermindepen-dencemodelis sometimessuperiorto theBooleanmodelandsometimesinferior. The“breakeven” point at which bothproducethesameperformanceis shown graphicallyin thefigureandmaybecomputedalgebraically(Equation14.)

As one would expect, theseresultsillustrate that the Booleanor is not as goodasprobabilisticretrieval takingadvantageof all termdependenceinformation. At thesametime, the or sometimesresultsin performancesuperiorto what is obtainedas-sumingtermindependence.

The Booleanor’s performanceis often inferior to what we obtainwith a systemconsistentwith termindependenceassumptionsbecausetheor modeltreatsdocumentsthe samewhetherthey have only oneterm or both terms. The modelcanbe saidto“throw away” theobvious informationthata documentwith two querytermsis morelikely to beof interestto thesearcher. Theprobabilisticmodelsarenot limited by thisconstraintandcantakeadvantageof this information.

5 Retrieval Variations acrossDisciplinesand Queries

Academicdisciplinesvary in how their scholarsexpressideas,varying from highlyquantitative expressionsin mathematicsto complex nominalexpressionsfound in bi-ological literatureto the uniqueand preciseuseof commontermssuchas “truth,”“knowledge,” and“beauty” by philosophers.The differencesbetweenthe languagesusedin academicdisciplines(sublanguages)havebeentheobjectof considerablestudy[Bec87, Bon84, Haa95,LH95, Sag81,Tib92]. Someof thedifferencesbetweendisci-plinesmay be viewedon a spectrumfrom the hardto the soft sciences[LH95]. Thedisciplinesmayalsobeviewedin termsof thosefieldsthatcreateanddonateconceptsandtermsasagainstthosedisciplinesthatarenetborrowers[Los95b].

Using theanalytictechniquesdescribedabove, we have attemptedto examinetheeffectivenessof differentretrieval proceduresondifferentdisciplines.Usingadatabasedevelopedby StephanieHaas[Haa95, LH95], term frequenciesandthe relationshipsbetweentermsarecomputedandexaminedin the light of the requirementsfor eachof severaldifferentretrieval procedures.Thedatabaseconsistsof approximatelytwohundredtitlesandabstractsfrom eachof eightdisciplines:biology, economics,electri-calengineering,history, mathematics,physics,psychology, sociology. For theresearchdescribedin LoseeandHaas[LH95], listsof termswererandomlyextractedfrom eachof the eight databasesandtheir statusassublanguagetermsstudied. Termson theselistsarecategorized(for purposeshere)assublanguagetermswhenthey matchin parta dictionaryentryfrom a subjectdictionaryandarelabeledasgenerallanguagetermsif they do not matchin partanentry in thesubjectdictionary. Sublanguagetermsare

15

0.10.11

0.12

0.13p

0.0050.0075

0.010.0125 0.015

c

48

50

52

ASL

0.10.11

0.12

0.13p

0.0050.0075

0.010.0125 0.015

c

Figure5: Themeshedsurfacerepresentsretrieval performancewith theBooleanor andtheunmeshedsurfacerepresentstheprobabilisticIND model’sperformance.

16

Discipline p �() t �'& Disc.Seeprintedarticlefor tabulardata

Table1: Parametervaluesfor differentdisciplines.Disc. representsthetraditionalINDtermweight(discriminationvalue)computedusingthe � and � values.

consideredhereasa pool of subjectrelatedtermslikely to be found in a query, andthesetermsareusedbelow assamplequerytermsfrom whichall possiblepairsof sub-languagetermsarederived. Thesesublanguagetermpairsmaybeusedin estimating� and �() parameters,while a randomsetof tensof thousandsof termpairswereusedin estimating� and �'& parameters.For this study, termswerestemmedusingthePorter[Por80]stemmingalgorithm.

Table1 presentsthe average� and � valuesfor the eight disciplines,alongwiththe correlation-relatedparameters�() and �'&* The valueon the right handsideof thetableis the discriminationvalueof the “average”term, which is at its maximumformathematicsandsociology, with its low point for biology. The highestvaluesfor �aresimilarly found in mathematicsandsociology. The variable �'& is highestfor thefields of biology and economics. An examinationof a ranked list of the �'& valuesfor termpairsfrom all disciplinesfoundthatof the top twentyfive termpairs,fifteenwerefrom biology andeightwerefrom economics,with only two comingfrom otherdisciplines(psychologyandelectricalengineering.)Becauseof thesmallsamplesizesfor the“sublanguage”terms,wedon’t makestrongclaimsaboutspecificdisciplinesoraboutthecharacteristicsof hardor soft sciences.Theresultsherearemeantto provideexamplesof whatmaybeanticipatedfrom adiversesetof disciplines.

Differentretrieval modelsmaybeevaluatedaftermakingexplicit theassumptionsof eachparticularretrieval model. If the assumptionsare met, suchas thosemadeby Equations9 or 13 for the Booleanand andor, respectively, thenBooleanmodelscanbe describedprobabilistically. Performanceis estimatedby computingthe ASLfor a particularmodelandcomparingtheseASL valuesfor eachmethod,givena setof parameters.WhencomputingtheASL (Equation1), two typesof probabilitiesareused: thosedescribingthe actualdata(unconditionalprobabilities)andprobabilitiesconditioneduponrelevance.Theconditionalprobabilitiesmayreflecttheprobabilitiesasseenby the modelbeingevaluated,while the unconditionalprobabilitiesmay beunderstoodasdescribingtheactualdistributionof datain all documents,as(precisely)predictedby thetermdependencemodel. In caseswherewe know thedegreeof termdependenceacrossall documents,i.e., �$&R� wemaycalculatetheperformanceof amodelassumingthe �() valuesuggestedby the model. This performancemay be comparedwith theperformanceobtainedwith theactual�() value,with termdependence.

Using the assumptionsabove, the type of retrieval model(e.g.,and, or, or IND)onlyeffectstheretrievalperformancemodelthroughtheparameterestimatesfor �:)� For

17

Discipline IND and orSoc 6.56 92.51 0.93Math 8.51 87.89 3.60Econ 10.48 85.44 4.08Bio 10.55 84.92 4.52EE 11.18 85.34 3.48Phys 12.84 83.78 3.39Psych 15.05 81.58 3.37Hist 17.78 78.19 4.03

Table2: Thepercentof randomlyselectedsublanguagetermpairsfor differentdisci-plinesin which the retrieval modelindicatedis best. This assumestheparametersofthemodelareconsistentwith theretrieval model(i.e.,Equations5, 9, or 13).

UseMethod (Assuming) WhenBooleanand Equation9 bdc(e#fg:h^i!j � , Equation11&bdk<lc(e#fVm � , Equation16Booleanor Equation13 bdk<lg:h^i!j � , Equation15&bdk<lc(e#f j � , Equation16TermWeighting(IND) Equation5 b k<lg:h^i m � , Equation15&b c(e#fg:h^i m � , Equation11

Figure6:

eachsetof valuesfor �'&(�n� and� , thereis a �() valuethatprovidesoptimal performance.Using anothervalue for �:) , suchasone likely to be assignedby oneof the variousretrieval models,may provide performancesignificantly below that which could beobtainedwith optimalmodels.Examiningthedifferencebetweenretrievalperformancewith theactualparameter�:) andtheparameterasestimatedby themodelcanprovideanindicatorof theperformancelevel usingamodel’sparameterestimateof �() .

Table2 showsthepercentof randomlyselectedsublanguagetermpairsthatarebestretrievedwith eachof thethreeretrieval models,giventhattheassumptionsabout�() foreachof themodels(e.g.Equations9or13)hold. Thevariable� is thedisciplineaverageshown in Table1 andthevalues�'& and � aretakenfrom theindividual termpairs.Thedecisionrulesusedareprovidedin Figure6. Thevalue bdo`poUq denotesthedifferenceinperformancebetweenretrieval model andmodel � Whenthis differenceis positive,thefirst modelhasa higher(worse)performancethanthesecondone,while whenit isnegative, performancewith the secondmodelexceedsthe first, measuredin termsof

18

ASL, andthefirst modelis thebetterperformer. Thesearethe rulesusedto producetheresultsshown in Table2.

As anexampleof usingthesedecisionrules,let usassumethatbothtermshave �r�-� Onemaythencalculatethe b values(andthebettermodel,indicatedin parentheses)for thesetwo cases: �s�t�Zuv��� �s�t�_uw��"�b �'&Z�t� 4 �$&D�v� 4 uv���b�c:exfgQh^i uF���������y (IND) uF����" ���y (IND)b kKlgQh^i z ���������� (or) uF���������� (IND)b�kKlc:exf uF��������"� (or) uF��������y (or)

Whenthedegreeof dependencebetweentermsis increased(moving from theleft col-umn to the right), alongwith an increasein the discriminationvalueof the terms,itbecomesclearthattheretrieval performanceusingtheBooleanand dropscomparedtothatobtainedwith thetermindependencemodelor theBooleanor. HeretheBooleanor is superiorto theand.

Usingadifferentapproachwith theeightdatabasesdescribedaboveandwith thepa-rametervaluesshown in Table1,wefind thatwhenall theassumptionsof eachBooleanmodelaremet,andignoringEquations5, 9, and13, the resultsshown in Table3 areobtained.Heretheactualparametersfor thedataareusedandthedocumentgroupingsimposedby thedifferentmodelsaremethere.Insteadof usingtheestimatesof �:) forthe Booleanmodels,provided by Equations5, 9, and13, the averageexperimentallydeterminedvaluefor �() for eachdisciplineis used.Using thedisciplineaveragesfor�() insteadof thevaluesfor eachindividualpairmayhaveasignificantinfluenceon theperformanceresultsobtained.

WhenretrievingusingtheBooleanor, for example,thevaluesfor parameters�D�n�$�'�()and �$& areobtainedfrom thedata,unlike theearliertechniquesusedin producingTa-ble2 thatgenerated�:) sothatit fit certainassumptions.Documentswith or arerankedso thatdocumentswith either(or both) of the querytermsaretreatedasbeingof thesamerank,while thosedocumentswith neithertermarerankedafter thesefirst docu-ments.This is differentthanthemethodusedin producingTable2, which usedEqua-tions9 and13to estimate�() for theBooleanmodels,andall differentdocumentprofilesaretreatedasunique.Whenthevaluesfor parametersmeettherequirementsof Equa-tions9 or 13, rankingobtainedfrom a dependencebasedprobabilisticsystemwill bethesameaswith a Booleansystembecauseof theparametervalues.Thedatain Table3 arenot producedby assumingthatEquations9 or 13 hold but, instead,the iteration(summation)throughthe documentsis modifiedso that the propergroupingsareob-tained(requiredfor theprobabilisticmodelto providerankingidenticalto thatprovidedby asystemusingtheBooleanmodel.)ThismodificationcomputestheASL (Equation1), collapsingcertainsetsof termstogetherfor specificBooleanmodels.

19

Discipline IND and orfboxSeeprintedarticlefor tabulardata

Table3: Thepercentof randomlyselectedsublanguagetermpairsfor differentdisci-plinesin which the model indicatedis the mostappropriateretrieval model. All therequirementsof eachmodelaremet.

6 Summary and Conclusions

Severaldecisionmakingtechniqueshave beensuggestedfor thepractitionerwho hasthe option of choosingwhich searchlanguageor retrieval model to usefor a givendatabase,in situationssuchaswhena databaseis searchablethroughmorethanonevendor’s searchengine.Theindividual searchermayselectthesearchmechanismex-pectedto performbest,or anautomatedsystemmaymake thedecisionfor theuserorassisttheuser.

Themostimportantrecommendationwecanmaketo searchersis thatsystemscon-sistentwith termdependencemodelsshouldbeused,wheneverpossible.While perfor-manceis improvedby takingadvantageof thestatisticaldependencebetweentermfre-quencies,it maybecomputationallytooexpensivein many instancesfor systemdesign-ersandimplementorsto incorporatetermdependenceasanoption[Los94a, LBY86]

Searchersneedto carefullyconsidertheassumptionsthataremadeby theBooleanmodels.For example,in thecaseof and, all documentswithout both the termsbeingandedareassumedto have thesamechanceof beingof interestto theuser, which isusuallya badassumptionto make. Similarly, theBooleanor treatsall documentswitheitherof theoredtermsasbeingidenticalin importance,which is alsounrealistic.Ingeneral,documentswith only oneof a pair of termsaresomewhat lessvaluablethana documentwith both termsandarecorrespondinglymorevaluablethana documentwith neitherof the terms.As wasseenabove, thedocumentsareonly equalin value,having differentsetsof characteristics,whenspecificparametervaluesexist, which isseldomthecase.

Table3suggeststhattheBooleanandproducessuperiorperformanceto theBooleanor in many but not in all cases.This superiority, if it canbe shown to hold in otherdatabasesandwith morerealisticquerysets,would suggestthat systemsthat supplytheirown operatorsshouldusuallysupplytheBooleanand ratherthantheor. Expres-sionsin CNF maybemorenaturalfor endusers,who might befurtherencouragedtouseand. This is consistentwith thehistoricalemphasisontheconjunctivenormalformandthefrequentuseof theBooleanand by searchers[Gup87, MC81].

The relationshipbetweenthe Booleanand, the or, and the IND model may beunderstoodin termsof the natureof the correlationsbetweenterms. For example,the Booleanand requiresthat documentswith eitherterm or neitherterm be treated

20

identically by the ranking process. The rankingsof disciplineswith regardsto thedifferentmodelsin Table3 seemsto berelatedto theorderingobtainedwith � or �:) inTable1.

Therankingin Table3 obtainedwith theBooleanor is only equivalentto therank-ing obtainedwith theBooleanand whenthereis aperfectpositivecorrelationbetweenthetwo terms.Whenthis occurs,thepresenceor absenceof onetermwill dictatethepresenceor absenceof theother. Therewill notbeany documentswith only oneof theterms.This providesthesamerankingunderBooleanand andor by effectively elimi-nating all documentsexceptthosewith bothor with neitherterms:boththesetypesofdocumentshave thesamerelative rankpositionunderorderingwith and andwith or.

Booleanretrieval systemscan be studiedanalytically as specialcasesof proba-bilistic retrieval systems.Statementscomposedof a singleBooleanoperatorcanbemodeledprobabilisticallyandcompoundBooleanstatementscanbe treatedasnestedsimplerBooleanexpressionsplacedinto CNF. Differentformsof queriesmaybecom-paredandoptimalityexamined[Los94b].

Due to the differencesin sublanguagesusedin academicfields, we may expectsomedifferencesin the typesof queriesthat arebestsuitedto eachdiscipline. Thedatadiscussedshow thatdisciplinesdo vary in thedegreeto which differentretrievalmodelsshouldbeused.Dueto problemsassociatedwith thesmallnumberof abstractsandthesmallnumberof sublanguagetermsusedin thesample,it would beunwisetomake any generalizationsaboutthosesearchtechniquesmostappropriatefor specificdisciplines.

In summary, situationsin which usingtheBooleanand or theor producethebestresultscanbe analyticallydeterminedthroughthe useof appropriateformulaeusingthe valuesof individual andjoint term probabilitiesandfrequencies.Onemethodofdeterminingthebestmodelto userequiresthatwe take advantageof all the informa-tion availableaboutthe databaseandthe statisticalevidenceabouttermrelationshipsfor thosetermsin thequery. Usingthis information,andgiventheoptionof thesamedatabasebeingavailableon multiple systems,onewith a Booleansearchmechanismsandonewith probabilisticsearchcapabilities,the systemwith the bestexpectedre-trieval performancemaybeselected.While thetermdependencemodelwill almostal-waysbesuperior, it is seldomthecasethatthis modelis availableonexistingsystems.Instead,onemustoftenchoosebetweensystemsconsistentwith simpletermindepen-dencemodelsandsystemsconsistentwith Booleanassumptions.Using the methodsdescribedhere,it becomespossibleto estimatewhich methodwill besuperior, givenactualor estimatedparametervalues.

A Appendix – Computing Retrieval Performance

TheAverageSearchLength(ASL) maybecomputedin differentways,dependingonthedifferentretrieval assumptionswith which it is desiredthat theASL beconsistent

21

[Los95a, Los96]. For our two term model,both termsareassumedto have the sameprobabilityof occurrence( � ) in all documentsandthesameprobabilityof occurrence(� ) in relevantdocuments.WemaycomputetheASL thus{}|�~ ��� ����*�������*����� 2<; �n��� JT�O5}���� l*�Kf�� �����n���*���r� 2O; � 5_u ��� 2<; � 5 ��`�� uv���� (1)

where��EGJT;Y2OIR5 is thepredecessorof profile IQ� theiterationproceedsin theorder ���xT���'����'�����and � is thenumberof documentsin thedatabase.This assumesthatboth termsarepositive discriminators,makingthe orderprogressfrom documentswith the termstodocumentswithout theterms.

Theprobabilities(in Equation1) thatadocumentoccursin eitherrelevant( �r� 2O; �n� � JT�O5 )or in all documents( �r� 2O; � 5 ) maybecomputedusingtheBahadurLazarsfeldExpansion(BLE), whichmaybeusedto estimate(or computeexactly)probabilitiesof termpairs,triples,etc.Describedfor thegeneralcasein [YBLS83, Los94a],we considerheretheBLE for the two term case,that is, whenwe wish to compute��� 2¡ Z��¢�56 For this twotermcase,theBLE takesthefollowing form:�¤£S2: z �\5 �*¥ £T��¦"2( z ��5 �*¥ ¦[§KUu,¨ 2n2©  z �Y5P2O¢ z ��5Q5x2¡  z �Y5P2O¢ z ��5Q5�ª��2: z �Y5P2: z ��5 « (2)

where� and � aretheprobabilitiesthatthefirst andsecondtermshavetermfrequencies  and ¢ of �� andwhere�8� ¨ 2n2OA�5P2<H65n5 is theexpectedvaluefor theproductA and H .If thetermsareindependent,theprobabilitythattwo termsoccurwith frequencies  and ¢ , respectively, is � £ 2( z �\5 �*¥ £ � ¦ 2: z ��5 �*¥ ¦ (3)

Adding the assumptionthat the termsaredependentrequiresthat the following com-ponentbeaddedto (3):� £ 2: z �Y5 �*¥ £ � ¦ 2( z ��5 �*¥ ¦ 2©  z �Y5P2O¢ z ��5 ¨ 2n2©  z �Y5P2O¢ z ��5Q5n5�ª�¤2( z �Y5x2( z ��5 (4)

Notethatthefractionon theright is aconstantfor all documentprofiles.Whencomputingtheprobabilitiesof termpairsoccurringconsistentwith theterm

independencemodel,it is necessaryto estimateor compute�r� 2O; �n� � JT�O56� using �() asthe� component,the numeratorof the fractionon the right handsideof Equation4, andto compute�r� 2<; � 5 with the � componentrepresentedby �'&( Givenknowledgeof �'& , wemaycompute�() consistentwith theindependenceassumptions�6¬©­^®) �¯� 4 (5)

The true valuesof �D�n�$� and �'& areusedwhen �() is estimated.Using this estimateof

22

�() , thesamerankingis obtainedwith thetermdependencemodelaswouldbeobtainedusingthetraditionaltermindependencemodel.

B BooleanExpressions& Probabilistic Ranking

The rankingobtainedwith Booleanexpressionsmay be emulatedprobabilisticallyifa documentweightingsystemis usedthat is consistentwith probabilisticmodelsofretrieval andincorporatesfeaturedependencies.Using theanalyticmodelof retrievalperformance,we canlearnthosesituationswherea Booleanqueryor Booleansystemwould be superiorto usinga probabilisticsystemandwhenthe probabilisticsystemwouldbesuperiorto theBooleansystem.

We first considerthesimplestform of logical expression,a singletermsuchas  Z�with nooperators.Therankingof documentsusingtheBooleanexpression  retrievesfirst documentswith  Z� followedby documentswithoutterm  Z WedenotethisBooleanrankingas °s2¡ �5 This samerankingis obtainedby a probabilisticretrieval modelwithasingleterm.Theprobabilisticrankingbasedonprobabilisticweight �±2¡ �5 is denotedas ²�³n��2¡ �5Q´¤ For our purposes,we will denotethis equivalent rankingby the µ·¶symbol,thus °s2¡ �5�µ�¶ ²v³��±2¡ �5(´ª (6)

B.1 Negation

The simplestlogical operatoris negation. A Booleanquerysuchasnot x (denotedas ¸L  ) simply requiresthat the rankingbe reversedfrom that obtainedwith  Z Theprobabilisticrankingis simply the inverserankingfrom thatobtainedwith ��2¡ �5 , thatis, therankingobtainedwith ��2K¸L �5�� �±2© �56 Rankingwith therule of logical negationis thus °s2<¸L �5.µ·¶ ² ³ �±2© �5 ´ (7)

In somecircumstances,anoddsformulationmaybemoremathematicallytractableor useful.Onemight thencompute°+2K¸L �5º¹ f(fQ»µ�¶ ² ³ 2( z ��2¡ �5n5ª�G�±2¡ �5 ´ Similarly, theoddsformulationof equation6 is°s2© �5 ¹ f(fQ»µ·¶ ² ³ 2<��2¡ �5n5ª�`2* z �±2¡ �5Q5 ´

Theuseof suchoddsbasedformulaeaddresssomeof theconcernsthathave beenraisedaboutterm independencemodels[Coo95,RSJ76]. Theseconcernsarenot anissuein modelsthatassumeandfully computetermdependence.

23

B.2 Conjunction

Theretrieval characteristicsof a Booleansystempresentedwith a Booleanquerycon-sistingof the conjunctionof two termsmay be emulatedby the rankingprovided byprobabilisticretrieval if oneacceptstherule of ranking conjoined features:°s2© d¼½¢¤5rµ·¶ ² ³ �±2© Z��¢¤5 ´ (8)

In thisexpression,�±2© Z��¢�5r� �r� 2 � JT� �  Z��¢�56 Theoddsform of Equation(8) is°s2© �¼º¢�5rµ�¶ ²¿¾ �±2© Z��¢¤52: z �±2© Z��¢¤5:5�À Theextensionof theserankingsbeyondtwo termsis simple.

The rankingprovided by °s2¡ Á¼º¢�5 may be representedin a probabilisticrankingby usingthe joint probability �±2¡ Z�n¢¤5 , that is, the probability that   and ¢ will occurtogether. While this approachis intuitively appealing,it bringsup a problemthatoc-curswith multiterm systems.Emulatingthe rankingof the Booleansystemwithin aprobabilisticsystemrequiresthat thereonly be two typesof documents,thosewithboth   and ¢ , andthosewith either oneor theotheror neither. It thusbecomesnec-essaryto find a probabilisticformulationconsistentsuchthat ²v³��±2© ½�¿���¢��v�"5:´��² ³ �±2© ½�Â����¢��Ã�5 ´ ��² ³ �±2© ½������¢d����5 ´ÃÄ�Ų ³ �±2¡ s�,��n¢d�,�5 ´ This is ob-tainedif we assumethat �±2© ½�¿��n¢��Â�"51�9�±2Æ Ç�����n¢��¿�51�9�±2¡ s�����n¢��Â�"5 Ä���2¡ ½�Ã���¢d�Ã�56 A simplesolutionto this begins with equatingthe probabilitiesofterms  ±�� and ¢È�É , making �±2© ½�¿���¢��Â��5Á�Ê��2¡ ½�v����¢d�Ã�56 We thende-terminea degreeof dependencebetweenthe two termssuchthat ��2¡ ½�Ã���¢d�v�"5X���2¡ ½�v����¢d�v�"5d�Ë�±2© ½�v����¢d�Ã�56 Obviously, this canonly be true if the secondorderdependencecan“counter” thechangein oneof theotherparameters.

Whencomputingthe term dependencefor term pairs in the Booleanmodelsde-scribedin the body of the paper, it becomesnecessaryto compute�r� 2O; �Q� � JT�O5 using�() asthe � component,describedin AppendixA, andto compute�r� 2<; � 5 , wherethe �componentis representedby �'&( Beginningwith knowledgeof �'& in theBooleanand,wemaycompute�() consistentwith theaboveassumptionsas:�$Ì6Í#Î) �¿.u z .u��'&ªuÏ� z �$&Ð� z � (9)

Whenoneranksdocumentsassumingtermdependence,the truevaluesof �D�n�$� and �'&areused,and �() is estimatedasin Equation9, giving the samerankingaswould beobtainedwith aBooleanqueryusingtheand operator.

Assumingthis methodfor computing �() , one cancomputethe breakeven pointbetweentheBooleanand andthetermindependencemodelassumingEquation5, that

24

is, wherethetwo modelsproducethesameretrieval ordering.We denotethisas�6Ñ�Ò*ÓQÔÕ<Ö�× � � � z �[uØ� 4 uÈ�Zu� $�¤� z �P� 4 � z M� 4 u� 6� 4 � 4 z � z M�_u� 6��� (10)

More generally, the �'& value neededwhen a difference b�c(e#fg:hLi in the “raw ASL” onthe unit interval from � to (that is, the ASL beforeit is multiplied by the numberof documentsand �� added)is desiredbetweentheraw ASL obtainedwhenusingtheBooleanand andthe raw ASL obtainedwhenusingthe termindependencemodel,iscomputedas� Ñ ÒRÓQÔÕOÖ�× � z �[uØ� 4 u¯�Du� $�¤� z �6� 4 � z M� 4 u� 6� 4 � 4 u� Gb c(e#fg:h^iØz M�:b c:exfgQh^i z � z M�Zu� 6�¤� (11)

Notethatotherkindsof break-evenpointsmaybecomputedconsistentwith differentassumptions.

B.3 Disjunction

Whenwe saythat   is trueor ¢ is true,we canalsosaythat it is not thecasethat   isfalseandthat ¢ is falseat thesametime, that is, bothcan’t befalseif   is trueor ¢ istrue.Wecanthussaythat 2© BÙF¢�5r�Ú¸`2K¸L U¼N¸Û¢�56 Therule of ranking disjoined featuresis: °+2¡ ÁÙº¢¤5rµ·¶ ² ³ ��2¡ ½������¢����"5 ´ (12)

Usingmethodssimilar to thoseusedwith theBooleanand, we maycomputethatwhentheBooleanor operatorconnectsthetwo terms,the �() valuemustbe�'ÜPÝ) �v�$&Ð�\�M� (13)

Assumingthat �:) is thisvalue(for thenext threeequations),thebreakevenlevel ofperformancebetweena queryusingtheBooleanor andthetermindependencemodelis � Ñ]ÞOßÕOÖ�× � � �!�¤� (14)

Given b�kKlg:hLi , thedifferencein theraw ASL performancebetweentheBooleanor andthetermindependencemodel,we maycompute�'& as� Ñ]ÞOßÕOÖ�× � �P2Ð� 4 z 6� 4 �_u� �b kKlg:hLi 5� z 6�¤� (15)

Usingasimilartechnique,the �$& valuemaybecomputed,giventhe( bdk<lc(e#f ) by whichtheraw ASL for theBooleanor exceedstheraw ASL for theBooleanand, andgiventhat �() is computedconsistentwith theassumptionsof Equations9 or 13 in thecaseof

25

theBooleanand or or, respectively. We compute�'& thus:� Ñ Þ<ßÒ*ÓQÔ � �P2 z �Áu¯�Du� 6�¤� z M� 4 z �b kKlc(e#f u� M�:b kKlc:exf 5z �Áu¯�Du� 6��� z M� 4 (16)

Usingtheseequationsto determineb valuesallows oneto analyticallydeterminethedegreeof performanceincreaseor decreasethat would be obtainedin circumstancesconsistentwith theassumptionsdescribedearlierby movingfrom onemodeltoanother.

26

C List of Variables� JT� Of useto theuser.� Probabilitya termoccursin a relevantdocument.� Probabilitya termoccursin a document.� Expectedfrequency of theproductof two terms.�$& � for thesetof all documents.�:) � for thesetof relevantdocuments.b o poUq Differencein “raw ASL” betweenretrieval models,à � z à 4 .°+2¡ �5 Theorderedsetof documentsproducedby Booleanquery   .�±2© �5 Theprobabilisticvalueattachedto adocumentgivenquery   .²v³��±2© �5:´ Therankingbasedon thesetof probabilisticvaluesattachedto doc-umentsgivenquery   .à � µ�¶ à 4 Method à � producesthesamedocumentrankingasmethodà 4 .

References[BCB94] B. T. Bartell,G. W. Cottrell,andR. K. Belew. Automaticcombinationof multiple

ranked retrieval systems.In ACM Annual Conference on Research and Develop-ment in Information Retrieval, pages173–181,New York, 1994.ACM Press.

[BCCC93] N. J. Belkin, C. Cool, W. B. Croft, andJ. P. Callan. Theeffect of multiple queryrepresentationson informationretrieval perforamnce.In ACM Annual Conferenceon Research and Development in Information Retrieval, pages339–346,New York,1993.ACM Press.

[Bec87] Tony Becher. Disciplinarydiscourse.Studies in Higher Education, 12(3):261–274,1987.

[Bon84] SusanBonzi.Terminologicalconsistency in abstractandconcretedisciplines.Jour-nal of Documentation, 40(4):247–263,1984.

[Boo82] AbrahamBookstein.Explanationandgeneralizationof vectormodelsin informa-tion retrieval. In Research and Development in Information Retrieval: Proceed-ings of the 5th International Conference on Information Retrieval, pages118–132,Berlin, 1982.Springer-Verlag.

27

[Boo83] AbrahamBookstein.Informationretrieval: A sequentiallearningprocess.Journalof the American Society for Information Science, 34(4):331–342,September1983.

[Boo85] AbrahamBookstein. Implicationsof Booleanstructurefor probabilisticretrieval.In Proceedings of the Eigth Annual International ACM SIGIR Conference on Re-search and Development in Information Retrieval, pages11–17.ACM Press,June1985.

[BR84] J.D. Bovey andS.E.Robertson.An algorithmfor weightedsearchingonaBooleansystem.Information Technology, 3(2):84–87,April 1984.

[CL68] C.K. Chow andC.N. Liu. Approximatingdiscreteprobability distributionswithdependencetrees. IEEE Transactions on Information Theory, IT-14(3):462–467,May 1968.

[Coo95] William S.Cooper. Someinconsistenciesandmisidentifiedmodelingassumptionsin probabilisticinformationretrieval. ACM Transactions on Information Systems,13(1):100–111,January1995.

[Cro86] W. BruceCroft. Booleanqueriesandtermdependenciesin probabilisticretrievalmodels. Journal of the American Society for Information Science, 37(2):71–77,March1986.

[CY82] D. Chow andClementYu. On theconstructionof feedbackqueries.Journal of theAssociation for Computing Machinery, 29:127–151,January1982.

[EC92] DanielM. EverettandStevenC. Carter. Topologyof documentretrieval systems.Journal of the American Society for Information Science, 43(10):658–673,1992.

[Eva94] RossEvans. BeyondBoolean:Relevanceranking,naturallanguageandthenewsearchparadigm.In Proceedings of the Fifteenth National Online Meeting, pages121–128,Medford,NJ,1994.LearnedInformation.

[FW90] EdwardA. FoxandSheilaG.Winett.UsingvectorandextendedBooleanmatchingin anexpertsystemfor selectingfosterhomes.Journal of the American Society forInformation Science, 41(1):10–26,1990.

[Gup87] PadminiDasGupta.Booleaninterpretationof conjunctionsfor documentretrieval.Journal of the American Society for Information Science, 38(4):245–254,July1987.

[Haa95] StephanieW. Haas.Domainterminologypatternsin differentdisciplines:Evidencefrom abstracts. In Proceedings of the Fourth Annual Symposium on DocumentAnalysis and Information Retrieval, pages137–146,LasVegas,NV, April 1995.

[KK84] William KnealeandMarthaKneale.The Development of Logic. ClarendonPress,Oxford,1984.

28

[LB88] RobertM. LoseeandAbrahamBookstein.IntegratingBooleanqueriesin conjunc-tive normalform with probabilisticretrieval models. Information Processing andManagement, 24(3):315–321,1988.

[LBY86] RobertM. Losee,AbrahamBookstein,andClementT. Yu. Probabilisticmodelsfor documentretrieval: A comparisonof performanceon experimentalandsyn-thetic databases.In ACM Annual Conference on Research and Development inInformation Retrieval, pages258–264,1986.

[Lee94] JoonHo Lee. Propertiesof extendedBooleanmodelsin informationretrieval. InACM Annual Conference on Research and Development in Information Retrieval,pages182–190,New York, 1994.ACM Press.

[Lee95] JoonHo Lee. Combiningmultipleevidencefrom differentpropertiesof weightingschemes.In ACM Annual Conference on Research and Development in InformationRetrieval, pages180–188,New York, 1995.ACM Press.

[LH95] RobertM. LoseeandStephanieW. Haas. Sublanguageterms: Dictionaries,us-age,andautomaticclassification.Journal of the American Society for InformationScience, 46(7):519–529,1995.

[Los88] RobertM. Losee.Parameterestimationfor probabilisticdocumentretrieval mod-els. Journal of the American Society for Information Science, 39(1):8–16,January1988.

[Los91] RobertM. Losee. An analytic measurepredictinginformation retrieval systemperformance.Information Processing and Management, 27(1):1–13,1991.

[Los94a] RobertM. Losee.Termdependence:TruncatingtheBahadurLazarsfeldexpansion.Information Processing and Management, 30(2):293–303,1994.

[Los94b] RobertM. Losee.Upperboundsfor retrieval performanceandtheirusemeasuringperformanceandgeneratingoptimal Booleanqueries:Canit get any betterthanthis? Information Processing and Management, 30(2):193–203,1994.

[Los95a] RobertM. Losee.Determininginformationretrieval performancewithout experi-mentation.Information Processing and Management, 31(4):555–572,1995.

[Los95b] RobertM. Losee.Thedevelopmentandmigrationof conceptsfrom donorto bor-rowerdisciplines:Sublanguagetermusein hard& soft sciences.In MichaelE. D.Koenig and AbrahamBookstein,editors,Proceedings of the Fifth InternationalConference on Scientometrics and Informetrics, pages265–274,June1995.

[Los96] RobertM. Losee.Evaluatingretrieval performancegivendatabaseandquerychar-acteristics:Analytic determinationof performancesurfaces.Journal of the Ameri-can Society for Information Science, 47(1):95–105,1996.

[LY82] K. Lam andC. T. Yu. A clusteredsearchalgorithmincorporatingarbitrarytermdependencies.ACM Transactions on Database Systems, 7:500–508,1982.

29

[MC81] KarenMarkey andPaulineAthertonCochrane.ONTAP: An Online Traininga ndPractice Manual for ERIC Database Searchers. ERIC Clearinghouseon Informa-tion Resources,SyracuseU., NY, 2ndedition,October1981.

[Moo93] SungBeenMoon.Enhancing Retrieval Performance of Full-Text Retrieval SystemsUsing Relevance Feedback. PhD thesis,U. of North Carolina,ChapelHill, NC,1993.

[Nie89] Jianyun Nie. An informationretrieval modelbasedon modallogic. InformationProcessing and Management, 25(5):477–491,1989.

[Por80] M. F. Porter. An algorithmfor suffix stripping. Program, 14(3):130–137,July1980.

[Rad79] TadeuszRadecki.Fuzzysettheoreticalapproachto documentretrieval. Informa-tion Processing and Management, 15:247–259,1979.

[Rad82] TadeuszRadecki.A probabilisticapproachto informationretrieval in systemswithBooleansearchrequestformulations. Journal of the American Society for Infor-mation Science, 33(6):365–370,November1982.

[Rob77] StephenE. Robertson.Theprobabilityrankingprinciple in IR. Journal of Docu-mentation, 33(4):294–304,1977.

[RSJ76] StephenE. Robertsonand KarenSparckJones. Relevanceweightingof searchterms.Journal of the American Society for Information Science, 27:129–146,1976.

[RT90] S.E.RobertsonandC.L. Thompson.Weightedsearching:theCIRT experiment.InInformatics 10: Prospects for Intelligent Retrieval, pages153–166.ASLIB, Lon-don,1990.

[RTMB86] S. E. Robertson,C. L. Thompson,M. J. Macaskill,andJ. D. Bovey. Weighting,rankingandrelevancefeedbackin a front-endsystem.Journal of Information Sci-ence, 12(1/2):71–75,1986.

[Sag81] NaomiSager. Informationstructuresin texts of a sublangauge.In Proceedings ofthe 44th ASIS Annual Meeting, pages199–201,White Plains,NY, 1981.Knowl-edgeIndustryPublications.

[Sal84] GerardSalton.Theuseof extendedBooleanlogic in informationretrieval. Techni-cal ReportTR 84–588,CornellUniversity, ComputerScienceDept.,Ithaca,N.Y.,January1984.

[Sme84] A. F. Smeaton.Relevancefeedbackanda fuzzy setof searchtermsin an infor-mation retrieval system. Information Technology: Research and Development,3(1):15–23,January1984.

[Spi95] AmandaSpink.Termrelevancefeedbackandmediateddatabasesearching:Impli-cationsfor informationretrieval practiceandsystemsdesign.Information Process-ing and Management, 31(2):161–171,1995.

30

[Tib92] HelenR. Tibbo. Abstractingacrossthedisciplines:A contentanalysisof abstractsfrom the naturalsciences,the social sciences,and the humanitieswith implica-tionsfor standardizationandonlineinformationretrieval. Library and InformationScience Research, 14:31–56,1992.

[Tur94] HowardTurtle. Naturallanguagevs.Booleanqueryevaluation:A comparisonofretrieval performance.In ACM Annual Conference on Research and Developmentin Information Retrieval, pages212–220,New York, 1994.ACM Press.

[VR77] C.J.VanRijsbergen. A theoreticalbasisfor useof co-occurrencedatain informa-tion retrieval. Journal of Documentation, 33(2):106–119,June1977.

[YBLS83] ClementT. Yu, Chris Buckley, K. Lam, andGerardSalton. A generalizedtermdependencemodel in information retrieval. Information Technology: Researchand Development, 2(4):129–154,1983.

31