21
December 16 , 2003 INEX 2003 -- Ray R. Larson Cheshire II at INEX Cheshire II at INEX 2003: Component and 2003: Component and Algorithm Fusion Algorithm Fusion Ray R. Larson Ray R. Larson School of Information School of Information Management and Systems Management and Systems University of California, University of California, Berkeley Berkeley

December 16, 2003 INEX 2003 -- Ray R. Larson Cheshire II at INEX 2003: Component and Algorithm Fusion Ray R. Larson School of Information Management and

  • View
    219

  • Download
    4

Embed Size (px)

Citation preview

December 16 , 2003 INEX 2003 -- Ray R. Larson

Cheshire II at INEX 2003: Cheshire II at INEX 2003: Component and Algorithm FusionComponent and Algorithm Fusion

Cheshire II at INEX 2003: Cheshire II at INEX 2003: Component and Algorithm FusionComponent and Algorithm Fusion

Ray R. LarsonRay R. LarsonSchool of Information Management and School of Information Management and

Systems Systems

University of California, BerkeleyUniversity of California, Berkeley

December 16 , 2003 INEX 2003 -- Ray R. Larson

OverviewOverviewOverviewOverview

• Cheshire II feature overview Cheshire II feature overview – Logistic Regression Ranking and Boolean Logistic Regression Ranking and Boolean

OperationsOperations

• Additions from INEX ‘02Additions from INEX ‘02– XML Schemas and Element RetrievalXML Schemas and Element Retrieval– CORI, Okapi BM-25 ranking algorithmsCORI, Okapi BM-25 ranking algorithms– Result Set sorting, merging and ranking Result Set sorting, merging and ranking

operatorsoperators

• Evaluation ResultsEvaluation Results

December 16 , 2003 INEX 2003 -- Ray R. Larson

Overview of Cheshire IIOverview of Cheshire IIOverview of Cheshire IIOverview of Cheshire II• It supports SGML and XML with components and component indexesIt supports SGML and XML with components and component indexes• It is a client/server applicationIt is a client/server application• Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI,

SOAP, SDLIP also implementedSOAP, SDLIP also implemented• Server supports a Relational Database GatewayServer supports a Relational Database Gateway• Supports Boolean searching of all serversSupports Boolean searching of all servers• Supports probabilistic ranked retrieval in the Cheshire search engine as Supports probabilistic ranked retrieval in the Cheshire search engine as

well as Boolean and proximity searchwell as Boolean and proximity search• Search engine supports ``nearest neighbor'' searches and relevance Search engine supports ``nearest neighbor'' searches and relevance

feedbackfeedback• GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT• WWW/CGI forms interface for DL, using combined client/server CGI WWW/CGI forms interface for DL, using combined client/server CGI

scripting via WebCheshirescripting via WebCheshire• Scriptable clients using Tcl and (new) PythonScriptable clients using Tcl and (new) Python• Store SGML/XML as files or “Datastore” databaseStore SGML/XML as files or “Datastore” database

December 16 , 2003 INEX 2003 -- Ray R. Larson

XML Element ExtractionXML Element ExtractionXML Element ExtractionXML Element Extraction

• A new search “ElementSetName” is A new search “ElementSetName” is XML_ELEMENT_XML_ELEMENT_

• Any Xpath, element name, or regular Any Xpath, element name, or regular expression can be included following the expression can be included following the final underscore when submitting a present final underscore when submitting a present requestrequest

• The matching elements are extracted from The matching elements are extracted from the records matching the search and the records matching the search and delivered in a simple format..delivered in a simple format..

December 16 , 2003 INEX 2003 -- Ray R. Larson

XML ExtractionXML ExtractionXML ExtractionXML Extraction

% zselect sherlock372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372}% zfind topic mathematics{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}% zset recsyntax XML% zset elementset XML_ELEMENT_Fld245% zdisplay{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} {<RESULT_DATA DOCID="1"><ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"><Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245></ITEM><RESULT_DATA> … etc…

December 16 , 2003 INEX 2003 -- Ray R. Larson

Boolean Search CapabilityBoolean Search CapabilityBoolean Search CapabilityBoolean Search Capability• All Boolean operations are supportedAll Boolean operations are supported

– ““zfind author x and (title y or subject z) not subject zfind author x and (title y or subject z) not subject A”A”

• Named sets are supported and stored on the Named sets are supported and stored on the serverserver

• Boolean operations between stored sets are Boolean operations between stored sets are supportedsupported– ““zfind SET1 and subject widgets or SET2”zfind SET1 and subject widgets or SET2”

• Nested parentheses and truncation are supportedNested parentheses and truncation are supported– ““zfind xtitle Alice#”zfind xtitle Alice#”

December 16 , 2003 INEX 2003 -- Ray R. Larson

Probabilistic RetrievalProbabilistic RetrievalProbabilistic RetrievalProbabilistic Retrieval• Uses Logistic Regression ranking method developed Uses Logistic Regression ranking method developed

at Berkeley (W. Cooper, F. Gey, D. Dabney, A. at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at Chen) with new algorithm for weigh calculation at retrieval timeretrieval time

• Z39.50 “relevance” operator used to indicate Z39.50 “relevance” operator used to indicate probabilistic searchprobabilistic search

• Any index can have Probabilistic searching Any index can have Probabilistic searching performed:performed:– zfind topic @ “cheshire cats, looking glasses, march hares zfind topic @ “cheshire cats, looking glasses, march hares

and other such things”and other such things”– zfind title @ caucus raceszfind title @ caucus races

• Boolean and Probabilistic elements can be Boolean and Probabilistic elements can be combined:combined:– zfind topic @ government documents and title zfind topic @ government documents and title

guidebooksguidebooks

December 16 , 2003 INEX 2003 -- Ray R. Larson

∑=

+=6

10),|(

iii XccDQRP

Probability of relevance is based onProbability of relevance is based onLogistic regression from a sample set of documentsLogistic regression from a sample set of documentsto determine values of the coefficients (TREC). to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by:At retrieval the probability estimate is obtained by:

For the 6 For the 6 XX attribute measures shown on the next slide attribute measures shown on the next slideNote that we did NOT retrain the coefficients this yearNote that we did NOT retrain the coefficients this year

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

December 16 , 2003 INEX 2003 -- Ray R. Larson

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

=

−=

=

=

=

=

=

∑ Average Absolute Query FrequencyAverage Absolute Query Frequency

Query LengthQuery Length

Average Absolute Component FrequencyAverage Absolute Component Frequency

Document LengthDocument Length

Average Inverse Component FrequencyAverage Inverse Component Frequency

Inverse Component FrequencyInverse Component Frequency

Number of Terms in common between Number of Terms in common between query and Component -- logged query and Component -- logged

December 16 , 2003 INEX 2003 -- Ray R. Larson

Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements

Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements

• Two original approaches:Two original approaches:– Boolean ApproachBoolean Approach

– Non-probabilistic “Fusion Search” Set merger approach Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate is a weighted merger of document scores from separate Boolean and Probabilistic queries Boolean and Probabilistic queries €

P(R | Q,D) = P(R | Qbool,D)P(R | Qprob ,D)

P(R | Qbool,D) =1: if Boolean eval successful for D

0 : Otherwise

⎧ ⎨ ⎩

December 16 , 2003 INEX 2003 -- Ray R. Larson

Ranking Methods added since Ranking Methods added since INEX ‘02INEX ‘02

Ranking Methods added since Ranking Methods added since INEX ‘02INEX ‘02

• CORICORI -- From Jamie Callan: Simple -- From Jamie Callan: Simple implementation of a weighting scheme for implementation of a weighting scheme for distributed search. Very effective for distributed search. Very effective for distributed search collection selection. Not distributed search collection selection. Not used for official INEX runs.used for official INEX runs.

• OKAPI BM-25OKAPI BM-25 -- From Steve Robertson. -- From Steve Robertson. This is now seems to be the “default” This is now seems to be the “default” retrieval algorithm in experimental IRretrieval algorithm in experimental IR

• New operators (later) let us mix and match New operators (later) let us mix and match ranking methods and Boolean operationsranking methods and Boolean operations

December 16 , 2003 INEX 2003 -- Ray R. Larson

Okapi BM25Okapi BM25Okapi BM25Okapi BM25

• Where:Where:• QQ is a query containing terms is a query containing terms TT• K K is is kk11((1-((1-bb) + ) + b.dlb.dl//avdlavdl))• kk11, b , b and and kk33 are parameters , usually 1.2, 0.75 and 7-1000are parameters , usually 1.2, 0.75 and 7-1000• tftf is the frequency of the term in a specific document is the frequency of the term in a specific document• qtf qtf is the frequency of the term in a topic from which is the frequency of the term in a topic from which QQ was derived was derived• dl dl and and avdl avdl are the document length and the average document length are the document length and the average document length

measured in some convenient unitmeasured in some convenient unit• ww(1) (1) is the Robertson-Sparck Jones weight.is the Robertson-Sparck Jones weight.

∑∈ +

+++

QT qtfk

qtfk

tfK

tfkw

3

31)1( )1()1(

⎟⎠⎞

⎜⎝⎛

++−−+−

⎟⎠⎞

⎜⎝⎛

+−+

=

5.0

5.05.0

5.0

log)1(

rRnN

rnrR

r

w

December 16 , 2003 INEX 2003 -- Ray R. Larson

INEX ‘02 Fusion SearchINEX ‘02 Fusion SearchINEX ‘02 Fusion SearchINEX ‘02 Fusion Search

QueryResults

Sort/Merge

FinalRanked

List

• Merge multiple resultsets and sort new setMerge multiple resultsets and sort new set– Sort by index name/key (ATTRIBUTE)Sort by index name/key (ATTRIBUTE)– Sort by rank (ELEMENTS)Sort by rank (ELEMENTS)

• Merges ranked results and Boolean resultsMerges ranked results and Boolean results

– Sort by XML/SGML Tag contents (TAG)Sort by XML/SGML Tag contents (TAG)

December 16 , 2003 INEX 2003 -- Ray R. Larson

Merging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking Operators• Extends the capabilities of merging to include merger Extends the capabilities of merging to include merger

operations in queries like Boolean operatorsoperations in queries like Boolean operators• Fuzzy Logic Operators (not used for INEX)Fuzzy Logic Operators (not used for INEX)

– !FUZZY_AND!FUZZY_AND– !FUZZY_OR!FUZZY_OR– !FUZZY_NOT!FUZZY_NOT

• Containment operators: Restrict components to or Containment operators: Restrict components to or with a particular parent with a particular parent – !RESTRICT_FROM!RESTRICT_FROM– !RESTRICT_TO!RESTRICT_TO

• Merge OperatorsMerge Operators– !MERGE_SUM!MERGE_SUM– !MERGE_MEAN!MERGE_MEAN– !MERGE_NORM!MERGE_NORM

December 16 , 2003 INEX 2003 -- Ray R. Larson

Query Generation - COQuery Generation - COQuery Generation - COQuery Generation - CO• # 91 TITLE = Internet traffic# 91 TITLE = Internet traffic

• (topicshort @+ {Internet traffic internet, web, traffic, (topicshort @+ {Internet traffic internet, web, traffic, measurement, congestion}) !MERGE_NORM (alltitles measurement, congestion}) !MERGE_NORM (alltitles @+ {Internet traffic}) !MERGE_NORM (kwd @+ @+ {Internet traffic}) !MERGE_NORM (kwd @+ {Internet traffic}) !MERGE_NORM (topicshort @ {Internet traffic}) !MERGE_NORM (topicshort @ {Internet traffic internet, web, traffic, measurement, {Internet traffic internet, web, traffic, measurement, congestion}) !MERGE_NORM (alltitles @ {Internet congestion}) !MERGE_NORM (alltitles @ {Internet traffic}) !MERGE_NORM (kwd @ {Internet traffic})traffic}) !MERGE_NORM (kwd @ {Internet traffic})

• TARGETPATH = XML_ELEMENT_articleTARGETPATH = XML_ELEMENT_article

December 16 , 2003 INEX 2003 -- Ray R. Larson

INEX CO RunsINEX CO RunsINEX CO RunsINEX CO Runs

December 16 , 2003 INEX 2003 -- Ray R. Larson

Query Generation - SCASQuery Generation - SCASQuery Generation - SCASQuery Generation - SCAS

• #66 TITLE = #66 TITLE = /article[./fm//yr &lt; /article[./fm//yr &lt; '2000’] //sec[about(.,'"search engines"')]'2000’] //sec[about(.,'"search engines"')]

• ((date < '2000')) !RESTRICT_FROM ((date < '2000')) !RESTRICT_FROM ((sec_words @ {"search engines"} !((sec_words @ {"search engines"} !MERGE_MEAN (sec_words {$search MERGE_MEAN (sec_words {$search engines$})))engines$})))

• TARGETPATH = XML_ELEMENT_secTARGETPATH = XML_ELEMENT_sec

December 16 , 2003 INEX 2003 -- Ray R. Larson

Query Generation -- SCASQuery Generation -- SCASQuery Generation -- SCASQuery Generation -- SCAS

• This run uses Logistic regression matching This run uses Logistic regression matching combined with Boolean phrase matching combined with Boolean phrase matching and MERGE_MEAN partial result and MERGE_MEAN partial result combinations FUZZY_AND and combinations FUZZY_AND and FUZZY_OR operators were used in FUZZY_OR operators were used in combining AND and OR elements within combining AND and OR elements within an "about" predicate. Containment an "about" predicate. Containment operators were used to constrain component operators were used to constrain component searches within ancestor elements, E.g.:searches within ancestor elements, E.g.:

December 16 , 2003 INEX 2003 -- Ray R. Larson

INEX SCAS RunsINEX SCAS RunsINEX SCAS RunsINEX SCAS Runs

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

December 16 , 2003 INEX 2003 -- Ray R. Larson

Future PlansFuture PlansFuture PlansFuture Plans

• Bug fixes -- incorrect query generation for some Bug fixes -- incorrect query generation for some SCAS queries, for example…SCAS queries, for example…– TITLE = //article[about(.,'security +biometrics') AND TITLE = //article[about(.,'security +biometrics') AND

about(.//sec,'"facial recognition"')]about(.//sec,'"facial recognition"')]

– Submitted : (topicshort @ {security biometrics} !Submitted : (topicshort @ {security biometrics} !MERGE_MEAN (topicshort @ {biometrics biometrics MERGE_MEAN (topicshort @ {biometrics biometrics biometrics biometrics}) ) !FUZZY_AND (biometrics biometrics}) ) !FUZZY_AND (sec_titlesec_title @ @ {"facial recognition"} !MERGE_MEAN ({"facial recognition"} !MERGE_MEAN (sec_titlesec_title {$facial recognition$})){$facial recognition$}))

– Should have included Should have included sec_words sec_words and Boolean subquery and Boolean subquery for biometrics merged with ranked subqueryfor biometrics merged with ranked subquery

December 16 , 2003 INEX 2003 -- Ray R. Larson

Future PlansFuture PlansFuture PlansFuture Plans

• Add Language Model ranking for Add Language Model ranking for componentscomponents

• Retrain Logistic Regression coefficients on Retrain Logistic Regression coefficients on INEX assessment data -- and experiment INEX assessment data -- and experiment with including new variables, such as with including new variables, such as relative component sizerelative component size

• Find bugs in Okapi BM-25Find bugs in Okapi BM-25• Find more bugs ahead of time, and be more Find more bugs ahead of time, and be more

consistent in runs!consistent in runs!