December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R

December 9, 2002 Cheshire II at INEX -- Ray R. Larson

Cheshire II at INEX: Using A Cheshire II at INEX: Using A Hybrid Logistic Regression and Hybrid Logistic Regression and

Boolean Model for XML RetrievalBoolean Model for XML Retrieval

Cheshire II at INEX: Using A Cheshire II at INEX: Using A Hybrid Logistic Regression and Hybrid Logistic Regression and

Boolean Model for XML RetrievalBoolean Model for XML Retrieval

Ray R. LarsonRay R. LarsonSchool of Information Management and School of Information Management and

Systems Systems

University of California, BerkeleyUniversity of California, Berkeley


Overview of Cheshire IIOverview of Cheshire IIOverview of Cheshire IIOverview of Cheshire II• It supports SGML and XMLIt supports SGML and XML• It is a client/server applicationIt is a client/server application• Uses the Z39.50 Information Retrieval ProtocolUses the Z39.50 Information Retrieval Protocol• Server supports a Relational Database GatewayServer supports a Relational Database Gateway• Supports Boolean searching of all serversSupports Boolean searching of all servers• Supports probabilistic ranked retrieval in the Cheshire search Supports probabilistic ranked retrieval in the Cheshire search

engine as well as Boolean and proximity searchengine as well as Boolean and proximity search• Search engine supports ``nearest neighbor'' searches and relevance Search engine supports ``nearest neighbor'' searches and relevance

feedbackfeedback• GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT• WWW/CGI forms interface for DL, using combined client/server WWW/CGI forms interface for DL, using combined client/server

CGI scripting via WebCheshireCGI scripting via WebCheshire• Scriptable clients using Tcl and (new) PythonScriptable clients using Tcl and (new) Python


SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support

• Underlying native format for all data is Underlying native format for all data is SGML or XMLSGML or XML

• The DTD defines the database contentsThe DTD defines the database contents

• Full SGML/XML parsingFull SGML/XML parsing

• SGML/XML Format Configuration Files SGML/XML Format Configuration Files define the database location and indexesdefine the database location and indexes

• Various format conversions and utilities Various format conversions and utilities available for Z39.50 support (MARC, GRS-1available for Z39.50 support (MARC, GRS-1


SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support

• Configuration files for the Server are Configuration files for the Server are SGML/XML:SGML/XML:– They include elements describing all of the data They include elements describing all of the data

files and indexes for the database.files and indexes for the database.– They also include instructions on how data is to They also include instructions on how data is to

be extracted for indexing and how Z39.50 be extracted for indexing and how Z39.50 attributes map to the indexes for a given attributes map to the indexes for a given database.database.


IndexingIndexingIndexingIndexing

• Any SGML/XML tagged field or attribute can be Any SGML/XML tagged field or attribute can be indexed:indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)B-Tree and Hash access via Berkeley DB (Sleepycat)

– Stemming, keyword, exact keys and “special keys”Stemming, keyword, exact keys and “special keys”

– Mapping from any Z39.50 Attribute combination to a Mapping from any Z39.50 Attribute combination to a specific indexspecific index

– Underlying postings information includes term Underlying postings information includes term frequency for probabilistic searchingfrequency for probabilistic searching

• Component extraction with separate component Component extraction with separate component indexesindexes


Boolean Search CapabilityBoolean Search CapabilityBoolean Search CapabilityBoolean Search Capability

• All Boolean operations are supportedAll Boolean operations are supported– ““zfind author x and (title y or subject z) not subject A”zfind author x and (title y or subject z) not subject A”

• Named sets are supported and stored on the serverNamed sets are supported and stored on the server• Boolean operations between stored sets are Boolean operations between stored sets are

supportedsupported– ““zfind SET1 and subject widgets or SET2”zfind SET1 and subject widgets or SET2”

• Nested parentheses and truncation are supportedNested parentheses and truncation are supported– ““zfind xtitle Alice#”zfind xtitle Alice#”


Probabilistic RetrievalProbabilistic RetrievalProbabilistic RetrievalProbabilistic Retrieval

• Uses Logistic Regression ranking method developed at Uses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval timealgorithm for weigh calculation at retrieval time

• Z39.50 “relevance” operator used to indicate probabilistic Z39.50 “relevance” operator used to indicate probabilistic searchsearch

• Any index can have Probabilistic searching performed:Any index can have Probabilistic searching performed:– zfind topic @ “cheshire cats, looking glasses, march hares and other zfind topic @ “cheshire cats, looking glasses, march hares and other

such things”such things”

– zfind title @ caucus raceszfind title @ caucus races

• Boolean and Probabilistic elements can be combined:Boolean and Probabilistic elements can be combined:– zfind topic @ government documents and title guidebookszfind topic @ government documents and title guidebooks


∑=

+=6

10),|(

iiiXccDQRP

Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients (TREC).At retrieval the probability estimate is obtained by:

For the 6 X attribute measures shown on the next slide

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression


Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

=

−=

=

=

=

=

=

∑

∑

∑ Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged


Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements

Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements

• Two approaches:Two approaches:– Boolean ApproachBoolean Approach

– Non-probabilistic “Fusion Search” Set Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of merger approach is a weighted merger of document scores from separate Boolean document scores from separate Boolean and Probabilistic queries and Probabilistic queries €

P(R |Q,D) = P(R |Qbool,D)P(R |Qprob ,D)

P(R |Qbool,D) =1: if Boolean eval successful for D

0 : Otherwise

⎧ ⎨ ⎩


Cheshire II SearchingCheshire II SearchingCheshire II SearchingCheshire II Searching

Z39.50 Internet

ImagesScannedText

Local Remote

Z39.50

Z39.50

Z39.50


INEX OverviewINEX OverviewINEX OverviewINEX Overview

LocalNet

UIOr

Scripts

MapQuery

MapResults

MapQuery

MapResults

INEXSearchEngine


Fusion SearchFusion SearchFusion SearchFusion Search

QueryResults

Sort/Merge

FinalRanked

List


Distributed Metadata ServersDistributed Metadata ServersDistributed Metadata ServersDistributed Metadata Servers

Replicatedservers

Meta-TopicalServers

General ServersDatabaseServers


Further InformationFurther InformationFurther InformationFurther Information

• Full Cheshire II client and server is open source Full Cheshire II client and server is open source and available for academic and government use: and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/ftp://cheshire.berkeley.edu/pub/cheshire/– Includes HTML documentationIncludes HTML documentation

• Project Web Site Project Web Site http://cheshire.berkeley.edu/http://cheshire.berkeley.edu/• Archives Hub http://www.archiveshub.ac.uk/Archives Hub http://www.archiveshub.ac.uk/

http://cheshire.berkeley.edu/

Documents

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R