View
217
Download
2
Tags:
Embed Size (px)
Citation preview
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
Cheshire II at INEX: Using A Cheshire II at INEX: Using A Hybrid Logistic Regression and Hybrid Logistic Regression and
Boolean Model for XML RetrievalBoolean Model for XML Retrieval
Cheshire II at INEX: Using A Cheshire II at INEX: Using A Hybrid Logistic Regression and Hybrid Logistic Regression and
Boolean Model for XML RetrievalBoolean Model for XML Retrieval
Ray R. LarsonRay R. LarsonSchool of Information Management and School of Information Management and
Systems Systems
University of California, BerkeleyUniversity of California, Berkeley
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
Overview of Cheshire IIOverview of Cheshire IIOverview of Cheshire IIOverview of Cheshire II• It supports SGML and XMLIt supports SGML and XML• It is a client/server applicationIt is a client/server application• Uses the Z39.50 Information Retrieval ProtocolUses the Z39.50 Information Retrieval Protocol• Server supports a Relational Database GatewayServer supports a Relational Database Gateway• Supports Boolean searching of all serversSupports Boolean searching of all servers• Supports probabilistic ranked retrieval in the Cheshire search Supports probabilistic ranked retrieval in the Cheshire search
engine as well as Boolean and proximity searchengine as well as Boolean and proximity search• Search engine supports ``nearest neighbor'' searches and relevance Search engine supports ``nearest neighbor'' searches and relevance
feedbackfeedback• GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT• WWW/CGI forms interface for DL, using combined client/server WWW/CGI forms interface for DL, using combined client/server
CGI scripting via WebCheshireCGI scripting via WebCheshire• Scriptable clients using Tcl and (new) PythonScriptable clients using Tcl and (new) Python
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support
• Underlying native format for all data is Underlying native format for all data is SGML or XMLSGML or XML
• The DTD defines the database contentsThe DTD defines the database contents
• Full SGML/XML parsingFull SGML/XML parsing
• SGML/XML Format Configuration Files SGML/XML Format Configuration Files define the database location and indexesdefine the database location and indexes
• Various format conversions and utilities Various format conversions and utilities available for Z39.50 support (MARC, GRS-1available for Z39.50 support (MARC, GRS-1
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support
• Configuration files for the Server are Configuration files for the Server are SGML/XML:SGML/XML:– They include elements describing all of the data They include elements describing all of the data
files and indexes for the database.files and indexes for the database.– They also include instructions on how data is to They also include instructions on how data is to
be extracted for indexing and how Z39.50 be extracted for indexing and how Z39.50 attributes map to the indexes for a given attributes map to the indexes for a given database.database.
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
IndexingIndexingIndexingIndexing
• Any SGML/XML tagged field or attribute can be Any SGML/XML tagged field or attribute can be indexed:indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)B-Tree and Hash access via Berkeley DB (Sleepycat)
– Stemming, keyword, exact keys and “special keys”Stemming, keyword, exact keys and “special keys”
– Mapping from any Z39.50 Attribute combination to a Mapping from any Z39.50 Attribute combination to a specific indexspecific index
– Underlying postings information includes term Underlying postings information includes term frequency for probabilistic searchingfrequency for probabilistic searching
• Component extraction with separate component Component extraction with separate component indexesindexes
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
Boolean Search CapabilityBoolean Search CapabilityBoolean Search CapabilityBoolean Search Capability
• All Boolean operations are supportedAll Boolean operations are supported– ““zfind author x and (title y or subject z) not subject A”zfind author x and (title y or subject z) not subject A”
• Named sets are supported and stored on the serverNamed sets are supported and stored on the server• Boolean operations between stored sets are Boolean operations between stored sets are
supportedsupported– ““zfind SET1 and subject widgets or SET2”zfind SET1 and subject widgets or SET2”
• Nested parentheses and truncation are supportedNested parentheses and truncation are supported– ““zfind xtitle Alice#”zfind xtitle Alice#”
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
Probabilistic RetrievalProbabilistic RetrievalProbabilistic RetrievalProbabilistic Retrieval
• Uses Logistic Regression ranking method developed at Uses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval timealgorithm for weigh calculation at retrieval time
• Z39.50 “relevance” operator used to indicate probabilistic Z39.50 “relevance” operator used to indicate probabilistic searchsearch
• Any index can have Probabilistic searching performed:Any index can have Probabilistic searching performed:– zfind topic @ “cheshire cats, looking glasses, march hares and other zfind topic @ “cheshire cats, looking glasses, march hares and other
such things”such things”
– zfind title @ caucus raceszfind title @ caucus races
• Boolean and Probabilistic elements can be combined:Boolean and Probabilistic elements can be combined:– zfind topic @ government documents and title guidebookszfind topic @ government documents and title guidebooks
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
∑=
+=6
10),|(
iiiXccDQRP
Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients (TREC).At retrieval the probability estimate is obtained by:
For the 6 X attribute measures shown on the next slide
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
=
−=
=
=
=
=
=
∑
∑
∑ Average Absolute Query Frequency
Query Length
Average Absolute Document Frequency
Document Length
Average Inverse Document Frequency
Inverse Document Frequency
Number of Terms in common between query and document -- logged
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements
Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements
• Two approaches:Two approaches:– Boolean ApproachBoolean Approach
– Non-probabilistic “Fusion Search” Set Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of merger approach is a weighted merger of document scores from separate Boolean document scores from separate Boolean and Probabilistic queries and Probabilistic queries €
P(R |Q,D) = P(R |Qbool,D)P(R |Qprob ,D)
P(R |Qbool,D) =1: if Boolean eval successful for D
0 : Otherwise
⎧ ⎨ ⎩
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
Cheshire II SearchingCheshire II SearchingCheshire II SearchingCheshire II Searching
Z39.50 Internet
ImagesScannedText
Local Remote
Z39.50
Z39.50
Z39.50
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
INEX OverviewINEX OverviewINEX OverviewINEX Overview
LocalNet
UIOr
Scripts
MapQuery
MapResults
MapQuery
MapResults
INEXSearchEngine
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
Fusion SearchFusion SearchFusion SearchFusion Search
QueryResults
Sort/Merge
FinalRanked
List
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
Distributed Metadata ServersDistributed Metadata ServersDistributed Metadata ServersDistributed Metadata Servers
Replicatedservers
Meta-TopicalServers
General ServersDatabaseServers
December 9, 2002 Cheshire II at INEX -- Ray R. Larson
Further InformationFurther InformationFurther InformationFurther Information
• Full Cheshire II client and server is open source Full Cheshire II client and server is open source and available for academic and government use: and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/ftp://cheshire.berkeley.edu/pub/cheshire/– Includes HTML documentationIncludes HTML documentation
• Project Web Site Project Web Site http://cheshire.berkeley.edu/http://cheshire.berkeley.edu/• Archives Hub http://www.archiveshub.ac.uk/Archives Hub http://www.archiveshub.ac.uk/