Upload
kelley-wilkinson
View
249
Download
0
Tags:
Embed Size (px)
Citation preview
2 September 2005 VLDB Tutorial on XML Full-Text Search
XML Full-Text Search: Challenges and Opportunities
Jayavel Shanmugasundaram
Cornell University
Sihem Amer-Yahia
AT&T Labs – Research
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation
• Full-Text Search Languages
• Scoring
• Query Processing
• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
Motivation
• XML is able to represent a mix of structured and text information.
• XML applications: digital libraries, content management.
• XML repositories: IEEE INEX collection, LexisNexis, the Library of Congress collection.
2 September 2005 VLDB Tutorial on XML Full-Text Search
XML in Library of Congresshttp://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml
<bill bill-stage="Introduced-in-House"> <congress>109th CONGRESS</congress> <session>1st
Session</session> <legis-num>H. R. 2739</legis-num> <current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-
chamber> <action> <action-date date="20050526">May 26, 2005</action-date> <action-desc><sponsor name-id="T000266">Mr. Tierney</sponsor>
(for himself, <cosponsor name-id="M001143">Ms. McCollum of Minnesota</cosponsor>, <cosponsor name-id="M000725">Mr. George Miller of California</cosponsor>) introduced the following bill; which was referred to the <committee-name committee-id="HED00">Committee on Education and the Workforce</committee-name>
</action-desc> </action>…
2 September 2005 VLDB Tutorial on XML Full-Text Search
INEX Data <article> <fno>K0271</fno> <doi>10.1041/K0271s-2004</doi> <fm> <hdr><hdr1><ti>IEEE TRANSACTIONS ON KNOWLEDGE
AND DATA ENGINEERING</ti> <crt> <issn>1041-4347</issn>/04/$20.00 © 2004 IEEE Published by the
IEEE Computer Society</crt></hdr1><hdr2><obi><volno>Vol. 16</volno>, <issno>No. 2</issno></obi> <pdt><mo>FEBRUARY</mo><yr>2004</yr></pdt>
<pp>pp. 271-288</pp></hdr2> </hdr> <tig><atl>A Graph-Based Approach for Timing Analysis and Refinement of OPS5 Knowledge-Based Systems</atl><pn>pp. 271-288</pn><ref rid="K02711aff" type="aff">*</ref></tig>
<au sequence="first"><fnm>Albert Mo Kim</fnm><snm> <ref aid="K0271a1“ type="prb">Cheng</ref></snm><role>Senior Member</role><aff><onm>IEEE</onm></aff></au><au sequence="additional"><fnm>Hsiu-yen</fnm><snm> Tsai</snm></au>
<abs><p><b>Abstract</b>—This paper examines the problem of predicting the timing behavior of knowledge-based systems for real-…
2 September 2005 VLDB Tutorial on XML Full-Text Search
Example INEX Query <inex_topic topic_id="275" query_type="CAS"> <castitle>//article[about(.//abs, "data
mining")]//sec[about(., "frequent itemsets")]</castitle> <description>sections about frequent itemsets from
articles with abstract about data mining</description> <narrative>To be relevant, a component has to be a
section about "frequent itemsets". For example, it could be about algorithms for finding frequent itemsets, or uses of frequent itemsets to generate rules. Also, the article must have an abstract about "data mining". I need this information for a paper that I am writing. It is a survey of different algorithms for finding frequent itemsets. The paper will also have a section on why we would want to find frequent itemsets.</narrative>
</inex_topic>
2 September 2005 VLDB Tutorial on XML Full-Text Search
Challenges in XML FT Search
• Searching over Semi-Structured Data– Users may specify a search context and return context.
• Expressive Power and Extensibility– Users should be able to express complex full-text
searches and combine them with structural searches. • Scores and Ranking
– Users may specify a scoring condition, possibly over both full-text and structured predicates and obtain top-k results based on query relevance scores.
– The language should allow for an efficient implementation.
2 September 2005 VLDB Tutorial on XML Full-Text Search
XML FT Search Definition• Context expression: XML elements searched:
– pre-defined XML nodes.– XPath/XQuery queries.
• Return expression: XML fragments returned: – pre-defined meaningful XML fragments.– XPath/XQuery to build answers.
• Search expression: FT search conditions: – Boolean keyword search.– proximity distance, scoping, thesaurus, stop words, stemming.
• Score expression: – system-defined scoring function.– user-defined scoring function.– query-dependent keyword weights.
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation
• Full-Text Search Languages
• Scoring
• Query Processing
• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
Four Classes of Languages
• Keyword search (INEX Content-Only Queries)– “book xml”
• Tag + Keyword search– book: xml
• Path Expression + Keyword search– /book[./title about “xml db”]
• XQuery + Complex full-text search– for $b in /book
let score $s := $b ftcontains “xml” && “db” distance 5
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Scoring• Query Processing• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
XRank [Guo et al., SIGMOD 2003]
<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …
2 September 2005 VLDB Tutorial on XML Full-Text Search
<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …
XRank [Guo et al., SIGMOD 2003]
2 September 2005 VLDB Tutorial on XML Full-Text Search
XIRQL [Fuhr & Grobjohann, SIGIR 2001]
<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <em> The XQL language </em> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …
Index Node
2 September 2005 VLDB Tutorial on XML Full-Text Search
Similar Notion of Results
• Nearest Concept Queries – [Schmidt et al., ICDE 2002]
• XKSearch – [Xu & Papakonstantinou, SIGMOD 2005]
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Scoring• Query Processing• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
XSearch [Cohen et al., VLDB 2003]
<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … … </paper> <paper id=”2”> <title> XML Indexing </title> … <paper id=”2”>
Not a“meaningful”
result
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Scoring• Query Processing• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
XPath [W3C 2005]
• fn:contains($e, string) returns true iff $e contains string
//section[fn:contains(./title, “XML Indexing”)]
2 September 2005 VLDB Tutorial on XML Full-Text Search
XIRQL [Fuhr & Grobjohann, SIGIR 2001]
• Weighted extension to XQL (precursor to XPath)
//section[0.6 · .//* $cw$ “XQL” + 0.4 · .//section $cw$ “syntax”]
2 September 2005 VLDB Tutorial on XML Full-Text Search
XXL [Theobald & Weikum, EDBT 2002]
• Introduces similarity operator ~
Select ZFrom http://www.myzoos.edu/zoos.htmlWhere zoos.#.zoo As Z and Z.animals.(animal)?.specimen as A and A.species ~ “lion” and A.birthplace.#.country as B and A.region ~ B.content
2 September 2005 VLDB Tutorial on XML Full-Text Search
NEXI [Trotman & Sigurbjornsson, INEX 2004]
• Narrowed Extended XPath I• INEX Content-and-Structure (CAS) Queries
//article[about(.//title, apple) and about(.//sec, computer)]
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Scoring• Query Processing• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
Schema-Free XQuery [Li, Yu, Jagadish, VLDB 2003]
• Meaningful least common ancestor (mlcas)
for $a in doc(“bib.xml”)//author $b in doc(“bib.xml”)//title $c in doc(“bib.xml”)//yearwhere $a/text() = “Mary” and exists mlcas($a,$b,$c)return <result> {$b,$c} </result>
2 September 2005 VLDB Tutorial on XML Full-Text Search
XQuery Full-Text [W3C 2005]
• Two new XQuery constructs
1) FTContainsExpr• Expresses “Boolean” full-text search predicates• Seamlessly composes with other XQuery
expressions
2) FTScoreClause• Extension to FLWOR expression• Can score FTContainsExpr and other expressions
2 September 2005 VLDB Tutorial on XML Full-Text Search
FTContainsExpr
//book ftcontains “Usability” && “testing” distance 5
//book[./content ftcontains “Usability” with stems]/title
//book ftcontains /article[author=“Dawkins”]/title
2 September 2005 VLDB Tutorial on XML Full-Text Search
FTScore Clause
FOR $v [SCORE $s]? IN [FUZZY] ExprLET …WHERE …ORDER BY …RETURN
Example
FOR $b SCORE $s in /pub/book[. ftcontains “Usability” &&
“testing”] ORDER BY $s
RETURN <result score={$s}> $b </result>
In anyorder
2 September 2005 VLDB Tutorial on XML Full-Text Search
FTScore Clause
FOR $v [SCORE $s]? IN [FUZZY] ExprLET …WHERE …ORDER BY …RETURN
Example
FOR $b SCORE $s in /pub/book[. ftcontains “Usability” &&
“testing” and ./price < 10.00] ORDER BY $s
RETURN $b
In anyorder
2 September 2005 VLDB Tutorial on XML Full-Text Search
FTScore Clause
FOR $v [SCORE $s]? IN [FUZZY] ExprLET …WHERE …ORDER BY …RETURN
Example
FOR $b SCORE $s in FUZZY/pub/book[. ftcontains “Usability” &&
“testing”] ORDER BY $s
RETURN $b
In anyorder
2 September 2005 VLDB Tutorial on XML Full-Text Search
XQuery Full-Text Evolution
Quark Full-TextLanguage (Cornell)2002
2003
2004
2005
TeXQuery(Cornell, AT&T Labs)
IBM, Microsoft,Oracle proposals
XQuery Full-Text
XQuery Full-Text(Second Draft)
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation
• Full-Text Search Languages
• Scoring
• Query Processing
• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
Full-Text Scoring
• Score value should reflect relevance of answer to user query. Higher scores imply a higher degree of relevance.
• Queries return document fragments. Granularity of returned results affects scoring.
• For queries containing conditions on structure, structural conditions may affect scoring.
• Existing proposals extend common scoring methods: probabilistic or vector-based similarity.
2 September 2005 VLDB Tutorial on XML Full-Text Search
Granularity of Results
• Keyword queries– compute possibly different scores for LCAs.
• Tag + Keyword queries– compute scores based on tags and keywords.
• Path Expression + Keyword queries– compute scores based on paths and keywords.
• XQuery + Complex full-text queries– compute scores for (newly constructed) XML
fragments satisfying XQuery (structural, full-text and scalar conditions).
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages• Scoring
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Query Processing• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
Granularity of Results
• Document as hierarchical structure of elements as opposed to flat document.– XXL [Theobald & Weikum, EDBT 2002]– XIRQL [Fuhr & Grobjohann, SIGIR 2001]– XRANK [Guo et al., SIGMOD 2003]
• Propagate keyword weights along document structure.
2 September 2005 VLDB Tutorial on XML Full-Text Search
XML Data Model
<workshop>
date <title> <editors> <proceedings>
28 July … XML and … David Carmel …
<paper> <paper> …
<title> <author> … …
XQL and … Ricardo …
Containment edge
Hyperlink edge
2 September 2005 VLDB Tutorial on XML Full-Text Search
XXL[Theobald & Weikum, EDBT 2002]
• Compute similar terms with relevance score r1 using an ontology.
• Compute tf*idf of each term for a given element content with relevance score r2.
• Relevance of an element content for a term is r1*r2.• r1 and r2 are computed as a weighted distance in an
ontology graph.• Probabilities of conjunctions multiplied
(independence assumption) along elements of same path to compute path score.
2 September 2005 VLDB Tutorial on XML Full-Text Search
Probabilistic ScoringXIRQL [Fuhr & Grobjohann, SIGIR 2001]
• Extension of XPath.
• Weighting and ranking:– weighting of query terms:
• P(wsum((0.6,a), (0.4,b)) = 0.6 · P(a)+0.4 · P(b)
– probabilistic interpretation of Boolean connectors:
• P(a && b) = P(a) · P(b)
2 September 2005 VLDB Tutorial on XML Full-Text Search
XIRQL Example
• Query:– “Search for an artist named Ulbrich, living in
Frankfurt, Germany about 100 years ago”
• Data:– “Ernst Olbrich, Darmstadt, 1899”
• Weights and ranking:– P(Olbrich p Ulbrich)=0.8 (phonetic similarity)– P(1899 n 1903)=0.9 (numeric similarity)– P(Darmstadt g Frankfurt)=0.7 (geographic distance)
2 September 2005 VLDB Tutorial on XML Full-Text Search
PageRank [Brin & Page 1998]
)(vp HEvu
huN
upd ),(
)(
)(
dN
d1
w
: Hyperlink edged/3
d/3
d/3
d: Probability of following hyperlink
1-d: Probability of random jump
2 September 2005 VLDB Tutorial on XML Full-Text Search
ElemRank [Guo et al. SIGMOD 2003]
w
: Hyperlink edged1/3
d1/3
d1/3
d1: Probability of following hyperlink
1-d1-d2-d3: Probability of random jump
: Containment edge
d2/2 d2/2
d2: Probability of visiting a subelement
d3
d3: Probability of visiting parent
)(ve HEvu h uN
ued
),(1 )(
)(
CEvu c uN
ued
),(2 )(
)(
1),(
3 )(CEvu
ued)(
1 321
vN
ddd
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages• Scoring
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Query Processing• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
XSearch[Cohen et al., VLDB 2003]
• tf*ilf to compute weight of keyword for a leaf element.
• A vector is associated with each non-leaf element.
• sim(Q,N): sum of the cosine distances between the vectors associated with nodes in N and vectors associated with terms matched in Q.
NdesancNtsize
NQsim_1
,
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages• Scoring
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Query Processing• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
Vector–based ScoringJuruXML [Mass et al INEX 2002]
• Transform query into (term,path) conditions: article/bm/bib/bibl/bb[about(., hypercube mesh torus
nonnumerical database)]• (term,path)-pairs:
hypercube, article/bm/bib/bibl/bb mesh, article/bm/bib/bibl/bb torus, article/bm/bib/bibl/bb nonnumerical, article/bm/bib/bibl/bb database, article/bm/bib/bibl/bb
• Modified cosine similarity as retrieval function for vague matching of path conditions.
2 September 2005 VLDB Tutorial on XML Full-Text Search
JuruXML Vague Path Matching
• Modified vector-based cosine similarity
),(),(),(||||
1
),( ),(
D
i
Qi
D
iiDQ
iQct DD
jicit
iQ
jjQii
cccrctwctwDQ
otherwise
ccifc
c
cccrQi
QiD
i
Qi
D
i
Qi
j
0
||1
||1
),(
Example of length normalization: cr (article/bibl, article/bm/bib/bibl/bb) = 3/6 = 0.5
2 September 2005 VLDB Tutorial on XML Full-Text Search
Query Relaxation on Structure
• Schlieder, EDBT 2002
• Delobel & Rousset, 2002
• Amer-Yahia et al, VLDB 2005
2 September 2005 VLDB Tutorial on XML Full-Text Search
XML Query Relaxation[Amer-Yahia et al EDBT 2002]
FlexPath [Amer-Yahia et al SIGMOD 2004]
• Tree pattern relaxations:– Leaf node deletion
– Edge generalization
– Subtree promotion
book
editionpaperback
info
authorDickens
book
editionpaperback
info authorDickens
book
info
authorC. Dickens
book
edition(paperback)
info
authorCharles Dickens
edition?
Query
Data
2 September 2005 VLDB Tutorial on XML Full-Text Search
Adaptation of tf.idf to XML Whirlpool[Marian et al ICDE 2005]
Document Collection (Information Retrieval)
XML Document
Document XML Node (result is a subtree rooted at a returned node with a given tag and satisfying structural predicates in the query)
Keyword(s) Tree Pattern
idf (inverse document frequency) is a function of the fraction of documents that contain the keyword(s)
idf is a function of the fraction of returned nodes that match the query tree pattern
tf (term frequency) is a function of the number of occurrences of the keyword in the document
tf is a function of the number of ways the query tree pattern matches the returned node
2 September 2005 VLDB Tutorial on XML Full-Text Search
A Family of XML Scoring Methods [Amer-Yahia et al VLDB 2005]
• Twig scoring– High quality– Expensive computation
• Path scoring• Binary scoring
– Low quality– Fast computation
book
edition(paperback)
info
author(Dickens)
Query
book
edition(paperback)
info
author(Dickens)
book
edition(paperback)
author(Dickens)
book
info
+
edition(paperback)
author(Dickens)
bookbook
info
+ +book
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages• Scoring
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Query Processing• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
XIRQL + Relaxation
• XIRQL proposes vague predicates but it is not clear how to combine it with all of XQuery.
• Open issue as how to relax all of XQuery including structured and scalar predicates.
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation
• Full-Text Search Languages
• Scoring
• Query Processing
• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages• Scoring• Query Processing
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
Main Issue
• Given: Query keywords
• Compute: Least Common Ancestors (LCAs) that contain query keywords, in ranked order
2 September 2005 VLDB Tutorial on XML Full-Text Search
Naïve Method
Naïve inverted lists:Ricardo 1 ; 5 ; 6 ; 8 XQL 1 ; 5 ; 6 ; 7
Problems: 1. Space Overhead2. Spurious Results
Main issue: Decouples representation of ancestors and descendants
<workshop>
date <title> <editors> <proceedings>
28 July … XML and … David Carmel …
<paper> <paper>…
<title> <author>… …
XQL and … Ricardo …
1
2 3 4 5
6
7 8
2 September 2005 VLDB Tutorial on XML Full-Text Search
Dewey Encoding of IDs [1850s]<workshop>
0.0date 0.1<title>
0
0.2<editors> 0.3<proceedings>
28 July … XML and … David Carmel …
0.3.0<paper> 0.3.1<paper> …
0.3.0.0<title> 0.3.0.1<author> … …
XQL and … Ricardo …
2 September 2005 VLDB Tutorial on XML Full-Text Search
XRank: Dewey Inverted List (DIL)
XQL 5.0.3.0.0 85 32
Dew
ey Id
Scor
ePo
sitio
n Li
st
8.0.3.8.3 38 89Sorted byDewey Id
… ……
Ricardo 5.0.3.0.1 82 38
8.2.1.4.2 99 52Sorted byDewey Id
… ……
Store IDs of elements that directly contain keyword - Avoids space overhead
91
2 September 2005 VLDB Tutorial on XML Full-Text Search
DIL: Query Processing
• Merge query keyword inverted lists in Dewey ID Order– Entries with common prefixes are processed together
• Compute Longest Common Prefix of Dewey IDs during the merge– Longest common prefix ensures most specific results
– Also suppresses spurious results
• Keep top-m results seen so far in output heap– Calculate rank using two-dimensional proximity metric
– Output contents of output heap after scanning inverted lists
• Algorithm works in a single scan over inverted lists
2 September 2005 VLDB Tutorial on XML Full-Text Search
XRank: Ranked Dewey Inverted List (RDIL)
XQL
…(other keywords)
Inverted List …
Sorted by Score
B+-treeOn Dewey Id
2 September 2005 VLDB Tutorial on XML Full-Text Search
RDIL: Algorithm
• An element may be ranked highly in one list and low in another list– B+-tree helps search for low ranked element
• When to stop scanning inverted lists?– Based on Threshold Algorithm [Fagin et al.,
2002], which periodically calculates a threshold– Can stop if we have sufficient results above the
threshold– Extension to most specific results
2 September 2005 VLDB Tutorial on XML Full-Text Search
RDIL: Query Processing
RicardoInverted List
B+-tree on Dewey Id
XQL
P: 9.0.4.2.0
9.0.4.1.2
threshold = Score(P)+Max-ScoreRank(9.0.4)
Output Heap Temp HeapP P
R
threshold = Score(P)+Score(R)
8.2.1.4.29.0.4.1.
29.0.5.6 10.8.3
B+-tree on Dewey Id
9.0.4.2.0
9.0.5.69.0.4.1.
2
2 September 2005 VLDB Tutorial on XML Full-Text Search
ID Order vs. Rank Order
• Approaches that combine benefits
• Long ID inverted list, short score inverted list– HDIL (Guo et al., SIGMOD 2003)
• Chunk inverted list based on score, organize by ID within chunk– FlexPath (Amer-Yahia et al., SIGMOD 2004)– SVR (Guo et al., ICDE 2005)
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages• Scoring• Query Processing
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
XSearch Technique
• Given: An interconnection relationship R between nodes (semantic relationship)– R is reflexive and symmetric
• Node interconnection index– Given two nodes n and n’ in a document d, find
if (n,n’) are in R*
• Use dynamic programming to compute closure– Online vs. offline
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages• Scoring• Query Processing
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
XXL Indexing
• Element Path Index (EPI)– Evaluates simple path expressions
• Element Content Index (ECI)– Traditional inverted list (but replicates nested
elements)
• Ontology Index (OI)– Lookup similar concepts (for evaluating ~e)– Returned in ranked order
2 September 2005 VLDB Tutorial on XML Full-Text Search
Myaeng et al. [SIGIR 1994]
XQL 5 85 act
Doc
umen
t ID
Elem
ent I
DEl
emen
t Tag
…
0.3
Prob
abili
ty
play 0.2 plays 0.1
Elem
ent T
agPr
obab
ility
Elem
ent T
agPr
obab
ility
…
2 September 2005 VLDB Tutorial on XML Full-Text Search
Integrating Structure and IL[Kaushik et al., SIGMOD 2004]
book
title
editioninfo
author
1
2 3
4 5
XQL 5 85 99
Doc
umen
t ID
Star
t ID
End
ID
…
3
Dep
th
5
Inde
x ID
…
0.9
Scor
e
B+ Tree
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation• Full-Text Search Languages• Scoring• Query Processing
– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search
• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
Scoring Functions Critical for Top-k Query Processing
• Top-k answer quality depends on scoring function.
• Efficient top-k query processing requires scoring function to be:– Monotone.
– Fast to compute.
2 September 2005 VLDB Tutorial on XML Full-Text Search
Structural Join Relaxation//book[./info[./author ftcontains “Dickens”]
[./edition ftcontains “paperback”]]
book info
author
edition
Dickens
paperback
pc(book,info)
pc(info,author)
pc(info,edition)
contains(author,”Dickens”)
contains(edition,”paperback”)
author
edition
Dickens
paperback
pc(book,info) or ad(book,info)
pc(info,author)
pc(info,edition) or ad(book,edition)
contains(author,”Dickens”)
contains(edition,”paperback”)
infobook
2 September 2005 VLDB Tutorial on XML Full-Text Search
Quark/GalaxXQuery Engine
Full-Text Primitives (FTWord,
FTWindow, FTTimesetc.)
evaluation <doc> Text TextText Text </doc>
<doc> Text TextText Text </doc>
.xmlXQFTParser
EquivalentXQueryQuery
EquivalentXQueryQuery
Full-TextQuery
Full-TextQuery
Preprocessing& Inverted Lists
Generation
<xml> <doc>Text TextText Text </doc></xml
<xml> <doc>Text TextText Text </doc></xml
inverted lists
.xml
4Quark/GalaTex Architecture
API on posit
ions
2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline
• Motivation
• Full-Text Search Languages
• Scoring
• Query Processing
• Open Issues
2 September 2005 VLDB Tutorial on XML Full-Text Search
System Architecture
XQuery Engine IR Engine
Integration Layer
2 September 2005 VLDB Tutorial on XML Full-Text Search
System Architecture
XQuery + IR Engine
Quark/GalaTex use this architecture
2 September 2005 VLDB Tutorial on XML Full-Text Search
Structural Relaxation
FOR $b SCORE $s in FUZZY
/pub/book[. ftcontains “Usability” with stems]
ORDER BY $s
RETURN $b
2 September 2005 VLDB Tutorial on XML Full-Text Search
Search Over Views
<books> <book> … </book> <book> … </book> …</books>
<reviews> <review> … </review> <review> … </review> …</reviews>
Data Source 1 Data Source 2
<book> <reviews> … </reviews></book>…
IntegratedView
2 September 2005 VLDB Tutorial on XML Full-Text Search
Other Open Issues
• Extensive experimental evaluation of scoring functions and ranking algorithms for XML (INEX).
• Joint scoring on full-text and scalar predicates.• Score-aware algebra for XML for the joint
optimization of queries on both structure and text.
2 September 2005 VLDB Tutorial on XML Full-Text Search
Why not use SQL/MM (or variant)?
• Key difference: No strict demarcation between structured and text data in XML– Can issue structured and text queries over same data
• Find books with year > 1995• Find books containing keyword “1998”
– Can embed structured queries in text queries• Find books that contain the keywords that occur in the title
of Richard Dawkins’ books
• Other important differences– XML/XQuery data model– Composability of full-text primitives
2 September 2005 VLDB Tutorial on XML Full-Text Search
Scoring Function (monotonicity)
• Required properties:– Exact matches should be
scored higher than relaxed matches (idf)
– Returned elements with several matches should be ranked higher than those with fewer matches (tf)
• How to combine tf and idf?– tf.idf, as used by IR, violates
above properties– Ranking based on idf, then
breaking ties using tf satisfies the properties
book
title(Great
Expectations)
edition(paperback)
info
author(Dickens)
book
info
author(Dickens)
score(a) >= score(b)
(a) (b)
book
title(Great
Expectations)
edition(paperback)
info
book
title(Great
Expectations)
edition(paperback)
info edition(paperback)
score(a) <= score(b)