Upload
hayley
View
59
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Searching XML Documents via XML Fragments. D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer. Presented by Hui Fang. Database:. IR:. Schema: Papers (Title, Authors, Conf., Journal). An example document:. . Title. Authors. Conf. Journal. - PowerPoint PPT Presentation
Citation preview
1
Searching XML Documents via XML Fragments
D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer
Presented by Hui Fang
2
Background(1) --- DataDatabase:
Bioinformatics---John SmithProtein
------SIGIRN.Fuhr, K. Grobjohann
XIRQL
JournalConf.AuthorsTitle
Schema: Papers (Title, Authors, Conf., Journal)
Un-structured DataWell-structured Data IR:
Intel: New chip, new price war .
February 1, 2004: 6:32 PM EST.
Intel Corp. on Sunday said it had refreshed its line of microchips for desktop computers with a new version of the Pentium 4 processor, designed to run increasingly power-hungry office and home entertainment software faster. In 1998,…..
An example document:
Lack of flexibility
Lack of extensibility
<title> </title>
<date> </date><content>
</content>
Lack of the logical structure of a document.
Semi-structured Data
DB+IR:<paper>
<title> XIRQL </title>
<author> N.Fuhr </author>
<author> K.Grobjohann </author>
<conf> SIGIR </conf>
</paper>
Why is semi-structured data important?
3
XML in a nutshell
• Hierarchical data format • Nested element structure having a root• Self describing data (tags), schema is attached to the data itself.
<book id=“25”>
<year>1997</year>
<author> Karen Sparck Jones </author>
<author>Peter Willett </author>
<publisher>Morgan Kaufmann</publisher>
<title> Readings in Information Retrieval </title>
</book>
…
Start tag content End tagAttribute
Readings in …
1997
book…
year
author
title
Karen Sparck Jones
Peter Willett
id=“25”
author
Morgan Kaufmann
publisher
element
4
Background(2) --- Query
Database: Boolean Query
SQL (Structured Query Language):
SELECT title
FROM papers
WHERE conf=‘SIGIR’
Return the unranked tuples satisfying the query.
IR: Ranked Query
Keywords:
paper SIGIR
Return the ranked documents according to the relevance.
How to query semi-structured data (e.g. XML data) ?
5
Related Work
• DB-oriented approaches– E.g. XML-QL, XQL, XQUERY …
WHERE
<book>
<title>Harry Potter </title>
<author>$a</author>, <year> $y </year>
</book> in “books.xml”, $y>2002
CONSTRUCT
<result> <author>$t</author> </result>
• DB+IR approaches– E.g. XIRQL
• IR-oriented approaches– E.g. this paper
6
Problem Refinement---CAS Search• Document collection:
– XML documents • Each document is a hierarchical structure of nested elements• Markup in the document mainly serves for exposing the
logical structure of a document.
• Query– content + explicit references to the XML structure– specifies the target element need to be returned
An example:
Retrieval all articles from the years 1999-2000 and deal with works on nonmonotonic reasoning. Do not retrieve articles that are calendar/call for
papers.
7
Approach
• Compare apple and apple
• Recall vector space models– Both documents and queries are expressed in free
text. – Compare unstructured data to unstructured data
• This paper:– Search XML documents via XML fragments
8
Query---XML Fragments(1)
• Topic 1: Find all books about fishing
<book> fishing </book>
• Topic 2: Find all books having a title about search<book> <title> fishing </title> </book>
<results>
{
for $t in document (“library.xml”//book/title)
where contains ($t/text(), “search”)
return $t
}
</results>
XQuery
More intuitive
More flexible
9
Query --- XML Fragment(2)
• Limited expressiveness– E.g. “Finding figures that describe the Corba
architecture and the paragraphs that refer to those figures. “
Requires a “join” operation between two elements “figures” and “paragraphs”
10
Recall: Text Retrieval Task• Give a query
– According to the retrieval formula, compute the relevance score for each document;
– Rank the documents according to relevance score.
( ) ( )( , )
| | | |
q dt q dw t w t
q dq d
• Vector Space Model
– Represent doc/query by a vector of terms
– Relevance between doc and query distance between two vectors
d
q
11
Extending the Vector Space Model(1)• Indexing unit:
– E.g. (“Harry Potter ”, /book/title)
– Can be matched with • (“Harry Potter ”,/book)
• (“Harry Potter ”,/book/sec/title)
• Retrieval Formula
( , )i it c
( , ) ( , )( , ) ( , ) ( , )
( , )| | | |
i kq i d k i kt c q t c q
w t c w t c cr c cq d
q d
Context resemblance measure
Perfect match: ,when ; 0 ,otherwise.
Partial match: ,when ci subsequence of ck; 0, otherwise
Fuzzy match:
Flat (ignore context):
( , ) 1i kcr c c i kc c1 | |
( , )1 | |
ii k
k
ccr c c
c
( , ) ( , )i k i kcr c c StrSimilarity c c
( , ) 1, ,i k i kcr c c c c
12
( , ) ( , )( , ) ( , ) ( , )
( , )| | | |
i kq i d k i kt c q t c q
w t c w t c cr c cq d
q d
Extending the Vector Space Model(2)
( , ) ( , ) ( , )d k d k kw t c tf t c idf t c ( , )
| |( , ) log( )
| |t c
Nidf t c
N,where
If c is rare, idf(t,c) would be high in spite of t being very common.
“Merge-idf” variant:
( , ) ( , ) ( , )d k d kw t c tf t c idf t C kk
C c,where ( , ) 0i kcr c c and
“Merge” variant:
( , ) ( , ) ( , )d itf t C idf t C cr c C
13
Evaluation
• Runs– Partial-match– Partial-match. merge-idf– Partial-match.merge– Fuzzy-match.merge-idf– Flat (ignore context)
14
Result(1)• Result for “free-text-oriented” topics
– An example topic :
<yr>1995,1996,1997,1998,1999</yr>
<bdy>XML Electronic commerce </bdy>
15
Result(2)• Result for “context-oriented” topics
– An example topic:
<atl> Content-Based retrieval of video databases</atl>
16
Summary
• Using XML fragments with an extended vector space model is promising.
• Use different solutions for different types of applications
• Something wrong?
17
Another Problem --- CO Search
• Document collection:– XML documents
• Query:– a set of keywords
• Task: Find smallest element satisfying the query
Challenge: rank the components instead of document
18
<article>
t1
<sec> <p> t2</p></sec>
</article>
Possible Method(1):
treat each component as a document.
Possible Solutions( ) ( )
( , )| | | |
q dt q dw t w t
q dq d
( ) log( ( )) log( )
( )D D
Nw t TF t
DF t ,where
Problem with this method: XML components are nested.
1 2( ) 1, ( ) 3CF t CF t
3N
1 2( ) 1, ( ) 1article articleTF t TF t
1 2( ) ( )article articleW t W t
19
<article>
<sec>t1</sec>
<sec>t1</sec>
<sec>t2</sec>
</article>
Possible Method(2):
counting TF at the component level;
computing N & DF at the document level.
Possible Solutions (Cont.)( ) ( )
( , )| | | |
q dt q dw t w t
q dq d
( ) log( ( )) log( )
( )D D
Nw t TF t
DF t ,where
1 2( ) 1, ( ) 1DF t DF t
1N
sec1 1 sec2 1 sec3 2( ) ( ) ( ) 1TF t TF t TF t
sec1 1 sec3 2( ) ( )W t W t
Impossible to differentiate between the rankings of the three sections
20
Proposed Solution• Create a index for each component type
– Elements in each index are regarded as documents
– Keep N, DF,TF for the specific component type
– Can apply the regular vector space model on each index
• Given a query– Run the query in parallel on each index
– Return one ranked list of results, one from each index
• Normalize the scores in each index into the range (0,1)– Achieved by computing
• Merge the normalized results into a one ranked list of all components
( , )q q
Assume the set of potential components to be returned must be known in advance.
Assume no nesting of the same component.
21
Conclusion
• Possible solutions to solve the following challenges.
– Challenge 1 (Information/Doc Unit): What is an appropriate information unit?
• Document may no longer be the most natural unit• Components in a document may be more appropriate
– Challenge 2 (Query): What is an appropriate query language?
• Keyword (free text) query is no longer the only choice• Constraints on the structures can be posed
22
References
• Retrieving the most relevant XML components, by Y. Mass, M. Mandelbrod. INEX’03 workshop.
• Searching XML Documents via XML fragments, by D. Carmel, Y. S.Maarek, M. Mandelbrod, Y. Mass and A. Soffer. SIGIR’03
• XIRQL: A Query Language for Information Retrieval in XML Documents by N. Fuhr, K. Großjohann. SIGIR’02