Searching XML Documents via XML Fragments

1

Searching XML Documents via XML Fragments

D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer

Presented by Hui Fang

2

Background(1) --- DataDatabase:

Bioinformatics---John SmithProtein

------SIGIRN.Fuhr, K. Grobjohann

XIRQL

JournalConf.AuthorsTitle

Schema: Papers (Title, Authors, Conf., Journal)

Un-structured DataWell-structured Data IR:

Intel: New chip, new price war .

February 1, 2004: 6:32 PM EST.

Intel Corp. on Sunday said it had refreshed its line of microchips for desktop computers with a new version of the Pentium 4 processor, designed to run increasingly power-hungry office and home entertainment software faster. In 1998,…..

An example document:

Lack of flexibility

Lack of extensibility

<title> </title>

<date> </date><content>

</content>

Lack of the logical structure of a document.

Semi-structured Data

DB+IR:<paper>

<title> XIRQL </title>

<author> N.Fuhr </author>

<author> K.Grobjohann </author>

<conf> SIGIR </conf>

</paper>

Why is semi-structured data important?

3

XML in a nutshell

• Hierarchical data format • Nested element structure having a root• Self describing data (tags), schema is attached to the data itself.

<book id=“25”>

<year>1997</year>

<author> Karen Sparck Jones </author>

<author>Peter Willett </author>

<publisher>Morgan Kaufmann</publisher>

<title> Readings in Information Retrieval </title>

</book>

…

Start tag content End tagAttribute

Readings in …

1997

book…

year

author

title

Karen Sparck Jones

Peter Willett

id=“25”

author

Morgan Kaufmann

publisher

element

4

Background(2) --- Query

Database: Boolean Query

SQL (Structured Query Language):

SELECT title

FROM papers

WHERE conf=‘SIGIR’

Return the unranked tuples satisfying the query.

IR: Ranked Query

Keywords:

paper SIGIR

Return the ranked documents according to the relevance.

How to query semi-structured data (e.g. XML data) ?

5

Related Work

• DB-oriented approaches– E.g. XML-QL, XQL, XQUERY …

WHERE

<book>

<title>Harry Potter </title>

<author>$a</author>, <year> $y </year>

</book> in “books.xml”, $y>2002

CONSTRUCT

<result> <author>$t</author> </result>

• DB+IR approaches– E.g. XIRQL

• IR-oriented approaches– E.g. this paper

6

Problem Refinement---CAS Search• Document collection:

– XML documents • Each document is a hierarchical structure of nested elements• Markup in the document mainly serves for exposing the

logical structure of a document.

• Query– content + explicit references to the XML structure– specifies the target element need to be returned

An example:

Retrieval all articles from the years 1999-2000 and deal with works on nonmonotonic reasoning. Do not retrieve articles that are calendar/call for

papers.

7

Approach

• Compare apple and apple

• Recall vector space models– Both documents and queries are expressed in free

text. – Compare unstructured data to unstructured data

• This paper:– Search XML documents via XML fragments

8

Query---XML Fragments(1)

• Topic 1: Find all books about fishing

<book> fishing </book>

• Topic 2: Find all books having a title about search<book> <title> fishing </title> </book>

<results>

{

for $t in document (“library.xml”//book/title)

where contains ($t/text(), “search”)

return $t

}

</results>

XQuery

More intuitive

More flexible

9

Query --- XML Fragment(2)

• Limited expressiveness– E.g. “Finding figures that describe the Corba

architecture and the paragraphs that refer to those figures. “

Requires a “join” operation between two elements “figures” and “paragraphs”

10

Recall: Text Retrieval Task• Give a query

– According to the retrieval formula, compute the relevance score for each document;

– Rank the documents according to relevance score.

( ) ( )( , )

| | | |

q dt q dw t w t

q dq d

• Vector Space Model

– Represent doc/query by a vector of terms

– Relevance between doc and query distance between two vectors

d

q

11

Extending the Vector Space Model(1)• Indexing unit:

– E.g. (“Harry Potter ”, /book/title)

– Can be matched with • (“Harry Potter ”,/book)

• (“Harry Potter ”,/book/sec/title)

• Retrieval Formula

( , )i it c

( , ) ( , )( , ) ( , ) ( , )

( , )| | | |

i kq i d k i kt c q t c q

w t c w t c cr c cq d

q d

Context resemblance measure

Perfect match: ,when ; 0 ,otherwise.

Partial match: ,when ci subsequence of ck; 0, otherwise

Fuzzy match:

Flat (ignore context):

( , ) 1i kcr c c i kc c1 | |

( , )1 | |

ii k

k

ccr c c

c

( , ) ( , )i k i kcr c c StrSimilarity c c

( , ) 1, ,i k i kcr c c c c

12

( , ) ( , )( , ) ( , ) ( , )

( , )| | | |

i kq i d k i kt c q t c q

w t c w t c cr c cq d

q d

Extending the Vector Space Model(2)

( , ) ( , ) ( , )d k d k kw t c tf t c idf t c ( , )

| |( , ) log( )

| |t c

Nidf t c

N,where

If c is rare, idf(t,c) would be high in spite of t being very common.

“Merge-idf” variant:

( , ) ( , ) ( , )d k d kw t c tf t c idf t C kk

C c,where ( , ) 0i kcr c c and

“Merge” variant:

( , ) ( , ) ( , )d itf t C idf t C cr c C

13

Evaluation

• Runs– Partial-match– Partial-match. merge-idf– Partial-match.merge– Fuzzy-match.merge-idf– Flat (ignore context)

14

Result(1)• Result for “free-text-oriented” topics

– An example topic :

<yr>1995,1996,1997,1998,1999</yr>

<bdy>XML Electronic commerce </bdy>

15

Result(2)• Result for “context-oriented” topics

– An example topic:

<atl> Content-Based retrieval of video databases</atl>

16

Summary

• Using XML fragments with an extended vector space model is promising.

• Use different solutions for different types of applications

• Something wrong?

17

Another Problem --- CO Search

• Document collection:– XML documents

• Query:– a set of keywords

• Task: Find smallest element satisfying the query

Challenge: rank the components instead of document

18

<article>

t1

<sec> <p> t2</p></sec>

</article>

Possible Method(1):

treat each component as a document.

Possible Solutions( ) ( )

( , )| | | |

q dt q dw t w t

q dq d

( ) log( ( )) log( )

( )D D

Nw t TF t

DF t ,where

Problem with this method: XML components are nested.

1 2( ) 1, ( ) 3CF t CF t

3N

1 2( ) 1, ( ) 1article articleTF t TF t

1 2( ) ( )article articleW t W t

19

<article>

<sec>t1</sec>

<sec>t1</sec>

<sec>t2</sec>

</article>

Possible Method(2):

counting TF at the component level;

computing N & DF at the document level.

Possible Solutions (Cont.)( ) ( )

( , )| | | |

q dt q dw t w t

q dq d

( ) log( ( )) log( )

( )D D

Nw t TF t

DF t ,where

1 2( ) 1, ( ) 1DF t DF t

1N

sec1 1 sec2 1 sec3 2( ) ( ) ( ) 1TF t TF t TF t

sec1 1 sec3 2( ) ( )W t W t

Impossible to differentiate between the rankings of the three sections

20

Proposed Solution• Create a index for each component type

– Elements in each index are regarded as documents

– Keep N, DF,TF for the specific component type

– Can apply the regular vector space model on each index

• Given a query– Run the query in parallel on each index

– Return one ranked list of results, one from each index

• Normalize the scores in each index into the range (0,1)– Achieved by computing

• Merge the normalized results into a one ranked list of all components

( , )q q

Assume the set of potential components to be returned must be known in advance.

Assume no nesting of the same component.

21

Conclusion

• Possible solutions to solve the following challenges.

– Challenge 1 (Information/Doc Unit): What is an appropriate information unit?

• Document may no longer be the most natural unit• Components in a document may be more appropriate

– Challenge 2 (Query): What is an appropriate query language?

• Keyword (free text) query is no longer the only choice• Constraints on the structures can be posed

22

References

• Retrieving the most relevant XML components, by Y. Mass, M. Mandelbrod. INEX’03 workshop.

• Searching XML Documents via XML fragments, by D. Carmel, Y. S.Maarek, M. Mandelbrod, Y. Mass and A. Soffer. SIGIR’03

• XIRQL: A Query Language for Information Retrieval in XML Documents by N. Fuhr, K. Großjohann. SIGIR’02