Upload
arlais
View
47
Download
0
Tags:
Embed Size (px)
DESCRIPTION
On Efficient Part-match Querying of XML Data. Michal Krátký , [email protected] Marek Andrt , marek . andrt @vsb.cz Department of Computer Science VŠB–Technical University of Ostrava Czech Republic. DATESO 2004. Contents. - PowerPoint PPT Presentation
Citation preview
On Efficient Part-match Querying of XML Data
DATESO 2004
Michal Krátký, [email protected] Andrt, [email protected]
Department of Computer ScienceVŠB–Technical University of Ostrava
Czech Republic
www.fei.vsb.czwww.fei.vsb.cz
Contents
Introduction – XML, query languages, indexing XML data, part-match querying.
Multi-dimensional approach to indexing XML data.
Extension of the multi-dimensional approach for keyword-based querying.
Index data structures. Preliminary experimental results.
2/21
Introduction
Native XML database. Set of documents is a database, DTD (XML Schema) is its database schema.
XML query languages (XPath, XQL, XQuery,…).
A common feature is a possibility to formulate paths in the XML graph (regular path expressions, XPath axes and so on).
Approaches based on: relational decomposition, trie, multi-dimensional, signatures and so on.
3/21
Part-match querying XML data
Some approaches for keyword or phrase based searching were published: XQuery-IR (WebDb’02), XKeyword (ICDE’03) and so on.
Knowledges from IR are applied. Query languages contain operators for
matching term occurrence. For example contains(), ~=.
4/21
Multi-dimensional approach to indexing XML data
5/21
A graph is a set of the paths. XML document is decomposed to paths and labelled paths.
labelled path: lp ∈ XLP:
s0,s1,...,slPN
path: p ∈ XP:
idU(u0),idU(u1),...,idU(ulLP),s
idU(ui) – unique number
of a node ui
Indexes
Term index – a storage of strings si of an XML document and their idT(si).
Labelled path index – a storage of points representing labelled paths.
Path index – a storage of points representing paths.
6/21
Examplelabelled path index, path index
books,book,id; books,book,title and books,book,author. Points (0,1,2); (0,1,4) and (0,1,6) are created using idT of element and attribute names, idLP = 0, 1 and 2.
For example, the path to value The Two Towers. The labelled path books,book,title with idLP 1 belongs. Vector (1,0,1,3,5) is created using idLP, unique numbers idU of elements, and idT of the term.
7/21
Query for values of elements and attributes
XPath query: books/book[author=“Joseph Heller”] 3 phases of a query processing, finding:
● idT of terms from the term index,
● idLP 2 of labelled path books,book,author from the labelled path index: point query (0,1,6),
● points from the path index: range query (2,0,0,0,12)×(2,max,max,max,12). 8/21
Enhanced querying
XPath axes are processed by a range query or sequence of range queries. For example axis descendent: (0,idU(u0),…,idU(ul-1), idU(u),0,…, 0):(maxD,idU(u0),…,idU(ul-1), idU(u), maxD,…,maxD).
Regular path expression. For example //title[name=‘Chaudhri’] is processed by a complex range query. The query is possible to process in one run in the multi-dimensional data structure.
9/21
Comparison of approaches
Mainline approaches (XISS, XPath Accelerator) index single element (attribute). For example query /e1[e2=‘dog’] is processed by joining single results.
Result formatting. For example a result of the query //name is all matched subtree.
Operation Update and Insert are simple possible.
10/21
Keyword-based searching
Motivation:/PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE Path-Labelled Path-Term (PLT) index is
added. The index indexes an 3-dimensional space:
(idP, idLP, idT).
idP is added into the point representing path: (idP,idLP,idU0
,idU1,…,idUl
,s).
11/21
Path-Labelled Path-Term index Example
12/21
Query processing plan Example
13/21
Index data structures
Paged and balanced multi-dimensional data structures – (B)UB-trees, variants of R-trees.
Problems: ● indexing points with different dimensions. ● narrow range query – the signature is
applied for efficient processing – Signature R-tree.
Efficient processing of the complex range query.
14/21
Efficient processing the complex range query
Complex range query = sequence of range queries: qb1,qb2,…,qbn.
The query is possible to process in one run in the multi-dimensional data structure.
15/21
Experimental results
Protein Sequence Database XML document:
● the document size is 683MB,
● number of elements: 21,305,818,
● number of attributes:1,290,647.
● maximal length of path: 7.
BUB-forest, R*-forest, Signature BUB-tree and R*-tree. Index structures: trees indexing spaces of dimension n=7 and n=9.
16/21
Experimental resultsQueries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.']
17/21
Experimental results Regular path expression
Query: //uid='89071748' , 5 labelled paths were matched.
Naive processing the complex range query: DAC: 368
Efficient processing the complex range query: DAC: 139
Time: 0.03s, Improvement: 2.5x
18/21
Preliminary experimental results Keyword-based searching
othello.xml:
● document size is 250kB,
● maximal length of the path: 6
● number of paths: 4,967
● number of labelled paths: 13
● number of terms: 8,744
● PLT index: 27,127
19/21
Preliminary experimental results Keyword-based searching
Query: /PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE
Labelled path index: result size: 1, DAC: 3
PLT index: result size: 1, DAC: 3 Path index: result size: 1, DAC: 13 Path index: result size: 1, DAC: 4
20/21
Conclusion
21/21
Θ(m × log n), Θ(c × m × log n) vs. Θ(m1 × m2), m1 ,m2 ≥ m.
Efficient processing a query with AND condition. Signature is applied.
Multi-dimensional approach for term searching may be applied (e.g. *comp*).
The update operation of XML documents. Comparison with another approaches for test
collections (INEX, XMark, …).
http://www.cs.vsb.cz/arg
References
M. Krátký, J. Pokorný, V. Snášel: Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data. Accepted at International Workshop on Database Technologies for Handling XML information on the Web, DataX, Int'l Conference on EDBT, Heraklion - Crete, Greece, 2004.
M. Krátký, J. Pokorný, T. Skopal, V. Snášel: The Geometric Framework for Exact and Similarity Querying XML data. In Proceedings of EurAsia-ICT 2002. Shiraz, Iran, Springer Verlag, LNCS 2510.
M. Krátký, T. Skopal, and V. Snášel: Multidimensional Term Indexing for Efficient Processing of Complex Queries. Kybernetika, Journal of the Academy of Sciences of the Czech Republic, 2004, accepted.
Paths, labelled paths Paths 0,1,2,’003-04212’; 0,5,6,’001-00863’
and 0,9,10,’045-00012’ belong to the labelled path books,book,id,
. . .
Paths 0,1,4,’J.R.R. Tolkien’; 0,5,8,’J.R.R. Tolkien’ and 0,9,12,’Joseph Heller’ belong to the labelled path books,book,author.
Complex queries
Query for values and XPath axis processing, e.g. books/book[author='Joseph Heller']/title
● Combination of above described techniques: query for value, XPath axis processing.
Regular path expression queries for example: books//author
● A sequence of range queries processes this query in the path and labelled path index: books, author - books,*,author - books,*,…,*,author.
(B)UB-tree, R-treeUB-treeUB-tree
Z-address
B-treeB-tree
Narrow range query – signature multi-dimensional ds
Regions intersecting a query hyper box are searched, O(NI × logc n).
Ratio cR of relevant NR and intersect NI regions
≪ 1 with an increasing dimension. Signatures are applied to better filtration of irrelevant
regions – signature md structures.
Signature R-tree
Experimental results
Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.']
Experimental results