On View Support for a Native XML DBMS

1

On View Support for a Native XML

DBMS

Ting Chen , Tok Wang Ling

School of Computing, National University of Singapore Daofeng Luo, Xiaofeng Meng Information School , Remin University of China

2

Outline View for XML Documents

Two Main Approaches Problems

ORA-SS: Object-Relationship-Attribute Model for Semi-structured Data ORA-SS class diagram and instance diagram ORA-SS for view schema definition

Element-Based Clustering (EBC) Basic Approach ORA-SS and EBC

XML View Transformation Problem Definition Algorithm

Conclusion

3

View for XML Documents

Two main approaches to define views Define views in script languages like XQuery or XSLT

General but demanding from users’ point of view because XQuery and XSLT scripts are complex

Difficult to optimize from performance view point Define views by Schema-Mapping

E.g. Clio[7] and eXeclon[3] A declarative approach; alleviate users from writing complex scripts

to perform view transformation Schema mappings can then be translated into XQuery (or XSLT)

scripts We focus on the problem of view transformation through

schema mapping

4


View for XML Documents via Schema-Mapping Problem: Current XML schema formats are not

able to express views with semantic constraints, resulting in ambiguity

E.g. (Next Slide) The source XML file contains information about researchers working under different projects and the publication list for each researcher.

5


The view schema in Fig (c) of the above diagram is such an example. It has at least two possible meanings which can lead to different view results!1. For each project, list all the papers published by project members; for each paper of the project, list all the authors of the paper.2. For each project, list all the papers published by project members; for each paper of the project, list all the authors of the paper who work for the project.

6

ORA-SS Data Model

ORA-SS[2] Object Class Relationship Type Attribute( Object attribute or Relationship attribute) E.g. An ORA-SS Instance Diagram

* Compared with the XML document in Slide 5, two extra fields (attribute Date and sub-element Position) are added in the above ORASS instance diagram

7

ORA-SS Data Model

ORA-SS Schema Diagram

•There are two binary relationship types in the schema: Project-Researcher(JR) and Researcher-Paper(RP). The set of papers under a researcher doesn’t depend on the project he/she works in.•Position is an attribute of relationship type JR instead of Researcher. This means that a researcher may hold different positions across projects he works in. •Date is a single-valued attribute of object class Paper. Different occurrences of the same paper will always have the same Date value. •J_Name,R_Name and P_ID are identifiers of object classes Project , Researcher and Paper respectively as indicated by solid circles. Identifier values are used to tell if two object occurrences are identical.

8

ORA-SS Data Model ORA-SS for View Schema Definition

It is able to define views with different semantics View (a) has two binary relationship types. The intention of the view

schema is to find all the papers published by researchers in a project; and for each paper to find all of its authors.

View (b) has only one ternary relationship type. The view is defined to find all the papers published by researchers in a project; however, for each paper View (b) only finds those authors working for the project.

View (a) View (b)

9

ORA-SS Data Model ORA-SS for View Schema Definition

The two view schemas in Slide 8 correspond to two different XSLT scripts. The following is the XSLT script for schema (a):

<root> <xsl:for-each-group select="root/Project" group-by="@J_Name"> <Project> <J_Name><xsl:value-of select="@J_Name"/></J_Name> <xsl:for-each-group select="current-group()/Researcher/Paper"

group-by="@P_Name"> <Paper> <xsl:variable name="vPName" select="@P_Name"/> <P_Name><xsl:value-of select="@P_Name"/></P_Name> <xsl:for-each-group select="/root/Project/Researcher[Paper/@P_Name =$vPName]" group-by="@R_Name">

<Researcher> <R_Name><xsl:value-of select="@R_Name"/></R_Name> </Researcher></xsl:for-each-group>

</Paper> </xsl:for-each-group> </Project> </xsl:for-each-group></root>

10

ORA-SS Data Model The following is the XSLT script for schema (b):

The main difference of two scripts lies in the third xsl:for-each-group directive for Researcher. Script for Schema (a) needs to search the whole document to find the complete author list of a paper because authors may not work for the same project. On the other hand, script for Schema (b) avoids the global search because it only needs find authors of the paper working for the same project.

<root> <xsl:for-each-group select="root/Project" group-by="@J_Name"> <Project> <J_Name><xsl:value-of select="@J_Name"/></J_Name> <xsl:for-each-group select="current-group()/Researcher/Paper"

group-by="@P_Name"> <Paper> <xsl:variable name="vPName" select="@P_Name"/> <P_Name><xsl:value-of select="@P_Name"/></P_Name>

<xsl:for-each-group select="current-group()/.." group-by="@R_Name"> <Researcher> <R_Name><xsl:value-of select="@R_Name"/></R_Name></Researcher></xsl:for-each-group>

</Paper> </xsl:for-each-group> </Project> </xsl:for-each-group></root>

11

Element-Based-Clustering (EBC) Element Based Clustering[5]

Extension of Element Based (EB[6]) Strategy Element nodes (records) with the same tag name are

clustered and organized as a list

A

A

B C

A

B

A

A

B C

A

B

12

Element-Based-Clustering (EBC)

EBC: Node labeling EBC gives labels for nodes in a XML document Labels for nodes in a XML data tree can be calculated

in the following manner:

1. The root element has label nil

2. Perform a pre-order traversal (i.e. Document order) on the XML document

For node x: (“+” here means string concatenation)label(x) = label(x.parent) + “.” + position of x in x.parent’s childList

Node A is ancestor of node B if Label(A) is the prefix of node Label(B) and vice versa

13

ORA-SS and EBC

Project j1(1) j2(2)

Researcher r1(1.1) r2(1.2) r2(2.1) r3(2.2)

Paper p1,05/2002(1.1.1) p1,05/2002(1.2.1) p2,03/2000(1.2.2)

•How does ORA-SS schema help tune XML document storage?

p1,05/2002(2.1.1) p1,03/2000(2.1.2) p2,05/2002(2.2.1)

Position Leader(1.2.3) Staff(2.1.3) Leader(2.2.2)

14

ORA-SS and EBC

1. Object identifier of an object will be stored together with the object. Objects of the same class form a cluster. Each cluster is a sequential file.

2. Relationship attribute values will be stored in separate cluster. 3. For object attribute values, we need some heuristics. It is attempting to store object attributes together with the object

since they are likely to be accessed at the same time. However, if an object class has too many objects attributes, to store all attribute values together with the object brings us to a situation similar to Subtree-Based storage strategy. One solution is to store only those essential (this can be determined by users) object attributes with the object and leave other to separated clusters.

4. Node (which can be object, relationship value and object attribute value) labels will be stored together with nodes.

15

Problem Definition

Given an XML document D1, a source schema V1 and a valid view schema V2 of V1, transform D1 to document D2, so that D2 is a valid document under V2.

View Transformation: Problem Definition

Source Document View Document

View SchemaSource Schema

User Defined Schema Mapping

View Transformation

16

View Transformation

Researcher

Proj

Paper

View Schema (Ambiguous)

Researcher

Proj

Paper

PR;2

JP;2

Researcher

Proj

Paper

JPR;3

Slide 8(a)

Slide 8(b)

•View schema defined in DTD can have different interpretations which result in different views

?

17

View Transformation View schema expressed in ORA-SS is unambiguous

The relationship set of an ORA-SS view schema clearly defines how view document should be constructed

Two basic techniques are used in construction of a single relationship in view schema Structural join: SJ (based on object labels)

For each relationship R in view schema, we first use structural join to find the set of paths of type R such that the object occurrences in each path locate on the same path in source document.

Value join : Merge (based on logical object identifiers)

In XML document an object can have many occurrences. Two occurrences are considered as the same if they have identical object identifier. The set of paths resulted from structural join is then value joined (or merged) using logical object keys.

18

View Transformation

What makes things complicated? A path in view schema can have more than one relationship and we need to

join two relationship together. E.g. View schema in Slide 8(a) contains two relationships.

Two relationships are “joined” on their overlapping object classes. So the results constructed for one relationship may be used in construction of another.

Value join (merge) based on logical keys destroys the “sorted-ness” of the output path list of structural join. The consequence is that the output of value join can’t be used in subsequent structural join with paths from other relationships efficiently. (Structural join requires sorted input lists)

Solution: Duplicated-Preserving Merge (D-Merge). It keeps the structural join output path list intact. For two occurrences of the same object in the list, their child contents will be merged and then each of them will have its own copy of the merged content. D-Merge is used only when a relationship needs to join with another relationship.

19

View Transformation: Algorithm The relationship set of a view schema determines the

view transformation process. Take the two view schemas in Slide 8 as an example:

View(a):

1. L := SJ(list(R), list(P), {P,R})

2. L := D-Merge(L, {P})

3. L := SJ(L, list(J), {J,P});

4. L := Merge(L,{J});

View(b):

1. L := SJ(list(J), list(P), {J,P})

2. L := SJ(L, list(R), {J,P,R});

3. L := Merge(L,{J});

20

View Transformation: Algorithm Structural Join

Based on Object Label Binary Structural Join[1]

Input

Two sorted (on node numbers) node lists: AList of potential ancestor nodes and DList of potential descendants nodes

Output

OutputList = [(ai; dj)] of join results, in which ai is the parent/ancestor of dj and ai is from AList and dj is from DList

EBC schemes stores elements with the same tag name in pre-ordered( i.e. sorted on element number) way; structural join can be naturally applied

21

View Transformation: Algorithm Structural Join:

Based on Object Label Complex Structural Join: (Example: structural join of three sorted

input lists) Input Sorted node lists: A,B,C

Output OutputList = [(ai; bj ; ck)] of join results, in which ai, bj , ck (from List A,B,C

respectively) are located on the same path in source document Two binary joins:

Step 1: Join A and B OutputList AB: [(ai; bj )] sorted on ai

Step 2: Join AB (using ai as node label) and C OutputList = [(ai; bj ; ck)] sorted on ai Important: ck should be on the same path as both ai AND bj

22

View Transformation: Algorithm A motivating example:

Paper

Proj

Researcher

RP;2

JR;2

Researcher

Proj

Paper

PR;2

JP;2

Source Schema View Schema

Source Document:

23

View Transformation: Algorithm Data Structures

Record Object label Object key (object identifier) ChildList: an array of children record references

Tuple A tuple consists of an array of records A tuple has a root record Example

Array of Tuples

Record:r1

(ChildList:

<1,2>), ROOT

Record:r2

(ChildList:

<3>)

Record:r3

(ChildList:

nil)

Record: r4

(ChildList:

nil)

Tuple Array Index: 0 1 2 3

r1

r2

r4

r3

Corresponding Tree:

24

View Transformation: Algorithm Operations

Structural Join (SJ) Input

L1: Array of Tuples of type <A1,A2,A3,…, An> L2: Array of Tuples of type <B1,B2,B3,…, Bn> Mask M: Bit Array, which specifies the object classes which participate in structural join

Output L3: Array of Tuples of type <A1,A2,A3,…, An,B1,B2,B3,…, Bn>

For each tuple in L3, its objects whose types are specified in M MUST locate on the same path in the source document

Merge (Merge) Input

L1: Array of Tuples of type < A1,A2,A3,…, An> Mask M: Bit Array, which specifies a list of object classes

Output L: Array of Tuples

For any two tuples t1 and t2 in L1 which have the same set of objects whose types are specified in M, the two tuples will be merged.

Duplicate-Preserving Merge (D-Merge) Input

L1: Array of Tuples of type < A1,A2,A3,…, An> Mask M: Bit Array, which specifies a list of object classes

Output L: Array of Tuples

For any two tuples t1 and t2 in L1 which have the same set of objects whose types are specified in M, the contents of the two tuples will be merged. t1 and t2 will then both have the merged content.

Note: Structural Join operation uses object number while Merge and D-Merge use object identifer.

25

View Transformation: Algorithm Definition 1:Relationship R1 is lower than R2, or R1 < R2, if the top-most participating object class of R1

is a descendent of the top-most participating object class of R2. Definition 2: A relationship is independent if its set of participating object classes is not included in any

other relationship. Otherwise the relationship is nested. Algorithm (Main steps):

Step 0. Initialize L1 := null, L2 := null, M := { }, //M contains the set of object classes that has been structurally joined L[O] : = null for each O in the view schema

Step 1. Put the independent relationships of the ORA-SS view schema into a partially ordered set (S, < )Step 2: While (S is not empty){

Extract the first relationship R from S;If (M∩R is not null) L1 := L[O] for the highest O(O has the smallest level in the view schema) in M∩R;Else L1 := null; For each object class O in (R – M) sorted decreasingly by their depths in the view schema M := M U {O}; L2 := Cluster[O]; //Cluster[O] is an array storing objects of class O if(L1 != null ) L1 := StructuralJoin(L1,L2,M∩R); else

L1: = L2; L[O] = L1;If R is the highest level relationship L1 := Merge(L1, R.top_most_object_class);Else L1 := D-Merge(L1, R – {object classes of R which doesn’t appear in

any other independent relationship}); }

Step 3: return L[Root]; where Root is the root of the view schema;

26

View Transformation: Algorithm

Example: View Schema Slide 8a Step 1: Construct Relationship Project-Paper

View Schema

Step 1.1. L := SJ (list(P), list(R), {P,R})

Step 1.2. L := D-Merge (L, {P})

Source Doc:

Researcher

Proj

PaperPR;2

JP;2

S = {PR,JP};

27

View Transformation: Algorithm Example: View Schema Slide 8a

Step 2: Construct Relationship Project-Paper

View Schema

Step 2.1. L := SJ(L, list(J), {J,P})

Step 2.2. L := Merge (L, {J} )

Source Doc:

Researcher

Proj

Paper

PR;2

JP;2

S = {JP};

28

Conclusion

We demonstrate how to combine an efficient XML storage scheme (EBC) and an expressive XML data model (ORA-SS) to provide XML view support for a native DBMS system.

ORA-SS, used as view schema definitions, can express a great variety of constraints graphically. More importantly, it can avoid ambiguity, which is a typical problem if by DTD or XML Schema is used as view schema format.

In our transformation method, a relationship type is the basic unit

of transformation. Both object label based structural join and logical key based value join are employed to construct view results. ORASS view schema information can guide correct view transformation.

29

Reference1. Shurug Al-Khalifa, H. V. Jagadish, Nick Kouda, Jignesh M. Patel, Divesh

Srivastava, YuqingWu. Structural Joins: A Primitive for Efficient XML Query Pattern Matching. In Proceedings of ICDE, 2002

2. Gillian Dobbie, Wu Xiaoying, Tok Wang Ling, Mong Li Lee: ORA-SS: An Object-Relationship-Attribute Model for Semistructured Data TR21/00, Technical Report, Department of Computer Science, National University of Singapore, December 2000.

3. eXcelon. An General XML Data Manager. http://www.exceloncorp.com/4. H. V. Jagadish, Shurug AL-Khalifa, et al. TIMBER: A Native XML Database.

Technical Report, University of Michigan,April 2002.5. Xiaofeng Meng, Daofeng Luo, Mong Li Lee, Jing An. OrientStore: A Schema

Based Native XML Storage System. In Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003

6. J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J.Widom. Lore: A Database Management System for Semistructured Data. SIGMOD Record, Vol.26(3):54-66,September 1997.

7. Lucian Popa, Mauricio A. Hern´andez ,Yannis Velegrakis , Ren´ee J. Miller, Felix Naumann, Howard Ho. Mapping XML and Relational Schemas with Clio. In ICDE 2002 Demo, 2002