Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Rainbow: Bridging XML and Relational Databases Using a Flexible Mapping
The Design, Implementation, and Evaluation of the Rainbow System
A Major Qualifying Project Report
Submitted to the Faculty
Of the
WORCESTER POLYTECHNIC INSTITUTE
In partial fulfillment of the requirements for the
Degree of Bachelor of Science
By
____________________ ___________________ ____________________ Tien Vu John Lee Mirek Cymer
Date: 5/02/2001 Approved: __________________________ Professor Elke A. Rundensteiner
Authorship Page
T=Tien VuJ=John LeeM=Mirek Cymer
1 Introduction T1.1 Motivation T1.2 Our Approach T1.3 The Rainbow System T1.4 MQP Project Goals T1.5 Additional Team Goals T1.6 Outline of the Remaining Sections TJ
2 Background M2.1 Readings M2.2 Basics of XML and DTDs M2.3 Technologies2.3.1 SQL, Relational Databases, and Oracle 8i TJM2.3.2 JDBC and ResultSet Classes J2.4 Software Development TM2.4.1 Object Oriented (OO) Design TJ2.4.2 Software Migration TJ
3 DTD Metadata Management J3.1 Metadata Tables J3.2 Data Schema J3.3 DTD Manager and XML Manager Extensions J3.3.1 Original DTD Manager and XML Manager J3.3.2 Support for Multiple DTDs and XMLs J
4 Implementation of the Rainbow System J4.1 Restructuring Subsystem J4.1.1 The Restructuring Functionality J4.1.2 A Prototype Design J4.1.3 Implementation Details J4.2 Restructuring Operators T4.2.1 Pushup and Pushdown Attribute Operators T4.2.2 Rename Item and Attribute Operators T4.2.3 Pushup and Pushdown Nesting Operators T4.2.4 Other Operators T4.3 Rainbow Graphical User Interface M
2
5 Implementation Details T5.1 System Architecture T5.2 Code Facts T5.3 Existing System Packages J5.4 Implementation Environment J
6 Experimental Evaluation J6.1 Experimental Setup J6.1.1 Scope and Design of a Test Plan J6.1.2 Designing an Experimental Test Bed J6.2 Performance Considerations J6.2.1 Restructuring Time J6.2.2 Query Time J6.3 Cost Factors J6.4 Experimental Data J6.5 Restructuring Setup Time Evaluations J6.5.1 Experiment 1: Scalability of Increase in Operations J6.5.2 Experiment 2: Operation Scalability J6.6 Query Time Evaluations for Restructured Schema J6.6.1 Experiment 3: Query Performance J6.7 Analyses J
7 Conclusions TJ7.1 Summary of the Rainbow Project T7.2 Experience Gained and Lessons Learned T7.2.1 Object-Oriented Design T7.2.2 UML T7.2.3 The Java Programming Language T7.2.4 XML TJ7.2.5 Database Management Systems TJ7.2.6 Software Engineering Experience TJ7.2.7 Designing the Test Plan TJ7.2.8 Working as a Team TJ7.3 Future Work TJ
References TPast Works and Books TJWeb-pages TJ
Appendixes TJMReadme for System Environment Setup and Demo TJM
3
Abstract
The use of Extensible Markup Language (XML) documents to model data and exchange data over the web is becoming increasingly prominent and promising. Due to the maturity and performance of existing relational database technology, there is great interest in exploiting this technology to serve as a backend engine to store, manage, and query XML data. It is well known that different relational schemas will have different query performances for a given load. Hence, one fixed way of mapping XML into relational databases is not sufficient to reach an overall optimized query performance for a given query workload. The Rainbow System proposes a flexible mapping approach by first loading the XML data into a relational database system and then applying relational restructuring technique with the help of SQL queries and database views of the loaded data and schema. A key ingredient of this Rainbow solution is the management of metadata in relational format of both XML structure information (DTD) as well as the chosen mapping. We have achieved the design, implementation, and preliminary evaluation of this flexible mapping component of the Rainbow system in this project.
4
Table of Contents
1 Introduction.......................................................................................................................91.1 Motivation..................................................................................................................91.2 Our Approach..........................................................................................................101.3 The Rainbow System...............................................................................................111.4 MQP Project Goals..................................................................................................141.5 Additional Team Goals............................................................................................141.6 Outline of the Remaining Sections..........................................................................15
2 Background.....................................................................................................................162.1 Readings..................................................................................................................162.2 Basics of XML and DTDs.......................................................................................162.3 Technologies............................................................................................................20
2.3.1 SQL, Relational Databases, and Oracle 8i........................................................202.3.2 JDBC and ResultSet Classes............................................................................22
2.4 Software Development............................................................................................242.4.1 Object Oriented (OO) Design...........................................................................242.4.2 Software Migration...........................................................................................24
3 DTD Metadata Management..........................................................................................263.1 Metadata Tables.......................................................................................................273.2 Data Schema............................................................................................................303.3 DTD Manager and XML Manager Extensions.......................................................33
3.3.1 Original DTD Manager and XML Manager.....................................................343.3.2 Support for Multiple DTDs and XMLs............................................................35
4 Flexible Mapping Support in the Rainbow System........................................................354.1 Restructuring Subsystem.........................................................................................37
4.1.1 The Restructuring Functionality.......................................................................374.1.2 A Prototype Design...........................................................................................384.1.3 Implementation Details.....................................................................................41
4.2 Restructuring Operators...........................................................................................444.2.1 Pushup and Pushdown Attribute Operators......................................................474.2.2 Rename Item and Attribute Operators..............................................................484.2.3 Pushup and Pushdown Nesting Operators........................................................484.2.4 Other Operators................................................................................................49
4.3 Rainbow Graphical User Interface..........................................................................495 Implementation Details...................................................................................................54
5.1 System Architecture.................................................................................................545.2 Code Facts...............................................................................................................545.3 Existing System Packages.......................................................................................555.4 Implementation Environment..................................................................................56
6 Experimental Evaluation................................................................................................576.1 Experimental Setup..................................................................................................57
6.1.1 Scope and Design of a Test Plan......................................................................576.1.2 Designing an Experimental Test Bed...............................................................57
6.2 Performance Considerations....................................................................................58
5
6.2.1 Restructuring Time...........................................................................................596.2.2 Query Time.......................................................................................................59
6.3 Cost Factors.............................................................................................................606.4 Experimental Data...................................................................................................606.5 Evaluations of Restructuring Setup Time................................................................61
6.5.1 Experiment 1: Scalability of Increase in Operations........................................616.5.2 Experiment 2: Operation Scalability................................................................63
6.6 Query Time Evaluations for Restructured Schema.................................................666.6.1 Experiment 3: Query Performance...................................................................66
6.7 Analyses...................................................................................................................687 Conclusions.....................................................................................................................69
7.1 Summary of the Rainbow Project............................................................................697.2 Experience Gained and Lessons Learned................................................................70
7.2.1 Object-Oriented Design....................................................................................717.2.2 UML.................................................................................................................717.2.3 The Java Programming Language....................................................................717.2.4 XML.................................................................................................................727.2.5 Database Management Systems.......................................................................727.2.6 Software Engineering Experience....................................................................727.2.7 Designing the Test Plan....................................................................................737.2.8 Working as a Team...........................................................................................74
7.3 Future Work.............................................................................................................75References..........................................................................................................................76
Past Works and Books:..................................................................................................76Web-pages:....................................................................................................................77
Appendixes........................................................................................................................78Readme for System Environment Setup and Demo......................................................78
6
List of Illustrations
Figure 1: Proposed Rainbow Architecture 12Figure 2: Examples of XML Elements 18Figure 3: XML Content Definitions 18Figure 4: XML/DTD Example Documents 19Figure 5: Algorithm of Mapping DTD into Relational Schema 26Figure 6: DTD Manager and XML Manager 34Figure 7: Restructure Function 38Figure 8: Restructure Subsystem 40Figure 9: Restructuring Subsystem Class Diagram 42Figure 10: Pushup Attribute Operator
46Figure 11: Pushup and Pushdown Attribute 47Figure 12: Pushup and Pushdown Nesting 48Figure 13: Screenshot 1 51Figure 14: Screenshot 2 52Figure 15: Screenshot 3 53Figure 16: Rainbow Architecture with RDBMS 54Figure 17: Statistics of Class Implementation 55Figure 18: Experiment DTD 60Figure 19: Batch versus Serial Restructuring 63Figure 20: Restructuring Overhead Results 65Figure 21: Join Query Performance Results 67
7
List of Tables
Table 1: Student Address Relation 21Table 2: Relation Resulting from a Query Evaluation 21Table 3: Item DTDM (DTDM-Item table) 28Table 4: Nesting DTDM (DTDM-Nesting table) 28Table 5: Attribute DTDM (DTDM-Attribute table) 29Table 6: DTDMS for Figure 2’s DTD 30Table 7: Data Relations for Figure 2’s XML Document 32Table 8: Parameters of Restructuring Evaluations 60
8
1 Introduction
1.1 Motivation
The use of Extensible Markup Language (XML) documents to store information
is becoming increasingly prominent and promising [14]. XML’s main strength of
organizing information using a human-readable and machine-interpretable file format
makes it ideal for exchanging data between different systems. Unlike Hypertext Markup
Language (HTML) that stores information about the physical presentation of a web page,
XML represents information about the meaning of the data itself by appropriate tags [9].
An element of an XML document refers to a defined tag [9]. See Figure 2 for an
example of an XML document. The nesting of elements represents the logical hierarchy
among the elements. For this reason, information can be extracted more easily out of an
XML document in response to a user request.
Efforts by industry groups in specifying standard structure for XML documents in
the form of Document Type Definitions (DTD) or more recently in the form of XML
schema facilitate the exchange of XML documents among enterprises [14]. By enabling
automatic data flow among businesses, XML is pushing the world into the electronic
commerce era. Collecting, analyzing, mining, and managing XML data will hence
become tremendously important tasks for future web-based applications [2]. An XML
bound system is required to store, retrieve, update, and query XML documents.
One prominent method for such a system is to store XML documents into
relational databases. RDBMSs stand for Relational Database Management Systems.
They deal with data storage, query, concurrency, and other features. Many database
venders such as Microsoft, IBM, Informix, and Oracle have started to support XML in
9
their own RDBMS systems. The benefits of XML being managed by a relational
database are: many fold including the availability of matured database tools, efficient
query and analysis tools, and the easy integration with existing business databases.
However, there are open issues to be resolved concerning XML. These issues include
mapping between XML and Relational Model, XML Update Propagation, and XML
Query Translation and Optimization. This MQP mainly focuses on solving the issue of
mapping between XML and the Relational Model.
1.2 Our Approach
Zhang et al. [2] propose a metadata driven approach that addresses the issue of
flexible mapping. The proposed approach can generate a relational schema out of a DTD
and store the XML data compliant to that DTD into relational tables that then could be
queried by Structured Querying Language (SQL) queries. This metadata driven approach
includes the loading of a DTD into relational metatables, construction of a relational
schema called Metadata Tables (DTDMs), restructuring of the DTDMs for efficient
querying purposes, automatic construction of a relational schema for the XML documents
that conform to this DTD, and loading of the XML data into the prepared relational
database schema.
As an additional feature, Zhang et al. [3], it also handles the issue of updates on
the XML documents. The information in the database correctly represents the
information of the external sources, those that hold the up-to-date XML pages, through
updating by means of synchronization that will utilize the DTDMs. The initial store of
the XML data utilizes a fixed mapping that retains the hierarchical semantics of the DTD
loaded. However, once the XML data is stored, a restructuring process may be called
10
upon to modify the DTDM schemas and the XML data because different mapping yields
varying query processing optimizations [8].
To reap the benefits from the metadata driven approach, the data contained within
an XML document has to be accessed efficiently. This approach should allow for easy
data retrieval and modifications of the XML data in a database system by the mean of
SQL. Such an approach would be more beneficial over the need for a programmer to
traverse through the XML and DTD documents by means of specialized miniature
parsing programs. It would not only save development time and money by code reuse,
but will eliminate the possibility of any error arising from additional programming.
1.3 The Rainbow System
Much of the information presented here is extracted from [8]. To keep track of
XML updates and provide optimal query performance, the metadata driven system from
herein referred to as Rainbow, is composed of a DTD manager, a basic storage manager,
a schema creator, a restructurer, an XML query engine, and an XML schema depicted in
Figure 1. A working system was already in place and included some of the components
of this figure.
11
Figure 1: Proposed Rainbow Architecture
The DTD Manager will load DTD documents into our system by storing them in
DTDMs as part of the system dictionary tables. DTDMs model the DTD as a collection
of items, attributes and nesting relationships. After the DTDMs repository is loaded, the
schema creator will infer a relational schema from the DTDMs repository.
The basic storage manager maintains XML documents with the help of three
modules: an importer, an exporter, and a synchronizer. The importer imports XML
compliant to a prior specified DTD into our system. The exporter will export the
12
XML Query EngineXML Query Engine
XML QueryXML Query XMLXML
XMLXML
Basic Storage ManagerBasic Storage Manager
DTDDTD
DTD ManagerDTD Manager
Restructure
Optimizer
RestructureOperatorLibrary
Query StorageMapping
UserDBA
XMLXMLQueryQueryLoadLoad
Sub-Sub-systemsystem
XMLXMLDataData
LegendLegend
Process
Relational
Model
relational data into XML documents. The synchronizer is used to keep the internal
relational representation and external XML representation consistent with each other
under data updates.
The restructure operator library stores a collection of restructuring operators for
optimization purposes An optimizer takes a given XML query load specified by a
database administrator (DBA) and the DTDMs, which model the current structure of
relational database, as input. It generates a mapping by applying the restructuring
operators provided from the restructuring operator library. A mapping specifies the
application of a sequence of restructuring operators to be applied on the different element
types defined in that DTD. Then, the restructuring manager actually transforms the
initially loaded data into the desired optimized format. The latter is to be utilized for
efficient query purposes.
The end user can issue XML queries through the XML Query Engine subsystem.
The Query Translator based on the mapping provided by the Optimizer will translate the
XML query into a sequence of SQL queries. Then the relational query engine of the
RDBMS will execute the SQL queries, and return the corresponding relational query
result. The query result translator will translate the query result back into the XML
model and return it to the end user.
The Rainbow architecture was partially implemented when the project team
started working on the development of its components. The DTD Manager and Basic
Storage Manager were capable of loading a single DTD and a single XML document.
The Basic Storage Manager, herein referred to as the XML Manager, did not have a
synchronizer process that keeps the integrity of the internal data. Instead, the
13
synchronizer, called Clock, was a separate component developed at WPI, but not yet
integrated. A constituting operator set was researched and designed for the Restructuring
Operator Library [8], but no implementations of the Optimizer or Restructuring Manager
were in place. Lastly, the XML Query Engine remained at its conceptual stage and has
yet to be realized.
1.4 MQP Project Goals
The scope of this MQP is to continue the necessary development of the remaining
subsystems in addition to extensions of the existing ones. With the benefit of an
extended schedule, the project team was able to pursue the extensions of the DTD
Manager and XML Manager and the design and development of a prototype of the
Restructuring Manager. With the completion of these tasks, the Rainbow system is now
able to store multiple DTDs and XML documents, and to restructure the initial fixed M!
mapping of the XML data utilizing an administrator specified mapping. To evaluate the
system we developed, the project team designed a test bed and experimental outline and
performed experimental studies on the working system.
1.5 Additional Team Goals
An additional goal of this MQP is for the project team members to learn and
develop a competency with the technologies of database, XML, SQL, Java programming
for RDBMSs. With respect to the goals of developing the subsystems of Rainbow, the
team had the goals to learn how to maintain and extend existing software, and engineer
from the design phase through to experimental evaluations of a complete software
system. Reuse of previous code that went beyond simple extension in functionality
14
becomes essential for both the team’s quicker adaptation to several of the needed
technologies mentioned and to complete the development of several subsystems within
the time constraints of the project.
By the conclusion of this project, members of the team did not only develop the
software engineering skills necessary to succeed in the field of Computer Science, but
each individual did learn and understand team dynamics. The project members must
work closely to ensure that the separable tasks lead to the development of compliant parts
as well as to show progress to the project advisor. To guarantee deliverables in a timely
manner, the team learned about the presentation and communication pertinent to a
manageable work schedule.
1.6 Outline of the Remaining Sections
This project report has the following structure. In the following section, we
describe the background technologies and tools that one needs to grasp an understanding
of the remaining sections of this paper. Section 3 describes the metadata model for the
Rainbow architecture and how it is used to load XML data and the extensions to the
existing subsystems, namely the XML Manager and DTD Manager. Section 4 details the
implementation of the Restructuring subsystem. Section 5 discusses what a restructuring
operator is and the list of operators that are implemented for the Restructuring subsystem.
Section 6 details experiments conducted to evaluate the Restructuring subsystem.
Finally, a summary and discussion of future work in Section 7 concludes the report.
15
2 Background
2.1 Readings
The team made extensive use of the following references, chapters two and nine,
from Database Management Systems [1], by Ramakrishnan, and a significant amount of
documents contributed by graduate students and the professor for the purpose of this
project including: “Metadata-Driven Approach to Integrating XML and Relational Data”
[2], “Clock: Synchronizing Internal Relational Storage with External XML Document”
[3], “Incremental Maintenance of Virtual XML Repository” [5], “ISP-EAR555: XML
Relational Management” [4], and “A Performance Evaluation of Alternative Mapping
Schemes for Storing XML Data in a Relational Database “ [6], and “DyDa: Dynamic
Data Warehousing” [7]. Since the Relational Database Management System (RDBMS)
that hosted the information for the team project was Oracle8i running on a Microsoft NT
Server PC, the team learned skills essential to manipulate information and navigate
through the system. The project team acquired background knowledge in design and
programming techniques that include the use of Java and its Java documentation
standard, XML, RDBMS, and SQL.
2.2 Basics of XML and DTDsXML is a markup language that allows a document to contain structured
information. A markup language is a mechanism to identify structures in a document.
The XML specification defines a standard way to add markup to documents. The content
of these documents may include descriptions, pictures, headings, etc. XML documents
also hold information about each type of content. Similar to an HTML document, an
XML document contains tags that specify these types of content. In HTML documents,
both the tag semantics and the tag set are fixed. Even with efforts by industry to improve
16
the flexibility of HTML, any changes are always strictly confined by what the browser
vendors have implemented and by the fact that backward compatibility is paramount.
XML, on the other hand, specifies neither semantics nor a tag set. While HTML
specifies how a document should be displayed, it does not describe what kind of
information the document contains. XML allows document authors to organize
information in a flexible way. In fact XML is really a meta-language for describing
markup languages. In other words, XML provides a facility to define tags and the
structural relationships between them. Since there is no predefined tag set, there cannot
be any preconceived semantics. All of the semantics of an XML document will either be
defined by the applications that process them or by style sheets.
Many applications of XML are Internet-related, but XML is in no way limited to
Internet use. In fact, XML's main strength is organizing information that makes it perfect
for exchanging data between different systems, regardless of whether the Internet is part
of the picture.
To view XML you'll need a program called an XML parser. This program reads
an XML document and displays it in a user-friendly way based on a stylesheet. Both
Microsoft and Netscape are working to add XML parsing capabilities to their browsers.
XML can benefit e-commerce by enabling back-end systems to communicate
business transaction information in a known format. For example, business partners can
standardize on specific XML syntax they use to describe purchase orders and can then
automate the transfer of that information across otherwise incompatible systems.
17
An example of XML is given in the following figure:
Description Example Empty element with attributes <ELEMENT ATTR1="value" ATTR2="value"/>
Element with content and end tag <ELEMENT>Element Content Here</ELEMENT>
Parent element with attributes and child elements
<PARENT ATTR1="value">
<CHILD1>
Content
</CHILD1>
<CHILD2 ATTR1="value"/>
</PARENT>
Figure 2: Examples of XML Elements
The allowable contents of an element type are EMPTY, ANY, Mixed, or children
element types[16].
Allowable Contents: Definition: EMPTY Refers to tags that are empty.
ANY
Refers to anything at all, as long as XML rules are followed. ANY is useful to use when you have yet to decide the allowable contents of the element.
Children elements
You can place any number of element types within another element type. These are called children elements, and the elements they are placed in are called parent elements.
Mixed content
Refers to a combination of (#PCDATA) and children elements. PCDATA stands for parsed character data, that is, text that is not markup. Therefore, an element that has the allowable content (#PCDATA) may not contain any children.
Figure 3: XML Content Definitions
18
For simplification purposes, we assume that the XML documents that this
particular project handles receive tag definitions through one standalone external DTD.
Therefore, the tags contained within each XML document are defined in a separate DTD.
A DTD holds definitions for tag elements, nesting relationships of these elements, as well
as attributes of these elements and other relations of the data types. To reiterate, DTDs
are defined by the industry group to specify the standard schema of XML documents in
order to facilitate that exchange. Therefore, the project scope to handle only those XMLs
that are compliant to a DTD is a reasonable limit.
An example of a DTD and an XML document [14]:
DTD:
<!ELEMENT prices (book*)><!ELEMENT book (title, source, price)><!ELEMENT title (#PCDATA)><!ELEMENT source (#PCDATA)><!ELEMENT price (#PCDATA)>
Compliant XML:
<prices> <book> <title>Advanced Programming in the Unix environment</title> <source>www.amazon.com</source> <price>65.95</price> </book> <book> <title> TCP/IP Illustrated </title> <source>www.amazon.com</source> <price>65.95</price> </book></prices>
Figure 4: XML/DTD Example Documents
19
2.3 Technologies
2.3.1 SQL, Relational Databases, and Oracle 8i
SQL is a query language that allows users to access data in a RDBMS (Ullman,
1997). Commercial RDBMS products from corporations such as Oracle, Sybase,
Informix, Microsoft, and others allow a user to describe the data of interest that the user
wishes to receive through support of standard SQL. SQL can provide these services by
allowing users to defined relations, manipulate relations, and query them. These relations
are simple tables that each have a schema, and may or may not be interconnected by
various constraints and keys to form an entire relational schema. The collection schemas
of all the relations of concern would be referred to as a relational schema. The execution
of a SQL query against the relational database will return a relation whereby this returned
relation’s schema is specified by the query.
Most of our information about relational databases came from [DMS]. We will
give a brief overview of how to access and manipulate data in SQL. The main objective
of this overview is to show the effectiveness of using SQL against an RDBMS for the
purpose of this project to effectively manage XML documents in a relational database.
In a relational database, data is stored in tables. The following table relates Social
Security Number, Name, and Address:
20
StudentAddressTable
SSN FirstName LastName Address City State
124368537 John Lee 100 Institute Road Jackson Nebraska
339152314 Tien Vu 23 Grover Street Lousville Lousiana
452078093 Mirek Cymer 19 Terrace Ave Miami Beach Florida
736192613 Jane Doe 34 Main Street New York New York
Table 1: Student-Address Relation
To see the address of each student, you could use the SELECT statement:
SELECT FirstName, LastName, Address, City, State FROM
StudentAddressTable;
Table 2 contains the result of your above query against the database in Table 1.
First Name Last Name Address City State
John Lee 100 Institute Road Jackson Nebraska
Tien Vu 23 Grover Street Lousville Lousianna
Mirek Cymer 19 Terrace Ave San Francisco California
Jane Doe 34 Main Street New York New York
Table 2: Relation Resulting from a Query Evaluation
21
Let us look at what just happened in detail. The query asked for all of the data in
the StudentAddressTable (specifically for the columns called FirstName, LastName,
Address, City, and State.) Note that all query statements end with a semicolon and that
table names and column names do not contain spaces. The general template of a
SELECT statement, retrieving all of the rows in the table is:
SELECT ColumnName, ColumnName, ... FROM TableName;
To get all columns of a table without typing all column names, use * as in:
SELECT * FROM TableName;
The SELECT type statement can be written in a great number of ways giving a
wide access to the data contained in the tables. SQL also supports using conditional
statements (i.e. querying data greater or less than certain amounts). More complex
conditional statements may be joined with the typical logical operators, AND, NOT, and
OR. SQL uses the keyword DISTINCT to retrieve only one set data (name, address,
number, etc.) appearing in the table queried against. There may be nested queries,
objects, joins of tables, and more advanced SQL syntax providing functionalities that go
beyond what are needed for the scope of the project.
2.3.2 JDBC and ResultSet Classes
Due to the complexity of this system it had to be implemented in a high-level
computer language. The language had to be object-oriented for the purpose of extending
existing classes and needed to be able to make calls to databases quickly and easily. The
language we chose was Java 1.2 due to its flexibility, and its extensive use of strict
object-oriented principles such as inheritance, encapsulation, and polymorphism (Horton,
22
1997). Another feature that was convenient for the Rainbow system was its ability to
make calls to databases quickly and easily through the use of Java DataBase Connection
(JDBC). In addition, Java manages to avoid many of the difficulties that can be
experienced when using other programming languages (Hortan, 1998). Lastly, it was
more convenient to use Java because all of the existing code that was included in the
Rainbow design had been written in Java.
To make it easier for future work with our code, Javadocs were used extensively
throughout our code. Javadocs are comments contained within the code that give
information and perform specific functions such as citing the author of code and listing
parameters of code. This helped the team read code more easily and more quickly and
will prove to be valuable to the members of future projects concerning Rainbow [7].
To establish a connection to a DBMS, associating program classes must utilize
the JDBC class in the case of programming with Java. It is a Java class that defines
connection objects (Taylor, 1997). The connection object, once initialized with proper
login information to a DBMS, will allow a Java program to execute queries or update
statements on the database. The project team has a class provided by the Java SQL
package that will allow Java programs to traverse through a returned relation. This class
is named ResultSet. It bridges the two languages of Java and SQL to overcome the
impedance mismatch issue. Impedance mismatch resolution from the ResultSet class
essential allows for Java to handle the data tuple structure returned by a DBMS. The
JDBC and ResultSet classes combined provide all the data retrieval and manipulation
functionalities needed to support the project team in terms of interfacing Java processes
with a DBMS.
23
2.4 Software Development
2.4.1 Object Oriented (OO) Design
Because the project team strives to continue the development of the Rainbow
system with accordance to its architectural design, the main objective for the team in
understanding and developing the project was to establish a way of translating the
architecture presented by the previous work into an actual system design in addition to
extend the established initial subsystems.
Once the project team grasped a firm understanding of the Rainbow system
architecture, the following phase for the development of a new subsystem from scratch is
to put it into a concrete design using the Unified Modeling Language (UML). UML is a
common design language that consists of many different diagram types (such as class
diagrams, activity diagrams and sequence diagrams). These diagrams serve as a type of
‘blueprint’ for the entire system, as each gives a different level and type of description of
the system. To utilize the benefits of UML, the team found that it was necessary to
become familiar with the software design tool, Object Domain [15]. Utilizing Object
Domain, the project team designed class diagrams for the Restructure subsystem.
2.4.2 Software Migration
The Rainbow system itself is very extensive, containing a large number of classes
and a large amount of code. The established subsystems were a resource of a great deal
of existing codes in the implementation of Rainbow. The project team encountered both
difficulties and advantages in the reuse of the existing code-base. The code-base needed
to be examined to determine the portions that were suitable for reuse, which may need to
24
be modified and enhanced, and the portions that had to be completely re-implemented
due to a lack of support for an extension. In order to accomplish the re-engineering of the
previous code, the team also had to make use of various software engineering skills
obtained from courses with the most important being proper documentation. The team
documented the code added as well as documented any reused codes once they were
eventually understood, but were either undocumented or documented insufficiently
before.
25
3 DTD Metadata Management
This section presents the details of the original metadata model that enables
flexible mapping as proposed by Zhang et al. [14]. The system assumes that there exists
only one external DTD document for the compliant XML documents and that file has no
nested DTDs, and there is no internal DTD in the XML documents. The data model only
focuses on XML documents that meet these requirements.
Figure 5: Algorithm of Mapping DTD into Relational Schema
As shown in Figure 5, the system first stores the DTD into metadata tables. Then
it can optionally restructure the metadata tables. At the end it will generate the relational
schema from the metadata. The storing module identifies the characteristics of the DTD
and stores them as metadata. The restructuring module identifies the multi-valued
attributes of the DTD and also identifies the items that could be represented as attributes.
Lastly, mapping a DTD into a relational schema is achieved by applying mapping rules
over the metadata tables storing the DTD.
26
DTD Store
Restructure Generate Relational Schema
Metadata
This metadata approach includes the storing stages, the mapping stage, and an
optional restructuring stage. We show how the metadata approach is flexible on
restructuring the metadata in order to provide various relational schemas in the
restructuring stage. The following subsections explain these stages in more detail along
with a working example of storing a DTD and loading the XML document.
3.1 Metadata Tables
Storing the DTD properties into relational tables makes it practical to use
relational query facilities to query the metadata. The metadata tables keep track of the
mappings to allow the system to automatically load the XML data into the generated
relational schema.
Let’s focus on the details of this metadata driven approach of managing XML
data; an approach that incorporates the loading of a DTD into DTDMs in a relational
database as part of the process for managing XML data. In order to capture all the
necessary information in the DTD, there are three DTDMs, one for each of the three
identified types of pertinent information. The three types of information captured are:
items, nesting, and attributes. The Items relation essentially corresponds to any element
defined as well as groupings of elements. An item represents an element type or group in
a DTD. The Nesting relation captures information regarding the relationships of the
various elements defined in a DTD. Finally, the Attribute relation captures all the
attributes defined for any of the particular elements defined in the DTD. An attribute is a
property of an item. The following tables have been extracted from [3].
In Tables 3 through 5, the schema for each of the three DTDMs is depicted.
27
Fields MeaningID Internal ID for items.Name Element Type or Group Name.Type Defines the type of this item from the domain: PCDATA,
ELEMENT.ELEMNT, ELEMENT.EMPTY, ELEMENT.ANY, ELEMENT.MIX, and GROUP.
Table 3: Item DTDM (DTDM-Item table)
The type field defines the type of an item or rather the type of the element content
in an element type declaration. ELEMENT.ELEMENT represents an element content.
ELEMENT.MIX represents a mix content. ELEMENT.EMPTY represents an empty
content. ELEMENT.ANY represents an ANY content. There are two new item types,
i.e., PCDATA represents PCDATA definition, and GROUP represents a group definition.
Fields MeaningID Internal ID of this nesting relationship.FromID ID of parent item of this nesting relationship.ToID ID of child item of this nesting relationship.Ratio Cardinality between the parent element and child element.Optional Used to indicate whether a child element is optional or not.Index The schema order of the child element.
Table 4: Nesting DTDM (DTDM-Nesting table)
The two fields FromID and ToID reference a parent item and a child item that
participate in a nesting relationship. The Index field captures the Schema Ordering
Property denoting the position of this child item in the parent item’s definition. If in a
sequence group, each child item will have a different value for indices. For the case that
all children are of a choice group, all the index fields will be have the same value.
28
The occurrence property for a child element is captured by a combination of the
Ratio and Optional fields. The Ratio field shows the cardinality between the instances of
the parent item and of the child item. Since the nesting relationships are always from one
element type to its sub-elements in the DTD, there are only one-to-one or one-to-many
nesting relationships in the Ratio field. The Optional field has value true or false
depending on whether or not this relationship is defined as optional in the DTD or not.
Fields MeaningID Internal ID for this attribute.PID ID of parent item.Name Name of this attribute.Type Type of this attribute, e.g., ID, IDREFS.Default A keyword or a default literal value of this attribute, e.g., #IMPLIED
Table 5: Attribute DTDM (DTDM-Attribute table)
To better understand how a DTD document is mapped into each of the described
DTDMs, let’s recaptured the DTD document example given in Figure 2.
DTD:<!ELEMENT prices (book*)><!ELEMENT book (title, source, price)><!ELEMENT title (#PCDATA)><!ELEMENT source (#PCDATA)><!ELEMENT price (#PCDATA)>
This DTD document will be loaded into the three relations as shown in Table 6.
29
DTDM-Item DTDM-Nesting
DTDM-Attribute
ID PID Name Type Default
Table 6: DTDMs for Figure 2’s DTD
The five elements, namely, prices, book, title, source, and price get stored as
tuples in the DTDM-Item relation. The relationships between these elements are stored
as tuples in the DTDM-Nesting relation. For example: the one-to-many relationship
between element prices and element book is recorded in the tuple with ID equal 7 within
the DTDM-Nesting relation. Lastly, the attributes are stored in the DTDM-Attribute
relation. The three elements, namely, title, source, and price each have PCDATA, so
their relationship with a PCDATA item is stored in DTDM-Nesting tuples with IDs 11,
12, and 13. The PCDATA information is stored in the Name field of tuple 14 in the
DTDM-Attribute relation.
3.2 Data Schema
ID FromID ToID Ratio Optional IndexID Name Type
7 1 2 1:n true 08 2 3 1:1 false 19 2 4 1:1 false 210 2 5 1:1 false 311 3 6 1:1 false 012 4 6 1:1 false 013 5 6 1:1 false 0
1 prices ELEMENT.ELEMENT2 book ELEMENT.ELEMENT3 title ELEMENT.MIX4 source ELEMENT.MIX5 price ELEMENT.MIX6 PCDATA PCDATA
14 6 value PCDATA #REQUIRED
30
DTDMs provide meta information about the structure of an XML document in the
form that can be queried to generate a relational schema for loading XML data of XML
documents that are compliant with the DTD that was used to generate these DTDMs.
The DTDMs will be used to generate a relational schema for each item tuple within the
Items relation. Each relation generated serves to store occurrences of the corresponding
item type from loaded XML documents. These relations are defaulted to contain three
columns corresponding to an internal identification number (iid), a parent identification
number (pid), and an order number among sibling items. The iid field is for querying
purposes of the system identifying a particular item. The iid value is the primary key for
each item’s relation, and therefore each tuple represents an instance of each item. Hence,
the iid must be unique. The pid field references the iid field of the item it is nested
within. Finally, the order field identifies its position among other items that have the
same pid number, that is, sibling items. If a particular item has any attributes, each
attribute of that item becomes a column for the item’s relation.
To illustrate the loading of an XML document using the DTDMs, let’s follow
through with the example from Figure 2.
Compliant XML: <prices> <book>
<title>Advanced Programming in the Unix environment</title><source>www.amazon.com</source><price>65.95</price>
</book> <book>
<title> TCP/IP Illustrated </title><source>www.amazon.com</source><price>65.95</price>
</book> </prices>
31
With the DTD for this XML already loaded as shown on Table 6, the relational
schema generated from these DTDMs with the loaded XML data is shown on Table 7.
Table 7: Data Relations for Figure 2’s XML Document
The six relations in Table 7 capture all the information in the XML
document. Each element type has its own relation and instances of
each element type in the XML document are stored as tuples of the
iid pid order
1 0 1
iid pid order
2 1 0
3 1 0
iid pid order iid pid order iid pid order
4 2 1
5 3 1
6 2 27 3 2
8 2 39 3 3
iid pid order Value
10 4 1 Advanced Programming in the Unix environment11 6 2 www.amazon.com12 8 3 65.9513 5 1 TCP/IP Illustrated14 7 2 www.amazon.com15 9 3 65.95
32
prices
book
title source price
PCDATA
correct relation. Here, the one instance of the element prices contains two instances of
the element book. Each of the two book elements have an instance of the elements title,
source, and price. The values for title, source, and price elements, as mentioned earlier,
are stored in the value field of the PCDATA relation.
3.3 DTD Manager and XML Manager Extensions
The DTD Manager and XML Manager at the time this project started were the
main two functionalities of the existing Rainbow system that were implemented. Being
responsible for the loading and exporting of multiple DTDs and XMLs, understanding
and extending these two subsystems was the first step taken for the implementation
phase.
Utilizing these resources, the team first needed to design an XML document and
DTD from scratch. Then using the existing DTD and XML manager subsystems, the
team accomplished the loading of the XML document and its DTD into relational tables
stored on the Oracle8i system. Initially, the project approached the task of exporting a
single DTD from its database store back into the form of a DTD document. This first
task provided the exposure necessary for the project team to grasp an understanding of
what XML documents and their DTDs are as well as how the existing managers
integrate. With an output in the form of a proper DTD, the team realized its first task and
gained a basic understanding of the inner workings of the existing system.
3.3.1 Original DTD Manager and XML Manager
33
At the start of this project, the DTD Manager and XML Manager were already in
place. To recapture the functions for these two subsystems, the DTD Manager is
responsible for the importing and exporting of DTD documents into and out of a
RDBMS, and the XML Manager handles the importing and exporting of XML
documents into and out of the RDBMS. Together, these two management subsystems
handle the loading of a single DTD and an XML document that is compliant with the
DTD.
The DTD Manager generates DTDMs from a loaded DTD, and the manager’s
schema creator component takes the information from these DTDMs and generates XML
data schemas for the loading of XML documents. In turn, the XML manager will utilize
a fixed mapping to store XML data into these XML data relations. Figure 6 shows the
XML Manager and DTD Manager subsystems with a separate process, Schema Creator.
The actual implementation of the DTD Manager incorporated the Schema Creator
process’s function of creating XML data schemas.
3.3.2 Support for Multiple DTDs and XMLs
Useful data management system would require the support of multiple
documents. Hence, we needed to provide the ability to load multiple DTDs and XMLs
34
XMLXML
XML ManagerXML Manager
DTDDTD
DTD ManagerDTD Manager SchemaCreator
Process
XMLXMLDataData
SubSubsystemsystem
LegendLegend
Figure 6: DTD Manager and XML Manager
into the relational database allows operations to be done on several documents at one
time. In order to support loading of multiple DTDs and XMLs, an additional column
must be added to the existing DTDMs that store the id of the particular document.
Because data relations that store instances of items from XML documents have their
schemas generated by these DTDMs, this generation must be extended to incorporate the
addition of an ID field to identify from which XML document the instance of the item
originated from.
An XML data relation for items will have a column corresponding to the XML
document ID from which that item instance originated, and a DTDM will have a field
corresponding to the DTD document ID from which that item originated. This ID field
becomes part of the key and is used for differentiating similar items from different DTDs
or XMLs. Two catalog relations will store information regarding the DTDs and XMLs:
one storing DTD IDs and their corresponding URIs, and another storing XML IDs and
their corresponding URIs. Catalog relations to store DTD and XML URIs allows the
database administrator to refer to the documents by a URI without the need to use their
internal ID.
4 Flexible Mapping Support in the Rainbow System
The motivation and architecture of the Rainbow System have already been
discussed in Section 1. Here we present some highlights of the Rainbow System that
pertain to this project as a guideline for our project goals.
Highlights of the system include:
35
Rainbow keeps track of DTD documents in DTDM repository.
Rainbow automatically generates the table schema out of the DTDM.
Rainbow has reversible restructuring feature built in and stored as restructure
operators in the restructure operator library.
Java is the language of choice for the implementation of the restructuring
subsystem since it is a platform independent and exclusively object-oriented language.
Because the DTD and XML Managers were already written in Java, the project team
aimed to continue the use of this language to also extend their functionality to support
loading of multiple documents. The existing subsystems, the DTD and XML Managers,
managed XML data utilizing the commercial product Oracle 8i relational database
management system (RDBMS). Oracle8i has many features that are used for the
manipulation of relational tables. This RDBMS provides the services necessary for
querying its data utilizing the standard Structured Querying Language (SQL), otherwise
pronounced as ‘sēqual’. The functionality of querying database allows for an efficient
mean of accessing and modifying the data stored in the database. For these advantages,
the team conferred to continue the use of an Oracle8i RDBMS to serve as the database
server.
With the first two highlights, the DTD Manager and XML Manager were the first
subsystems we worked on.
4.1 Restructuring Subsystem
4.1.1 The Restructuring Functionality
36
The Restructuring subsystem is aimed to provide services that allow the Rainbow
system to achieve one of its primary goals, query optimization. To achieve the goal of
query optimization, a system must be in place to enable flexible mapping of the XML
data mapped by the XML manager. This system should perform various restructurings
on the initial fixed mapping to achieve the functionality of flexible mapping. The
flexible mapping capacity of the Rainbow system aims to decrease the query processing
time of specific query loads given by a database administrator. As shown in Figure 1 and
discussed in Section 1, an Optimizer process intelligently selects a set of restructuring
operations to perform on the mapped data given the input of this query load. Figure 7
captures the original components of the Rainbow system that work together to provide
the restructuring functionality.
37
Figure 7: Restructure Function
The possible restructuring operations come from the Restructurings Operator
Library. Later in this section, we will describe in further details the operators constituting
this library. For an overview, a restructuring operator fits into the restructuring
functionality by applying some query on the mapped data to generate a new mapping.
The Restructure process executes series of restructuring operations to produce flexible
mappings of a fixed data map.
4.1.2 A Prototype Design
The first goal when it came to designing the restructuring functionality was to
define a subsystem for restructuring. This restructuring subsystem will have some core
functionalites of what the Rainbow restructuring function requires. The requirement for a
Restructurer process to execute a restructuring mapping was first realized. The project
38
Basic StorageBasic Storage ManagerManagerDTD ManagerDTD Manager
Restructure
Optimizer
RestructureOperatorLibrary
Query StorageMapping
DBA
XMLXMLQueryQueryLoadLoad
Sub-Sub-systemsystem
XMLXMLDataData
LegendLegend
Process
Relational
Model
team designed a Restructurer class where its running process will take as input a
mapping. Such a mapping object contains a series of restructuring operations to be
performed on the XML data mapped by the XML manager in conjunction with the DTD
manager. The other input for the process is the Restructuring Operator Library. The
contents of this library will be discussed later in this section. The library essentially
contains the SQL templates for manipulating the XML data mapped in the RDBMS.
The Restructurer process will read the restructuring operations needed from the
mapping object and then call the corresponding restructuring operators of the
Restructuring Operator Library to perform the necessary restructuring.
39
Figure 8 shows the Restructuring subsystem breakdown into its components.
Figure 8: Restructuring Subsystem
This Restructuring subsystem does not incorporate the Optimizer process that
takes as input a query load and intelligently generates a flexible mapping that best
optimizes the query performance for that load utilizing information from the mapping,
DTDMs, and the Restructuring Library. This subsystem is instead a simplified version
that assumes the administrator decides upon a good mapping for the XML data and then
calls the Restructurer process to perform restructuring with the mapping object as input.
40
SubSubsystemsystem
DataData
ProcessProcess
LegendLegend
Restructuring Restructuring
Mapping
RestructuringOperatorLibrary
RestructurerRestructurer
4.1.3 Implementation Details
The implementation details of the Restructuring Subsystem follow the UML that
was first designed. Figure 9 shows the Restructuring Subsystem broken down into
classes in UML. Mapping is an object that holds all the restructuring operations. The
OperatorInterface class is a template for all operators to follow. All operators that
implement this OperatorInterface must provide a code for the public method Execute().
The 11 operator classes in this figure correspond to the 11 operators that are defined later
in this section. Lastly, the Restructurer class contains a Java Vector container that it
initializes with the public method readOperators() given the operations specified by the
Mapping input file: Its public method runOperators() will call the method Execute() of
each operator in the Vector container.
41
Figure 9: Restructuring Subsystem Class Diagram
42
Mapping
Op1 Op2……………
Restructurer
Vector Operator operators//contains list of operators
private ReadOperators(File inp)//reads operators from input file //and stores them in vector format
public runOperators()//matches each operator to the //matches each operator to the //corresponding method and //corresponding method and runs //execute for that methodruns //execute for that method with with the //appropriate argumentsthe //appropriate arguments
RenameAttribute
public Execute()public Execute()
PushDownAttribute
public Execute()public Execute()
PushUpAttribute
public Execute()public Execute()
RenameItem
public Execute()public Execute()
Dereference
public Execute()public Execute()
Reference
public Execute()public Execute()
SplitNesting
public Execute()public Execute()
MergeNesting
public Execute()public Execute()
PushDownNesting
public Execute()public Execute()
PushUpNesting
public Execute()public Execute()
SwitchNesting
public Execute()public Execute()
OperatorInterface
<virtual>Execute()<virtual>Execute()
Operator
String OperatorNameString Parameters[ ]
After having broken down the components necessary for this subsystem, namely,
the Mapping object, the Restructuring Operator Library object, and the Restructurer
process, the Restructurer process was the first component to be developed.
The Restructurer process had to read from the Mapping component, so the first of
its tasks is to parse an input file. This input file is essentially the Mapping component. It
contains a series of operators with specified arguments of type item, attribute, or nesting
intelligently selected by a user to yield a mapping that may be beneficial for particular
kinds of queries. Once these operators are instantiated with the specified arguments, the
project team will refer to these instantiated operators as operations. The Restructurer
process parses the series of operations, store them locally, and instantiates the
Restructuring Operator Library operator classes into the mapping object. Once the entire
series of operations are parsed and the individual operators of the library get instantiated,
then the Restructurer process calls these operators to execute one by one. The execution
of the individual operators within the library will execute the instantiated query templates
of the respective operator thereby changing both the DTDMs and the XML mapping.
The Restructuring Operator Library is a set of restructuring operator classes. The
library is first implemented with an operator interface that describes the functionalities
each operator must provide when called by the Restructurer process. As for the
implementation of the operators, they each must contain a method for instantiation and a
method for execution of the instantiated SQL template. The SQL template is defined
within the operator classes and their details are described in detail later in this section.
Once the templates are instantiated, they are stored in the local process space of the
running operator class. When the operator processes are called to be executed by the
43
Restructurer process, they process the instantiated SQL templates, then SQL statements,
to perform the restructuring. The execution of a series of these operator processes
generates the mapping that had been specified by the user.
To illustrate how the classes in Figure 9 work together, let’s observe an example.
If Mapping contains the operation “fooOperator(arg1, arg2, arg3)”, the Restructurer class
adds an instance of fooOperator with the arguments arg1, arg2, and arg3 to the Vector
container when the method readOperators() is called by the Restructurer. When the
method runOperators() is called, the Restructurer class calls the method Execute() for
each object in the Vector container. In this example, the only object will be an instance
of fooOperator and calling its method Execute() will evaluate the code inside the
fooOperator class. The code in the fooOperator class utilizes SQL queries which do the
actual updates to the DTDMs and the restructuring of the XML data that is mapped.
4.2 Restructuring Operators
To support the restructuring functionalities of the Rainbow System to achieve
flexible mapping, we have developed a set of restructuring operators implemented by
view technology. The restructuring operators will restructure the relational data set into
another relational format optimized for query evaluation.
So far, there are 11 restructuring operators defined in the Restructuring Operator
library. Restructuring operator library stores a collection of reversible restructuring
operators for optimization purpose. Reversible meaning the restructuring operators can
keep track of the changes and easy to restore the original data. An optimizer takes a
given XML query load specified by a database administrator and the DTDM tables,
which model the current structure of relational database, as input. It generates a mapping
44
by applying the restructuring operators provided from the restructuring operator library.
A mapping specifies the application of a sequence of restructuring operators to be applied
on the different element types defined in that DTD. Then, the restructuring manager
actually transforms the initially loaded data into the desired optimized format. The latter
is to be utilized for efficient query purposes [8].
Reversible restructuring operators include Rename Item, Rename Attribute,
Pushup Attribute, Pushdown Attribute, Pushup Nesting, Pushdown Nesting, Switch
Nesting, Merge Nesting, Split Nesting, Reference, and Dereference. Each operator is
composed of two parts, the DTDM transformation and corresponding relational data
transformation.
45
x
A
B
A
B
DTD Modifications: Modifications:Data Changes:
CREATE VIEW out.$A ASCREATE VIEW out.$A ASSELECT p.SELECT p.<all_columns>, c.$x, c.$xFROM in.$A p, in.$B cFROM in.$A p, in.$B cWHERE c.pid = p.iidWHERE c.pid = p.iid
CREATE VIEW out.$B ASCREATE VIEW out.$B ASSELECT SELECT <all-columns-but-x>FROM in.$BFROM in.$BPushup
In Out
x
x
Next, we will explain the pushup attribute operator in more detail as an example
to illustrate the general concept of an operator.
Figure 10: Pushup Attribute Operator
On the left of Figure 10, the pushup attribute operator pushes attribute X of
element B up to element A as attribute X. The changes that are made to the DTD is that
attribute X’s pid(parent id) field will change from the iid(item id) of element B to the
iid(item id) of element A.
In addition to the changes made to the DTD schema, the opertor uses two queries
to restructure the XML data as depicted on the right of Figure 10. The first query creates
a view on top of the relation corresponding to element A inserting attribute X as a new
field. The second query creates a view on top of the relation corresponding to element B
46
that projects every field except for the field for attribute X. The logistics behind the rest
of the operators follows.
4.2.1 Pushup and Pushdown Attribute Operators
Pushup/down attribute operators will push up an attribute from a child item to its
parent item, or vice versa it will push down an attribute from an item to its child item.
Figure 11: Pushup and Pushdown Attribute
Here are the SQL templates:
pushUpAttribute (ChidlItemName, ChildAttributeName, ParentItemName, ParentAttributeName) CREATE VIEW <new.ParentItemName> ASSELECT p.<all-columns>, c.<ChildAttributeName> as <ParentAttributeName>FROM <old.ChildItemName> c, <old.ParentItemName> pWHERE c.pid = p.iid
CREATE VIEW <new.ChildItemName> ASSELECT <all-columns-but-ChildAttributeName>FROM <old.ChildItemName>
pushDownAttribute (ParentItemName, ParentAttributeName, ChildItemName, ChildAttributeName) CREATE VIEW <new.ParentItemName> ASSELECT <all-columns-but-ParentAttributeName>FROM <old.ParentItemName>
CREATE VIEW <new.ChildItemName> ASSELECT c.<all-columns>, <ParentAttributeName> as <ChildAttributeName>FROM <old.ParentItemname> p, <old.ChildItemName> c
47
X
A
B
A
BX
Push-up
Push-down
WHERE p.iid = c.pid
4.2.2 Rename Item and Attribute Operators
Rename item and rename attribute will rename an item and an attribute
respectively. They can easily be implemented using the DTDM primitives. Here is the
SQL template:
renameItem(OldItemName, NewItemName): CREATE VIEW <new.NewItemName> AS SELECT * FROM <old.OldItemName>;
renameAttribute (ParentItemName, OldAttributeName, NewAttributeName) CREATE VIEW <new.ParentItemName> AS SELECT <OldAttributeName> as <NewAttributeName>, <rest-of-columns> FROM <old.ParentItemName>;
4.2.3 Pushup and Pushdown Nesting Operators
The pushup/down nesting operators will push up a child item to the sibling item
of its parent child, or vice versa it will push down an item to the child of its sibling item.
Figure 12: Pushup and Pushdown Nesting
Here is the SQL template:
pushUpNesting (MovedItemName, FromPosition, ChildItemName, ParentPosition, ParentItemName, ToPosition) Without considering the position, this would correspond to the query given below: CREATE VIEW new.MovedItemName ASSELECT m.<all-columsn-but-pid>, c.pidFROM old.MovedItemName m, old.ChildItemName c, old.ParentItemName pWHERE m.pid = c.iid AND c.pid = p.iid
48
A
B A
B
A
B X
Push-up
Push-down
A
BPush-up
Push-downC
C
pushDownNesting (MovedItemName, FromPosition, ChildItemName, ParentPosition, ChildItemName, ToPosition) Without considering the position, this would correspond to the query given below: CREATE VIEW <new.MovedItemName> AsSELECT m.<all-columsn-but-pid>, c.pidFROM <old.MovedItemName> m, <old.ChildItemName> c, <old.ParentItemName> pWHERE m.pid = p.iid AND c.pid = p.iid
4.2.4 Other Operators
Due to time constraint, we were not able to implement Switch Nesting, Merge
Nesting, Split Nesting, Reference, and Dereference operators. Switch Nesting was
partially implemented but need further modification and improvement. Switch Nesting
will switch two nesting relationship within the same parent. Merge Nesting will merge
nestings of two items. Split Nesting will split nesting between two items. Reference
breaks a nesting relationship between two items by assigning an ID attribute to the child
item and adds an IDREF(s) attribute to the parent item, which together are used to
represent that nesting relationship. Dereference will create a nesting relationship between
the items that have the ID and IDREF(s) attributes respectively [8].
4.3 Rainbow Graphical User Interface
The Rainbow Interface allows the administrator to do the restructuring of a DTD
and its loaded XML from within a GUI environment by giving access to the functions of
the Rainbow System. The GUI environment eliminates the chore of having to manually
run classes of the Rainbow System. In other words, it gives the administrator a more
convenient way of selecting XML documents for loading, specifying parameters for the
operators, viewing the tables contained in the database at any time (before or after the
restructuring).
49
Let us examine the sequence of steps one would take to do a simple restructuring.
The primary step that must be taken before anything else can be done is to establish a
connection with the Oracle Database. Then, an XML document has to be imported into
the database so that it can be restructured. Any imported documents can be viewed in a
table format. In order to do the restructuring, the administrator has to select a sequence of
operators and give each a set of parameters. Once the restructuring is done the
administrator can choose to export the modified data back into a DTD file on the
administrator's local computer.
The following screen shots of the interface give the main idea of its appearance.
(To switch between the various tabs of the Working Window the administrator only has
to click on the tab corresponding to the appropriate window). The first screenshot is the
main window of the Rainbow Interface. Its menu bar contains options for importing and
exporting documents, establishing connections, entering manual queries into the
database, etc. Screenshot 2 in Figure WHATEVER is a figure of the Work Window with
the DB Tab selected. The main purpose of this window is to give the administrator
information about what kind of data is currently in the database. It displays all the tables
in the database and the data of each table. Screenchot 3 is a display of the Work Window
with the Operators Tab selected. In this window the administrator does the restructuring
by selecting the desired operators and inputting the appropriate arguments. The main
window lists all the tables the user requested.
50
The left column represents the names of the tables. The right column represents (in
order) the ID# of the item, the item name, item type, the item DTD id
Figure 13: Screenshot 1
51
Main window message field.
The administrator is entering a query manually.
Figure 14: Screenshot 2
52
The administrator selects which table to view.
The data of the selected table appears here.
Figure 15: Screenshot 3
53
The administrator selects an operator.
An argument is selected and a value is entered.
All the selected operators appear here.
5 Implementation Details
5.1 System Architecture
Previous to the start of this MQP, the DTD and XML Managers were
implemented to handle only one XML/DTD pair. The project team modified and
extended these modules to support multiple XMLs and their DTDs. The team designed
and implemented the Restructuring Subsystem. Lastly, with respect to the architecture as
shown in Figure 16, but not within the scope of this project, is the XML Query Engine
which has not yet been implemented.
Figure 16: Rainbow Architecture with RDBMS
54
XMLXMLDataData
SubSubsystemsystem
LegendLegend
XMLXML
XMLXMLQueryQuery XMLXMLUser
XML Query EngineXML Query Engine
XML ManagerXML Manager
RDBMS
DTDDTD
DTD ManagerDTD Manager
Restructuring SubsystemRestructuring Subsystem
5.2 Code Facts
The completed Rainbow system totals 44 classes, 17 of which have been coded
from scratch by the Rainbow MQP team. In addition to the creation of 17 new classes,
the Rainbow System takes advantage of existing code, much of which was extended to
support new functionalities. Eight classes are preexisting and unchanged classes.
Nineteen are preexisting, but extended. Pie charts of the class facts can be seen in Figure
17.
Figure 17: Statistics of the Class Implementation
5.3 Existing System Packages
The implementation of the Rainbow System is contained in 8 packages. The
DTDMObjects package contains classes that encapsulate the DTDMs into objects with
methods for accessing and modifying the data of each of the DTDM relations. The
exportDTD package contains the classes that provide the functionality of exporting a
DTD from the database. The JDBCClient package contains classes that encapsulates
database connections into easy to understand objects utilized by every class that needs
55
connections to the database. The MetadataDrivenLoader package contains a class that
allows for the generation of unique identifying numbers for relations in a database. The
Operators package contains the operator interface class and all the restructuring operator
classes. The Restructuring package contains the class that encapsulates the XML Catalog
relation in objects for easy accessing and modifications. It also contains the Restructurer
class. The StoreDTD package contains the classes that generate the DTDM schema and
the loading of multiple DTDs into a database. The XMLRDBMSUpdate package
contains the classes that generate the XML data schema and the loading of multiple
XMLs into a database. Two other packages, namely, DTDWrapper and Utils, were used
to facilitate implementation in general.
5.4 Implementation Environment
All class extensions and implementations were programmed in Java 1.2 using
JDK 1.2.2 running on a Digital UNIX 64 terminal on the WPI LAN. The database server
is a PC, PII 300MHz with 256 MB memory, running Microsoft NT Server with Oracle8i
software. The GUI was developed in Visual Café on a PII 400MHz with 128 MB
memory, running Windows 98. It was tested on a Windows NT system and compiled/ran
successfully using various Java languages (Visual Café, Jdeveloper, etc.).
56
6 Experimental Evaluation
The purpose of the experiments is two fold: one, to evaluate the performance of
loading and restructuring XML data and their DTDs, and two, to evaluate the
performance of queries evaluated against fixed mapping and restructured data. In
evaluating the outcome of the experiment, one must consider the overhead associated
with loading the data and with getting the internal representation of the data in RDBMS.
When we speak of restructuring data, we refer to one or a set of restructuring operators
applied in sequence. The motivation for using this set of restructuring operators is the
expectation that this will improve the performance of query time.
Logically, below is divided into two major parts: evaluation of restructuring time
and evaluation of query processing time. These are the two major divisions of
consideration from which we hope our experiments will lend some satisfactory
conclusions.
6.1 Experimental Setup
6.1.1 Scope and Design of a Test Plan
The proposal of the Rainbow system is a product of analyses done by many
graduate students and Professor Elke A. Rundensteiner. The main focus when we
designed the test plans was outlined by what the system had to achieve: update
propagation capacity, and query evaluation optimization.
6.1.2 Designing an Experimental Test Bed
After the system was determined to be complete and functional by cycles of test
and debug experiments, our goal was to design an evaluation system. The evaluation
57
system should not only yield conclusive data that outlines the benefits and limitations of
the system in terms of performance versus overhead under varying scenarios, it must also
be reliable. The evaluation system was set up in a way that makes it either tolerant of un-
factored influences such as outside processes taking up microprocessor time, or lets it
avoid these unexpected costs. The evaluation system designed ran each experiment five
times to eliminate un-factored influences that may obscure a particular timing, such as
another scheduled computer process that is heavy on the CPU executing in some interval
within the testing.
Even with precautions taken during the design of such an evaluation system,
experiments had to be performed under the same conditions. By this we mean that a
devoted client and server must be selected and that not only this pair of machines be
utilized for all experiments, but also that the machines are not reconfigured or modified
in any significant way. The project team chose a PC, Pentium 233MHz with 128Mb
memory, running Microsoft NT Workstation as the database client and a PC, PII
300MHz with 256Mb memory, running Microsoft NT Server with Oracle8i as the
database server. The network between the client and the server PCs remained unchanged
throughout the experiments.
6.2 Performance Considerations
As described in the introductory portion of this section, it is possible to evaluate
the performance of restructuring data for query efficiency by considering two types of
actions, namely, restructuring and query. The experiments outlined in this paper
conducted one of the two actions. The following describes in further detail what it means
to evaluate either type of actions.
58
6.2.1 Restructuring Time
Restructuring time includes the loading of data and additionally the restructuring
applied to the data. First we measured the performance of loading a set of documents and
then we measured the performance of applying a set of restructuring operations on the
loaded data.
Two different methods of applying a set of restructuring operations were utilized
to evaluate the performance of restructuring the data: single (series) restructuring and
batch restructuring. Series restructuring is running one operation on a set of data at a
time. Batch restructuring is running a set of operations on a set of data all at one time.
Since this project includes a restructuring component to execute all restructuring
operations, the difference here means providing as input a single line of operation
repeatedly for all each operation for the former versus a list of operations for the latter to
this component. The difference between Series and Batch restructuring with respect to
Oracle8i is when materialization of the views created by the restructuring operations is
performed. For Series restructuring, materialization is performed after each operation,
and for Batch restructuring, materialization is performed after every set of operations.
6.2.2 Query Time
To evaluate the performance of query processing, we measured the time it took
for a set of queries to evaluate on the data before and after restructuring. The
measurement performed was on each query, not the set of queries as a whole. Each query
thereby yields a query-performance time for a set of data. All queries were designed by
the project team and therefore was not randomly generated or selected from some list.
59
6.3 Cost Factors
Numerous factors can influence the performance evaluation of the whole concept
of restructuring data for query efficiency.
Parameter DescriptionOP# Number of operationsOP-TYPE Type of operatorDAT-SIZE Data sizeQY# Number of queriesDU# Number of data updates
Table 8: Parameters of Restructuring Evaluations
6.4 Experimental Data
The DTD designed by the project team for the experiment is depicted in Figure
18.
<!ELEMENT one (two+)><!ELEMENT two (three)><!ELEMENT three (four)><!ELEMENT four (five)><!ELEMENT five (six)><!ELEMENT six (seven)><!ELEMENT seven EMPTY><!ATTLIST seven attribute #REQUIRED>
Figure 18: Experiment DTD
The project team designed this DTD to yield deep nesting levels for the
evaluation of the experiments. The attribute embedded in the seventh level allows for
attribute information that may be queried. XML documents were then randomly
generated from this DTD utilizing IBM’s XML-generator [17]. With the data in place,
useful evaluations that lead to conclusive materials were discovered.
60
6.5 Evaluations of Restructuring Setup Time
Below, we performed each experiment 5 times to gather average findings. The
motivation, data models, and analysis methods for each experiment will be discussed in
their respective sections. Note that these experiments are not necessarily mutually
exclusive in their variable settings.
For simplicity, the three experiments discussed in this section will not observe
any data updates; data updates will be fixed at 0. Update propagations were ignored for
this set of preliminary experiments.
6.5.1 Experiment 1: Scalability of Increase in Operations
In this experiment, we aim to evaluate the overhead associated with restructuring
with a varying number of operations.
Nine tests were conducted, varying the number of restructuring operations of each
test. Each test will also evaluate the performance of two restructuring methods; for each
test, the fixed set of operations will be applied first serially and then in batch.
We aimed to formulate some idea about the relation between restructuring
overhead and the number of restructuring operations. Additionally, we aimed to
formulate some idea about the performance difference between batch restructuring and
series restructuring in Oracle8i.
The results of this experiment is a graph plot of the number of renameItem
operations (one of the most important operators) versus the average processing time in
seconds for both batch and series restructuring. The processing time will be the
processing time of the operations for Batch, and the sum of the processing time for each
operation for Series.
61
In order to account for the un-factored influences that may obscure the results as
mentioned earlier in this section, each plot point on the graph corresponds to the average
of several runs, five identical runs with the greatest and smallest numbers taken out and
the remaining three averaged.
The fixed-parameter settings for both batch and series restructuring:
OP-TYPE: renameItemDAT-SIZE: 104KBQY#: 0
Expectations
1. Since there is an overhead associated with restructuring beyond the direct
modifications of the DTDMs and materialization of the XML data views
generated by an operation, batch restructuring should take less processing
time than series restructuring. Each operation evaluated in series will
accumulate its own overhead.
2. We can expect the processing time to increase linearly as the number of
restructuring operations increase for both batch and series restructuring.
62
The graph in Figure 19 shows that although both series and batch restructuring
observe linear growth the processing time used for batch restructuring yielded less of an
overhead demand.
Figure 19: Batch versus Serial Restructuring
This result is expected because batch restructuring only requires the
materialization of the views created by a set of operations. Materialization only occurs
once per mapping for batch restructuring. The average processing time is mostly taken
up by the materialization of the views created by the restructuring operations in the
database.
Batch restructuring will be used for the evaluations of the remaining experiments.
6.5.2 Experiment 2: Operation Scalability
63
Since any particular type of operation may be performed many times with
different parameters, a performance evaluation of the batch restructuring of many
operations, all of the same operator type, would yield some idea of each operator type’s
cost for evaluation. To get a better grasp of actual overhead costs, we materialize only
after the set of restructuring operations.
An operator type was tested with an increasing number of operations, using batch
restructuring. The performance of the operator type was determined by evaluating the
performance of evaluating a batch of operations of the same operator type. The
performance for each set of operations will of course be the sum of the processing time
for each operation. We have tested an operator type starting with one operation, and
incrementally adding one additional operation until we reached a batch restructuring of
six operations.
This experiment yields a graph plotting the number of pushUpAttribute operations
versus the average processing time in seconds for the restructuring time of a set of that
many operations. The processing time is the processing time for the batch set of
operations, the number indicated on the x-axis, of the specific operator type.
Again, in order to account for the un-factored influences that may obscure the
results as mentioned earlier in this section, each plot point on the graph corresponds to
the average of several runs, five identical runs with the greatest and smallest numbers
taken out and the remaining three averaged.
The fixed-parameter settings:
OP-TYPE: pushUpAttributeDAT-SIZE: 22MBQY#: 0
64
Expectations
1. We hope to be able to conclude that the cost of an operator type observes
linear growth over the number of operations of its type.
The Operation Scalability experiment yielded the following results for the
pushUpAttribute operator as depicted in Figure 20.
Figure 20: Restructuring Overhead Results
The restructuring overhead was hoped to increase linearly with respect to the
increase in the number of restructuring operations. The results in Figure 20 however
suggest that the overhead cost actually increased with an exponential or polynomial curve
rather than linear. Much of the overhead cost came from the materialization of the views
generated from the series of operations. What we should keep in mind is that the
65
restructuring of the XML data measured yields better query performance as one can see
in the following query time evaluation. The more queries performed on the restructured
data, the greater the benefits of restructuring become.
6.6 Query Time Evaluations for Restructured Schema
The experiment in this section evaluated query performance. The queries used for
evaluation in this section are performed on materialized restructured data.
6.6.1 Experiment 3: Query Performance
This experiment is concerned with the optimization of query performance. The
motivation for this experiment is the general assumption that the pushing up of XML
information with respect to nesting would yield better query evaluation time as a result of
a reduction in the number of joins necessary to find the data.
To evaluate query performance, we used restructuring operations of the operator
type pushUpAttribute and then we measured the performance over a fixed data set
varying only the number of operations performed on it. The information we tried to
retrieve was the value of an attribute that is nested which required joins. The evaluation
will be from one to six operations and as discussed, the query will be on actual tables, not
non-materialized views.
This experiment yields a graph with a plot of the number of pushUpAttribute
operations performed versus the average processing time of the join-query needed to
retrieve the attribute data as described. The processing time is the processing time for the
query, the number indicated on the x-axis, of the specific operator type.
66
Each point on the graph corresponds to the average of several runs, five identical
runs with the greatest and smallest numbers taken out and the remaining three averaged.
The queries are designed to specifically query for the restructured data.
The fixed-parameter settings:
OP-TYPE: pushUpAttributeDAT-SIZE: 22MBQY#: 1QY-TYPE: join
Expectations:
1. As more operations are performed on the data, we should observe a linear
increase in query performance for queries requiring joins.
This experiment yielded the necessary results to evaluate query optimization
provided by the Restructuring subsystem.
Figure 21: Join Query Performance Results
67
The result set on queries lead to the conclusion that the increase in the number of
pushUpAttribute operations performed on the data also leads to a linear decrease in the
time it would take for query evaluation. The greater the number of pushups performed on
the attribute queried, the smaller the number of joins necessary to evaluate the join query.
The linear increase in performance confirms the hypothesis of this experiment and also
presents some preliminary support for flexible mapping.
6.7 Analyses
Given our time constraint, we were not able to evaluate the different types of
queries in conjunction with other operators of the Restructuring Operator Library.
However, what these results establish are preliminary findings that begin to justify the
need for flexible mapping that in turn is the ultimate goal of this project. At the cost of
some restructuring overhead, an intelligent mapping will yield restructured data that
allows for faster query evaluations. The realization of this benefit continues the
inspiration for further evaluations, possibly developments of new operators, and most
importantly, the Optimizer module that is based on a query load from a DBA.
68
7 Conclusions
7.1 Summary of the Rainbow Project
The Rainbow project itself started with the theories and ideas expressed by Zhang
et al.’s technical reports. The project team started this project by reading and analyzing
these research documents, and understanding what subsystems were in place, what had to
be extended as well as what subsystems had to be more thoroughly designed and
developed. Because the project team started with a partially implemented system, the
first tasks after understanding the existing code was to extend its capability to reflect
more of Rainbow’s architecture.
The project team had to revise some of the existing code in order to ensure
compatibility with the extensions needed to fully develop the DTD and XML manager
subsystems. Once the existing subsystems were properly extended, the following phase
for the team was to thoroughly design a new subsystem called the Restructuring
Subsystem. After learning how to properly design the classes pertaining to the design of
the Restructuring subsystem using Object Domain, the members laid out the subsystem
components in UML. The team broke down the implementation of this subsystem with
each member having separate tasks to complete in order to meet the demanded schedule
for completion. All implementations were tested and debugged in the implementation
phase by the individuals working on the particular components.
One of the team members focused on the implementation of the Restructurer
component for the Restructuring Subsystem. Additionally, this member was also
responsible for any modifications and cleanups necessary for the existing and extended
subsystems to be easily utilized by the team. The team had to use the existing
69
subsystems to setup the environment necessary for the integration of the Restructuring
Subsystem since it required XML data and DTD to be loaded and an input for mapping
data.
The remaining members of the team focused on the implementation of each of the
restructuring operators that make up the Restructuring Operator Library component. In
implementing these operator classes, these members learned to reuse and modify codes
that a graduate student had developed in order to access the DTDMs more efficiently.
In parallel with the implementation phase was the team’s work on this project
report as well as the outlined experiments. The final phase of this MQP project was for
the members to conclude the preliminary experimental evaluations, create the
presentation for the Rainbow project, and finalize this project report.
During the implementation phase we stayed in very close contact with the primary
author of the Rainbow technical report (Zhang, 1999), Xin Zhang. As design changes
were made, or if guidance was needed, the Rainbow MQP team would consult Xin for
assistance and keep him updated on the progress of the project in general [7].
Once the implementation phase concluded, the integration phase began. The
integration phase started with importing the code source tree into Visual Café and setting
up an environment that allowed for database client/server connection for
experimentation.
Lastly, the team learned the logistics of setting the experimental test bed,
implemented the experimental code, and conducted each experiment.
7.2 Experience Gained and Lessons Learned
70
By the conclusion of the Rainbow project, many different concepts and practices
were learned. The skills that we developed include object-oriented design, UML, Java
language, database and SQL concepts, software engineering experience, and teamwork
skills. In the following section these concepts will be discussed.
7.2.1 Object-Oriented Design
The first task given to the group was to read and comprehend the technical report
written by Zhang and others. To reach an understanding of these documents was
necessary in order to begin work with the Rainbow system. After reviewing and
analyzing the report, ideas were discussed and decisions were made about how to extend
and design parts of the Rainbow system. It was a practical and rewarding experience to
assist in the turning of complex technical reports explaining algorithms and modules into
a large, organized and well-documented system [7].
7.2.2 UML
Creating the design of the Rainbow system through the use of the Unified
Modeling Language was also a practical learning experience. Software engineering
knowledge had to be reviewed, new concepts had to be researched, and an understanding
of the state-of-the-art software, Object Domain, had to be achieved. Using UML helped
deliver a better understanding of how different modules and classes are represented, the
order of processes and events, and how objects cooperate over time. UML is becoming a
very popular tool in the software engineering industry, and here our exposure will be
beneficial..
7.2.3 The Java Programming Language
71
Before starting the Rainbow MQP, the team members had limited Java
knowledge. After completing the implementation, integration, and evaluation, all team
members acquired a great deal of Java knowledge and a better understanding of object-
oriented programming. During the project, all team members dealt with concepts
pertaining to inheritance, polymorphism, encapsulation, abstract classes, and how to code
in a visual environment [7].
7.2.4 XML
The project members learned XML at the start of the project to understand what
types of information are contained in XML documents as well as how they can be
mapped into a RDBMS. Because XML is a Markup Language that is popular for web
based applications, it is important technology to learn and to understand.
7.2.5 Database Management Systems
All three members of the Rainbow MQP team had to acquire knowledge of SQL
as well as the Oracle database platform. Additionally, the background knowledge learned
in the introductory databases course taught at WPI helped in dealing with breaking down
queries and having SQL commands embedded into Java code using JDBC connections.
Mostly, the experience with Rainbow helped give a more thorough understanding of the
SQL language and all of its components.
7.2.6 Software Engineering Experience
From taking the software engineering undergraduate class at WPI (CS 3733), all
three members of the Rainbow MQP team had knowledge of the software engineering
process. Some of the software engineering concepts such as the reuse and the integration
72
of existing code pertained to this project and were helpful throughout its evolution. The
Rainbow project went through the stages of requirements, design, implementation,
integration, testing, evaluation, and analysis. The Rainbow MQP members gained hands-
on experience in developing a large software system.
7.2.7 Designing the Test Plan
To learn the particulars of what aspects are involved with a test plan, the team
reviewed the DyDa project team’s experimental outline as an example [7]. Our project
team followed that general template as a guide when designing the test plan.
A test plan should begin with the introduction on what categories the testing aims
to evaluate, whether they are one time setup costs vs. continuous run-time costs. The test
plan should follow with a comprehensive listing of the cost factors. The cost factors
make up the variables and constants for each experiment, but they identify only the
constituting factors and not the specifics such as how many or whether or not a particular
factor is observed within a particular experiment. Next, we decided upon each individual
experiment that composed the test plan.
Each experiment stated the hypothesis of the experimenter, described what data
the experiment gathered, and whether the results may be presented in the form of tables
or graphs. Having identified what data was to be collected, the cost factor settings were
listed for a better understanding of the limit to what the gathered data inferred. The
experimental outline followed with a list of expected conclusions about what one is likely
to see. Following the individual experimental outline is a conclusion for the test plan that
73
reflected what observations were made conclusive with respect to each evaluated
category.
7.2.8 Working as a Team
A fundamental goal of our project was to distribute the work among the project
team members to complete the project in a timely fashion. Much of the early work was
completed by a unified team because it was more feasible due to the lack of any prior
knowledge of a majority of the technologies required for the project. The project team
made a transition from teamwork to team collaboration with separated tasks to make
better progress and to utilize the strengths of the individuals. Even when the members
worked individually, the team had frequent communication via email and scheduled
meetings to ensure the integrity, compliancy, and elimination of redundancy of individual
progress. Though some of tasks were assigned to one member, all project members
would provide assistance and contribute to the successful completion of such tasks.
Additionally, the MQP project team supported a leader who was responsible for the
successful and punctual completion of tasks beyond recording and updating notes and
websites.
Good teamwork skills are necessary when working on a large project like the
Rainbow system. Through working together for over several months, the Rainbow MQP
team grasped the skills to allow the team to work together. The work was split up evenly
and meetings were held weekly. Research and design was done as a team. The final
stages of evaluation and report write up were again split up among the members but final
modifications were finalized by the group.
74
7.3 Future Work
The conclusion of this MQP has left many opportunities for future studies. The
challenging Optimizer process has yet to be developed. With its development and
integration, the goal of query optimization could then be realized given this intelligent
process. Please refer to the research documents from this project’s website for further
details regarding the Optimizer process.
The XML query engine component has yet to be developed, and with its
development comes a friendlier interface for users who are concerned only with the XML
technology. The database administrator would provide XML queries and receive results
in the form of XMLs. The details of the various subsystems could be hidden and even
the interface with the Rainbow subsystems would be abstracted as the only user interface
points become XML queries provided to the XML query engine and query loads given to
the Optimizer process.
The set of restructuring operators are not exhaustive. To realize all potential
benefits of flexible mapping, it would be ideal that further development, evaluation, and
analysis be performed on components related to the restructuring of data. The scope of
this MQP was substantial and the time constraint did not permit a comprehensive study
on the various operators.
75
References
Past Works and Books:
[1][DMS] Ramakrishnan, Raghu. Database Management Systems. WCB McGraw-Hill,
1998.
[2][MDA-IXRD] Zhang, Xin, Wang-Chien Lee, and Gail Mitchell. Metadata-Driven
Approach to Integrating XML and Relational Data. February 22, 2000.
[3][CLOCK] Zhang, Xin, Wang-Chien Lee, and Elke A. Rundensteiner. Clock:
Synchronizing Internal Relational Storage with External XML Documents.
October 9, 2000.
[4][ISP-EAR555] Zhang, Xin, Aparna Pillai, and Wei Huang. ISP-EAR555: XML
Relational Management. Summer 2000.
[5][IMVXR] Zhang, Xin. Incremental Maintenance of Virtual XML Repository.
May 1, 2000.
[6][PEAMS] Florescu, Daniela, and Donald Kossmann. A Performance Evaluation of
Alternative Mapping Schemes for Storing XML Data in a Relational Database.
August 3, 1999.
[7][DDW] DyDa MQP Project Team. DyDa: Dynamic Data Warehousing. May 4, 2000.
[8][RAINBOW] Zhang, Mitchell, Lee, Rundensteiner. Rainbow: A Flexible Bridge
between XML Documents and Relational Data based on Relational Database
Restructuring. February 26, 2001.
[9] Bates, Chris. Web Programming. John Wiley & Sons: Chichister, 2000.
[10] Horton, Ivor. Beginning Java. Wrox Press: USA, 1997.
[11] Taylor, Art. JDBC Developer’s Resource. Prentice Hall: New Jersey, 1997.
76
[12] Ullman, Jeffrey. A First Course in Database Systems. Prentice Hall: New Jersey, 1997.
Web-pages:
[13]
http://msdn.microsoft.com/workshop/delivery/cdf/reference/channels.a
sp - information pertaining to XML.
[14] http://www.w3.org/TR/xmlquery-use-cases#xmp-dtd - information
pertaining to DTD and an XML document.
[15] http://www.objectdomain.com - information pertaining to the
Object Domain tool.
[16] http://xmlwriter.net/xml_guide/element_declaration.shtml -
information pertaining to XML Content Definitions.
[17] http://www.alphaworks.ibm.com - download for XML-generator
software.
Appendixes
Readme for System Environment Setup and Demo
77
README===============================================================================
AUTHOR: Tien&John-------
REQUIREMENTS--------------
- JDK 1.2 or higher- Database Server should be running MS Windows NT Server
w/Oracle 7 or higher- username, password, and URI for the database (example: shiba.dsrg)
INSTALLATION--------------
- This is only a description for Windows NT, for other platforms use your own intellegent.
- Open the compressed archive dtdm-dtd-project.zip by unzip into a new directory
- The contents of this new directory, say 'root', will contain the following directories
- \src contains .java- \data contains .xml- \classes contains .class- \doc contains .javadoc- \lib contains .java
- classpath=.;%CD%\lib\jdbc\classes12.zip;%CD%\lib\jdbc\jdbcodbc.zip; %CD%\lib\xml4j.jar;%CD%\lib\xerces.jar:%CD%\classes
* %CD% is the new directory 'root'.
START POINT--------------
- Go to the 'root' directory - You can directly type in: java Demo DBURI <username> <userpassword> to run the system. For example: java Demo jdbc:oracle:thin:@shiba.wpi.edu:1521:ORCL foo foo
Or do the following three steps.
a. First add two entries for your database in the source fileedu/wpi/cs/DSRG/xmldb/JDBCClient/JDBCClient.java
- add entry 1: final static String YOURDB_URL = "<your db's URI>";
example A: final static String SHIBA_URL ="jdbc:oracle:thin:@shiba.wpi.edu:1521:ORCL";
NOTE: insert entry 1 immediately after the example A
- add entry 2: else if (uri.toUpperCase().equals("YOURDB")) return JDBCClient.YOURDB_URL;
example B: else if
78
(uri.toUpperCase().equals("SHIBA")) returnJDBCClient.SHIBA_URL;
NOTE: insert entry 2 immediately after theexample B
b. Compile: javac -g -d /classes <.java file here> on all .java files in this directory and its subdirectories
- example1: javac -g -d /classes edu/wpi/cs/DSRG/xmldb/storeDTD/StoreDTD.java
- example2: javac -g -d /classes Demo.java
c. Run: java Demo YOURDB <username> <usrpassword>
RESULT-------
- The Demo process will load /data/book.xml and /data/book.dtd into your database
- The following relations should now exist on your account on your database:
+ ALL_DTDS_DTDM_Item contains every element of the dtd
+ ALL_DTDS_DTDM_Nesting contains every nesting relationship between the elements
+ ALL_DTDS_DTDM_Attribute contains every attribute for any of the elements
+ ALL_DTDS_DTD_ID_Mapping contains the dtd URI and its internal id
+ UNIQUEID contains some uniqueid used by the UniqueID.class
+ XML_CATALOG contains the XML URI and its internal id
+ DATAVIEW_CATALOG contains the current view for each individual element's relation
+ other relations that stores the XML's individual elements
DEMO GUI=================================================================================AUTHOR: Mirek-------
RUNNING INSTRUCTIONS:--------------------- The GUI is started by running "java Rainbow".
OPERATION INSTRUCTIONS:----------------------------
79
1. Rainbow Interface1.1. General Structure
Once the interface is run, a main window pops up. This window iscomposed of a main menu and a text box which displays messages tothe administrator.
1.2. Establishing a ConnectionIn order for any interaction to occur with the database, a connectionmust fist be established. An administrator selects the "System" optionfrom the main window menu and clicks on "Connect". A connect window pops up with three text fields. The database path is entered into the firstfield, the user name into the second field, and the user password intothe third. When all information is entered the administrator clickson the "Connect" button and if successful a connection is establishedwith Oracle.
1.3. Sending Manual Queries to the DatabaseThe administrator may enter an SQL query into the database by selecting"System" from the main window menu and clicking "Manual". A window popsup with one text field. Once the administrator enters the query stringinto the field and clicks on the "Send" button, the query is processedinto Oracle and any output received is echoed in the main window textbox.
1.4. Importing XML documentsIn order to import an XML document into the database, the administratorselects "Import" from the main window menu. An "open file" window popsup which allows the administrator to select the XML file.
1.5. Exporting a DTDIn order to export a DTD document from the database and save it as afile, the administrator selects "Export" from the main window menu. A"Save file" window pops up which allows the administrator to selectthe name and path of the DTD file to create.
1.6. Using the Work Window1.6.1. The work window is initially invisible. In order for it to become
visible, the administrator must select "Window" from the main windowmenu and click "WorkWindow". The work window contains three tabs (eachtab is a separate sub-window). The first tab (DB) brings up theDatabase data, the second tab (DTD/XML) is not yet implemented and isintended to display the DTD and XML structure, and the third tab(Operators) is for the purpose of doing restructuring.
1.6.2. Viewing TablesWhen the administrator clicks on the "Get Table List" on the secondtab (DB) the list of tables contained in the database will be displayed.When the administrator clicks on one of the table names, the data ofthat particular table is displayed in the adjacent "Table Data" textbox.
1.6.3. RestructuringIn order for a restructuring to be done, the administrator must firstselect what operators to run and give the parameters for each of theoperators. The third tab (Operators) contains three text boxes. Thefirst box is a list of all available operators. The administrator mustfirst select one of the operators. Once selected, it will appear inthe second box. This process may be repeated for as many operators asare intended to be run. Each selected operator that is clicked on inthe second box will cause a list of parameters for that operator toappear in the third box. In order to enter values for each of theseparameters, the administrator must click on a particular parameter andenter its value in the text field. Once operators are ready to be run,the "Run" button is clicked. Upon successful execution, the operators
80
then run sequentially and do the restructuring.
ADDITIONAL NOTES:---------------------The Rainbow system has been tested to successfuly compile and run on a pcrunning Windows NT 5.01 as well as under Windows 98. The java programminglanguages used were JDeveloper by Oracle and Visual Cafe.
FIND OUT MORE:--------------
- Please look the javadocs for the source files, in particular the one for src\Demo.java
TELL US ABOUT IT:---------------------
- If you have any questions or comments, why don't you drop us an email with your comments or questions at [email protected], noting that it relates to 'Rainbow Project 2000-2001'.
81