60
Form Methods Syst Des DOI 10.1007/s10703-010-0099-4 Theorem prover approach to semistructured data design Scott Uk-Jin Lee · Gillian Dobbie · Jing Sun · Lindsay Groves © Springer Science+Business Media, LLC 2010 Abstract The wide adoption of semistructured data has created a growing need for effective ways to ensure the correctness of its organization. One effective way to achieve this goal is through formal specification and automated verification. This paper presents a theorem proving approach towards verifying that a particular design or organization of semistruc- tured data is correct. We formally specify the semantics of the Object Relationship Attribute data model for Semistructured Data (ORA-SS) modeling notation and its correctness criteria for semistructured data normalization using the Prototype Verification System (PVS). The result is that effective verification on semistructured data models and their normalization can be carried out using the PVS theorem prover. Keywords Formal specification · Automated verification · Semistructured data modeling · Schema normalization · Theorem proving · ORA-SS · PVS S.U.-J. Lee CEA, LIST, Laboratory of Model-driven Engineering for Embedded Systems, Point Courrier 94, 91191 Gif sur Yvette, France e-mail: [email protected] G. Dobbie · J. Sun ( ) Department of Computer Science, The University of Auckland, Private Bag 92019, Auckland, New Zealand e-mail: [email protected] G. Dobbie e-mail: [email protected] L. Groves School of Engineering and Computer Science, Victoria University of Wellington, P.O. Box 600, Wellington, New Zealand e-mail: [email protected]

Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst DesDOI 10.1007/s10703-010-0099-4

Theorem prover approach to semistructured data design

Scott Uk-Jin Lee · Gillian Dobbie · Jing Sun ·Lindsay Groves

© Springer Science+Business Media, LLC 2010

Abstract The wide adoption of semistructured data has created a growing need for effectiveways to ensure the correctness of its organization. One effective way to achieve this goalis through formal specification and automated verification. This paper presents a theoremproving approach towards verifying that a particular design or organization of semistruc-tured data is correct. We formally specify the semantics of the Object Relationship Attributedata model for Semistructured Data (ORA-SS) modeling notation and its correctness criteriafor semistructured data normalization using the Prototype Verification System (PVS). Theresult is that effective verification on semistructured data models and their normalization canbe carried out using the PVS theorem prover.

Keywords Formal specification · Automated verification · Semistructured data modeling ·Schema normalization · Theorem proving · ORA-SS · PVS

S.U.-J. LeeCEA, LIST, Laboratory of Model-driven Engineering for Embedded Systems, Point Courrier 94,91191 Gif sur Yvette, Francee-mail: [email protected]

G. Dobbie · J. Sun (�)Department of Computer Science, The University of Auckland, Private Bag 92019, Auckland,New Zealande-mail: [email protected]

G. Dobbiee-mail: [email protected]

L. GrovesSchool of Engineering and Computer Science, Victoria University of Wellington, P.O. Box 600,Wellington, New Zealande-mail: [email protected]

Page 2: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

1 Introduction

Semistructured data [8] does not have a rigidly defined structure. This property allows datawith disparate structure to be integrated effectively. Typical drivers for semistructured datahave been worldwide companies and the web, where the usage of semistructured data hasincreased enormously [4, 46]. For example, semistructured data can be used in a worldwidecompany to pull together disparate data stored in different repositories for company-widereports, analysis and decision making. The usage of semistructured data for such applica-tions has increased extensively as vast numbers of international businesses, corporations,and organizations are using semistructured data languages to represent their data, and indi-vidual offices are organizing their data to suit local practices. The usage of semistructureddata has also increased dramatically through the rapid development of the World Wide Web(WWW) and its technologies, where large amounts of information on the web cannot beeasily organized into a structured format. The eXtensible Markup Language (XML), a com-mon representation of semistructured data [7, 21, 23], has been adopted as a new standard tocomplement the existing Hyper Text Markup Language (HTML) based web pages for datadescription and exchange over the Internet [1, 30]. In addition, semistructured data providesa means of exchange for the irregularly structured information of web applications and webservices. The increase in semistructured data usage is not limited to data integration andweb related applications, but expands into various other domains such as digital libraries,biological databases, and multimedia data management systems [45, 51]. In addition, semi-structured data has also been used in model transformations between different formal andinformal design models [48]. With such a rapid increase in its usage and its availabilityover the web, semistructured data needs to be modeled, stored, manipulated, and managedproperly for effective utilization. For these purposes, various designs and developments ofdatabase systems that store and manage large amounts of semistructured data have beenproposed [16, 24]. As a result, several database systems have already been developed forXML [10, 31], while traditional database companies, such as Oracle, have provided XMLsupport for their existing database systems [15, 49].

Maintaining the consistency of stored data is very important in any database system.There are constraints on data in the real world, and these constraints must be preserved overtime. Typically, in relational databases, the constraints or meaning of the data is initiallycaptured in Entity Relationship (ER) diagrams or similar, used in database design, and whenthe database is designed some of the real world constraints can be enforced through theschema [13]. The intended meaning of the data can be lost or corrupted during schemadesign or data population, if the schema does not conform to the real world constraints orthe populated data instances do not conform to their schema. Furthermore, it is critical toensure that no meaning is lost when various algorithms and operations for data manipulationand management, especially those that transform the schema, are applied.

Similar to relational database systems, invalid schema and data instances commonlycause data inconsistencies in XML database systems. In fact the problem is worse becausethere is a higher likelihood of designing invalid schemas and populating invalid data in-stances when integrating data with different formats or representing heterogeneous informa-tion on the web. Another common cause for data inconsistencies in XML database systemsis redundant data [29]. Data redundancies can create various insertion, update, and dele-tion anomalies. These often introduce inconsistencies in a database system when insertion,update and deletion were omitted on repeated data instances [19, 22]. Adequate valida-tion of the schema and data instances are essential to prevent possible data inconsistencies.However, currently the proposed database systems for semistructured data lack effectivereasoning support to validate schemas and data instances.

Page 3: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 1 The Department-Course-Student XML data instance with redundant data

Consider the data instance examples in Fig. 1, which represent simple XML documentsconsisting of information about a computer science department in a university, the coursesoffered by the department, and the students who are enrolled in the courses. These exampleshave redundant data that could lead to data inconsistency. The XML document in Fig. 1(a)has two instances of data redundancy—(1) the information about the course coordinator whoorganizes all the courses in the department is repeated for every course as shown in lines 4and 18; (2) the information about students (name and netID) is repeated for every coursethat the student is enrolled in as shown from the lines 5 to 14 and the lines 19 to 28. Suchredundant data can introduce data inconsistencies when data is inserted, updated, or deleted.For example, Fig. 1(b) above shows a resulting XML document after attempting to performeach of the following database operations; deleting Mary Brown followed by inserting LindaJones as a new course coordinator; updating netID of John Smith; and deleting all the in-formation about Jane Anderson. In the resulting XML data instance, there are three datainconsistencies caused by the anomalies—(1) the department’s course coordinator is notconsistent for all courses offered by the department as shown in the lines 4 and 18; (2) thenetID of John Smith is not consistent across the repeated data as shown in the lines 7 and 21;(3) the data representing information about Jane Anderson still exists as shown from thelines 10 to 14 and from the lines 24 to 28. These inconsistencies in the database corrupt themeaning of the data. According to the above example, students enrolled in CS101 coursemay contact Mary Brown for course related matters even after she has resigned from thecourse coordinator position. Similarly, John Smith may not be able to access CS101 courseweb pages with his new netID. In addition, an invoice for tuition fees could be sent to JaneAnderson even after she has graduated.

To overcome the problem, several normalization algorithms have been proposed to min-imize redundancies in semistructured data instances. They transform the schema of the

Page 4: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

semistructured data with redundancy into a better form. For example, currently proposednormalization algorithms include the normal form for semistructured data (NF-SS) devel-oped by Wu et al. [52], XML normal form (XNF) developed by Embley and Mok [20], andnormal form for XML documents developed by Arenas and Libkin [3]. Such normaliza-tion algorithms should always preserve data integrity. If normalization algorithms are notdesigned correctly, it may be possible for the normalization process to lose or corrupt themeaning of the data. However, many existing normalization algorithms for semistructureddata lack automated verification support that proves the correctness of the normalizationand ensures data consistency. The consequences of such inconsistencies could be devasta-ting for many applications that use databases including online banking applications, creditcard transactions, company’s classified documents, government’s legal applications, and anyother applications dealing with crucial information.

The above XML example and the possible disasters caused by data inconsistenciesclearly indicate the importance of ensuring the correctness and consistency of algorithms forsemistructured data. However, current developments of algorithms operating on semistruc-tured data fail to prevent possible data inconsistencies, since there is no adequate validationof schema/data instances or verification of semistructured data normalization provided [29].In order to address this problem, we need to apply formal specification and verificationtechniques that enable us to automatically detect that a schema and data instance is invalidand verify the correctness of normalization algorithms on semistructured data. In this pa-per, we demonstrate a theorem prover approach to semistructured data design to ensure theconsistency of data modeling. This approach consists of the following:

– A formal specification of the data modeling language for semistructured data to provideprecise and declarative conceptual level descriptions.

– Automated validation of semistructured data and schema to provide a means for prevent-ing possible data inconsistencies caused by invalid instances of semistructured data andschema.

– A formal specification of the correctness criteria of semistructured data normalization toprovide precise and declarative conceptual level descriptions for the conditions and rulesthat semistructured data normalization must obey to be correct.

– Automated verification of semistructured data normalization to prevent data inconsisten-cies caused by incorrectly designed schema transformations of normalization.

In order to provide formal validation for schema and data instances and formal verifi-cation for the normalization, an adequate data modeling language for semistructured dataand formal specification and verification languages must be selected [2, 5, 9, 14]. The for-mer should represent the schema of semistructured data, since the normalization algorithmstransform the schema of the data. The latter should enable precise formal definitions andeffective verification. For the data modeling language, the Object Relationship Attributemodel for Semi-Structured data (ORA-SS) is chosen because it is a semantically enrichednotation for semistructured data design [17, 29]. For the formal language, Prototype Veri-fication System (PVS) [36, 38] is chosen because it provides an effective type checker andtheorem prover that enables automated verification.

Using the ORA-SS data modeling language and PVS, the proposed approach for ensur-ing consistency of semistructured data is constructed as shown in Fig. 2. This paper is asubstantial extension of our previous work on validating ORA-SS data models [27]. Theextension includes—(1) formal criteria for measuring the correctness of semistructured datanormalization; (2) semantic encoding of the proposed correctness criteria in PVS; (3) au-tomated verification of ORA-SS data normalization using the PVS theorem prover. As a

Page 5: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 2 Overview approach of PVS theorem prover for semistructured data design

first step towards ensuring data consistency, the data model, the semantics captured in thedata model, and the correctness criteria for semistructured data normalization must be for-mally described at the conceptual level. In the conceptual level specification, the ORA-SSdata modeling language and its semantics were formally specified with PVS, serving as ashallow embedding [6] of the target notation. In addition, the criteria that defines the cor-rect normalization of semistructured data is derived and formally specified with PVS. Withthe PVS definitions of the ORA-SS semantics, we can automatically validate a schema ofsemistructured data against the semantics of ORA-SS data modeling language using thePVS type checking facility. Similarly, an instance of semistructured data such as an XMLdocument can be validated against its schema represented according to the semantics ofORA-SS data modeling language. The PVS definitions of correctness criteria for semistruc-tured data normalization can provide automated verification of the normalization by provingthe schema equivalence using the theorem prover. The results of PVS validation and veri-fication can indicate which parts of the semistructured data or its schema are invalid andwhich parts of the transformed schema do not conform to the correctness criteria for nor-malization. Here, we would like to point out that although an unsuccessful verification witha theorem prover may result from failure of constructing a correct proof, contradictions inthe proof steps can be used to pinpoint the errors in the specification, as we will demonstratein the later sections of the paper. These results can be useful when automated validation andverification are adopted by applications that need to detect possible causes of data inconsis-tencies. Furthermore, the formal definitions of verification can be extended and applied toprove the correctness of schema transformation operators, view creation, and normalizationalgorithms.

Page 6: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

The rest of the paper is organized as follows. Section 2 presents background informationabout ORA-SS, normalization, and PVS. In Sect. 3, we present a formal semantics of theORA-SS language and its data models in PVS with examples. With the examples presented,automated validation of ORA-SS schema and data instances is also demonstrated. Section 4presents correctness criteria for semistructured data normalization formally defined in PVS.In this section, the automated verification of semistructured data normalization with the PVStheorem prover is also demonstrated. Section 5 presents a review on related research witha discussion of their strength and weakness. Section 6 concludes the paper and addressesfuture work.

2 Background

In this section, we provide an overview of the ORA-SS data modeling language, introducingthe language in terms of its concepts and diagrammatic notations and outlining the semanticrichness of the ORA-SS notation. We also present a brief description of data redundanciesand normalization specific to semistructured database systems, explaining how the processof normalization removes the redundancies in the semistructured data context. A short in-troduction to PVS formal language is presented in terms of its underlying logic, languageconstructs, and verification support.

2.1 ORA-SS data modeling language

The Object-Relationship-Attribute model for Semistructured data (ORA-SS) is a semanti-cally rich graphical data modeling language for semistructured data design [17]. It is capableof modeling complex semistructured data models than other notations, such as XML schemaand DTD [29]. This can be observed from its variety of language constructs, e.g., the n-aryrelationship structure, the complicity of constraints among the entities, e.g., the participa-tion constraints, and the variability of data instances, e.g., the disjunctive relationship andattribute. ORA-SS has been widely used in many XML related database applications forvarious purposes including storage design, normal form definition, view creation, query ex-ecution, and the translation of XML to relational schemas [12, 18, 28, 32, 33]. To highlightthe main features of ORA-SS, we describe an example along with its real world constraints,and then we show how these constraints are modeled in ORA-SS. Figure 3 represents anXML data instance of a Department-Course data model which consists of information aboutdepartments, courses, students, and tutors. The structure of the XML document reflects thata department offers different courses in which students can enrol and to which tutors canbe assigned. It indicates what properties are specified for each entity. For example, studententity is specified with a student number, net id, and name as its properties. Besides thestructure of the data reflected in XML documents, there are various real world constraintsthat could be imposed on the Department-Course data model. Some of the constraints thatthe above XML data instance can satisfy are listed below.

– The maximum number of courses in which a student can enrol is 8.– A student can live either at home or in a hostel but not both.– A grade is given to each student in a course that belongs to a department.– A course code is constructed with a department prefix and a course number.– A hostel where a student lives is identified by a hostel name and a room number.– Each course in a department can have only one tutor.

Page 7: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 3 An XML data instance of a Department-Course data model with redundancy

Obviously, these constraints need to be captured in the data model during the designphase. With the constructs of ORA-SS data modeling language, we can describe the struc-ture of any XML document and capture the associated constraints in an ORA-SS model.ORA-SS models are represented as a forest of tree structured diagrams based on four basicconcepts, namely, object classes, relationship types, attributes and references.

– An object class represents an entity in the domain being modeled, such as person, student,or course. In ORA-SS diagrams, object classes are denoted by labeled rectangles.

– A relationship type represents a relationship among object classes. In semistructured data,this relationship is typically a nested relationship. In ORA-SS schema diagrams, a rela-tionship type is described by an edge labeled with a tuple (name, n, p, c), where name isthe name of the relationship type, integer n indicates the degree of the relationship type,p represents the participation constraint of the parent object class in the relationship typewith min:max notation and c represents the participation constraint of the child objectclass similarly. A relationship type may also include a disjunction of object classes wherechild object classes are not homogeneous.

– An attribute represents a property of either an object class or a relationship type. In ORA-SS schema diagrams, attributes are denoted by labeled circles. An attribute may be akey attribute that has a unique value, and is represented as a filled circle. Other typesof attributes include single-valued attributes, multi-valued attributes, optional attributes,required attributes, composite attributes, and disjunctive attributes.

– An object class can reference another object class, using key attributes. A reference be-tween two object classes indicates that the referencing object class is extended with thedetails of the referenced object class. It is similar to the inheritance relationship in theobject oriented data model. In ORA-SS schema diagrams, references are represented bydashed edges.

With the concepts in ORA-SS, we can capture the structure and constraints on the XMLdocument in Fig. 3 in an ORA-SS schema diagram as shown in Fig. 4. As described earlier,an ORA-SS schema diagram essentially consists of different object classes and attributes in

Page 8: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 4 An ORA-SS schema diagram of a Department-Course data model

a tree structure, where it specifies the relationships, participation and cardinality constraintsamong the instances of the object classes in a semistructured data model. The relationshipbetween Course and Student object classes is a typical example of a binary relationship. Theoptional description of the relationship specifies that the relationship is named cs, has a de-gree of 2 and has parent and child participation constraints of 1:m and 1:8 respectively. Theparticipation constraints represent that a course must have at least 1 student and a studentmust take between 1 to 8 courses. The relationship sh is also a binary relationship but unlikethe other binary relationships it is a disjunctive relationship represented by an edge with ‘|’in a diamond. A disjunctive relationship allows an object class to be related to either of theobject classes in the disjunction. For example, the relationship sh between Student, Home,and Hostel object classes specifies that a student can live in either a hostel or at home butnot both. Besides binary relationships, the ORA-SS data modeling language can model any‘n-ary’ relationship with a degree higher than 2 [29]. An ‘n-ary’ relationship is a nestingrelationship amongst ‘n’ object classes where a child object class is related to a relationshipbetween ‘n − 1’ object classes. For example, the relationship cst represents a ternary rela-tionship specifying a nesting relationship between the cs relationship and the Tutor objectclass.

In the ORA-SS schema diagram of the Department-Course schema, appropriate at-tributes are related to each object class to specify properties of the object class. There arealso several attributes, e.g., the grade attribute of relationship cs and the feedback attribute ofrelationship cst, describing the properties associated with relationships. The attributes withpartially filled circle, such as netID, represent the candidate keys of object classes. Similarly,attributes with filled circle, such as deptName, represent the identifying attributes of objectclasses where an identifying attribute is a selected candidate key of an object class. Similarto the concept of primary key in the relational data model [11], the value of an identifying

Page 9: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

attribute in an object class uniquely identifies an instance of the object class. Two or moreattributes can also be joined together to specify properties of object classes or relationships,where these attributes are called composite attributes or composite keys. The attribute codeof the Course object class is a composite attribute consisting of deptPrefix and courseNoattributes to specify the course code. The hostelName and roomNo attributes form a com-posite key that identifies the Hostel object class. Similar to a disjunctive relationship, the ‘|’in a circle for attribute examVenue represents a disjunctive attribute such that a course hasan exam venue that is either a lecture theatre or a lab. Note that the Tutor object class underthe Course object class is referenced by the Tutor object class under the Student object classin the ORA-SS schema diagram. In this example, a Tutor object class in the cst relationshiphas the same properties as a Tutor object class in the ct relationship, with one additionalproperty, feedback.

2.2 Data redundancy and normalization

An ORA-SS schema diagram represents constraints enforced on the data, and an instanceof semistructured data such as an XML document is organized based on its schema. If theschema is poorly designed, it can cause redundant data in the subsequent data instance. Forexample, in the XML data instance in Fig. 3, we can identify three such examples of dataredundancies, i.e., (1) attribute value Paul Cox for tutorialCoordinator attribute is repeatedfor every tutor in the CS101 course as shown in lines 7, 12, and 17, (2) values for all theattributes of tutor Robert Miller are repeated in different courses as shown in lines 14 to 18,26 to 30, and 38 to 42, (3) values for all the attributes of student John Smith are repeated indifferent courses as shown from the lines 19 to 22, 31 to 34, and 43 to 46. These redundanciesnot only waste storage but also create anomalies such as inconsistent insertion, incompletedeletion and partial update. This is mainly caused by the value of the data appears in multipleplaces. If we revisit the ORA-SS schema diagram in Fig. 4, there are three places that couldcause data to be stored redundantly in the schema design, i.e., (1) since a tutorial coordinatoris assigned to a course and represented as an attribute of the tutors of the course, the valuefor the tutorialCoordinator attribute of the object class Tutor is repeated for each tutor in thecourse, (2) since a tutor can be employed for different courses, the values for tutorName anddegree attributes of the Tutor object class are repeated for each of the Course object classesthe tutor is associated with, (3) since a student can enroll in different courses, the values forall attributes of the Student object class are repeated for each instance of the Course objectclass that the student instance is associated with. These redundancies in the ORA-SS schemaof Fig. 4 could cause the following anomalies on a data instance:

– An insertion anomaly can arise if the value of the tutorialCoordinator attribute is notinserted for every tutor in the same course. The result is that recording of the tutorialCo-ordinator attribute is inconsistent across tutors in courses.

– An update anomaly can arise if the value for tutorName and degree is not updated forevery course that a tutor is employed for. The partial update could corrupt the data byhaving different values for the tutorName and degree attributes for the same tutor in dif-ferent courses.

– A deletion anomaly can arise if Student object class is not deleted consistently. The in-complete deletion could result in a course having information about a student even afterthe student has dropped the course, causing data corruption.

These data inconsistencies or corruptions in the database instance caused by data redun-dancies can be prevented using a database technique called normalization [19, 22]. Normal-ization is the process of analyzing and restructuring a database schema in order to minimize

Page 10: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

redundancies in data instances. A number of normalization processes have been defined forsemistructured database systems including [3, 20, 52], which transform a schema based onthe keys and functional dependencies. Although the processes are different, there are com-mon operations for transforming the structure, such as creating a reference for an objectclass, moving an attribute to another object class, and swapping an object class with anotherobject class.

The functional dependencies for semistructured data reflect the constraints between at-tributes in the schema using a path notation [3]. The functional dependency in a path notationis of the form pathX → pathY, and reads pathX determines pathY, where pathX representsa set of paths from the root to an attribute or an object class, and pathY represents a setof paths from the root to an attribute. Because different object classes or attributes in theschema diagram may have the same name, paths are used to uniquely identify particularobject classes and attributes. For example, the constraints between attributes in the ORA-SSschema of Fig. 4 can be captured in the following set of functional dependencies:

{Department.Course → Department,Department.Course.@code → Department.Course.@courseTitle,Department.Course.@code → Department.Course.@examVenue,Department.Course.@code → Department.Course.Tutor.@tutorialCoordinator,Department.Course.Tutor.@tutorID → Department.Course.Tutor.@tutorName,Department.Course.Tutor.@tutorID → Department.Course.Tutor.@degree,Department.Course.Student → Department.Course.Student.Home,Department.Course.Student → Department.Course.Student.Hostel,Department.Course.Student.@studentNo

→ Department.Course.Student.@studentName,Department.Course.Student.@studentNo → Department.Course.Student.@netID,{Department.Course, Department.Course.Student} → Department.Course.Tutor,{Department.Course, Department.Course.Student}

→ Department.Course.Student.@grade,{Department.Course, Department.Course.Student, Department.Course.Student.Tutor}

→ Department.Course.Student.Tutor.@feedback}.

The above functional dependencies are given separately from the ORA-SS schema torepresent the business logic or real world constraints of Department-Course data. For exam-ple, the dependency that describes that code determines courseTitle is represented in pathnotation as ‘Department.Course.@code → Department.Course.@courseTitle’, where the‘@’ symbol is a prefix for attributes. This path functional dependency denotes that the codeof a course in a department determines the course title of the same course in the same de-partment. Based on the given path functional dependencies, redundancies in semistructureddatabase instances can be presented or minimized using various normalization algorithmsproposed for semistructured data.

In the process of normalization, the ORA-SS schema in Fig. 4 is restructured into theschema shown in Fig. 5 according to the given functional dependencies where the redun-dant data for tutor information is removed by creating a reference for a Tutor object class.A new Tutor object class with all the attributes describing its properties are created and ref-erenced by other Tutor object classes using the identifying attribute, tutorID. The referencehas enabled attributes of a Tutor object class to be stored independently without causingredundancies and to be extended by other Tutor object classes. With such restructuring ofthe ORA-SS schema, the data redundancy caused by the Tutor object class in the ct rela-tionship is eliminated. Because we want to use this example in later sections to demonstrate

Page 11: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 5 An ORA-SS schema diagram of a partially normalized Department-Course data model

further normalization, the schema diagram in Fig. 5 does not show a complete normaliza-tion. Figure 6 is an example of the XML data instance corresponding to the restructuredschema shown in Fig. 5 where only one of the redundancies is removed. When comparedwith the XML data instance of Department-Course schema in Fig. 3 before normalization,it can easily be seen that information about each tutor is no longer repeated. This data onlyappears once, and the same information is referenced using the tutorID identifying attributein the Tutor entities in different courses.

2.3 Prototype verification system

The Prototype Verification System (PVS) is a typed higher-order logic formal verificationsystem, where a formal specification language is integrated with supporting tools [36, 38,42]. It is a research prototype which means it evolves and improves with the new require-ments. PVS has a number of language constructs including user-defined types, built-in types,functions, sets, tuples, records, enumerations, and recursively-defined data types such as listsand binary trees. Predicate subtypes and dependent types are included in PVS to introduceconstraints in other types. These types that represent constraints increase the expressivenessof the modeling language and can be proved effectively by the theorem prover [34, 39]. Withthe provided language constructs, PVS specifications are represented in parameterized the-ories that contain assumptions, definitions, axioms, and theorems. Definitions of PVS alsosupport recursive functions, and inductively-defined relations. Its expressions provide theusual arithmetic and logic operators, function application, lambda abstraction, and quanti-fiers. The definitions and expressions of PVS allow composition of complex specifications,which lead to easier construction of real world problems [35, 40]. The problems in turn canbe solved and verified using the built-in theories and theorem prover. PVS is considered oneof the most popular and effective verification systems, because it provides an automated type

Page 12: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 6 An XML data instance of a partially normalized Department-Course data model

checker, a rich library of predefined theorems and a powerful theorem prover [41, 44]. It hasbeen adopted to provide formal verification support for system properties of many appli-cations such as fault tolerant flight control systems, railroad crossing controllers, real-timehardware systems, and steam boiler control systems [25, 37, 43, 47, 50].

3 Formal semantics of ORA-SS and validation of semistructured data model

In this section, we present a shallow embedding of the ORA-SS language semantics intoPVS and provide automated validation for ORA-SS schema and data instances using thePVS type checker and theorem prover. Automated validation detects possible data inconsis-tencies, which can arise from incorrectly designed schema or incorrect data population. TheORA-SS semantics consist of concepts, diagrammatic notations, and associated constraintsfor representing the semistructured data model. The concepts and the diagrammatic notationprovide a means to construct a data model for semistructured data, whereas the associatedconstraints define properties that a constructed data model must satisfy to be correct andmeaningful. We have already presented an informal description of the concepts and the di-agrammatic notation of the ORA-SS data modeling language in Sect. 2.1. To complete thedescription of the ORA-SS semantics, the associated constraints of ORA-SS data modelinglanguage must be described. In order to provide adequate validation, the informal descrip-tion of ORA-SS semantics are specified formally in PVS, where the concepts and notationsare specified declaratively with the associated constraints. Using the formally specified se-mantics of ORA-SS, an ORA-SS schema and a data instance can be validated against thesemantics of semistructured data.

3.1 Constraints on ORA-SS schema

The constraints implicit in an ORA-SS schema, which must be satisfied to define a mean-ingful data model, are derived from the ORA-SS data modeling language as follows.

Page 13: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

1. A relationship type r with degree two relates a parent object class with a child objectclass.

2. In a relationship type r with degree n greater than two, there must be another relationshiptype with degree n−1, relating n−1 ancestors with the child object class of relationshiptype r .

3. A disjunctive relationship type relates a parent object class to two or more child objectclasses.

4. A participation constraint of a relationship type must not be 0:0. Similarly, the minimumparticipation of a child participation constraint must not be 0, since a child object mustbe related to at least one parent object in a relationship instance.

5. A composite attribute or disjunctive attribute relates an attribute to two or more sub-attributes.

6. A candidate key or a composite key of an object class must be a set of attributes selectedfrom the attributes of the object class.

7. There can only be one set of identifying attributes per object class, and it must be eithera candidate key or a candidate composite key of the object class.

8. Relationship attributes have to belong to an existing relationship type, and the attributemust be nested in the child object class of the relationship type.

9. An object class can reference only one object class, but an object class can be referencedby multiple object classes. The referencing and referenced object classes in an ORA-SSschema should have the same identifying attributes.

The list above represents the constraints on ORA-SS schemas that are essential to ensurethe clear meaning of the data against the ORA-SS semantics. The constraints listed can beconsidered as a starting point to derive the full set of constraints necessary to model a correctORA-SS schema. These constraints can also be used to check whether an ORA-SS schemais correct using validation. For example, to illustrate this further at the schema level, threesemantic errors are deliberately introduced to the schema diagram of the Department-Coursedata model as shown in Fig. 7. These errors were introduced to demonstrate the difficultiesof validating an ORA-SS schema and to illustrate how the errors are detected using theconstraints discussed in Sect. 3.1. Examining the Department-Course schema diagram inFig. 7 against the constraints listed above reveals three semantic errors:

– The degree of relationship dc between object class Department and Course is describedas 3, representing a ternary relationship where it is actually binary, which violates thesecond constraint.

– Two attributes are described as identifying attributes for the object class Student, whichviolates the seventh constraint.

– The candidate key netID is represented as an attribute of the relationship cs, violating thesixth constraint.

These are the errors that violate the constraints on the schema described by the semanticsof ORA-SS data modeling language. In order to detect semantic errors similar to these,constraints on data models are acquired from the ORA-SS data modeling language.

3.2 Constraints on ORA-SS data instance

We consider a semistructured data instance, such as an XML document, consistent if itconforms to the constraints embodied in its corresponding ORA-SS schema. The constraintson the ORA-SS data instance are derived from the ORA-SS data model as follows.

Page 14: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 7 An ORA-SS schema diagram of a Department-Course data model with semantic errors

1. An object instance can only be associated with one object class.2. The degree of a relationship type and the degree of its instances must be consistent.3. An object in the instance of a relationship type should be an instance of the corresponding

object classes in the relationship type, i.e., the child object should be an instance ofthe child object class; and the parent object or relationship type instance should be aninstance of the parent object class or relationship type.

4. A relationship instance can only be associated with a single relationship type.5. A relationship instance must conform to its parent and child participation constraints,

i.e., the number of child objects related to a single parent object or relationship instanceshould be consistent with the parent participation constraints; and the number of parentobjects or relationship instances that a single child object relates to should be consistentwith the child participation constraints.

6. In a disjunctive relationship type or attribute, only one object class or attribute respec-tively can be associated with a particular parent instance.

7. The value of a candidate key (single or composite) should uniquely identify an object inits object class.

8. The number of values of an attribute must be limited by the minimum and maximummultiplicity values of the attribute (e.g., single-valued, multi-valued, required, and op-tional).

9. For a referencing object that is an instance of the referencing object class, the correspond-ing referenced object must be an instance of the referenced object class and both objectsshould have the same value for their identifying attributes.

Similar to the list of constraints derived for ORA-SS schemas, the above list comprisesthe constraints on semistructured data instances that are essential to ensure the correctnessof the data against the ORA-SS schema. It acts as a starting point to derive the full set ofconstraints necessary to model correct data instances. Figure 8 shows an XML document

Page 15: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 8 An XML data instance for a Department-Course data model

that does not conform to the ORA-SS schema in Fig. 7 (after the semantic errors identifiedin the previous section are rectified). By inspecting this data instance, we can identify thatthere are three data inconsistencies:

– From the lines 30 to 35, the student Jane Anderson in course CS101 is related to twotutors, violating the fifth constraint, where parent participation constraint of relationshipcst is defined as 1:1.

– In lines 18 and 26, there are two identical values of the studentNo for two different stu-dents, violating the seventh constraint, where candidate keys and therefore identifyingattributes must uniquely identify an object in an object class.

– From the lines 36 to 39, the student Jane Anderson is related to both Home and Hostel,violating the sixth constraint, where the relationship sh restricts a student to associate witheither a hostel or a home, but not both.

One can observe that it is not trivial to reveal these errors using only the schema diagramsand XML documents. When examining a large database consisting of a number of XMLdocuments that should conform to a complicated ORA-SS schema, it is almost impossibleto validate the ORA-SS schema diagram and its XML data instances manually. Therefore,adequate validation support based on the formal specification of ORA-SS semantics canbe beneficial in revealing inconsistencies in the schema as well as in corresponding datainstances.

In order to provide adequate validation, the informal description of ORA-SS semanticsare specified in the PVS formal language, where the concepts and notations of ORA-SS datamodeling language are declaratively specified with the associated constraints. When spec-ifying the ORA-SS semantics in PVS definitions, we have limited ourselves to a subset ofthe ORA-SS semantics that demonstrate major concepts of the language to focus mainly onproviding clear and simple definitions that are required for automated validation and veri-fication. For example, the ORA-SS semantics are simplified by assuming that an attributeonly has a single value. As a result, multiplicity of attribute values and several types of at-tributes, single-valued, multi-valued, optional, required attributes, which are distinguishedby the multiplicity, are omitted in the PVS definitions. This assumption makes the otherdefinitions and their validation simpler and clearer. Even though the definition for multi-plicity of attribute values is omitted by the assumption, it does not affect the verification

Page 16: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 9 Structure of the ORA-SS semantics formally defined in PVS and its validation model

of semistructured data normalization because normalization does not take multiplicity ortype of attributes into account. With such an assumption, the PVS formal definitions for thesemantics of ORA-SS data modeling language and the validation models for schemas andinstances of semistructured data are defined and structured as shown in Fig. 9.

The PVS definitions for the semantics of ORA-SS data modeling language consist ofthree different theories as shown above: one defining ORA-SS schemas, one defining datainstances, and one defining the relationship between a schema and a data instance. Thesethree theories allow independent validation of schemas and instances of semistructured data.Specific ORA-SS schema diagrams and possible XML data instances can be translated intothe vocabulary provided by these theories, to obtain PVS representations that can then beused for the validation, as described later in Sect. 3.9. In order to provide a concise ex-planation, some of the PVS definitions for ORA-SS semantics are abbreviated and similardefinitions are presented only once. A complete PVS semantics of ORA-SS can be foundin [26].

3.3 The ORA-SS schema theory

Initially, structure and semantics of ORA-SS schemas are defined in a PVS theory to es-tablish the formal definition of ORA-SS data modeling language. Every ORA-SS schemais built on a sets of object classes and a set of attributes, which are declared as non-emptytypes, OC and ATT, that are provided as arguments to the generic theory named orass-SchemaDef.

orassSchemaDef[OC, ATT: TYPE+]

: THEORY

3.3.1 Relationship type

The relationship type in ORA-SS schema is defined as a list of set of object classes. Itlists the object classes participating in the relationship type, starting with the one occurringdeepest in the hierarchy, and moving upwards. It uses sets containing more than one objectclass to represent a disjunctive relationship type.

RelType: TYPE = {ocSetList: list

[set

[OC

]]|

length(ocSetList) > 1 ∧ no_cycle?(ocSetList)}

Page 17: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

no_cycle?(ocSetList: list

[set

[OC

]]): RECURSIVE bool =

CASES ocSetList OF

null: TRUE,cons(head, tail):

(∀(ocSet: set[OC

]): member(ocSet,tail) ⇒

disjoint?(head, ocSet)) ∧ no_cycle?(tail)ENDCASES

MEASURE length(ocSetList)

The RelType definition above defines the relationship type in ORA-SS as a list of setof object classes using a set comprehension with the predicates representing the associatedconstraints. The predicate specifies that the list must contain at least two elements since arelationship type must involve at least two object classes. It also specifies that the list mustnot contain any cycles as defined in the no_cycle? function, which recursively checksif there are any repeating set of object classes in the list. Note that a list is used to definerelationship type to provide a clearer and simpler definition. It also allows the relationshiptype to be easily used in other definitions and conveniently processed for validation in PVSby making recursion possible. We use the list structure to represent finite sets in our en-coding mainly because of the efficiency of computation as well as the simplicity in proofconstructions. Unlike the relational data where all the tables are filled, semistructured datacan consist of many missing and incomplete fields. This naturally causes the PVS encodingto have partial functions. The list structure is used instead of sets to reduce the high costof calculating subsets for every partial function represented. In addition, PVS has a rich li-brary of functions and theories for manipulating list structures over sets. We found that itis more efficient to use lists with recursive functions in PVS, after exploring both ways ofencoding. Furthermore, the list approach also contributes to the automation during the veri-fication stage, as the proofs can be constructed via simple PVS commands such as ‘grind’.However, using a list to substitute a set during the encoding, additional predicates must bedefined to prevent the list from having duplicated values.

In order to fully describe a relationship type, we must record the relevant informationsuch as the related object classes, the degree of the relationship type, and the parent andchild participation constraints. Instead of defining the relationship type and its properties inseparate definitions, the record type in PVS, which is capable of holding multiple types asits components, is used to represent relationship type with these four components:

Relationship: TYPE ={r:

[# rel: RelType, degree: posnat, pConstraint:

[nat, posnat

],

cConstraint:[posnat, posnat

]#]| degree(r) = length(rel(r)) ∧

(length(rel(r)) > 2 ⇒ (∃(subRel: RelType): subRel = cdr(rel(r))))}

The above definition defines relationship type and its properties as a record type (en-closed by the ‘#’ symbols) in a set comprehension with predicates representing associatedconstraints. The rel component represents a relationship between object classes and isdeclared as a RelType defined previously. The degree component represents the de-gree of the relationship type and is declared as a positive integer. The pConstraint andcConstraint components represent the parent and child participation constraints and aredeclared as ordered pairs of numbers giving the minimum and maximum number of occur-rences permitted. The minimum for parent participation is a natural number, which containszero, since the parent objects in a relationship are not required to have any child objects.The others are all positive integers. The name of the relationship is not included in the typedefinition since it is represented as the variable name of the relationship type. The predicateof the relationship type specifies that the degree of a relationship type is the same as

Page 18: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

the length of the rel that refers to the number of object class sets participating in the rela-tionship type. Also it is implied that the degree must be at least two, since the predicate ofthe RelType definition only allows relationship types with degree greater than one to bedefined. In addition, the predicate specifies that if a relationship type has degree n greaterthan two, then there must also be a relationship type with degree n− 1 relating the ancestorsof the child object class of the relationship type in the same hierarchical order. This ensuresthat a relationship type has the right number of participants.

3.3.2 Attributes of object class and relationship type

In ORA-SS schema diagrams, object classes and relationship types can both have attributes,and these attributes may be composite or disjunctive. The set of attributes for an objectclass is defined as a record type with two components, giving the object class (oc) and itsattributes (attList). The set of attributes for a relationship type can be defined similarly.

ObjectAtt: TYPE = {oAtt:

[# oc: OC, attList: list

[ATT

]#]|

noAttRepeat?(attList(oAtt))}

RelationshipAtt: TYPE = {rAtt:

[# rel: RelType, attList: list

[ATT

]#]|

noAttRepeat?(attList(rAtt))}

The ObjectAtt definition above uses a record type to relate attributes with objectclasses and a list to represent all the attributes that belongs to an object class. The attributesof relationship type is defined similarly, as shown in the above definition. Where a list sub-stitutes a set during the encoding, a predicate must be defined to prevent the list from havingrepeating values. The noAttRepeat? recursive predicate function is defined to checkwhether the list of attributes has any repeating attributes.

noAttRepeat?(attList: list

[ATT

]): RECURSIVE bool =

CASES attList OF

null: TRUE,cons(head, tail): ¬ member(head, tail) ∧ noAttRepeat?(tail)

ENDCASES

MEASURE length(attList)

The above definition represents the function that recursively traverses the list of attributesand compares each attribute with the rest of the list. It returns a boolean value of false whenany attribute in the list exists in the rest of the list, and true otherwise. The noAttRepeat?function is highly scalable since the PVS theorem prover can effectively manipulate a longlist of attributes in the function. Throughout the PVS definitions of ORA-SS semantics, thiskind of recursive predicate function will be used to check for repeating elements whenevera list is used to represent a set. Note that a generic recursive predicate function that detectsrepeating elements for different lists is not defined since many lists are constructed with dif-ferent and complicated types, which often require a separate equality function to be defined.

Composite attributes are defined using the CompositeAtt record type that has twocomponents giving the attribute (att) and its component attributes (attList). Disjunctiveattributes can be defined similarly.

CompositeAtt: TYPE = {cAtt:

[# att: ATT, attList: list

[ATT

]#]|

length(attList(cAtt)) > 1 ∧ noAttRepeat?(attList(cAtt)) ∧¬ member(att(cAtt), attList(cAtt))

}

Page 19: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

The CompositeAtt definition above defines a composite attribute as a record typein a set comprehension with predicates representing associated constraints. The predicateof the CompositeAtt type specifies that a composite attribute must have at least twocomponents and must not have repeated components. It also restricts an attribute from beinglisted as one of its own components.

3.3.3 Candidate key and identifying attributes

In ORA-SS schema diagrams, an object class may have any number of candidate keys andcomposite candidate keys, each having a unique value for each instance of the object class.The set of candidate keys for an object class is defined as a record type with two components,the object class (oc) and its candidate keys (keyList). It uses lists to represent more thanone attribute in a composite candidate key.

CandidateKey: TYPE = {cKey:

[# oc: OC, keyList: list

[list

[ATT

]]#]|

noKeyRepeat?(keyList(cKey)) ∧ (∀ (attList: list

[ATT

]):

member(attList, keyList(cKey)) ⇒ noAttRepeat?(attList))}

The above defines a candidate key as a record type in a set comprehension with predi-cates representing associated constraints. The list of list of attributes is used to represent thecandidate keys of an object class, where the outer list represents all the candidate keys ofthe object class and the inner list represents the multiple attributes that make up a compositecandidate key. The predicate of the CandidateKey type specifies that a list of candidatekeys for an object class should not have any repeating keys using the noKeyRepeat?function. It also specifies that a list of attributes representing a composite key must not haveany repeating attributes using the noAttRepeat? function defined earlier.

Besides these predicates defined for the candidate key, the recursive predicate functioncKeyCorrect? is defined separately to apply constraints of ORA-SS semantics on a can-didate key. The semantics of ORA-SS specifies that a candidate key should be selected fromthe set of attributes that belong to the object class, hence the cKeyCorrect? functionchecks whether all the candidate keys for all object classes in a schema satisfy this condi-tion.

cKeyCorrect?(cKeyList: list

[CandidateKeys

], oAttList: list

[ObjectAtt

]):

RECURSIVE bool =CASES cKeyList OF

null: TRUE,cons(head, tail): correctOAtt4CKey?(oAttList, head) ∧

cKeyCorrect?(tail, oAttList)ENDCASES

MEASURE length(cKeyList)

correctOAtt4CKey?(oAttList: list

[ObjectAtt

], cKey: CandidateKey

):

RECURSIVE bool =CASES oAttList OF

null: FALSE,cons(head, tail): IF (oc(head) = oc(cKey))

THEN correctCKey?(keyList(cKey), attList(head))ELSE correctOAtt4CKey?(tail, cKey) ENDIF

ENDCASES

MEASURE length(oAttList)

Page 20: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

correctCKey?(attListList: list

[list

[ATT

]], attList: list

[ATT

]): RECURSIVE bool =

CASES attListList OF

null: TRUE,cons(head, tail): isCKeyObjAtt?(head, attList) ∧ correctCKey?(tail, attList)

ENDCASES

MEASURE length(attListList)

isCKeyObjAtt?(cKeyList, objAttList: list

[ATT

]): RECURSIVE bool =

CASES cKeyList OF

null: TRUE,cons(head, tail): member(head, objAttList) ∧ isCKeyObjAtt?(tail, objAttList)

ENDCASES

MEASURE length(cKeyList)

The cKeyCorrect? recursive predicate function above has a list of Candi-dateKeys and a list of ObjectAtt as its arguments, which refer to all the candidatekeys and all the attributes of the object classes in a schema. Among the candidate keys in aschema, the cKeyCorrect? function recursively goes through the candidate keys of eachobject class and compares it with the given object class attributes using the correct-OAtt4CKey? function. Given the candidate keys of an object class and the entire objectclass attributes in a schema, the correctOAtt4CKey? function finds the attributes ofthe object class that the given candidate key belongs. Using the correctCKey? andisCKeyObjAtt? functions, it checks whether the attributes that make up each candi-date key belong to the attributes of the object class. Once it has been checked that all thecandidate keys in the schema are selected from the attributes of the corresponding objectclass, the cKeyCorrect? function returns true to specify that this constraint is checkedand valid for a given schema.

In ORA-SS schema diagrams, an object class has identifying attributes that are selectedfrom the set of candidate keys. The identifying attributes are defined as a record type thatconsists of an object class and a list of attributes that refers to the identifying attributes ofthe object class.

IdentifyingAtt: TYPE = [# oc: OC, attList: list

[ATT

]#]

In the above definition, the list of attributes is used to represent composite identifyingattributes. Note that the list of attributes represents identifying attributes that belong to anobject class since there can only be one set of identifying attributes for an object class.Similar to the CandidateKey definition, there is a recursive predicate function separatelydefined to check whether identifying attributes for each object class in a given schema areselected from candidate keys as specified in ORA-SS semantics.

3.3.4 Object reference

In ORA-SS schema diagrams, an object class can reference another object class to extendits meaning. An object class can reference only one other object class but an object class canbe referenced by many object classes.

Reference: TYPE = {ref:

[# referencing: OC, referenced: OC #

]|

¬ (referencing(ref) = referenced(ref))}

The above definition defines a reference as a record type in a set comprehension withpredicates representing associated constraints. The two object classes in the record type

Page 21: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

represent the referencing and the referenced object classes respectively. A single object classis used to represent the referenced object class, since an object class can reference only oneobject class. The predicate of the Reference type specifies that an object class cannotreference itself.

Besides the predicate defined for references, recursive predicate function refCor-rect? is defined separately to apply further constraints of ORA-SS semantics on a ref-erence. The semantics of ORA-SS specify that object classes involved in a reference musthave the same identifying attributes in order for the referencing object class to extend the at-tributes that belongs to the referenced object class. The refCorrect? function is definedto check whether the references in a schema satisfy this condition.

refCorrect?(refList: list

[Reference

], idAttList: list

[IdentifyingAtt

]): RECURSIVE bool =

CASES refList OF

null: TRUE,cons(head, tail): attListEqual?(findIdAtt(idAttList, referencing(head)),

findIdAtt(idAttList, referenced(head))) ∧ refCorrect?(tail, idAttList)ENDCASES

MEASURE length(refList)

findIdAtt(idAttList: list

[IdentifyingAtt

], oc: OC

): RECURSIVE list

[ATT

] =CASES idAttList OF

null: null,cons(head, tail): IF (oc(head) = oc) THEN attList(head)

ELSE findIdAtt(tail, oc) ENDIF

ENDCASES

MEASURE length(idAttList)

The refCorrect? recursive predicate function above has a list of Reference and alist of IdentifyingAtt as its arguments. Among the references in a schema, the ref-Correct? function recursively goes through each reference and finds the identifying at-tributes of the referencing and the referenced object class using the findIdAtt function.It then checks whether the identifying attributes of the referencing and the referenced objectclass are the same. Once the refCorrect? function has checked all the given references,it returns true to indicate that this constraint is checked for a given schema.

3.3.5 Schema type

To capture the meaning of an ORA-SS schema diagram, we need to describe the relationshiptypes, attributes, keys and references embodied in the diagram. In the PVS formal seman-tics, a schema is represented using a record type. The components of the schema recordtype represent the aspects of an ORA-SS schema with relationship types, various kinds ofattributes (object class attributes, relationship attributes, composite attributes and disjunc-tive attributes), keys (candidate keys and identifying attributes) and references representedpreviously.

Schema: TYPE ={s:

[# relList: list

[Relationship

], oAttList: list

[ObjectAtt

],

rAttList: list[RelationshipAtt

], cAttList: list

[CompositeAtt

],

dAttList: list[DisjunctiveAtt

], cKeyList: list

[CandidateKey

],

Page 22: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

idAttList: list[IdentifyingAtt

], refList: list

[Reference

]#]|

noRelRepeat?(relList(s)) ∧ noOAttRepeat?(oAttList(s)) ∧noRAttRepeat?(rAttList(s)) ∧ noCAttRepeat?(cAttList(s)) ∧

noDAttRepeat?(dAttList(s)) ∧ noCKeyRepeat?(cKeyList(s)) ∧noIdAttRepeat?(idAttList(s)) ∧ noRefRepeat?(refList(s)) ∧

cKeyCorrect?(cKeyList(s), oAttList(s)) ∧idAttCorrect?(idAttList(s), cKeyList(s)) ∧

refCorrect?(refList(s), idAttList(s))}

The above definition defines a schema as a record type in a set comprehension with pred-icates representing associated constraints. In the record type of a schema, lists are used torepresent sets of relationship types, attributes and keys that make up a schema. Similar tomany definitions that were presented earlier, the use of lists has introduced definitions of thefunctions such as noRelRepeat?, noOAttRepeat?, noCKeyRepeat? and noRe-fRepeat?, which are predicates of the record type to prevent the same components beingrepeated multiple times in a list. The predicate of the record type also includes the cKey-Correct?, the idAttCorrect? and the refCorrect? functions defined earlier thatenforce the constraints on keys and references in a schema. In order for a schema to be valid,it must satisfy all the constraints on a schema and all the other constraints for componentsof the schema defined as predicates.

3.4 Representing ORA-SS schema diagram in PVS

To illustrate how an ORA-SS schema diagram can be represented using the definitions pre-sented above, we will show how, the Department-Course schema diagram in Fig. 7 can berepresented in a PVS theory called the schemaEx as follows.

schemaEx: THEORY

BEGIN

IMPORTING OCIMPORTING ATTIMPORTING orassSchemaDef

[OC, ATT

]

dc: RelType = (:singleton(Course), singleton(Department):)cs: RelType = (:singleton(Student), singleton(Course):)...dcRel: Relationship = (#rel := dc, degree := 3, pConstraint := (1, many),

cConstraint := (1, 1)#)csRel: Relationship = (#rel := cs, degree := 2, pConstraint := (1, many),

cConstraint := (1, 8)#)...courseAtt: ObjectAtt = (#oc := Course, attList := (:code, courseTitle,

examVenue:)#)studentAtt: ObjectAtt = (#oc := Student, attList := (:studentNo,

studentName:)#)...csRelAtt: RelationshipAtt = (#rel := cs, attList := (:netID, grade:)#)cstRelAtt: RelationshipAtt = (#rel := cst, attList := (:feedback:)#)codeCAtt: CompositeAtt = (#att := code, attList := (:courseNo, deptPrefix:)#)examVenueDAtt: DisjunctiveAtt = (#att := examVenue,

Page 23: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

attList := (:lab, lectureTheatre:)#)courseCKey: CandidateKey = (#oc := Course, keyList := (:(:code:):)#)studentCKey: CandidateKey = (#oc := Student, keyList := (:(:studentNo:),

(:netID:):)#)...courseIdAtt: IdentifyingAtt = (#oc := Course, attList := (:code:)#)studentIdAtt: IdentifyingAtt = (#oc := Student, attList := (:studentNo:)#)studentIdAtt2: IdentifyingAtt = (#oc := Student, attList := (:studentName:)#)...tutorRef1: Reference = (#referencing := TutorInCourse, referenced := Tutor#)...departmentCourse: Schema =

(#relList := (:dcRel, csRel, ctRel, cstRel, shRel:),oAttList := (:deptAtt, courseAtt, studentAtt, tutorAtt, ... , hostelAtt:),rAttList := (:csRelAtt, cstRelAtt:),cAttList := (:codeCAtt:),dAttList := (:examVenueDAtt:),cKeyList := (:deptCKey, courseCKey, studentCKey, ... , hostelCKey:),idAttList := (:deptIdAtt, courseIdAtt, studentIdAtt, ... ,hostelIdAtt:),refList := (:tutorRef1, tutorRef2:)#)

END schemaEx

The schemaEx theory imports the OC and ATT data types constructed for theDepartment-Course schema, and the orassSchemaDef theory. It defines several localvariables holding the components for the Department-Course schema, as shown in the di-agram in Fig. 7, and finally defines departmentCourse as a schema, using these localvariables. Note that the ‘#’ symbols enclose an instance of the record type; whereas the ‘:’symbols represent an instance of the list structure.

3.5 PVS semantics of ORA-SS data instance

An ORA-SS data instance is an abstract representation of a semistructured data instance,such as an XML document. The general structure and properties of ORA-SS data in-stances are constructed on non-empty sets of object classes (OC), attributes (ATT), ob-jects (OBJECT) and attribute values (ATTVALUE), which are provided as arguments to thegeneric theory named orassDataDef.

orassDataDef[OC, OBJECT, ATT, ATTVALUE: TYPE+]

: THEORY

3.5.1 Basic types

In an ORA-SS data instance diagram, an instance of object class and an instance of attributeare described by an object with its object class and an attribute value with its attribute re-spectively.

OBJ: TYPE+ = [# class: OC, object: OBJECT #

]

ATTVAL: TYPE+ = [# attribute: ATT, value: ATTVALUE #

]

The OBJ type defines an instance of object class as a non-empty record type with compo-nents representing an object class and its instance (object). Similarly, an instance of attributeis defined as a non-empty record type, ATTVAL, with components representing an attribute

Page 24: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

and its instance (attribute value). These basic types are defined prior to constructing a formalspecification of ORA-SS semantics for a data instance.

3.5.2 Object relationships

To describe an instance of a relationship type, the relationship between objects and thehierarchy of the related objects in a tree structure must be recorded. Thus two PVS typesrepresenting the above two aspects in the ORA-SS data instance are both defined as a list ofobjects.

ObjRelationship: TYPE = {oList: list

[OBJ

]| length(oList) = 2

}

ObjRelTree: TYPE = {oList: list

[OBJ

]| noObjCycle?(oList)

}

The ObjRelationship type represents the relationship between objects, where its con-straint specifies that the object relationship in the list is always a binary. This constraint isapplied to the type, since we can not determine whether a child object relates to a parentobject or to a parent relationship instance in an ORA-SS data instance such as an XML doc-ument. The ObjRelTree type is defined to represent the structural information (tree path) ofthe data instance as a list of objects, where the constraints specify that there are no repeatedobjects in the list. This definition is essential for reconstructing the data instances such asXML documents or ORA-SS data instance diagrams from the PVS representation of thedata instance.

3.5.3 Instances of attributes, composite attributes, and disjunctive attributes

In an ORA-SS data instance, instances of attributes that belong to a relationship type or anobject class are represented as attribute values that belong to an object. The instances ofattributes for both object class and relationship type are defined as a single record type withtwo components, i.e., the objects (objList) and the attribute values (attValList) thatbelong to them. The instances of a composite attribute are defined as a record type consistingof attribute value and list of attribute values representing instances of the attribute and itscomponents respectively. Similarly, the instances of disjunctive attributes are represented asa record type consisting of two attribute values. As usual, there can be no repeated attributevalues.

AttInstance: TYPE = [# objList: list

[OBJ

], attValList: list

[ATTVAL

]#]

CompositeAttVal: TYPE = {cAttVal:

[# attVal: ATTVAL,

attValList: list[ATTVAL

]#]| noAttValRepeat?(attValList(cAttVal))

}

DisjunctiveAttVal: TYPE = [# attVal1: ATTVAL, attVal2: ATTVAL #

]

In the AttInstance type definition, a list of objects is used to specify the exact objectin the tree structure, because the same object can appear many times in different tree paths.The CompositeAttVal type is defined using the same structure as the definition forcomposite attributes in the ORA-SS schema diagram. The predicate for the type preventsattribute values from repeating in the list of composite attribute values. The Disjunc-tiveAttVal type is similar to the definition for disjunctive attributes in the ORA-SSschema diagram. However, attribute value is used instead of list of attribute values becauseonly one attribute can be selected from a set of disjunctive attributes.

Page 25: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

3.5.4 Data instances

To capture the meaning of an ORA-SS data instance, we need to describe the instancesof relationship types and instances of attributes. The data instance is defined using a recordtype, with object relationships (oRelList), object tree path (oRelTreeList) represent-ing relationship instances, and with attribute values (attInstList, cAttValList anddAttValList) representing attribute instances. The additional constraints of the defini-tion specify that there can be no repeated elements in the list of each component.

Data: TYPE = {d:

[# oRelList: list

[ObjRelationship

], oRelTreeList: list

[ObjRelTree

],

attInstList: list[AttInstance

], cAttValList: list

[CompositeAttVal

],

dAttValList: list[DisjunctiveAttVal

]#]| noORelRepeat?(oRelList(d)) ∧

noORelTreeRepeat?(oRelTreeList(d)) ∧ attInstRepeat?(attInstList(d)) ∧cAttValRepeat?(cAttValList(d)) ∧ dAttValRepeat?(dAttValList(d))

}

3.6 Representing ORA-SS data instance in PVS

To illustrate how an ORA-SS data instance can be represented using the definitions pre-sented above, we will show how the XML data instance for Department-Course schema inFig. 8 can be represented in a PVS theory called dataEx as follows. The dataEx theoryimports the OC, ATT, OBJECT and ATTVALUE data types constructed for the instance ofthe Department-Course schema, and the orassDataDef theory. It then defines severallocal variables holding the components for the data instance of Department-Course schema,as shown in the diagram in Fig. 8, and finally defines departmentCourseInst as a datainstance, using these local variables.

dataEx: THEORY

BEGIN

IMPORTING OC...IMPORTING orassDataDef

[OC, OBJECT, ATT, ATTVALUE

]

dept1Obj: OBJ = (#class := Department, object := department1#)course1Obj: OBJ = (#class := Course, object := course1#)...computerScienceAV: ATTVAL = (#attribute := deptName,

value := computerScience#)cs101AV: ATTVAL = (#attribute := code, value := cs101#)examVenue1AV: ATTVAL = (#attribute := examVenue, value := examVenue1#)...dc1: ObjRelationship = (:course1Obj, dept1Obj:)cs1: ObjRelationship = (:student1Obj, course1Obj:)...dctTree1: ObjRelTree = (:tutorInCourse1Obj, course1Obj, dept1Obj:)dcstTree1: ObjRelTree = (:tutorInCS2Obj, student1Obj, course1Obj, dept1Obj:)...d1AttInst: AttInstance = (#objList := (:dept1Obj:),

attValList := (:computerScienceAV:)#)d1c1AttInst: AttInstance = (#objList := (:course1Obj, dept1Obj:),

attValList := (:cs101AV, principlesOfProgrammingAV, examVenue1AV:)#)d1c1s1AttInst: AttInstance = (#objList := (:student1Obj, course1Obj, dept1Obj:),

Page 26: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

attValList := (:sNo123456AV, jsmi123AV, johnSmithAV, aAV:)#)...cs101cAttVal: CompositeAttVal = (#attVal := cs101AV,

attValList := (:csAV, cNo101AV:)#)examVenue1dAttVal: DisjunctiveAttVal = (#attVal1 := examVenue1AV,

attVal2 := pltAV#)...departmentCourseInst: Data =

(#oRelList := (:dc1, dc2, dc3, ct1, ct2, ct3, ct4, cs1, cs2, cs3, ... ,sh3:),oRelTreeList := (:dctTree1, dctTree2, dctTree3, dctTree4, ... , dcshTree6:),attInstList := (:t1AttInst, t2AttInst, t3AttInst, ... , d1c3s3h1AttInst:),cAttValList := (:cs101cAttVal, cs111cAttVal, cs131cAttVal:),dAttValList := (:examVenue1dAttVal, ..., examVenue3dAttVal:)#)

...END dataEx

3.7 PVS definitions of relationship between ORA-SS schema and data instance

With the semantics of the ORA-SS schema diagram and data instance defined, we can spec-ify the generic constraints of the mapping between an ORA-SS schema and its data instancein PVS as shown below. The mapping is defined in the generic theory named orassSche-maNDataDef with the arguments OC, OBJECT, ATT, and ATTVALUE.

orassSchemaNDataDef[OC, OBJECT, ATT, ATTVALUE: TYPE+]

: THEORY

The PVS theories previously defined to describe the semantics of ORA-SS schema dia-gram and data instance (orassSchemaDef and orassDataDef) are imported to definethe mapping between an ORA-SS schema and its data instance.

IMPORTING orassSchemaDef[OC, ATT

]

IMPORTING orassDataDef[OC, OBJECT, ATT, ATTVALUE

]

3.7.1 Instances of relationship types

Instances of relationship types are defined in a manner that is different from the way rela-tionship types for ORA-SS schema are defined. They are defined as a list of binary objectrelationships and a list of objects representing the structure of the objects in the data in-stance. The mapping for relationship type and its instances is defined as a record type thatconsists of a relationship type and a list of list of object relationships. The inner list is usedto represent the ordered collection of binary object relationships that make up the instanceof binary, ternary, or n-ary relationships. The outer list is used to represent the set of rela-tionship instances.

RelInstance: TYPE = {rInst:

[# rel: RelType, relInst: list

[list

[ObjRelationship

]]#]|

isRelInstAll?(toObjListAll(relInst(rInst)), rel(rInst)) ∧noRepeatingObjRel?(toObjListAll(relInst(rInst)))

}

The additional constraints of relationship instances specify that the object in the objectrelationships is an instance of the corresponding object class in the relationship type. Thetype constraints of RelInstance type transforms each list of ObjRelationship into

Page 27: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

a list of objects, where each object corresponds to an object class in its relationship type, andchecks whether each object in the list is an instance of the object class in the relationshiptype using the following functions.

toObjList(objRelList: list

[ObjRelationship

]): RECURSIVE list

[OBJ

] =CASES objRelList OF

null: null,cons(head, tail): IF (length(objRelList) = 1 ) THEN append(head, toObjList(tail))

ELSE append((:car(head):), toObjList(tail)) ENDIF

ENDCASES

MEASURE length(objRelList)

toObjListAll(objRelListList: list

[list

[ObjRelationship

]]): RECURSIVE list

[list

[OBJ

]] =CASES objRelListList OF

null: null,cons(head, tail): cons(toObjList(head), toObjListAll(tail))

ENDCASES

MEASURE length(objRelListList)

isRelInst?(objList: list

[OBJ

], ocList: list

[set

[OC

]]): RECURSIVE bool =

CASES objList OF

null: TRUE,cons(head, tail): ¬ (ocList = null) ∧ (class(head) ∈ car(ocList)) ∧

isRelInst?(tail, cdr(ocList))ENDCASES

MEASURE length(objList)

isRelInstAll?(objListList: list

[list

[OBJ

]], ocList: list

[set

[OC

]]): RECURSIVE bool =

CASES objListList OF

null: TRUE,cons(head, tail): isRelInst?(head, ocList) ∧ isRelInstAll?(tail, ocList)

ENDCASES

MEASURE length(objListList)

The toObjListAll function is defined to recursively go through a relationship in-stance represented in RelInstance type and transform each ObjRelationship intoan object using toObjList function. The isRelInstAll? function is defined to re-cursively go through the transformed relationship instances and check whether each objectin the list is an instance of the corresponding object class in the relationship type using theisRelInst? function. The other constraint of the RelInstance record type specifiesthat there should not be any repeated relationship instances in the outer list using noRe-peatingObjRel? function.

3.7.2 Cardinalities of relationship types and relationship instances

In the ORA-SS data model, a set of relationship instances must satisfy the specified par-ticipation constraints of the corresponding relationship type in the ORA-SS schema. Thepredicate function correctConstraints? is defined to verify whether the relationshipinstances satisfy both parent and child participation constraints.

parentSet(objListList: list

[list

[OBJ

]], loParent: list

[OBJ

]): RECURSIVE nat =

CASES objListList OF

Page 28: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

null: 0,cons(head, tail): (IF (¬(head = null) ∧ ¬ (loParent = null) ∧

loEqual?(cdr(head), cdr(loParent))) THEN 1 ELSE 0 ENDIF) +parentSet(tail, loParent)

ENDCASES

MEASURE length(objListList)

correctPC?(oRelListList: list

[list

[ObjRelationship

]], objListList: list

[list

[OBJ

]],

pConst:[nat, posnat

]):

RECURSIVE bool =CASES oRelListList OF

null: TRUE,cons(head, tail): (PROJ_1(pConst) ≤ parentSet(objListList, toObjList(head))) ∧

(PROJ_2(pConst) ≥ parentSet(objListList, toObjList(head))) ∧correctPC?(tail, objListList, pConst)

ENDCASES

MEASURE length(oRelListList)

correctConstraints?(r: Relationship, oRelListList: list

[list

[ObjRelationship

]]): bool =

IF (oRelListList = null) THEN (PROJ_1(pConstraint(r)) = 0)ELSE (correctPC?(oRelListList, toObjListAll(oRelListList), pConstraint(r)) ∧

correctCC?(oRelListList, toObjListAll(oRelListList), cConstraint(r)))ENDIF

The predicate function correctConstraints? uses two other recursive predicatefunctions correctPC? and correctCC? to verify whether the relationship instancessatisfy the participation constraints of the relationship type. The correctPC? functionchecks whether the number of relationship instances with the same parent object or parentobject relationship is between min:max notation of the parent participation constraint of therelationship type. The parentSet function is defined and used in the corrrectPC?function to calculate the number of relationship instances with the same parent object orparent object relationship existing in a list of relationship instances for a relationship type.The validation of child participation constraints for relationships can be defined similarly.

3.7.3 Instances of object attributes and relationship attributes

In the ORA-SS model, schema diagrams clearly distinguish between attributes of objectclasses and attributes of relationship types, but data instances do not distinguish betweenthese two attribute instances. By examining the schema and data instance, instances of objectattributes and instances of relationship attributes can be distinguished and defined for themappings between schema and data.

ObjAttInstance: TYPE = [# obj: OBJ, objAttInst: list

[ATTVAL

]#]

RelAttInstance: TYPE = [# relInst: list

[OBJ

], relAttInst: list

[ATTVAL

]#]

correctAttInst?(attValList: list

[ATTVAL

], attList: list

[ATT

]): bool =

IF (attList = null) THEN FALSE

ELSE (list2set(attVal2Att(attValList)) ⊆ list2set(attList)) ENDIF

The instances of object attributes and the instances of relationship attributes are distin-guished using record types ObjAttInstance and RelAttInstance respectively. The

Page 29: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

ObjAttInstance record type consists of an object and a list of attribute values repre-senting the set of attribute values that belong to the object. The RelAttInstance recordtype consists of a list of objects and a list of attribute values representing the set of attributevalues that belong to the relationship instance. There is also a correctAttInst? pred-icate function that checks whether the list of attribute values is a correct instance of list ofattributes. This predicate function can be used for both object or relationship attribute in-stances, since both record types represent attribute instances as a list of attribute values andattribute as a list of attributes.

3.7.4 Instances of candidate keys

In the ORA-SS model, a value of a candidate key of an object instance is defined to beunique. In the definition of mappings between schema and data, validation of this prop-erty must be provided. For this validation, the correctCKeyInst? predicate function isdefined to check uniqueness of each attribute value for a given candidate key.

correctCKeyInst?(attListList: list

[list

[ATT

]], attValList1, attValList2: list

[ATTVAL

]):

RECURSIVE bool =CASES attListList OF

null: TRUE,cons(head, tail): ¬ attValListEqual?(findAttListInst(head, attValList1),

findAttListInst(head, attValList2)) ∧correctCKeyInst?(tail, attValList1, attValList2)

ENDCASES

MEASURE length(attListList)

The correctCKeyInst? recursive predicate function takes in the candidate key andtwo object attributes to check whether the values of the candidate key that belong to the twodifferent objects are unique for all instances of the candidate key. The recursive functionfindAttListInst is needed to find the instances of candidate keys among the objectattribute instances, because ORA-SS data instances do not contain any information aboutcandidate keys. An identifying attribute and its instances inherit this property of uniquenessfrom the candidate key, as identifying attributes are selected from the candidate keys. Hence,a separate predicate function for checking uniqueness of identifying attribute values is notnecessary.

3.7.5 Instances of object references

In the ORA-SS model, a referenced object instance must exist for every referencing objectinstance, with the same values for their identifying attributes. In the definition of mappingsbetween schema and data, validation of this property must be provided. The following ref-erencedObjExist? and correctRefInst? functions are defined for this purpose.

referencedObjExist?(referencedOAttInsts: list

[ObjAttInstance

],

referencingOAttInst: ObjAttInstance, idAtts: list[ATT

]): RECURSIVE bool =

CASES referencedOAttInsts OF

null: FALSE,cons(head, tail): attValListEqual?(findAttListInst(idAtts, objAttInst(head)),

findAttListInst(idAtts, objAttInst(referencingOAttInst))) ∨referencedObjExist?(tail, referencingOAttInst, idAtts)

ENDCASES

Page 30: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

MEASURE length(referencedOAttInsts)

correctRefInst?(referencingOAttInsts, referencedOAttInsts: list

[ObjAttInstance

],

idAtts: list[ATT

]): RECURSIVE bool =

CASES referencingOAttInsts OF

null: TRUE,cons(head, tail): referencedObjExist?(referencedOAttInsts, head, idAtts) ∧

correctRefInst?(tail, referencedOAttInsts, idAtts)ENDCASES

MEASURE length(referencingOAttInsts)

The correctRefInst? function checks the consistency of a list of object instanceswith a corresponding list of instances that are claimed to be referencing them. The first listof object attributes instances represents the instances of the referencing object class withthe associated object attribute instances in a reference, whereas the second list representsthe same for the referenced object class. The list of attributes represents the identifying at-tributes of both referencing and referenced object classes in the reference since both objectclasses have the same identifying attributes. Using the referencedObjExist? func-tion, the correctRefInst? recursive predicate function checks whether a referencingobject has the corresponding referenced object for all referencing objects in a reference.The referencedObjExist? recursive predicate function checks whether there is a ref-erenced object which has the same instance of identifying attributes as the referencing objectamong all the referenced objects in the reference. Similar to the definition of correct-CKeyInst? function, the recursive function findAttListInst is utilized to find theinstances of identifying attributes among the object attribute instances, because ORA-SSdata instances do not contain any information about identifying attributes.

3.7.6 Mappings between schemas and data instances

Similar to Schema type and Data type definitions, the mappings between schema and datais defined as a single record type called SchemaData.

SchemaData: TYPE = {sd:

[#relInstList: list

[RelInstance

],

oAttInstList: list[ObjAttInstance

], rAttInstList: list

[RelAttInstance

]#]|

noRelInstRepeat?(relInstList(sd)) ∧ noOAttInstRepeat?(oAttInstList(sd)) ∧noRAttInstRepeat?(rAttInstList(sd)) ∧

disjointRelInst?(toObjListAll(allRelInst?(relInstList(sd))))}

The SchemaData type is defined as a record type that consists of entire mappings be-tween relationship types, attributes, keys and their instances. Similar to schema type anddata type, a list is used and there are type constraints to check whether each of these listshas any repeating elements. There is an extra constraint for mappings between a relation-ship type and its instances that checks whether a set of instances for a relationship type isdisjoint from a set of instances for a different relationship type. This constraint for checkingdisjointness of relationship instances is necessary because the instance of a relationship canonly belong to a single relationship type.

3.7.7 Data validation against schema

In the SchemaNDataDef theory, data can be validated against its schema using the map-pings between schema and data defined. This data validation against the schema is defined

Page 31: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

using a predicate function called correctSchemaNData? with Schema type, Datatype, and SchemaData type as its arguments.

correctSchemaNData?(s: Schema, d: Data, sd: SchemaData): bool =relInstCorrectConstraints?(relList(s), relInstList(sd)) ∧

oAttInstCorrect?(oAttInstList(sd), oAttList(s)) ∧rAttInstCorrect?(rAttInstList(sd), rAttList(s)) ∧

cAttValCorrect?(cAttValList(d), cAttList(s)) ∧dAttValCorrect?(dAttValList(d), dAttList(s)) ∧

cKeyInstCorrect?(cKeyList(s), oAttInstList(sd)) ∧refInstCorrect?(refList(s), oAttInstList(sd), idAttList(s))

The correctSchemaNData? predicate function checks whether every relationshipinstance meets the participation constraints of the relationship using the relInst-Correct- Constraints? that utilizes the correctConstraints? function. Thepredicate function also provides validation of attribute instances against the object attributes,relationship attributes, composite attributes, and disjunctive attributes. All the functions thatcheck for validity of the attribute instances use the correctAttInst? function men-tioned earlier. In addition, the predicate function checks the uniqueness of each candidatekey instance for every object class using the cKeyInstCorrect? function that utilizesthe correctCKeyInst? function. Similarly, the predicate function checks the existenceof the associated referenced object for each referencing object of every reference using re-fInstCorrect? function that utilizes the correctRefInst? function. After all thesepredicate functions in correctSchemaNData? are verified, the data is validated againstits schema.

3.8 Representing mappings between schema diagrams and data instances in PVS

To illustrate how a mapping between a schema diagram and its data instance can be rep-resented using the definitions presented above, the Department-Course schema diagramin Fig. 7 and its XML data instance in Fig. 8 are used as examples. The representationof the mappings imports the schemaEx, dataEx, and orassSchemaNDataDef the-ory to provide a formal definition for the mappings between the schema diagram exam-ple and XML data example according to the mapping definition. It then defines severallocal variables holding the components for the mappings and finally defines depart-mentCourseMap as an entire mapping between schemaEx, dataEx, using these lo-cal variables. The schemaNData relationship representation also includes a conjecturecalled correctSchemaNData_con that can be used to validate the dataEx against theschemaEx by proving the conjecture.

schemaNdata: THEORY

BEGIN

IMPORTING schemaExIMPORTING dataExIMPORTING orassSchemaNDataDef

[OC, OBJECT, ATT, ATTVALUE

]

dcRelInst: RelInstance = (#rel := dc, relInst := (:(:dc1:), (:dc2:), (:dc3:):)#)ctRelInst: RelInstance = (#rel := ct, relInst := (:(:ct1:), (:ct2:), (:ct3:), (:ct4:):)#)...d1OAttInst: ObjAttInstance = (#obj := dept1Obj,

objAttInst := (:computerScienceAV:)#)

Page 32: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

c1OAttInst: ObjAttInstance = (#obj := course1Obj,objAttInst := (:cs101AV, principlesOfProgrammingAV, examVenue1AV:)#)

s1OAttInst: ObjAttInstance = (#obj := student1Obj,objAttInst := (:sNo123456AV, jsmi123AV, johnSmithAV:)#)

...c1s1RAttInst: RelAttInstance = (#relInst := (:student1Obj, course1Obj:),

relAttInst := (:aAV:)#)c1s1t2RAttInst: RelAttInstance = (#relInst := (:tutorInCS2Obj, student1Obj,

course1Obj:), relAttInst := (:goodAV:)#)...departmentCourseMap: SchemaData =

(#relInstList := (:dcRelInst, ctRelInst, csRelInst, cstRelInst, shRelInst:),oAttInstList := (:t1OAttInst, t2OAttInst, t3OAttInst, ... , h2OAttInst:),rAttInstList := (:c1s1RAttInst, c1s2RAttInst, ... , c3s3t1RAttInst:)#)

correctSchemaNdata_con: CONJECTURE

correctSchemaNData?(departmentCourse, departmentCourseInst,departmentCourseMap)

END schemaNdata

3.9 Formal validation of ORA-SS schema and data instances

Using the formally defined ORA-SS semantics, we can perform automatic validation onboth schema diagrams and data instances via the PVS system. For example, the Department-Course schema example in Fig. 7 can be validated against the ORA-SS semantics, and theXML data instance of the Department-Course schema in Fig. 8 can be validated against itsschema. In order to demonstrate the automated validation, we show that the pre-introducederrors in both Fig. 7 and Fig. 8 can be successfully detected by PVS. For each theory, thePVS type checker uses the theorem prover to automatically prove all the Type CorrectnessConditions (TCCs), which refer to type constraints applied to various types in the theories.If there is a complicated type constraint that cannot be verified automatically by the theo-rem prover, the type checker returns unproved TCCs. The unproved TCCs can be provedinteractively using the predefined theorems and lemmas of PVS. For unsuccessful proofs,efforts mainly involve diagnosing and correcting the errors in the schema or data instance.After type checking the theories, the conjectures in the theories must be proven to completethe validation. Like the unproven TCCs, the conjectures can also be proven interactively.Strictly speaking, failing to discover a proof in a theorem prover does not necessarily meana conjecture is false. It might also be related to the correctness of a proof construction. How-ever, if a contradiction is observed during the proof steps, it clearly indicates possible errorsin the specification, as the improvable case is certain. Once all the theories are type checkedand all the conjectures are proven, validation of the schemas and the data instance are com-plete, proving that the schema is consistent with respect to the ORA-SS semantics and thedata is consistent with respect to its schema.

3.9.1 Validation of ORA-SS schema against ORA-SS semantics

We demonstrate the validation of the Department-Course schema diagram in Fig. 7 asfollows. This validation is conducted through type checking, since all the constraints for

Page 33: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 10 Verifying the type correctness condition for identifying attributes

the schema definition in the ORA-SS semantics are applied using type constraints. TheschemaEx theory is verified to be consistent with the semantics of ORA-SS only if theTCCs generated by the type checker are proved. The type checking of the schemaEx the-ory produces three unproved TCCs. When the TCC for relationship type dc is verified usingthe theorem prover, it results in a state that cannot be proven. For example, with respect tothe first semantic error illustrated in Sect. 3.1, the verification of the TCC for relationshiptype dc has reached a state where the degree of the dc relationship type is incorrect unlessthe 3 = 2 clause is proved. This shows a contradiction in the proof steps and clearly indi-cates there is an error in the original specification, where the degree of the relationship typedc should be 2 instead of 3.

After correcting the above representation, the schema type departmentCourse stillreturns in a state that cannot be proved as shown in Fig. 10. The noIdAttRepeat? func-tion checks whether there is more than one set of identifying attributes for a single objectclass. In this case, Fig. 10 indicates that there are two identifying attributes, i.e., stu-dentNo and studentName, defined for the Student object class, which is not allowedby the ORA-SS language. Similarly, the third semantic error illustrated in Sect. 3.1 wasdetected. After all three introduced errors are corrected, all the TCCs for the schemaExtheory are verified, indicating the schema diagram example in Fig. 7 is verified to be seman-tically correct.

3.9.2 Validation of data instance against ORA-SS schema

We demonstrate the validation of the XML data instance in Fig. 8 as follows. This validationis conducted through the type checking and verification of predicate functions representedas conjectures. First, the PVS representation for the XML data example and mappings be-tween the schema and the XML data must be type checked to prove whether the definitionssatisfy the type constraints applied. Then, the correctSchemaNData_con conjecturein schemaNData theory must be verified through the theorem prover to prove whether thedata is valid against its schema. The type checking of dataEx and schemaNData the-ories proves all Type Correctness Conditions (TCCs) automatically, since there are only a

Page 34: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 11 Validating the uniqueness of identifying attribute studentNo

small number of type constraints applied. For example, the theorem proving of the cor-rectSchemaNData_con conjecture in the schemaNData theory has resulted in a statethat cannot be proved as shown in Fig. 11. It indicates that the candidate key instancesof Student object do not satisfy the constraints. The candidate key has a constraint thatstates the value of a candidate key must be unique. But the findCKeyInst function usedby correctCkey function to check this constraint shows that the candidate key valuesNo123456 is related to two different students, i.e., student1 and student2. Simi-larly, the other two instance errors that were illustrated in Sect. 3.2 can be detected effec-tively.

After all three introduced errors are corrected, the correctSchemaNData_con con-jecture is verified. Thus the XML data example in Fig. 8 is verified to be valid against itsschema. Using the verification support of PVS with predefined theorems and lemmas, thedefinitions of the schema and its data instance can be effectively validated. As shown fromthe validation of schema and data examples, the automated validation with PVS also pro-vides diagnostic results, under the assumption that counterexamples can be observed, whichindicate the parts of the schema or data instances that violates the semantics of ORA-SS.This diagnostic indication is very useful in correcting the semantic errors of the ORA-SSschema and its data instances.

Page 35: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 12 An ORA-SS schema diagram of an incorrectly normalized Department-Course data model

4 Correctness criteria and verification for semistructured data normalization

In database systems for semistructured data, normalization removes or minimizes data re-dundancies by transforming the schema according to the keys and functional dependencies.A possible schema transformation includes the combinations of relocating attributes, swap-ping object classes around, adding references, and introducing new object classes. Duringthese transformations, the semantic equivalence between the transformed ORA-SS schemaand its original form must be ensured, otherwise, data inconsistencies can arise especially ifthe normalization algorithm was incorrectly designed. For example, consider Fig. 12 as theresult of normalizing the ORA-SS schema in Fig. 5 in Sect. 2.

As shown earlier, the Department-Course schema in Fig. 5 contains the data redundan-cies caused by the tutorialCoordinator attribute and the Student object class. The corre-sponding XML data instance in Fig. 6 clearly shows that the attribute value Paul Cox oftutorialCoordinator attribute and values of all the attributes for Student John Smith are re-peated. In order to remove these redundancies, the Department-Course schema in Fig. 5is normalized by transforming the schema into a new form shown in Fig. 12. In the aboveschema diagram, the tutorialCoordinator attribute is relocated to Department object class.The Student object class with all of its attributes and child object classes is also moved andnow belongs to the Department object class. These transformations have removed all theredundancies as shown in the corresponding XML document in Fig. 13.

Even though the redundancies are removed as demonstrated in the XML data instance, itis clear that the “normalized” schema in Fig. 12 is not semantically equivalent to its originalform. Firstly, it is not possible for the identifying attribute code of the Course object classto determine tutorialCoordinator in the “normalized” schema. Consequently, the functionaldependency Department.Course.@code → Department.Course.Tutor.@tutorialCoordinatorthat was given for the original schema is lost. Secondly, the normalized ORA-SS schema inFig. 12 introduces spurious data. If the XML data in Fig. 13 was queried to find informationabout students who are enrolled in the cs101 course, information about every student in theComputer Science department would be returned. This information is spurious, as according

Page 36: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 13 An XML data instance of an incorrectly normalized Department-Course data model

Fig. 14 Verification ofsemistructured datanormalization

to the original XML in Fig. 6, not every student will be enrolled in the cs101 course. Thespurious data is introduced because the relationship between Course object class and Studentobject class is lost. Similarly, the association between a Course object class and a tutorial-Coordinator attribute is lost, which produces inconsistent information. The above examplesshow that data redundancies caused by the design of semistructured data schemas can beremoved or minimized by transforming the schema. However, the transformations may notguarantee the preservation of data and its constraints, as some of the semantic meaning inthe original schema can be lost or altered incorrectly during the transformations. Therefore,it is essential to have a clear set of criteria for verifying the correctness of semistructureddata normalization.

A normalization is considered to be correct if the transformation ensures that the trans-formed schema is semantically equivalent to its original schema in an appropriate sense.In the context of relational database systems, the result of normalization is considered tobe semantically equivalent to the original relation if it satisfies two properties, i.e., depen-dency preserving and lossless. The former ensures that each functional dependency givenfor the original schema is enforced by the transformed schema, and the latter guaranteesthat no information is lost and no spurious information is introduced as a result of the trans-formation [19, 22]. These two properties can be adapted and extended to define correct-ness criteria for verifying semistructured data normalization. Figure 14 shows the overallapproach for verifying the correctness of semistructured data normalization. The verifi-cation checks whether the semantic equivalence between a transformed and its original

Page 37: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

schema is preserved using the correctness criteria for semistructured data normalizationdefined in terms of dependency preserving and lossless properties. In Fig. 14, considerthat a Normalization Algorithm takes an initial schema (ORA-SS Schema 1)and its given set of Functional Dependencies, and produces a new schema(ORA-SS Schema 2). The given set of Functional Dependencies and the trans-formed ORA-SS Schema 2 are then used in the Correctness Criteria to ver-ify the semantic equivalence of the transformation. The Dependency Preservingcriterion requires that the given set of Functional Dependencies for ORA-SSSchema 1 is still preserved in ORA-SS Schema 2, and the Lossless criterion re-quires that no data in ORA-SS Schema 1 is lost in ORA-SS Schema 2 and no newdata is created. In other words, it requires that every set of data that can be queried from thedata instances of the original ORA-SS Schema 1, and no other, can also be queried fromthe data instances of the transformed ORA-SS Schema 2.

In order to provide adequate and effective verification for semistructured data normaliza-tion, the PVS theorem prover is applied to the above approach. On top of the PVS formalsemantics of ORA-SS data modeling language, the correctness criteria is defined in a sin-gle PVS theory in terms of dependency preserving and lossless properties. The PVS theoryalso consists of the definitions for functional dependency, Armstrong’s Axioms and closure,since they are required for the dependency preserving and lossless properties to prove thecorrectness of semistructured data normalization. With the formally defined PVS theories ofthe correctness criteria and semantics of the ORA-SS schema, the effective verification forsemistructured data normalization can be performed using the PVS theorem prover. A trans-formed ORA-SS schema diagram and the set of functional dependencies given for the orig-inal schema can be translated into the vocabulary provided by the defined PVS theories.The PVS representation of the transformed schema and functional dependencies can then beused for verification as described in Sect. 4.7. When a normalization is verified to be incor-rect, counterexamples can be used to produce diagnostic results indicating which functionaldependency cannot be enforced or what data is lost and hence suggests corrections.

In this section, we present the PVS definitions of the correctness criteria for semistruc-tured data normalization together with the desired verification. The example of an incor-rectly normalized ORA-SS schema diagram that is introduced in Fig. 12 will be used todemonstrate the verification of the normalization for semistructured data in terms of depen-dency preserving and lossless properties. In order to provide a concise presentation of thecorrectness criteria for the normalization of semistructured data, some of the PVS definitionsfor the correctness criteria are abbreviated and similar definitions are presented only once.A complete PVS definition of the correctness criteria for semistructured data normalizationcan be found in [26].

4.1 Functional dependencies

In order to verify the semantic equivalence between two ORA-SS schemas, the entire setof functional dependencies enforced on a schema must be derived to see whether the con-straints between attributes are preserved. The given set of functional dependencies for aschema are often provided as a subset of the entire functional dependencies that are requiredto be enforced on the schema. The subset of functional dependencies acts as a minimal setwhere all the other dependencies can be inferred. When verifying a normalization of anORA-SS schema, a set of all functional dependencies inferred from the given set must beobtained to check the preservation of every possible constraint implied on the schema. Thestandard way to obtain this is to derive the closure of the given functional dependencies

Page 38: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

using Armstrong’s Axioms [19, 22]. Armstrong’s Axioms are the collection of essential in-ference rules defined and verified for functional dependencies. The closure of the functionaldependencies is the given set of functional dependencies and the set of all other functionaldependencies that can be inferred from the given set. We can use Armstrong’s Axioms andderive the closure to represent the entire set of functional dependencies of a schema. How-ever, Armstrong’s Axioms are defined over functional dependencies specified as relationsbetween sets of attributes, i.e., ‘X → Y’ form. This representation describes the constraintbetween data where X, a set of attributes, determines Y, another set of attributes. Thus, forit to be used in the semistructured data context, either a version of Armstrong’s Axioms forfunctional dependencies in path notation must be defined and verified, or functional depen-dencies in path notation must be converted into the ‘X → Y’ form. The latter approach wasadopted for simplicity.

The conversion can be easily achieved without losing or corrupting constraints amongdata if the following two assumptions are placed on the ORA-SS schema: (1) every ob-ject class in the ORA-SS schema has a set of identifying attributes; (2) every attribute andobject class has a unique name except for identifying attributes that are used in reference.The first assumption eliminates the use of object class in the path functional dependenciessince identifying attributes can be used to uniquely identify the object class. The secondassumption eliminates the need for path notations, since every object class and attributecan be distinguished by its name. It also eliminates the need for distinguishing the sameattributes and object classes between the original and transformed ORA-SS schemas, sincethe schema transformation changes the path notation of object classes or attributes in theoriginal schema. It is straightforward to convert any ORA-SS schema into one that satisfiesthese assumptions. Identifying attributes can be added into the object classes and a uniquename can be given to the attributes and object classes that have the same name to satisfythe assumptions without corrupting the schema or data. With the simple conversion of theORA-SS schema, a path functional dependency can be converted into ‘X → Y’ form. Inthe conversion of a path functional dependency, every path that constructs the ‘pathX →pathY’ notation must be described by attributes. The path that refers to an attribute can bereplaced by the attribute, whereas the path that refers to an object class can be replaced bythe identifying attributes of the object class. For example, the path functional dependency‘Department.Course → Department’ given for the ORA-SS schema in Fig. 12 can be con-verted into ‘code → deptName’. The Department.Course path is converted into code sincethe attribute code is an identifying attribute that determines the object class Course, and De-partment is converted into deptName similarly. The valid conversion allows the utilization ofArmstrong’s Axioms to produce the closure from the given set of functional dependencies.

Armstrong’s Axioms consist of three main inference rules that are sound and com-plete [19, 22]. This means that any functional dependencies inferred by the rules holds inevery instance of the schema, and all the possible functional dependencies for the schemacan be inferred using these rules. The following are the three inference rules, i.e., reflex-ive, augmentation and transitive rules, where X, Y , Z is an arbitrary set of attributes in aschema.

– Reflexive rule: X ⊇ Y |= X → Y ,– Augmentation rule: {X → Y } |= XZ → YZ,– Transitive rule: {X → Y,Y → Z} |= X → Z.

The reflexive rule states that functional dependencies can be inferred from a set of attributesin a schema where a set of attributes can determine itself or its subsets. The augmentationrule states that a functional dependency can be inferred by adding the same set of attributes

Page 39: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

to the left and right hand side of an existing functional dependency. The transitive rule statesthat if functional dependencies (‘X’ determines ‘Y ’) and (‘Y ’ determines ‘Z’) exist thenanother functional dependency (‘X’ determines ‘Z’) can be inferred. With these three rules,the closure of the functional dependencies given for a schema can be obtained by repeatedlyapplying the inference rules to the dependencies until no more dependencies can be inferred.

4.2 PVS definitions of functional dependencies

In the verification process for dependency preserving and lossless properties, the functionaldependencies given for the original schema and the closure of functional dependencies areutilized to prove the correctness of normalization for semistructured data. The functionaldependencies for a schema are often given in the form of a minimal cover. The minimalcover is an irreducible set of functional dependencies that represents the minimal number offunctional dependencies required to be enforced on the schema. It is often used to preciselyrepresent the required functional dependencies since all other functional dependencies canbe inferred from the minimal cover. The functional dependencies in a minimal cover canbe represented in a standard ‘X → Y ’ format where only a single attribute is allowed forthe Y component. The functional dependency with a set of attributes in the Y component isreduced to a set of functional dependencies with a single attribute in the Y component. Thisrepresentation describes the dependencies between attributes more precisely and it is alsomore effective in deriving inferred functional dependencies. Since the minimal cover is acommon and concise representation of dependencies between attributes, it is assumed thatthe minimal cover is given for ORA-SS schemas. The functional dependencies in a minimalcover, Armstrong’s Axioms, and the closure are defined in the PVS formal language below.

FD: TYPE = [set

[ATT

], ATT

]

The FD type above is defined as a tuple consisting of a set of attributes and an attributeto represent a minimal cover of functional dependencies in the form of ‘X → Y ’. In thisdefinition, the set of attributes refers to the X component and the attribute refers to the Ycomponent of a functional dependency. We assume that path functional dependencies usedfor semistructured data are converted into functional dependencies in the form of ‘X → Y ’and the minimal cover is given for a schema. Note that path constraints are introduced asPVS data objects, and these data objects are later manipulated to find their closure.

With the PVS formal definition of functional dependency, Armstrong’s Axioms are de-fined as a function along with two other auxiliary functions to describe and apply inferencerules to a set of functional dependencies. The defined functions go through every combina-tion of two functional dependencies in the given set and derive a functional dependency thatcan be inferred from them.

findAllY(x: set

[ATT

], fdList: list

[FD

]): RECURSIVE set

[ATT

] =CASES fdList OF

null: emptyset,cons(h, t): IF (x = PROJ_1(h)) THEN (findAllY(x, t) ∪ {PROJ_2(h)})

ELSE findAllY(x, t) ENDIF

ENDCASES

MEASURE length(fdList)

applyAAxiom(fdList: list

[FD

], fd1X: set

[ATT

], fd1Y: ATT, fd1AllY: set

[ATT

]):

RECURSIVE list[FD

] =CASES fdList OF

Page 40: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

null: null,cons(h, t): IF (fd1Y ∈ PROJ_1(h))

THEN cons(((fd1X ∪ (PROJ_1(h) \ fd1AllY)), PROJ_2(h)),applyAAxiom(t, fd1X, fd1Y, fd1AllY))

ELSE applyAAxiom(t, fd1X, fd1Y, fd1AllY) ENDIF

ENDCASES

MEASURE length(fdList)

armstrongsAxiom(fdList1, fdList2: list

[FD

]): RECURSIVE list

[FD

] =CASES fdList1 OF

null: null,cons(h, t): append(cons(h, applyAAxiom(fdList2, PROJ_1(h), PROJ_2(h),

findAllY(PROJ_1(h), fdList2))), armstrongsAxiom(t, fdList2))ENDCASES

MEASURE length(fdList1)

The applyAAxiom and armstrongsAxiom functions apply inference rules of Arm-strong’s Axioms to a given list of functional dependencies and produce a new list of func-tional dependencies. The list structure is used instead of the set to recursively apply inferencerules. The new list of functional dependencies derived from the armstrongsAxiom func-tion contains the given list of functional dependencies and the functional dependencies thatare inferred it. The armstrongsAxiom function is defined with two lists of given func-tional dependencies as its input. The two given sets of functional dependencies are used inthe definition to apply inference rules to every pair of functional dependencies in the givenset. It recursively goes through each functional dependency in the given set and applies infer-ence rules using the applyAAxiom function. For each given functional dependency (fd1),the applyAAxiom function finds another functional dependency (fd2) from the given set offunctional dependencies where the Y component of the fd1 is contained in the X componentof fd2. It then derives an inferred functional dependency using the findAllY function.This function finds all the Y components of functional dependencies with the same X com-ponent from the given set of functional dependencies. The inferred functional dependencystarts with the set of attributes that belongs to the X component of fd2 but not contained inthe set of all Y components. The X component of fd1 is added to the derived set of attributesand assigned as an X component of the inferred functional dependency. The Y componentof fd2 is simply assigned as the Y component of the inferred functional dependency.

Note that the inference rules defined in the applyAAxiom function represents all thenecessary rules of Armstrong’s Axioms combined into one single rule. The combined ruleproduces all the inferred functional dependencies required to verify the normalization ofsemistructured data in terms of the dependency preserving and lossless properties. In thedependency preserving property, only the projected functional dependencies are consideredamong the set of inferred functional dependencies and compared with the functional depen-dencies given for the original schema. Similarly, in the lossless property, only the functionaldependencies that have both X and Y components contained in the set of attributes relatedthrough object classes or relationships are considered.

With this in mind, we can see that the functional dependencies inferred by directly apply-ing the reflexive and augmentation rules of Armstrong’s Axioms are not required since weassume that the functional dependencies are given in minimal cover where the Y componentonly consists of a single attribute. However, the transitive rule in Armstrong’s Axioms mustbe considered in verifying the semistructured data normalization in terms of dependencypreserving and lossless properties, since the projection of functional dependencies inferred

Page 41: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

by the transitive rule are not dependent on the original functional dependencies. In addition,various inference rules such as union rule ({X → Y,X → Z} |= X → YZ), decomposi-tion rule (X → YZ |= {X → Y,X → Z}), pseudo-transitivity rule ({X → Y,WY → Z} |=WX → Z), can be derived by combining the basic rules [19, 22]. Among these derivedrules, union and decomposition rules do not apply to our approach, since we assume that thefunctional dependencies are given in minimal cover. But the pseudo-transitivity rule mustbe considered since it is not dependent on the original functional dependencies, similar tothe transitive functional dependency. The combined inference rule in the applyAAxiomfunction applies transitive and pseudo-transitivity rules to derive all the inferred functionaldependencies that are meaningful to dependency preserving and lossless properties.

After specifying the Armstrong’s Axioms in PVS, the closure of functional dependenciesis defined as a function. It recursively applies the combined inference rule of Armstrong’sAxiom to the given set of functional dependencies until no more functional dependenciescan be inferred. It generates a set of functional dependencies that includes the given setof functional dependencies and all the functional dependencies that can be inferred from itusing the derived inference rule in the applyAAxiom function.

findClosure(fdList: list

[FD

]): RECURSIVE list

[FD

] =IF listEqual?(fdList, armstrongsAxiom(fdList, fdList)) THEN fdListELSE findClosure(armstrongsAxiom(fdList, fdList)) ENDIF

MEASURE findBound(fdList)

The findClosure function is defined with lists of functional dependencies as its two argu-ments, where the list for its domain represents the given set of functional dependencies andthe list for its range represents the closure. Using the previously defined armstrongsAx-iom function, it recursively applies inference rules to the given set of functional dependen-cies until no more functional dependencies can be inferred. The listEqual? functioncompares and checks equality of two lists of functional dependencies regardless of the orderthe functional dependencies are listed. The armstrongsAxiom function uses the list-Equal? function to check whether any additional functional dependencies are inferred. ThefindBound function gives the upper-bound MEASURE value of the recursion. For exam-ple, if there are n distinct attributes in the functional dependency set, the maximum numberof functional dependency pairs that could be produced is n*(n-1). This value can be usedto ensure the termination of the recursive function, since the distance to the bound decreaseswith each iteration. However, the actual computation still depends on the listEqual? todetermine whether the closure is complete. As the numbers of functional dependencies inan ORA-SS schema are usually small, this upper-bound value is sufficient to compute theclosure.

4.3 Dependency preserving property

The concept of the dependency preserving property is to ensure the preservation of depen-dencies between the data when the schema is transformed during a normalization process.According to this concept, the dependency preserving property is constructed to suit thesemistructured data context for verifying a normalization of semistructured data. In thetransformation performed by semistructured data normalization, the dependency preservingproperty is satisfied if the entire functional dependencies of the original schema is projectedonto the transformed schema. The projection of a functional dependency onto a schema rep-resents the possibility for the dependencies between attributes to exist in the schema. In or-der to verify the satisfaction of the dependency preserving property in an ORA-SS schema,

Page 42: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

the projection of functional dependencies specific to the semistructured data context hasbeen defined using the ORA-SS schema diagram. In an ORA-SS schema, the dependen-cies between attributes can only exist if the attributes are related through object classes orrelationships by belonging to the same object class or to the same relationship. The set ofattributes that are related through a relationship includes the identifying attributes of theobject classes involved in the relationship along with the relationship attributes, as the iden-tifying attributes distinguish the object classes that make up the relationship from others.A functional dependency is projected on to the transformed ORA-SS schema if both X andY components of the functional dependency is present in any set of related attributes in thetransformed schema. In order to derive every functional dependency of the original ORA-SS schema projected onto the transformed ORA-SS schema for the dependency preservingproperty, the projected functional dependencies are derived from the closure of the givenset of functional dependencies of the original ORA-SS schema. The transformed schema isverified to be dependency preserving if every functional dependency of the original schemais contained in the closure of the functional dependencies projected onto the transformedschema.

4.4 PVS definitions of dependency preserving property

The dependency preserving property for semistructured data and its verification have beendefined in the PVS formal language below. The dependency preserving property requires aset of functional dependencies to be projected onto the transformed schema to verify whetherthe dependencies between attributes are preserved. In order to effectively derive projectedfunctional dependencies, the sets of attributes that are related through object classes or rela-tionship types are defined, since they are required to determine the projection of functionaldependencies. We defined two functions that extract sets of related attributes from a schema.One is for attributes related through object classes, which derives sets of attributes that be-long to the same object class. Another is for attributes related through relationship types,which derives sets of attributes that contain all the identifying attributes of object classesand the attributes that belong to the relationship type.

getAllOAttSet(oAttList: list

[ObjectAtt

]): RECURSIVE list

[set

[ATT

]] =CASES oAttList OF

null: null,cons(head, tail): cons(list2set(attList(head)),getAllOAttSet(tail))

ENDCASES

MEASURE length(oAttList)

The getAllOAttSet function is defined to produce the sets of related attributes interms of lists of sets of attributes from the given list of ObjectAtt. The list of Object-Att type represents all the object class attributes in a schema where the ObjectAtt typeis defined to contain object class and all the attributes that belongs to the object class as alist. From the list of ObjectAtt, the getAllOAttSet function can easily produce thesets of attributes that belong to the same object classes for all attributes in the schema. Itsimply extracts the list of attributes from each ObjectAtt type in the list and convertsthem to sets. Each set of attributes in the produced list represents a single set of attributesthat are related through an object class. A list is used instead of a set for representing allthe sets of related attributes in the schema to recursively go through the sets and find theprojected functional dependencies.

Page 43: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

addRelOCs(ocsList: list

[set

[OC

]]): RECURSIVE set

[OC

] =CASES ocsList OF

null: emptyset,cons(head, tail): (head ∪ addRelOCs(tail))

ENDCASES

MEASURE length(ocsList)

getRelIdAtts(idAttList: list

[IdentifyingAtt

], rType: RelType

): RECURSIVE set

[ATT

] =CASES idAttList OF

null: emptyset,cons(head, tail): IF (oc(head) ∈ addRelOCs(rType))

THEN (list2set(attList(head)) ∪ getRelIdAtts(tail, rType))ELSE getRelIdAtts(tail, rType) ENDIF

ENDCASES

MEASURE length(idAttList)

getRelAtts(rAttList: list

[RelationshipAtt

], rType: RelType

): RECURSIVE set

[ATT

] =CASES rAttList OF

null: emptyset,cons(head, tail): IF (rel(head) = rType)

THEN (list2set(attList(head)) ∪ getRelAtts(tail, rType))ELSE getRelAtts(tail, rType) ENDIF

ENDCASES

MEASURE length(rAttList)

getAllRAttSet(relList: list

[Relationship

], s: Schema

): RECURSIVE list

[set

[ATT

]] =CASES relList OF

null: null,cons(head, tail):

cons((getRelIdAtts(idAttList(s), rel(head)) ∪ getRelAtts(rAttList(s), rel(head))),getAllRAttSet(tail, s))

ENDCASES

MEASURE length(relList)

The getAllRAttSet function is defined to produce the sets of related attributes interms of a list of set of attributes from the given list of Relationship type and a schema.For a given relationship type, it uses the addRelOCs function to find all the object classesthat make up the relationship type and extracts their identifying attributes using the get-RelIdAtts function. Then the extracted identifying attributes are combined with the at-tributes of the relationship type derived from the getRelAtts function and returned as aset. After recursively processing all the relationship types in the list, the getAllRAttSetfunction produces the sets of attributes that contain all the identifying attributes of objectclasses and the attributes that belong to the relationship type. Each set of related attributesin the list represents a single set of attributes that are related through the same relationshiptype. Similar to the getAllOAttSet function, a list is used instead of a set to recursivelygo through the sets and find the projected functional dependencies.

With the functions that derive all the sets of related attributes in a schema, the projec-tion of the functional dependencies onto a schema is defined as a function. A functionaldependency is projected onto a schema if its X and Y components belong to a set of relatedattributes of the schema. According to this definition, the function derives every functionaldependency that is projected onto the given schema from the given set of functional depen-dencies.

Page 44: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

isFDProjected?(attsList: list

[set

[ATT

]], fd: FD

): RECURSIVE bool =

CASES attsList OF

null: FALSE,cons(head, tail): ((PROJ_1(fd) ∪ {PROJ_2(fd)}) ⊆ head) ∨ isFDProjected?(tail, fd)

ENDCASES

MEASURE length(attsList)

findProjectedFD(fdList: list

[FD

], s: Schema

): RECURSIVE list

[FD

] =CASES fdList OF

null: null,cons(head, tail): IF (isFDProjected?(getAllOAttSet(oAttList(s)), head) ∨

isFDProjected?(getAllRAttSet(relList(s), s), head))THEN cons(head, findProjectedFD(tail, s))ELSE findProjectedFD(tail, s) ENDIF

ENDCASES

MEASURE length(fdList)

The function isFDprojected? checks whether the given functional dependency isprojected onto the given set of related attributes. The findProjectedFD is a function thatproduces all the given functional dependencies projected onto the given ORA-SS schema.It derives every set of related attributes from the given schema using getAllOAttSetand getAllRAttSet functions. With the sets of related attributes found, the findPro-jectedFD function uses the isFDprojected? function to check and produce the listof functional dependencies that are projected onto the transformed schema.

The dependency preserving property is defined as a function to verify whether the trans-formation performed by a normalization is dependency preserving by returning the appro-priate boolean value.

isDependencyPreserving?(fdList: list

[FD

], s: Schema

): bool =

(list2set(fdList) ⊆ list2set(findClosure(findProjectedFD(findClosure(fdList), s))))

The isDependencyPreserving? function defined above verifies whether the trans-formed ORA-SS schema is dependency preserving. From the closure of the given set offunctional dependencies, the function generates a list of projected functional dependenciesfor the transformed schema using the findProjectedFD function. Then it derives theclosure of the projected functional dependencies using the findClosure function andcompares it with the set of functional dependencies given for the original schema. We as-sume that the computation of the closure is always complete, which is usually the case. Ifevery functional dependency in the given set is contained in the closure of the projection,the predicate will verify that the transformed schema is dependency preserving.

4.5 Lossless property

The lossless property ensures that no data is lost and spurious data is not created when anORA-SS schema is transformed during normalization. This means that a query on a datasetbased on the transformed schema will produce the same result as a query on the datasetbased on the original schema. In another words, the transformed schema is verified to belossless if the joined set of attributes in the transformed schema produces the same joinedset of attributes as the original schema. The join of the attributes in the transformed schemamakes use of the functional dependencies and the set of related attributes in the schema. Itcan be realized either by the attributes belonging to the same object class or relationship,

Page 45: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

or by the set of attributes that are related through the structure of the schema and joinedaccording to the functional dependencies. Using this correctness criterion and based on thelossless join test theorem proved in the relational data model [19], the verification algorithmof the lossless property for a transformed ORA-SS schema is defined and shown below aspseudo code.

Let S be the transformed ORA-SS schemaLet OA be the set of attributes that belong to an object classLet RA be the set of attributes that contains relationship attributes and

the identifying attributes of object classes that belong to a relationship

Let A be the set of every OA for all object classes in S andset of every RA for all relationships in S

Let FD be the functional dependencies of the original schema of S in X→Y notation

Repeat until there is no change to AFor each functional dependency X→Y in the closure of FD

Let Ax be the sets of attributes in A that contains XIf some set of attributes in Ax contains Y

For each set of attributes in AxAdd Y to the set

If any set of attributes in A is the same as the set of all attributes in Sreturn True

The above pseudo code represents how the satisfaction of the lossless property is verifiedwhen sets of attributes in the transformed ORA-SS schema are joined according to theirfunctional dependencies. That is, for each functional dependency ‘X → Y’ in the closureof FD, if some attribute sets Ax in OA ∪ RA contains X and at least one set in Ax con-tains Y, then Y is introduced into every set of attributes in Ax. Finally, after all the functionaldependencies in FD have been processed, if there exists at least one set of attributes thatcontains all the attributes in the schema, the verification returns true, meaning that the trans-formation is considered to be lossless. Every set of attributes in the transformed schema canbe queried if the object classes, relationships and functional dependencies can connect at-tributes together. This is because the set of attributes resulting from a data query depends onthe associations between the attributes enforced by the functional dependencies and relation-ships between attributes. If the queried set of attributes cannot be connected together usingfunctional dependencies and relationships between attributes, spurious data can be retrieved.Hence, the above verification correctly shows whether the normalized schema preserves alldata and does not create spurious data.

4.6 PVS definitions of lossless property

The lossless property for semistructured data and its verification methodology described inthe pseudo code are defined in the PVS formal language below. The lossless property verifieswhether the structure of a schema and the dependencies between attributes can join all theattributes in the schema together. In order to verify whether a transformation performed inthe normalization process is lossless, the joined set of attributes in a transformed schema canbe compared with the set of all attributes in the schema. Initially, functions that derive theset of all attributes in a schema is defined to be used in a definition of the lossless property.The defined functions generate all the attributes that belong to object classes and all the

Page 46: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

attributes that belong to relationship types from the given schema. It then combines thegenerated attributes together to derive all the attributes of a schema, since the attributes ofobject classes and attributes of relationship types are defined in different types and organizedseparately as a list in the PVS definition for the ORA-SS schema.

oAtt2Att(oAttList: list

[ObjectAtt

]): RECURSIVE list

[ATT

] =CASES oAttList OF

null: null,cons(head, tail): append(attList(head), oAtt2Att(tail))

ENDCASES

MEASURE length(oAttList)

getAllAtt(s: Schema): set[ATT

] =list2set(append(oAtt2Att(oAttList(s)), rAtt2Att(rAttList(s))))

The above defined function getAllAtt produces the list of attributes that containall the attributes in the given ORA-SS schema. It uses the functions oAtt2Att andrAtt2Att to obtain the set of attributes that belong to the object classes and relationshiptypes correspondingly. Note that the getAllAtt function includes composite and disjunc-tive attributes without their components since composite and disjunctive attributes alwaysbelong to either object classes or relationship types without the components. The compo-nents of the composite and disjunctive attributes are not included because components ofcomposite and disjunctive attributes are not altered during the normalization processes.

After defining the getAllAtt function that derives all attributes in a schema, a functionthat joins attributes of a schema according to the structure of the schema and the functionaldependencies must be defined to compare the joined sets of attributes with all attributesof the schema and verify the lossless properties. These functions find the sets of attributesrelated through object classes and relationship types and the closure of the projected func-tional dependencies for the schema. Among the closure of the projected functional depen-dencies, all the functional dependencies with the same X component are merged into a tupleconsisting of the X component, and all the Y components of the functional dependenciescombined. For each tuple defined for the closure of projected functional dependencies, allthe Y components in the tuple are added into every set of related attributes that contain theX component of the tuple to produce the sets of joined sets of attributes.

processFD(fdList: list

[FD

], attsList: list

[set

[ATT

]]):

RECURSIVE list[[

set[ATT

],set

[ATT

]]] =CASES fdList OF

null: null,cons(head, tail): IF member(PROJ_1(head), attsList)

THEN processFD(tail, attsList)ELSE cons((PROJ_1(head), findAllY(PROJ_1(head), fdList)),

processFD(tail, cons(PROJ_1(head), attsList))) ENDIF

ENDCASES

MEASURE length(fdList)

calculateLossless(attsList: list

[set

[ATT

]], attsT:

[set

[ATT

], set

[ATT

]]):

RECURSIVE list[set

[ATT

]] =CASES attsList OF

null: null,cons(head, tail): IF (PROJ_1(attsT) ⊆ head)

Page 47: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

THEN cons((head ∪ PROJ_2(attsT)), calculateLossless(tail, attsT))ELSE cons(head, calculateLossless(tail, attsT)) ENDIF

ENDCASES

MEASURE length(attsList)

losslessMatrix(attsTList: list

[[set

[ATT

], set

[ATT

]]], attsList: list

[set

[ATT

]]):

RECURSIVE list[set

[ATT

]] =CASES attsTList OF

null: attsList,cons(head, tail): losslessMatrix(tail, calculateLossless(attsList, head))

ENDCASES

MEASURE length(attsTList)

The processFD function produces the tuple that consists of the X component of thefunctional dependency and the set of every possible Y component of the given functional de-pendencies that contain the X component. This function is defined to reduce the time takento verify the lossless property by merging all the functional dependencies with the same X

component as a single tuple. The calculateLossless function joins the attributes thatcan be queried using the given functional dependencies and the relationships between at-tributes in the schema. It adds all the possible Y components of the functional dependenciesto each set of related attributes if the set contains the X component of the given functionaldependency. The losslessMatrix function constructs the sets of attributes that can bequeried by joining Y components of all the projected functional dependencies to each set ofrelated attributes recursively using the calculateLossless function.

With functions defined to join related attributes in a schema and generate all attributes ofthe schema, the lossless property is defined as a function to verify whether the transformationperformed by the normalization is lossless by returning the appropriate boolean value.

isAttsMember?(attsList: list

[set

[ATT

]], atts: set

[ATT

]): RECURSIVE bool =

CASES attsList OF

null: FALSE,cons(head, tail): empty?((atts \ head)) ∨ isAttsMember?(tail, atts)

ENDCASES

MEASURE length(attsList)

isLossless?(s: Schema, fdList: list

[FD

]): bool =

isAttsMember?(losslessMatrix(processFD(fdList, null),append(getAllOAttSet(oAttList(s)), getAllRAttSet(relList(s), s))), getAllAtt(s))

The isLossless? function defined above verifies whether the transformed ORA-SSschema is lossless. The function finds every set of related attributes and the set of projectedfunctional dependencies for the transformed ORA-SS schema using getAllOAttSet,getAllRAttSet and findProjectedFD functions. Then it derives the joined sets ofattributes by adding the Y components of projected functional dependencies to the sets ofrelated attributes using the losslessMatrix function. If the entire set of attributes forthe transformed schema is contained in the joined sets of attributes, the predicate will verifythat the transformed schema is lossless.

The process that verifies the correctness of semistructured data normalization consists ofverification for dependency preserving and lossless properties defined in the PVS language.The PVS definitions of these properties can be applied to the normalized ORA-SS schemarepresented in the PVS language and verify the correctness of normalization using the PVS

Page 48: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

theorem prover. The verification of the transformed ORA-SS schema in Fig. 12 is shown inthe next section.

4.7 Formal verification of semistructured data normalization

With the dependency preserving and lossless properties formally defined in PVS, we canperform effective verification on normalization of semistructured data using the PVS the-orem prover. The verification uses dependency preserving and lossless properties to checkwhether the transformation used in normalization of semistructured data ensures semanticequivalence between the transformed schema and its original. In the verification, we candetermine if the transformation is dependency preserving and lossless by considering onlythe transformed schema and the functional dependencies given for its original schema. Thetransformation is dependency preserving if the given functional dependencies hold in thetransformed schema, and it is lossless if no data stored in the transformed schema is lost andit creates no spurious data. In order to verify the satisfaction of the dependency preservingproperty, the proposed verification checks whether every functional dependency given forthe original schema is projected onto the transformed schema. For the lossless property, theverification checks whether every attribute in the transformed schema can be joined by theobject classes or relationships of the transformed schema together with the functional de-pendencies given for the original schema. The satisfaction of the lossless property is verifiedwithout considering the attributes of the original schema, since the known transformationsused in the semistructured data normalization do not delete any attributes permitting thesame set of attributes to appear in both the original and transformed schema.

In order to demonstrate the PVS verification on a semistructured data normalization,the incorrectly normalized Department-Course schema example in Fig. 12 is verified interms of dependency preserving and lossless properties. Initially, the transformed ORA-SSschema diagram in Fig. 12 and the functional dependencies given for its original schemaare represented according to the vocabulary provided by the PVS definitions. The formallydefined schema and functional dependencies are then passed into the isDependency-Preserving? and isLossless? functions by defining a conjecture for each function.The conjectures are defined and verified separately to provide independent verification re-sults for each property. Using the PVS theorem prover and the predefined theorems andlemmas, the conjectures constructed with the isDependencyPreserving? and is-Lossless? functions are verified interactively with little effort. These properties can beeasily proved with the standard PVS proof command—‘grind’, which minimizes the hu-man interaction required during the verification. However, the level of automation of theverification also depends on the success of the proof. If the proof succeeds, i.e., verifica-tion returns true, the proof steps can be considered as automated. On the other hand, if thenormalization is verified to be incorrect, further efforts have to be made in diagnosing thecounterexample, correcting the normalization and retrying the proof. More discussion on theverification effort of using our approach can be found in Sect. 4.7.3 of the paper. The verifi-cation for each property provides a diagnostic result showing which functional dependenciesor attributes are lost. These results can be used to analyze and correct the transformation per-formed by the normalization. After the transformation is corrected and when the conjecturesare proven, verification of the normalization are complete, proving that the transformationensures the semantic equivalence between the transformed schema and its original. In orderfor a normalization of semistructured data to be correct, the conjectures for both the depen-dency preserving and lossless properties must be proven to satisfy the semantic equivalence.

Page 49: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

4.7.1 Verification of dependency preserving property

The dependency preserving property specifies that the semantic equivalence between a trans-formed schema and its original is ensured during a normalization process if the dependenciesbetween attributes are preserved. In the PVS formal definitions of the dependency preservingproperty, the projected set of functional dependencies are derived from the set of functionaldependencies given for an original schema. The closure of the projected set of functionaldependencies are compared with the given set of functional dependencies to verify the sat-isfaction of the dependency preserving property. Using the dependency preserving propertyformally defined in PVS, the verification of the normalization in terms of dependency pre-serving property is possible. We demonstrate the verification for the dependency preserv-ing property of the normalization that transformed the Department-Course schema examplein Fig. 7 into the schema in Fig. 12. This verification is conducted by constructing andverifying the conjecture, isDependencyPreserving_Con, for the normalization. Theconjecture is defined with the PVS representations of the transformed Department-Courseschema and the set of functional dependencies given for the original schema passed into theisDependencyPreserving? function.

When the isDependencyPreserving_Con conjecture is verified using the PVStheorem prover, it produces a state that cannot be proved as shown in Fig. 15. In the proofstep for this part of the conjecture, the line that starts with {−1} indicates that the func-tional dependencies ‘code → tutorialCoordinator’ is not preserved during transformation.Thus, the “normalized” ORA-SS schema in Fig. 12 is not dependency preserving, since notevery functional dependency given for the original schema is preserved. After correcting thetransformation to preserve the above functional dependency, verifying the isDependen-cyPreserving_Con conjecture again returns a state that cannot be proved similar to theone shown in Fig. 15. It indicates that ‘{code, studentID} → tutorID’, ‘{code, studentID}→ grade’ and ‘{code, studentID, tutorID} → feedback’ are not preserved. The results ofthe PVS verification are consistent with our original observations of the example and show

Fig. 15 Verifying dependency preserving property

Page 50: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

that the PVS definitions of the dependency preserving properties correctly detect the loss offunctional dependency during normalization.

4.7.2 Verification of lossless property

The lossless property specifies that the semantic equivalence between a transformed schemaand its original is ensured during a normalization process if every attribute in the schema canbe queried without producing any spurious data. In the PVS formal definitions of the losslessproperty, the sets of attributes that are related through object classes or relationship types arederived and joined according to the given set of functional dependencies. The joined sets ofattributes are compared with the attributes of the schema to verify the satisfaction of thelossless property. Using the lossless property formally defined in PVS, the verification ofthe normalization in terms of the lossless property is possible. We demonstrate the verifica-tion for the lossless property of the normalization that transformed the Department-Courseschema example. This verification is conducted by constructing and verifying the conjec-ture, isLossless_Con, for the normalization. The conjecture is defined with the PVSrepresentations of the transformed Department-Course schema and the set of functional de-pendencies given for the original schema passed into the isLossless? function. Whenthe isLossless_Con conjecture is verified using the PVS theorem prover, it produces aunproven state that indicates the joined set of attributes does not contain all the attributesof the schema. Hence, the “normalized” ORA-SS schema in Fig. 12 is not lossless, sincenot every attribute in the original schema can be queried from the transformed schema. Us-ing the unprovable state shown, we can analyze which part of the transformed schema hasviolated semantic equivalence.

The above verification conducted with the PVS definitions of dependency preserving andlossless properties illustrate that the semantic equivalence between the normalized schema inFig. 12 and its original form is not ensured. The result of the PVS verification also indicateswhich parts of the transformed schema caused dependency violation, data loss, or creationof spurious data. Using this information, a normalized schema can be further corrected toproduce the schema shown in Fig. 16. It gives a correctly normalized ORA-SS schema of theDepartment-Course example shown in Fig. 7. When verified using the PVS theorem prover,this normalization is verified to be correct.

In addition, we evaluated our approach against the normalization examples presented inArenas [3] and Wu [52]’s papers. These normalization algorithms transform a schema ofsemistructured data into one that satisfies the associated normal form defined to avoid dataredundancies. The normalization guided by these algorithms uses various schema transfor-mations according to the key constraints and the given set of functional dependencies for theschema. Using the normalization examples presented in the papers, we are able to evaluateour approach more completely, as the PVS verification can be performed on normalizationthat uses different notations of schema and various schema transformations. In order to ver-ify the normalization examples, different notations of semistructured data schema used inthe examples are translated into ORA-SS schema diagrams and represented according to thePVS definitions of ORA-SS semantics. Similarly, the given set of functional dependenciesfor the schema are converted into the ‘X → Y’ notation of minimal cover and representedin PVS. With the PVS representations of the translated ORA-SS schema and the convertedfunctional dependencies, the normalization examples are verified effectively using the PVStheorem prover. Moreover, we also generated the examples of incorrectly designed schematransformation and verified them to further evaluate our solution. A transformation that islossless but not dependency preserving was generated and verified using the PVS theorem

Page 51: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Fig. 16 A correctly normalized ORA-SS schema diagram

prover. Such examples are chosen since normalization algorithms, including Wu [52]’s al-gorithm, do not necessarily guarantee functional dependencies between attributes are pre-served in their schema transformation. Additionally, a transformation that is neither depen-dency preserving nor lossless was generated and verified. The verification begs the questionwhether the verification can correctly detect schemas that do not satisfy the dependencypreserving or lossless property. It also showed that our solution can verify the dependencypreserving and lossless properties separately.

4.7.3 Discussion on modeling and verification effort

This subsection summarizes the modeling and verification effort of our approach in verify-ing semistructured data and its normalization. As illustrated in Fig. 2 of the introduction,our PVS encoding is based on two top level conceptual models, i.e., the formal seman-tics of ORA-SS data modeling language and the correctness criteria for semistructured datanormalization. The ORA-SS language semantics consists of three separate theories for rep-resenting the schema, data and their relationships, whereas the correctness criteria for nor-malization is captured in a single theory. The information on the model construction of thetheories is given below.

We can see from Table 1 that our encoding consists of approximately 26 type definitionsand 112 functions with a total of 895 lines of specifications. The customized models of theORA-SS schema, data and normalization are defined as instances of these base theories andverified using the PVS theorem prover. Constructing PVS definitions for a particular semi-structured data and its schema is straightforward, since the semantics of ORA-SS language

Page 52: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Table 1 PVS modeling efforts of the base theories

Theories Types Functions Lines

ORA-SS schema 10 31 220

ORA-SS data instance 8 14 103

Schema-instance mapping 4 43 320

Correctness criteria for normalization 4 24 152

Total 26 112 895

is readily defined. As a matter of fact, a simple translation can be developed to automati-cally transform the graphical representations of ORA-SS models into their correspondingPVS specifications. The validation of the schema and data against ORA-SS semantics canbe performed effectively using the type checking facilities of PVS. The TCCs generatedfrom the PVS definitions of semistructured data and schema depend on the numbers of re-lationships and attributes in the model. For example, the validation of the ‘Department-Course’ example presented in the paper generates 32 TCCs from the schema definition,43 TCCs from the data instance and 6 TCCs from the mapping between the schema andinstance. All these TCCs were proved using standard PVS proof commands, such as the‘use "rel_axiom"’, ‘grind’ or ‘m-x tcp’. The time taken for proving a TCC isonly around 0.1–0.2 seconds. The actual validation of data against schema is achieved byproving the conjecture defined with the ‘correctSchemaNData?’ function. Similar tothe TCCs, this conjecture can be easily proved with the PVS ‘grind’ command. Hence, theeffort required to validate the semistuctured data using our approach is very light. To furtherimprove the useability and efficiency of the validation, a simple tool can be developed to au-tomatically feed the proof commands to the PVS prover whenever TCCs are generated fromthe PVS definitions or data instances are validated. However, as we mentioned earlier, hu-man interactions are required for unsuccessful proofs in diagnosing and correcting the errorsin the models. A visual interface can be provided as part of the functions of the tool supportin assisting the process. In this way, users only need to focus on the diagrammatic aspectsof the ORA-SS language without knowing the underlying PVS representations during themodeling and verification phases.

The normalization of an ORA-SS schema model is verified based on the transformedschema and the original set of function dependencies. Once the representation of a normal-ized schema is completed, the associated functional dependencies are the only additionalinformation that needs to be defined. The TCCs generated by the function dependenciescan be automatically proven using the ‘m-x tcp’ command. To verify the correctness ofsemistructured data normalization, conjectures with ‘dependencypreserving?’ and‘lossless?’ functions need to be proved. When proving the conjecture, a simple PVSproof command ‘grind’ is used. Therefore, only light effort is required to formalize andverify semistructured data normalization using our approach, since there is no complicateproof engineering involved during the verification. Similarly, a simple tool can be con-structed to automate the translation and proving processes with the aid of a visual interface.

Here we briefly presented the modeling and verification efforts of our approach. The fol-lowing table shows the time taken for verifying different normalization examples discussedin the paper. Note that the verification time in the table is measured on a Macbook laptopcomputer with OS 10.5.8 and a hardware configuration of 2 GHz Intel Core 2 Duo Proces-sor & 2 GB 667 MHz DDR2 SDRAM. The types of entities in a schema example includeobject classes, relationships and attributes. As we can see from Table 2, a reasonable-sized

Page 53: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

Table 2 Verification time on different normalization examples

Normalization examples No. of entities No. of FDs Time taken

Department-course 34 14 200 seconds

Course-student in [3] 8 2 27 seconds

DBLP in [3] 14 5 139 seconds

Department-course-student in [52] 13 6 128 seconds

Teacher-classroom in [52] 13 3 28 seconds

Objects in [52] 13 3 48 seconds

normalization, such as the ‘Department-Course’ example, takes about three minutesto verify. Note that this includes the computation of the closure of functional dependencysets, which usually takes up most of the running time. When the table is analyzed, we canobserve that the number of functional dependencies in the model directly effect the timetaken to verify the normalization. When the relationship between the number of FDs andtime taken is carefully examined, it further indicates that the correlation is more in the formof a linear increase than that of exponential. In addition, the complexity of the schema itselfalso effects the performance of the verification. As the numbers of functional dependenciesand the schema of the ORA-SS models are usually small to avoid having redundant struc-tures among the entities, our approach is reasonably efficient in practice. If the normalizedschema is indeed too big, the whole normalization process can be decomposed into smallertransformation steps, where each step only effects a certain part of the model. Thus the ver-ification on the normalization can be carried out incrementally. In this way, we can surelytackle large and more complex schemas in semistructured data normalization. Finally, wealso believe that with adequate and domain specific tool support the performance of the ver-ification can be further improved. Both of the above two points will be further discussed inthe future work section.

5 Related work

In this section, we present reviews and discussions on two collections of research that di-rectly relate to the work presented in the paper, namely, formal frameworks (semantics) andnormalization algorithms (normal forms) for semistructured data.

5.1 Formal semantics of semistructured data

Calvanese et al. [9] proposed a framework for representing and reasoning about structuralaspects of XML documents using Description Logic (DL). The framework is developed toimprove the efficiency of web information retrieval, where the web is considered a largesemistructured database and database querying is used to retrieve information. In the pro-posed framework, a view of Document Type Definition (DTD), a schema language for XMLdocuments, is formally specified using Description Logic. With the formal specification ofDTD, a possible document structure for XML data instances defined in the correspondingDTD can be represented. The represented document structure can be used to validate anXML document against the DTD. The representation of the document structure can alsobe used to check structural equivalence, inclusion, and disjointness between two differ-ent DTDs using inference algorithms provided by DL. The framework developed in this

Page 54: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

research provides an effective methodology for checking structural equivalence betweenDTDs, which performs much faster when compared with other known algorithms. It alsoprovides an effective way to validate an XML document against a DTD in terms of doc-ument structure. Additionally, the framework provides an efficient way to answer a querywhere reasoning algorithms of DL can be used to retrieve documents that conform to a cer-tain structure. However, possible representations of document structure provided by DTDare limited since not every aspect of DTD is captured in the formal specification. In particu-lar, common and widely used concepts such as attributes and links in DTD are not modeledby the approach.

In the research presented by Anutariya [2], a formal framework for XML documents,called the XML declarative description (XML-DD) model, is developed using declarativedescription theory for modeling and managing XML documents. The XML-DD is devel-oped by characterizing XML elements in terms of mathematical structures called a special-ization system, where a formal representation and reasoning support are constructed accord-ingly. The developed XML-DD model consists of object and relationship descriptions thatrepresent the XML elements and the relationship between the elements respectively. Withrelationship descriptions, the XML-DD model can describe integrity, path, type, and refer-ential constraints as variables. Additionally, it can represent DTDs and formulate queriesby mapping them to relationship descriptions, where a query can be evaluated and an XMLdocument can be validated against a DTD. It is also claimed that the proposed frameworkprovides more efficient modeling and reasoning for XML documents when the XML-DDis integrated with a computational model called Equivalence Transformation. The proposedXML-DD model provides an expressive XML document model that is enabled with descrip-tions of implicit information and constraints of XML documents. It also provides effectivevalidation of an XML document against a DTD and query evaluations. However, the con-cepts of uniqueness and referential integrity constraints in DTD are not modeled in theproposed framework. Hence, these constraints cannot be represented and used for validationof an XML document.

Conforti and Ghelli [14] proposed a framework for semistructured data using SpatialTree Logics (STL), which is a language based on Ambient Logic. The framework is pro-posed to show that STL is suitable for representing and reasoning about constraints, types,and queries of semistructured data. The proposed STL model for semistructured data canrepresent semistructured data in a tree like structure, describe different data types used invarious schema notations, and represent constraints given for the semistructured data. Basedon the proposed STL model, model checking and query answering are also implemented interms of Tree Query Language (TQL) which uses STL formulas for its expressions. Withthe TQL logic, constraint satisfaction and type conformance can be checked for a semistruc-tured data instance represented in the STL model. Additionally, queries on semistructureddata instances represented in STL model can be executed and evaluated. However, the repre-sentation provided by the proposed STL model is restricted to semistructured data and XMLdocuments in the form of tree like structures that are not ordered. The reasoning support forconstraint satisfaction and type conformance in the proposed STL model is limited, sincethe TQL logic can only express and check constraints and types that can be represented assimple ground logic.

In the research presented by Bidoit et al. [5], a formalization of semistructured data isproposed using hybrid multimodal logic to increase the efficiency of query evaluation. Itgeneralizes the notion of schema for semistructured data and formally represents semistruc-tured data, schema for semistructured data, and integrity constraints. The proposed formal-ization expresses integrity constraints of semistructured data using the hybrid multimodal

Page 55: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

logic and derives pattern grammar as notation for expressing schema of semistructured data.The derived pattern grammar along with the expressed integrity constraints are formallyrepresented as formulas of hybrid multimodal logic. The corresponding semistructured datais represented as models of the formula. In the proposed formalization, valid semistructureddata can be deduced from the formalized schema of semistructured data using referencetyping. The proposed formalization provides a generalized and well defined formal repre-sentation of schema for semistructured data. The formalized schema enables the expressionof typing constraints on references, which is not supported in other schema notations forsemistructured data such as DTD. However, the expressiveness of the constraints is limitedin the proposed solution, e.g., natural path constraints of data graph for semistructured datacannot be represented.

The above mentioned research that focused on formal semantics of semistructured dataproposed various forms of formal descriptions for semistructured data. The specificationsprovide the means to formally represent semistructured data, its schema, and associatedconstraints. They also enable various reasonings on semistructured data using constraintsatisfaction and type conformance, in particular validations of a semistructured data instanceagainst its schema. However, all the formal frameworks presented in the related researchhave limitations in expressing some semantics of semistructured data, and none of them isable to represent all aspects of semistructured data required for defining normal forms. Thisis because much of the existing research is based on DTD, which is not very semanticallyexpressive, where others developed their own schema representations that did not consideraspects of semistructured data that are required for defining normal forms. In this paper,our formal semantics of semistructured data models is constructed based on the ORA-SSlanguage which is a semantically richer notation than others. And more importantly, thevalidation of semistructured data can be effectively carried out by the PVS theorem prover.

5.2 Normalization algorithms for semistructured data

Arenas and Libkin [3] proposed a normalization algorithm for XML documents to preventupdate anomalies and data redundancies. A normal form for XML documents (XNF), thatgeneralizes Boyce-Codd normal form and normal form for nested relations, is derived byexamining the concepts of path functional dependencies given for XML documents. Thederived XNF disallows data redundancies in XML documents by providing rules and con-ditions for the implications of path functional dependencies. Using the derived XNF, a nor-malization algorithm is developed for XML documents in terms of DTD. Given a set ofpath functional dependencies, the proposed algorithm converts any DTD into one that is inXNF. The algorithm uses two schema transformations, i.e., creating new elements and mov-ing attributes, to remove anomalous functional dependencies that cause data redundancies.In this research, an effective normalization algorithm is developed to remove redundanciesfrom XML documents, where the decompositions used in the algorithm are mathemati-cally proven to preserve data. However, the proposed algorithm is restrictive since it doesnot consider uniqueness and referential constraints of the DTD schema. The implication offunctional dependencies used in the algorithm is also limited to simple path functional de-pendencies, where a methodology for detecting implication of more complicated functionaldependencies is not described.

In the research presented by Wu et al. [52], a normalization algorithm is developedto minimize data redundancies in XML documents. A Normal Form for SemistructuredSchemata (NF-SS) is defined using DTD represented in a tree like structure and extendedfunctional dependency (EFD), where EFD is defined by combining functional dependencies

Page 56: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

and key constraints together. The defined NF-SS avoids data redundancies in XML docu-ments by providing rules and conditions for implication to prevent transitive dependencies,path anomalies, and partial dependencies. With the derived NF-SS, a normalization algo-rithm is developed for XML documents in terms of the graphical representation of DTD.According to the given set of EFD, the proposed algorithm uses the heuristic schema restruc-turing rules to transform a DTD into one that is in NF-SS. The heuristic schema restructuringrules used in the algorithm remove transitive dependencies, path anomalies, and partial de-pendencies by decomposing object classes, splitting path, and creating new object classesto reorganize attributes respectively. The proposed normalization algorithm is less restric-tive when compared with other known normalization algorithms, since EFD can model realworld semantics more effectively with incorporated integrity constraints and the ability torepresent complicated dependencies. The process of normalization is also effective since theheuristic schema restructuring rules are developed. However, the normalization algorithm isnot guaranteed to work for every possible schema of XML documents. The completenessof the heuristic schema restructuring rules is not proven. The algorithm can also lose de-pendencies between data during the schema transformation, since the preservation of thedependency is not guaranteed.

Embley and Mok [20] proposed a normalization algorithm for XML documents to re-move data redundancies. An XML normal form (XNF) is derived using conceptual-modelhypergraphs (CM hypergraphs) and their functional edges. The CM hypergraphs representschema of XML documents as collections of scheme-trees where their functional edges rep-resent functional and multivalued dependencies as well as inclusion constraints. The derivedXNF prevents data redundancies in XML documents by providing rules and conditions forconstraint satisfaction and finding a minimal number of scheme-trees deduced from CM hy-pergraphs. Using the derived XNF, a normalization algorithm based on a conceptual-modelis developed for XML documents to generate DTDs that comply to XNF. In this research,tools are also implemented to apply normalization algorithms to the given CM hypergraph toautomatically generate XNF compliant DTDs. However, the proposed algorithm is restric-tive since CM hypergraph has no hierarchical structures, no concept of attribute, and no keyconstraints. It is also non-deterministic since the algorithm produces different schemas assolutions instead of checking and removing redundancies from a single schema. Addition-ally, the structure of scheme-trees used in the algorithm does not correspond to the structureof XML documents.

The above mentioned research that focuses on normalization algorithms for semistruc-tured data has proposed various normal forms for semistructured data and correspondingnormalization algorithms. The normal forms defined in terms of semistructured schema pre-vent data redundancies in the data using functional dependency implications. The normal-ization algorithms transform the schema of semistructured data into one that satisfies theproperties of the corresponding normal forms. However, these normalization algorithms forsemistructured data lack mechanical proof support to ensure the preservation of the data andits constraints. In this paper, we provide a formal and effective means for the verificationof semistructured data normalization using the PVS theorem prover, which verifies whethera schema transformation has resulted in any loss or corruption of data or dependencies be-tween the data.

6 Conclusion

Semistructured data is developed to integrate, exchange and store disparate data from het-erogeneous repositories. Its usage has increased extensively, and is driven in particular by

Page 57: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

international companies and the World Wide Web. Such an increase in its usage has leadto the development of database systems for semistructured data to model, store, manipulateand manage the data efficiently and effectively. Several normalization algorithms specificto semistructured data have also been proposed to remove or minimize data redundancies,which could cause various anomalies that result in data inconsistencies. Similar to widelyused database systems such as relational databases, semistructured database systems mustensure the consistency of the data, since inconsistency leads to the loss of the original mean-ing of the data stored, which can cause devastating results. Data inconsistencies can ariseduring schema design and data population if the modeled schema does not conform to thesemantics of data modeling language or if the populated data does not conform to its schema.In addition, database operations that transform a schema, such as normalization, can causedata inconsistencies if the underlying algorithms are designed incorrectly. In order to en-sure consistency of semistructured data, the modeled schema must be validated against thesemantics of the data modeling language, the populated data must be validated against itsschema, and the correctness of normalization must be verified. However, the current stateof semistructured data and its database systems lack adequate reasoning support to detectsuch inconsistencies. In this research, we have addressed these problems by providing a PVStheorem prover approach to validate and verify semistructured data models effectively.

Firstly, we provided a formal and precise description for the ORA-SS semistructureddata modeling language, which serves as a declarative and rigorous reference to the ORA-SS semantics. We use the PVS type checker and theorem prover to ensure the consisten-cies of ORA-SS schemas and their data models. The validation provides an effective andconvenient way to detect data inconsistencies caused by incorrect schema design and datapopulation. Secondly, we established a formal description of the correctness criteria for thenormalization of semistructured data. The correctness criteria for semistructured data nor-malization is derived in terms of the dependency preserving and lossless properties to pro-vide rules and conditions to show the original and transformed ORA-SS schema are seman-tically equivalent. On top of the PVS semantics of the ORA-SS data modeling language, thederived correctness criteria are formally specified in PVS. It provides better understandingabout the correctness of normalization and schema equivalence specific to the semistruc-tured data context. Furthermore, we provided formal verification for the normalization ofsemistructured data. With the PVS definitions of the correctness criteria, effective verifica-tion of semistructured data normalization can be conducted through the PVS theorem proverwith a built-in library of theories. According to the correctness criteria defined in PVS, thecorrectness of semistructured data normalization can be verified in terms of dependency pre-serving and lossless properties. PVS verification provides an effective and convenient way ofdetecting dependency or data loss in transformed schemas caused by incorrectly designednormalization algorithms. It supplements the reliability of semistructured data normaliza-tion, as it ensures the consistency of the data by detecting possible dependency or data loss.Similar to validation, the verification produces diagnostic results that can be used to analyzeand correct the transformation. In addition, the proposed verification can be easily adoptedby various applications that require verification of correctness of schema transformations. Itprovides a good stepping stone for defining valid transformation operators that can be usedin verifying normalization algorithms, view creation, and various other database operationsthat transform the schema of semistructured data. In summary, the proposed theorem proverapproach to semistructured data design provides adequate and effective reasoning supportto ensure the consistency of semistructured data.

In the future, we plan to extend our work by developing an adequate tool support forintegrating the verification with a user friendly interface for semistructured data design. The

Page 58: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

tool should be constructed with a graphical user interface that supports the underlying PVSdefinitions of ORA-SS semantics and correctness criteria for semistructured data normal-ization. It should automatically provide the necessary preprocessing for PVS verification ofsemistructured data design, such as representing ORA-SS schema and data models, convert-ing functional dependencies, and invoking the PVS theorem prover, to further improve theefficiency of the verification. One of the limitations of the current approach lies in the clo-sure computation on the given functional dependency sets. Although the actual verificationof the functional dependency and lossless property is quite efficient, the calculation of theclosure set may take up most of the running expenses. This is due to the fact that PVS isa general theorem prover, and its strength is not in raw computation. Thus, an integratedtool support that provides the initial computations, such as finding the closures of functionaldependency sets, prior to the PVS verification would greatly improve the performance ofthe proposed approach. Additionally, the tool should integrate diagnostic results with theORA-SS schema diagram for easier comprehension to increase the applicability of the ver-ification. Finally, a user-friendly tool support would also trigger a wider application of theapproach.

Another possible direction for future research is to develop a PVS verification of normal-ization algorithms specific to semistructured data using the basic transformation operators.The basic transformation operators should be formally defined in the PVS language by ex-tending the PVS definitions of ORA-SS semantics and should also be verified by extendingthe PVS definitions of correctness criteria for semistructured data normalization presented inthis research. With the formally defined and verified basic transformation operators, normal-ization algorithms for semistructured data can be decomposed into the basic transformationoperators and verified in terms of dependency preserving and lossless properties. This re-search will provide effective and efficient verification for normalization algorithms specificto semistructured data and increase the reliability of the proposed normalization algorithmsfor semistructured data. The formally defined basic transformation operators can also beused to compare different normalization algorithms for their effectiveness, reliability andperformance. In addition, the basic transformation operators can be used to verify variousother database operations that transform the schema of semistructured data such as viewcreation.

Acknowledgements This work was supported by the Marsden research grant, Foundations of Semistruc-tured Data from the Royal Society of New Zealand. We would also like to thank the numerous anonymousreferees who have reviewed the manuscript and whose valuable comments have contributed to the clarifica-tion of many of the ideas presented in the paper.

References

1. Abiteboul S, Buneman P, Suciu D (1999) Data on the Web: from relations to semistructured data andXML. Morgan Kaufmann, San Mateo

2. Anutariya C, Wuwongse V, Nantajeewarawat E, Akama K (2000) Towards a foundation for XML docu-ment databases. In: EC-Web’00: proceedings of the 1st international conference on electronic commerceand Web technologies, London, UK. Springer, Berlin, pp 324–333

3. Arenas M, Libkin L (2004) A normal form for XML documents. ACM Trans Database Syst 29(1):195–232

4. Baumgartner R, Frölich O, Gottlob G, Herzog M, Lehmann P (2005) Integrating semi-structured datainto business applications: a web intelligence example. In: WM’05: proceedings of the 3rd biennialconference on professional knowledge management, Kaiserslautern, Germany. Springer, Berlin, pp 469–482

5. Bidoit N, Cerrito S, Thion V (2004) A first step towards modeling semistructured data in hybrid multi-modal logic. J Appl Non-Class Log 14(4):447–475

Page 59: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

6. Bowen J, Gordon M (1995) A shallow embedding of Z in HOL. Inf Softw Technol 37(5–6):269–2767. Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F (2006) Extensible markup language (XML)

1.0. http://www.w3.org/TR/2006/REC-xml-20060816/8. Buneman P (1997) Semistructured data. In: PODS’97: proceedings of the 16h ACM SIGACT-SIGMOD-

SIGART symposium on principles of database systems, Tucson, Arizona, USA. ACM, New York,pp 117–121

9. Calvanese D, De Giacomo G, Lenzerini M (1999) Representing and reasoning on XML documents:a description logic approach. J Log Comput 9(3):295–318

10. Chawathe SS, Garcia-Molina H, Hammer J, Ireland K, Papakonstantinou Y, Ullman JD, Widom J (1994)The TSIMMIS project: integration of heterogeneous information sources. In: IPSJ’94: proceedings of the10th conference on information processing society of Japan, Tokyo, Japan, pp 7–18

11. Chen PP (1976) The entity-relationship model—toward a unified view of data. ACM Trans DatabaseSyst 1(1):9–36

12. Chen YB, Ling TW, Lee M-L (2002) Designing valid XML views. In: ER’02: proceedings of the 21thinternational conference on conceptual modeling, Tampere, Finland. Springer, Berlin, pp 463–478

13. Choppella V, Sengupta A, Robertson EL, Johnson SD (2007) Preliminary explorations in specifyingand validating entity-relationship models in pvs. In: AFM’07: proceedings of the second workshop onautomated formal methods. ACM, New York, pp 1–10

14. Conforti G, Ghelli G (2003) Spatial tree logics to reason about semistructured data. In: SEBD’03: pro-ceedings of the 11th Italian symposium on advanced database systems, Cetraro, Italy. Rubettino Editore,Soveria Mannelli, pp 37–48

15. Deutsch A, Fernández MF, Suciu D (1999) Storing semistructured data with STORED. In: SIGMOD’99:proceedings of ACM SIGMOD international conference on management of data, Philadelphia, Pennsyl-vania, USA. ACM, New York, pp 431–442

16. Dietrich SW, Urban SD (2004) An advanced course in database systems: beyond relational databases.Prentice Hall, New York

17. Dobbie G, Wu X, Ling TW, Lee ML (2001) ORA-SS: object-relationship-attribute model for semi-structured data. Technical Report TR 21/00, School of Computing, National University of Singapore,Singapore

18. Du W, Lee M-L, Ling TW (2001) XML structures for relational data. In: WISE’01: proceedings of the2nd international conference on web information systems engineering, Kyoto, Japan. IEEE ComputerSociety, Los Alamitos, pp 151–160

19. Elmasri R, Navathe SB (2004) Fundamentals of database systems, 4th edn. Addison-Wesley, Reading20. Embley DW, Mok WY (2001) Developing XML documents with guaranteed “Good” properties. In:

ER’01: proceedings of the 20th international conference on conceptual modeling, Yokohama, Japan.Springer, Berlin, pp 426–441

21. Harold ER, Means WS (2004) XML in a nutshell, 3rd edn. O’Reilly, Sebastopol22. Hoffer JA, Prescott MB, Topi H (2008) Modern database management, 9th edn. Prentice Hall, New

York,23. Hunter D, Rafter J, Fawcett J, van der Vlist E, Ayers D, Duckett J, Watt A, McKinnon L (2007) Beginning

XML, 4th edn. Wrox Press Ltd., Birmingham24. Kifer M, Bernstein A, Lewis PM (2006) Database systems: an application-oriented approach, 2nd edn.

Addison-Wesley, Reading25. Lawford M, Wu H (2000) Verification of real-time control software using PVS. In: Proceedings of the

2000 conference on information sciences and systems. Princeton University Press, Princeton, pp TP1–13–TP1–17

26. Lee SU-J (2008) PVS definitions of ORA-SS semantics & PVS definitions of correctness crite-ria for semistructured data normalization. Technical Report UoA-SE-2008-3, Department of Com-puter Science, The University of Auckland, Auckland, New Zealand. Available at https://www.se.auckland.ac.nz/uploads/trReports/UoA-SE-2008-3.pdf

27. Lee SU-J, Dobbie G, Sun J, Groves L (2009) Formal verification of semistructured data models in PVS.J Univers Comput Sci 15(1):241–272

28. Ling TW, Lee ML, Dobbie G (2001) Applications of ORA-SS: an object-relationship-attribute datamodel for semistructured data. In: IIWAS’01: proceedings of the 3rd international conference on infor-mation integration and web-based applications and services, Linz, Austria, pp 17–28

29. Ling TW, Lee ML, Dobbie G (2005) Semistructured database design. Springer, New York30. Ma Z (2005) Fuzzy database modeling with XML. The Kluwer international series on advances in data-

base systems. Springer, New York31. McHugh J, Abiteboul S, Goldman R, Quass D, Widom J (1997) Lore: a database management system

for semistructured data. SIGMOD Rec 26(3):54–66

Page 60: Theorem prover approach to semistructured data designjingsun/papers/PDFs/FMSD.2010.pdf · Theorem prover approach to semistructured data design ... In the resulting XML data instance,

Form Methods Syst Des

32. Mo Y, Ling TW (2002) Storing and maintaining semistructured data efficiently in an object-relationaldatabase. In: WISE’02: proceedings of the 3nd international conference on web information systemsengineering. IEEE Computer Society, Los Alamitos, pp 247–256

33. Ni W, Ling TW (2005) Translate graphical XML query language to SQLX. In: DASFAA’05: proceedingsof the 10th international conference on database systems for advanced applications, Beijing, China.Springer, Berlin, pp 907–913

34. Owre S, Shankar N (1993) Abstract datatypes in PVS. Technical Report SRI-CSL-93-9R, ComputerScience Laboratory, SRI International, Menlo Park, CA, USA, December 1993. Extensively revisedJune 1997. Also available as NASA Contractor Report CR-97-206264

35. Owre S, Shankar N (1997) The formal semantics of PVS. Technical Report SRI-CSL-97-2, ComputerScience Laboratory, SRI International, Menlo Park, CA, USA, August 1997

36. Owre S, Rushby JM, Shankar N (1992) PVS: a prototype verification system. In: CADE’92: proceedingsof the 11th international conference on automated deduction, Saratoga Springs, NY, USA. Springer,Berlin, pp 748–752

37. Owre S, Rushby J, Shankar N, von Henke F (1995) Formal verification for fault-tolerant architectures:prolegomena to the design of PVS. IEEE Trans Softw Eng 21(2):107–125

38. Owre S, Rushby J, Shankar N, Stringer-Calvert D (1998) PVS: an experience report. In: FM-trends’98:proceedings of international workshop on current trends in applied formal method, Boppard, Germany.Springer, Berlin, pp 338–345

39. Owre S, Shankar N, Rushby JM, Stringer-Calvert DWJ (1999) PVS language reference. Computer Sci-ence Laboratory, SRI International, Menlo Park, CA,USA, September 1999

40. Owre S, Shankar N, Rushby JM, Stringer-Calvert DWJ (1999) PVS system guide. Computer ScienceLaboratory, SRI International, Menlo Park, CA,USA, September 1999

41. Rushby J (2000) Theorem proving for verification. In: MoVEP’00: modelling and verification of parallelprocesses, Nantes, France. Springer, Berlin, pp 39–57

42. Rushby J, Stringer-Calvert DWJ (1995) A less elementary tutorial for the PVS specification and verifica-tion system. Technical Report SRI-CSL-95-10, Computer Science Laboratory, SRI International, MenloPark, CA, USA, June 1995

43. Shankar N (1993) Verification of real-time systems using PVS. In: CAV’93: proceedings of the 5thinternational conference on computer aided verification, Elounda, Greece. Springer, Berlin, pp 280–291

44. Shankar N, Owre S, Rushby JM, Stringer-Calvert DWJ (1999) PVS prover guide. Computer ScienceLaboratory, SRI International, Menlo Park, CA, USA, September 1999

45. Shih TK (2002) Distributed multimedia databases: techniques & applications. Idea Group Publishing,Hershey

46. Simon H (2000) XML: strategic analysis of XML for web application development. Computer Technol-ogy Research Corporation, Charleston

47. Srivas M, Rueß H, Cyrluk D (1997) Hardware verification using PVS. In: Kropf T (ed) Formal hardwareverification: methods and systems in comparison. Lecture notes in computer science, vol 1287. Springer,Berlin, pp 156–205

48. Sun J, Dong JS, Liu J, Wang HH (2002) A formal object approach to the design of ZML. Ann SoftwEng 13(1–4):329–356

49. Thuraisingham BM (2002) XML databases and the semantic web. CRC Press, Boca Raton50. Vitt J, Hooman J (1996) Assertional specification and verification using PVS of the steam boiler control

system. In: Formal methods for industrial applications: specifying and programming the steam boilercontrol. Springer, Berlin, pp 453–472

51. Wu X, Ling TW, Lee ML, Dobbie G (2001) Designing semistructured databases using the ORA-SSmodel. In: WISE’01: proceedings of the 2nd international conference on web information systems engi-neering, Kyoto, Japan. IEEE Computer Society, Los Alamitos, pp 171–180

52. Wu X, Ling TW, Lee ML, Lee SY, Dobbie G (2001) NF-SS: a normal form for semistructured schemata.In: DASWIS’01: proceedings of international workshop on data semantics in web information systems,Yokohama, Japan. Springer, Berlin, pp 292–305