108
Diploma Thesis August 3, 2006 XML to RDF Transformation Markus Fehlmann of Aadorf TG, Switzerland (00-912-857) supervised by Prof. Dr. Harald Gall Dr. Gerald Reif Department of Informatics software evolution & architecture lab

Diploma Thesis XML to RDF Transformation - UZH IfIffffffff-d0d3-33d7-ffff...Diploma Thesis August 3, 2006 XML to RDF Transformation Markus Fehlmann of Aadorf TG, Switzerland (00-912-857)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Diploma ThesisAugust 3, 2006

    XML to RDFTransformation

    Markus Fehlmannof Aadorf TG, Switzerland (00-912-857)

    supervised by

    Prof. Dr. Harald GallDr. Gerald Reif

    Department of Informatics software evolution & architecture lab

  • Diploma Thesis

    XML to RDFTransformation

    Markus Fehlmann

    Department of Informatics software evolution & architecture lab

  • Diploma ThesisAuthor: Markus Fehlmann, [email protected] period: February 3, 2006 - August 3, 2006

    Software Evolution & Architecture LabDepartment of Informatics, University of Zurich

  • Acknowledgements

    I am grateful to Gerald Reif whose PhD thesis was an excellent foundation for this work. Manyof the now implemented features and ideas originated from fruitful discussions during the lastsix months. Gerald proved that the sentence used to advertise the thesis ”be best supervised byyour advisers” was more than just empty words.

    I also express my gratitude to professor Harald Gall for giving me the opportunity to writemy diploma thesis in the field of the Semantic Web, that I believe will strongly influence the waypeople access and process information, the main resource of today’s information society.

    I thank my parents for making my education possible and for all their encouragement throughthe years. Needless to say I could not have done this without them.

    My thanks also go to my fellow students and friends who provided me with useful inputs andcritical suggestions.

  • Abstract

    XML continues to be the primary format for data exchange in distributed systems. However, sinceseveral serializations of domain specific knowledge are possible, XML documents have no imma-nent semantic. The Semantic Web provides a common framework that allows data to be sharedand reused across application, enterprise, and community boundaries. The Resource DescriptionFramework (RDF), which is part of the Semantic Web, formalizes the meaning of information.While many documents are encoded in XML, only few documents are represented in RDF. Inhis PhD thesis, Reif proposed an algorithm and did a prototype implementation, called WEESA,that generates RDF graphs out of arbitrary XML documents by applying processing instructionsdefined in a mapping.

    In this thesis we propose an object-oriented architecture of the mapping algorithm in orderto improve its maintainability, efficiency, and extensibility. In addition to that, we introduce newmapping directives that simplify the mapping definition process.

    The result of this thesis is a new implementation of the mapping algorithm that incorpo-rates the suggested object-oriented architecture and the additional mapping constructs. Thus, thetransformation from XML data to RDF could be simplified to a reasonable extent. A prominentexample that benefits from our results is the semantic annotation of Web sites.

  • Zusammenfassung

    XML ist das tragende Format um Daten in verteilten Systemen auszutauschen. Allerdings habenXML Dokumente keine immanente Semantik, da in XML unterschiedliche Serialisierungen des-selben domänenspezifischen Wissens möglich sind. Das semantische Web bietet ein Rahmen-werk, das es erlaubt, Daten über Anwendungs- und Unternehmensgrenzen hinaus zu teilen undwiederzuverwenden. Das Resource Description Framework (RDF), ein Bestandteil des seman-tischen Webs, formalisiert hierzu die Bedeutung von Informationen. Während viele Dokumentein XML vorliegen, existieren erst wenige, die eine RDF Repräsentation haben. Reif schlug inseiner Dissertation einen Algorithmus vor, der RDF Repräsentationen aus beliebigen XML Doku-menten erstellt, indem Verarbeitungsanweisungen aus einem Mappingdokument auf das XMLDokument angewendet werden. Ebenso hat er den Algorithmus prototypisch implementiert.

    In dieser Arbeit stellen wir eine objektorientierte Architektur des Mapping Algorithmus vor,um dessen Wartbarkeit, Effizienz und Erweiterbarkeit zu verbessern. Zusätzlich erweitern wirdas Mappingvokabular um Anweisungen, welche die Erstellung von Mappings vereinfachen.

    Das Ergebnis dieser Arbeit ist eine neue Implementierung des genannten Algorithmus, derdie objektorientierte Architektur und die neuen Mappinganweisungen vereinigt. Auf diese Weisekonnte die Transformation von XML Dokumenten in das RDF Format erheblich vereinfacht wer-den. Ein bedeutendes Anwendungsgebiet, das von unseren Ergebnissen profitieren kann, ist diesemantische Annotation von Webseiten.

  • Contents

    1 Introduction 11.1 Semantic Web Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1.1 Origins and Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 XML based Web Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2.1 Apache Cocoon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.4 Structure of the Thesis, Objectives of this Work . . . . . . . . . . . . . . . . . . . . . 15

    2 Semantic Annotation of XML-based Web Applications 172.1 Tools for Manual Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Embedding and Retrieving Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.2.1 GRDDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2 RDFa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.3 XML to Metadata Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.1 Bridging the Gap between RDF and XML . . . . . . . . . . . . . . . . . . . . 212.3.2 Mapping XML to OWL Ontologies . . . . . . . . . . . . . . . . . . . . . . . . 212.3.3 Lifting XML Schema to OWL . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3.4 Round-tripping between XML and RDF . . . . . . . . . . . . . . . . . . . . . 232.3.5 XR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2.4 Conversion of arbitrary document types to RDF . . . . . . . . . . . . . . . . . . . . 252.4.1 Data Conversion, Extraction and Record Linkage using XML and RDF Tools

    in Project SIMILE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3 Introduction to WEESA 273.1 WEESA - Web Engineering for Semantic Web Applications . . . . . . . . . . . . . . 273.2 Semantic Web Applications with WEESA and Apache Cocoon . . . . . . . . . . . . 28

    3.2.1 Integration of WEESA in the Apache Cocoon Framework . . . . . . . . . . 293.2.2 WEESA Cocoon Transformer to generate HTML+RDF . . . . . . . . . . . . 303.2.3 WEESA Cocoon Transformer to generate RDF/XML . . . . . . . . . . . . . 30

    3.3 Building the Knowledge Base of the Semantic Web Application . . . . . . . . . . . 313.3.1 Architecture and Maintenance of the WEESA Knowledge Base . . . . . . . 31

    4 Design of the Object-Oriented Architecture 334.1 General Object-Oriented Design Principles . . . . . . . . . . . . . . . . . . . . . . . 334.2 Application of Design Principles to Mapping Algorithm . . . . . . . . . . . . . . . 344.3 Description of the WEESA Mapping Algorithm . . . . . . . . . . . . . . . . . . . . 39

    4.3.1 Resource Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

  • viii CONTENTS

    4.3.2 Circular References in Resource Definitions . . . . . . . . . . . . . . . . . . . 424.3.3 Mapping Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3.4 Sample Mapping Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    5 WEESA Mapping Features 475.1 Target Ontology and Source XML File . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 WEESA Mapping Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    5.2.1 Method Parameter Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.3 Relative Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.4 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.5 If Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 Switch Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.7 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.8 Datatype Attribute for Typed Literals . . . . . . . . . . . . . . . . . . . . . . . . . . 585.9 Language Section, Language Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 59

    5.9.1 Language Tag Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.9.2 Definition of a Default Language . . . . . . . . . . . . . . . . . . . . . . . . . 605.9.3 Language Definition in Triples Section . . . . . . . . . . . . . . . . . . . . . . 605.9.4 Language Precedence and Typed Literals . . . . . . . . . . . . . . . . . . . . 61

    6 Conclusions and Future Work 636.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    A XML Schemas 67A.1 Mapping Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67A.2 WEESA Dictionary Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77A.3 Sample Shop Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    B XML Files 81B.1 Sample Shop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81B.2 Sample Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82B.3 Shop Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82B.4 Shop Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    C Java Code of Methods used in Mapping Examples 87C.1 weesa.util.MappingLib.addPrefix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87C.2 weesa.util.MappingLib.avg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    D Content of the CD 89

  • CONTENTS ix

    List of Figures1.1 Semantic Web Architecture [AvH04] . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 A simple RDF Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 A Blank Node that represents a Shop Item . . . . . . . . . . . . . . . . . . . . . . . . 61.4 A Cocoon Pipeline using several XML Technologies [Rei05] . . . . . . . . . . . . . . 13

    2.1 Recursive Application of the GRDDL Mechanism [Haz05] . . . . . . . . . . . . . . 192.2 Operating Sequence of an XML to OWL Transformation as suggested in [BA05] . . 222.3 Operating Sequence of an XML to OWL Transformation as suggested in [FZT04] . 23

    3.1 WEESA Design and Instance Levels [Rei05] . . . . . . . . . . . . . . . . . . . . . . . 283.2 a) Cocoon Pipeline to integrate RDF into HTML, b) Pipeline to create a separate

    RDF/XML File [Rei05] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    4.1 Interface Hierarchy of Jena’s RDF Nodes . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Interface Hierarchy of Mapping Elements that generate RDF Nodes and Arcs . . . 354.3 Triple Class that makes use of Interfaces described in Figure 4.2 . . . . . . . . . . . 364.4 Const Class Hierarchy that is used similarly for the Const, Method, and XPath

    Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5 ExpressionFactory Class used for Resource, Literal, and Property Creation . . 374.6 Integration of new Mapping Directives in Class Hierarchy . . . . . . . . . . . . . . 384.7 Extract of the Operator Class Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 394.8 Circular Dependencies that are not directly supported by the recursive Mapping

    Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.9 Simple RDF Graph for the Description of the Mapping Algorithm . . . . . . . . . . 45

    5.1 The Sample Shop Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2 Triple expressing that TrekKing belongs to the Outdoor Sector . . . . . . . . . . . . 495.3 Triple created using Variables for ID/IDREF Relationship . . . . . . . . . . . . . . . 52

    6.1 Target RDF Graph that needs careful Consideration with Respect to Variable andXPath Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    6.2 RDF Graph for Resources with several incoming Edges . . . . . . . . . . . . . . . . 65

    List of Tables6.1 Table describing Variable and Relative Path Dependencies of Figure 6.2 . . . . . . . 66

    List of Listings1.1 Three different XML Representations of the same Fact . . . . . . . . . . . . . . . . . 41.2 RDF/XML Serialization Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 TriX Serialization Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 TriG Serialization Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1 Inclusion of Metadata for GRDDL Processing . . . . . . . . . . . . . . . . . . . . . . 182.2 RDFa Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3 XR sample Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

  • x CONTENTS

    3.1 Inclusion of RDF/XML in a Element . . . . . . . . . . . . . . . . . . . . 293.2 Linking HTML to Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1 Mapping Algorithm as introduced in [Rei05] . . . . . . . . . . . . . . . . . . . . . . 404.2 Object-Oriented Representation of Mapping Algorithm . . . . . . . . . . . . . . . . 404.3 Triple Definition for Description of Mapping Algorithm . . . . . . . . . . . . . . . . 455.1 Shop.xml Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 WEESA Mapping Structure Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3 Application of Relative Paths in Section . . . . . . . . . . . . . . . . 515.4 Application of a Relative Path in the Section . . . . . . . . . . . . . . . 525.5 Variable Definition in Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 Variable Definition in Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.7 Use of Variables in Triple Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.8 If Statement enclosing Object only . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.9 If Statement enclosing Predicate and Object . . . . . . . . . . . . . . . . . . . . . . 545.10 Switch Statement enclosing Predicate and Object . . . . . . . . . . . . . . . . . . . 565.11 Switch Statement enclosing Object only . . . . . . . . . . . . . . . . . . . . . . . . 565.12 Structure of a Dictionary File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.13 Dictionary Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.14 Datatype Attribute in Expression Element . . . . . . . . . . . . . . . . . . . . . . . . 585.15 Constant Language Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.16 XPath Language Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.17 Switch Language Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

  • Chapter 1

    Introduction

    1.1 Semantic Web OverviewThis section gives a short overview on the basic principles and mechanisms that the SemanticWeb is built on. It follows the ”layer cake” approach as it originated from the Semantic Web Roadmap [Ber98].

    1.1.1 Origins and VisionThe World Wide Web has changed the way people communicate with each other and the waybusiness is conducted [AvH04]. While isolated numerical calculations and information process-ing were the main tasks for computers for years, this focus has changed with the introductionof the Web. Today, computers can be thought of as entry points to the information highway.Thus, it is not a single machine but rather a collection and the interplay of various, heteroge-neous systems that satisfy users’ demand for information. While it has been advantageous for thepropagation and acceptance of the Web, the following statement also includes some unfavorableshortcomings: The WWW of today has been developed for the human reader. A machine cannotunderstand much of the contents of the Web, but just offer them for people to interpret. How-ever, the automatic interpretation of contents is vital for the development of intelligent Internetapplications that are easy to use [Hyv01].

    The lack of data that describes human readable information in a machine-processable wayleads to several challenges which proved to be hard to solve using traditional techniques, e.g.from data mining or artificial intelligence. Antoniou and van Harmelen [AvH04] introduce anexample that exhibits the shortcomings of the current Web in the domain of search engines:

    High recall, low precision Even if the main relevant pages are retrieved, they are of little use iftoo many mildly relevant or irrelevant documents were also retrieved.

    Low or no recall Often it happens that we do not get any answer for our request, or that impor-tant and relevant pages are not retrieved.

    Results are highly sensitive to vocabulary Often our initial keywords do not get the results wewant; in these cases the relevant documents use a different terminology from the originalquery. This is unsatisfactory because semantically similar queries should return similarresults.

  • 2 Chapter 1. Introduction

    Results are single Web pages If we need information that is spread over various documents, wemust initiate several queries to collect the relevant documents, and then we must manuallyextract the partial information and put it together.

    Due to the aforementioned shortcomings and the fact of the growing Internet as the world’smost important data source, the idea of the Semantic Web was developed and expressed as fol-lows:

    ”The Semantic Web is not a separate Web but an extension of the current one, in which in-formation is given well-defined meaning, better enabling computers and people to work incooperation. ” - Tim Berners-Lee, James Hendler, Ora Lassila; May 2001 [BHL01]

    Although this ”vision” does not say anything about how the Semantic Web will be imple-mented, it determines that there is something new added to the existing Web and that this ex-tension adds machine-processable, well-defined meaning to the published data. This something iscalled explicit metadata.

    Explicit Metadata

    In contrast to HTML tags that are mainly concerned with the human readable representation ofcontent, explicit metadata is concerned with semantics and therefore refers to the meaning ofan object. In other words, semantic builds a relationship between the object represented in theinformation system and the real world object. Making this information explicit has the advantagethat we do not need to develop superintelligent agents that process natural language in order tobe able to reason about data. Metadata is made explicit by RDF, which is the subject of Section1.1.2. However, some more components are needed to gain full advantage of semantic metadata:

    Ontologies

    The term ontology originates from philosophy where it is used to describe the nature of existence.For example, the classification of things into abstract classes that share a set of properties is atypical ontological commitment. [AvH04]

    In IT, however, an ontology is defined as follows:

    An ontology is an explicit and formal specification of a conceptualization.Thomas R. Gruber; June 1993 [Gru93]

    In general, an ontology describes formally a domain of discourse. This description is made byterms that describe classes of objects of the domain. Classes group resources with similar char-acteristics. In addition to that, ontologies define relationships between classes. Several types ofinformation about relationships exist [AvH04]:

    Subclass Relationship: A class C � is a subclass of parent class C if and only if every instance ofC � is also an instance of C. In the Semantic Web, classes are allowed to have one or moresub- or parent classes.

    Properties: Properties link two individuals together.

    Value Restrictions: A value restriction restricts the classes applicable to properties with regardto their domain and range.

    Disjointness Statements: Used to define that instances of class C1 and class C2 are disjoint.

    Specification of Logical Relationships between Objects: Introduces the definition of additionalconstraints, e.g. cardinality constraints.

  • 1.1 Semantic Web Overview 3

    With the shared understanding that emerges from explicit metadata referring to a certain on-tology, one is now able to reason over data provided by Web applications. Reasoning is basedon logic, which offers formal languages for expressing knowledge, well-understood formal se-mantics, and reasoning capabilities to infer conclusions from the given knowledge, thus makingimplicit knowledge explicit. This knowledge can then be processed by agents. Agents are piecesof software that work autonomously and proactively with regard to the tasks and preferencesreceived from a person. [AvH04]

    The next section describes the current Semantic Web technologies that are used to implementthe aforementioned concepts.

    1.1.2 ArchitectureThe use of layered architectures is a widespread principle in IT to break down a complex prob-lem into subproblems that are easier to solve. Each layer usually relies on parts of the problemthat were already processed by lower layers, does some processing, and offers the results to itssuperior layer. Advantages of this approach are that a partial solution to a complex problem canbe offered as soon as lower levels are implemented. In addition to that, it is easier to achieveconsensus on small steps. Thus, the Semantic Web is built on layers as illustrated in Figure 1.1.

    Figure 1.1: Semantic Web Architecture [AvH04]

    Basic Layers

    Unicode provides a unique number for every character, no matter what the platform, no matterwhat the program, no matter what the language [uni06]. Hence, Unicode provides the SemanticWeb with the set of characters that matches the requirements of a distributed, platform indepen-dent system.

    A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an ab-stract or physical resource [BFM05]. Since the Semantic Web is all about resources, URIs are themeans for uniquely identifying them.

    XML is an eXtensible Markup Language that uses tags to add markup to arbitrary (Unicode)text documents. ’Extensible’ means that there is no predefined set of tags, which makes XML a

  • 4 Chapter 1. Introduction

    meta markup language. Thus, it is possible to use XML for structuring explicit metadata as well asfor modeling ontologies.

    An XML Schema expresses shared vocabularies and allows machines check whether an XMLdocument conforms to a set of rules regarding the document structure [ST00]. Although XMLSchema is not powerful enough to define the structure of statements defined in upper levels ofthe Semantic Web layer cake, such as RDF or Ontologies, it is used to describe the syntax of XMLdocuments that serve as a basis for semantic annotation in this thesis.

    XML namespaces provide a simple method for qualifying element and attribute names usedin XML documents by associating them with namespaces identified by URI references [BHL99].Therefore, namespaces ensure that Semantic Web definitions can be integrated with other XMLbased standards.

    The technologies that the basic layers are composed of are recommendations that are acceptedby a broad WWW community and thus, many tools are available to create and maintain docu-ments that adhere to them. However, it is up to the superior layers to apply them in a way thatthey express statements with an inherent, machine-processable semantic.

    RDF and RDF Schema Layer

    In order to ’understand’ a language, computers must be able to access it using symbols and struc-tures that are based on an underlying model which defines the semantic of statements. AlthoughXML provides a means for structuring data, it does not provide the reader with its inherent mean-ing. This is because one fact can be expressed by different serializations. As an example we intro-duce an online shop called ”TrekKing” that sells outdoor products. We further assume that thisshop can be uniquely identified by the URI http://example.com/shop#TrekKing. PossibleXML representations that assign the name of the shop to the URI that identifies the shop can beas follows:

    TrekKing

    http://example.com/shop#TrekKingTrekKing

    Listing 1.1: Three different XML Representations of the same Fact

    To prevent such ambiguities, metadata of the Semantic Web must be described in a standardizedway. This is done by the Resource Description Framework (RDF). RDF provides a means for en-coding metadata concerning arbitrary resources. The fundamental concepts of RDF are resources,properties, and statements [AvH04]. A resource can be anything that is identifiable by a URI. Sim-ilar to natural language, RDF expressions are based on a subject, i.e. the resource we want to makea statement about, a predicate, and an object. While the predicate defines a property of a resource,

    http://example.com/shop#TrekKing

  • 1.1 Semantic Web Overview 5

    the object identifies the value of that property. The statement compounded of subject, predicate,and object can also be referred to as a triple.

    The shop example introduced above can be formulated in natural language as follows:

    http://example.com/shop#TrekKing has a shopName with value TrekKing.

    To bring this statement into machine-processable RDF, one needs to be more specific withrespect to the property shopName. Thus, the identification of property names is done by URIsdefined in an ontology. The term ontology is introduced in Section 1.1.1. In this example, theproperty shopName is an element of a vocabulary defined at http://example.com/shop. Itis good practice (but not imperative) to name the property therefore http://example.com/shop#shopName. In contrast to that, the value of the property can either be a resource, identifiedby a URI, or a constant value, represented by a literal.

    RDF models statements as nodes and arcs in a graph. In this notation, a statement is repre-sented by a node for the subject, a node for the object, and an arc for the predicate, directed fromthe subject node to the object node.

    The RDF statement above would be represented by the graph shown in Figure 1.2 [MM04].

    Figure 1.2: A simple RDF Graph

    Notice that the object can either be a literal, as in our example, or a resource. If the object isa resource, it can be the subject of additional statements that can have their own predicates andobjects. Thus, cycles between different resources are possible.

    Further notice that RDF offers quite a few other features such as container elements that areused to collect a number of resources or attributes about which we want to make statements as awhole [AvH04], or reification which is a powerful means for making statements about statements.Thus, the reader is referred to [MM04, AvH04, BG04] that introduce these RDF concepts.

    However, we present the following three additional RDF concepts because of their relevancyfor WEESA:

    Blank Nodes

    We now extend the online shop example to the items that the store offers. To make this informa-tion available in RDF, one could create triples with the subject http://example.com/shop#TrekKing, the predicate http://example.com/shop#offersItem, and an item value asobject which is identified by a URI. All the properties belonging to the item can then be attachedto that resource. However, since the URI of that item will never be referred to directly from out-side the graph, a universal identifier is not needed. Thus, RDF offers the concept of blank nodes(also known as anonymous resources). Blank nodes allow the creation of resources without needinga URIref, since the node itself provides the necessary connectivity between the various other partsof the graph [MM04].

    Figure 1.3 shows an application of a blank node. The blank node that represents a shopitem exhibits the datatype properties http://example.com/shop#itemName and http://example.com/shop#averageRating. The range of the itemName property is of type string,whereas the averageRating is expressed by float values. In addition to that, the offered itemhas the object property http://example.com/shop#category.

    http://example.com/shop#TrekKinghttp://example.com/shophttp://example.com/shop#shopNamehttp://example.com/shop#shopNamehttp://example.com/shop#TrekKinghttp://example.com/shop#TrekKinghttp://example.com/shop#offersItemhttp://example.com/shop#itemNamehttp://example.com/shop#averageRatinghttp://example.com/shop#averageRatinghttp://example.com/shop#category

  • 6 Chapter 1. Introduction

    Figure 1.3: A Blank Node that represents a Shop Item

    Typed Literals

    To ensure that programs know how to interpret literal values, it is necessary to provide themwith explicit information of the type they belong to. In the example above, the averageRatingproperty is a float value which permits operations different from the values of the itemNameproperty which is of type string. In RDF, typed literals are used to provide this kind of information.Although the use of any externally defined data typing scheme is allowed in RDF documents, themost widely used datatyping scheme is the one offered by the XML Schema Datatype Definition.[AvH04]

    Languages: xml:lang

    The XML recommendation [BPS+04] introduces an xml:lang attribute defined as follows: ”Indocument processing, it is often useful to identify the natural or formal language in which thecontent is written. A special attribute named xml:lang may be inserted in documents to specifythe language used in the contents and attribute values of any element in an XML document.” ThisXML attribute is also available in RDF to express the language of untyped literals. The xml:langattribute can be used on any untyped literal node element to indicate that the included contentis in a given language. Typed literals are not affected by this attribute [Bec04]. A more detaileddescription of the syntax of valid language values is given in Section 5.9.1.

    So far, we have not said anything about the actual representation of RDF triples in informationsystems. Since drawing graphs is not always convenient to formulate statements, several kindsof serializations exist.

    RDF Serializations

    As indicated by the Semantic Web layer cake, XML provides sufficient concepts for serializingRDF graphs. However, as we have demonstrated in Section 1.1.2, XML allows several represen-tations of the same fact. Thus, it is not surprising that several RDF serializations in XML exist.In addition to that, there are a number of RDF serializations that are not based on XML. Thefollowing list gives a short overview on the most common serializations:

    RDF/XML RDF/XML is the widespread serialization format for RDF graphs. Unlike Triples no-tation (see below), which is intended as a shorthand notation, RDF/XML is the normativesyntax for writing RDF [MM04]. The success of RDF/XML lies in its early availability andthe number of tools that support RDF/XML processing. Therefore, RDF/XML is the recom-mended syntax for applications to exchange RDF information [GB04].

    The basic principle of RDF/XML files is the mapping of RDF nodes and arcs into XML ele-ments, attributes, element content, and attribute values. The XML namespace mechanism is

  • 1.1 Semantic Web Overview 7

    used to abbreviate potentially long URIs. Thus, most properties and object nodes consist oftwo parts: A namespace prefix, that is defined in the RDF/XML root element and a local namethat denotes the actual element or attribute name. The concatenation of the namespace URIand the local name forms the original URI of a node or an edge. URIs that define subjectnodes are stored in XML attribute values. Literal values can be encoded either as text con-tent or attribute values of the predicate element. The result of this serialization is a sequenceof elements that alternately define RDF nodes and arcs. Therefore, RDF/XML is also calleda striped syntax.We illustrate the syntax of RDF/XML with the serialization of the sample RDF graph shownin Figure 1.3: A shop whose id is http://example.com/shop#TrekKing offers an itemnamed ”overnighter”. The average user rating of that item is the float value ”3.5” and thecategory it belongs to has the resource id http://example.com/shop#tent:

    3.5overnighter

    Listing 1.2: RDF/XML Serialization Example

    In spite of its popularity, RDF/XML also has some drawbacks: [CS04]

    • It is not possible to define a subset of RDF/XML that can represent all RDF graphs andcan be described by an DTD or an XML Schema.

    • RDF embedded in XHTML and other XML documents is hard (i.e. impossible) to vali-date.

    Notation 3 Notation 3 (N3) is a non-XML-based language that focuses on compactness and read-ability. It also extends RDF/XML to allow greater expressiveness. Notation 3 was designedto optimize the expression of data and logic in the same language with respect to RDF. Inaddition to the serialization of RDF graphs, Notation 3 offers means for the definition ofrules that allow the integration of RDF statements. A further goal of the language is to be asnatural and symmetrical as possible. [Ber]Notation 3 achieves the aforementioned goals with the following features: [Ber]

    • URI abbreviation using prefixes which are bound to a namespace (using @prefix) a bitlike in XML

    • Repetition of another object for the same subject and predicate using a comma ”,”• Repetition of another predicate for the same subject using a semicolon ”;”• Formulae allowing N3 graphs to be quoted within N3 graphs using { and }

    http://example.com/shop#TrekKinghttp://example.com/shop#tent

  • 8 Chapter 1. Introduction

    • Variables and quantification to allow rules, etc. to be expressed

    Since the huge expressiveness of N3 is not always necessary, several N3 subsets exist:

    N-Triples N-Triples is a line-based, plain text format for encoding an RDF graph [GB04].The expressiveness of N-Triples is limited to RDF 1.0. Statements cannot be nested,therefore only one triple can be defined per line. However, the simplicity and restric-tions of N-Triples facilitate the development of tools that process RDF graphs serializedin N-Triples. Since N-Triples is a subset of N3, tools that process N3 documents can alsobe used to handle N-Triples documents.The basic N-Triples syntax is that full URIs (e.g. needed for identifying the subjectresource) are enclosed in angle brackets. The subject is followed by the URI of thepredicate. If the object is a literal, it is enclosed in quotation marks, otherwise if it is aresource, the angle brackets are applied to the full URI. Each statement is terminatedby a full stop. In addition to full URIs, a shorthand way of writing URIs exist: Thisshorthand substitutes an XML qualified name without angle brackets as an abbrevia-tion for a full URI reference. A qualified name contains a prefix that has been assignedto a namespace URI, followed by a colon, and then a local name. [MM04]Assuming that shop is used as prefix for http://example.com/shop# the graphillustrated in Figure 1.2 can therefore be written as:

    shop:TrekKing shop:shopName "TrekKing" .

    The definition of a datatype in N-Triples notation is done by appending two circum-flexes to the literal, followed by the full URI of the datatype. Thus, the fact that theobject belonging to the averageRating property is of datatype float can be madeexplicit by writing the object of a triple as follows:

    "3.5"^^http://www.w3.org/2001/XMLSchema#float

    Similar to the definition of a datatype, the allocation of a language is done by append-ing an @ symbol to the literal, followed by the language identifier. Assuming thatthe itemName ”overnighter” introduced in the example above is written in English,we can represent that literal in N-Triples notation as follows (_:b1 identifies a blanknode):

    _:b1 "overnighter"@en .

    Turtle Turtle (Terse RDF Triple Language) is an extension of the N-Triples format. Com-pared to N-Triples, Turtle supports the N3 features ”,” and ”;” as well as the nestingof statements i.e. all the triples belonging to a certain subject can be expressed in onestatement. Moreover, Turtle allows the definition of collections to summarize state-ments with the same subject and predicate but several objects in one statement.

    N3 RDF N3 RDF is the subset of N3 that is reduced by logical rules and formulae. Thus, itis neither possible to explicitly define logical implications nor to define statements thathave an RDF graph as object.

    N3 Rules Compared to N3 RDF, N3 Rules introduces variables and the possibility to ex-press logical statements such as ”If x has a parent y and y is a brother of z, then x hasan uncle z”. An N3 representation of that rule is written as follows:?x fam:parent ?y. ?y fam:brother ?z => ?x fam:uncle ?z. [Ber06]

    TriX Triples in XML (TriX) is another RDF serialization that was developed with focus on simplic-ity and human-readability. As an additional feature, TriX supports the naming of graphs.Named Graphs allow publishers to sign their graphs whereas information consumers canevaluate specific graphs using task-specific trust policies and act on information from those

    http://example.com/shop#

  • 1.1 Semantic Web Overview 9

    named graphs that they accept. TriX documents consist of a element that namesthe graph, followed by any number of triple elements as children [CBHS05]. Thus, TriXdocuments are not ”striped” as RDF/XML serializations. Listing 1.3 shows the example in-troduced above in TriX syntax. (A shop with id http://example.com/shop#TrekKingthat sells an item (blank node) whose name is ”overnighter”, and its average rating is thefloat value ”3.5”. The item belongs to the category http://example.com/shop#tent)

    http://example.org/shopGraph

    ex:TrekKingex:offersItemx

    xex:averageRating3.5

    xex:itemNameovernighter

    xex:categoryhttp://example.com/shop#tent

    Listing 1.3: TriX Serialization Example

    An advantage of TriX compared to Notation 3 is that it can easily be processed by XMLtools. Thus, XPath expressisons can be formulated to identify triples or triple elements andXSLT transformations can be applied to transform TriX documents. It is even possible totransform TriX documents to RDF/XML using XSLT instructions. Compared to RDF/XMLa further advantage of TriX is the possibility of naming graphs. However, one its disadvan-tages is that it is relatively new and therefore, not many tools exist that are able to processTriX serializations.

    TriG TriG is another plain text format for serializing named graphs and RDF datasets. It com-bines the possibility of naming graphs with the compactness and readability of Turtle. In ad-dition to the features offered in Turtle, TriG extends Turtle with ’{’ and ’}’ to group triples into

    http://example.com/shop#TrekKinghttp://example.com/shop#tent

  • 10 Chapter 1. Introduction

    multiple graphs and to precede named graphs by their names [Biz05]. The fact that I makethe statement ”The thing with the resource id http://example.com/shop#TrekKinghas the http://example.com/shop#shopName TrekKing” on the 1st of July 2006 is ex-pressed in TriG as follows:

    @prefix xsd: .@prefix swp: .@prefix dc: .@prefix : .:G1 { :TrekKing :shopName "TrekKing"^^xsd:string . }:G2 { :G1 swp:assertedBy _:w1 .

    _:w1 swp:authority :Markus ._:w1 dc:date "2006-07-01"^^xsd:date . }

    Listing 1.4: TriG Serialization Example

    RDF Schema

    RDF Schema is a language for describing simple RDF vocabularies. RDF does not make assump-tions about any particular application domain, nor does it define the semantics of any domain. Itis up to the user to do so in RDF Schema (RDFS). [AvH04]

    RDF Schema itself is built on RDF triples which means that any RDFS document is a is also anRDF document that can be represented in RDF/XML notation. The RDF Schema elements thatcan be used to describe a domain of discourse are the following:1

    • Classes: Classes specify the things that belong to a certain domain. Given the outdoor shopexample above, one could identify Store as the class that defines the type of individualshops. rdfs:Class is the class of resources that are RDF classes.

    • Properties: Once classes are specified they can be linked using properties. RDF Schemamakes it possible to impose the following two property restrictions: The domain that theproperty can be applied to and the range that determines the set of valid values that aproperty can hold. RDFS restricts property domains with the property rdfs:domain andthe range of a property by rdfs:range. rdfs:Property is the class of RDF properties.

    If we wanted to represent the fact that TrekKing is an instance of the Store class we cannow define this in two steps:

    1. Creation of an RDF Schema that defines the Store (and other classes and properties)using rdfs:Class, rdfs:Literal, rdfs:Property classes2.

    2. Application of the created Store class to a concrete resource that belongs to it. This isdone by creating the following triple:

    shop:TrekKing rdf:type shop:Store .

    The fact that store instances can have a property shopName is defined in an RDF Schema inTriples notation as follows:

    1In the following examples, the XML namespace prefix rdfs: indicates that a term is part of the RDFS language spec-ification (http://www.w3.org/2000/01/rdf-schema#). The prefix rdf: qualifies terms that are part of the RDFspecification (http://www.w3.org/1999/02/22-rdf-syntax-ns#). shop is the prefix for http://example.com/shop#

    2See [BG04] for the full list of RDF/RDFS classes and properties.

    http://example.com/shop#TrekKinghttp://example.com/shop#shopNamehttp://www.w3.org/2000/01/rdf-schema#http://www.w3.org/1999/02/22-rdf-syntax-ns#http://example.com/shop#http://example.com/shop#

  • 1.1 Semantic Web Overview 11

    shop:shopName rdfs:domain shop:Store .

    • Class Hierarchies, Inheritance, Property Hierarchies: RDF Schema provides a means forcreating class hierarchies as introduced in Section 1.1.1. The Store class could thereforebe subclassed by an OnlineStore and a TraditionalStore class. As it is known fromobject-oriented programming, class properties such as domain and range are inherited tosubclasses. This means that the property shopName is applicable for OnlineStores andTraditionalStores once it was defined for their superclass Store. Subclassing is doneby applying the rdfs:subClassOf property to the subclass, with the id of the superclassas its value. An example in Triples notation:

    shop:OnlineStore rdfs:subClassOf shop:Store .

    In addition to that, it is also possible to create property hierarchies. The property online-ShopName could be a subproperty of shopName with the domain OnlineStore. Such arelationship can be established using rdfs:subPropertyOf as follows:

    shop:onlineShopName rdfs:subPropertyOf shop:shopName .

    In general, P is a subproperty of Q if Q(x, y) whenever P(x, y) [AvH04].

    Different from traditional object-oriented class-property relationships, properties are de-fined globally, that is, they are not encapsulated as attributes in class definitions. Thus,it is possible to define new properties that apply to an existing class without changing thatclass.

    Since the expressivity of RDF Schema is too limited to create complex ontologies, the need fora more expressive ontology language arose. The superior ontology layer offers solutions to thatrequirement.

    Ontology Layer

    The requirements for ontology languages such as subclass relationships, properties, value restric-tions, disjointness statements, and specification of logical relationships between objects were out-lined in Section 1.1.1. However, the formal semantics for the primitives defined in RDF Schemaare not provided, and the expressivity of these primitives is not enough for full-fledged ontologi-cal modeling and reasoning [BKD+00]. In the Semantic Web of today the Web Ontology Language(OWL) is considered the preferred ontology definition language.3 OWL adds more vocabularyfor describing properties and classes, among others: relations between classes (e.g. disjointness),cardinality (e.g. ”exactly one”), equality, richer typing of properties, characteristics of properties(e.g. symmetry), and enumerated classes. [MvH04]

    The OWL Web Ontology Language Overview characterizes the three types of the language asfollows [MvH04]:

    OWL Lite OWL Lite supports those users primarily needing a classification hierarchy and simpleconstraints. For example, while it supports cardinality constraints, it only permits cardinal-ity values of 0 or 1. It should be simpler to provide tool support for OWL Lite than its moreexpressive relatives, and OWL Lite provides a quick migration path for thesauri and othertaxonomies. OWL Lite also has a lower formal complexity than OWL DL.

    3OWL originated from from the DAML+OIL Web Ontology Language. Similar to OWL, DAML+OIL is a semanticmarkup language for Web resources. DAML+OIL is compatible with W3C standards such as RDF and RDF Schema.DAML+OIL was built from the original DAML ontology language DAML-ONT (October 2000) in an effort to combinemany of the language components of OIL. [CvHH+01]

  • 12 Chapter 1. Introduction

    OWL DL OWL DL supports those users who want the maximum expressiveness while retainingcomputational completeness (all conclusions are guaranteed to be computable) and decid-ability (all computations will finish in finite time). OWL DL includes all OWL languageconstructs, but they can be used only under certain restrictions (for example, while a classmay be a subclass of many classes, a class cannot be an instance of another class). OWLDL is so named due to its correspondence with description logics, a field of research that hasstudied the logics that form the formal foundation of OWL.

    OWL Full OWL Full is meant for users who want maximum expressiveness and the syntacticfreedom of RDF with no computational guarantees. For example, in OWL Full a class can betreated simultaneously as a collection of individuals and as an individual in its own right.OWL Full allows an ontology to augment the meaning of the pre-defined (RDF or OWL)vocabulary. It is unlikely that any reasoning software will be able to support completereasoning for every feature of OWL Full.

    Top Layers

    The top layers: Logic, Proof and Trust are currently being researched and simple applicationdemonstrations are being constructed. The Logic layer enables the writing of rules while the Prooflayer executes the rules and evaluates together with the Trust layer mechanism for applicationswhether to trust the given proof or not. [Hyv01]

    1.2 XML based Web EngineeringWeb engineering can be described as the ”application of systematic, disciplined an quantifiableapproaches to the cost-effective development and evolution of high-quality applications in theWorld Wide Web.” [GG00]

    Given the definition above, the following aspects of conventional software engineering canalso be found in Web engineering: Methods are used to describe and implement requirements fordistribution, user interfaces and communications. Tools support the development, maintenance,and analysis of Web based systems. Standards define the formats and contents for the managementand regulation of Web sites. Measurements are used for benchmarking Web specific quantities andqualities, and experiences provide a basis for the qualitatively appropriate development of Webbased systems [DLWZ03]. Due to their universal applicability, XML technologies support WebEngineering to a high amount. Especially the tools and standards aspects profit from the possibil-ities that XML processing offers. Prominent examples are the transformation of XML documentsusing XSLT, the selection of document elements by XPath expressions and the visualization ofgraphs in SVG.

    In our opinion, the separation of concerns is crucial for the development and management ofdynamic Web applications. Several Web development frameworks exist that make use of XMLtechnologies to process the content, the logic, and the layout of Web pages. This section takesthe Apache Cocoon framework as an example to describe how XML technologies can be used torealize the above mentioned Web engineering principles when generating Web applications.

    1.2.1 Apache CocoonApache Cocoon is a Web development framework built around the concepts of separation ofconcerns and component-based Web development [coc06]. One of Cocoon’s strength is the strict

  • 1.3 Problem Statement 13

    separation of concerns for defining Web applications: Content, style, logic, and management func-tions are divided into separate steps of a Cocoon Web application. The underlying mechanismthat processes and links the various concerns is a pipeline as depicted in Figure 1.4.

    Serializer

    Transformer

    Transformer

    Generator XML Source

    Cocoon Pipeline

    SAX Events

    SAX Events

    SAX Events

    SAX Events

    Schema valid XML document

    Busines Logic:

    XSP, JSP, Filter,

    SQL, etc.

    HTML page

    HTML

    XSLT Transformer XSL Stylesheet

    Figure 1.4: A Cocoon Pipeline using several XML Technologies [Rei05]

    A Cocoon pipeline consists of the following components:

    Generator The generator reads the XML source file and generates a stream of SAX events of it.

    Transformers The initial stream of SAX events is then passed to the first transformer. A trans-former takes the SAX events and transforms them according to business logic rules. Theresulting stream of SAX events is then passed to the next transformer. If all the steps of thebusiness logic are processed, the stream is passed to the XSLT transformer. The XSLT trans-former applies an XSL Stylesheet to the received data to define its graphical appearance.

    Serializer The data stream is then taken by a serializer that transforms the data to the formatexpected by the client, e.g. HTML or PDF. As a last step, the serializer sends the result tothe client.

    Notice that the aforementioned concerns do not take semantic annotation of Web sites intoaccount. Section 3 describes how WEESA introduces a new metadata concern and how existingWeb development frameworks can be extended to support Semantic Web engineering.

    1.3 Problem StatementWEb Engineering for Semantic web Applications (WEESA) aims at combining the benefits of theSemantic Web with the concepts from conventional Web engineering. With this goal in mind,

  • 14 Chapter 1. Introduction

    Reif developed a WEESA prototype that facilitates the generation of RDF according to mappinginstructions and also incorporates concepts from XML based Web engineering. This prototypeimplementation proved that an effective annotation of Web sites is possible. Two key elements ofReif’s annotation approach are the following: [Rei05]

    • WEESA identifies semantic annotation as a concern on its own. Therefore, it takes the tra-ditional concerns for content, graphical appearance, and application logic and adds semanticannotation to the Web engineering process.

    • In contrast to many approaches described in Chapter 2, WEESA offers powerful mecha-nisms to overcome gaps that can appear when generating metadata: The so-called granular-ity problem arises when information required by the ontology cannot be found directly in theXML document but it can be computed from the information available [Rei05]. This way,users are not bound to ontologies that are derived from the structure of the source XMLdocument.

    Although the first version of the WEESA prototype was a useful proof of concept, it alsoexhibited some shortcomings:

    Design aspects In order to quickly achieve a prototypical XML to RDF transformation, early im-plementations were built on the Java Architecture for XML Binding (JAXB). JAXB allowsJava developers to create and edit XML using a representation thereof in Java code. Whilethis is helpful for rapid development, JAXB has the drawback that it requires the XMLSchema to be changeless. Changes in the XML Schema cause the JAXB binding compilerto generate a new Java content-tree. Therefore, one is likely to separate functionality fromdata processing which does not comply with the object-oriented principle of encapsulation.Since WEESA aims at being open for extensions that are likely to affect the XML Schemaof the mapping definition, a more flexible means for addressing the mapping elements isneeded.

    Usability The WEESA method expression (see Section 5.2) offers a powerful means for trans-forming XPath results according to arbitrary rules. However, the generation of methodsrequires knowledge of Java and therefore increases the complexity of mapping definitions.The extension of the mapping vocabulary by terms that would otherwise need a methodcall shifts responsibilities from Java into the mapping process and therefore increases theexpressiveness of mapping definitions.

    In order to relate two XPath expressions, the former WEESA implementation introducedvariables that can be shared between various resources. However, most of the cases thatare covered with variables can also be handled by well known relative XPath expressions.Therefore, the introduction of relative XPaths would ease the definition of mapping state-ments.

    Coverage of RDF features Up to this thesis, the WEESA prototype did not take RDF featuressuch as datatypes or the language of a literal into account. Thus, the mapping output exhib-ited some restrictions with regard to its expressiveness.

    In this thesis we tackle the question of how the shortcomings mentioned above can be elimi-nated. We resolve the design issues by introducing an object-oriented architecture that is flexiblefor changes in the mapping structure and open for extensions of the mapping vocabulary. Wefurther introduce new mapping directives and the possibility to use relative XPath expressions inorder to increase WEESA’s usability. The new mapping directives do not only reduce the numberof Java method calls, they also cover RDF features that could not be expressed in earlier versions.

  • 1.4 Structure of the Thesis, Objectives of this Work 15

    To our knowledge, no other framework exists that unifies Web engineering principles and aflexible XML to RDF transformation for the semantic annotation of Web sites. Thus, we considerit worthwhile to solve the aforementioned issues in order to promote the Semantic Web.

    1.4 Structure of the Thesis, Objectives of this WorkThe remainder of this thesis is structured as follows.

    Chapter 2 gives an overview on related research areas. It briefly describes several approacheshow XML or HTML sources can be processed in order to retrieve metadata out of them and howWeb pages can be annotated with machine-processable metadata.

    Chapter 3 introduces WEESA which transforms arbitrary XML documents to RDF graphs.In addition to that, this chapter describes how the transformation process can be integrated inthe Cocoon Web development framework and how the generated metadata is accumulated in aknowledge base.

    Chapter 4 describes the object-oriented architecture that supports the WEESA mapping algo-rithm. It also describes the functioning of the mapping algorithm and introduces the rules thathave to be followed when defining a mapping.

    Chapter 5 presents all the mapping features that are currently available for WEESA mappingdefinitions. This chapter is intended to serve as a reference for anyone who wants to get fulladvantage of the comprehensive collection of mapping directives.

    Chapter 6 concludes the thesis and gives an outlook on future work.

  • Chapter 2

    Semantic Annotation ofXML-based Web Applications

    In this chapter we describe research areas that are related to Semantic Web engineering as it isoffered byWEESA. Section 2.1 introduces approaches that focus on the manual annotation of Websites. Section 2.2 presents two techniques that embed metadata in HTML. Various approaches thatautomatically generate metadata from XML documents are the subject of Section 2.3. This chaptercloses with the presentation of a project that aims at generating and integrating metadata fromheterogeneous sources.

    2.1 Tools for Manual AnnotationThe SHOE Knowledge Annotator provides support for the annotation of HTML pages using the”Simple HTML Ontology Extensions”. SHOE’s goal was to extend HTML in order to provide tagsfor defining ontologies as well as for annotating the content of Web sites. The SHOE KnowledgeAnnotator offers a graphical user interface in order to annotate static Web sites with conceptsdefined in SHOE ontologies. Since SHOE and its Knowledge Annotator do not take the archi-tecture of the Semantic Web layer cake into account, they are not actively maintained anymore.[Hef, HH00]

    SMORE is an application that allows users to markup HTML documents in OWL. Its graphicaluser interface allows users who have only a limited understanding of the concepts of the Seman-tic Web to create classes, properties, and individuals from existing ontologies. Human-readableelements of a Web sites can be dragged-and-dropped on concepts of ontologies in order to createindividuals. The triples built this way are automatically checked for consistency and users areassisted to correct errors if the triples do not match the underlying ontology. In addition to that,SMORE provides a built-in HTML editor that allows users to control and edit the HTML docu-ment while generating the semantic markup. However, SMORE does not link the metadata to theHTML document. [SMO05]

    CREAM (CREAting Metadata for the Semantic Web) is an annotation framework that containsmethods for several kinds of annotations: In addition to the manual annotation, semi-automatic,and ”deep” annotation methods are provided. CREAM consists of various components that sup-port the different annotation types: The CREAM ontology guidance/fact browser makes surethat the defined annotations comply with their ontologies. It therefore supports the generationof arbitrary Web site independent metadata which the authors call annotation by typing. In con-trast to annotation by typing, annotation by markup refers to annotating existing conventional Web

  • 18 Chapter 2. Semantic Annotation of XML-based Web Applications

    sites. The CREAM document editor/viewer allows the editing and visualization of HTML doc-uments and makes it possible to link pieces of the displayed data to concepts of ontologies. If aWeb site is generated from scratch, the authors propose an annotation by authoring approach thatbasically inverts annotation by markup: The creation of a new Web site is guided by the ontologyguidance/fact browser for content generation.

    In addition to the annotation of static Web sites, CREAM also supports annotation of dy-namic sites by storing metadata in databases and providing a mechanism that transforms onto-logical queries into database queries. The concepts of CREAM are implemented in the OntoMat-Annotizer. [HS03]

    All the manual annotation mechanisms have in common that the metadata generation is po-tentially error-prone. Inconsistencies are likely to occur when the content of a Web page changesand the annotation is not updated properly. An additional issue is that manual annotation canbecome complex and time-consuming when granularity issues arise.

    2.2 Embedding and Retrieving Metadata

    2.2.1 GRDDLGRDDL is a mechanism for Gleaning Resource Descriptions from Dialects of Languages; thatis, for getting RDF data out of XML and XHTML documents using explicitly associated trans-formation algorithms, typically represented in XSLT [HC05]. One of GRDDL’s objectives is toensure that the meaning of the processed information is preserved. In order to prevent docu-ments that accidentally correspond to a certain dialect from being transformed in a way that doesnot preserve meaning, GRDDL implementations read in explicit descriptions that define the typeof conventions that were used to encode the metadata.

    As an example, we assume that an HTML dialect exists that allows online shops to integrateinformation about their shop and their products directly into their Web pages. An extract of sucha page is given in Listing 2.1. A first mechanism to include metadata that can be transformedinto RDF is the use of HTML link elements. elements are not rendered with the docu-ment’s contents. However, GRDDL implementations can scan a document for link elements oftype transformation and use them to find the appropriate processing instructions. Referenc-ing the GRDDL profile profile="http://www.w3.org/2003/g/data-view" indicates thatlinks of type transformation relate the document to transformations that preserve its meaning.[HC05]

    TrekKing online store

    ...

    ...

    Listing 2.1: Inclusion of Metadata for GRDDL Processing

  • 2.2 Embedding and Retrieving Metadata 19

    Notice that the definition of several elements in one document is possible.In addition to that, GRDDL processing can also happen recursively. If, for example, a docu-

    ment (DOC) contains a link to an XFN1 metadata profile, that XFN profile can contain a transfor-mation link that points to an XSLT sheet (XSLT-1) which is applied to the XFN profile. The resultof this first transformation can induce a further link to an XSLT transformation (i.e. a profiletransformation) that can then be applied to the original document DOC [HC05, Haz05]. Figure2.1 depicts this mechanism.

    Figure 2.1: Recursive Application of the GRDDL Mechanism [Haz05]

    Valuation

    GRDDL became a W3C Team Submission in 2005 and quite a few interest can be observed in the(Semantic) Web community. A couple of GRDDL implementations already exists. In our opinion,GRDDL solves the problem of how to combine data with its metadata (see also Section 3.2). Theintroduction of an intermediate step, i.e. the publication of metadata in a predefined dialect,can facilitate the annotation process. However, we think that GRDDL mainly shifts the time-consuming task of metadata generation: Instead of publishing metadata directly in RDF, users arebound to transform their data into a certain dialect that is later transformed to RDF. We think thatthe direct allocation of metadata in RDF is more desirable since it makes the process of metadatapublication independent of the availability of certain dialects or transformation stylesheets.

    2.2.2 RDFaRDFa is a syntax for expressing metadata in XHTML. The rendered, hypertext data of XHTML isreused by the RDFa markup, so that publishers don’t repeat themselves. The underlying meta-data representation is RDF. The metadata is closely tied to the data it describes, so that rendereddata can be copied and pasted along with its relevant structure. [AB06]

    1XFN (XHTML Friends Network) is a simple way to represent human relationships using hyperlinks. XFN enablesWeb authors to indicate their relationship(s) to the people in their Web sites simply by adding a ’rel’ attribute to their tags. (http://gmpg.org/xfn/)

    http://gmpg.org/xfn/

  • 20 Chapter 2. Semantic Annotation of XML-based Web Applications

    Listing 2.2 shows a sample HTML page of the TrekKing online store. In order to make state-ments about that store, one includes the information in a paragraph that has an about attributewith the subject of a statements as its value. The predicates can then be defined in propertyattributes of elements. The text contents of elements are interpreted as objectvalues.

    TrekKing Online Store

    Welcome to TrekKing, youronline store for outdoor goods. Online since

    August, 2006

    !

    Listing 2.2: RDFa Example

    As it can be seen in the example above, in cases where the metadata representation is differentfrom the human-readable one (e.g. different date formats), a element with a propertyattribute can be applied to define the metadata compliant representation. [AB06]

    Compared to the GRDDL approach, RDFa has the advantage that it is more generic. Thus,only one parser is needed in order to glean RDF statements from arbitrary RDFa elements.

    RDFa is currently a W3C Working Draft and expected to be part of the XHTML2 recommenda-tion. Several RDFa parsers already exist. An additional advantage of XHTML documents markedup with RDFa is that they are valid XHTML Documents [AB06]. Therefore, RDFa is believed toallow a gentle extension of the current Web towards the Semantic Web.

    Valuation

    An advantage of reusing the human-readable content of XHTML documents for semantic annota-tion is that inconsistencies between data and metadata are eliminated. The fact that RDFa is builton RDF provides it with enough flexibility to build new metadata vocabularies, extend others,and evolve vocabularies over time. In addition to that, RDFa eliminates the issue of how to linkor embed data and metadata.

    However, from a methodological point of view, the tight binding of data and metadata isquestionable. In our opinion, metadata management can be considered a concern of its ownan therefore, tying metadata too close to data breaks the separation of concern approach that isusually followed in XML based Web engineering.

  • 2.3 XML to Metadata Translation 21

    2.3 XML to Metadata Translation

    2.3.1 Bridging the Gap between RDF and XMLIn ”Bridging the Gap between RDF and XML” [Mel99] Melnik assumes that ”every XML docu-ment (even those without DTDs) has a default RDF interpretation”. Thus, he developed a mech-anism that creates RDF descriptions from arbitrary XML documents. In this early approach, Mel-nik neither takes XML Schema or DTD information nor any mapping directives into account. Themapping rule is as follows: ”Every XML tag is regarded as a relationship name, unless an RDFproperty rdf:instance is used to override this default.” Thus, the resulting RDF graphs mainlyconsist of blank nodes with literal leaf nodes. Although we think that Melnik’s approach lacksmany of the crucial requirements for Semantic Web applications (e.g. compliance with ontologies)we mention it for the sake of completeness.

    2.3.2 Mapping XML to OWL OntologiesIn ”Mapping XML to OWL Ontologies” [BA05] Bohring an Auer suggest a comprehensive mech-anism that takes XML Schemas as the basis for ontology generation. Their approach assumes thatthe data to be mapped to an ontology originates from relational structures that were serializedto XML documents. Therefore, they aim at representing the relational information in OWL: Re-lations/tables correspond to classes, columns to properties and rows to instances. However, adifficulty of this approach is the detection of these structures in XML files: Nested tags, for ex-ample, can represent a ”part-of” or a ”subtype-of” relationship which would result in differentontologies. The rules that are applied to XML Schemas can be summarized as follows: [BA05]

    • If one element contains another element that contains not only a literal, a ”part-of” relation-ship is assumed that is mapped to an owl:ObjectProperty.

    • If an element in the source XML tree is always a leaf that contains only a literal and no at-tributes, the element is mapped to an owl:DatatypeProperty. The domain of the prop-erty is the class representing the surrounding element.

    • xsd:elements that contain other elements or have at least one attribute correspond toowl:Class elements that are coupled with owl:ObjectProperties.

    • xsd:SimpleType elements correspond to owl:DatatypeProperties

    • XML attributes are mapped equally to owl:DatatypeProperties.

    • XML Schema cardinality constrains xsd:minOccurs/xsd:maxOccurs are mapped to theOWL cardinality constraints owl:minCardinality and owl:maxCardinality.

    • xsd:sequence and xsd:all are always mapped to owl:intersectionOf whilexsd:choice corresponds to a combination of owl:intersectionOf, owl:unionOf,and owl:complementOf.

    The above mentioned rules are implemented in XSLT stylesheets. A first stylesheet convertsthe XML Schema to the OWL model. In addition to that, a second stylesheet that transforms theXML instance to the OWL instances is automatically generated. In cases where no XML Schemais available, a suitable XML Schema is extracted from the XML instance file. Figure 2.2 depicts thesequence from XML to OWL.

  • 22 Chapter 2. Semantic Annotation of XML-based Web Applications

    Figure 2.2: Operating Sequence of an XML to OWL Transformation as suggested in [BA05]

    Valuation

    A disadvantage of the extraction of an XML Schema out of a single XML file is that one XML in-stance file can only indicate what the full range of valid XML instances may be. In our opinion theontologies created from such XML Schemas are likely to suffer from restrictions that are either toonarrow or too broad. In addition to that, elements that are declared optional may not be availablein a concrete instance file and therefore missing in the created ontology. However, the mechanismthat extracts the XML Schema is currently being improved in order to support multiple sourcedocuments which mitigates these problems.

    An additional issue is immanent to the automatic generation of ontologies: While they supportthe building of ontologies from scratch, they do not tap the full potential of existing domainrelevant knowledge sources. Ontologies that are built from scratch often do not resort to availableontological knowledge on the Web and are therefore tailored to specific application needs, whichin turn means that they cannot be reused in different settings. [ont05]

    2.3.3 Lifting XML Schema to OWLIn ”Lifting XML Schema to OWL” [FZT04] Ferdinand, Zirpins, and Trastour propose a twofoldapproach that is similar to the one described above. Their primary goal is not to create ontologiesand RDF graphs in order to offer them to third parties but to support the Web engineering process.Therefore, their mapping concepts do not take existing ontologies into account but create newones that are closely bound to an XML Schema.

    Similar to the approach described above, the mapping process described in [FZT04] consistsof two parts: XML instance documents are taken for RDF generation whereas XML Schemas areused to create ontologies. The generation of an ontology out of XML instances is not supported.Thus, the mapping procedure can be depicted as shown in Figure 2.3. Ferdinand et al. use Javamethods to implement two algorithms: The one that is responsible for the generation of the RDFfile distinguishes between two types of nodes: First, element nodes that have subelements and/orattributes and second, attribute nodes and element nodes that have text content. Elements thathave mixed content are not fully supported. The basic idea is that nodes of the first type aremapped to blank nodes whereas nodes of the second type are mapped to literal nodes. Therefore,the resulting RDF graph exhibits a tree-like structure that is very similar to the source XML doc-

  • 2.3 XML to Metadata Translation 23

    Figure 2.3: Operating Sequence of an XML to OWL Transformation as suggested in [FZT04]

    ument. In addition to that, the RDF graph is extended by datatype information that is retrievedfrom the Post-Schema Validation Infoset which is an outcome of the XML Schema validation.

    In addition to that, a mechanism is proposed to automatically create an OWL ontology fromthe XML Schema. The main concepts of the ontology generation are as follows: [FZT04]

    • Each XML Schema complexType is mapped to an owl:Class.

    • Each element and attribute declaration is mapped to an OWL property. SimpleType el-ements and attributes are mapped to an owl:DatatypeProperty whereas elements ofcomplexType are mapped to an owl:ObjectProperty.

    • The root element of a schema is mapped to an OWL Class of name ’targetNamespace +#Schema’.

    Valuation

    Similar to the ”Mapping XML to OWL Ontologies” approach, ”Lifting XML Schema to OWL”creates XML Schema specific ontologies. Ontologies created like this may serve as a starting pointwhen developing a new ontology for a specific domain from scratch. However, they neithercomply to already existing ontologies nor does the XML to RDF mapping provide means forhandling the granularity problem or for making implicit information explicit.

    2.3.4 Round-tripping between XML and RDFIn ”Round-tripping between XML and RDF” [Bat04], Battle describes a ’lift’ which is anothertransformation from XML/XML Schema to RDF. In Battle’s opinion, mappings that transformXML directly to RDF/XML using XSLT suffer from the impossibility of a backward mapping.That is because there is no canonical RDF/XML serialization. Thus, one of Battle’s goal is toprovide a mechanism that allows mappings from XML to RDF as well as mappings from RDFto XML. The round-tripping mechanism is similar to Melnik’s approach since it maps XML ele-ments and attributes to RDF properties. However, different from Melnik’s approach, the round-tripping mechanism takes the XML Schema into account in order to determine if the properties

  • 24 Chapter 2. Semantic Annotation of XML-based Web Applications

    are datatype or object properties. XML Schema complex types are mapped to resources. In or-der to facilitate a mapping from XML to RDF and vice versa the sequencing information of XMLneeds to be taken into account. Therefore, the round-tripping mechanism maps the XML Schemasequence operator to the RDF rdf:Seq container.

    In order to support various XML Schemas when mapping RDF to XML, Battle makes use of anexisting RDF representation and enriches it using properties that correspond to the target XMLSchema. The idea is that every additional property is an owl:equivalentProperty that isequivalent to an already existing one. Thus, RDF graphs contain potentially many equivalentproperties that originate from several XML Schemas. When mapping RDF to XML, only theproperties that correspond to a certain target XML Schema are taken into account.

    Valuation

    The strength of the round-tripping approach is that it takes the RDF representation of XML docu-ments as a ”common denominator” to facilitate data exchange between different XML Schemas.In our opinion, the round-tripping mechanism is applicable for domains that process data froma small number of well defined XML Schemas. Its advantage is that in addition to mapping be-tween XML Schemas, a semantic representation of the data becomes available. The disadvantagesare again the same as outlined above, i.e. the granularity problem, the lack of taking existing on-tologies into account and the missing possibility to make implicit information explicit.

    2.3.5 XR

    XR (XML→RDF) is a transformation format that indicates how RDF can be extracted from a cer-tain XML format. In order to define a transformation for a certain XML format, its structure (e.g.defined by an XML Schema) is taken for the definition of XPath expressions. RDF can then beautomatically extracted from every XML document that complies to that format. [Vis05]

    The basic idea behind XR is that an empty template of the RDF statement is first defined man-ually. The syntax of the statement is very close to RDF/XML but instead of filling the templatewith concrete URIs and literals, XPath expressions are entered. In order to generate the RDF/XMLdocument, the XPath expressions are evaluated. Given the XML document listed in AppendixB.1, the following XR definition creates a statement saying that the resource with the resourceid http://www.TrekKing.com has the property http://example.com/shop#shopNamewith the literal value "TrekKing":

    Listing 2.3: XR sample Statement

    http://www.TrekKing.comhttp://example.com/shop#shopName

  • 2.4 Conversion of arbitrary document types to RDF 25

    2.4 Conversion of arbitrary document types to RDF

    2.4.1 Data Conversion, Extraction and Record Linkage using XML andRDF Tools in Project SIMILE

    The project SIMILE (Semantic Interoperability of Metadata In unLike Environments) investigateshow Semantic Web tools can support dealing with heterogeneous metadata. An approach that isdescribed in ”Data conversion, extraction and record linkage using XML and RDF tools in ProjectSIMILE” [BGSS04] merges XML data from several sources into one RDF representation in orderto facilitate the processing of otherwise distributed data. In a first step, several (possibly alreadyexisting) RDF Schemas are chosen and combined to provide an extensive target schema. In asecond step, XSLT is used to carry out a syntax to syntax translation from XML to RDF/XML.As described in [BGSS04] it is possible but not user-friendly to use XSLT as transformation de-scription language. For example, XSLT does not provide a URI encoding mechanism needed toreplace characters that are not allowed in URIs. Therefore, developers are urged to implementad-hoc URI encoding algorithms using the XPath replace function. [BGSS04]

    In addition to the approach mentioned in [BGSS04], the SIMILE project offers a collection ofso-called RDFizers. RDFizers are standalone tools that allow to transform existing data into anRDF representation. Several RDFizers are available at http://simile.mit.edu/RDFizers/.As shown by the examples below, the offered transformations originate from various domains:[RDF06]

    JPEG→ RDF scans folders for JPEG files and represents the EXIF metadata in RDF/N3.

    BibTEX→ RDF transforms BibTEX files into RDF/XML.

    Java→ RDF scans Java bytecode for method calls and creates a description of the dependenciesbetween classes and the package/archive encoded in RDF/N3.

    Email→ RDF transforms email mbox files into RDF/XML

    Due to their heterogeneous origins, RDFizers are not built on a standardized technical frameworkbut on several programming languages such as Java, Perl, or Python.

    http://simile.mit.edu/RDFizers/

  • Chapter 3

    Introduction to WEESAGiven the gap between the current Web and the goals of the Semantic Web, Reif developed atechnique to extend existing XML-based Web engineering methodologies in order to create se-mantically annotated Web pages [Rei05]. His approach is called ”WEESA” which stands for WEbEngineering for Semantic web Applications. One of WEESA’s objectives is the transformation ofarbitrary XML documents to RDF graphs. Furthermore, WEESA offers mechanisms to automat-ically annotate Web sites with the generated RDF data. In order to extend the limited view onthe metadata offered by a single Web page, WEESA further accumulates and offers metadata ina knowledge base. This chapter gives an overview on all components of WEESA. Section 3.1 isdeliberately skimped, since the mapping procedure is treated in detail in the following chapters.

    3.1 WEESA - Web Engineering for Semantic Web Appli-cations

    As described in Chapter 2, several approaches exist for obtaining metadata from XML sources.This section describes how WEESA transforms XML data to RDF.

    In ”Web Design For the Semantic Web”[PT04] Plessers and De Troyer identify the linking be-tween the ontology and the actual data of a Web site to be one of the main difficulties of semanticannotation: Most of the approaches described in Chapter 2 lead to a strong weaving of seman-tics and implementation. As a consequence, the annotation process remains a heavy and timeconsuming task. The high coupling between implementation and annotation makes it difficult toadhere to the separation of concern approach that is usually followed in Web engineering. Reiftherefore suggests to introduce a new metadata concern in addition to the already existing con-cerns for content, graphical appearance, and application logic [Rei05]. As described in Section 1.2,XML based Web engineering uses XML and XML Schema for structuring content whereas XSLTis often used for generating the desired layout. The role of XML Schema is to serve as a contractthat the parties involved in the generation of a Web site adhere to. Since WEESA mappings arebuilt against an XML Schema, the introduction of a specific metadata concern is facilitated. Thisway, separation of concern and parallel work can be ensured.

    In our opinion, the approaches described in Chapter 2 do not take the granularity problem suf-ficiently into account. The granularity problem arises when the concepts defined in an ontologydo not match the granularity of the data on the Web site [PT04]. WEESA now offers several di-rectives to overcome the granularity problem: Simple if statements, for example, can be used tomap arbitrary values according to certain conditions. The use of Java methods allows the defini-tion of more complex rules to control the mapping in case of granularity issues. A description of

  • 28 Chapter 3. Introduction to WEESA

    all directives can be found in Chapter 5.A further challenge of semantic annotation is the consistency between the Web site and its

    metadata. Most of the supporting tools only allow annotating static Web sites, page by page onan implementation level [PT04]. Due to the fact that the same information is stored twice, changesto either the human-readable content or the metadata are likely to cause inconsistencies. In orderto prevent pages from becoming inconsistent, the work done for one page needs to be repeatedfor all similar structured Web pages. In WEESA the generation of HTML and RDF metadata gohand in hand. Thus, changes in the source XML document propagate to the HTML representationas well as to the metadata description.

    Figure 3.1 summarizes how Reif [Rei05] takes the issues mentioned above in account. Thelift of the mapping definition to the design level relying on an XML Schema as a ”contract” thatseveral parties adhere to decouples the annotation process from the already existing concerns.The constructs available for WEESA mapping definitions can handle the granularity problem.Using the same XML document for creating the RDF description and the HTML page preventsinconsistencies between them.

    Instance Level

    Design Level

    Semantic Web page

    XML Schema

    XML document

    RDF description

    OntologyWEESA

    mapping

    definition

    HTML Web page

    generate

    via

    WEESA

    generate

    via XSLT

    validuses concepts

    associate

    Figure 3.1: WEESA Design and Instance Levels [Rei05]

    3.2 Semantic Web Applications withWEESA and ApacheCocoon

    In addition to creating machine-processable RDF data, one needs to associate it with human-readable HTML pages. At present, there is no standardized way, how to associate RDF/XMLdata with HTML. However, in ”RDF in HTML: Approaches” [Pal02] Palmer describes severalways to provide Web pages with RDF/XML that can be classified as follows:

    Embedding RDF/XML in HTML

    With this association style the RDF description is directly embedded in the HTML page [Rei05].Some of the suggested approaches are: Embedding RDF/XML directly in the ele-ment, inclusion of RDF/XML in a or element, and the extension of theHTML metadata facilities so that the element may appear within the body of the HTML

  • 3.2 Semantic Web Applications with WEESA and Apache Cocoon 29

    document. In general, Palmer concludes that there is no single embedding method that satisfiesall applications and still remains simple. [Pal02]

    WEESA offers a technique that makes use of embedding RDF/XML using a ele-ment. An example of an HTML header containing such an element is shown in the Listing below.

    TrekKing Online Store

    TrekKing

    Listing 3.1: Inclusion of RDF/XML in a Element

    Linking HTML to the Metadata

    In this approach, the RDF metadata description is stored in an external document and the HTMLpage references its metadata description. HTML offers two possibilities to link documents: A element that can be applied to the header section of an HTML document and a commonHTML element. Listing 3.2 shows two examples of the possible HTML-RDFlinking facilities:

    TrekKing Online Store

    TrekKing

    Listing 3.2: Linking HTML to Metadata

    3.2.1 Integration of WEESA in the Apache Cocoon FrameworkThe fact that the Semantic Web is an extension to the current Web indicates that current Webengineering techniques need to be extended, too. Hence, WEESA is built for being integratedinto XML/XSLT based Web development frameworks such as Apache Cocoon.

    SinceWEESA offers the integration of RDF/XML data in the element of the HTML section as well as the linking of RDF/XML to an HTML document, two different exten-sions to the Cocoon pipeline are available. They are briefly described in the following sections.

  • 30 Chapter 3. Introduction to WEESA

    3.2.2 WEESA Cocoon Transformer to generate HTML+RDFSince WEESA mapping definitions are written against an XML Schema, it is necessary that theXML document that is passed to the mapping algorithm complies to the expected structure. Tofulfill this requirement, the Cocoon pipeline is split up into one part that is responsible for RDFcreation and one part responsible for the XSLT transformation. The WriteDOMSessionTrans-former that is inserted just before the XSLT Transformer takes the XML Schema compliant SAXevents and writes a Document Object Model (DOM) representation of it into a servlet session.In addition to that, it passes the SAX events to the XSLT Transformer. Once the result of theXSLT Transformer is generated, it is passed to the WEESA ReadDOMSessionTransformer.The ReadDOMSessionTransformer reads the DOM representation from the servlet session,generates the RDF representation using the mapping algorithm and inserts the RDF metadatainto the element of the HTML page.

    The advantage of the approach described in this section is that the business logic is only ex-ecuted once. Its disadvantage is that Web sites having RDF/XML descriptions in a element of their section are not HTML 4.0/XHTML compliant [RHJ99, PAA+02].

    3.2.3 WEESA Cocoon Transformer to generate RDF/XMLIn order to associate metadata in an HTML compliant way, WEESA offers the possibility to ”link”the HTML document to a separate RDF/XML document. The means of choice regarding thelinking of the document is the introduction of a element in the of the HMTLdocument. A possible HTML element that links the Web page and its metadata description isshown below:

    The integration of a element into HTML can take place directly in one of the businesslogic transformers. In addition to that, WEESA offers an AddRDFLink Transformer which is in-serted between the XLST Transformer and the Serializer. The AddRDFLink Transformer takes theURL of the incoming request, replaces the ".html" suffix of the path with ".rdf" and adds a element to the of the HTML page [Rei05]. Notice that this approach requiresthe URL to contain the suffix ".html". In addition to the otherwise untouched pipeline thatgenerates the HTML file, a second pipeline is needed for the creation of the RDF/XML file. Sincethe WEESA mapping must depend on the same XML data that is also taken for generating theHTML page, the same business logic has to be applied to the source XML file. The WEESA trans-form