DartGrid: a semantic infrastructure for building database Grid applications

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2006; 18:1811–1828Published online 17 January 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1031

DartGrid: a semanticinfrastructure for buildingdatabase Grid applications

Huajun Chen∗,†, Zhaohui Wu, Yuxin Maoand Guozhou Zheng

College of Computer Science, Zhejiang University, Hangzhou 310027,People’s Republic of China

SUMMARY

In the presence of a Database Grid where a huge number of highly diverse, widely distributed, autonomouslymanaged databases can be involved in a sharing cycle, database tools and middleware should be well suitedfor schema mediation and query processing in a semantically meaningful way. In this paper, an implementedsystem called DartGrid is presented. DartGrid is intended to provide a semantic infrastructure for buildingdatabase grid applications. We explore the essential and fundamental roles played by Resource DescriptionFramework (RDF) semantics for database grids and implement a set of semantically enabled tools and gridservices such as semantic browser, semantic mapping tools, ontology service, semantic query service andsemantic registration service. We propose an RDF-View-based approach for relational schema mediationand describe the view-based semantic query rewriting algorithm implemented in DartGrid. DartGrid hasbeen used to build a real database grid application for Traditional Chinese Medicine in China. Copyrightc© 2006 John Wiley & Sons, Ltd.

KEY WORDS: Grid computing; semantic Web; database Grid; RDF semantics; ontology

1. INTRODUCTION

In many e-Science Grid applications such as medical science, bioinformatics, climate modelling,high-energy physics, etc., advanced instruments and procedures for systematic digit observations,simulations and experiments are generating a wealth of highly diverse, widely distributed,autonomously managed data resources. The data produced and the knowledge derived from it willlose value in the future if the mechanisms for sharing, integration, cataloging, searching, viewing and

∗Correspondence to: Huajun Chen, College of Computer Science, Zhejiang University, Hangzhou 310027, People’s Republicof China.†E-mail: [email protected]

Contract/grant sponsor: China 973 Project: Fundamental Approach, Model, and Theory of Semantic GridContract/grant sponsor: China 863 Project: TCM Virtual Research InstituteContract/grant sponsor: China NSF Program; contract/grant number: NSFC60503018

Copyright c© 2006 John Wiley & Sons, Ltd.Received 15 May 2004Revised 1 March 2005Accepted 1 May 2005

1812 H. CHEN ET AL.

retrieving are not quickly improved. Faced with an impending data crisis, one promising solution isto explicitly and adequately describe the data semantics so that the data can be easily exchanged,shared and integrated by different applications, middleware, sensors, instruments, and even variouspervasive devices. The explicitly represented data semantics also make the data more readable andmeaningful to people, which would greatly facilitate data sharing between different scientists workingfor different research institutes.

At present, the most popular languages for representing data semantics are the Resource DescriptionFramework (RDF)‡ and the Web Ontology Language (OWL)§ language proposed in the SemanticWeb research area and standardized by the W3C organization. RDF is a language for representingWeb information in a minimally constraining, extensible, but meaningful way. The RDF structure isgeneric in the sense that it is based on the directed acyclic graph (DAG) model. RDF is based onthe idea of identifying things using Web identifiers (called uniform resource identifiers (URIs)) anddescribing resources in terms of simple statements about the properties of resources. Each statementis a triplet consisting of a subject, a property and a property value (or object). For example, the triple(“http://example.org”, ex:createdBy, “Huajun”) has the meaning ‘http://www.example has a creatorwhose value is Huajun’. RDF also provides a means of defining classes of resources and properties.These classes are used to build statements that assert facts about resources. While the grammar forXML documents is defined using Document Type Definition (DTD) or XSchema, RDF uses itsown syntax RDF Schema (RDFS) for writing a schema for a resource. RDFS is expressive and itincludes subclass/superclass relationships as well as constraints on the statements that can be made ina document conforming to the schema. The generic structure of RDF makes data interoperability andevolution easier to handle as different types of data can be represented using the common graph model,and offers greater value for data integration over disparate Web sources of information. OWL is anextension of RDF/RDFS and supports more sophisticated knowledge representation and inference.

In this paper, an implemented system called DartGrid is presented. DartGrid is built on severaltechniques from both the Semantic Web and Grid research areas, and is intended to offer a genericsemantic infrastructure for building database Grid applications. Roughly speaking, DartGrid is aset of semantically enabled tools and Grid services such as semantic browser, semantic mappingtools, ontology service, semantic query service and semantic registration service, which support thedevelopment of database Grid applications. Within a database Grid application built upon DartGrid:(a) database providers are organized as an ontology-based virtual organization by uniformly defineddomain semantics, i.e. domain ontologies, a database can be semantically registered and seamlesslyintegrated together to provide uniform semantic query service; and (b) we raise the level of interactionwith the database Grid system to a domain-cognizant model in which query requests are specified in theterminology and controlled vocabularies of the domains, which enable the users to publish, discover,and query databases in a more intuitive way.

Fundamentally speaking, DartGrid mainly deals with the problem of answering queries through atarget RDF-based ontology, given a set of semantic mappings between one or more source relationaldatabases and this target ontology. In essence, this is the old problem of uniformly querying manydisparate data sources through one common virtual interface. A typical approach, called answering a

‡http://www.w3.org/RDF/.§http://www.w3.org/TR/owl-features/.

Copyright c© 2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 18:1811–1828

DARTGRID: A SEMANTIC INFRASTRUCTURE FOR DATABASE GRID APPLICATIONS 1813

query using a view [1,2], is to describe data sources as precomputed views over a mediated schemaand reformulate the user query, posed over the mediated schema, into queries that refer directly tothe source schemas by query rewriting. While most of the preceding work has been focused on therelational case [1–3] and the XML case [4,5], we consider the problem of rewriting an RDF queryusing view for RDF-based relational data integration. In DartGrid, the source relational tables aredefined as views over the RDF-based shared ontology. These kinds of views are called RDF viewswhich define the semantic mappings from source relational schemas to a shared ontology. RDF viewsare used by a rewriting algorithm to rewrite an RDF query defined over the shared ontology into a setof SQL queries. In DartGrid, a Visual Semantic Mapping Tool is developed to enable users to speed upthe process of defining RDF views. A Visual Query Tool is also developed to enable users to visuallyand quickly construct RDF queries.

Finally, we note that DartGrid effort was motivated by the application of Web-based data sharing anddatabase integration for Traditional Chinese Medicine (TCM) [6] . DartGrid has been used to developthe TCM database Grid system and is under evaluation by our TCM collaborator.

This paper is organized as follows. Section 2 introduces the architecture and core components ofDartGrid from a technical view. Sections 3 and 4 deal with the issues of schema mediation and queryprocessing, respectively. Section 5 mentions some related works. Section 6 gives the summary.

2. ABSTRACT ARCHITECTURE AND CORE COMPONENTS OF DARTGRID

DartGrid consists of a set of Grid services and several convenient client tools. Section 2.1 introduces thelayered architecture and core services of DartGrid, and Section 2.2 elaborates on the tools implementedin DartGrid.

2.1. Layered architecture and core services

Figure 1 illustrates the layered architecture of DartGrid. At the basic service layer, three services areimplemented.

1. Database Access Service. This supports the typical remote operations on database contents,including query, insertion, deletion and modification.

2. Database Information Service. This supports inquiries about the meta information of databases.The meta information includes: relational schema definitions, Database Management System(DBMS) descriptions, privilege information, statistics information including CPU utilization,available storage space, active session number, etc.

3. Access Control Service. This service is developed for access control in DartGrid. It provides theservice of authorizing or authenticating a Grid user to access specific database resources.

We mainly contribute at the semantic service level. The services at this level are mainly designed forRDF-based relational schema mediation and semantic query processing.

1. Ontology Service. This service is used to expose the shared ontologies that are defined usingRDF/OWL languages. The ontologies are used to mediate heterogenous relational databases.


1814 H. CHEN ET AL.

Figure 1. Layered architecture of DartGrid.

2. Semantic Registration Service. Semantic registration establishes the mappings from sourcerelational schema to mediated RDF ontologies. A Semantic Registration Service maintains themapping information and provides the service of registering and inquiring about this information.

3. Semantic Query Service. This service accepts RDF semantic queries and ask the SemanticRegistration Service to determine which databases are capable of providing the answer, thenrewrites the RDF queries in terms of relational schema, i.e. the RDF queries will be ultimatelyconverted into a set of SQL queries. The results of SQL queries will be wrapped by RDF/OWLsemantics and returned as RDF triples.

2.2. User tools

In general, there are three kinds of user roles in DartGrid, they are Virtual Organization Administrator(VOA), Local Database Administrator and Normal User. Figure 2 illustrates the relationship betweenthese user roles and the core components of DartGrid.

1. Normal User. For a Normal User, DartGrid offers an intelligent user interface called the SemanticBrowser [7]. It is a visual interface enabling the user to graphically browse the RDF/OWLsemantics and visually construct an RDF semantic query. Section 4.2 gives an example abouthow to construct a semantic query using this tool.

2. Local Database Administrator. A database resource can be dynamically added into the sharingcycle of a virtual organization. DartGrid provides the database provider with a semantic mappingtool. After a database Grid service is set up, the local database administrator can use the semanticmapping tool to register his database to the virtual organization. The mapping tool typically



Normal User

Semantic Registration Service

Ontology Service

Query Dispatching Service

Semantic Query Service

Browse RDF Semantics;Construct Semantic Query

Get SchemaMapping Info.

Virtual OrganizationAdministrator

VisualMapping Tool

SemanticMapping

Index Service

Get Addr.of Database Services

Console

Semantic Browser

Service RegistrationService MonitoringSecurity Management

DBDBDB

Database Grid Service

...

DatabaseAccess

Retrieve RelationalSchema Info.

ServiceAggregation

Local DBA

SemanticRegistration

Generate DistributedSQL Query Plan

Submit Semantic Query

DBDBDB


...DBDBDB


...

DartGridCore Services

Figure 2. User tools in DartGrid.

retrieves the RDF ontologies from ontology service, and gets the relational schema from databaseGrid service. Then the database administrator (DBA) can visually map the relational schema toRDF ontologies. Section 3.2 introduces this tool in more detail.

3. Virtual Organization Administrator. The VOA manages the whole database Grid system.DartGrid provides the VOA with a console to configure, monitor and manage the databaseGrid application. The main component with which the console interacts is the index service.The index service can dynamically aggregate all kinds of service data of database Gridservices. These service data contains the meta-information such as relational schema definition,DBMS descriptions, privilege information, statistics information including CPU utilization,available storage space, active session number, etc. Using DartConsole, the VOA can monitor,view and configure these service data.

3. SCHEMA MEDIATION IN DARTGRID

In the following sections, we consider two fundamental issues: schema mediation and query processingin DartGrid. We first propose the RDF-View-based approach to schema mediation, then we introducethe semantic mapping tool.


1816 H. CHEN ET AL.

3.1. RDF-View-based schema mediation

DartGrid uses an RDF-based ontology to mediate heterogenous relational database schemas.Fundamentally, DartGrid takes a local-as-view approach [1] to define the semantic mapping fromsource relational schema to mediated RDF ontologies, i.e. each relational table is defined as a viewover the mediated RDF ontologies.

Consider a simple example: suppose both W3C and Zhejiang University (ZJU) have a legacyrelational database of their employees and projects, and we would like to integrate them by the Friendof a Friend (FOAF) ontology¶, so that we can query these relational databases by formulating RDFqueries on the FOAF ontology.

The mapping scenario in Figure 3 illustrates two source relational schemas (W3C and ZJU), a targetRDF schema (a part of the FOAF ontology) and two mappings between them. Graphically, themappings are described by the arrows that go between the mapped schema elements. Mappings areoften defined as views in conventional data integration systems, often in the form of global-as-view(GAV), local-as-view (LAV), or, more generally, global-and-local-as-view (GLAV) assertions. We takethe LAV approach, i.e. we define each relational table in the source as a view over the RDF ontologies.We call such views RDF Views. For formal discussion, RDF Views are expressed in a Datalog-likenotation. The lower part of Figure 3 illustrates how to represent the semantic mappings as RDF Views.As the examples illustrate, a typical RDF View consists of two parts. The left part is called the viewhead and is often a relational predicate. The right part is called the view body and is often a set of RDFtriples. In general, the body can be viewed as an RDF query over the target RDF ontology and it definesthe semantics of the relational predicate from the perspective of the RDF ontology.

The semantics of an RDF View would be clearer if we construct a Target RDF Instance G basedon the semantic mapping specified by RDF Views. For example, given a relational tuple as below,applying the RDF View V1 in Figure 4 on this tuple will yield a set of RDF triples. We illustrate theseRDF triples using N3 notation:

Relational Tuple:--------------------w3c:emp("DanBrickley","[email protected]","SWAD","http://swad.org","EU");

Yielded RDF triples by Applying V1:-------------------------------------

_:bn1 rdf:type foaf:Person;foaf:name "Dan Brickley";foaf:mbox "[email protected]";foaf:currentProject _:bn2.

_:bn2 rdf:type foaf:Project;foaf:name "SWAD";foaf:homepage "http://swad.org";foaf:fundedBy _:bn3.

_:bn3 rdf:type foaf:Organization;foaf:name "EU".

¶The FOAF project is about creating a Semantic Web of machine-readable homepages describing people, the links between themand the things they create and do, see http://www.foaf-project.org/.



?y1

?y2

? y 3

fo a f:Pe r s o n

f o a f :P r o je c t

fo a f:O r g a n iza t io n

? e n ?e m

r df :t y p ef o a f :c ur r e n t P r o je c tf o a f : n a m e

f o a f :m bo x

rdf :t y p e

r df :t y p e

? p n

fo a f :n a m e f o a f :f un de dB y

? fo n ? fo h

f o a f :n a m ef o a f :h o m e p a ge

W 3C source: w3c:emp( ?en,?em,?pn,?ph,?fo n)

ZJU source: zju: emp(?en,?em),

Source R elati onal Schemas Target Schema: foaf onto logy

zju:e mp pro(?en,?pn)

zju:pro or g(?pn,?fo n)

zju:or g(?fon,?f oh)

? p h

fo a f :h o m e p a ge

W3C Source: ZJU Source:V1: w3c:emp(?en,?em,?pn,?ph,?fon) :- V2: zju:emp(?en,?em) :-

(?y1,rdf:type,foaf:Person), (?y1,rdf:type,foaf:Person),(?y1,foaf:name,?en), (?y1,foaf:name,?en),(?y1,foaf:mobx,?em),(?y1,foaf:mobx,?em), V3: zju:emp pro(?en,?pn) :-(?y1,foaf:currentProject,?y2), (?y1,rdf:type,foaf:Person),(?y1,foaf:name,?en),(?y2,rdf:type,foat:Project), (?y1,foaf:currentProject,?y2),(?y2,rdf:type,foaf:Project),(?y2,foaf:name,?pn), (?y2,foaf:name,?pn).(?y2,foaf:homepage,?ph), V4: zju:pro org(?pn,?fon) :-(?y2,foaf:fundedBy,?y3), (?y2,rdf:type,foaf:Project),(?y2,foaf:projectName,?pn),(?y3,rdf:type,foaf:Organization), (?y2,foaf:fundedBy,?y3),(?y3,rdf:type,foaf:Organization),(?y3,foaf:name,?fon), (?y3,foaf:name,?fon).

V5: zju:org(?fon,?foh) :-(?y3,rdf:type,foaf:Organization),(?y3,foaf:name,?fon),(?y3,foaf:homepage,?foh).

Figure 3. RDF View example. The symbols ?en, ?em, ?pn, ?ph, ?f on, ?f oh are variables and represent‘employee name’, ‘mail box’, ‘project name’, ‘project homepage’, ‘funding organization name’ and‘funding organization homepage’ respectively. The FOAF ontology consists of three classes: foaf:Person,

foaf:Project and foaf:Organization.

The key notion is the newly generated blank node ID in the RDF triples. As can be seen,corresponding to each existential variable ?y ∈ Y in the view, we generate a new blank node ID.For example, :bn1, :bn2 are both newly generated blank node IDs corresponding to the variables?y1, ?y2 in V1. This treatment of existential variable is in accordance with the RDF semantics, asblank nodes can be viewed as existential variables‖.

‖W3C Recommendation: RDF Semantics, see http://www.w3.org/TR/rdf-mt/.


1818 H. CHEN ET AL.

Figure 4. Visual semantic mapping tool.

We formally define the RDF View as below.

Definition 1. RDF View. Let Var be a set of variable names. A typical RDF View is of the form:

R(X) : −G(X, Y )

where:

1. R(X) is called the head of the view, and R is a relational predicate;2. G(X, Y ) is called the body of the view and G is a RDF graph with some nodes replaced by

variables in Var;3. the X, Y contain either variables or constants—the variables in X are called distinguished

variables and the variables in Y are called existential variables (we often denote individualexistential variables by ‘?y1, ?y2, . . . ’).

3.2. Semantic mapping tool

The task of defining semantic mapping from source relational schema to RDF ontologies is burdensomeand erroneous. DartGrid offers a visual tool to facilitate the task of defining RDF Views. As Figure 4displays, the user can use the registration panel (the right part in the figure) to view the table and column



definition of the relational database, and use the semantic browsing panel (the left part in the figure) tobrowse the RDF ontologies graphically. The user can then specify which RDF classes in a table shouldbe mapped onto and which RDF property a table column should be mapped onto. After finishing themapping, the tool automatically generates a registration entry in RDF/XML format and submits it tothe semantic registration service.

4. QUERY PROCESSING IN DARTGRID

This section concerns the issue of semantic query processing in DartGrid. We first introduce a queryrewriting algorithm, then present the visual query tool of DartGrid.

4.1. Rewriting semantic queries using RDF Views

As Figure 2 shows, the typical process of query processing in DartGrid can be divided into four steps.

1. Semantic Query Construction. Normally, a user visits an Ontology Service and browses the RDFontologies, then formulates RDF queries upon the RDF ontologies.

2. Semantic Query Rewriting. The RDF queries are submitted to the Semantic Query Service, andwill be rewritten into a set of SQL queries.

3. SQL Query Plan Evaluation. The set of SQL queries are dispatched to separate database Gridservices to retrieve data.

4. Query Result Transformation. The raw data retrieved will be transformed into RDF/XML formatand returned as RDF triples.

The most difficult step is query rewriting. Based on the RDF View introduced in Section 3.1,we propose an innovative view-based query rewriting algorithm in this section.

4.1.1. Query rewriting problem

Formally, the fundamental problem we want to address is given a set of relations R, RDF ontologies G,a set of RDF View definitions V = {vi | vi = Ri(X) : −Gi(X, Y )} and a query Q defined over RDFontologies G, can we find a rewriting r of the query Q using views in V such that the body of r arerelational predicates? Intuitively, the rewriting process will yield a set of SQL queries.

For example, Q1 is a query over the target ontology in Figure 3. The query is written in SPARQL∗∗query notation and it asks for all tuples with person name (?en), mail box (?em), project name (?pn),project homepage(?ph) and the homepage of the funding organization (?foh). In the following sections,we use this query as a running example to introduce the query rewriting algorithm:

Q1:select ?en,?em,?y2,?pn,?ph,?foh where(?y1 rdf:type foaf:Person) (?y1 foaf:name ?en) (?y1 foaf:mbox ?em)(?y1 foaf:currentProject ?y2)(?y2 rdf:type foaf:Project) (?y2 foaf:name ?pn) (?y2 foaf:homepage ?ph) (?y2 foaf:fundedBy ?y3)(?y3 rdf:type foaf:Organization) OPTIONAL (?y3 foaf:homepage ?foh)

∗∗W3C’s SPARQL query language, see http://www.w3.org/TR/rdf-sparql-query/.


1820 H. CHEN ET AL.

We note there is an Optional Block in Q1. According to the SPARQL specification, the OPTIONALpredicate specifies that if the optional part does not lead to any solutions, the variables in the optionalblock can be left unbound.

For Q1, using the RDF views in Figure 3, we can yield a set of rewritings. Two of them are illustratedas below as examples††. The rewritings are also in Datalog-like syntax. They can be easily convertedinto SQL queries:

(r1):H(?en,?em,SF(?pn),?pn,?ph,?foh=null):-w3c:emp(?en,?em,?pn,?ph,?fon)(r2):H(?en,?em,SF(?pn),?pn,?ph,?foh=null):-zju:emp(?en,?em),zju pro(?en,?pn),zju:pro org(?pn,?fon),w3c:emp(?fon)

We then begin to elaborate on how to generate above rewritings using our algorithm.

4.1.2. Rewriting algorithm

Next we describe the basic algorithm for rewriting RDF queries into a set of source SQL queries,based on the RDF View definitions. Basically, there are two phases in the algorithm: Class MappingRule Generation and Query Transformation.

• Class Mapping Rule Generation. This phase creates a set of mapping rules for each RDF classbased on the RDF views. The goal of this phase is to turn the view definitions into a set of smallerrules, so that target query expressions can be more directly substituted by relational terms.

The algorithm starts by looking at the body of the view and group the triples by subject name,i.e. group together the triples that have the same subject name. For example, there are three such triplegroups for V1. In the first group, four triples share the same subject name ?y1.

Next, the algorithm replaces each existential variable ?yn ∈ Y with a Skolem Function Name, whichis used to generate a blank node ID. Referring to Section 3.1, to transform a relational tuple into aset of RDF triples, we generate a new blank node ID corresponding to each existential variable in theview definition. As a matter of fact, we associate each RDF class in the target ontology with a uniqueSkolem Function that can generate a blank node ID of that type. These Skolem Functions will be usedto generate blank node IDs in the query result. For instances, we associate the RDF classes in Figure 3with the following Skolem Functions, respectively:

foaf:Person-SF1(?en); foaf:Project - SF2(?pn); foaf:Organization -SF3(?fon)

Take view V1 as an example, the ?y1, ?y2 in V1 are replaced by Skolem Function nameSF1(?en), SF2(?pn), respectively. The algorithm then constructs a new mapping rule for each triplegroup. For example, for the set of views in Figure 4, we generate the following new mapping rules:

††We use ‘w3c:emp(?en,?em,?pn)’ as a shortcut to the project operation on the table ‘w3c:emp’ onto columns ?en,?em,?pn.Others are similar.



W3C Source:V1: Rule-1: w3c:emp(?en,?em,?pn) :- (SF1(?en),rdf:type, foaf:Person),(SF1(?en),foaf:name,?en),

(SF1(?en),foaf:mbox,?em),(SF1(?en),foaf:currentProject,SF2(?pn))Rule-2: w3c:emp(?pn,?ph,?fon) :- (SF2(?pn),rdf:type, foaf:Project), (SF2(?pn),foaf:name,?pn),

(SF2(?pn),foaf:homepage,?ph),(SF2(?pn),foaf:currentProject,SF3(?fon))Rule-3: w3c:emp(?fon) :- (SF3(?fon),rdf:type,foaf:Organization),(SF3(?fon, foaf:name, ?fon))

.ZJU Source:V2: Rule-4: zju:emp(?en,?em) :- (SF1(?en),rdf:type, foaf:Person),

(SF1(?en),foaf:name,?en),(SF1(?en),foaf:mbox,?em)V3: Rule-5: zju:emp pro(?en, ?pn) :- (SF1(?en),rdf:type, foaf:Person),

(SF1(?en),foaf:name,?en),(SF1(?en),foaf:currentProject,SF2(?pn))Rule-6: zju:emp pro(?pn) :- (SF2(?pn),rdf:type, foaf:Project),(SF2(?pn),foaf:name,?pn)

V4: Rule-7: zju:pro org(?pn,?fon) :- (SF2(?pn),rdf:type, foaf:Project),(SF2(?pn),foaf:name,?pn),(SF2(?pn),foaf:fundedBy,SF3(?fon))

Rule-8: zju:pro org(?fon) :- (SF3(?fon),rdf:type, foaf:Organization), (SF3(?fon),foaf:name,?fon)V5: Rule-9: zju:org(?fon,?foh) :- (SF3(?fon),rdf:type, foaf:Organization),

(SF3(?fon),foaf:name, ?fon),(SF3(?fon), foaf:homepage,?foh)

Next, for each source, the algorithm will merge those rules that are about the same RDF class.For example, for a ‘ZJU’ source, rules 4 and 5 are merged as they are both about foaf:Person. The finalmapping rules are as follows:

Rule-4 5: zju:emp(?en,?em),zju:emp pro(?en,?pn) :- (SF1(?en),rdf:type, foaf:Person), (SF1(?en),foaf:name,?en),(SF1(?en),foaf:mbox,?em), (SF1(?en),foaf:currentProject,SF2(?pn))

Rule-6 7:zju:pro org(?pn,?fon) :- (SF2(?pn),rdf:type, foaf:Project),(SF2(?pn),foaf:name,?pn), (SF2(?pn),foaf:fundedBy,SF3(?fon))

Rule-8 9: zju:org(?fon,?foh) :- (SF3(?fon),rdf:type, foaf:Organization),(SF3(?fon),foaf:name, ?fon), (SF3(?fon), foaf:homepage,?foh)

• Query Transformation. In this phase, the algorithm transforms the input query Q using the newlygenerated mapping rules and produces a set of valid rewritings.

Similarly, the algorithm starts by looking at the body of the query and groups the triples by subjectname. For example, there are three such groups for Q1. In the first group, three triples share the samesubject name ?y1.

Next, the algorithm replaces all variables ?yn with corresponding Skolem Function Names.For example, the ?y1, ?y2, ?y3 in Q1 will be replaced by Skolem Function Names SF1(?en),SF2(?pn), SF3(?f on), respectively.

Next, the algorithm begins to look for rewritings for each triple group by trying to find applicablemapping rules. If it finds one, it replaces the triple group by the head of the mapping rule and generatesa new partial rewriting. After all triple groups have been replaced, a candidate rewriting is yielded.If a triple t in Q is OPTIONAL and no triple in the mapping rule is mapped to t, the variable in t isset to NULL as default value. Figure 5 gives the formal descriptions of the algorithms, and Figure 6illustrates the rewriting process for query Q1.

Definition 2 (Triple Mapping). Given two triples t1, t2, we say t1 maps t2 if there is a variable mappingϕ from Vars(t1) to Vars(t2) such that t2 = ϕ(t1). Vars(t1) denotes the set of variables in t1.

Definition 3 (Applicable Class Mapping Rule). Given a triple group g of a query Q, a mapping rule m

is an Applicable Class Mapping Rule with respect to g, if there is a triple mapping τ that maps everynon optional triple in g to a triple in m.


1822 H. CHEN ET AL.

Algorithm 1: Class Mapping Rule Generation Algorithm 2: Query Transformation

1. Input: target query q, set of mapping rules M1. Input: set of RDF view V 2. Initialize rewriting list Q;

3. Group the triples in q.body by subject name;2. Initialize mapping rules list M; 4. Replace variables in q.body with corresponding skolem function;

4. Let L be the set of triple groups of q.body;3. For each v in V4. Group the triples in v.body by subject name; 5. Add q to Q;5. Replace variables in v with corresponding skolem function; 6. For each triple group g in L6. Let L be the set of triple groups of v.body; 7. Let AM=the set of mapping rules applicable to g;

8. For each q in Q7. For each triple group g in L 9. remove q from Q;8. create a new mapping rule m; 10. For each role m in AM9. m.head=v.head 11. For each OPTIONAL triple t in g10. m.body=g; 12. Let x be the variable in t and x in q.head;11. add m to M 13. q.head=q.head[x/x=null];12. EndFor 14. EndFor13. EndFor 15. q=q[g/m.head];

16. Add q′ to Q;14. Merge those rules that are about same RDF class; 17. EndFor

18. Endfor15. Output: mapping rule list M; 19. Endfor

20. Output: rewriting list Q;

Figure 5. The algorithms. We use ‘q = q[a/b]’ to denote replacing all occurrences of ‘a’ in ‘q’ with ‘b’,and use ‘q.head’ and ‘q.body’ to denote the head and body of ‘q’.

4.2. Visual query tool

DartGrid offers a semantic browser [7] enabling a user to interactively specify a semantic query.Figure 7 illustrates an example from our TCM application. It showcases how user can step-by-stepspecify a semantic query to find out those TCM prescriptions that can cure influenza. Two RDF classesare involved, TCM Prescription and Disease. For clarity, we have translated the Chinese terms intoEnglish.

In the first step (the left part of Figure 7), the user selects the TCM Prescription class and its threeproperties: name, dosage and preparationMethod. In the second step (the right part of Figure 7), theuser selects the Disease class and three properties name, symptom and pathogeny. Finally, a constraintwhich specifies that the name of the Disease is ‘influenza’ is input. The semantic of this query is toquery TCM Prescriptions on how to cure influenza, and the data for those selected properties will bereturned.

4.3. Experiments

The first goal of our experiment is to validate that our algorithm can scale up to deal with a largemapping complexity . We consider two general classes of relational schema: chain schema and starschema. In these two cases, we consider queries and views that have the same shape and size. Moreover,we also consider the worst case in which two parameters are looked upon: (1) the number of triplegroups of a query; and (2) the number of sources. The whole system is implemented in Java and



H ( ? e n ,? e m ,? y 2 ,? p n ,? p h ,? f o h ) :-( ? y 1 f o a f :n a m e ? e n ) , ( ? y 1 f o a f :m bo x ? e m ) , ( ? y 1 f o a f :c ur re n t P r o je c t ? y 2 ) ,( ? y 2 f o a f :n a m e ? p n ) , ( ? y 2 fo a f :h o m e p a ge ? p h ) , ( ? y 2 f o a f :fun de dB y ? y 3 ) ,( ? y 3 f o a f :h o m e p a ge ? f o h ) .

H (? e n ,? e m ,SF2( pn) ,? p n ,? p h ,? fo h ) :-(SF 1 ( ? e n ) f o a f :n a m e ? e n ) , (SF 1 ( ? e n ) f o a f :m b o x ? e m ) , ( SF 1 ( ? e n ) fo a f :c u r r e n t P ro je c t S F 2 ( ? p n ) ) ,(SF 2 ( ? p n ) f o a f :n a m e ? p n ) , ( SF 2 ( ? p n ) fo a f :h o m e p a ge ? p h ) , (SF 2 (? p n ) f o a f :f un de dB y SF 3 ( ? fo n ) ) ,(SF 3 ( ? fo n ) f o a f :h o m e p a ge ? fo h ) .

Sk o le m iz a t io n

H (? e n , ? e m ,SF2( pn ) ,? p n ,? p h ,? f o h ) :-w3 c :e m p 1 (? e n ,? e m ) ,(SF 2 (? p n ) f o a f :n a m e ? p n ) , ( SF 2 ( ? p n ) fo a f :h o m e p a ge ? p h ) ,(SF 2 (? p n ) f o a f :f un de d B y SF 3 (? f o n ) ) ,(SF 3 (? f o n ) f o a f :h o m e p a ge ? fo h ) .

H ( ? e n ,? e m ,SF2( pn) , ? p n ,? p h ,? f o h ) :-w3 c :e m p ( ? e n ,? e m ) , w3 c :e m p (? p n ,? p h ) ,( SF 3 ( ? f o n ) fo a f :h o m e p a ge ? f o h ) .

R ule - 2

H ( ? e n ,? e m ,SF2( pn) , ? p n ,? p h , foh nu ll ) :-w3 c :e m p (? e n ,? e m ) , w3 c :e m p ( ? p n ,? p h ) , w3 c :e m p 3 (? f o n )

R ule - 3

r 1 : H ( ? e n ,? e m ,SF2( pn ) , ? p n ,? p h , foh null ) :-w3 c :e m p ( ? e n ,? e m ,? p n ,? p h ,? f o n )

H ( ? e n ,? e m ,SF2( pn) , ? p n ,? p h ,? f o h ) :-z ju:e m p ( ? e n ,? e m ) , z ju:e m p p r o ( ? e n ,? p n ) ,( SF 2 (? p n ) f o a f :n a m e ? p n ) , (SF 2 (? p n ) f o a f :h o m e p a ge ? p h ) ,( SF 2 (? p n ) f o a f :f un de dB y SF 3 (? f o n ) ) ,( SF 3 (? f o n ) f o a f :h o m e p a ge ? fo h ) .

H ( ? e n ,? e m ,SF2( pn) , ? p n ,? p h ,? fo h ) :-z ju:e m p (? e n ,? e m ) , z ju:e m p p ro (? e n ,? p n ) , w3 c :e m p ( ? p n ,? p h ,? f o n ) ,( SF 3 ( ? f o n ) fo a f :h o m e p a ge ? f o h ) .

R ule - 2

r4 : H (? e n ,? e m ,SF2( pn) , ? p n ,? p h ,? f o h ) :-z ju:e m p ( ? e n ,? e m ) , z ju:e m p p r o ( ? e n ,? p n ) ,w3 c :e m p (? p n ,? p h ,? f o n ) , z ju:o rg(? fo n ,? f o h )

R ule - 8 9

R ule - 4 5R ule -1

Figure 6. The query rewriting example. Only r1 and r4 are illustrated.

all experiments are performed on a PC with a single 1.8 GHz P4 CPU and 512 MB RAM, runningWindows XP(SP2) and JRE 1.4.1.

Chain scenario

In a chain schema, there are lines of relational tables that are joined one by one with each other.The chain scenario simulates the case where multiple inter-linked relational tables are mapped to atarget RDF ontology with a large number of levels (depth). Figure 8(a) shows the performance in thechain scenario with the increasing length of the chain and also the number of views. The algorithm canscale up to 300 views in under 10 seconds.

Star scenario

In a star schema, there are unique relational tables that are joined with every other table, and there areno joins between the other tables. The star scenario simulates the case where source relational tablesare mapped to a target RDF graph with a large branching factor. Figure 8(b) shows the performancein the star scenario with the increasing branching factor of the star and also the number of views.The algorithm can easily scale up 300 views in under 1 second. The experiments illustrate that thealgorithm works better in star scenario.


1824 H. CHEN ET AL.

Figure 7. Visually construct an RDF query.

Worst case analysis

The worst case happens when for each RDF class, many class-mapping rules are generated, and thenumber of triple groups in the query is also large. In this case, for each triple group of the query,there are a lot of applicable mapping rules. Thus, there would be many rewritings, since virtuallyall combinations produce valid rewritings and a complete algorithm is forced to form an exponentialnumber of rewritings. In the experiment illustrated in Figure 9, we set up 10 sources and for eachsource, eight chained tables are mapped to eight RDF classes, respectively. The figure shows the costof rewriting increases quickly as the number of triple groups and number of sources increases. As canbe seen, in the case of eight groups, the cost reaches 25 seconds with only four sources.

5. RELATED WORK

There are many relevant related works. Within the domain of Grid research, there are many effortsconcerning accessing and integrating databases under the Grid framework. Typical examples are EU’sOGSA-DAI efforts‡‡, GGF’s DAIS working group∗ and Oracle’s 10G. The significant difference is theRDF-based and Semantic-Web-oriented approach adopted in DartGrid. DartGrid complements thoseefforts with a semantic infrastructure for building database Grid applications.

‡‡OGSA-DAI, see http://www.ogsadai.org.uk/index.php.∗Database Access and Integration Services WG, see http://forge.gridforum.org/projects/dais-wg.



(a)

(b)

Figure 8. Mapping complexity experiment: (a) chain scenario, (b) star scenario.


1826 H. CHEN ET AL.

Figure 9. Worst-case analysis.

Within the domain of Semantic Web research, the most relevant works are D2RQ [8] and RDFGateway† D2RQ is implemented as a plug-in of the Jena framework [9]. This D2RQ plug-in wrapsone or more local relational databases into a virtual RDF graph. It rewrites RDQL queries and JenaAPI calls into SQL queries. The resulting sets of these SQL queries are transformed into RDF triplesthat are passed up to the higher layers of the Jena framework. The RDF Gateway is a platform for thedevelopment and deployment of Semantic Web applications. It connects external database resources tothe Semantic Web via its SQL Data Service Interface. The SQL Data Service translates an RDF-basedquery to an SQL query and returns the results as RDF data. The SQL Data Service exposes the relationschema of the RDBMS as RDF Schema.

Comparing with them, DartGrid exhibits several technical differences.

1. First, DartGrid exposes database resources as Grid services. The semantic query interface forprocessing RDF queries is also implemented as a Grid service, so that the user can developservice-oriented RDF applications.

2. Second, DartGrid provides a convenient visual tool to facilitate the schema mapping fromdatabase to RDF. Moreover, the relational schema of the RDBMS is defined as views on themediated RDF schema. With our experience, this view-based approach is quite convenient forthose databases that are not normalized enough. In contrast, for both D2RQ and RDF Gateway,the mapping information must be manually edited in a text mode and there is no considerationon the integration of denormalized relational databases.

†RDF Gateway, see http://www.intellidimension.com.



3. Finally, with DartGrid, a database can be dynamically added into the sharing cycle without anyeffect on the client application. A UDDI-like service, called Semantic Registration Services,is developed to dynamically aggregate the service handlers and schema mapping informationfrom highly distributed database resources. However, there is no such consideration in bothD2RQ and RDF Gateway. If the user wants to add a new database, they must reprogram theapplication, because both the database handlers and mapping information are statically specifiedin the programs.

EDUTELLA [10] also has a subcomponent called SWARD [11] serving the same purpose. However,the approach of SWARD–EDUTELLA is essentially different with us. For SWARD, there is nomediated schema and domain semantics for heterogeneous database integration, they only expose therelational schema as RDF descriptions, then query those databases using that RDF description.

In addition to the projects illustrated above, there are a lot of other relevant works that concernmapping semantic Web data with RDBMS, such as Jena [12], RDFSuite [13], KAON [14], Sesame [15]and D2RMAP [16]. Some of them deal with the issue of using a real-time database managementsystem (RDBMS) as RDF triple storage, such as Sesame, Parka, KAON, Jena, TAP and RDFSuite.Others deal with the issue of exposing relational data in RDF, such as D2RMap, DBVIEW, KAONand Jena. The approach used here is quite different to these examples. Regarding approaches such asD2RMap, they have to dump all of the relational data into some serialized RDF format before queryingthem. This kind of approach would be quite inefficient if the RDBMS contains a large volume of data,which is a normal case for Grid applications. A more suitable approach, such as DartGrid, rewrites theRDF queries as SQL queries instead of converting all relational data into an RDF format.

Also, the RDF/OWL can be regarded as a kind of knowledge representation approach to definethe mediated schema. Therefore, in a wider research context, the proposal presented in this papercan be classified into the collection of results on knowledge-based database integration system suchas SIMS [17], OBSERVER [18], TAMBIS [19], etc. A survey on those approaches is given in [20].The difference lies in the Grid-oriented architecture and the standard-based approach of DartGrid.

6. SUMMARY AND FUTURE WORK

In summary, with DartGrid, we have made the following contributions.

1. An implemented Grid system that provides a set of semantic tools and Grid services to supportbuilding database Grid applications. Upon this system, we have built a real database Grid forTCM in China.

2. An RDF-View-based approach to define mapping from relational schema to RDF ontologies.With our approach, the hidden semantics of relational schema can be represented explicitly,denormalized relations can be normalized and the redundancy can be eliminated.

3. A View-based RDF query rewriting. Our work is a complement to the research area of rewritinga query using a view. The rewriting approach has been applied widely to many data models,such as relational, XML, regular path, and description logic but within our investigation, noattempt has been made to produce an RDF model.

Some of the open questions that require further study include: further performance improvementof the worst case scenario; extension to more expressive RDF query languages such as: inverserole, recursive queries, predefined constraints (such as subclass and subproperty axioms in RDFS);


1828 H. CHEN ET AL.

and integrating inference into query rewriting. Other issues that require further investigation include:distributed transactions, database replication, etc.

ACKNOWLEDGEMENTS

This work is funded by the China 973 Project (Fundamental Approach, Model and Theory of Semantic Grid),subprogram of the China 863 Project (TCM Virtual Research Institute), and China NSF Program NSFC60503018(Research on Scale-free Network Model for Semantic Web and High Performance Semantic Search Algorithm).

REFERENCES

1. Halevy AY. Answering queries using views: A survey. Journal of Very Large Databases 2001; 10(4):75–102.2. Abiteboul S. Complexity of answering queries using materialized views. Proceedings of the 17th ACM Symposium on

Principles of Database Systems (PODS1998), May 1998. ACM Press: New York, 1998; 254–263.3. Pottinger R, Halevy AY. MiniCon: A scalable algorithm for answering queries using views. Journal of Very Large

Databases 2001; 10(2-3):182–198.4. Yu C, Popa L. Constraint-based XML query rewriting for data integration. Proceedings of the ACM SIGMOD International

Conference on Management of Data (SIGMOD2004). ACM Press: New York, 2004; 371–382.5. Deutsch A, Tannen V. MARS: A system for publishing XML from mixed and redundant storage. Proceedings of the

International Conference on Very Large Database (VLDB2003). Morgan Kaufmann: Berlin, 2003.6. Chen H, Wu Z, Huang C. TCM-Grid: Weaving a medical Grid for traditional Chinese medicine. Proceedings of the

International Conference on Computational Science (Lecture Notes in Computer Science, vol. 2659). Springer: Berlin,2003; 1143–1152.

7. Mao Y, Wu Z, Chen H. Semantic browser: An intelligent client for Dart-Grid. Proceedings of the International Conferenceon Computational Science (Lecture Notes in Computer Science, vol. 3036). Springer: Berlin, 2004; 470–473.

8. Bizer C, Seaborne A. D2RQ—Treating non-RDF databases as virtual RDF graphs. Proceedings of the 3rd InternationalSemantic Web Conference (ISWC2004), November 2004. Available at:http://www.wiwiss.fu-berlin.de/suhl/bizer/d2rq/ [October 2005].

9. Carroll JJ et al. Jena: Implementing the semantic Web recommendations. Proceedings of the 13th World Wide WebConference (WWW2004), May 2004. ACM Press: Danvers, MA, 2004; 74–84.

10. Nejdl W et al. EDUTELLA: A P2P networking infrastructure based on RDF. Proceedings of the 11th World Wide WebConference (WWW2002), May 2002. ACM Press: Danvers, MA, 2002.

11. Petrini J, Risch T. Processing queries over RDF views of wrapped relational databases. Proceedings of the 1st InternationalWorkshop on Wrapper Techniques for Legacy Systems (WRAP2004), November 2004.

12. Reynolds D. Jena relational database interface. Part of the documentation for the Jena Semantic Web Toolkit, HP LabsSemantic Web Activity Web site. http://www.hpl.hp.com/semweb/doc/RDB/rdb-performance.html [October 2005].

13. Alexaki S, Christophides V, Karvounarakis G, Plexousakis D, Tolle K. The RDFSuite: Managing voluminous RDFdescription bases. Technical Report, ICS-FORTH, Heraklion, Greece, 2000.

14. Bozsak E et al. KAON SERVER—A semantic Web management system. Proceedings of the 12th International WorldWide Web Conference (WWW2003), May 2003. ACM Press: Danvers, MA, 2003.

15. Broekstra J, Kampman A, F van Harmelen. Sesame: A generic architecture for storing and querying RDF and RDF schemaPDF. Proceedings of the 1st International Semantic Web Conference (Lecture Notes in Computer Science, vol. 2342).Springer: Berlin, 2003; 54–68.

16. Bizer C. Freie. D2R MAP—A database to RDF mapping language. Poster at the International World Wide Web Conference(WWW2003), May 2003.

17. Arens Y, Knoblock CA, Shen W-M. Query reformulation for dynamic information integration. Journal of IntelligentInformation Systems 1996; 6(2/3):99–130.

18. Mena E, Illarramendi A, Kashyap V, Sheth AP. OBSERVER: An approach for query processing in global informationsystems based on interoperation across pre-existing ontologies. Distributed and Parallel Databases 2000; 8(2):223–271.

19. Paton NW, Stevens R, Baker P, Goble CA, Bechhofer S, Brass A. Query processing in the TAMBIS bioinformaticssource integration system. Proceedings Statistical and Scientific Database Management (SSDBM1999), June 1999. IEEEComputer Society Press: Boston, MA, 1999; 138–147.

20. Wache H, Vogele T, Visser U, Stuckenschmidt H, Schuster G, Neumann H, Hubner S. Ontology-based integration ofinformation—a survey of existing approaches. Proceedings of the Workshop on Ontologies and Information Sharing atIJCAI2001, August 2001; 108–118.


Documents

DartGrid: a semantic infrastructure for building database Grid applications