154
QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION SOURCES a dissertation submitted to the department of computer science and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy By Vasilios Antoniou Vassalos September 2000

QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

QUERYING AUTONOMOUS, HETEROGENEOUS

INFORMATION SOURCES

a dissertation

submitted to the department of computer science

and the committee on graduate studies

of stanford university

in partial fulfillment of the requirements

for the degree of

doctor of philosophy

By

Vasilios Antoniou Vassalos

September 2000

Page 2: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

c© Copyright 2000 by Vasilios Antoniou Vassalos

All Rights Reserved

ii

Page 3: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

I certify that I have read this dissertation and that in my

opinion it is fully adequate, in scope and quality, as a disser-

tation for the degree of Doctor of Philosophy.

Jeffrey D. Ullman(Principal Advisor)

I certify that I have read this dissertation and that in my

opinion it is fully adequate, in scope and quality, as a disser-

tation for the degree of Doctor of Philosophy.

Hector Garcia-Molina

I certify that I have read this dissertation and that in my

opinion it is fully adequate, in scope and quality, as a disser-

tation for the degree of Doctor of Philosophy.

Yannis Papakonstantinou

Approved for the University Committee on Graduate Studies:

iii

Page 4: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Abstract

A wide variety of information sources are available both in internal networks of organiza-

tions and on the Web. These sources are autonomous, have different and limited query

capabilities, and usually contain heterogeneous data that only have partial, flexible, or im-

plicit structure, i.e., that are semistructured (e.g., XML, bibliographic, or genomic data).

Enabling users to query in an integrated manner the wealth of information in these sources

is a crucial requirement for increasing the usefulness of the Web as an information resource

and for enabling electronic commerce.

An effective system for online integration of such sources needs to perform efficiently

two main tasks in response to a user query: First, devise a query plan that locates and

retrieves the relevant pieces of information from the sources, by submitting to the sources

localized queries that respect the sources’ query capabilities. Then, combine the pieces of

information to produce a unified answer. This thesis develops powerful query processing

techniques and architectures for information integration and studies some of the tradeoffs

between the generality of the language and the efficiency of query processing in such an

integration system.

The thesis adopts a powerful framework for the construction of an online integration

system, proposed by the TSIMMIS project at Stanford University and inspired by the

formal, declarative underpinnings of modern database and knowledge base management

systems. In this framework, the core of the integration system is a query processor called

a mediator, that implements the integrated query processing algorithms. The details of an

integration scenario, including the integrated views that describe the way source information

is to be combined, and the contents and query capabilities of the sources, are specified

declaratively, in a high-level specification language.

The thesis studies logical languages for the specification of integrated views and the

description of query capabilities. Both relational and semistructured languages, with and

without recursion, are studied from the point of view of expressive power and efficiency.

iv

Page 5: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

The thesis presents sound and complete algorithms that solve the key problem of generat-

ing query plans that respect query capabilities described in these powerful languages (the

capability-based rewriting problem). In particular, the first algorithm solving this problem

for a semistructured language is presented.

v

Page 6: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Nunc est bibendum

Horace, Ode 37

Dedicated to my parents, Antonios and Angeliki, and my sister Leda

vi

Page 7: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Acknowledgments

First and foremost, I would like to thank my advisor, Jeffrey Ullman, for his guidance,

advice and support. His door was always open, and his comments on any issue were always

insightful. As indispensable as his mentoring was to me in research, his guidance and advice

on all other aspects of academic and professional life were just as important. I am grateful.

I would also like to thank Yannis Papakonstantinou. Yannis and I have collaborated

extensively; this thesis would not have been possible without him. I thank him for his

insight, his enthusiasm, the generous application of his critical skills to our papers (and to

this thesis), and for being a friend. I hope our current endeavors are even more successful

in their own way.

I thank Hector Garcia-Molina for his leadership and his support of the TSIMMIS project

in general and my work in particular. His impeccable taste in research topics and his ability

to home in on the flaws and merits of an idea quickly and explain them incisively have been

an inspiration.

Serge Abiteboul’s brilliant comments and his research style greatly influenced both my

own research and my attitude towards research. I thank him for sharing his enthusiasm for

good research and his distaste for sloppy, lazy research. I also thank him for selling me a

trouble-free car.

I would like to thank the members of my reading and oral defense committees, Hector,

Yannis, Gio Wiederhold, and Kincho Law. I would also like to thank all the faculty members

of the Stanford database group, Jeff, Hector, Gio and Jennifer Widom, for creating an

exciting environment for research and for leading the best database group in the world.

The Stanford database group is full of amazingly intelligent, exciting and friendly people.

It has been a pleasure to live and work in such a high-energy milieu, and for that I want

to thank every member of the group, and particularly Shiva, Sergey, Wilburt, Jason, Roy,

Svetlozar and Tom. Special thanks go to Junghoo and Calvin for interesting discussions in

vii

Page 8: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

our office, and for putting up with me yelling on the phone occasionally.

The Stanford Computer Science Department, first located in MJH, then in the Gates

building, has provided the backdrop for a big part of my life for the past few years. Thanks

to Suresh, Ashish, Piotr, Aris and Donald for making it memorable.

I also want to thank my friends Suresh, Yiannis Kontoyiannis, Menelaos, Panos, Yiannis

Orginos and Christos for all the wonderful memories and the great times we had.

This thesis is the culmination of a long educational journey. My parents guided my first

steps and cultivated in me the pursuit of excellence and love of learning that carried me

through that journey. They let me move to the other end of the earth to continue that

journey at a time that was difficult for both of them, and that my departure made more

difficult. My parents and my sister have always offered me their unwavering support and

love. I cannot thank them enough.

Generous support for my work by the NSF, DARPA, the Air Force, the Bodossakis

Foundation and the L. Voudouri Foundation is gratefully acknowledged.

viii

Page 9: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Contents

Abstract iv

vi

Acknowledgments vii

1 Introduction 1

1.1 Information integration and the challenges of autonomy and heterogeneity . 1

1.2 Semistructured data model . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Equivalence of OEM databases . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 OEM and XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Modeling and using source query capabilities . . . . . . . . . . . . . . . . . 10

1.4.1 Problem definitions: CBR and query expressibility . . . . . . . . . . 11

1.4.2 CBR and query rewriting using views . . . . . . . . . . . . . . . . . 11

1.5 TSIMMIS for information integration . . . . . . . . . . . . . . . . . . . . . 12

1.6 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Related Work 15

3 DSL: A Language for Semistructured Data 20

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 The DAG Specification Language . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.3 Syntactic restrictions on DSL rules . . . . . . . . . . . . . . . . . . . 25

ix

Page 10: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

3.2.4 Expressive power and complexity . . . . . . . . . . . . . . . . . . . . 28

3.2.5 Normal forms of DSL rules . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Query composition for DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Equivalence of DSL queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.1 Mappings and containment mappings . . . . . . . . . . . . . . . . . 39

3.4.2 Extending the chase for set variables . . . . . . . . . . . . . . . . . . 40

3.4.3 Deciding DSL query equivalence . . . . . . . . . . . . . . . . . . . . 42

3.5 DSL and other semistructured languages . . . . . . . . . . . . . . . . . . . . 44

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Query Rewriting for Semistructured Data 46

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 DSL Query Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.1 Rewriting of Queries with a Single Path Condition . . . . . . . . . . 48

4.3 Using structural constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.1 General case of query rewriting . . . . . . . . . . . . . . . . . . . . . 53

4.3.2 Completeness and Complexity . . . . . . . . . . . . . . . . . . . . . 55

4.4 Capability-based rewriting in the TSIMMIS mediator . . . . . . . . . . . . 57

4.4.1 Query Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.4.2 Physical Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4.3 Source Capabilities Description: Templates . . . . . . . . . . . . . . 59

4.4.4 Capability Based Plan Generation . . . . . . . . . . . . . . . . . . . 60

4.4.5 Rewriting algorithm and capability-based plan generation . . . . . . 62

4.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 The Capability Description Language p-Datalog 64

5.1 The p-Datalog Source Description Language . . . . . . . . . . . . . . . . . . 65

5.1.1 Formal description of p-Datalog. . . . . . . . . . . . . . . . . . . . . 69

5.2 Deciding query expressibility with p-Datalog descriptions . . . . . . . . . . 71

5.2.1 Expressibility and translation . . . . . . . . . . . . . . . . . . . . . . 80

5.3 Answering Queries Using p-Datalog Descriptions . . . . . . . . . . . . . . . 80

5.3.1 CBR with binding requirements . . . . . . . . . . . . . . . . . . . . 83

5.4 An interesting and more efficient class of p-Datalog descriptions . . . . . . . 86

x

Page 11: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

5.4.1 Lattice Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.4.2 QED and Ploop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5 Expressive Power of p-Datalog . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.6.1 Describing binding requirements in p-Datalog . . . . . . . . . . . . . 93

5.7 Conclusions and open problems . . . . . . . . . . . . . . . . . . . . . . . . . 95

6 The Capability Description Language RQDL 97

6.1 The RQDL description Language . . . . . . . . . . . . . . . . . . . . . . . . 98

6.1.1 Using RQDL for query description . . . . . . . . . . . . . . . . . . . 99

6.1.2 Semantics of RQDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2 RQDL and mediator capabilities . . . . . . . . . . . . . . . . . . . . . . . . 104

6.3 Reducing RQDL to p-Datalog with function symbols . . . . . . . . . . . . . 106

6.3.1 Reduction of a database to standard schema database . . . . . . . . 106

6.3.2 Reduction of queries to standard schema queries . . . . . . . . . . . 108

6.3.3 Reduction of RQDL programs to Datalog programs over the standard

schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.4 QED and CBR for RQDL descriptions . . . . . . . . . . . . . . . . . . . . . 114

6.4.1 The query expressibility problem for RQDL . . . . . . . . . . . . . . 114

6.4.2 The CBR problem for RQDL . . . . . . . . . . . . . . . . . . . . . . 122

6.5 Conclusions and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 124

A Enabling Integration: TSIMMIS Wrappers 126

A.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

A.2 Implemented Wrappers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

B Sort program 132

Bibliography 133

xi

Page 12: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

List of Tables

4.1 Matcher Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

xii

Page 13: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

List of Figures

1.1 Example OEM objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Text representation of OEM objects . . . . . . . . . . . . . . . . . . . . . . 4

1.3 XML representation of data of Figure 1.2 . . . . . . . . . . . . . . . . . . . 7

1.4 A common integration architecture . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 Declarative specification of wrappers and mediators . . . . . . . . . . . . . 12

1.6 Mediator architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Result of (Q2) on database of Figure 1.1 . . . . . . . . . . . . . . . . . . . . 23

3.2 Body graph examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Head graph of (P5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 OEM database and result for Example 3.2.5 . . . . . . . . . . . . . . . . . . 28

4.1 DSL query rewriting algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 TSIMMIS CBR architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1 Algorithm QED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Algorithm QED-T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3 Supporting set lattice for fact f for a database of size 5 . . . . . . . . . . . 88

5.4 Supporting sets and least common ancestor . . . . . . . . . . . . . . . . . . 89

6.1 Default rules for generation of attr tuples . . . . . . . . . . . . . . . . . . . 112

6.2 Extended facts produced by Algorithm QED-T for Example 6.4.7 . . . . . . 125

A.1 Wrapper architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.2 Wrapper components and procedure calls . . . . . . . . . . . . . . . . . . . 128

B.1 A logic program implementing selection sort . . . . . . . . . . . . . . . . . . 132

xiii

Page 14: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Chapter 1

Introduction

1.1 Information integration and the challenges of autonomy

and heterogeneity

Information today (2000) resides on a variety of information sources that are increasingly

interconnected. File systems, databases, document retrieval systems, workflow systems,

ERP (enterprise resource planning) systems, data warehouses, and other sources of valuable

information are accessible inside corporate intranets. Moreover, they are also becoming

increasingly available to the “outside world,” through extranets or the Internet. Being able

to access and make sense of the available information is a significant challenge: organizations

(and individuals) represent, maintain, and export the information using a variety of formats,

data models, interfaces and semantics.

In order to use the information productively, it is important to get integrated access to

it, i.e., to be able to request information (that may be found split among various sources),

and get a consistent, integrated response, regardless of which information sources store the

information in the answer and how they export it.

An instance of this problem is the problem of integrating relational or object-oriented

databases. The database community has identified and worked on this problem since

the 1980s, making substantial progress on database integration techniques [A+91; Gup89;

LMR90; T+90]. But that line of work made a number of assumptions — namely fixed

schemas, unrestricted access to the information and, in general, control over the informa-

tion sources — that are increasingly often invalid: data is often residing in autonomous

1

Page 15: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 2

sources containing heterogeneous information. Moreover, these information sources have

different and often limited query capabilities.

The challenges of autonomy of sources and heterogeneity of information can be addressed

by using a more flexible data model and a nonintrusive, online integration architecture that

relies on interpreted, declarative specifications to guide the integration task. Using a more

flexible, semistructured data model allows to deal with data that do not conform to a rigid

schema, or whose schema is not fully known in advance or evolves rapidly. Using a non-

intrusive, online integration architecture respects the autonomy of the information sources,

lowers the development time, and provides users with always correct, always up-to-date

answers. These points are briefly discussed in the next sections.

1.2 Semistructured data model

Much of today’s electronically stored information does not conform to traditional relational

or object oriented data models. Several applications store their data in nonstandard data

formats, legacy systems, structured documents like HTML or SGML etc. These data often

have irregular structure: some objects may have missing attributes and others may have

multiple occurrences of the same attribute. Moreover, as explained above, traditional data

models are not well-suited to the task of integrating heterogeneous data sources: often these

sources belong to external organizations or partners not under the application’s control;

even if the data is internally modelled as object-oriented data, their structure is often

only partially known, and may change without notice. Finally, data among these sources

are usually syntactically heteregeneous: the same attribute may have different types in

different objects, and semantically related information may be represented differently in

various objects. Data like the above, characterized by the presence of some structure but the

absence of a rigid, known schema, have recently been called semistructured data [PGMW95].

Important, desirable properties for a semistructured data model include that the data are

self-describing1 (because of the frequent absence of an a-priori schema), that the model is

not strongly typed, that it supports nesting and is not in first normal form [Ull89].

A powerful and intuitive way to model semistructured data that satisfies the above

requirements is to represent them as labeled graphs. The database is then self-describing

in that the schema of a database instance is moved into the graph, in the form of labels

1In the sense that it can be parsed without reference to an external schema.

Page 16: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 3

Figure 1.1: Example OEM objects

<&1,DBPL,set,{&2,&3}><&2,Book,set,{&4,&5,&6,&7}>

<&4,Title,string,‘Materialized Views’><&5,ISBN,integer,999><&6,Keyword,string,‘Relational’><&7,Author,string,‘A. Gupta’>

<&3,Article,set,{&7,&8,&9}><&8,Title,string,‘Constraint Checking’><&9,Conference,set,{&10,&11,&12}>

<&10,Name,string,‘SIGMOD’><&11,Year,integer,1993><&12,Location,string,‘Washington, DC’>

Figure 1.2: Text representation of OEM objects

attached to the graph nodes or edges. Moreover, graph edges naturally model object-

subobject relationships.

The most popular semistructured data model, which is also the semistructured data

model used in this thesis, is the Object Exchange Model (OEM), proposed by the TSIMMIS

project and originally described in [PGMW95].

In the OEM data model, the data are represented as a rooted graph with labeled nodes2

that have unique object ids. Figure 1.1 is an example of bibliographic data modeled as an

OEM graph. A textual representation of the same data is shown in Figure 1.2.

Each OEM object consists of an object-id (e.g., &2), a label that explains its meaning

(e.g., Title), a type (e.g., string), and a value.3 Labels are strings that are meaningful

to applications or end-users. Labels may have different meanings at different information

sources.

Objects can be either atomic or complex (set objects). The value of an atomic object is

of the specified atomic type(e.g., ‘SIGMOD’). In the rest of the thesis, we assume that the

2In a later version of the model, used in the LORE database system, labels are attached to edges. Thisapproach leads to only minor differences in the description of information and in the corresponding queryand view definition languages. The techniques and algorithms described in this thesis apply with littlechange to the LORE version of the data model.

3Note that, for simplicity, type information has been omitted from Figure 1.1.

Page 17: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 4

type of all atomic objects is string and we omit type information from objects. The value

of a complex object is a set of objects. Notice that this definition is inherently recursive,

since the value of an object is part of the object.

Thus in the OEM data graph the nodes are the objects and the edges denote the object-

subobject relationships. The leaf nodes have an associated atomic value. The set of objects

pointed to by the outgoing edges from o is the value of o; because of the recursive nature

of this definition, the value of o is essentially the OEM subgraph rooted at o.4 The OEM

graph has roots, i.e., distinguished, top-level objects with all other objects accessible from

them.

The object ids are typically atomic data. Formally, they are terms from the Herbrand

universe [End72] consisting of

• a set of atomic data, which includes but is not necessarily confined to, the atomic

data appearing as labels and values, like &10 and Smith, and

• an arbitrary set of freely interpreted function symbols. For example, f(&10, ashish)

is a possible object id, and the function symbol f “defines” the term. The function

symbols are interpreted “freely”, in the sense that two terms are considered equal only

if they are syntactically identical.

Object ids may be symbols with no particular meaning, or they may carry semantic

meaning. For example, if the object is a Web page, then it is typically a good idea to

have the URL be the object id. Furthermore, meaningful term object ids can facilitate the

integration tasks, since they carry information about how they were created. Thus a term

object id can describe how an integrated object was constructed. That information can be

useful to both human users and the query processor. We will discuss this issue more in

Chapter 3.

Note that OEM objects are self-describing, in that semantic and structural information

about objects is encoded in the labels, the semantic object ids and the object-subobject

relationships. Note also that OEM poses no restrictions on the labels of subobjects. For

example, some Article object can have a single Title object, others may not have any

Title, and others may have multiple Titles. In this way we allow Articles to have irreg-

ular structures while at the same time regularities in the structure are implicitly reflected

4Excluding o itself.

Page 18: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 5

in the OEM object. Indeed, even if Articles have a regular structure we still gain by using

OEM in integration, since Articles from different information sources will often have differ-

ent, albeit regular, structures: the regularities can be reflected in the integrated structure,

that also accommodates the structural differences gracefully.

Even though OEM can model data that can naturally be represented as an arbitrary

graph, in many applications data is naturally represented as a directed acyclic graph, or as

a tree. When the underlying graph is a tree, object ids can be omitted from the textual

representation without loss of information.

1.2.1 Equivalence of OEM databases

Two OEM databases D1 and D2 are equivalent if they are isomorphic, i.e., there exists a

bijection θ of the object ids in the two databases, such that for every pair (o, theta(o)) of

objects ids, with o ∈ D1 and θ(o) ∈ D2, the two objects identified by o and θ(o)(i) have the

same label l (ii) both of them have an atomic value or both of them have a set value (iii) if

they are atomic objects they have the same atomic value v and (iv) if they are set objects

they have isomorphic sets of subobjects.

Expressed differently, two OEM databases are equivalent if they are identical up to

object id renaming.

1.2.2 OEM and XML

Other semistructured data models that have been proposed [Suc98; BDHS96] are very

similar to OEM. Recently, the Extensible Markup Language (XML) [BPSM] has emerged as

the new, lightweight standard for the description, exchange and integration of information

on the Web.5 XML data are self-describing and bear a striking similarity to OEM data —

as well as other semistructured models — as it is illustrated in Figure 1.3 that presents the

OEM data of Figure 1.2 in XML syntax.

There are a few differences between XML and OEM; most of them stem from the

document-oriented nature of SGML [SGM], which is the standard XML is derived from.

A detailed comparison of the (mostly superficial) differences between OEM/semistructured

models and XML can be found in [Suc98]. The following differences are most relevant to

5In [MFDG98], John Bosak, a leader in the XML initiative, mentions information-integration applicationsas a major motivation for the lightweight XML standard.

Page 19: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 6

<DBPL id="&1"><Bookid="&2">

<Title id="&4">Materialized Views

</Title><ISBN id="&5">

999</ISBN><Keyword id="&6">

Relational</Keyword><Author id="&7">

A. Gupta</Author>

</Book><Article id="&3" idref="&7">

<Title id="&8">Constraint Checking

</Title><Conference id="&9">

<Name id="&10">SIGMOD

</Name><Year id="&11">

1993</Year><Location id="&12">

Washington, DC</Location>

</Conference></Article>

</DBLP>

Figure 1.3: XML representation of data of Figure 1.2

Page 20: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 7

mediation applications:

Ordering The subobjects of an XML object (called element) are always ordered, while

OEM objects are unordered. Ordering brings determinism to the textual represen-

tation of objects or, for that matter, to the representation of an object transmitted

over a network. Furthermore, it allows the natural modeling of lists. However, query

languages with order semantics have higher complexity and more complicated seman-

tics; it is unnessary to incur these costs when the order is semantically meaningless.

Hence, it is likely that an XML version that combines support for lists and sets will

emerge. Many important query processing and optimization problems in information

integration are still open when a data model and language with order semantics is

used.

Subelements versus Referenced Elements OEM supports only one relationship across

elements: an object x may point to an object y. Thus OEM models both an object-

subobject and a reference relationship with directed edges between objects. In con-

trast, in XML (as well as in object-oriented models [AHV95]) there is a distinction

between element y being a subelement of x and x refering to an element y. The graph

representation of an XML document thus potentially involves two kinds of edges: the

“subelement” ones and the “reference” ones.

This distinction may be important from an information modeling point of view. In

addition this distinction provides some benefits for the query processing and storing

of semistructured objects. For example, the edges that correspond to the subelement

relationship are usually more numerous than the “reference” edges, and the subele-

ment edges form a tree. A query processor may be able to exploit the tree structure.

Similarly, a storage manager can exploit the natural nesting of the tree objects.

Nonstring Data XML does not allow nontextual data as values of atomic objects. In-

stead, binary data are linked as external entities. Separating textual from binary data

adds complexity that seems to be useful only in the context of a document model —

as opposed to a data model such as OEM.

Schema XML provides for flexible schema definition languages to capture the islands of

structure existing in semistructured information. OEM can use similar languages for

the same purpose, even though originally it was designed as a schemaless data model.

Page 21: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 8

The benefits and uses of available schema information for semistructured data are

an important topic for information integration that is not addressed in this thesis —

we only discuss briefly one of the uses of flexible schema information to information

integration in Section 4.3. For some recent work on the topic, see [PV00; MS99;

MSV00].

1.3 Information Integration

Information integration systems provide integrated query access to the sources via the

mediator/wrapper architecture of Figure 1.4.

Figure 1.4: A common integration architecture

Conceptually, an integration system must perform three main tasks in response to a

user query:

• identify and locate the pieces of data in the information sources that make up the

answer to the user query,

• create and execute a query plan for correctly and efficiently retrieving the data from

the sources, and

Page 22: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 9

• construct the answer to the user query by piecing together appropriately and manip-

ulating the returned data.

These tasks are performed by the mediator. The mediator is a distributed query process-

ing, optimization and execution engine for distributed information residing on autonomous,

heterogeneous sources. The mediator combines, integrates, and refines source data, provid-

ing applications with a “cleaner” integrated view of the source information. For example,

a web car-shopping mediator provides access to a set of dealers’ pages and even to other

car-shopping mediators. Users accessing the mediator would see a single collection of on-

sale cars, with, for example, duplicates removed,6 format discrepancies resolved, and cars

ranked according to some criterion such as having the current best offers appear first. The

same or different mediators can provide different integrated views of source information,

geared towards different integration uses or applications, either focusing on different parts

of the source information (e.g., new, Japanese sedans with at least one review), or per-

forming different kinds of integration (e.g., summarization versus information fusion versus

schema/structure normalization).

The integration system uses a uniform data representation (most appropriately, a se-

mistructured one, for the reasons explained in Section 1.2) and a common query language.

Wrappers present a logical view of the data of each source represented into the common data

model. The wrappers also accept queries in the common query language on the exported

logical view. When the wrappers receive a query, they translate it into one or more source-

specific queries or commands that are issued to the source. They also translate the source

result into the common data model. The wrappers abstract away the implementation and

interface details of information sources from the rest of the integration system. Applications

can access data directly through wrappers, although most applications will typically access

them through mediators.

1.4 Modeling and using source query capabilities

In the architecture of Figure 1.4, the mediator decomposes incoming client queries, which

refer to the integrated view presented by the mediator and are expressed in some common

query language, into new common-language queries that refer to data found in the individual

6Duplicates could have been introduced because multiple sites may be advertising the same car.

Page 23: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 10

sources and are sent to the wrappers. The information sources connected to the wrappers

are accessible through interfaces that have varying query capabilities; the queries emitted

by the mediator must conform to these capabilities, because otherwise it is not possible for

the wrappers to translate them into native queries and commands. Let us use an example

to illustrate the query processing steps followed by the mediator.

Consider a bibliographic mediator that combines the data of multiple bibliographic

sources into a single “union” view. The user query requests all “SIGMOD 97” publications.

The mediator decomposes the user query into multiple “SIGMOD 97” queries, of which

each is source-specific, i.e., it refers to one source only. To do the decomposition correctly

and efficiently, the mediator must figure out how to extract the necessary information from

the sources using their query capabilities. This is the Capability-Based Rewriting (CBR)

problem. In our example, if one source only supports selection queries on “year”, solving

the CBR the mediator will decide that a query that retrieves the “97” publications will be

sent to this source. The rest, i.e., filtering for “SIGMOD,” will be done at the mediator.

After such decisions are made, and the mediator formulates a query plan that respects the

query capabilities of the sources, each query is sent to a wrapper, where it is translated into

the native query language of the corresponding source. Then the individual query results,

are collected, the information is filtered appropriately and consolidated into one entity by

the mediator, and the combined result is presented to the user.

In order to be able to perform capability-based rewriting, the mediator needs formal

descriptions of the query capabilities of the information sources. A capability-based rewriter

takes as input these descriptions and the query, and it infers query plans for retrieving the

required data that are compatible with the source query capabilities. Solving the CBR

typically produces more than one candidate plans for the query.

The wrappers also need descriptions of the source capabilities in order to translate

the supported common-language queries into queries and commands understood by the

source interface. Conceptually, descriptions are associated with actions that perform the

translation, in the same style as Yacc [ASU87].

1.4.1 Problem definitions: CBR and query expressibility

Source query capabilities can be described in terms of the queries that the source supports: a

capability description is a finite encoding of the set of queries (in some query language) that

Page 24: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 11

the source can answer.7 Therefore, the semantics of the description is a set of queries. A

query is described by the capability description if it belongs in that set. A query is expressible

at the information source, or, equivalently, is expressible by the capability description, if it

is equivalent to a query described by the source’s capability description.

The CBR problem can then be formulated as follows: Given a description of the sources’

capabilities, how can we answer a query using only queries expressible at the sources? The

(related, but simpler) query expressibility problem is as follows: Given a description of the

source’s capabilities and a query, is the query expressible at the source?

1.4.2 CBR and query rewriting using views

The CBR problem is strongly related to the problem of query rewriting using views [Lev;

Ull97], defined as follows: Given a query q accessing some database8 D and a set of views

V = {V1, . . . , Vn} over D, find rewriting queries. A rewriting query of q given V is a query

that accesses at least one view of V and returns the same result as q (for any D). If the

rewriting query uses views only (i.e., it does not access directly the database D) then it is

called a total rewriting query.

If the query capabilities of the information source are modelled as a set of queries, or

equivalently views, over the source contents, then the CBR problem is indeed a rewording

of the problem of query rewriting using views in the context of information integration.

Notice that the problem of answering queries using views [GM99; AD98] is related, but

different:9 Given a query q, a set of views and the view extensions, find the set of tuples

t such that t is in the answer to q for all the databases that are consistent with the view

extentions (i.e., the certain answers to q).

1.5 TSIMMIS for information integration

The information integration system developed in the TSIMMIS project follows the general

architecture of Figure 1.4. In good engineering tradition, the project emphasized the au-

tomation of wrappers’ and mediators’ development. In particular, we developed tools that

7For finite sets of supported queries, the source capabilities can be described by fully enumerating them.8The database may be distributed over multiple sites.9The distinction between these two problems is drawn in [CGLV99; CGLV00].

Page 25: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 12

Figure 1.5: Declarative specification of wrappers and mediators

allow the implementation of wrappers and mediators from high level specifications of their

functionality, as shown in Figure 1.5.

One can develop a mediator by providing a declarative specification of the integrated

view, expressed in a semistructured language that is the common query and view definition

language of the integration system, to the generic mediator specification interpreter that

was developed in TSIMMIS. Similarly, one develops a wrapper by providing a wrapper

specification to the generic wrapper generator. The wrapper specification shows how queries

expressed in the common query language are translated into queries expressed in the native

query language of the underlying sources.

The TSIMMIS data model is OEM and it also uses a semistructured query and view

definition language. During run time, when the mediator receives a client query, it composes

the query with the mediator view definition. Then the mediator creates a plan that sends

queries to the wrappers. These queries are translated by the wrappers into native queries on

the underlying systems. The wrappers translate the source results into the OEM model and

Page 26: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 13

they ship them to the mediator, where the plan combines them into the client query result.

Detailed end-to-end presentations of query processing in TSIMMIS can be found in [GM+97;

GMPVY; Pap97].

Figure 1.6: Mediator architecture

The TSIMMIS system uses parametrized views to describe source query capabilities.

The Capabilities-Based Rewriter uses the source capability descriptions to adapt to the

query capabilities of the sources (see Figure 1.6). The rewriting algorithm employed by the

CBR module of the mediator is discussed in Chapter 4.

Finally, a cost optimizer provides cost estimates. The TSIMMIS approach is based on

a loose coupling of the CBR with the optimizer. Systems and algorithms where a CBR

module and the optimizer are tightly coupled are described in [HKWY97] and [PGH98].

We are not concerned in this thesis with estimating the cost of the plans. Relevant work

can be found in [ACPS96; DKS92].

1.6 Thesis Overview

Chapter 2 discusses related work in the area of information integration. Chapter 3 discusses

the semistructured language DSL (DAG Specification Language) and presents query con-

tainment and query composition algorithms for it. DSL is a variant of the view and query

language used in the TSIMMIS system, and a DSL-based template language is also used

by TSIMMIS as a source capability description language. Chapter 4 presents a rewriting

Page 27: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 1. INTRODUCTION 14

algorithm for DSL, which is the first rewriting algorithm for a semistructured language. It

also presents an efficient rewriting heuristic, based on the general-purpose algorithm, that is

used in the TSIMMIS system. Chapter 5 discusses the query capability description language

p-Datalog (including expressibility results) and presents solutions to the CBR problem and

the query expressibility problem for p-Datalog and p-Datalog variants. Chapter 6 presents

algorithms for the CBR and query expressibility problems for the more powerful capabil-

ity description language RQDL. It also presents a reduction of RQDL to p-Datalog with

function symbols. Finally, Appendix A presents the TSIMMIS wrapper architecture.

Page 28: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Chapter 2

Related Work

This chapter summarizes related work on integrated querying of autonomous, heterogeneous

sources. Notice that detailed comparisons between more specific topics addressed in this

thesis and related work are found in the related work section of the corresponding chapters.

Earlier work on database integration [A+91; K+93; BLN86; LMR90; T+90; Gup89;

FLNS88] focused on the integration of well-structured databases, with fixed schemas, that

support powerful query languages. Significant amount of work was devoted to the of-

fline integration of the database schemas, in order to produce a schema for the integrated

database. As described in the introduction, the assumptions made in much of this work are

increasingly invalid. This thesis focuses on technologies for integrating heterogeneous and

autonomous information sources.

Recently, a new generation of systems has focused on the integration of sources that

may not necessarily be structured databases. We briefly describe them here.

The TSIMMIS project has recently focused on various query optimization and modelling

issues in integrated querying, in addition to the work described in this thesis and earlier work

in semistructured models and languages and query processing [Pap97]. In particular, the

problem of automatic computation of mediator capabilities has been studied in [YLGMU99]

(compare with Section 6.2). Integrated querying over sources supporting disjunction is

studied in [GMLY99], while [YLUGM99] study the problem of choosing efficient integrated

query plans, and propose provably efficient heuristics. Finally, [AGMPY98] addresses the

issue of optimizing large fusion queries.

HERMES [S+] attempts to solve the integration problem by a mediator specification

language where literals explicitly specify the parameterized calls that are sent to the sources.

15

Page 29: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 2. RELATED WORK 16

Unfortunately, the HERMES solution reduces the interface between the integration system

and the sources to a limited set of explicitly listed parameterized calls.

Garlic [C+95; HKWY96; ROH99] focuses on integrating heterogeneous databases or

multimedia data stores that are autonomous but cooperative. That means that their data

have a fixed, known schema, either relational or object-oriented, they provide a minimum of

database-like functionality, like support for either general-purpose or media-specific query

operators, and they provide access to a wealth of metadata, such as the schemas, query

plans, cost models for their operators, etc. In that sense, Garlic represents the continua-

tion of research on federated databases [FGL+98]. The focus of Garlic is on effective and

efficient integrated query optimization, making the best use of the metadata and taking

advantage of the media-specific operators at the sources. Our work assumes a looser inte-

gration scenario: information sources are not assumed to be cooperative, nor are the source

data assumed to be strongly typed. Correspondingly, we place considerable emphasis on

describing source contents and capabilities using flexible descriptions that assume minimal

knowledge about source structure and content. The problem of discovering feasible query

plans, a necessary first step before query optimization, receives more attention in our work,

making it complementary to Garlic. Finally, the Garlic wrapper architecture is very similar

to the TSIMMIS wrapper architecture (see Appendix A) [RS97]. Mapping queries and op-

erators from the Garlic data model and language to the native data model language in the

wrappers is a quite simpler operation though, given that the Garlic data model is essentially

ODL [AHV95] and the mediated data sources are also databases.

The Tukwila data integration system [IFF+99] introduces the concept of adaptive query

optimization and execution: interleaving planning and execution with partial optimization

allows Tukwila to recover from query planning decisions based on inaccurate estimates.

Moreover, Tukwila proposes a new adaptive operator for join, the doubly pipelined join,

that is better suited to the requirements of integrated querying over autonomous informa-

tion sources. As with Garlic, the Tukwila work is complimentary to ours: our produced

feasible query plans can immediately benefit from more efficient and intelligent query en-

gines. Moreover, the Tukwila system originally was designed for processing only relational

data.1

The MIX project [LPV00] is following in the steps of TSIMMIS, putting emphasis on

1Tukwila is currently (May 2000) being extended to process XML data [Tuk].

Page 30: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 2. RELATED WORK 17

query processing in the mediator. The MIX mediator exports a virtual view XML document

into which the client navigates. Client navigation commands are translated online into

source/wrapper sequences of navigation commands and/or queries. MIX also provides a

query-by-example user interface that is driven by the DTD of the virtual view [MP00].

In the SIMS project [ACHK93] and the follow-on project Ariadne [AAB+98] the under-

lying assumption is that for each application there is a unifying domain model that provides

a single ontology for the application. These projects focus on providing powerful knowledge

representation techniques (ontologies, description logics) to create expressive domain mod-

els for the application (see also [CGL+98b]). Each source model is described in terms of the

unifying domain model (the so-called local-as-view approach). Because of the power and

complexity of the knowledge representation techniques used, the use of powerful query lan-

guages make the problem of integrated query planning intractable. Thus SIMS and Ariadne

have focused on limited query languages (the join operator is not supported), as well as on

query planning heuristics rooted in the AI planning literature [AK97]. Other recent work in

query planning heuristics inspired from the AI planning literature includes [KW96; FW97;

AKL97; FLM99]. In contrast, in our work the emphasis is on sound and complete query

processing algorithms, and on description languages that, while expressive, are tractable.

The Information Manifold [LRO96] and Infomaster [GKD97] projects follow the same

local-as-view approach as SIMS and Ariadne. Both projects use expressive yet tractable

description languages for the universal domain model: a simple description logic for the

Information Manifold and Datalog for Infomaster.2 The Information Manifold also ad-

dresses the issue of dealing with sources with limited capabilities by modeling the source

capabilities using capability records. At the core of the Infomaster query planner is a novel,

computationally cheap algorithm for rewriting queries using views [DG97; DL97]. An in-

teresting comparison of the assumptions and relative strengths of the Information Manifold

and TSIMMIS can be found in [Ull97].

The Distributed Information Search Component DISCO [TRV98] describes the capa-

bilities of the sources using context-free grammars appropriately augmented with actions.

DISCO enumerates plans initially ignoring limited wrapper capabilities. It then checks

the queries that appear in the plans against the wrapper grammars and rejects the plans

containing unsupported queries.

2Infomaster initially used KIF [G+92], a very general knowledge representation language.

Page 31: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 2. RELATED WORK 18

UnQL (Unstructured Query Language) [BDHS96] was among the first languages and

systems (together with TSIMMIS) for semistructured data. UnQL uses a graph-based data

model very similar to OEM and XML. UnQL is a functional language that uses a structural

recursion paradigm. The emphasis in UnQL was on developing mathematical constructs for

querying semistructured data. The structural recursion paradigm of UnQL is very similar

to XSL (XML stylesheet language) [Adl].

The FLORID system [LHL+98] is a deductive, object-oriented system for managing and

integrating semistructured data, based on F-logic [KL89]. Semistructured data are modeled

using an object-oriented data model and queried using F-logic, an object-oriented logic

language that supports object identity through Skolem functions, a concept that proved

useful for semistructured languages for integration (see also Chapter 3). FLORID was used

to define and query the structure and content of Web sites declaratively.

The Strudel system [FFK+98] has applied concepts from information integration to

the task of building complex Web sites that serve information derived from multiple data

sources. Strudel separates site content from structure: a Web site is a declaratively-defined

site graph over the semistructured data graph of the contents of the information sources. If

we only have access to the information through the Web site(s), queries asked over the data

graph need to be rewritten as queries over the Web site structure and contents. The Web site

definitions are just view definitions over the data graph. The system is declaratively defined

using the StruQL language [FFLS97]. StruQL has also been a vehicle for the investigation

of theoretical problems related to semistructured languages, such as query containment in

the presence of regular path expressions and intensional3 constraint checking.

Tiramisu [ALW99] is a follow-on project to Strudel that separates the implementation

of the site from the design of the site, and supports a top-down development of the site.

Tiramisu allows quick integration with external implementation tools that create web con-

tent. Tiramisu also allows a web master to graphically view the web site as a graph of web

content connected together by hypertext and inclusion links.

The WEAVE system [FLSY99] extends the Strudel work by allowing compile-time pre-

computation and other optimizations to improve querying and navigating the site graph.

The COIN system [SBGJ+97] focuses primarily on issues of semantic integration, namely

how to use limited contextual information to automatically identify and resolve differences

3Using only the constraints and the view definitions.

Page 32: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 2. RELATED WORK 19

in measurement units, terminology or schema definitions. These issues are orthogonal to

the issues discussed in this thesis.

More optimization issues There has been some preliminary work on utilizing source

overlap and redundancy information for query optimization in a mediation environment

[FKL97; VP98; DL99]. [FLMS99] offers an analysis of the search space for query opti-

mization in the presence of sources with limited capabilities and describes a heuristic for

capability-sensitive query optimization as well as experimental evidence of its performance.

Theoretical work A lot of theoretical work has been done on the problems of rewriting

queries using views and answering queries using views, and the related problems of query

containment in the presence of views, mainly for relational languages [LMSS95; LRU96;

RSU95; VP00; VP97; DG97; DL97; AD98; Qia96; MLF00; PL00; GM99], for description

and higher-order logics [CGL98a; CGL99; BLR97], and recently also for languages with

regular expressions [CGLV99; CGLV00]. As we discussed in Chapter 1, these problems are

at the core of query planning algorithms for integrated querying systems. Some of this work

is discussed in more detail in Chapters 3, 4, 5 and 6. The first paper on rewriting queries

using views for a semistructured language is [PV99]. A comprehensive survey on this topic

is [Lev].

The importance and relevance of research in technology for integrated querying of au-

tonomous, heterogeneous sources is highlighted by the arrival of commercial products in-

corporating results of this research, as well as of companies further developing this tech-

nology. Products include IBM’s DataJoiner [GL94], the DataLinks enhancements to IBM’s

DB2 UBDB [PWDN99], and Microsoft’s OLECOM [Bla96]. Companies include Junglee

[GHR98], Mergent [Mer] and Cadabra [Cad], based on technology developed in the Info-

master project, Nimble [Nim], based on research in the Tukwila project, Fetch Technologies

[Fet], founded around technology developed in the SIMS and Ariadne projects at ISI, and

Enosys Markets [Eno], based on technology developed in the TSIMMIS and MIX projects.

Page 33: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Chapter 3

DSL: A Language for

Semistructured Data

3.1 Introduction

In this chapter we present DSL (DAG Specification Language), a language for semistruc-

tured data. DSL uses the OEM data model and is an object-oriented, rule-based query

and view definition language in the spirit of Datalog [Ull89]. DSL is a variant of the Me-

diator Specification Language [PGMU96] and, like MSL, it is especially well-suited to the

definition of integrated views over semistructured, heterogeneous sources; the definition of

such views is an essential component of our integrated querying architecture, as explained

in Chapter 1.

The distinguishing characteristics of DSL are its ability to manipulate semistructured

data and its effectiveness in information fusion. In particular, DSL has most of the features

identified as important for semistructured languages in recent literature (e.g., in [FSW99]),

such as support for paths, object nesting, label variables, and restructuring capabilities. As

we will see shortly, DSL rule syntax is similar to XML. Finally, DSL supports object identity

through the definition of semantic object-ids using Skolem functions [End72], in the spirit of

F-Logic [KL89] and ILOG [HY90]. Semantic object-ids enable fusion of heterogeneous query

results, as explained in [PAGM96]. Together, these characteristics make DSL a candidate

for an XML query language, like Lorel [AQM+97] or XML-QL [DFF+].

The purpose of this chapter is not to present a new semistructured query language,

since DSL is a variant of MSL. Its purpose is to describe the formal semantics of DSL

20

Page 34: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 21

and to identify the syntactic and semantic tradeoffs that had to be made (compared to

other semistructured languages) to enable us to discover solutions to the problems of query

composition and query equivalence for DSL, without significantly affecting the power of

DSL. We also present these solutions, namely query composition and query equivalence

algorithms for DSL.

We start by presenting the syntax and semantics of DSL. We present query composition

and query equivalence algorithms for DSL in Sections 3.3 and 3.4 respectively. Finally, in

Section 3.5, we discuss other semistructured languages.

3.2 The DAG Specification Language

Queries over OEM data need to impose conditions on semistructured data graphs, as well

as to output new OEM data as results. Therefore, a query language over OEM data needs

to have an object-selection component as well as a transformation component that allows

creation of new OEM objects and graphs. To express queries and view definitions over

OEM data, we use the DAG Specification Language (DSL).

3.2.1 Syntax

A DSL rule is a rule of the form head :- body in the style of Datalog [Ull88]. Intuitively, the

head describes the result objects in the answer graph, whereas the body is a conjunction of

one or more conditions that must be satisfied by the source objects. The head and the body

conditions are based on object patterns of the form <object-id label value>. If the object id is

omitted from a rule head object pattern, it is assumed to be a unique constant. If the object

id is omitted from a body object pattern, it is assumed to be a fresh variable. The object-id

is a term: a variable, an atomic constant, or a function symbol followed by a list of terms.

The label can be a variable or a constant and the value can be either a variable, an atomic

constant, or a set value pattern that contains zero or more object patterns. Section 3.2.3

discusses the syntactic restrictions placed on DSL rules.

For example, the following query returns information about SIGMOD conferences held

after 1992.1

1The information about the unit of measure for year is not present in the query. Specification andintegration of the semantics of data is of crucial importance for information integration, but it is an issueorthogonal to the issues we are examining. For more information on semantic interoperability, see [SBGJ+97].

Page 35: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 22

(Q1) <cnf(C) sigmod {<f(X) Y Z>}> :-

<D dblp {<A article {<C conference {<N name "SIGMOD"> < year Y> <X Y Z>}>}>}>

AND (Y> 1992)

The head of the query consists of one object pattern, whereas the body of the query is

a conjunction of one or more

1. object patterns, tagged with their originating information source (such as db), and

2. external or built-in predicates applied to variables or constants, such as Y > 1992,

or isatomic(A), where isatomic is a predicate for testing if variable A has an atomic

value.

A DSL program is a collection of DSL rules. DSL queries and DSL view definitions are

DSL programs. If a query (or a view definition) happens to consist of a single rule, then

query (or view definition) will be used interchangeably with rule. Moreover, we will use

view and view definition interchangeably in the remainder of this thesis.

3.2.2 Semantics

DSL rules have minimal model semantics. We illustrate the semantics with the following

example, which is a simplification of (Q1).

(Q2) <cnf(C) sigmod {<f(X) Y Z>}> :-

<D dblp {<A article {<C conference {<N name "SIGMOD"> <X Y Z>}>}>}>

The semantics of the above query are

if there is a tuple of bindings d, a, c, n, x, y and z for the variables D, A, C, N, X, Y,

and Z such that

the data source contains a top-level (root) object with label dblp identified by d,

the d object has an article subobject with object id a,

the a object has a conference subobject with object id c

the object a may also have subobjects other than the c

the c object has a name subobject with value "SIGMOD" and object id n

the c object has a y subobject with value z and object id x

Page 36: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 23

Figure 3.1: Result of (Q2) on database of Figure 1.1

then the query result has

a sigmod object, whose object-id is the term cnf(c),

the object with object id cnf(c) has a y subobject with value z and object id f(x).

the object cnf(c) may have subobjects other than y

because the result of another rule may “fuse” more subobjects into the object cnf(c)

Note that z could be a subgraph of the data in the source. The answer to the query is

a graph consisting of new objects with fresh, unique object ids and the structure denoted

by the query head. The bindings of the variables “fill in” this explicitly created structure.

The result of applying (Q2) to the database of Figure 1.1 is shown in Figure 3.1.

Formally, for an OEM database D, let PD be the set of all subgraphs2 of D, O be the

set of all object ids in D, and C be the set of all labels and atomic values. Let VO be the

set of all object id variables3 and VC be the set of all other (label and value) variables in

the body of the rule, with VO ∩ VC = φ. Let V = VO ∪ VC be the set of all variables. The

meaning of the rule body is the set of assignments θ : V → O ∪ C ∪ PD that satisfy all

conditions in the body. Each assignment maps object id variables to O, label variables to

C, and value variables to C ∪ PD.

The meaning of the rule head is defined as follows. We create and label the new

nodes of the answer graph, by instantiating the object id and label fields of the query

head, and we make the objects resulting from the instantiation of the top-level object pat-

tern of the query the roots of the answer graph. In particular, for each object pattern

<f(X1, . . . , Xm) L V> in the query head, and for each assignment θ above, create a new

object with object id f(θ(X1), . . . , θ(Xm)), label θ(L) and value θ(V ). If instead of V , the

object pattern above has {objpattern1, . . . , objpatternn}, the value of the created object is

{θ(objpattern1), . . . , θ(objpatternn)}.Notice that when two assignments produce the same term as the object id of two objects,

one object is created, and the values of the two objects are “fused,” which means that the

set of outgoing edges from the “fused” object is the union of the sets of outgoing edges from

2Remember that the value of a set object is the OEM subgraph rooted at that object.3Object id variables are variables appearing in the object id field of object patterns in the bodies of rules.

Page 37: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 24

the two objects. Moreover, the incoming edges to any of the two objects will point to the

one fused object.

Space of function symbols and object ids Since object ids need to be unique, and

to avoid accidental “fusion” between objects in the query result and the OEM database,

function symbols appearing in the head of a DSL query belong to a different sort4 than

function symbols appearing in the body of the query or in the input database.

A direct consequence of using fresh function symbols in the query result is shown in

Lemma 3.2.1 below. Let us first give the definition of a template of a term.

Definition: For term or atom T , the template of T , denoted temp(T ) is the character

string obtained from T by replacing the i-th variable occurence in T by the new variable

Vi. For example, temp(f(X, g(X, Y ))) = f(V1, g(V2, V3)). 2

Lemma 3.2.1 Two terms T1, T2 appearing in the head of a DSL query can be unified

[GN88] by a satisfying variable assignment θ only if they have the same template.

Proof: If T1 and T2 do not have the same template, then, when parsing them left to

right, we will eventually reach a position where, in one of them we encounter some variable

X, and in the other one we encounter some term, for example (without loss of generality)

f(Y1, . . . , Yn). The variable X can only be unified with f(Y1, . . . , Yn) with a unifier that

maps X to an appropriate f -term. But, as we have explained, the space of function symbols,

such as f , used in the head of the query is disjoint from the space of function symbols in

the database, therefore a satisfying assignment θ cannot map X to an f -term. 2

Safe DSL rules A DSL rule is safe if every variable appearing in the rule head also

appears in the rule body. Thus, the same simple syntactic condition that is used by [Ull88]

to define safety of conjunctive queries can be used to define safety in DSL. In the remainder

of this thesis we are only discussing safe DSL rules.

3.2.3 Syntactic restrictions on DSL rules

DSL rules have to obey some simple syntactic restrictions. In particular, cyclical object

conditions are not allowed in the query body, and object id invention is limited to con-

structing DAGs of OEM objects. In what follows, I first describe formally the restriction

4In other words, they are“picked” from a disjoint set of symbols.

Page 38: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 25

(a)Bodygraphof(R3)

(b)Bodygraphof(R4)

Figure 3.2: Body graph examples

imposed on query bodies and then the restriction on object id invention.

Restricting rule conditions

In order to define the restriction on query bodies, we first define the body graph bgraph(R)

of a rule R.

Definition: For DSL rule R, the body graph of R, denoted bgraph(R), is a labeled graph

(N,E), where N is the set of object ids appearing in the object patterns in the body of R,

each node labeled by the label in its corresponding object pattern, and E is the set of edges

naturally defined by the nesting of object patterns in the body of R. 2

Example 3.2.2 The body graph of the following DSL rule (R3) is shown in Figure 3.2(a).

(R3) <bk(T) book {<f(X) Y Z>}> :-

<D dblp {<B book {<T author A> <X Y {<B book Z>}>}>}>AND <R report {<T author {<N fn F>}>}>

2

We can now state formally the restriction imposed on the bodies of rules: every legal

DSL rule R has an acyclic body graph. Checking acyclicity of bgraph(R) is linear in the size

of R. Given the above restriction, rule (R3) above is not legal DSL. In contrast, rule (R4)

is legal DSL because its body graph is acyclic, as shown in Figure 3.2(b).

(R4) <bk(T) book {<f(X) Y Z>}> :-

<D dblp {<B book {<T author A> <X Y Z>}>}>}>AND <R report {<T author {<N fn F>}>}>

Page 39: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 26

Figure 3.3: Head graph of (P5)

Restricting object id invention

In general, a DSL query constructs answer objects that are restructurings of source data.

Using terms in the object id field of the head object pattern allows semantic object id

invention; the function symbols used are known in the literature as Skolem functions. The

mechanism of inventing object ids using Skolem functions has been proposed in [Mai86;

HY90; KL89]. This object id invention mechanism is very powerful; namely, it allows the

construction of arbitrary graphs in the query output.

The unbridled power of object id invention significantly complicates the composition of

DSL queries, as will be explained in Section 3.3. Moreover, the full power of the object

id invention mechanism is not essential for most applications. That is why DSL imposes

a simple syntactic restriction on invented object ids. In brief, we require that the use of

Skolem terms does not create cycles in the query result; in other words, we require that the

structure created by the query head is acyclic. The formal definition follows. Let us first

give an additional definition to facilitate the rest of the presentation.

Definition: For DSL program P , the head graph of P , denoted hgraph(P ), is a graph

(N,E), where N is the set of templates for the object id terms appearing in the heads of

rules of P , and E is the set of edges naturally defined by the nesting of object patterns in

the heads of rules of P . 2

Example 3.2.3 The head graph of the following DSL program (P5) is shown in Figure 3.3.

(P5) <bk(T) book {<f(X) Y Z>}> :- <dblp {<book {<title T> <X Y Z>}>}><f(X) review {<g(A) reviewer C>}> :-

<nyrb {<X review {<A writtenby C>}>}>

2

We can now state formally the restriction on the use of Skolem functions to invent object

ids: For every legal DSL program P , the head graph of P is acyclic. Checking acyclicity of

hgraph(P ) is linear in the size of the heads of the rules of P . Given this restriction, we can

show the following:

Page 40: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 27

(a)OEMdatabaseD

(b)(Q6)(D)

Figure 3.4: OEM database and result for Example 3.2.5

Theorem 3.2.4 A DSL query constructs answer objects that form a DAG.

Proof: A DSL query Q constructs new objects through object id invention. New edges are

constructed when the query head specifies that a newly constructed object o1 is a subobject

of constructed object o2. A cycle can be created only when newly constructed objects have

the same object id and therefore are fused. The only objects that can be fused are those

whose object ids have the same term templates (from Lemma 3.2.1). Therefore, a cycle in

the constructed answer objects of Q corresponds to a cycle in hgraph(Q). Since hgraph(Q)

is acyclic, the constructed objects in the result of Q cannot have cycles, which proves the

theorem. 2

We refer to the result of a DSL query as an answer DAG. Notice that the result of

Q can include cycles, since it can include copied subgraphs from the input database (c.f.

Section 3.2.2), as the following example shows.

Example 3.2.5 The result of the following query on the OEM database D of Figure 3.4(a)

is given in Figure 3.4(b).

(Q6) <f(X) new Z> :- <a {<X b Z>}>

2

The generalization of the semantics as defined above to a DSL program (i.e., a collection

of DSL rules) is straightforward.5

3.2.4 Expressive power and complexity

DSL is more expressive than conjunctive queries. In particular, the distinctive features

of DSL, namely function symbols and copying semantics for value variables, give DSL

5Notice that DSL does not support recursion.

Page 41: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 28

the power of Datalog with limited forms of recursion. Specifically, DSL queries can be

expressed with linear Datalog programs, that is, Datalog with only linear recursion [AHV95;

Ull89], as the following theorem proves.

Theorem 3.2.6 DSL is strictly less expressive than StruQL [FFLS97] and linear Datalog.

Proof: Let L-DATALOG be the set of queries expressible by linear Datalog pro-

grams, and let TC-DATALOG be the set of queries expressible by Datalog programs

with only transitive closure recursive rules. StruQL is a semistructured language described

in [FFLS97], where it is also stated that it is equal to TC-DATALOG. It is shown in

[CM90] that TC-DATALOG =L-DATALOG. StruQL allows regular path expressions in

the query body. Regular path expressions are not expressible in DSL. To see this, notice

that DSL rule bodies can only enforce conditions on objects appearing at a fixed depth in

the input database, whereas, using regular path expressions, conditions can be applied to

objects in arbitrary depth. On the other hand, StruQL includes all the constructs found in

DSL, including, importantly, Skolem functions for defining object ids. 2

We know from [CM90] that L-DATALOG ⊂ QNLOGSPACE. An immediate conse-

quence of the theorem above is the following:

Corollary 3.2.7 DSL queries are in QNLOGSPACE.

3.2.5 Normal forms of DSL rules

To simplify the presentation of query composition in Section 3.3 and query rewriting in the

next chapter, we define normal form rules and full normal form programs. Every DSL rule

and program can be easily converted into normal form or full normal form, hence the focus

on normal forms does not limit the power of the language. First, let us define single path

object conditions:

Definition: Single path object condition is an object condition in which all the set-

valued value fields contain at most one object pattern. 2

We next define the notion of correspondence between a complete path and a single path

object condition in a rule or body graph.6 We make the definition through the following

example.

6A complete path is a path from a root to a leaf in the graph.

Page 42: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 29

Example 3.2.8 Let us consider again DSL rule (R4), whose body graph is shown in Fig-

ure 3.2(b). The path

(D, dblp)→ (B, book)→ (X, Y )

corresponds to object condition

<D dblp {<B book {<X Y Z>}>}>

in the body of (R4) and vice versa.

Let us also consider again (P5), whose head graph appears in Figure 3.3. If we consider

the first rule of this program separately, its head graph contains only the complete path

bk(V )→ f(V ). This path corresponds to the object condition

<bk(T) book {<f(X) Y Z>}>

in the head of the rule. The path bk(V ) → f(V ) → g(V ) in the head graph of (P5)

corresponds to the object condition

<bk(T) book {<f(X) review {<g(A) reviewer C>}>}>

Notice that in this case the object condition does not appear in the head of any of the rules

of (P5); instead, it is created by unifying an object condition from the first rule of (P5)

with an object condition from the second rule. The intuition is that the path

bk(V )→ f(V )→ g(V )

is created through fusion of objects returned by the first rule of (P5) with objects returned

by the second rule. 2

Finally, let us define the notion of correspondence between a complete path in a head

graph and a rule, again through an example.

Example 3.2.9 Let us consider the first rule of (P5), whose head graph contains only one

complete path, bk(V )→ f(V ). The rule corresponding to this path is

<bk(T) book {<f(X) Y Z>}> :- <dblp {<book {<title T> <X Y Z>}>}>

Page 43: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 30

If we consider now both rules of (P5), as explained in Example 3.2.8, the path

bk(V )→ f(V )→ g(V )

corresponds to the object condition

<bk(T) book {<f(X) review {<g(A) reviewer C>}>}>

which is created by appropriately unifying object conditions in the heads of the two rules.

The rule corresponding to this path is

<bk(T) book {<f(X) review {<g(A) reviewer C>}>}> :-

<D dblp {<B book {<title T>}>}> AND <D dblp {<B book {<X Y Z>}>}>AND <nyrb {<X review {<A writtenby C>}>}>

that is created as follows:

• The head of the rule is the object condition corresponding to the path.

• The body of the rule is the conjunction of the bodies of the rules whose heads “con-

tribute” to the creation of the object condition that corresponds to the path.7

2

Definition: Normal Form DSL rules are the DSL rules whose body is a conjunction of

single path object conditions. Additionally, a normal form rule with just one condition in

its body is called a single path rule. 2

The query (Q2) can be easily transformed into the following normal form query:

(Q7) <cnf(C) sigmod {<f(X) Y Z>}> :-

<D dblp {<A article {<C conference {<N name "SIGMOD">}>}>}>AND <D dblp {<A article {<C conference {<X Y Z>}>}>}>

Normalization makes all paths present in the body graph of a rule into separate object

conditions. In particular, to transform a query Q into normal form, it suffices to replace

the query body of Q with a conjunction of the single path object conditions corresponding

to the complete paths of bgraph(Q).

7Variables in the bodies may need to be appropriately renamed to avoid accidental variable equalization.

Page 44: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 31

Definition: Full normal form DSL program is a DSL program P consisting of normal

form DSL rules, such that

• All set-valued value fields in the heads of the rules contain at most one object pattern,

and

• Every path present in the head graph hgraph(P ) of the program is present in the

head of some rule of P .

2

The following example explains in detail the translation of a DSL program into full

normal form.

Example 3.2.10 The DSL program (P5), which is repeated below, can be transformed

into the following full normal form program, as follows.

• Consider the first rule of (P5) alone. Its head graph has only one complete path,

bk(V )→ f(V ), which corresponds to rule

<bk(T) book {<f(X) Y Z>}> :- <dblp {<book {<title T> <X Y Z>}>}>

therefore that rule (which is the first rule in (P5)) is added to (P8).

• Consider the second rule of (P5) alone and similarly add it to (P8).

• Consider both rules of (P5) together. The head graph again has only one complete

path, bk(V )→ f(V )→ g(V ) that, as explained in Example 3.2.9, corresponds to rule

<bk(T) book {<f(X) review {<g(A) reviewer C>}>}> :-

<D dblp {<B book {<title T>}>}>AND <D dblp {<B book {<X Y Z>}>}>AND <nyrb {<X review {<A writtenby C>}>}>

We add this rule to (P8).

(P5) <bk(T) book {<f(X) Y Z>}> :- <dblp {<book {<title T> <X Y Z>}>}><f(X) review {<g(A) reviewer C>}> :-

<nyrb {<X review {<A writtenby C>}>}>

Page 45: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 32

(P8) <bk(T) book {<f(X) Y Z>}> :-

<D dblp {<B book {<title T>}>}>AND <D dblp {<B book {<X Y Z>}>}>

<f(X) review {<g(A) reviewer C>}> :-

<nyrb {<X review {<A writtenby C>}>}><bk(T) book {<f(X) review {<g(A) reviewer C>}>}> :-

<D dblp {<B book {<title T>}>}>AND <D dblp {<B book {<X Y Z>}>}>AND <nyrb {<X review {<A writtenby C>}>}>

In the case of programs consisting of n rules, all subsets of rules of size k = 1, . . . , n need

to be considered.8 2

Because of object fusion, databases that are answers to DSL queries can contain paths

that are not explicitly present in the rule heads of the program (but are present in the head

graph). Full normalization makes these paths explicit by adding rules to the program that

create these paths. Therefore, full normalization extends normalization of rules to programs,

by making all paths in the head graph of a program into separate object conditions.

3.3 Query composition for DSL

The composition of DSL queries Q and V is a query Qc = V ◦ Q, such that for any OEM

database D, Qc(D) = Q(V (D)). Query composition is accomplished by resolving each

condition in the body of Q with the head of V in all possible ways, using unification (which

generalizes [GN88; Ull88]).9 Query composition is easily generalized to multiple queries

V1, . . . , Vn.

Let us look at the following two detailed examples:

Example 3.3.1 Let us consider the following query:

(Q9) <f(P) ans {<g(D) m V>}> :- <P p {<A l V>}> AND <Q q {<D m V>}>

8A simple optimization of the naive full normalization algorithm would consider only subsets of ruleswhose head graph includes complete paths not considered already.

9Unification and resolution for DSL is the same as for MSL and is presented formally in Section 5.4.2 of[Pap97].

Page 46: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 33

and two views

(V1) <h(A′,B) p {<i(A′) l V′> <j(B) Y "abc">}> :-

<X label1 {<A′ label2 V′> <B Y W> <C label3 T>}>(V2) <k(F) L E> :- <G l {<F L E>}>

The reduction of (V1) to full normal form gives

(V3) <h(A′,B) p {<i(A′) l V′> }> :-

<X label1 {<A′ label2 V′>}> AND <X label1 {<B Y W>}>AND <X label1 {<C label3 T>}>

<h(A′,B) p {<j(B) Y "abc">}> :-

<X label1 {<A′ label2 V′>}> AND <X label1 {<B Y W>}>AND <X label1 {<C label3 T>}>

There exist two unifiers for the first condition of (Q9) and the heads of (V3):

θ1 = [P 7→ h(A′, B), A 7→ i(A′), V 7→ V′]

θ2 = [P 7→ h(A′, B), A 7→ j(B), Y 7→ l, V 7→ “abc”]

There exists one unifier for the first condition of (Q9) and the head of (V2):

θ3 = [P 7→ k(F), L 7→ p, E 7→ {<A l V>}]

The existence of three unifiers means that the result of resolving the first condition of

(Q9) with the views gives three DSL queries:

(Q10) <f(h(A′,B)) ans {<g(D) m V′>}> :-

<X label1 {<A′ label2 V′>}> AND <X label1 {<B Y W>}>AND <X label1 {<C label3 T>}>AND <Q q {<D m V′>}>

(Q11) <f(h(A′,B)) ans {<g(D) m "abc">}> :-

<X label1 {<A′ label2 V′>}> AND <X label1 {<B l W>}>AND <X label1 {<C label3 T>}>AND <Q q {<D m "abc">}>

(Q12) <k(F) ans {<g(D) m V>}> :-

<G l {<F p {<A l V>}>}> AND <Q q {<D m V>}>

Page 47: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 34

For the second condition of each one of (Q10,Q11,Q12), there exists one unifier with (V2):

θ4 = [Q 7→ k(F), L 7→ q, E 7→ {<D m V′>}]

θ5 = [Q 7→ k(F), L 7→ q, E 7→ {<D m "abc">}]

θ6 = [Q 7→ k(F), L 7→ q, E 7→ {<D m V>}]

respectively. Therefore, Qc consists of 3 DSL rules:

(Q13) <f(h(A′,B)) ans {<g(D) m V′>}> :-

<X label1 {<A′ label2 V′> <B Y W> <C label3 T>}>AND <G l {<F q {<D m V′>}>}>

(Q14) <f(h(A′,B)) ans {<g(D) m "abc">}> :-

<X label1 {<A′ label2 V′> <B l W> <C label3 T>}>AND <G l {<F q {<D m "abc">}>}>

(Q15) <k(F) ans {<g(D) m V>}> :-

<G l {<F p {<A l V>}>}> AND <G l {<F q {<D m V>}>}>

Of those, (Q14) is subsumed by (Q13), in that obviously every object constructed by

(Q14) is also constructed by (Q13). So Qc consists of (Q13) and (Q15). 2

Example 3.3.2 Consider the following view definition:

(V4.1) <trep(RN1) tr {<TID title T>}> :-

<Ro1 r {<RNo1 rn RN1> <TID title T>}>(V4.2) <tp(FN,LN) pr {<n(FN,LN) name N>

<w(FN,LN) work {<trep(A) tr {<Sid subject S>}>}>}> :-

<P person {<FN1 fn FN> <LN1 ln LN>

<A article {<Sid subject S>}>}>AND mergename(N,FN,LN)

View rules (V4.1) and (V4.2) contribute information to tr and pr objects. The full

normal form reduction of the view is given below:

(V5.1) <trep(RN1) tr {<TID title T>}> :-

<Ro1 r {<RNo1 rn RN1>}> AND <Ro1 r {<TID title T>}>(V5.2) <tp(FN,LN) pr {<n(FN,LN) name N>}> :-

Page 48: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 35

<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename(N,FN,LN)

(V5.3) <tp(FN,LN) pr {<w(FN,LN) work {<trep(A) tr {<Sid subject S>}>}>}> :-

<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename(N,FN,LN)

(V5.4) <tp(FN,LN) pr {<w(FN,LN) work {<trep(A) tr {<TID title T>}>}>}> :-

<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename(N,FN,LN)

AND <Ro1 r {<RNo1 rn A> <TID title T>}>

Let us now consider the query (Q16) that asks for all titles of John Smith’s reports.

(Q16) <f(P1,TID1) titles {<g(TID1) title T1>}> :-

<P1 pr {<N1 name "John Smith">}> AND

<P1 pr {<W work {<TR1 tr {<TID1 title T1>}>}>}>

The following unifier is produced for the name condition of (Q16) and the head of (V5.2):

θ1 = [P1 7→ tp(FN, LN), N1 7→ n(FN, LN), N 7→ ”JohnSmith”]

Applying θ1 to the query and the view, we produce

(Q17) <f(tp(FN,LN),TID1) titles {<TID1 title T1>}> :-

<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename("John Smith",FN,LN)

AND <tp(FN,LN) pr {<W work {<TR1 tr {<TID1 title T1>}>}>}>

The following unifier is produced for the second source condition (on pr) of (Q17) and

the head of (V5.4):

θ2 = [FN 7→ FN, LN 7→ LN, W 7→ w(FN, LN), TR1 7→ trep(A), TID1 7→ TID, T1 7→ T]

Page 49: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 36

After renaming the variables in (V5.4) and applying θ2 to the query and the view, we

produce:

(Q18) <f(tp(FN,LN),TID) titles {<TID title T>}> :-

<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename("John Smith",FN,LN)

AND <P′ person {<FN1′ fn FN>}> AND <P′ person {<LN1′ ln LN>}>AND <P′ person {<A′ article {<Sid′ subject S′>}>}>AND mergename(N,FN,LN)

AND <Ro1 r {<RNo1 rn A> <TID title T>}>

Removing oviously redundant conditions gives:

(Q19) <f(tp(FN,LN),TID) titles {<TID title T>}> :-

<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename("John Smith",FN,LN)

AND <Ro1 r {<RNo1 rn A> <TID title T>}>

2

Notice that in DSL there are multiple mgus or most general unifiers. The practical

consequence is that the result of V ◦Q, can be a DSL program consisting of multiple rules.

In particular, Qc could consist of an exponential number of rules (each of at most polynomial

length.) This observation gives us the following theorem.

Theorem 3.3.3 (Composition Complexity) Query composition in DSL is in EXP-

TIME.

Notice that the order of resolving query conditions with view heads does not matter. Also

notice that query composition “implements” view dereferencing: it transforms a query that

refers to the object patterns in a view head to a query that refers to the objects in the

information source(s) that the view is defined over.

Page 50: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 37

Query composition for MSL

A query composition algorithm is presented in Chapter 5 of [Pap97] as the Query Decom-

position algorithm for MSL, that takes as input a query over views and the view definitions

over some information sources, and produces a datamerge program over the information

sources that is equivalent to the original query. The algorithm as presented is incorrect:

it misses possible unifiers, as MSL programs are not in full normal form. In particular,

unifier θ2 in Example 3.3.2 above would be missed. Moreover, MSL programs can construct

cyclical answer graphs, which also defeat the unification algorithm (and consequently the

composition algorithm).

3.4 Equivalence of DSL queries

Two queries Q1, Q2 are equivalent if and only if for all OEM databases D, their results

Q1(D) and Q2(D) are equivalent. In this section, we develop a compile-time test of equiv-

alence of DSL queries, based on a simple extension of containment mappings [CM77].

Deciding query equivalence of DSL queries syntactically is complicated by the distin-

guishing characteristics of a semistructured language, like the restructuring capabilities

of the query language, support for value and label variables, and semantic object id in-

vention. The fact that semantic object ids complicate query processing and optimiza-

tion was observed already in [HY90] for intentional queries in the relational model. Be-

cause of DSL’s restrictions in object id invention, presented in Section 3.2.3, we are able

to come up with a simple query equivalence algorithm that is based on extending the

machinery used in the equivalence test for unions of conjunctive queries [Ull89; CM77;

SY80]. Namely, the algorithm is based on discovering containment mappings between

queries. We extend mappings and containment mappings in Section 3.4.1.

Moreover, object identity introduces key dependencies from the object id to the label and

value. The chase technique [Ull89] is used to deal with these dependencies. The technique

is extended to deal with the case of value variables that can bind to sets of object patterns.

In Section 3.4.2, we present our extension to the chase for the case of key dependencies on

object ids. The extension applies to any functional dependency with value variables on the

right hand side.

In order to decide the equivalence of chased DSL queries, we need to make sure that

all the components (i.e., nodes and edges) of the result graphs are the same. To do that,

Page 51: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 38

we decompose a normal form DSL query into graph component queries that correspond to

the components of the result graph: edges, nodes and root, i.e., top-level, objects.10 In

particular, every DSL query Q is decomposed into three types of finer-grain rules:

• one top rule corresponding to the top level condition of the head of Q (this query

corresponds to the root of the OEM graph constructed by the head of Q)

• as many member rules as there are object-subobject relationships in the head of Q

(these queries correspond to the edges of the OEM graph constructed by the head of

Q) and

• as many object rules as there are object conditions in the query head of Q (corre-

sponding to the objects of the OEM graph constructed by the head of Q and describing

their labels and values).

The decomposition is illustrated by the following example. Note that top, member and

object rule heads depart from DSL syntax: they are simply relational predicates.

Example 3.4.1 Consider the query (Q20); it consists of two rules, and in order to make

the graph component query decomposition clearer, we have let both queries have the same

body. Notice that the first rule of (Q20) creates an l object for every a object and an n′

object for every c object, while the second rule creates an l′ object for every a and an n

object for every c.

(Q20) <l(X) l {<f(Y) m {<n′(Z) n′ V>}>}> :- <X a {<Y b {<Z c V>}>}><l′(X) l′ {<f(Y) m {<n(Z) n V>}>}> :- <X a {<Y b {<Z c V>}>}>

The following rules are the graph component queries of (Q20).

top(l(X)) :- <X a {<Y b {<Z c V>}>}>top(l′(X)) :- <X a {<Y b {<Z c V>}>}>member(l(X),f(Y)) :- <X a {<Y b {<Z c V>}>}>member(l′(X),f(Y)) :- <X a {<Y b {<Z c V>}>}>member(f(Y),n′(Z)) :- <X a {<Y b {<Z c V>}>}>member(f(Y),n(Z)) :- <X a {<Y b {<Z c V>}>}>

10Recall that OEM graphs are rooted.

Page 52: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 39

object(l(X),l,set) :- <X a {<Y b {<Z c V>}>}>object(l′(X),l′,set) :- <X a {<Y b {<Z c V>}>}>object(f(Y),m,set) :- <X a {<Y b {<Z c V>}>}>object(n′(Z),n′,V) :- <X a {<Y b {<Z c V>}>}>object(n(Z),n,V) :- <X a {<Y b {<Z c V>}>}>

2

Decomposition into graph component queries takes time linear to the size of the heads

of the DSL rules.

3.4.1 Mappings and containment mappings

In this section, we define mappings and containment mappings for DSL rules and graph

component queries.

Definition: A mapping h assigns to each variable appearing in a DSL rule either a

variable, a constant, a term or a set of object patterns. It substitutes the function symbols

appearing in the head of a rule for a function symbol. The mapping extends naturally to

terms and object patterns, with h being the identity mapping on all constants, predicate

symbols and function symbols appearing in the body of a rule. 2

Definition: Valid Application of a Mapping on an OEM Object Pattern The

result of applying a mapping h on an OEM object pattern is a pattern where every variable

V is replaced by h(V ) and every function symbol f is replaced by h(f). The mapping is

applicable to the object pattern if (i) the resulting pattern has valid OEM syntax, i.e., set

patterns do not appear in object-id or label positions, and (ii) it is compatible with key

dependencies imposed by the object-id’s. 2

Definition: Containment mapping Let R1, R2 be two graph component queries or

normal form DSL rules:

R1 : H : − < object pattern1 > AND . . . AND < object patternn >

R2 : I : − < object pattern′1 > AND . . . AND < object pattern′k >

A mapping h is said to be a containment mapping if h turns R2 into R1; that is if

h(I) = H, and for every i there exist a j such that

Page 53: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 40

h(< object pattern′i >) =< object patternj >

2

3.4.2 Extending the chase for set variables

Object identity introduces a functional dependency in OEM: a key dependency from the

object id to the label and value. Moreover, structural constraints, such as those described

by a DTD [Gol90], introduce functional dependencies, as we will see in Section 4.3. We use

the chase technique [Ull88] to deal with these dependencies. The technique is extended for

the case of value variables, which can bind to sets of object patterns. In the rest of the

section, we present our extension to the chase for the case of key dependencies on object id.

The extension applies in general to any functional dependency with value variables in the

right hand side. Recall that DSL queries are not allowed to contain cyclic object patterns.

This is necessary for the described simple extension to the chase to terminate.

Chase extension for dependency on object id Let o1, o2 be object patterns in the

body of a query q with the same term in the object id field.

• If o1 and o2 have L1, V1 and L2, V2 in their label and value field respectively, then we

replace all occurrences of L2, V2 in q with L1, V1 respectively.

• If o1 has object patterns {oi, . . . , oj} in its value field and o2 has V2, then replace all

occurrences of V2 in q with {<X Y Z>}, where X, Y, Z are variables not appearing in

q.

• If o1 has {oi, . . . , oj} in its value field and o2 has {ck, . . . , cm}, replace the value fields

of both o1 and o2 with {oi, . . . , oj , ck, . . . , cm}.

• If one of o1, o2 has a constant in one of the fields, and the other has a variable, replace

all occurrences of that variable in q with the constant.

• If both o1 and o2 have constants in one of the fields, then, if the constants are different,

halt with an error (this query cannot be chased to an equivalent query satisfying the

object id key dependency). If the constants are the same, do nothing for this field.

• If o2 is identical to o1, drop o2 from q.

Page 54: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 41

In order to “chase” functional dependencies that do not involve value variables, we

can use the “regular” chase rule [Ull88]. It is easy to see that our extension to the chase

terminates. It is also easy to see that all terminal chasing sequences [AHV95] are the same.11

Example 3.4.2 Consider query (Q21):

(Q21) <f(P) stan student V> :- <P person {<U university stanford>}>AND <P person V>

(Q21) is chased to (Q22), since V is a set variable. Using the key dependency on object

id, we infer that V is a set variable and thus transform (Q21) to (Q22). Notice how the

value variable is transformed into a set pattern because of the chase.

(Q22) <f(P) stan student {<X Y Z>}> :-

<P person {<U university stanford>}> AND <P person {<X Y Z>}>

2

Notice also that all OEM databases satisfy the object id functional dependency by

definition. That means that the following theorem holds.

Theorem 3.4.3 Let Q be a DSL query and chase(Q) be a terminal chasing sequence

of Q. Also let the only dependency be the functional dependency on object id. Then

Q ≡ chase(Q).

A direct consequence of Theorem 3.4.3 is the following.

Corollary 3.4.4 In the presence of only object id dependencies, two DSL queries are equiv-

alent if and only if their chased counterparts are equivalent.12

The next subsection presents the syntactic test for DSL query equivalence.

11A terminal chasing sequence is the query that results from successive applications of the chase rulesuntil no more applications are possible.

12Theorem 3.4.3 and Corollary 3.4.4 also hold in the presence of arbitrary functional dependencies. Theproof is analogous to the relational case ([AHV95], pp 173-177).

Page 55: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 42

3.4.3 Deciding DSL query equivalence

Before presenting the condition for equivalence of DSL queries, let us discuss equivalence of

graph component queries. The condition for equivalence for each of the three types of graph

component queries is a simple generalization of the condition for equivalence of relational

conjunctive queries with object ids [CM77; HY91].13

Theorem 3.4.5 Two (top, member, object) queries Q1 and Q2 are equivalent if and only

if there exists a containment mapping from Q1 to Q2 and another containment mapping

from Q2 to Q1.

The condition for equivalence of sets of graph component queries is then easily derived:

Theorem 3.4.6 Two sets S1 = {P1, . . . , Pn} and S2 = {T1, . . . , Tm} of graph component

queries are equivalent if and only if for each Pi there exists an equivalent Tj , and for each

Ti there exists an equivalent Pj .

Theorem 3.4.6 is a generalization of the containment theorem for unions of relational

conjunctive queries [SY80; HY91]. The proof is analogous. The condition for DSL equiva-

lence then follows:

Theorem 3.4.7 (DSL query equivalence) Two DSL queries are equivalent if and only

if their decompositions into graph component queries are equivalent.

Proof: The proof for the IF direction is straightforward. For the ONLY IF, if the

graph component decompositions are equivalent, then for every OEM database D, the

result databases will have the same objects (because of the equivalence of the object rules)

and the same root objects (because of the equivalence of the top rules). They will also

have the same member tuples, which means that they will have the same object-subobject

relationships. This concludes the proof. 2

Example 3.4.8 Consider again query (Q20) and query (Q23) given below. Notice that

(Q20) and (Q23) have the same query body. Also notice that query (Q23), though it has

different path conditions in its rule heads from query (Q20), creates exactly the same result.

The intuition is that the different query heads create different “parts” of the same answer

graph for (Q20) and (Q23).

13Our notion of query equivalence is a variant of the notion of exposed equivalence in [HY91].

Page 56: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 43

(Q23) <l(X) l {<f(Y) m {<n(Z) n V>}>}> :- <X a {<Y b {<Z c V>}>}><l′(X) l′ {<f(Y) m {<n′(Z) n′ V>}>}> :- <X a {<Y b {<Z c V>}>}>

The graph component queries for (Q23) are shown below:

top(l(X)) :- <X a {<Y b {<Z c V>}>}>top(l′(X)) :- <X a {<Y b {<Z c V>}>}>member(l(X),f(Y)) :- <X a {<Y b {<Z c V>}>}>member(l′(X),f(Y)) :- <X a {<Y b {<Z c V>}>}>member(f(Y),n′(Z)) :- <X a {<Y b {<Z c V>}>}>member(f(Y),n(Z)) :- <X a {<Y b {<Z c V>}>}>object(l(X),l,set) :- <X a {<Y b {<Z c V>}>}>object(l′(X),l′,set) :- <X a {<Y b {<Z c V>}>}>object(f(Y),m,set) :- <X a {<Y b {<Z c V>}>}>object(n′(Z),n′,V) :- <X a {<Y b {<Z c V>}>}>object(n(Z),n,V) :- <X a {<Y b {<Z c V>}>}>

It is obvious that (Q20)≡(Q23) 2

Example 3.4.9 Consider the following two queries:

(Q24) <f(P) stan student {<X Y Z>}> :-

<P person {<U university stanford>}> AND <P person {<X Y Z>}>

(Q25) <f(P) stan student {<X Y Z>}> :-

<P person {<X Y Z′>}> AND <P person {<X Y′ Z>}>AND <P person {<U university Z′′>}>AND <P person {<U Y′′ stanford>}>

Notice that unless we chase (Q25), there is no mapping from the body of the query (Q24)

to the body of (Q25). By chasing (Q25), we infer that Y ≡ Y′, Z ≡ Z′, Y′′ ≡ university,

and Z′′ ≡ stanford, which turns (Q25) into (Q26):

(Q26) <f(P) stan student {<X Y Z>}> :- <P person {<X Y Z>}> AND

<P person {<U university stanford>}>

Queries (Q24) and (Q26) are obviously equivalent. 2

Page 57: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 44

3.5 DSL and other semistructured languages

DSL is a variant of the Mediator Specification Language (MSL) [PGMU96]. MSL supports

recursion and does not impose any restrictions on the query conditions or the invented object

ids. The result is that the query equivalence and composition problems for MSL become

unnecessarily complex: there are no known composition and query equivalence algorithms

for MSL.

StruQL [FFLS97] is a logic-based semistructured language very similar to DSL and

MSL. As mentioned in Section 3.2.4, its distinguishing characteristic compared to DSL is

support for regular path expressions in the query body. The query composition problem

does not have a general solution for StruQL, i.e.for StruQL rules Q, V , there does not

always exist a StruQL query Qc = V ◦Q, such that for any database D, Qc(D) = Q(V (D))

[FFLS97]. The containment and equivalence problems for a subset of StruQL14 are studied

in [FLS98].

XML-QL [DFF+] is a query language for XML that uses a syntax inspired from SQL

and Lorel and whose semantics follow that of StruQL and MSL.

Lorel [AQM+97] is an OQL-based semistructured language that is the query language

of Lore [MAG+97], the first semistructured database management system, which was de-

veloped at Stanford. Lorel uses OEM as its data model. Optimization techniques for Lorel

are described in [MW99].

XMAS is a semistructured language for XML that has been proposed by the MIX

project [LPV00]. XMAS does not support the notion of object identity, and can perform

restructuring through a groupby operator. Comparing the restructuring power of an explicit

groupby operator versus object id invention, it is observed already in [AK89] that object id

based set formation (as provided by the object id) can replace explicit grouping operators.

The reverse holds in a limited fashion: Suciu and Vianu show in [MSV00] that the explicit

groupby operator in XMAS has the same restructuring power as Skolem-based object id

invention in DSL and XML-QL, under some additional constraint on the form of invented

object ids.

The restructuring capabilities of XMAS are a subset of those of DSL, as XMAS can

only construct trees in the query output, whereas DSL can construct DAGs.

Quilt [CRF00] is a proposed XML query language that incorporates the best features

14The subset considered is conjunctive StruQL without restructuring capabilities in the head.

Page 58: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 45

of other XML and semistructured query languages, including the ones mentioned above.

Quilt includes all important constructs for querying XML data and returning XML results,

except a grouping operator or Skolem functions for object invention. Instead, complex XML

results can be constructed using nested queries.

3.6 Conclusion

DSL has features essential for querying and integrating semistructured data, namely the

ability to query and copy arbitrarily nested, schemaless data, the ability to restructure such

data through the use of semantic object ids, and the ability to query the “structure” of

the data through the use of label variables. These features make it a strong choice for

information integration, as well as a good basis for an XML query language.

For a semistructured language to be useful, query optimization techniques must be

available. DSL allows flattening of nested queries through query composition. In the next

chapter, an algorithm for query rewriting using views is presented for DSL, that uses the

algorithms for query composition and query equivalence developed in this chapter.

Page 59: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Chapter 4

Query Rewriting for

Semistructured Data

4.1 Introduction

As we explained in Chapter 1, the capability-based rewriting problem and the related prob-

lem of query rewriting using views are at the core of integration systems. Moreover, as

described in Chapter 1, integration systems benefit by using a semistructured data model

and a semistructured query language. In this chapter, we present a query rewriting al-

gorithm for DSL that is the first rewriting algorithm for a semistructured language. The

algorithm builds on the theory developed in the previous chapter. The rewriting algorithm

is at the core of the CBR module of a mediator, as shown in Figure 1.6. Moreover, it has

various applications beyond information integration:

Rewriting in semistructured repositories A rewriting algorithm can be used to an-

swer queries using materialized views and cached queries of repositories for semistruc-

tured data, such as Lore [MAG+97].

For example, if a cached query result contains all “SIGMOD” publications, our rewrit-

ing algorithm can create a rewriting query where “SIGMOD 97” publications are ob-

tained by filtering the cached query for “1997” publications. The rewriting algorithm

only needs the query and the cached query statements - it does not need to examine

the source data. The cached queries play in this case the role of views.1

1Given the autonomy of the bibliographic sources and the mediator, the rewriting query may deliver a

46

Page 60: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 47

Materialized views and cached queries were the main original motivation for rela-

tional query rewriting [YL87], and they are as important for semistructured or XML

databases.

Web site management and structured Web search Recent work [FFLS97] has ap-

plied concepts from information integration to the task of building complex Web sites

that serve information derived from multiple data sources. In this scenario, a Web site

is a declaratively-defined site graph over the semistructured data graph of the contents

of the information sources. If we only have access to the information through the Web

site(s), queries asked over the data graph need to be rewritten as queries over the Web

site structure and contents. The Web site definitions are just view definitions over

the data graph; the necessary query rewriting can thus be handled by the rewriting

algorithm.

Our algorithm solves the problem of rewriting a query Q using views, described in

Section 1.4.2, by outputting a finite set Q of rewriting queries, i.e., queries equivalent to Q

that have at least one condition referring to one of the views.

The algorithm operates in two steps: First it generates candidate rewriting queries by

discovering mappings from the views to the original query Q. Then it keeps the candi-

date rewriting queries which are equivalent to the original query by performing equivalence

checks.

Testing the equivalence of the candidate rewriting query with the original query is

accomplished in two steps. First, the algorithm composes the rewriting query Qr and the

views to obtain an expanded query Qc that is equivalent to the rewriting query but does

not refer to the views any more. Then Qc is tested for equivalence with the original query

Q.

The rewriting algorithm is extended to make use of structural constraints on the source

data. In particular, we consider constraints that can easily be expressed by standards such

as the XML DTDs or XML Schemas. The existence of such constraints allows us find

rewritings in cases where, in the absence of constraints, the algorithm would fail.

The reader may wonder whether, given a reduction of semistructured data to relations

such as the one presented in [Pap97], the DSL rewriting problem can be fully reduced to

stale result to the user. This result may still be very useful to the user. Furthermore, if an update-propagationsystem is in place, it can account for the “deltas” between the cache and the sources [ZGMHW95]. In thispaper we will not deal any further with these consistency issues. Instead we focus on the rewriting algorithm.

Page 61: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 48

the well-understood relational conjunctive query rewriting problem. The answer is negative

because DSL queries cannot be reduced to conjunctive relational queries. DSL queries and

views are reducible to Datalog with function symbols and with a limited form of recursion,

as follows from the MSL-to-Datalog reduction presented in [Pap97]. There are no known

applicable results for rewriting of Datalog programs using Datalog views.

Content Section 4.2 restates the rewriting problem for DSL, presents the rewriting

algorithm, proves its correctness and discusses its complexity. Section 4.3 in particular

extends the algorithm to take into account the existence of structural constraints on the

semistructured data that are given in the form of a Document Type Definition (DTD).

Section 4.4 describes the capability-based rewriting heuristic used by the CBR module of

the TSIMMIS mediator. Finally, Section 4.5 discusses related work and Section 4.6 offers

some concluding remarks.

4.2 DSL Query Rewriting

Given a DSL query Q over an OEM database D and conjunctive views V = V1, . . . , Vn over

D, the rewriting problem is to find a DSL query Q′ such that (i) Q′ refers to at least one

of V1, . . . , Vn and (ii) Q is equivalent to Q′.

We call Q′ the rewriting query. In general, there may be more than one rewriting queries.

4.2.1 Rewriting of Queries with a Single Path Condition

We informally present an algorithm which decides whether a single-rule, normal form query

Q, having one single path condition in its body, can be rewritten using a single-rule, nor-

mal form view V . This algorithm, though a special case of the complete rewriting algo-

rithm, illustrates the basic steps of our technique. The general algorithm is presented in

Section 4.3.1. It is proven sound and complete for DSL and its complexity is studied in

Section 4.3.2.

Step 1: Find Candidate Queries We first find mappings from the view to the condition

and then we develop a candidate query for each mapping.

Step 1A: Find Mappings Find, if one exists, a mapping from the body of V to the

body of Q. Notice that there can be at most one mapping from the body of V

Page 62: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 49

to the one single path condition in the body of Q. However, in the general case

(Section 4.3.1) we may have multiple mappings. If a mapping exists, then we can

be sure that, if there is a variable binding that satisfies the body of Q, then there

is also a binding that satisfies the body of V . Hence mappings are a necessary

condition for the relevance of the view to the query condition. Furthermore, the

mapping indicates which conditions of Q do not appear in V ; these conditions

will have to be checked by the rewriting query.

Example 4.2.1 Consider the view (V6), which restructures the person objects

into objects that “group” their labels in property subobjects, and their values

in value subobjects. Notice that (V6) “loses” information in the sense that it

only shows the labels and values that appear in the source data, but the label-

value correspondence has disappeared. Queries such as (Q27), that ask whether

the value leland appears in the source, can be answered using the view (V6),

because they do not need information on the label-value correspondence. The

example below shows how our algorithm finds a rewriting query for (Q27).

(V6) <g(P′) person {<pp(P′,Y′) property Y′> <h(X′) value Z′>}> :-

<P′ person {<X′ Y′ Z′>}>(Q27) <f(P) stanford yes> :- <P person {<X Y leland>}>

The only mapping from the body of (V6) to the body of (Q27) is (M1). Intu-

itively, (M1) indicates that the condition Z′ = leland must be enforced on the

view in order to get objects relevant to the query.

(M1) [ P′ 7→ P, X′ 7→ X, Y′ 7→ Y, Z′ 7→ leland ]

2

Step 1B: Generate Candidate Query Apply the mapping to V , resulting in an

“instantiation” of V , namely V ′. Then build the rewriting query Q′ as follows:

The head of Q′ is identical to the head of Q. The body of Q′ is the head of V ′.

Example 4.2.1 continued The only candidate rewriting query (Q28) is cre-

ated from the head of (Q27) and the result of applying (M1) to the head of

(V6).

Page 63: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 50

(Q28) <f(P) stanford yes> :-

<g(P) person {<pp(P,Y) property Y> <h(X) value leland>}>

Step 2: Test Correctness of Candidate Query Check whether the composition of V

and Q′, denoted by V ◦Q′, is equivalent to Q. Step 2 is accomplished in two sub-steps:

Step 2A: Computation of Composition We compute V ◦Q′ using the compo-

sition algorithm in Section 3.3.

Step 2B: Testing Equivalence of V ◦Q′ and Q Equivalence testing is done as

described in Section 3.4.

Example 4.2.1 continued We test whether (Q28) is a valid rewriting query by

first transforming it into the normal form (Q28)n, then composing it with (V6), and

finally comparing the resulting query (V6)◦(Q28)n to (Q27). Indeed, (V6)◦(Q28)n is

equivalent to (Q27) because (i) the containment mapping (M2) maps (V6)◦(Q28)n to

(Q27) and (ii) the containment mapping (M3) maps (Q27) to (V6)◦(Q28)n.2

(Q28)n <f(P) stanford yes> :- <g(P) person {<pp(P,Y) property Y>}>AND <g(P) person {<h(X) value leland>}>

(V6)◦(Q28)n <f(P) stanford yes> :- <P person {<X′ Y Z′>}> AND

<P person {<X′′ Y′′ leland>}>(M2) [ P 7→ P, X′ 7→ X, Y 7→ Y, Z′ 7→ leland, X′′ 7→ X, Y′′ 7→ Y ]

(M3) [ P 7→ P, X 7→ X′′, Y 7→ Y′′ ]

Let us look at a second example:

Example 4.2.2 Consider the query (Q29) and the view (V6). It is clear that Z′ must bind

to set values that contain a <Z last stanford> subobject. The algorithm captures this

intuition by finding the mapping (M4) from the body of (V6) to the body of (Q29). Notice

that Z′ is mapped to {<Z last stanford>}>.

(Q29) <f(P) stanford yes> :- <P person {<X Y {<Z last stanford>}>}>(M4) [ P’ 7→ P, X’ 7→ X, Y’ 7→ Y, Z’ 7→ {<Z last stanford>} ]

(Q30) <f(P) stanford yes> :- <g(P) person {<pp(P,Y) property Y>

<h(X) value {<Z last stanford>}>}>

2We skip the step of graph component decomposition because of the simplicity of the example.

Page 64: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 51

(Q30) is the candidate query created from the head of (Q29) and the result of applying

(M4) to the head of (V6). 2

Mappings are necessary but not sufficient for the existence of a rewriting query, as the

following example illustrates. That is why a containment test is needed, as in Step 2B of

the algorithm.

Example 4.2.3 Consider query (Q31) and view (V6).

(Q31) <f(P) stanford yes> :- <P person {<X name {<Z last stanford>}>}>

Intuitively, there is no rewriting query for (Q31) because the view “loses” the correspon-

dence between labels and values. Hence, if the database contains a name attribute and a

value v containing the <last stanford> subobject, it is impossible for the rewriting query

to discover whether there is a name object with value v or name and v appear in different

objects of the data source. Notice that despite the non-existence of a rewriting query there

is the mapping (M5). Based on this mapping, the algorithm derives the candidate rewriting

query (Q32). However, the composition of the candidate rewriting query with the view

results in the query (Q33) which is not equivalent to the original query (Q31). Notice that

name is the label of the object X′ while <last stanford> is a subobject of another object

X′′.

(M5) [ P′ 7→ P, X′ 7→ X, Y′ 7→ name, Z′ 7→ {<Z last stanford>} ]

(Q32) <f(P) stanford yes> :- <g(P) person {<pp(P,Y) property name>

<h(X) value {<Z last stanford>}>}>(Q33) <f(P) stanford yes> :- <P person {<X′ name Z′>}> AND

<P person {<X′′ Y′′ {<Z last stanford>}>}>

2

Section 4.3 discusses how the algorithm can exploit structural constraints, such as DTDs,

on source data.

4.3 Using structural constraints

Semistructured data is often accompanied by constraints that partially define the structure

of objects. Such structural constraints can be expressed as a DTD [Gol90], a DataGuide

Page 65: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 52

[GW97], or an XML Schema [TBMM; BM]. For instance, we could know that the data in

the previous examples conform to the following DTD:3

<!ELEMENT person (name, phone, address*)>

<!ELEMENT name (last, first, middle?, alias?)>

<!ELEMENT alias (last, first)>

<!ELEMENT address CDATA>

<!ELEMENT phone CDATA>

<!ELEMENT last CDATA>

<!ELEMENT first CDATA>

<!ELEMENT middle CDATA>

This DTD describes in a flexible way the structure of the source data. For example,

it specifies that objects labeled person have exactly one subobject each with labels name

and phone, and zero or more address subobjects. It also specifies that subobjects phone

and address are atomic. Given such a DTD, we can infer information in the form of

dependencies between labels or object ids, that will allow the rewriting algorithm to discover

rewritings in cases where it would have otherwise failed.

Example 4.3.1 Given the above DTD, we can infer automatically that, in a source that

conforms to the DTD, the only subobject of a person object with a last subobject is a

name object. Therefore, if we look at (Q33) in Example 4.2.3, Y′′ has to be name. Moreover,

there exists a “labeled” functional dependency from object id P with label p to object id

X with label name, since according to the DTD a p object has exactly one name subobject.

This implies that X′′ has to be X′ (by application of the chase rule). Therefore (Q33) can

be rewritten as

(Q34) <f(P) stanford yes> :- <P person {<X′ name Z′>}> AND

<P person {<X′ name {<Z last stanford>}>}>

It is obvious that (Q34) is equivalent to (Q31), and therefore a valid rewriting query.

2

As illustrated in the previous example, we identify two cases where information can easily

be inferred from a structural description, such as a DTD:

3Since OEM does not support order, we ignore the order in the DTD description as well.

Page 66: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 53

• (label inference) Given a “path expression” of labels a.?.c, if the structural constraint

specifies that the only subobject of an a object with a c subobject is a b subobject,

we can infer that ? = b.

• (functional dependency) If the structural constraint specifies that objects labeled a

have only one subobject labeled b, we can infer the functional dependency between

object id variables Xa → Yb.

The rewriting algorithm takes advantage of this information by performing label in-

ference and the chase on the query, the views and the candidate queries, as illustrated in

Example 4.3.1. It is straightforward to show that applying label inference and the chase

always terminates in time polynomial to the length of the queries and the constraints de-

scription. Moreover, it is easy to show that label inference and the chase do not affect the

soundness of the rewriting algorithm.

In the presence of structural constraints, there are clearly more opportunities for query

simplification and query rewriting. Identifying these opportunities is an open research

problem.

The next section presents the general algorithm for query rewriting.

4.3.1 General case of query rewriting

We now treat the general case of the query rewriting problem, with any number of views

in V and any number of conditions in the body of the (single-rule) query Q. For the sake

of simplicity, the following example uses a view set with only one view. The method used

generalizes trivially to view sets of any size; the algorithm described in Figure 4.1 covers

the general case.

Example 4.3.2 Consider the following view (V7). Notice that the semantic object ids of

property and value objects retain information about the object that originally had that

property and value. Then consider query (Q35).

(V7) <view(P′) person {<pp(X′) property Y′> <val(X′) value Z′>}> :-

<P′ person {<X′ Y′ Z′>}>(Q35) <f(P) stan student {<X Y Z>}> :- <P person {<X Y Z>}>

AND <P person {<U university stanford>}>

Page 67: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 54

Intuitively, (Q35) can be answered using only (V7) as follows: First use (V7) to find the

P’s that have a university subobject with value stanford. The mapping (M6) from the

body of (V7) to the first condition of (Q35) implies that this is possible. Then for every P

that qualifies, pick all its subobjects <X Y Z>. Mapping (M7) from the body of (V7) to the

second condition of (Q35) implies that this is also possible. Then, the head of the rewriting

query (Q36) is the head of (Q35) and the body of (Q36) is the conjunction of θ6(head(V 7))

and θ7(head(V 7)).

(M6) θ6 = [P′ 7→ P, X′ 7→ U, Y′ 7→ university, Z′ 7→ stanford]

(M7) θ7 = [P′ 7→ P, X′ 7→ X, Y′ 7→ Y, Z′ 7→ Z]

(Q36) <f(P) stan student {<X Y Z>}> :-

<view(P) person {<pp(X) property Y> <val(X) value Z>}> AND

<view(P) person {<pp(U) property university>

<val(U) value stanford>}>

Let us now check whether (Q36) is a valid rewriting query. Performing the check means

transforming (Q36) and (V7) into full normal form and checking whether

(Q37) = (V 7) ◦ (Q36)n

is equivalent to (Q35).

(Q37) <f(P) stan student {<X Y Z>}> :-

<P person {<X Y Z′>}> AND <P person {<X Y′ Z>}> AND

<P person {<U university Z′′>}> AND

<P person {<U Y′′ stanford>}>

Notice that unless we make use of the key dependency Oid → LabelValue there is no

mapping from the body of the query (Q35) to the body of (Q37). By chasing (Q37), we

infer that Y ≡ Y′, Z ≡ Z′, Y′′ ≡ university, and Z′′ ≡ stanford. 2

We now give the algorithm for the general case of the query rewriting problem. In what

follows, the bodies of the query Q and the views in V are converted into normal form and

label inference and the chase are applied before we apply the algorithm.

Notice that the above algorithm constructs and tests all candidate queries (in Step

1B). The efficiency of the algorithm can be substantially improved with the use of simple

heuristics. A particularly effective heuristic is the following:

Page 68: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 55

Algorithm 4.3.3

Input: A DSL query Q with k single path conditions in the bodyand a set of DSL views V = {V1, . . . , Vn}.

Output: A set of rewriting queries.Step 1A: Find the mappings θij from the body of each Vi ∈ V to the body of Q.Step 1B: Construct candidate rewriting queries Q′ as follows:

• head(Q′) is head(Q)• body(Q′) is any conjunction of l conditions, 1 ≤ l ≤ k, where eachcondition is either a view “instantiation” θij (head(Vi)) or a condition of Q.If the resulting query is unsafe, then continue with next candidate.

Step 1C: Perform label inference and chase Q′.Step 2: Test whether each constructed Q′ is correct.

• Construct the composition Q′(V1, . . . , Vn) of Q′ with V1, . . . , Vn.• Perform label inference and chase Q′(V1, . . . , Vn).• If Q′(V1, . . . , Vn) is equivalent to Q, include Q′ in the output;else continue with the next candidate.

2

Figure 4.1: DSL query rewriting algorithm

• Keep track of which conditions of the query body each instantiated view θij (head(Vi))

maps into. These are the conditions that are “covered” by θij (head(Vi)).

• Only construct candidate queries Q′ such that the views and conditions in the body

of Q′ “cover” all the conditions in the body of Q.

A variation of the above heuristic is implemented in the capability-based rewriting mod-

ule of the TSIMMIS system, as explained in more detail in Section 4.4.

4.3.2 Completeness and Complexity

The soundness of the algorithm in Figure 4.1 is established by its second step, that checks

the correctness of the rewriting. We will show that the algorithm is complete, i.e., that it

always finds a rewriting query if one exists. For this, we assume that there are no structural

constraints, and therefore no functional dependencies except the key dependencies on object

id.

To prove the completeness of the algorithm, we first observe that if there is no mapping

from a view body to the query body, then the view is not “relevant” to the query.

Page 69: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 56

Lemma 4.3.4 Let Q and V be DSL queries. There is a rewriting query Q′ of Q using view

V only if there is a mapping from the body of V to the body of Q.

Moreover, we can bound both the number of conditions and the variables appearing in

the rewriting.

Lemma 4.3.5 Let Q be a DSL query and V be a set of DSL views. If there exists a

rewriting of Q using V, then there exists such a rewriting consisting of at most k view

heads, where k is the number of single path conditions in the body of the query.4

Lemma 4.3.6 If there exists a rewriting of query Q using the set of views V, then there

exists a rewriting of Q using V that doesn’t use variables that don’t exist in Q.

The above lemmata directly extend the theory of relational query rewriting, presented

in [LMSS95], to DSL.

The following lemma justifies why completeness is not compromised by only constructing

rewriting queries Q′ that have a head identical to the head of the query Q. Notice, this

is an issue that is particular to semistructured and nested data models, while it is trivial

in the relational model (where it is easier to see that Q′ must have a head identical, up to

variable renaming, to the head of Q).

Lemma 4.3.7 If there exists a valid rewriting query Q′′ such that head(Q′′) is not the same

as head(Q), then there exists a valid rewriting query Q′ such that head(Q′) = head(Q).

To see that Lemma 4.3.7 holds, notice that if there exists such a query Q′′, then we can

always apply our rewriting algorithm to it, to derive a query Q′ equivalent to Q′′ (and

therefore to Q) whose head is identical to the head of Q.

Theorem 4.3.8 The rewriting algorithm of Figure 4.1 is sound and complete.

Proof: The algorithm is obviously sound, because its last step is a correctness test. It is

complete because it exhaustively searches the space of possible candidate rewriting queries,

as defined by the above lemmata (i.e., it generates all the candidate rewriting queries in

that space.) 2

4Notice that, since view heads do not have to be single path, the number of single paths in the rewritingcan be greater than k.

Page 70: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 57

Complexity of DSL rewriting

The algorithm described in Section 4.3.1 takes exponential time. First, Step 1 can generate

a number of mappings that is exponential in the size of the view bodies . Then Step 2 can

generate an exponential number of candidate rewritings, each of polynomial length. Finally,

the construction of Q′(V1, . . . , Vn) using the query composition algorithm takes exponential

time. Checking the equivalence of each of these queries (that are of polynomial length) to

the original query takes time that is non-deterministic polynomial. Therefore overall the

complexity of the DSL view rewriting algorithm is exponential time.

4.4 Capability-based rewriting in the TSIMMIS mediator

In this section, we describe the implemented capability-based rewriting and plan generation

module of the TSIMMIS mediator.5 In particular, we show how the mediator translates

the user queries into a set of relevant source queries in Sections 4.4.1 and 4.4.2. We also

explain in Section 4.4.3 how the source capabilities are described by using query templates.

The capability-based plan generation algorithm implemented in the TSIMMIS mediator is

a heuristic based on the rewriting algorithm presented in the previous sections. We discuss

its limitations in Section 4.4.5.

4.4.1 Query Translation

As explained in Chapter 1, the mediator encodes the relationship between the integrated

views and the source views with a set of view definitions. Specifically, it uses DSL to define

integrated views. For example, the integrated view (V8) is defined as follows:

(V8) <paper {<title T> <author A> <abs B> <conf C>}> :-

<entry {<title T> <author A> <abs B>}>@s1,

<entry {<title T> <conf C>}>@s2

The above view is essentially a join of the views exported by s1 and s2, with title

being the join attribute.

Suppose the user wants to find the title and abstract of each paper written by ‘Smith’

in ‘SIGMOD-97’. The user formulates the following query, based on the user view paper:

5The design of the described rewriting and plan generation algorithm was done with Ramana Yerneniand Chen Li. The implementation in the TSIMMIS mediator was done by Ramana Yerneni and Chen Li.

Page 71: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 58

(Q38) <ans {<title T> <abs B>}> :-

<paper {<title T> <author "Smith"> <abs B> <conf "SIGMOD-97">}>

When the user query arrives at the mediator, the mediator uses the view definitions to

translate the query on the user views into a logical plan [PGMU96] (i.e., a set of DSL rules

that refer to the source views instead of the integrated views). The following is the logical

plan for the example user query:

(P1) <ans {<title T> <abs B>}> :-

<entry {<title T> <author "Smith"> <abs B>}>@s1,

<entry {<title T> <conf "SIGMOD-97">}>@s2

We refer to the source queries specified on the right hand sides of the logical query plan

rules as conditions. In the logical plan above, there is only one rule with two conditions.

The rule states that answers to the user query can be computed by sending two source

queries. The first one, to s1, gets the title and abstract of each entry (for a paper)

corresponding to ‘Smith’, while the second one, to s2, gets the title of each paper in

conference ‘SIGMOD-97’. From the results of the two source queries, the bindings for

variables T and B are obtained to construct the answers to the user query.

4.4.2 Physical Plans

The logical query plans in TSIMMIS do not specify the order in which the conditions are

processed (e.g., the order in which source queries are sent to the sources). This is done in

the physical plans generated in the subsequent stages of the TSIMMIS mediator.

Three possible physical plans for the logical plan of the example user query are:

• P1: Send query <entry {<title T> <author "Smith"> <abs B>}> to s1;6 send

query <entry {<title T> <conf "SIGMOD-97">}> to s2; join the results of these

source queries on the title attribute.

• P2: Send query <entry {<title T> <author "Smith"> <abs B>}> to s1; for each

returned title, send query <entry {<title T> <conf "SIGMOD-97">}> to s2, with

T bound.

6Strictly speaking, source queries are DSL rules. For simplicity, we just show the tail of the source queryrule.

Page 72: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 59

• P3: Send <entry {<title T> <conf "SIGMOD-97">}> to s2; for each returned title,

send query <entry {<title T> <author "Smith"> <abs B>}> to s1, with T bound.

4.4.3 Source Capabilities Description: Templates

In order to describe the capabilities of sources, the TSIMMIS system uses templates to

represent sets of queries that can be processed by each source (see also Chapter 1 and

Appendix A). Suppose s1 and s2 have the following templates.

T11: X :- X:<entry {<title $T> <author A> <abs B>}>@s1

T21: X :- X:<entry {<title T> <conf $C>}>@s2

T22: X :- X:<entry {<title $T> <conf C>}>@s2

The first template T11 says that source s1 can return all the information it has about a

paper given its title. T21 says that s2 can return all the information it has about papers

given the conference. T22 says that s2 can also return the information about a paper given

its title. Assume that these are the only templates for s1 and s2. That is, s1 and s2

cannot answer any other kinds of queries. The pattern variable X in this example binds

to the contents of the whole object pattern. The use of pattern variables is essentially a

shortcut.

TSIMMIS templates are DSL views of a limited form that also allow the use of place-

holders to allow required binding patterns by the query interface of a source. The templates

have a significant limitation: query heads can consist of only one pattern variable. There-

fore, a template, viewed as a DSL view, returns one condition and allows no projections or

restructuring of the result.

Given the above capabilities, P1 is not feasible since s1 cannot answer the query

< entry{< titleT >< author”Smith” >< absB >} >

because the title value is not specified. P2 is also infeasible for the same reason. Only

P3 is feasible, as the mediator first gets the title of each paper in ‘SIGMOD-97’ from s2

and uses this title value to get the corresponding abstract information from s1 and check

that the author is indeed ‘Smith’. Notice that the queries to s1 are now feasible because

they specify the title values.

Page 73: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 60

Source Source Condition BindingQuery Template Processed RequirementM1 T11 C1 TM2 T21 C2 NoneM3 T22 C2 T

Table 4.1: Matcher Result

In the next section, we show how the plan generation process in TSIMMIS takes into

account the source capabilities in producing feasible query plans.

4.4.4 Capability Based Plan Generation

Figure 4.2: TSIMMIS CBR architecture

A block diagram of the TSIMMIS mediator is shown in Figure 1.6. The logical plan

generated by the query decomposer is passed onto the plan generator module, which com-

putes a feasible physical plan for the query. As shown in Figure 4.2, it accomplishes this in

three stages.

Matcher

The first step in the plan generation process is to find all the templates that represent

source queries that can process parts of the logical plan. Some of these templates have

requirements indicating the list of variables that need to be bound. To illustrate, consider

the logical plan of Section 4.4.1. Let the two conditions of the logical plan be denoted C1

and C2. There is one template to process C1, with the requirement that variable T be bound.

There are two templates that can process C2: one with T bound and another without any

binding requirements.

Table 4.1 describes the result of the Matcher. It has a row for each source query that

processes some conditions of the logical plan. For instance, the first row M1 in the table

indicates that s1 can process C1 by using template T11 with the requirement that variable

T be bound.

Page 74: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 61

Sequencer

The second step in the plan generation process is to piece together the source queries for

processing the conditions of the logical plan in order to construct feasible plans. Here, what

matters is not just the specific source queries chosen to cover all the conditions of the logical

plan but also the sequence of processing these queries.

The Sequencer uses the table output by the Matcher to find the set of feasible sequences

of source queries. Each query in a feasible sequence has the property that the variables

in its binding requirements are exported by the source queries that appear earlier in the

sequence. For instance, in our example logical plan, the Sequencer finds that the only

feasible sequence is < M2, M1 >, with source query M1 being parameterized (variables

bound) from the result of the source query M2. Sequence < M1, M2 >, though it can

also process all the conditions, is not feasible because M1’s binding requirements cannot be

satisfied. Other sequences like < M2, M3 > cannot process all the conditions of the logical

plan.

Optimizer

Having found the feasible sequences of source queries, the third step of the plan generation

process is to optimize over the set of corresponding feasible plans and choose the most

efficient among these. The Optimizer uses standard optimization techniques to pick the

best feasible plan and translates it into a physical plan. In our example case, there is only

one feasible sequence of queries < M2, M1 > and this leads to the physical plan P3 of

Section 4.4.2.

4.4.5 Rewriting algorithm and capability-based plan generation

The plan generation algorithm described in this section is a heuristic based on the general

purpose rewriting algorithm described in this chapter. Essentially, the algorithm exploits

the simplicity of the templates to construct feasible candidate queries faster. The plan-

generation algorithm is sound, meaning the generated plans are indeed equivalent to the

user queries, but not complete, that is, the algorithm could fail to find a plan when one

exists (the rewriting algorithm on the other hand as discussed in Section 4.3.2 is complete).

In particular, the plan generation algorithm treats each object condition appearing on

the right hand side of a user query as an atomic condition: the algorithm tries to find (in

Page 75: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 62

the Matcher) one or more templates that “cover” the condition. Thus, the algorithm can

miss a plan where two conditions can be covered together by one template. For example, if

the logical plan (P1) above were instead expressed as (P2) below, then the Matcher would

not be able to discover any matching templates.

(P2) <ans {<title T> <abs B>}> :-

<E entry {<title T> <author "Smith">}>@s1 AND

<E entry {<title T> <abs B>}>@s1 AND

<entry {<title T> <conf "SIGMOD-97">}>@s2

4.5 Related work

There is little work on the problem of rewriting semistructured queries using views [FLS98;

CGLV99; CGLV00]. In [FLS98], the related problem of query containment in StruQL is

addressed. The paper deals with queries and views containing “wildcards” and regular

path expressions, but it does not deal with the restructuring capabilities of the StruQL

language. Recently, Calvanese et al. [CGLV99; CGLV00] proposed an elegant solution to

the problem of rewriting a regular expression in terms of other regular expressions. The

problem is closely related to the problem of rewriting semistructured queries using views,

but the solution is applicable to a narrow class of queries and views, the ones that consist

of only one regular path expression and return its “endpoints.”

The problem of query rewriting for conjunctive relational views is discussed, among

others, in [LMSS95; DL97] and for recursive queries (but not recursive views) in [DG97].

The problem of query equivalence for relational languages with object ids has been studied

in [HY91]. Our notion of query equivalence corresponds roughly, in the terminology of

[HY91], to exposed equivalence.

Our work is also related to the problem of object oriented query rewriting. Previous

work on the problem of containment and equivalence of object oriented queries [Cha92;

LR96] relies on the existence of a static class hierarchy. The problem of containment of

queries on complex objects has been addressed recently in [LS97].

Finally, there has been some recent work on using structural information about a semi-

structured source (such as graph schemas [BDFS97] or DTDs) in query processing [FS98;

PV00; MS99; MSV00].

Page 76: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 63

4.6 Conclusions

We describe an algorithm that given a semistructured query q expressed in DSL and a

set of semistructured views V, finds rewriting queries, i.e., queries that access the views

and are equivalent to q. Our algorithm is based on appropriately generalizing containment

mappings, the chase, and unification. The first step uses containment mappings to produce

candidate rewriting queries. The second step composes each candidate rewriting query with

the views and checks whether the composition is equivalent to the original query.

Moreover, we extend the algorithm to use structural constraints to discover rewritings

in cases where, in the absence of constraints, there would be no rewritings.

It is an open problem to extend our algorithm to semistructured languages with regular

path expressions, like StruQL or Lorel [AQM+97].

Page 77: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Chapter 5

The Capability Description

Language p-Datalog

In the previous chapter we studied the capability-based rewriting problem in the context of

the semistructured data model. We have also shown how the salient features of the semi-

structured model and of semistructured languages affect the solution of the CBR problem.

In this chapter and in Chapter 6 we are solving the CBR problem for relational data. Using

a simpler data model allows us to focus our attention on the CBR and related problems,

in particular the query expressibility problem introduced in Chapter 1, in the context of

more powerful capability description languages. We propose the use of Datalog variants

as more powerful description languages. We define and study the CBR and query express-

ibility problem for these description languages. Moreover, we present some results on the

expressiveness of these languages.

We focus on sources that support conjunctive queries, i.e., their capabilities are a subset

of the set CQ of all conjunctive queries ([AHV95]). The topics discussed in this chapter are

as follows:

• We introduce the description language p-Datalog. We formally define the semantics

of p-Datalog as a capability description language, and present complete and efficient

algorithms to (i) decide whether a query is described by a p-Datalog description (the

query expressibility problem) and (ii) decide whether a query can be answered by

combining supported queries (the CBR problem). The CBR algorithm runs in time

nondeterministic exponential in the size of the query and the description, a substantial

64

Page 78: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 65

improvement of the state of the art over the algorithm described in [LRU99], which

is doubly exponential.

• We extend the CBR algorithms for descriptions with binding requirements. We also

compare the expressive power of our proposed description language with an alternative

description of binding requirements: Datalog queries with binding patterns [RSU95].

• We study the expressive power of p-Datalog. We reach the important result that p-

Datalog cannot describe the query capabilities of certain powerful sources. The most

important result is that there is no Datalog program that can describe all conjunctive

queries over a given schema. Indeed, there is no program that describes all boolean

conjunctive queries over the schema.

• We identify an important class of descriptions, Ploop descriptions, covering sources

such as document retrieval systems, lookup catalogs, and object repositories, and we

show that the complexity of the CBR problem for Ploop descriptions is significantly

lower than the complexity for the general case.

The next section introduces the p-Datalog description language. Section 5.2 describes

the query expressibility decision procedure. Section 5.3 describes the CBR algorithm for p-

Datalog. Section 5.4 studies a useful large class of descriptions, for which the CBR problem

has lower computational complexity. Section 5.5 discusses expressive power issues. The

chapter concludes with Section 5.6 that discusses the related work, including the relationship

between the proposed description language and Datalog queries annotated with binding

patterns.

5.1 The p-Datalog Source Description Language

It is well known that the most popular real-life query languages, like SPJ queries [AHV95]

and Web-based query forms are equivalent to conjunctive queries. A Datalog program is a

natural encoding of many sets of conjunctive queries: the set is described by the expansions

of the Datalog program. First, we describe informally a Datalog-based source description

language and illustrate it with examples. A formal definition follows in the next subsection.

In the simple case, when we deal with a weak information source, the source can be

described using a set of parameterized queries. Parameters, called tokens in this thesis,

Page 79: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 66

specify that some constant is expected in some fixed position in the query [PGGMU95;

PGH96; LRU96; LRO96]. Without loss of generality, we assume the existence of a desig-

nated predicate ans that is the head of all the parametrized queries of the description.

Example 5.1.1 Consider a bibliographic information source, that provides information

about books. This source exports a predicate

books(isbn, author, title, publisher, year, pages)

The source also exports “indexes”:

author index(author name, isbn),

publisher index(publisher, isbn)

title index(title word, isbn)

Conceptually, the tuple (X, Y ) is in author index if the string X resembles the actual

name of an author and Y is the ISBN of a book by that author. Similarly, (X, Y ) is in

title index if X is a word of the actual title and Y is the ISBN of a book with word X in

the title. The following parameterized queries describe the wrapper that answers queries

specifying an author, a title or a publisher.

ans(Id,Aut, T itl, Pub, Y r, Pg)← books(Id,Aut, T itl, Pub, Y r, Pg),

author index($c, Id)

ans(Id,Aut, T itl, Pub, Y r, Pg)← books(Id,Aut, T itl, Pub, Y r, Pg), title index($c, Id)

ans(Id,Aut, T itl, Pub, Y r, Pg)← books(Id,Aut, T itl, Pub, Y r, Pg),

publisher index($c, Id)

where $c denotes a token. The query

ans(Id,Aut, T itl, Pub, Y r, Pg)← books(Id,Aut, T itl, Pub, Y r, Pg),

author index(“Smith”, Id)

can be answered by that source, because it is derived by the first parameterized query by

replacing $c by the constant ‘‘Smith". 2

In the previous example, the source is described by parameterized conjunctive queries. Note

that if, for instance, the source accepts queries where values for any combination of the three

indexes are specified, we would have to write 23 = 8 parameterized conjunctive queries. The

next example uses IDB predicates (i.e., predicates that are defined using source predicates

Page 80: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 67

and other IDB predicates) to describe the abilities of such a source more succinctly. Finally,

example 5.1.3 uses recursive rules to describe a source that accepts an infinite set of query

patterns.

Example 5.1.2 Consider the bibliographical source of the previous example. Assume that

the source can answer queries that specify any combination of the three indexes. The

p-Datalog program that describes this source is the following:

(D1)(R1) ans(Id,Aut, T itl, Pub, Y r, Pg) ← books(Id,Aut, T itl, Pub, Y r, Pg),

ind1(Id), ind2(Id), ind3(Id)

(R2) ind1(Id) ← title index($c, Id)

(R3) ind1(Id) ← ε

(R4) ind2(Id) ← author index($c, Id)

(R5) ind2(Id) ← ε

(R6) ind3(Id) ← publisher index($c, Id)

(R7) ind3(Id) ← ε

ε denotes an empty body, i.e., an ε–rule has an empty expansion. Notice that ε–rules are

unsafe [Ull89]. In general, p-Datalog rules can be unsafe, but that is not a problem under

our semantics, defined in Section 5.1.1. Note also that the number of rules is only polyno-

mial in the number of the available indexes, whereas the number of possible expansions is

exponential.

The query

ans(Id,Aut, T itl, Pub, Y r, Pg)← books(Id,Aut, T itl, Pub, Y r, Pg),

author index(“Smith”, Id)

can be answered by that source, because it is derived by expanding rule (R5.1.2) using

rules (R3), (R4) and (R7), and by replacing $c by the constant “Smith”. We can easily

modify the description to require that at least one index is used. 2

In general, a p-Datalog program describes all the queries that are expansions of an

ans-rule of the program. In particular, p-Datalog rules that have the ans predicate in

the head can be expanded into a possibly infinite set of conjunctive queries. Among the

expansions generated, some will only refer to source predicates.1 We call these expansions

1We stated that source predicates are the EDB predicates of our descriptions.

Page 81: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 68

terminal expansions. A p-Datalog program can have unsafe terminal expansions. We say

that the p-Datalog program describes the set of conjunctive queries that are its safe terminal

expansions (see formal definitions in the next subsection).

Example 5.1.3 Consider again the bibliographical source of Example 5.1.1. Assume that

there is an abstract index abstract index(abstract word, Id) that indexes books based on

words contained in their abstracts. Consider a source that accepts queries on books given

one or more words from their abstracts. The following p-Datalog program can be used to

describe this source.

(D2) ans(Id,Aut, T itl, Pub, Y r, Pg) ← books(Id,Aut, T itl, Pub, Y r, Pg), ind(Id)

ind(Id) ← abstract index($c, Id)

ind(Id) ← ind(Id), abstract index($c, Id)

The source describes the following infinite family of conjunctive queries:

ans(I, A, T, P, Y, Pg) ← books(I, A, T, P, Y, Pg), abstract index(c1, I)

ans(I, A, T, P, Y, Pg) ←books(I, A, T, P, Y, Pg), abstract index(c1, I),

abstract index(c2, I)

etc.

which agrees with our conceptual description of the source given above. 2

As another example of a recursive source description, we can think of a transportation

company, such as FedEx, that has an information source that contains information about

flights. The source is capable of answering certain queries about flights. In particular,

assume that the source can answer only whether there exists a flight between cities A and

B that makes exactly n stopovers. We can model such a source with the following p-Datalog

program:

(D3) exists(A,B, 1) ← flight(A,B)

exists(A,B, n)← flight(A,C), exists(C,B, n− 1)

Notice that the content of the source is essentially the relation flight, whereas the

p-Datalog program above describes the query capabilities of the source.

Page 82: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 69

5.1.1 Formal description of p-Datalog.

We assume familiarity with Datalog, e.g., [Ull89; AHV95]. Besides the constant and variable

sorts, we use a third disjoint set of symbols, the set of tokens.

Definition: p-Datalog Program Syntax A parametrized Datalog rule or p-Datalog rule

is an expression of the form

p(u)← p1(u1), . . . , pn(un)

where p, p1, p2, . . . , pn are relation names, and u, u1, u2, . . . , un are tuples of constants, vari-

ables and tokens of appropriate arities. A p-Datalog program is a finite set of p-Datalog

rules. 2

Tokens are variables that have to be instantiated to form a query. We now formalize

the semantics of p-Datalog as a source description language.

Definition: Set of Queries Described/Expressible by a p-Datalog Program Let

P be a p-Datalog program with a particular IDB predicate ans. The set of expansions EPof P is the smallest set of rules such that:

• Each rule of P that has ans as the head predicate is in EP ;

• If r1: p ← q1, . . . , qn is in EP , r2: r ← s1, . . . , sm is in P (assume their variables and

tokens are renamed, so that they don’t have variables or tokens in common) and a

substitution θ is the most general unifier of some qi and r then the resolvent

θp← θq1, . . . θqi−1, θs1, . . . , θsm, θqi+1, . . . , qn

of r1 with r2 using θ is in EP .

The set of safe terminal expansions TP of P is the subset of all expansions e ∈ EP containing

only EDB predicates in the body that are safe [Ull89]. The set of queries described by P

is the set of all rules ρ(r), where r ∈ TP and ρ assigns arbitrary constants to all tokens in

r. The set of queries expressible by P is the set of all queries that are equivalent to some

query described by P . 2

Unification extends to tokens in a straightforward manner: a token can be unified with

another token, yielding a token. When unified to a variable, it also yields a token. When

unified to a constant, it yields the constant. The above definitions can easily be extended

to accommodate more than one “designated” predicate (like ans).

Page 83: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 70

In the context of the above description semantics, we will use the terms p-Datalog

program and description interchangeably.

Informally, we observe that expansions are generated in a grammar-like fashion, by

using Datalog rules as productions for their head predicates and treating IDB predicates as

“nonterminals” [ASU87]. Resolution is a generalization of non-terminal expansion; rules of

context-free grammars can be thought of simply as Datalog rules with 0 arguments.

Rectification: For deciding expressibility as well as for solving the CBR problem the

following rectified form of p-Datalog rules simplifies the algorithms. We assume the following

conditions are satisfied:

• No variable appears twice in subgoals of the query body. Instead, multiple occurrences

of the same variable are handled by using distinct variables and making equalities

explicit with the use of the equality predicate equal.

• No variable appears twice in the head of the query. Again, equalities are made explicit

with use of the predicate equal.

• No constants or tokens appear among the ordinary2 subgoals. Instead, every constant

c or token $c is replaced by a unique variable C, and an equality subgoal equal(C, c)

or equal(C, $c) is added to equate the variable to the constant.

• No variables appear only in an equal subgoal of a query.

Example 5.1.4 Consider the query

(Q39) ans(X, X, Z)← r(X, Y, Z), p(a, Y )

which contains a join between the second columns of r and p, a selection on the first column

of p, and the same variable in two columns of ans. Its rectified equivalent is

(Q40) ans(X1, X, Z)← r(X, Y, Z), p(A, Y1), equal(X, X1), equal(Y, Y1), equal(A, a)

22We refer to the EDB and IDB relations and their facts as ordinary, to distinguish them from facts of

the equal relation.

Page 84: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 71

Notice that we treat the equal subgoal not as a built-in predicate, but as a source

predicate. We call rules that obey these conditions rectified rules and the process that

transforms any rule to a rectified rule rectification. We call the inverse procedure (that

would give us rule 39 from rule 40) de-rectification.

In Sections 5.2 and 5.3 we provide algorithms for deciding whether a query is expressible

by a description and for solving the CBR problem.

5.2 Deciding query expressibility with p-Datalog descriptions

In this section we present an algorithm for query expressibility of p-Datalog descriptions.

In so doing, we develop the techniques that will allow us in the next section to give an

elegant and improved solution to the problem of answering queries using an infinite set of

views described by a p-Datalog program.

Our algorithm, the Query Expressibility Decision algorithm, is an extension of the

classic algorithm for deciding query containment in a Datalog program that appears in

[RSUV89] (also see [Ull89]). The algorithm takes as input a conjunctive query and a p-

Datalog description and it tries to identify one expansion of the p-Datalog program that is

equivalent to our query. We next illustrate the workings of the algorithm with an example.

Example 5.2.1 Let us revisit the bibliographic source of previous examples. Assume

that the source contains a table books(isbn, author, publisher), a word index on titles,

title index(title word, isbn) and an author index au index(au name, isbn). Also assume

that the query capabilities of the source are described by the following p-Datalog program:

(D4) ans(A,P ) ← books(Id,A, P ), ind1(Id1), ind2(Id2), equal(Id, Id1), equal(Id, Id2)

ind1(Id) ← title index(V, Id), equal(V, $c)

ind1(Id) ← ε

ind2(Id) ← au index(V, Id), equal(V, $c)

ind2(Id) ← ε

Let us consider the query

(Q41) ans(X, Y )← books(Id,X, Y ), title index(“Zen”, Id), au index(“Smith”, Id)

First we produce its rectified equivalent

Page 85: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 72

(Q42) ans(X, Y )← books(Id,X, Y ), title index(V1, Id1), au index(V2, Id2),

equal(V1, “Zen”), equal(V2, “Smith”), equal(Id, Id1), equal(Id, Id2)

Apparently the above query is expressible by the description. Intuitively, our algorithm

discovers expressibility by “matching” the p-Datalog program rules with the subgoals. In

particular, the “matching” is done as follows: first we create a DB containing a “frozen

fact” for every subgoal of the query. Frozen facts are derived by turning the variables into

unique constants which will be denoted with a bar.

Moreover, we want to capture all the information carried by equal subgoals into the

DB. If, for example, subgoals equal(X, Y ), equal(X, Z) exist in the query, we will generate

“frozen” facts for all implicit equalities as well, i.e., equal(Y, X), equal(Y, Z) etc. In the

interests of space and clarity, we will write equal(X, Y, Z) to mean that all the previously

mentioned facts are in the DB. The DB for our running example is then

books(id, x, y), title index(v1, id1), au index(v2, id2), equal(id, id1, id2),

equal(v1, “Zen”), equal(v2, “Smith”)

We then evaluate the p-Datalog program 4 on the DB, deriving more facts for the IDB’s.3

In addition, we keep track of the set of frozen facts, called the supporting set, that we use

for deriving each fact. Below is the set of facts and supporting sets derived by a particular

3Notice that, because of the existence of unsafe rules in the p-Datalog program, evaluating the p-Datalogprogram will result in the production of non-ground facts. The semantics of these facts are that they aretrue for all values of their variables.

Page 86: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 73

evaluation of the Datalog program.4

< ind1(Id), {} >

< ind2(Id), {} >

(1) < ans(x, y), {books(id, x, y), equal(id, id1)} >

< ind1(id1), {title index(v1, id1), equal(v1, “Zen”)} >

< ind2(id2), {au index(v2, id2), equal(v2, “Smith”)} >

(2) < ans(x, y), {books(id, x, y), title index(v1, id1), equal(v1, “Zen”),

au index(v2, id2), equal(v2, “Smith”), equal(id, id1, id2)} >

Every ans fact that is identical to the frozen head of the client query “corresponds” to a

query that contains the client query. Furthermore, we can derive the containing query from

the <fact, supporting set> pair by translating “frozen” facts back into subgoals. In our

running example, the two containing queries5 correspond to (1) and (2). If the supporting

set is identical to the DB that we started with (modulo redundant equality subgoals) then

the corresponding query is equivalent to the client query. Indeed, the query corresponding

to (2) is

ans(X, Y )← books(Id,X, Y ), title index(Id, “Zen”), au index(Id, “Smith”)

which is equivalent (actually identical) to our given query. 2

Algorithm QED starts by mapping the subgoals of the given query into “frozen” facts,

such that every variable maps to a unique constant, thus creating the canonical database

[RSUV89; Ull89] of the query, and then evaluates the p-Datalog program on it, trying to

produce the “frozen” head of the query. Moreover, it keeps track of the different ways

to produce the same fact; that is achieved by “annotating” each produced fact f with its

supporting facts, i.e., the facts of the canonical DB that were used in that derivation of f .

4Notice that the supporting set for fact (1) below contains equal(id, id1). Fact (1) is produced from rule

ans(A, P )← books(Id, A, P ), ind1(Id1), ind2(Id2), equal(Id, Id1), equal(Id, Id2)

and facts books(id, x, y), equal(id, id1) and ind1(Id), ind2(Id). As we explained earlier, the semantics ofind1(Id) is that it stands for ind1(x) for every x, and the same for ind2(Id). Setting x to id1 produces thegiven supporting set.

5Algorithm QED uses pruning to eliminate (1) from the output.

Page 87: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 74

We next formalize the notion of the canonical database. A formal definition of support-

ing facts follows.

Definition: Canonical DB of Query Q Let Q : H ← G1, . . . , Gk, . . . , E1, . . . , Em be a

rectified conjunctive query, where G1, . . . , Gk are the ordinary subgoals and E1, . . . , Em are

the equality subgoals. Select a mapping τ that assigns to every variable X of Q a unique

“frozen” constant τ(X) = x and is the identity mapping on constants and predicate names.

This way we construct k “frozen” ordinary facts: τ(G1), . . . , τ(Gk). We also construct m

“frozen” facts of the EDB predicate equal: τ(E1), . . . , τ(Em). These m facts constitute

an instance of the equal relation. We create additional equal facts so that we get the

smallest set of equal facts that includes this instance and is an equivalence relation. All the

constructed facts constitute the canonical DB of query Q. 2

Notice that this DB contains two “kinds” of constants: “regular” constants and frozen

constants.

Example 5.2.2 Consider the rectified query:

(Q43) ans(Y )← p(X, X1), q(X2, Y, Z), equal(X, X1), equal(X1, X2),

equal(X, X3), equal(Z, c)

The canonical DB produced by this query is

p(x, x1), q(x2, y, z), equal(x, x1), equal(x1, x2), equal(z, c), equal(x, x), equal(x1, x1),

equal(x2, x2), equal(x, x2), equal(x2, x), equal(x1, x), equal(x2, x1), equal(c, z),

equal(z, z), equal(c, c)

2

Shorthand notation: Before we proceed, let us formalize the shorthand notation intro-

duced in Example 5.2.1. It is obvious that if the equal facts form an equivalence relation, the

constants and frozen constants appearing in equal facts are divided in equivalence classes.

Let us look at the canonical DB of some query Q. If variables X1, . . . , Xk appearing

in the canonical DB belong to the same equivalence class, we replace all equal facts in-

volving X1, . . . , Xk by equal(X1, . . . , Xk). For example, equal(X1, X2, X3) “stands for” all

equal(Xi, Xj), 1 ≤ i, j ≤ 3.

The canonical DB produced by query (Q43) above can be written as

p(x, x1), q(x2, y, z), equal(z, c), equal(x, x1, x2)

Page 88: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 75

It is easy to see that

equal(Y1, . . . , Yl) is a subset of equal(X1, . . . , Xm) iff ∀i ≤ l, Yi ∈ {X1, . . . , Xm}

Definition: Supporting Set of Fact Let h be an ordinary fact produced by an appli-

cation of the p-Datalog rule

r : H ← G1, . . . , Gk, E1, . . . , Em

of a p-Datalog description P on a database DB that consists of a canonical database

CDB and other facts, and let µ be a mapping from the rule into the database such that

µ(Gi), µ(Ej) ∈ DB and h = µ(H). The set Sh of supporting facts of h, or supporting set of

h, with respect to P , is the smallest set such that

• if µ(Gi) ∈ CDB, then µ(Gi) ∈ Sh,

• if µ(Gi) 6∈ CDB and S ′ is the set of supporting facts of µ(Gi), then S ′ ⊆ Sh,

• if E is the set of all µ(Ei) ∈ Sh, then the smallest set of equality facts that includes

E and is an equivalence relation is included in Sh.

2

Let us notice that Sh is the set of leaves of a proof tree [Ull89] for h. We can further

annotate the produced fact with the “id” of the rule used in its production, thus generating

the whole proof tree for this fact.

Example 5.2.3 We can apply the rule

ans(X1, Z1)← author(X1, Z1), publisher(Z2,W ), equal(Z1, Z2), equal(W, $w)

on the following canonical DB

author(a, b), author(a, a), publisher(d, f), publisher(g, h), equal(b, d), equal(a, g),

equal(f , “PrenticeHall”)

to produce fact ans(a, b). The supporting set S is

{author(a, b), publisher(d, f), equal(b, d), equal(f , “PrenticeHall”)}

Page 89: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 76

2

We next define the notions of extended facts and extended canonical DB:

Definition: Extended Facts and Extended Canonical DB An extended fact is a pair

of the form < h,Sh >, where h is a fact and Sh is the supporting set for h, with respect to

some description P . Let Q be a rectified conjunctive query. The extended canonical DB of

Q is a database of extended facts < h,Sh >, such that every h belongs in the canonical DB

of Q. 2

Referring to Example 5.2.3, the extended fact “associated” with our production of ans(a, b)

is

< ans(a, b), {author(a, b), publisher(d, f), equal(b, d), equal(f , ‘PrenticeHall′)} >

We now introduce the notion of the corresponding query for a fact, that makes our intuition

about the supporting set explicit.

Definition: Corresponding Query Let < h,Sh > be an extended fact of the DB.

Then, for every fact gi ∈ Sh, we can define a mapping ρ that is the identity on constants

and predicate names and maps every frozen constant to the variable which it came from.

It is easy to see that this mapping is well-formed. Moreover, it maps Sh into a query

body and the fact h into a query head. The query Q:ρ(h) ← ρ(g1), . . . , ρ(gk) is called the

corresponding query for extended fact < h,Sh >. 2

Intuitively, the corresponding query is an instantiated expansion of the rules of the

description that can prove h and it uses only source and equality predicates.

Algorithm QED produces a set of candidate queries: these are the corresponding queries

for the produced extended facts. Candidate queries are described by the p-Datalog descrip-

tion; they are the only “interesting” expansions, in that they could be equivalent to the

given query. As we will show later, each candidate query has an important property: its

projection over the empty list of attributes contains the projection over the empty list of

attributes of the given query Q. Said otherwise, the body of a candidate query contains the

body of the given query. That means that if there exists a candidate query whose head is

identical to the head of Q, then obviously this a containing query for Q with respect to P .

Moreover, Q is expressible by P iff one of the candidate queries in the set is equivalent to

Q.

Page 90: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 77

The algorithm is presented in detail in Figure 5.1. Notice that the algorithm only gen-

erates maximal supporting sets for each produced fact. Therefore, the produced candidate

queries are in a sense “minimal.” We will formalize that notion later in this section.

Algorithm 5.2.4

InputMinimized [Ull89] (non-rectified) conjunctive query Q of the formH ← G1, G2, . . . , Gk, where H is of the form ans(X1, . . . , Xn).(Non-rectified) p-Datalog description P .

OutputA set of candidate queries.

MethodRectify P and QConstruct the extended canonical DB of QApply the rules of P to the facts in DB to generate all possible extended factsusing bottom up evaluation [Ull89] modified in the following ways:

%items 1 and 2 guarantee the generation of extended facts%with maximal supporting sets

1. populate IDB relations with extended facts, i.e., if fact h is produced bythe rule, compute Sh and then enter < h,Sh > in the database iff• < h,Sh > is not already in the database and• No < h,S ′

h > where Sh ⊆ S ′h is present in the DB.

2. when a new fact < h,Sh > is added to the DB, delete from the DB allfacts of the form < h,S ′

h >, where S ′h ⊂ Sh.

3. if a rule is unsafe, i.e., some distinguished variables do not appearin the rule body, simply leave those variables in the produced fact.

In the end:4. if < h,Sh > is an extended fact, h is an ans fact and h contains

variables, delete the extended fact.5. de-rectify the resulting extended facts, and the query Q.6. Create the corresponding queries of the extended facts.

2

The treatment of unsafe rules is the same as in generalized magic sets [Ull89].

Figure 5.1: Algorithm QED

We proceed to give results on the correctness and running time of the algorithm. Before

that, let us just demonstrate with an example why rectification is necessary.

Example 5.2.5 To illustrate why rectification is necessary in identifying the candidate

queries, let us consider the query

ans(X)← p(X, c)

Page 91: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 78

and the p-Datalog description6

ans(A)← p(A,B)

Evaluating the description on the canonical DB {p(x, c)} (without rectification), would

produce the extended fact < ans(x), {p(x, c)} >. The corresponding query is

ans(X)← p(X, c)

which is not a correct candidate query, because it is not expressible (by Definition 5.1.1) by

the given description. If on the other hand we use rectification, we get the canonical DB

{p(x, y), equal(y, c)}. Evaluating the description on it, we get the candidate query

ans(X)← p(X, Y )

which is a containing query for our given query (but not equivalent). 2

Now we are ready to state some formal results about algorithm QED. We ultimately

formally state and prove its correctness criterion (i.e., solving the expressibility problem)

and state and prove its computational complexity.

Lemma 5.2.6 Algorithm QED produces extended facts with maximal supporting sets.

By maximal, we mean that if < h,Sh >,< h,S ′h > are two extended facts for the same

fact h, it cannot be that Sh ⊆ S ′h or that S ′h ⊆ Sh. Thus Lemma 5.2.6 directly follows

from Algorithm 5.2.4

Theorem 5.2.7 Soundness and Completeness of Set of Candidate Queries Let Q

be a query, P be a p-Datalog description and {Qi} be the set of candidate queries that is

the result of algorithm QED on Q and P . Then the following are true:

1. For all i, πφQ ⊆ πφQi.

2. For all i the identity mapping can map the body of Qi to the body of Q.

3. If R is a query described by P and is not in {Qi} then

• πφR does not contain πφQ or

6This is obviously the description of a source with a very simple query interface

Page 92: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 79

• there exists an i such that the heads of R and Qi are identical and Qi ⊆ R.

Moreover, the identity mapping µ is a containment mapping from R to Qi.

4. If R is a query described by P and is not in {Qi}, R ≡ Q only if there exists i such

that Qi ≡ Q.

Proof: (2) is derived directly from the Algorithm and (1) is a direct consequence of the

existence of the mapping. (4) is a direct consequence of (1) and (3). For (3): Algorithm

QED is exhaustive, i.e., it generates all “relevant” (in the sense of (1)) candidate queries,

with the exception of those that are pruned due to Lemma 5.2.6. So let R : HeadR ← BodyR

be “relevant” and not in the candidate set. Then, for the extended fact7 < HeadR, BodyR >,

BodyR is not a maximal supporting set. That means that there exists an extended fact F :<

HeadR,S > such that BodyR ⊆ S. It is then clear from the definition of a corresponding

query that the corresponding query QF to F is contained in R, and that the mapping from

R to QF is the identity. 2

Theorem 5.2.7 says that, for any described query R that is not in the candidate set,

either R is not equivalent to Q, or there already exists a “smaller” query Qi in the candidate

set that still “contains” Q. In the above sense, the candidate set contains “minimal” queries.

Moreover, it says that queries not in the candidate set are not “interesting”: even if R ≡ Q,

there is always a query Qi in the candidate set that is also equivalent to Q.

Algorithm QED produces output that allows us to correctly decide query expressibility.

To that effect, we prove the following:

Lemma 5.2.8 Expressibility Criterion Q is expressible by P iff the set of supporting

facts for some extended fact < h,Sh > of the frozen head h of Q is identical8 to the canonical

DB for Q.

Proof:

IF: It is obvious from the way the “corresponding” query is defined, that if DB ≡ Sh,

then the corresponding query is equivalent to Q.

ONLY IF: The output of algorithm QED contains candidate queries for which Theo-

rem 5.2.7 holds, i.e., there is no expansion that is a “tighter fit” to the given query than

the queries in the output. If, for every Sh, there exists some fact in the canonical DB that

7Where Head, Body are “frozen.”8After de-rectification of both.

Page 93: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 80

is not in Sh, then the corresponding query cannot be equivalent to Q: Q is minimized, and

minimization is unique up to isomorphism, so all subgoals (i.e., all facts in the canonical

DB) are necessary for equivalence. 2

The number of extended facts that can be generated per “real” fact is equal to the

number of different maximal supporting sets for the fact, i.e., it is exponential in the size

of the canonical DB. The number of facts is exponential in the size of the description, so

we have the following:

Theorem 5.2.9 Algorithm QED produces an answer in time exponential to the size of the

description and the size of the query.

Notice that the problem of query containment in Datalog is reducible to the problem of

query expressibility described here. Query containment in Datalog is EXPTIME-complete

[RSUV89]. Hence we have the following:

Theorem 5.2.10 Query expressibility is EXPTIME-complete.

Therefore, Algorithm 5.2.4 meets the theoretical lower bound.

5.2.1 Expressibility and translation

Let us consider the case of a wrapper that receives a query. It is easy to see that we

could extend Algorithm 5.2.4 so that it annotates each fact not only with its supporting

set, but also with its proof tree. The wrapper then can use the parse tree to perform the

actual translation of the user query in source-specific queries and commands, by applying

the translating actions that are associated with each rule of the description [PGGMU95;

JBHM+97].

5.3 Answering Queries Using p-Datalog Descriptions

Mediators are faced with a tougher problem than wrappers, as explained in Chapter 1:

Given the descriptions for one or more wrappers, the mediator has to answer the user

query by sending to the wrappers only queries expressible by the wrapper descriptions and

consequently combine the answers to produce the answer to the given query. This is the

Capabilities-Based Rewriting (CBR) problem. The mediator considers rewritings of the

user query that are conjunctive rules, as described below.

Page 94: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 81

Definition: Rewriting of Query Given a conjunctive query Q and a set of queries

{Q1, . . . , Qn}, of the form

ans i ← body i , i = 1, . . . , n

a rewriting of Q using {Qi} is a rule Q′ of the form

ans ← ans1, . . . , ansn, optional equalities

such that Q′ ≡ Q 2

As we have said in previous sections, a source description defines the (possibly infinite)

set of conjunctive queries answerable by the source. So, the CBR problem is equivalent

to the problem of answering the user query using an infinite set of views described by a

Datalog program [LRU96].

Our CBR algorithm proceeds in two steps. The first step finds a finite set of expan-

sions. The second step uses an algorithm for answering queries using views [LMSS95;

Qia96] to combine some of these expansions to answer the query. The first step uses the

Algorithm 5.2.4 to generate a finite set of expansions (see Figure 5.1). We prove that if

we can answer the query using any combination of expressible queries, then we can answer

it using a combination of expansions in our finite set. In [LRU99], a solution is presented

for the problem whose complexity is doubly exponential in the size of the query and the

description. The solution is based on “signatures” for the expansions of the description,

that divide the queries that are expressible by the description into equivalence classes. We

will show that our solution is non-deterministic exponential in the size of the query and the

description. Moreover, the proof of our solution is much more intuitive and simpler.

Given a user query Q and a wrapper description P in p-Datalog, Algorithm QED pro-

duces all9 the candidate queries of Q with respect to P . We can show that there is at most

an exponential number of those:

Lemma 5.3.1 The output of Algorithm 5.2.4 contains at worst an exponential number of

queries, whose length is at most linear to the size of the given user query.

Moreover, we can prove that these are the only queries expressible10 by P that are

“relevant” in answering Q.

9Modulo variable renaming.10The corresponding queries Qi, that are the output of Algorithm 5.2.4, actually are described by P .

Page 95: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 82

Theorem 5.3.2 (CBR) Assume we have a query Q and a p-Datalog description P without

tokens, and let {Qi} be the result of applying Algorithm 5.2.4 on Q and P . There exists a

rewriting Q′ of Q, such that Q′ ≡ Q, using any {Qj |Qj is expressible by P} if and only if

there exists a rewriting Q′′ , such that Q′′ ≡ Q, using only {Qi}.

Proof: The if direction is trivial. For the only if: It must be that πφ(Q) ⊆ πφ(Qj)

[LMSS95]. Since Qj is expressible by P , Qj could be a candidate query. But {Qi} con-

tains all the “interesting” candidate queries of Q with respect to P by Theorem 5.2.7.

Therefore, for any Qj , either Qj ∈ {Qi} or there exists some “corresponding” Qi such

that Qi ⊆ Qj , and the containment mapping from Qj to Qi is the identity mapping. Let

Q′:Qj1 , . . . , Qjk, . . . , Qjm be the rewritten query. If we replace each Qjk

with its “corre-

sponding” Qik identified above, then Q′′:Qi1 , . . . , Qim is also equivalent to Q. In proof:

• There exists a containment mapping from Q′′ to Q. In particular, the identity mapping

is a containment mapping from Q′′ to Q

• There exists a containment mapping from Q to Q′ and from Q′ to Q′′, and therefore

also from Q to Q′′.

Therefore, by the containment mapping theorem [CM77], Q′′ and Q are equivalent. 2

Since all we need to solve the rewriting problem is to compute the candidate queries

(using Algorithm 5.2.4), we need an algorithm to combine some of the candidate queries

into a rewriting of the given query. The problem of finding an equivalent rewriting of a

query using a finite number of views is known to be NP-complete in the size of the query

and the view set [LMSS95] and there are known algorithms for solving it in the absence of

tokens [LMSS95; Qia96]. Hence, the total computational complexity of our CBR scheme in

the worst case is

• First stage (QED): Exponential in the size of the query and the description.

• Second stage (answering queries using views): NP in the size of its input. The size of

the input is the cardinality of the candidate set times the size of the largest candidate.

Since the QED algorithm has output of exponential size, the second stage dominates and

the total complexity of the algorithm in the worst case is nondeterministic exponential. In

particular, the cardinality of the candidate set is exponential in the arity of the head of the

candidate queries and, more importantly, in the size of the canonical database. (See also

Section 5.4.2.)

Page 96: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 83

5.3.1 CBR with binding requirements

The discussion in the previous section ignores the presence of tokens. To handle tokens in

the p-Datalog description, we need to modify both steps of our CBR scheme. Let us discuss

what changes are necessary.

To correctly solve the CBR problem in the presence of binding requirements, we first

modify the QED algorithm. Let us consider an example that will show that algorithm

QED, if used unchanged, is inadequate for the solution of the CBR problem with binding

patterns.

Example 5.3.3 Let the “target” query be

(Q44) ans(X)← p(c, Y ), p(Y, X)

and let the description be

(D5) v(X)← p($c,X)

The rectified query is

(Q45) ans(X)← p(A, Y ), p(Y1, X), equal(A, c), equal(Y, Y1)

The rectified p-Datalog description of the source is

(D6) ans(W )← p(B,W ), equal(B, $c)

Algorithm QED produces the following candidate query (after de-rectification):

(Q46) ans(Y )← p(c, Y )

There is no rewriting of (Q44) using only (Q46) that is equivalent to (Q44). But there is a

way to answer (Q44) using our p-Datalog description. To see that, let us rewrite the query

and the view to make the binding patterns explicit:

(Q′44) ansfb(X, A) ← p(A, Y ), p(Y, X)

(D′5) vfb(X, A) ← p(A,X)

Page 97: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 84

Then we can rewrite (Q44) as follows:

(Q′′44) ansfb(X, A)← vfb(Y, A), vfb(X, Y )

This rewriting respects the binding requirements of the views, is processed by passing Y

bindings, and is equivalent to the target query. 2

Algorithm 5.3.4

InputMinimized [Ull89] (non-rectified) conjunctive query Q of the formH ← G1, G2, . . . , Gk, where H is of the form ans(X1, . . . , Xn).(Non-rectified) p-Datalog description P .

OutputA set of candidate queries with binding patterns.

MethodRectify P and QConstruct the extended canonical DB of QReplace tokens in P with variables. Annotate rules with binding information.Apply the rules of P to the facts in DB to generate all possible extended factsusing bottom up evaluation [Ull89] modified in the following ways:

1. populate IDB relations with extended facts, i.e., if fact h is produced bythe rule, compute Sh and then enter < h,Sh > in the database iff• < h,Sh > is not already in the database and• No < h,S ′

h > where Sh ⊆ S ′h is present in the DB.

2. when a new fact < h,Sh > is added to the DB, delete from the DB allfacts of the form < h,S ′

h >, where S ′h ⊂ Sh.

3. if a rule is unsafe, i.e., some distinguished variables do not appearin the rule body, simply leave those variables in the produced fact.

4. Update bound variables annotation for the extended fact:A variable gets an annotation when it binds to an already annotated variable.

In the end:5. if < h,Sh > is an extended fact, h is an ans fact and h contains

variables, delete the extended fact.6. de-rectify the resulting extended facts, and the query Q.7. Create the corresponding queries of the extended facts.

Use the binding information to construct their binding patterns.

2

The treatment of unsafe rules is the same as in generalized magic sets [Ull89].

Figure 5.2: Algorithm QED-T

Therefore, we need to modify algorithm QED. The necessary change over QED consists

basically of a pre-processing step: replace tokens in the p-Datalog description with variables,

Page 98: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 85

but maintain as an extra annotation the information that these variables need to be bound.

In particular, that information can be attached to each extended fact as an extra annotation.

The modified algorithm QED-T is presented in detail in Figure 5.2.

Applying that modification to the previous example, (D5) becomes

(D′′5) v(W )← p(B,W )

where B needs to be bound. Algorithm QED-T on this input produces two candidate

queries:

(Q47) ans(X)← p(Y1, X)

where Y1 needs to be bound, and

(Q48) ans(Y )← p(A, Y )

where A needs to be bound. Finally, QED-T uses the binding information to turn the

candidate queries into queries with binding patterns. So, (Q47),(Q48) turn into

(Q49) ansfb(X, Y1)← p(Y1, X)

and

(Q50) ansfb(Y, A)← p(A, Y )

Queries (Q49),(Q50) together with (Q44) are the input to the second stage of our CBR

scheme, which per Section 5.3 is an algorithm for answering queries using views. The algo-

rithms [LMSS95; Qia96] proposed in the previous section do not deal properly with tokens.

As we have mentioned in Section 5.1, tokens describe binding requirements. Therefore, we

need to take into account the binding requirements of candidate queries. [RSU95] studies

the problem of answering queries using views with binding requirements. The authors use

binding patterns to describe binding requirements. They show that the problem is NP-

complete and they also describe an algorithm for it. The algorithm takes as input a finite

set of conjunctive views with binding patterns and a “target” query with a binding pattern

and rewrites the query using the views in a way that respects the view binding patterns.

Example 5.3.3 is an example of query rewriting using views with binding patterns.

We use this algorithm, henceforth referred to as the AnsBind algorithm, for the second

Page 99: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 86

part of our CBR scheme, that is, to find a rewriting of the user query using the candidate

queries. Using (Q47), (Q48) and (Q44) as input to AnsBind, we obtain the correct rewriting

of (Q44) that is shown in Example 5.3.3.

Theorem 5.3.5 (CBR-tokens) Assume we have a query Q and a p-Datalog description

P with tokens, and let {Qi} be the result of applying Algorithm 5.3.4 on Q and P . There

exists a rewriting Q′ ≡ Q, using any {Qj |Qj is expressible by P} if and only if there exists

a rewriting Q′′ ≡ Q, using only {Qi}.

Proof: The only difference to QED is that QED-T is “missing” some candidate queries

by ignoring tokens. But it is easy to see that any candidate query we are thus “missing”

is identical to one of the queries in the candidate set of QED-T modulo equality subgoals.

Moreover, if there is a rewriting of a query using some candidate Qi with some binding

pattern, then there is also a rewriting of the query using Qi without a binding pattern. The

theorem then follows. 2

The solution for the CBR problem with binding requirements is also non-deterministic

exponential.

5.4 An interesting and more efficient class of p-Datalog de-

scriptions

We identify an interesting class of p-Datalog descriptions with a simple syntactic characteri-

zation, for which the CBR algorithm of Section 5.3 is much more efficient. In particular, for

this class of descriptions the output of the QED algorithm is exponential only in the arity

of the candidate query head, and does not depend on the size of the canonical database.

Hence, the second stage of the CBR scheme is more efficient, since it receives smaller input.

Overall, the CBR scheme for this class is non-deterministic exponential in the arity of the

head predicate.

Definition: A p-Datalog description P belongs in Ploop if and only if

• P contains only one IDB predicate

• If p is the IDB predicate and

R : p(X1, . . . , Xn)← pred1(A1, . . . , Am) . . . , p(Y1, . . . , Yn), . . . , predk(B1, . . . , Bl)

Page 100: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 87

is any recursive rule where p appears, Yi is actually Xi for all i.

2

Descriptions in Ploop therefore consist of simple loops and exit rules.

Example 5.4.1 Let us repeat the description of the source of Example 5.1.3. The source

accepts queries on books given one or more words from their abstracts, assuming there exists

an abstract index abstract index(abstract word, Id) The following p-Datalog program is

used to describe this source.

(D7)

ans(Id,Aut, T itl, Pub, Y r, Pg) ← books(Id,Aut, T itl, Pub, Y r, Pg), ind(Id)

ind(Id) ← abstract index($c, Id)

ind(Id) ← ind(Id), abstract index($c, Id)

The above description clearly belongs in Ploop.11 2

We use lattices to help explain why the output of QED on descriptions in Ploop does not

depend on the size of the canonical database but it depends solely on the arity of the ans

facts. The next subsection is a short reminder about lattices.

5.4.1 Lattice Framework

Let us consider the subset relation ⊆ between sets.

We denote a lattice with set of elements (supporting sets in this section) L and the

subset relation ⊆ by 〈L,⊆〉. For elements a and b of a lattice 〈L,⊆〉, a ⊂ b means that

a ⊆ b and a 6= b.

The ancestors and descendants of an element of a lattice 〈L,⊆〉, are defined as follows:

ancestor(a) = {b | a ⊆ b}

descendant(a) = {b | b ⊆ a}

Note that every element of the lattice is its own descendant and its own ancestor. The

immediate proper ancestors of a given element a belong to a set we shall call next(a).

Formally,

next(a) = {b | a ⊂ b, 6 ∃c, a ⊂ c, c ⊂ b}

11The description also happens to be monadic[AHV95]. Descriptions in Ploop in general don’t have to be.

Page 101: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 88

Figure 5.3: Supporting set lattice for fact f for a database of size 5

Figure 5.4: Supporting sets and least common ancestor

In the terminology of lattices, with ⊂ being the ordering relation, a is covered by the

elements of next(a) [DP90].

It is common to represent a lattice by a lattice diagram, a graph where the lattice

elements are nodes and there is an edge from a below to b above if and only if b is in

next(a) (i.e., if and only if a is covered by b). Thus, for any two lattice elements x and y,

the lattice diagram has a path downward from y to x if and only if x ⊆ y.

Figure 5.3 shows the lattice diagram for the possible supporting sets of a fact f for

a database of size 5. The next subsection discusses the size of the output of the QED

algorithm for the Ploop class of p-Datalog descriptions.

5.4.2 QED and Ploop

The cardinality of the candidate set produced by QED can in general be exponential in the

size of the canonical database. Figure 5.3 gives a graphical explanation for the potential

exponentiality of supporting sets of even fixed size for a fact f . Therefore, the number of

candidate queries can also be exponential in the size of the canonical database.

For descriptions in Ploop, let us make the following crucial observation: Let Si and Sj

be two supporting sets for fact f that are produced by algorithm QED with a description

P that is in Ploop. Let S be their least common ancestor, as in Figure 5.4. Then, S is also

produced by QED for f . Since QED only keeps extended facts with maximal supporting

sets, the extended fact < f,S > will be kept for f , and it will replace the extended facts

< f,Si > and < f,Sj >.

Thus, it is easy to see that only one extended fact per fact f will be generated, and

therefore just one candidate query. Therefore, the output of the QED algorithm for Ploop,

and thus the complexity of the second stage of the CBR scheme, is only exponential in the

arity of the head of the candidate queries, and not in the size of the canonical database.

The importance of the class lies in the fact that we have observed that it is expressive

enough to describe a large number of common sources, such as document retrieval systems

and Web-based sources.

Page 102: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 89

5.5 Expressive Power of p-Datalog

We have illustrated the use of p-Datalog programs as a source description language. In

this section, we explore some limits of its description capabilities. It should be noted that

although we focus here on the description of conjunctive queries, similar results hold when

negation and disjunction are introduced.

Clearly, there are sets of conjunctive queries that cannot be described by any p-Datalog

description. Moreover:

Lemma 5.5.1 There exist recursive sets of conjunctive queries that are not expressible by

any p-Datalog description.

Proof: As we have seen in the previous section, the decision procedure for the description

semantics of p-Datalog is exponential. Therefore, any recursive set of conjunctive queries

with a membership function that is superexponential is not expressible by any p-Datalog

description. 2

However, the practical question is whether there are recursive sets of conjunctive queries,

that correspond to “real” sources and cannot be expressed by p-Datalog programs. We show

next that some common sources (intuitively the “powerful” ones) exhibit this behavior.

Before we prove this result, we demonstrate the expressive abilities and limitations of p-

Datalog.

Let us start with an observation: For every p-Datalog description program P , the arity

of the result is exactly the arity of the ans predicate. This restriction is somewhat artificial,

since we can define descriptions with more than one “answer” predicate. However, even

in that case, a given program would still bound the arities of answers. Furthermore, a

more serious bound is the number of variables that occur in any one of the rules of the

program. We will see that this bound is imposing severe restrictions on the queries that

can be expressed.

But first, if we bound the number of variables, we can show the following:

Theorem 5.5.2 Let k be some integer. Let p1, . . . , pm be the EDB predicates of a database.

There exists a p-Datalog program P that describes all conjunctive queries with at most k

variables on this database.

Proof: We show the construction for k = 3 and for the case where p1, . . . , pm are each

Page 103: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 90

predicates of arity two. The program that can describe all conjunctive queries is the fol-

lowing:

(D8) ans3(Xi, Xj , Xl) ← temp(X1, X2, X3),∀i, j, l ≤ 3

ans2(Xi, Xj) ← ans3(Xi, Xj , Xl),∀i, j, l ≤ 3

ans1(Xi) ← ans2(Xi, Xj),∀i, j ≤ 3

ans0() ← ans1(X)

temp(X1, X2, X3) ← pl(Xi, Xj), temp(X1, X2, X3), ∀l ≤ m, ∀i, j ≤ 3

temp(X1, X2, X3) ← pl(Xi, $c), temp(X1, X2, X3), ∀l ≤ m, ∀i ≤ 3

temp(X1, X2, X3) ← pl($c,Xj), temp(X1, X2, X3), ∀l ≤ m, ∀j ≤ 3

temp(X1, X2, X3) ← pl($c1, $c2), temp(X1, X2, X3), ∀l ≤ m

temp(X1, X2, X3) ← ε

where X1, X2, X3 are distinct variables. It is easy to see that a similar construction can

provide the program that describes all conjunctive queries for k > 3 and larger arities. 2

As mentioned above, a fixed p-Datalog program bounds the arity of the results, but

this bound is not the only cause of limitation. Even if we focus on arity-0 results, i.e.,

queries that answer yes or no and do not provide data, p-Datalog is limited. The limitation

is related to the number of variables. Let FOk be the set of sentences of first order logic

[AHV95] with at most k variables. Note that the same variable can be “reused” as much

as needed using quantification. The following relates the queries described by a p-Datalog

program to formulas expressible in first-order logic with a bounded number of variables.

It states that although one such query may use an arbitrary number of variables, with

appropriate “reuse” only a bounded number of variables suffice.

Lemma 5.5.3 Let P be a Datalog program and k the maximum number of variables

occurring in a rule of P . Then for each Q expressible by P , Q is equivalent to a query in

FOk.

Proof: Let x1, . . . , xk be the variables appearing in the rules of description P . Also, let

Q′ : ans(u1)← p1(u2), p2(u3), . . . , pn(un)

be in descr(P ) such that Q ≡ Q′. We will show that Q′ is equivalent to a first order sentence

Page 104: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 91

with only k variables.

The proof is by induction on the number of resolution steps used to construct a rule. If

Q′ is a rule of P , then the claim is true. Otherwise, when doing a step of the resolution, let

qi be the literal that is unified with some rule head. Then, the variables not used in qi can

be reused existentially quantified for the extra variables in the rule. To illustrate, let the

rules be as follows:

(1) p(X, Y )← q(Y, Z), r(Z,X, W )

(2) q(X, Y )← s(X, Y, Z,W, A,B)

The maximum number of variables in the rules is 6. The result of resolving rule (1) with

(2) is the following rule:

p(X, Y )← s(Y, Z, Z ′,W ′, A′, B′), r(Z,X, W )

The logical sentence associated with this rule is

∀X, Y (∃Z,Z ′,W ′, A′, B′,W (s(Y, Z, Z ′,W ′, A′, B′) ∧ r(Z,X, W ))→ p(X, Y ))

which is equivalent to

∀X, Y (∃Z,W (r(Z,X, W ) ∧ ∃X, W,A′, B′(s(Y, Z, X,W, A′, B′)))→ p(X, Y ))

The last statement above uses exactly 6 variables. 2

The limitation on the number of variables of the program prohibits the description of the

set of all conjunctive queries over a schema — a set that is supported by common powerful

sources.

Theorem 5.5.4 Let the database schema S have a relation of arity at least two. For every

p-Datalog description P over S, there exists a boolean query Q over S, such that Q is not

expressible by P . (So, in particular, there is no p-Datalog description that could describe

a source that can answer all conjunctive queries, even if we fix the arity of the answer.)

In order to prove the theorem, we first need to prove the following lemma:

Lemma 5.5.5 There exists a boolean conjunctive query with k variables that is not ex-

pressible as a conjunctive query with k − 1 variables.

Page 105: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 92

Proof: Let us consider the following query Q:

ans() ← G(x1, x2), . . . , G(x1, xk), . . . ,

G(xi, x1), . . . , G(xi, xi−1), G(xi, xi+1), . . . , G(xi, xk)

G(xk, x1), . . . , G(xk, xk−1)

This query asks if G has self-loops or G has a clique of size k and is in FOk. Q cannot

be expressed by an FOk−1 formula, as can be shown by playing a pebble game [Imm82]

with k−1 pairs of pebbles on the following two structures: G1, a k-clique without self-loop,

and G2, a (k − 1)-clique without self-loop. These two structures are indistinguishable in

the (k − 1)-pebble game. Therefore, because of Lemma 5.5.3, Q cannot be expressed by a

conjunctive query with k − 1 variables. 2

Now we are ready to prove Theorem 5.5.4.

Proof: Let S (without loss of generality) contain the binary predicate G. Suppose such

a description P exists. Let k be the maximum number of variables in a rule of P . Then

each conjunctive query expressible with P is in FOk by Lemma 5.5.3. Moreover, the k + 1

clique without self-loop is not in P . 2

Theorem 5.5.4 points out a rather serious limitation of p-Datalog descriptions.

5.6 Related Work

Many projects have dealt with data integration of structured sources (e.g., [LMR90; A+91;

HM93; K+93; T+90].) These projects ignored the problem of the different and limited

query capabilities of information sources, which is important for integration systems that

deal with heterogeneous sources. In Chapter 2 we discussed the approaches taken by a

newer generation of projects. In what follows, we discuss some theoretical work in this

area.

Papakonstantinou et al. [PGGMU95] suggested a grammar-like approach for describing

query capabilities and Levy et al. [LRU96] used Datalog with tokens for the same purpose.

These works are focused on showing how we can compute a query Q given a capabilities

description P . The algorithm presented in [PGGMU95] only applies to specific classes of

descriptions. [LRO96] proposes using capability records for source capability description.

Capability records are strictly less expressive than p-Datalog descriptions. More recently,

Page 106: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 93

Florescu et al. [FLMS99] discussed performance and implementation issues for plan gener-

ation in the presence of access limitations on data modelled by binding patterns.

The following subsection discusses the use of tokens for the description of binding re-

quirements and compares that approach to the use of binding patterns [RSU95; Ull89].

5.6.1 Describing binding requirements in p-Datalog

As we have already noticed, sources often can answer only queries that have specific bind-

ing requirements. As mentioned in Section 5.1, we are using tokens to specify that some

constant is expected in some fixed position in the query, i.e., to implicitly define the bind-

ing requirements of described queries. In contrast, [RSU95] uses explicit enumeration of

accepted binding patterns [Ull89] for each described query to achieve the same goal.

Example 5.6.1 Let us consider the following p-Datalog rule:

(Q51) ans(X, Y )← p(X, Z, $c1), q(Y, Z, $c2,W )

(Q51) describes a join query that requires two bindings, one for the third argument of

relation p and one for the third argument of relation q. Using the notation of [RSU95], also

used in [Ull89], we could write (Q51) above as follows:

(Q52) ansffbb(X, Y,A, B)← p(X, Z, A), q(Y, Z, B, W )

This query describes the same binding requirements as (Q51). 2

Explicitly specifying accepted binding patterns as in (Q52) presents a number of prob-

lems. In particular, it obscures the distinction between variable and constant in the rule.

This complicates answering the query expressibility question. Moreover, and more im-

portantly, explicit specification of binding patterns does not generalize in the presence of

recursion. When query capabilities are described with a p-Datalog program, it is not even

possible to enumerate all posssible binding patterns: the description encodes a possibly

infinite number of described queries that have different bound variables.

On the other hand, using tokens allowed us to naturally extend the description of binding

requirements to the case of p-Datalog programs. The difference is made clearer by the

following example.

Page 107: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 94

Example 5.6.2 Let us revisit Example 5.1.3, that describes a particular bibliographic

source. The p-Datalog description for that source is the following12

ans(I, A, T, P, Y, Pg) ← books(I,A, T, P, Y, Pg), ind(I)

ind(I) ← abstract index($c, I)

ind(I) ← ind(I), abstract index($c, I)

As we saw in Example 5.1.3, the source describes the following infinite family of con-

junctive queries:

ans(I, A, T, P, Y, Pg) ← books(I, A, T, P, Y, Pg), abstract index(c1, I)

ans(I, A, T, P, Y, Pg) ← books(I,A, T, P, Y, Pg), abstract index(c1, I),

abstract index(c2, I)

etc.

The queries in this family have an increasing number of bound variables, so their binding

patterns would look like this:

ffffffb,

ffffffbb,

etc.

The use of tokens allows us to describe the binding requirements succinctly. 2

5.7 Conclusions and open problems

In this chapter we discussed the problems of (i) describing the query capabilities of sources

and (ii) using the descriptions for mediation. We discussed these problems for a capability-

description language that is a Datalog variant, called p-Datalog. We provided algorithms

for solving (i) the expressibility and (ii) the CBR problems.

The first algorithm decides whether a given query is equivalent to one of the queries

described by a p-Datalog program. Within an integration system such as TSIMMIS, a

variant of this algorithm is run by the wrapper to perform the translation of queries issued

by the mediator or the user to queries and commands in the sources’ native language.

12Variable names are changed

Page 108: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 95

In particular, as explained in Appendix A, each rule of the capability description can be

associated with a translating action, in the spirit of Yacc. Algorithm 5.1 can be extended

to annotate facts not only with their supporting set, but also with their proof tree. The

wrapper then uses the algorithm to match a submitted query with a query described by the

capability description, and uses the proof tree generated by the algorithm to perform the

translation of the query into source-specific queries and commands, by applying to it the

appropriate translating actions.

The second algorithm is run by the CBR module of the mediators and it finds out if a

given query can be computed using queries which are expressible by a p-Datalog program.

The output of the algorithm is a logical query plan that can be further optimized by a

query optimizer, as shown in Figure 1.6. This conceptually clean separation of the CBR

module from the rest of the query processing modules is challenged in environments where

the number of logical query plans may become huge. When we need to optimize a multiway

join query where the source data can be joined in a large number of permutations, it is

infeasible to have the CBR module generate all possible logical plans and consequently have

the optimizer cost them and pick the least expensive. Instead the CBR module has to aid

the optimizer by pruning the space of plans it generates. In [PGH98], the authors describe

an extension of a CBR algorithm with System R’s dynamic programming pruning technique

[SAC+79]. A more aggressive approach has been followed by the Garlic project [ROH99],

where the capabilities-based rewriting is part of the cost-based query optimizer. Capabilities

are encoded as rewriting rules that describe the subplans (instead of the queries) that are

understood by a wrapper. The interaction of CBR with query optimization, especially

in the context of adaptive query optimization [Lev00], presents fertile ground for further

research.

This chapter also studies the expressive power of p-Datalog. We showed that p-Datalog

is more powerful than using conjunctive queries with binding patterns but we also reached

the important negative result that p-Datalog can not describe the query capabilities of

certain powerful sources. In particular, we showed that there is no p-Datalog program that

can describe all conjunctive queries over a given schema. Indeed, there is no program that

describes all boolean conjunctive queries over the schema. A direct consequence of our

result is that p-Datalog cannot model a full-fledged relational DBMS.

We have focused exclusively on conjunctive queries. It is an interesting open problem to

extend this work to non-conjunctive queries, i.e., queries involving aggregates and negation.

Page 109: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Chapter 6

The Capability Description

Language RQDL

The limitations of p-Datalog for describing the query capabilities of powerful sources make it

useful to explore the use of more powerful capability-description languages. In this chapter

we study RQDL, a novel language proposed recently [PGH96] for capability-description

in the absence of detailed schema information. RQDL extends p-Datalog by allowing the

representation of attribute lists of arbitrary length through attribute vectors. In particular,

in this chapter

• We formally describe and extend RQDL, and prove that it is more powerful than

p-Datalog.

• We provide an algorithm that allows us to build networks of mediators by exporting

the query capabilities of a mediator in terms of the capabilities of its underlying

sources — as described by the TSIMMIS architecture (Figures 1.5 and 1.6). The

algorithm takes as input descriptions of the query capabilities of sources and outputs

a description of all queries supported by a mediator that accesses these sources.

• We provide a reduction of RQDL descriptions into p-Datalog augmented with function

symbols of a specific form. The reduction has important practical and theoretical

value. From a practical point of view, it reduces the CBR and query expressibility

problems for RQDL to the corresponding problems for p-Datalog, thus giving complete

algorithms that are applicable to all RQDL descriptions. From a theoretical point of

96

Page 110: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 97

view, it clarifies the difference in expressive power between RQDL and p-Datalog. We

present the reduction, as well as the algorithms for the complete RQDL language.

Section 6.1 introduces RQDL. Section 6.2 discusses the use of RQDL to describe media-

tor capabilities accurately. Section 6.3 describes the reduction of RQDL to p-Datalog with

function symbols and Section 6.4 describes the query expressibility and CBR algorithms for

RQDL.

6.1 The RQDL description Language

Given the limitations of p-Datalog for the description of powerful information sources, we are

proposing the use of a more powerful query description language. RQDL (Relational Query

Description Language) is a Datalog-based rule language used for the description of query

capabilities. It was first proposed in [PGH96] and used for describing query capabilities

of information sources. [PGH96] shows its advantages over Datalog when it is used for

descriptions that are not schema specific, i.e., the description does not refer to specific

relations or arities in the schema of the specific source. In this way the descriptions are

more concise and they gracefully handle schema evolution.

In this chapter we present a formal specification of extended-RQDL, which provably

allows us to describe large sets of queries. For example, we can prove that the extended-

RQDL (from now on, we will by default refer to the extended-RQDL as RQDL), unlike

p-Datalog, can describe the set of all conjunctive queries. Furthermore, we reduce RQDL

descriptions to terminating p-Datalog programs with function symbols. Consequently, the

decision on whether a given conjunctive query is expressed by an RQDL description is

reduced to deciding expressibility of the query by the resulting p-Datalog program.

Note, the reduction of RQDL to Datalog with function symbols is important because

• It reduces the comparison between the expressive power of p-Datalog and RQDL to

a comparison between Datalog and Datalog with function symbols.

• It reduces the decision procedure for expressibility to Algorithm 5.2.4. This reduction

allows us to give a complete solution to the CBR problem for RQDL.

Subsections 6.1.1 and 6.1.2 demonstrate the use of RQDL for the description of source

capabilities and define the syntax and semantics of RQDL. Section 6.3 describes the reduc-

tion of RQDL descriptions to p-Datalog programs with function symbols and Section 6.4

Page 111: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 98

proceeds to give algorithms for query expressibility by RQDL description and for the CBR

problem for RQDL descriptions.

6.1.1 Using RQDL for query description

To support schema-independent descriptions, RQDL allows the use of predicate tokens1 in

place of the relation names. Furthermore, to allow tables of arbitrary arity and column

names, RQDL provides special variables called vector variables, or simply vectors, that

match with sets of relation attributes that appear in a query. Vectors can “stand for”

arbitrarily large sets of attributes. It is this property that eventually allows the description

of large, interesting sets of conjunctive queries (like the set of all conjunctive queries).

Example 6.1.1 illustrates RQDL’s ability to describe source capabilities without referring

to a specific schema. Example 6.1.2 demonstrates an RQDL program that describes all

conjunctive queries over any schema. Subsection 6.1.2 describes the formal syntax and

semantics of RQDL. Before we go ahead with the examples, let us introduce some notation.

Named Attributes in Conjunctive Queries: For notational convenience, we slightly

modify the query syntax so that we can refer to the components of tuples by attribute

names instead of column numbers. For example, consider the relation book with schema

book(title, isbn). We will write book subgoals by explicitly mentioning the attribute names;

instead of writing

ans()← book(X, Z), equal(X, DataMarts)

we will write

ans()← book(title : X, isbn : Z), equal(X, DataMarts)

We will be using named attributes in the rest of this chapter. Every predicate will then

have a set of named attributes (and not a list of attributes). The connection of this scheme

to SQL syntax is evident.

Example 6.1.1 Consider a source that accepts queries that refer to exactly one relation

and pose exactly one selection condition over the source schema.

ans()← $r(→V ), item(

→V , $a,X ′), equal(X ′, $c)

1Predicate tokens belong to the same sort as tokens (see Chapter 5).

Page 112: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 99

In the above RQDL description, $r is a predicate token, “standing in for” any predicate

name,→V is an attribute vector, and item is a metapredicate describing an element of

→V .

The description2 describes, among others, the query

ans()← books(title : X, isbn : Z), equal(X, DataMarts)

because, intuitively, we can map the predicate token $r to relation books,→V to the set

of attribute-variable pairs {title : X, isbn : Z}, X ′ to X, and $c to DataMarts. The

metapredicate item(→V , $a,X ′) declares that the variable X ′ maps to one of the variables in

the set of attribute-variable pairs that→V is mapped to, i.e., X ′ maps to one of the variables

of the subgoal $r. The token $a maps to the attribute name of the variable X ′ in the

mapping of→V . $a can map to any of the attribute names and hence X ′ can map to either

X or Z.

RQDL descriptions do not have to be completely schema-independent. For example, let

us assume that we can put a selection condition only on the title attribute of the relation.

Then we modify the above RQDL description as follows:

ans()← $r(→V ), item(

→V , title,X ′), equal(X ′, $c)

The replacement of $a by title forces the selection condition to refer to the title attribute

only. 2

Next we present the RQDL description PCQ that describes all conjunctive queries over

any schema.

Example 6.1.2

(i) ans(→V 1) ← cond(

→V ),

→V 1 ⊆

→V

(ii) cond(→V ) ← $p(

→V1), cond(

→V2),

→V =

→V1 ∪

→V2

(iii) cond(→V ) ← item(

→V , $a,X), equal(X, $c), cond(

→V )

(iv) cond(→V ) ← item(

→V , $a1, X1), item(

→V , $a2, X2), equal(X1, X2), cond(

→V )

(v) cond(→V ) ← $p(

→V )

The description above describes any rectified conjunctive query (without arithmetic). The

2Notice that both the RQDL descriptions and the queries are rectified.

Page 113: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 100

description works conceptually as an evaluator of an SQL query: The predicate cond cap-

tures the “Cartesian product” of the FROM clause, constructed by rules (ii) and (v), while

rules (iii) and (iv) apply one selection and join condition respectively over one variable, and

rule (i) describes any projection. The union metapredicate in rule (ii) creates the attribute

vector of the ”augmented” condition.

2

6.1.2 Semantics of RQDL

An RQDL description is a finite set of RQDL rules. The description semantics of RQDL is

a generalization of the description semantics of p-Datalog, to account for the existence of

vectors and metapredicates. We start by defining an expansion of an RQDL description.

Definition: Let P be an RQDL description with a particular IDB predicate ans. The set

of expansions EP of P is the smallest set of rules such that:

• each rule of P that has ans as the head predicate is in EP ;

• if r1: p ← q1, . . . , qn is in EP , r2: r ← s1, . . . , sm is in P , and a substitution θ is the

most general unifier of some qi and r then the resolvent

θp← θq1, . . . θqi−1, θs1, . . . , θsm, θqi+1, . . . , θqn

of R1 with R2 using θ is in EP .

2

Unification: Unification extends to vectors in the following way:

1. a vector can unify with another vector, yielding a vector

2. a vector can unify with a set consisting of attribute-variable pairs, yielding that set;

for example p(→V ) can unify with p(attr1 : X, attr2 : Y ) yielding

p(attr1 : X, attr2 : Y )

Metapredicates: There are three metapredicates, and their argument list has to be of a

specific type: We define

union(→V ,

→V1,

→V2) to mean

→V =

→V1 ∪

→V2

Page 114: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 101

where→V is a vector and

→V1,

→V2 can be vectors, or sets of attribute-variable pairs. We also

define

item(→V , $a,X) to mean

→V [$a] = X,

and

item(→V , a,X) to mean

→V [a] = X

which means that the variable X belongs to the set of attribute-variable pairs that→V maps

to, with attribute name $a (or a). a is a constant. $a is a token. X is a variable.→V can be

a vector or a set of attribute-variable pairs. Finally, we define

subset(→V ,

→V1) to mean

→V ⊆

→V1

where→V and

→V1 can be vectors or sets of attribute-variable pairs and

→V can only appear in

the head of the rule (in addition to the subset subgoal). The intuition behind subset is that

it allows us to do arbitrary projections. In any RQDL program P , subset can only appear

in rules whose head predicate does not appear in the body of any rule of P .

We call a metapredicate that does not contain any vectors ground.

Safety: Metapredicates must observe some binding pattern constraints. In particular, all

vectors that appear in metapredicates must be safe as defined below:

• If a vector appears in an EDB or IDB subgoal then it is safe.

• If a vector→V appears in a subgoal union(

→V ,

→V1,

→V2), and

→V1 and

→V2 are safe, then

→V

is also safe.

• If a vector→V appears in a subgoal subset(

→V ,

→V1) and

→V1 is safe, then

→V is also safe.

Following the definition of description semantics of Section 5.1, we now define the de-

scription semantics of RQDL.

Definition: Set of Queries Described/Expressible by an RQDL Program The

set of terminal expansions TP of P is the subset of all expansions e ∈ EP containing only

EDB predicates or predicate tokens in the body. A valid terminal expansion is a terminal

expansion where all ground metapredicates evaluate to true.

Page 115: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 102

The set of instantiated terminal expansions IP of RQDL description P is the set of all

(rectified) conjunctive queries τ(r), where r belongs to the set of terminal expansions of P ,

and τ is a mapping of the RQDL rule r to a conjunctive query that:

1. maps every token $c to a constant c. (Note, we consider relation names to be of

constant type.)

2. maps every vector→V to a set of attribute-variable pairs {(a1 : X1), . . . , (an : Xn)}

through a mapping σ such that

(a) after we replace every predicate subgoal p(→V ) with p(a1 : X1, . . . , an : Xn) no

variable appears in more than one predicate subgoal,

(b) for every subgoal of the form union(→V ,

→V1,

→V2), σ(

→V ) = σ(

→V1) ∪ σ(

→V2),

(c) for every subgoal of the form item(→V , a, X), σ(

→V ) includes a pair (a : X),

(d) for every subgoal of the form item(→V , $a,X), σ(

→V ) includes a pair (a,X), for

some a,

(e) for every subgoal of the form subset(→V ,

→V1), σ maps

→V to a subset of σ(

→V1).

3. and drops all metapredicate subgoals.

The set of described queries of an RQDL description P with “designated” predicate ans

(when ans is understood), is the set of safe instantiated terminal expansions of P . 2

Example 6.1.3 Let us refer to the RQDL description PCQ of Example 6.1.2. The RQDL

rule

R : ans(→V ′)← $p1(

→V 1), $p2(

→V 2), union(

→V ,

→V 1,

→V 2), item(

→V , $a1, X1), item(

→V , $a2, X2),

equal(X1, X2), subset(→V ′,

→V )

is a terminal expansion of that RQDL description. In particular, this rule is derived from the

RQDL description PCQ by using rules (i), (iv), (ii) and (v) in that order. The conjunctive

query

Ri : ans(a1 : X, a2 : Y )← p(a1 : X, b : Z), q(a2 : Y, c : Z ′), equal(Z,Z ′)

is an instantiated terminal expansion of the RQDL description, since it is an instantiation

of rule R. In particular,

Page 116: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 103

• $p1, $p2 map to predicate names p, q respectively.

• $a1, $a2 map to attribute names b, c respectively.

•→V 1 maps to (a1 : X, b : Z),

→V 2 maps to (a2 : Y, c : Z ′) and

→V maps necessarily to their

union, namely to (a1 : X, a2 : Y, b : Z, c : Z ′).

• X1, X2 map to Z,Z ′ respectively.

•→V ′ maps to (a1 : X, a2 : Y ).

All metapredicate subgoals are dropped. 2

If Q is a conjunctive query with head predicate ans and P is an RQDL description, we

say that Q is expressible by P , if there exists Q′ described by P , such that Q ≡ Q′.

Referring to Example 6.1.3, query

Q : ans(a1 : A, a2 : B)← p(a1 : A, b : Z), q(a2 : B, c : Z ′), q(a2 : W, c : U), equal(Z,Z ′)

is expressible by the description PCQ, since it is equivalent to Ri.

Note here that RQDL can be easily extended (e.g., allowing not only tokens but also

variables in place of predicate names) to describe the capabilities of information sources

that understand and can process higher order logics, for example sources that understand

HiLog [CKW93] or F-Logic [KL89]. We do not pursue this issue further in this thesis.

The next section explains how to use RQDL to describe the capabilities of networks of

mediators.

6.2 RQDL and mediator capabilities

Let us revisit the mediation architecture of Figure 1.4. In a dynamic environment such as the

Internet, or the intranet of a big organisation, when integrating information, we would like

to be able to leverage existing integration infrastructure [Wie92]. Specifically, if a mediator

exists that offers an integrated view of some information we want to access, we would like to

be able to use that mediator, instead of accessing each one of the sources it integrates. Using

a mediator as an integrated information source means creating a networks of mediators, as

in Figure 1.4. Using a mediator as a “source” to another mediator also means that we must

be able to describe the mediator capabilities. As explained in the introduction, mediators

Page 117: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 104

often have query processing capabilities that allow them to “handle” every conjunctive

query over the data that they integrate.

Given the expressiveness results of Section 5.5, p-Datalog cannot describe the capabilities

of such a mediator. However, RQDL is powerful enough for that task. Let us consider a

mediator M that integrates sources S1, . . . , Sn and let the descriptions of these sources be

D1, . . . , Dn. Also, assume that each wrapper understands one answer predicate, and let

these be ans1, . . . , ansn. Then, the RQDL program DM that describes the capabilities of

the mediator is the following:

ans(→V 1) ← cond(

→V ), subset(

→V 1,

→V )

cond(→V ) ← choose(

→V1), cond(

→V2), union(

→V ,

→V1,

→V2)

cond(→V ) ← item(

→V , $a,X), equal(X, $c), cond(

→V )

cond(→V ) ← item(

→V , $a1, X1), item(

→V , $a2, X2), equal(X1, X2), cond(

→V )

cond(→V ) ← choose(

→V )

choose(→V ) ← ans1(

→V )

...

choose(→V ) ← ansn(

→V )

D1

...

Dn

The similarity of this description to PCQ of Example 6.1.2 is evident. DM describes all

conjunctive queries that the mediator can answer, that is, any conjunctive query that com-

bines results from queries that are accepted by the sources the mediator integrates; thus

the concatenation of D1, . . . , Dn in DM . Given D1, . . . , Dn, the description DM obviously

can be generated automatically.

Next we will discuss an efficient algorithm for deciding whether a query is expressible

by an RQDL description. The algorithm is based on a reduction of both the query and

the description into a simple standard schema that facilitates reasoning about relations and

attribute names.

Page 118: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 105

6.3 Reducing RQDL to p-Datalog with function symbols

Deciding whether a query is expressible by an RQDL description requires “matching” the

RQDL description with the query. This matching is a challenging problem, because vectors

have to match with nonatomic entities, i.e., sets of variables, hence making matching much

harder.

We present an algorithm that addresses this problem by reducing query expressibility

by RQDL descriptions to the problem of query expressibility by p-Datalog with function

symbols, i.e., we reduce the RQDL description into a corresponding description in p-Datalog

with function symbols. The reduction is based on the idea that every database DB can be

reduced to an equivalent database DB′ such that the attribute names and relation names

of DB appear in the data (and not the schema) of DB′. We call DB′ a standard schema

database. We then rewrite the query so that it refers to the schema of DB′ (i.e., the

standard schema) and we also rewrite the description into a p-Datalog description with

function symbols which refers to the standard schema as well.

Subsection 6.3.1 presents the conceptual reduction of a database into a standard schema

database. Subsection 6.3.2 presents the rewriting of queries and Subsection 6.3.3 presents

the rewriting of RQDL descriptions. Each of the subsections starts with one or two examples

and continues with a formal definition of the reduction, which can be skipped at the first

reading.

6.3.1 Reduction of a database to standard schema database

In order to reason with the relation names and attribute names of the queries, we conceptu-

ally reduce the original database into a standard schema database where the relation names

and the attribute names appear as data and hence can be manipulated without the need

for higher order syntax. First we present a reduction example and then we formally define

the reduction of a database into its standard schema counterpart.

Example 6.3.1 Consider the following database DB with schema

b(au, isbn) and f(subj , isbn).

Page 119: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 106

b

au isbn

Smith 123

Jones 345

f

subj isbn

Logic 123

Theology 345

The corresponding standard schema database DB′ consists of two relations

tuple(table name, tuple id) and attr(tuple id , attr name, value) which are common to all

standard schema databases. In the running example DB′ is

tuple

table name tuple id

b b(au,Smith,isbn,123)

b b(au,Jones,isbn,345)

f f(subj,Logic,isbn,123)

f f(subj,Theology,isbn,345)

attr

tuple id attr name value

b(au,Smith,isbn,123) au Smith

b(au,Smith,isbn,123) isbn 123

b(au,Jones,isbn,345) au Jones

b(au,Jones,isbn,345) isbn 345

f(subj,Logic,isbn,123) subj Logic

f(subj,Logic,isbn,123) isbn 123

f(subj,Theology,isbn,345) subj Theology

f(subj,Theology,isbn,345) isbn 345

Notice above how we created mechanically one tuple id for each tuple of the original

database. 2

Definition: Given a database DB, we say that the standard schema database correspond-

ing to DB is the smallest database DB′ such that

1. its schema is tuple(table name, tuple id) and attr(tuple id , attr name, value), and

Page 120: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 107

2. for every tuple t(a1 : v1, . . . , an : vn) in DB, there is a tuple tuple(t, t(a1, v1, . . . , an, vn))

in DB′ and for every attribute ai, i = 1, . . . , n there is also a tuple

attr(t(a1, v1, . . . , an, vn), ai, vi) in DB′.

2

6.3.2 Reduction of queries to standard schema queries

The RQDL expressibility algorithm first reduces a given conjunctive query Q over some

database DB into a corresponding query Q′ over the standard schema database DB′. The

reduction is correct in the following sense: the result of asking query Q′ on DB′ is equivalent,

modulo tuple-id naming, to the reduction into standard schema of the result of Q on DB.

To illustrate the query reduction, let us consider the following examples. We first con-

sider a boolean query Q over the schema of Example 6.3.1.

ans()← b(au : X, isbn : S1), f(subj : A, isbn : S2), equal(S1, S2), equal(A, Theology)

Query Q is reduced into the following query Q′:

tuple(ans, ans()) ← tuple(b, B), tuple(f, F ), attr(B, isbn, S1), attr(F, isbn, S2),

equal(S1, S2), attr(F, subj, A), equal(A, Theology)

Notice that for every ordinary subgoal we introduce a tuple subgoal and create mechani-

cally a tuple id. For every attribute we introduce an attr subgoal. The tuple id for the

result relation ans is simply ans() because the result relation has no attributes. When the

query head has attributes, a single conjunctive query is reduced to a nonrecursive Datalog

program. For example, consider the following query that returns the authors and ISBNs of

books if their subject is Theology.

ans(au : X, isbn : S1) ← b(au : X, isbn : S1), f(subj : A, isbn : S2), equal(S1, S2),

equal(A, Theology)

This query is reduced to the following program Q′ where the first rule defines the tuple part

of the standard schema answer and the last two rules describe the attr part.

Page 121: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 108

tuple(ans, ans(au,X , isbn,S1 )) ← tuple(b, B), tuple(f, F ), attr(B, isbn, S1),

attr(F, isbn, S2), equal(S1, S2), attr(B, au, X),

attr(F, subj, A), equal(A, Theology)

attr(ans(au,X , isbn,S1 ), au, X) ← tuple(b, B), tuple(f, F ), attr(B, isbn, S1),

attr(F, isbn, S2), equal(S1, S2), attr(B, au, X),

attr(F, subj, A), equal(A, Theology)

attr(ans(au,X , isbn,S1 ), isbn, S1) ← tuple(b, B), tuple(f, F ), attr(B, isbn, S1),

attr(F, isbn, S2), equal(S1, S2), attr(B, au, X),

attr(F, subj, A), equal(A, Theology)

In general, the reduction is accomplished by the following procedure:

Procedure 6.3.2 (Reduction) If Q’s head is ans(a1 : V1, . . . , an : Vn), generate a program

with n + 1 rules such that

1. One rule has head tuple(ans, ans(a1, V1, . . . , an, Vn)),

2. For every attribute ai, i = 1, . . . , n there is a rule with head

attr(ans(a1, V1, . . . , an, Vn), ai, Vi), and

3. All rules have the same body which is constructed by the following steps:

(a) For every subgoal of Q of the form r(a1 : X1, . . . , am : Xm), invent and associate

to it a unique variable T . The variables such as T bind to tuple id’s of the

standard schema database and hence we call them tuple id variables.

(b) Include in the standard schema query body the subgoal tuple(r, T ).

(c) For every attribute ai, i = 1, . . . ,m include in the standard schema query the

subgoal attr(T, ai, Xi).

(d) Add to the body all equality subgoals of the original query.

2

Xi can be a variable, a token or a constant. It is easy to see that under a few obvious

constraints there exists the inverse reduction.

Next we show how we reduce RQDL descriptions into p-Datalog descriptions over stan-

dard schema databases.

Page 122: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 109

6.3.3 Reduction of RQDL programs to Datalog programs over the stan-

dard schema

In the previous sections we showed how schema information, i.e., relation and attribute

names, becomes data in standard schema databases. Based on this idea, we will reduce

RQDL descriptions to p-Datalog descriptions that do not use higher order features such

as metapredicates and vectors. In particular, we “reduce” vectors to tuple identifiers. In-

tuitively, if a vector matches with the arguments of a subgoal, then the tuple identifier

associated with this subgoal is enough for finding all the attr-variable pairs that the vector

will match to. Otherwise, if a vector→V is the result of a union of two other vectors

→V1 and

→V2, then we associate with it a new constructed tuple id, the function u(T1, T2) where T1

and T2 are the tuple id’s that correspond to→V1 and

→V2. As we will see later, the reduction

carefully produces a program which terminates despite the use of the u function.

Example 6.3.3 Let us first consider a simple but interesting one-rule description:

ans(→V )← $p(

→V ), item(

→V , name,X)

This RQDL rule describes all selection-projection queries that refer to any schema over one

relation, with the constraint that the schema of the relation contains an attribute “name.”

This description reduces to the following p-Datalog description:

tuple(ans, ans(T )) ← tuple($p, T ), attr(T1, name,X), equal(T, T1)

attr(ans(T ), $a,X) ← tuple(ans, ans(T )), attr(T1, $a,X), equal(T, T1)

The vector variable→V is reduced to the variable T , which matches with a tuple id. The

metapredicate item(→V , name,X), is reduced to the predicate attr(T, name,X). 2

Example 6.3.4 The description of Example 6.1.2 describes all boolean conjunctive queries.

It reduces into the following p-Datalog description (with function symbols):

Page 123: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 110

tuple(ans, ans(T )) ← tuple(cond , cond(T )) (1)

attr(ans(T ), $a,X) ← attr(T1, $a,X), tuple(ans, ans(T2)), equal(T, T1, T2)

tuple(cond , cond(T )) ← tuple($p, T1), tuple(cond , cond(T2)), valid(T, T1, T2)

attr(cond(T ), $a,X) ← attr(T1, $a,X), tuple(cond , cond(T2)), equal(T, T1, T2)

tuple(cond , cond(T )) ← attr(T ′, $a,X), equal(X, $c), tuple(cond , cond(T )),

equal(T ′, T )

tuple(cond , cond(T )) ← attr(T1, $a1, X1), attr(T2, $a2, X2), equal(X1, X2),

tuple(cond , cond(T )), equal(T, T1, T2)

tuple(cond , cond(T )) ← tuple($p, T )

and subset flag(1) = 1.

The reduction of each rule is independent of the reduction of other rules. Notice that the

metapredicate subset in the first rule was reduced into a subset flag set on the rule. In

the third rule, notice that we reduced→V to T , which is “produced” by the predicate valid,

given T1 and T2. The predicate valid is defined by the simple rule

valid(T, T1, T2)← sort(T, u(T1, T2)) (6.1)

and the rules for sort, which are given in Appendix B. Sort is a standard list-sorting routine,

that takes a list (in the form of an arbitrary u-term) as input, sorts it, and returns the sorted

list (in the form of a right-deep u-term).

The predicate valid constructs a valid, new tuple id for the vector union that has

“associated” with it all the attributes associated with the union of→V1 and

→V2. The new

tuple id is uniquely determined by the tuple ids of→V1 and

→V2. In particular, it is the right-

deep ordered binary tree with leaves the tuple ids in the unioned vectors. For the union

of the attribute lists of three relations with attribute vectors→V1,

→V2 and

→V3, that reduce to

tuple ids T1, T2 and T3 correspondingly,3 the tuple id generated would be u(T1, u(T2, T3));

no other u-term of “length 3” is produced by sort and, consequently, valid. For another

example, valid(T, u(t2, u(t3, t4)), u(t3, t5)) will bind T (by calling sort) to the sorted, right-

deep u-term with leaves t2, . . . , t5, that is, u(t2, u(t3, u(t4, t5))). T will be bound to the same

u-term by valid(T, u(t3, u(t4, t5)), u(t2, t5)) also.

3Without loss of generality, we assume that T1 < T2 < T3.

Page 124: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 111

The above definition guarantees that the number of valid tuples generated from executing

the valid rules on any canonical database is bounded, and determined by the number of

distinct tuple ids appearing in the database.

Finally, the description also has to include the rules4 of Figure 6.1, that make sure that

all attributes of tuple with ids T1 and T2 are also attributes of tuples with id T , constructed

from T1, T2.

attr(T, $a,X)← attr(T1, $a,X), valid(T, T1, T2)attr(T, $a,X)← attr(T2, $a,X), valid(T, T1, T2)

Figure 6.1: Default rules for generation of attr tuples

2

Formally, an RQDL description P is reduced to a p-Datalog description P ′ by reducing

each rule r of the description into p-Datalog with functions as follows:

Procedure 6.3.5 (Reduction)

1. If r has a subgoal of the form union(→V ,

→V1,

→V2), include in the reduction the valid

rules valid(T, T1, T2) ← sort(T, u(T1, T2)) and the rules for sort, as well as the rules

of Figure 6.1.

2. Reduce predicates that do not involve vectors as described in Section 6.3.2.

3. For each subgoal of the form r(→V ), where r is not a recursive predicate, include in the

reduced rule a subgoal tuple(r, T ). T is the reduction of→V .

4. For each subgoal of the form r(→V ), where r is a recursive predicate, include in the

reduced rule a subgoal tuple(r, r(T )).

5. For each subgoal of the form item(→V , a,X), where a is a token or a constant, include

in the reduced rule the subgoal attr(T, a,X), where T is the reduction of→V .

6. For each subgoal of the form union(→V ,

→V1,

→V2), replace in the reduced rule all in-

stances of→V with T and include the subgoal valid(T, T1, T2), where T1 and T2 are the

reductions of→V1 and

→V2.

4Note that we did not need to include these rules in Example 6.3.3.

Page 125: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 112

7. For each subgoal of the form subset(→V 1,

→V ), let T1 be the reduction of

→V 1 and T be

the reduction of→V . Replace T1 by T in the rule where subset appears, set the subset

flag for the rule to 1 (see below) and drop the subset subgoal.

8. If the head is of the form p(→V ), then reduce it to tuple(p, p(T )). Moreover, add the

following rule to the reduction:

attr(p(T ), $a,X)← attr(T1, $a,X), tuple(p, p(T2)), equal(T, T1, T2)

9. If the head of r is of the form p(attr-var set), then follow Procedure 6.3.2 to generate

all the p-Datalog rules that r reduces to.

2

The use of the subset flag of the head of a rule will be explained in Section 6.4.1. The

intuition behind it is as follows: Assume the existence of a subgoal subset(→V 1,

→V ) in rule r.

As we have said earlier,→V 1 must appear in the rule head, so let the head of r be p(

→V 1).

Also,→V must appear in an ordinary subgoal, e.g., q(

→V ). The subset subgoal means that the

RQDL rule r describes all conjunctive queries whose head attribute set is any projection

of the attribute set of relation q. In the reduction, we replace T1 (the reduction of→V 1) by

T (the reduction of→V ), saying effectively that the attribute set of p must be the same as

the attribute set of q. That’s why we set a flag on the rule, the subset flag, to remember

when deciding query expressibility and query description to also consider described those

conjunctive queries that include projections on q.

Theorem 6.3.6 Let P be an RQDL description and P ′ its reduction in p-Datalog with

functions. Let also DB be a canonical standard schema database of a query Q. Then P ′

applied on DB terminates.

Proof: It suffices to see that the generation of u terms cannot fall into an infinite loop.

For every two tuple ids in the canonical database, a new u-term is generated by a call to

valid, which in turn calls sort. By a simple inductive argument on the number of calls to

sort, it follows that for every two tuple ids in the canonical database, a unique u-term is

generated. Consequently, if n tuple-ids are present in the canonical database, at most an

exponential number in n of u-terms can be generated. 2

Page 126: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 113

The next section explains the semantics of p-Datalog with functions, and shows how to solve

the CBR problem for RQDL using the algorithms developed for p-Datalog in Sections 5.2

and 5.3.

6.4 QED and CBR for RQDL descriptions

The reduction presented in the previous section allows us to formulate a solution to the

expressibility problem and CBR problems for RQDL descriptions.

In particular, we show in the next section that we can use QED with small changes for p-

Datalog with function symbols; we prove that the modified QED is sound and complete over

the fragment of p-Datalog with function symbols that is generated by the RQDL reduction.

In the remaining sections, we will denote that fraction of p-Datalog with function symbols

with p-Datalogf .

The result of applying the QED algorithm to a query Q and an RQDL description P is a

finite set of expansions of P , i.e., a finite set of queries (over the original schema) described

by P . We show in Section 6.4.2 that these expansions are the only relevant ones in trying

to answer Q using queries expressible by P . Therefore, as in Section 5.3, the CBR problem

for RQDL is immediately reduced to the problem of rewriting a conjunctive query using a

finite set of conjunctive views.

6.4.1 The query expressibility problem for RQDL

We first illustrate QED for RQDL with an example. Notice that there are now two “desig-

nated” predicates, the predicates tuple and attr.

Example 6.4.1 Consider the query Q: ans(a : X) ← books(au : X, titl : Y ) and the

descriptionans(a : X) ← $r(au : X, titl : Y )

ans(b : Y ) ← $r(au : X, titl : Y )

The reduction of the query is

tuple(ans, ans(a,X)) ← tuple(p, T0), attr(T1, au,X), attr(T2, titl, Y ), equal(T0, T1),

equal(T0, T2)

attr(ans(a,X), a,X) ← tuple(p, T ), attr(T, au, X), attr(T, titl, Y ), equal(T0, T1),

equal(T0, T2)

Page 127: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 114

The canonical DB is

tuple(books, t0), attr(t1, au, x), attr(t2, titl, y), equal(t0, t1, t2)

The reduction of the description (after rectification) is

tuple(ans, ans(a,X)) ← tuple($r, T ), attr(T1, au,X), attr(T2, titl, Y ), equal(T, T1),

equal(T, T2)

attr(ans(a,X), a, X) ← tuple($r, T ), attr(T1, au,X), attr(T2, titl, Y ), equal(T, T1),

equal(T, T2)

tuple(ans, ans(b, Y )) ← tuple($r, T ), attr(T1, au,X), attr(T2, titl, Y ), equal(T, T1),

equal(T, T2)

attr(ans(b, Y ), b, Y ) ← tuple($r, T ), attr(T1, au,X), attr(T2, titl, Y ), equal(T, T1),

equal(T, T2)

Notice that we didn’t include the rules of Figure 6.1 or the valid rules in the reduced

description, since the original description didn’t contain any metapredicates.

If we run the Algorithm 5.2.4 on the canonical DB, the following extended facts are

produced:

(1) < tuple(ans, ans(a, x)), {tuple(books, t0), attr(t1, au, x), attr(t2, titl, y), equal(t0, t1, t2)} >

(2) < attr(ans(a, x), a, x), {tuple(books, t0), attr(t1, au, x), attr(t2, titl, y), equal(t0, t1, t2)} >

< tuple(ans, ans(b, y)), {tuple(books, t0), attr(t1, au, x), attr(t2, titl, y), equal(t0, t1, t2)} >

< attr(ans(b, y), b, y), {tuple(books, t0), attr(t1, au, x), attr(t2, titl, y), equal(t0, t1, t2)} >

The output of the algorithm includes extended facts with the same tuple id. We “group”

together the extended facts with the same tuple id. We notice that the group consisting of

the extended facts (1) and (2) corresponds exactly to the two conjunctive queries that are

the reduction of Q. Therefore Q is expressible by our description. 2

Before presenting the theorem that states the condition for RQDL expressibility, let us

make the following important observations:

• The only combination of function symbols and recursion is for the invention of tuple

ids (u-terms) for unioned attribute vectors.

• The valid program produces the same tuple id for the union of attribute vectors

Page 128: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 115

→V1, . . . ,

→Vn, regardless of the ordering of the unions, merely by ordering the tuple ids

of the unioned attribute vectors.

• Lemmata 5.2.6 and 5.2.8 and Theorem 5.2.7 hold for p-Datalogf .

• Let Q be a conjunctive query and let {Qi|1 ≤ i ≤ n} be the set of standard schema

queries it reduces to. Let Hi be the heads of those queries. As we pointed out

in Section 6.3.2, all Qi have the same body. Moreover, for Q1, H1 is of the form

tuple(ans, T ), where T is a term that denotes a tuple id, and for all Qi, i 6= 1, Hi is

of the form attr(T, ci, Xi) for the same T. We call T the query id. In reference to the

previous example, the query id is ans(a,X).

Theorem 6.4.2 Let Q be a conjunctive query and {Qi|1 ≤ i ≤ n} be its standard schema

reduction. Q is expressible by an RQDL description P without the subset metapredicate if

and only if there exists a maximal set {Q′i|1 ≤ i ≤ n} of queries described by the reduced

description P ′, where all Q′i have the same id, such that Q′

i ≡ Qi, ∀1 ≤ i ≤ n.5

Referring again to Example 6.4.1, the maximal set {Q′i} is the set of the corresponding

queries to extended facts (1) and (2).

Note that the exact “value” of tuple ids is not important: their use is to identify com-

ponents (i.e., attributes) of the same relation. Therefore, we say that a reduced query Q in

p-Datalogf is expressible by a reduced p-Datalogf description P if and only if there exists

Q′ equivalent to Q up to tuple-id naming that is described by P .

Proof: The above theorem is easy to see in the case where the RQDL description contains

no vectors. When the RQDL description contains vectors, the intuition is as follows: Let

Q be a conjunctive query, and let {Qi|1 ≤ i ≤ n} be the set of standard schema queries

it reduces to. Also let P be the RQDL description and Pred be the reduced p-Datalogf

description.

For the ONLY IF direction: The reduction directly maps the RQDL rules to rules

“producing” tuple subgoals, so it ensures that if Q is expressible by P , then Q1 is expressible

by Pred. Because of the expressibility of Q1 and of rules 8 and 1 in the reduction, all {Qi}are also expressible.

The IF direction follows from the definition of RQDL expressibility, Procedure 6.3.5,

and the fact that all Q′i have the same query id. 2

5“Maximal” means that {Q′i} includes all described queries with that same query id.

Page 129: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 116

Because of Theorems 6.3.6 and 6.4.2, we can use Algorithm QED (see Section 5.2) to

answer the expressibility question in RQDL. QED generates all possible extended facts for

tuple and attr. We then check whether (i) all and only the necessary “frozen” tuple and attr

facts are produced and have the same id, and (ii) their corresponding queries are equivalent

to the Qi’s.

For the algorithm to work properly, a change needs to be made to the definition6 of the

supporting set of a fact: due to the reduction introduced in Sections 6.3.2 and 6.3.3, there

is an implicit connection between a fact tuple(const1, T ) and facts attr(T, const2, X), i.e.,

between the tuple fact and the attribute facts that are created by the reduction. We make

that connection explicit by modifying the definition of “supporting set” as follows:

Definition: Supporting Set - Modified Let h be an ordinary fact produced by an

application of the p-Datalogf rule

r : H ← G1, . . . , Gk, E1, . . . , Em

of a (reduced) p-Datalogf description P on a database DB that consists of a canonical

database CDB and other facts, and let µ be a mapping from the rule into DB such that

µ(Gi), µ(Ej) ∈ DB and h = µ(H). The set Sh of supporting facts of h, or supporting set of

h, with respect to P , is the smallest set such that

• if µ(Gi) ∈ CDB, then µ(Gi) ∈ Sh,

• if µ(Gi) 6∈ CDB and S ′ is the set of supporting facts of µ(Gi), then S ′ ⊆ Sh,

• if tuple(c, t) ∈ Sh for some7 c and t, then for all c′, x, if attr(t, c′, x) is in the canonical

DB, then attr(t, c′, x) ∈ Sh,

• if E is the set of all µ(Ei) ∈ Sh, then the smallest set of equality facts that includes

E and is an equivalence relation is included in Sh.

2

6We could have the same effect by correspondingly changing the RQDL to p-Datalogf reductionprocedure.

7The constant c can be a frozen or regular constant, t can be a frozen or regular constant or a groundterm.

Page 130: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 117

Modifications in the presence of subset subgoals: We have already explained that a

subset(→V ,

→V′) subgoal present in a rule r is reduced into a subset flag attached to r. Let

P be an RQDL description and Pred be its p-Datalogf reduction. Let Q be an expansion

of Pred (cf Section 5.1.1). We say that the subset flag attaches to Q if Q is the result of

resolving rule r of Pred with rule s and subset flag is attached to r. Then Theorem 6.4.2

can be restated more generally as follows:

Theorem 6.4.3 Let Q be a conjunctive query and {Qi|1 ≤ i ≤ n} be its standard schema

reduction. Q is expressible by an RQDL description P if and only if there exists a maximal

set {Q′i|1 ≤ i ≤ m} of queries described by the reduced description P ′, where all Q′

i have

the same id, such that Q′i ≡ Qi, ∀1 ≤ i ≤ n. It is n ≤ m if subset flag is attached to Q1

and n = m otherwise. Maximal means that {Q′i} includes all described queries with that

same query id.During execution of the QED algorithm, whenever a tuple fact is generated from a

rule that has subset flag attached, a subset flag annotation is set on its tuple id. That

annotation is used after the execution is complete together with Theorem 6.4.3 to determine

expressibility.

Let us consider the following example.

Example 6.4.4 If our RQDL description is

ans(→V )← p(

→V ), item(

→V , au, X)

as in Example 6.3.3 then the query Q : ans(au : X)← p(au : X, subj : Y ) is not expressible

by our description. The reduction of the description is

tuple(ans, ans(T )) ← tuple(p, T ), attr(T1, au, X), equal(T, T1)

attr(ans(T ), $a,X) ← attr(T1, $a,X), tuple(ans, ans(T1)), equal(T, T1)

and the reduction of the query (i.e., the set {Qi}) is

tuple(ans, ans(au,X)) ← tuple(p, T ), attr(T, au, X), attr(T, subj, Y )

attr(ans(au,X), au,X) ← tuple(p, T ), attr(T, au, X), attr(T, subj, Y )

The canonical DB is then

tuple(p, t0), attr(t1, au, x), attr(t2, subj, y), equal(t0, t1, t2)

Page 131: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 118

The extended facts produced by Algorithm 5.2.4, taking into account the modification of

the definition of supporting sets introduced above, are

(1) < tuple(ans, ans(t0)), {tuple(p, t0), attr(t1, au, x), attr(t2, subj, y), equal(t0, t1, t2)} >

(2) < attr(ans(t0), au, x), {tuple(p, t0), attr(t1, au, x), attr(t2, subj, y), equal(t0, t1, t2)} >

(3) < attr(ans(t0), subj, y), {tuple(p, t0), attr(t1, au, x), attr(t2, subj, y), equal(t0, t1, t2)} >

Let us look in more detail into how extended fact (1) was produced. Application of the

first rule of the p-Datalogf program generates

< tuple(ans, ans(t0)), {tuple(p, t0), attr(t1, au, x), equal(t0, t1)} >

The second rule of the program consequently fires and gives

< attr(ans(t0), au, x), {tuple(p, t0), attr(t1, au, x), equal(t0, t1)} >

and

< attr(ans(t0), subj, y), {tuple(p, t0), attr(t2, subj, y), equal(t0, t2)} >

Then, according to the modified definition of supporting set, we need to augment the

supporting set of tuple(ans, ans(t0)), to include attr(t2, subj, y), thus getting extended fact

(1). Performing the augmentation step cannot take more than exponential amount of time.

Finally, the second rule of the program fires again, to generate (2) and (3).

Even though both standard-schema queries of the reduction are expressible by our re-

duced description, the original query, as pointed out, is not expressible by the RQDL de-

scription. That is because the only maximal set of described queries produced (consisting

of the corresponding queries for (1),(3) and (4)) is larger than the set of reduced queries.

On the other hand, if the description were

ans(→V )← p(

→V 1), item(

→V , au,X), subset(

→V ,

→V 1)

then Q is described by the modified description. The reduction of the description would

be exactly the same, but we would set the subset flag on the rule. Then, following The-

orem 6.4.3, the algorithm would decide correctly that Q is described by the modified de-

scription. 2

Page 132: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 119

Let us consider a more complicated example of QED.

Example 6.4.5 The following source can accept queries that perform a join between rela-

tion q with any other relation over any set of attributes. The description of this source is a

simplification of description PCQ, of Example 6.1.2.

ans(→V ) ← cond(

→V )

cond(→V ) ← q(

→V 1), union(

→V ,

→V 1,

→V 2), cond(

→V 2)

cond(→V ) ← item(

→V , $a1, X1), item(

→V , $a2, X2), equal(X1, X2), cond(

→V )

cond(→V ) ← $r(

→V )

The reduction of the description, after rectification, is

tuple(ans, ans(T )) ← tuple(cond , cond(T ))

attr(ans(T ), $a,X) ← attr(T1, $a,X), tuple(ans, ans(T2)), equal(T, T1, T2)

tuple(cond , cond(T )) ← tuple(q, T1), tuple(cond , cond(T2)), valid(T, T3, T4),

equal(T1, T3), equal(T2, T4)

attr(cond(T ), $a,X) ← attr(T1, $a,X), tuple(cond , cond(T2)), equal(T, T1, T2)

tuple(cond , cond(T )) ← attr(T1, $a1, X1), attr(T2, $a2, X2), equal(X1, X2),

tuple(cond , cond(T )), equal(T, T1, T2)

tuple(cond , cond(T )) ← tuple($r, T )

attr(T, $a,X) ← attr(T1, $a,X), valid(T, T2, T3), equal(T1, T2)

attr(T, $a,X) ← attr(T1, $a,X), valid(T, T2, T3), equal(T1, T3)

plus the valid rules.

The user query submitted to the source is the following:

ans(au : X, ln : X, subj : Z)← q(au : X, subj : Z), s(ln : X)

(where ln stands for last name) which produces the extended canonical DB

tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3), attr(t4, ln, x1), equal(t0, t1, t2),

equal(t3, t4), equal(x, x1)

Page 133: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 120

The standard schema reduction of the user query is

tuple(ans, ans(au,X, ln,X, subj, Z)) ← tuple(q, Q), tuple(s, S), attr(Q1, au, X),

attr(Q2, subj, Z), attr(S1, ln,X1), equal(S, S1),

equal(X, X1), equal(Q,Q1, Q2)

attr(ans(au,X, ln, X, subj, Z), au, X) ← tuple(q, Q), tuple(s, S), attr(Q1, au, X),

attr(Q2, subj, Z), attr(S1, ln,X1), equal(S, S1),

equal(X, X1), equal(Q,Q1, Q2)

attr(ans(au,X, ln, X, subj, Z), ln,X) ← tuple(q, Q), tuple(s, S), attr(Q1, au, X),

attr(Q2, subj, Z), attr(S1, ln,X1), equal(S, S1),

equal(X, X1), equal(Q,Q1, Q2)

attr(ans(au,X, ln, X, subj, Z), subj, Z) ← tuple(q, Q), tuple(s, S), attr(Q1, au, X),

attr(Q2, subj, Z), attr(S1, ln,X1), equal(S, S1),

equal(X, X1), equal(Q,Q1, Q2)

Running Algorithm 5.2.4 on the canonical DB produces the following extended facts

after augmentation8:

8We are only showing some of the extended facts produced

Page 134: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 121

< valid(u(t0, t3), t0, t3), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),

attr(t4, ln, x1), equal(t0, t1, t2)} >

< tuple(cond, cond(t0)), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), equal(t0, t1, t2)} >

< tuple(cond, cond(t3)), {tuple(s, t3), attr(t4, ln, x1), equal(t3, t4)} >

< tuple(cond, cond(u(t0, t3))), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),

attr(t4, ln, x1), equal(t0, t1, t2), equal(t3, t4)} >

< tuple(cond, cond(u(t0, t3))), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),

attr(t4, ln, x1), equal(t0, t1, t2), equal(x, x1), equal(t3, t4)} >

(1) < tuple(ans, ans(u(t0, t3))), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),

attr(t4, ln, x1), equal(t0, t1, t2), equal(t3, t4), equal(x, x1)} >

< attr(u(t0, t3), au, x), {attr(t1, au, x), equal(t0, t1)} >

< attr(u(t0, t3), subj, z), {attr(t2, subj, z), equal(t0, t2)} >

< attr(u(t0, t3), ln, x1), {attr(t4, ln, x1), equal(t3, t4)} >

(2) < attr(ans(u(t0, t3)), au, x), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),

attr(t4, ln, x1), equal(t0, t1, t2), equal(t3, t4), equal(x, x1)} >

(3) < attr(ans(u(t0, t3)), ln, x1), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),

attr(t4, ln, x1), equal(t0, t1, t2), equal(t3, t4), equal(x, x1)} >

(4) < attr(ans(u(t0, t3)), subj, z), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),

attr(t4, ln, x1), equal(t0, t1, t2), equal(t3, t4), equal(x, x1)} >

The maximal set of described queries with query id u(t0, t3) (corresponding to (1),(2),(3)

and(4)) is equal to the set of the standard schema queries that are the reduction of the user

query. Therefore, the user query is expressible by our RQDL description, by Theorem 6.4.2.

2

6.4.2 The CBR problem for RQDL

We solve the CBR problem for a given query and a given RQDL description in two steps:

• We generate the set of relevant described queries from the output of the Algorithm 5.2.4,

by “gluing” together the tuple and attr subgoals that have the same supporting set.

In other words, we create the corresponding standard schema queries for the extended

facts and then do the inverse reduction on the sets of those that have the same id and

body (thus ending up with queries on the original schema). These are the relevant

queries of the description with respect to the given query.

Page 135: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 122

• Given the original query and the relevant queries (or views) that are expressible by the

given RQDL description, we can apply the appropriate query rewriting using views

algorithm, e.g., [Qia96; LMSS95] or [RSU95], on that problem.

The algorithm is correct because of the following theorem.

Theorem 6.4.6 (RQDL-CBR) Assume we have a query Q and an RQDL description P ,

and let {Qi} be the result of applying Algorithm 5.2.4 on Q and P . There exists a rewriting

Q′ of Q, such that Q′ ≡ Q, using any {Qj |Qj is expressible by P} if and only if there exists

a rewriting Q′′ , such that Q′′ ≡ Q, using only {Qi}.The proof follows directly from the proof of Theorem 5.3.2. Therefore, solving the query

expressibility problem for RQDL immediately reduces the CBR problem to the familiar

problem of answering the given query using a finite set of conjunctive views. The complexity

of the whole procedure is nondeterministic exponential in the input size.

Let us notice that, in the presence of subset subgoals in the RQDL description, the

QED algorithm produces candidate queries that can have set the subset flag annotation.

In principle, these annotations can be ignored for the solution of the CBR problem, since

we assume that the mediator has the capability to do projections locally (i.e., projections

can always be handled by the final rewriting at the mediator). Finally, it should be obvious

that the discussion of Subsection 5.3.1 about binding requirements holds for RQDL as well.

Example 6.4.7 We consider a source that expects a selection condition on attribute au or

on attribute subj, but not both. The RQDL description for this source is

ans(→V ) ← $r(

→V ), item(

→V , au, $c)

ans(→V ) ← $r(

→V ), item(

→V , subj, $c)

The description reduces to

tuple(ans, ans(T )) ← tuple($r, T ), attr(T1, au, X), equal(T, T1)

tuple(ans, ans(T )) ← tuple($r, T ), attr(T1, subj, X), equal(T, T1)

attr(ans(T ), $a,X) ← attr(T1, $a,X), tuple(ans, ans(T2)), equal(T, T1, T2)

Let the user query be

Q : ans(subj : X, au : Y, isbn : Z) ← books(subj : X, au : Y, isbn : Z), equal(X, Logic),

equal(Y, Smith)

Page 136: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 123

It is obvious that Q can be answered with a combination of queries expressible by the

description: First send the selection condition on au, then on subj and finally intersect the

two results. Q reduces to

tuple(ans, ans(subj,X, au, Y, isbn, Z)) ← tuple(books, T ), attr(T, au, X), attr(T, subj, Y ),

attr(T, isbn, Z), equal(X, Logic), equal(Y, Smith)

attr(ans(subj,X, au, Y, isbn, Z), subj, X) ← tuple(books, T ), attr(T, au, X), attr(T, subj, Y ),

attr(T, isbn, Z), equal(X, Logic), equal(Y, Smith)

attr(ans(subj,X, au, Y, isbn, Z), au, Y ) ← tuple(books, T ), attr(T, au, X), attr(T, subj, Y ),

attr(T, isbn, Z), equal(X, Logic), equal(Y, Smith)

attr(ans(subj,X, au, Y, isbn, Z), isbn, Z) ← tuple(books, T ), attr(T, au, X), attr(T, subj, Y ),

attr(T, isbn, Z), equal(X, Logic), equal(Y, Smith)

The canonical DB is then9

tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z), equal(x, Logic), equal(y, Smith)

The extended facts that are generated by algorithm QED-T are shown in Figure 6.4.7:10

The result (after the inverse reduction) is two candidate conjunctive queries, with bind-

ing information:

C1 : ansbff (subj : X, au : Y, isbn : Z)← books(subj : X, au : Y, isbn : Z)

andC2 : ansfbf (subj : X, au : Y, isbn : Z)← books(subj : X, au : Y, isbn : Z)

Using Q and C1, C2 as input to algorithm AnsBind, we get the expected answer. 2

6.5 Conclusions and Related Work

We described and extended RQDL, which is a provably more expressive language than p-

Datalog. The extra power is mainly a result of vector variables which can match to sets

of attributes of arbitrary length. The existence of vector variables makes very hard a di-

rect, brute-force implementation of query expressibility and CBR algorithms for RQDL. In

[PGH96], a brute-force approach is proposed for a query expressibility algorithm; it tries to

9For brevity we are not doing full rectification.10The figure only shows the extended facts of interest.

Page 137: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 124

< tuple(ans, ans(t)), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(x, Logic)} >

< tuple(ans, ans(t)), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(y, Smith)} >

< attr(ans(t), subj, x), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(x, Logic)} >

< attr(ans(t), au, y), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(x, Logic)} >

< attr(ans(t), isbn, z), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(x, Logic)} >

< attr(ans(t), subj, x), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(y, Smith)} >

< attr(ans(t), au, y), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(y, Smith)} >

< attr(ans(t), isbn, z), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(y, Smith)} >

Figure 6.2: Extended facts produced by Algorithm QED-T for Example 6.4.7

generate instantiated terminal expansions bottom up, so that vectors match with sets during

the derivation. This approach soon leads to complicated problems that force [PGH96] to

restrict the applicability of matching algorithms to a subset of RQDL descriptions. Con-

sequently, the query expressibility algorithm proposed in [PGH96] is not applicable to all

RQDL descriptions.

We provide a reduction of RQDL descriptions into p-Datalog augmented with function

symbols that construct unique tuple ids, similar to the invention of semantic object ids

in DSL (see Chapter 3) and the ideas in [Mai86; KL89]. Using this reduction, we provide

complete algorithms for solving the expressibility and CBR problems. Moreover, we demon-

strate how to automatically derive an RQDL description of the capabilities of a mediator,

given the descriptions of the capabilities of the sources it accesses.

Page 138: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Appendix A

Enabling Integration: TSIMMIS

Wrappers

In order to access information from a variety of heterogeneous information sources, TSIM-

MIS must translate queries and data from one data model into another. This functionality

is provided by source wrappers [Lew91; Wel] that convert queries into one or more com-

mands/queries understandable by the underlying source and transform the native results

in a format understood by the application. Anyone who has build a wrapper (and as part

of the TSIMMIS project we developed hard-coded wrappers for a variety of sources, like

Sybase DBMS, WWW pages, and legacy systems like Folio) can attest that development

is a heavy task. In situations where it is important/desirable to gain access to new sources

quickly, this is a major drawback. However, we have also observed that only a relatively

small part of the code deals with the specific access details of the source. The rest of the

code is either common among wrappers or implements query and data transformation that

could be expressed in a high-level, declarative fashion.

Based on these observations, I built a wrapper implementation toolkit for quickly building

wrappers. The toolkit contains a library for commonly used functions, such as for receiving

queries from the application and packaging results. It also contains a facility for translating

queries into source-specific commands, and for translating results into a model useful to

the application. The philosophy behind the “template-based” translation methodology

is as follows. The wrapper implementor specifies a set of templates (rules) written in a

high-level, declarative language that describe the queries accepted by the wrapper. If an

application query matches a template, an implementor-provided action associated with the

125

Page 139: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

APPENDIX A. ENABLING INTEGRATION: TSIMMIS WRAPPERS 126

template is executed to provide the native query for the underlying source. The native query

is not necessarily a string of a well-structured query language, e.g., SQL. In general, it is

any program used to access and retrieve information from the underlying source. When

the source returns the results of the query, the wrapper transforms the answer objects

which are represented in the data model of the source into a representation that is used

by the application. Using this toolkit one can quickly design a simple wrapper with a few

templates that cover some of the desired functionality, probably the one that is most urgently

needed. However, templates can be added gradually as more functionality is required later

on. In addition, the libraries in the toolkit provide the ability to perform post-processing

on the result if required, e.g., the incoming query did not match exactly with any of the

templates. In a sense, this post-processing capability allows the wrapper builder to enhance

the usefulness of the source by adding query capabilities to the wrapper that are not natively

supported.

Figure A.1: Wrapper architecture

Another important use of wrappers is in extending the query capabilities of a source.

For instance, some sources may not be capable of answering queries that have multiple

predicates. In such cases, it is necessary to pose a native query to such a source using only

Page 140: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

APPENDIX A. ENABLING INTEGRATION: TSIMMIS WRAPPERS 127

Figure A.2: Wrapper components and procedure calls

predicates that the source is able to handle. The rest of the predicates will be automatically

separated from the query and handled locally through a filter query. When the wrapper

receives the results, a post-processing engine applies the filter query. This engine supports

a set of built-in predicates based on comparison operators =, 6=, <, >, etc. In addition,

the engine can support more complex predicates that can be specified as part of the filter

query. The postprocessing engine is common to wrappers of all sources and is part of the

wrapper toolkit. It is a simply a lightweight version of the mediator query engine shown

in Figure 1.6. As we noted for mediators, the postprocessing engine gives the wrapper the

ability to handle a much larger class of queries than those that exactly match the templates

it had been given.

Figure A.1 shows an overview of the wrapper architecture as it is currently implemented

in our tsimmis testbed. Rounded components are provided by the toolkit, the rectangular

native component is source-specific and must be written by the implementor. The driver

component controls the translation process and invokes the following services of the toolkit:

the parser that parses the templates as well as the incoming queries into internal data

Page 141: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

APPENDIX A. ENABLING INTEGRATION: TSIMMIS WRAPPERS 128

structures, the matcher that matches a query against the set of templates and creates a

filter query for post-processing if necessary, the native component that submits the generated

action string and receives the native result, and the engine that transforms and packages the

result and applies a post-processing filter if one has been created by the matcher. Figure A.2

shows in detail the procedure calls made between the various components. It also shows

that the native component is well-encapsulated, consisting simply of four procedures that

need to be implemented by the wrapper implementor, and a native template, describing the

structure of the native results. We now describe the sequence of events that occur at the

wrapper during the translation of a query and its result using a concrete example. Queries

are formulated using a simple extension of DSL, and the wrapper turns the native results

into OEM objects.

A.1 An Example

Let us assume that the source is a relational database containing bibliographic information

about papers and books. Suppose that the user is interested in all papers authored by

“Jones” and published before 1984. The corresponding DSL query is:

(Q53) P : −P :< &O book {< year Y >< author “Jones” >} > AND lt(Y, 1984)

The predicate lt(Y,1984) specifies that the < comparison operator be used. The

pattern variable P in this example binds to the contents of the whole object pattern. The

use of pattern variables is essentially a shortcut.

Upon receipt by the wrapper, the query is sent to the driver component which invokes

the parser. After the query is successfully parsed, the driver invokes the matcher to match

the query against a set of template rules. These rules describe the queries that are accepted

by the wrapper and are expressed in a simple extension of DSL with pattern variables and

tokens (see Chapter 5). Associated with each rule is an action string that describes the

corresponding native query. In our scenario, the action string is a parametrized SQL query.

In order to give an example of postprocessing using a filter query, we have chosen not to

include any predicates on year in the templates. That way, we are essentially acting as

if the source does not support the < predicate on year. In a “production” version of the

wrapper it is usually beneficial to make use of all the natively supported query facilities in

Page 142: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

APPENDIX A. ENABLING INTEGRATION: TSIMMIS WRAPPERS 129

order to maximize efficiency (see discussion on query capabilities in Chapter 1).

Here is an example of a template (without the associated action) that matches the above

query:

(D9) B : −B :< book {< author $X >} >

The above template matches the given query because the substitutions B ← P , I ← O,

$X ← “Jones” transform the template into a DSL expression that is contained in the given

query. Note that we could have designed a template that matches the input query exactly,

had we decided to let the source execute the year predicate.

Using the following associated action

// $$ = “select ∗ from book

where author =′′ $X //

and the substitution $X ← “Jones”, the matcher produces the following native SQL query:

select *

from book

where author = "Jones"

The driver then invokes the query processing part of the native component which sub-

mits the native query to the source. When the result is returned, the driver invokes the

query engine to perform the necessary post-processing: the wrapper must remove all those

publications from the answer that were published after 1984 (since the original query was

for publications before 1984). This is done by applying the following DSL filter query to

the result:

B : −B :< book {< year Y >} > AND lt(Y, 1984)

Specifically, the postprocessing engine takes each answer object in the native query

result, extracts the year field of the object and checks if it less than 1984. If so, the object

is included in the result constructed by the engine. After the post-processing, the engine

creates an OEM answer object containing the desired publications. Finally, the driver

component returns the OEM result to the application that issued the query.

Page 143: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

APPENDIX A. ENABLING INTEGRATION: TSIMMIS WRAPPERS 130

A.2 Implemented Wrappers

The toolkit has been used to wrap the following four different types of sources containing

bibliographic data in heterogeneous formats.

1. A University-owned legacy system called folio, which is accessible through an inter-

active front-end (called inspec).

2. A Sybase relational DBMS, which is accessible through SQL.

3. A collection of UNIX files, which are accessible through a PERL script file.

4. A World-Wide Web source which is accessible through a Python script file.

Although all four sources are supporting different access methods, the wrappers hide all

source specific details from the application/end-user by exporting a common interface to

the underlying data independently of where and how it is stored. By adding new templates

or modifying existing ones, it is easy to quickly enhance the query capabilities of a wrapper

as well as the structure of the resulting answers without writing one line of code.

Page 144: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Appendix B

Sort program

Sort is an almost-standard list-sorting routine, that takes a list (in the form of an arbitrary

u-term) as input, turns it into a right-deep u-term and then sorts it, deletes duplicates, and

returns the sorted list (in the form of a right-deep u-term). The sorting algorithm used is

selection sort [CLR90]. The rules to merge two right-deep u-terms are omitted.

sort(T, u(T1, T2)) ← uniquesort(T,U), rd(U, u(T1, T2))rd(T, T ) ← tuple(N,T )rd(U, u(T1, T2)) ← rd(U1, T1), rd(U2, T2),merge(U,U1, U2)uniquesort(T, u(T1, u(T1, T ))) ← selsort(u(T1, u(T1, T )), U)uniquesort(T, T ) ← selsort(T, T ), tuple(N,T )selsort(T, T ) ← tuple(N,T )selsort(u(Min, T ), u(T1, T2)) ← findMin(Min,Rest, u(T1, T2)),

selsort(T,Rest)findMin(T1, T2, u(T1, T2)) ← tuple(N1, T1), tuple(N2, T2), T1 < T2

findMin(Min, u(T1, Rest), u(T1, T2)) ← findMin(Min,Rest, T2), tuple(N,T1),T1 > Min

Figure B.1: A logic program implementing selection sort

131

Page 145: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

Bibliography

[A+91] R. Ahmed et al. The Pegasus heterogeneous multidatabase system. IEEE Computer,

24:19–27, 1991.

[AAB+98] J. L. Ambite, N. Ashish, G. Barish, C. A. Knoblock, S. Minton, P. J. Modi, Ion

Muslea, A. Philpot, and S. Tejada. ARIADNE: A system for constructing mediators

for internet sources. In Proc. SIGMOD Conf., pages 561–563, 1998.

[ACHK93] Y. Arens, C.Y. Chee, C.-N. Hsu, and C.A. Knoblock. Retrieving and integrating

data from multiple information sources. Intl Journal of Intelligent and Cooperative

Informations Systems, 2:127–158, June 1993.

[ACPS96] S. Adali, S. C. Candan, Y. Papakonstantinou, and V. S. Subrahmanian. Query caching

and optimization in distributed mediator systems. In Proc. SIGMOD, pages 137–48,

1996.

[AD98] S. Abiteboul and O. Duschka. Complexity of answering queries using views. In Proc.

PODS Conf., 1998.

[Adl] S. Adler et al. Extensible Stylesheet Language (XSL) 1.0. W3C Work-

ing Draft. Available at http://www.w3.org/TR/xsl. More information at

http://www.w3.org/Style/XSL.

[AGMPY98] S. Abiteboul, H. Garcia-Molina, Y. Papakonstantinou, and R. Yerneni. Fusion query

optimization. In Proc. EDBT Conf., pages 57–71, 1998.

[AHV95] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley,

1995.

[AK89] S. Abiteboul and P.C. Kanellakis. Object identity as a query language primitive. In

Proc. ACM SIGMOD Conference, pages 159–73, Portland, OR, May 1989.

[AK97] J. L. Ambite and C. A. Knoblock. Planning by rewriting: Efficiently generating high-

quality plans. In Proc. AAAI Conf., pages 706–713, 1997.

132

Page 146: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

BIBLIOGRAPHY 133

[AKL97] N. Ashish, C. A. Knoblock, and A. Levy. Information gathering plans with sensing

actions. In Fourth European Conference on Planning, 1997.

[ALW99] C. R. Anderson, A. Y. Levy, and D. S. Weld. Declarative web site management with

tiramisu. In Informal Proc. WebDB Workshop, pages 19–24, 1999.

[AQM+97] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel query

language for semistructured data. International Journal on Digital Libraries, 1(1):68–

88, April 1997.

[ASU87] A. Aho, R. Sethi, and J.D. Ullman. Compilers Principles, Techniques, and Tools.

Addison-Wesley, 1987.

[BDFS97] P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstruc-

tured data. In Proc. ICDT Conf., 1997.

[BDHS96] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A query language and opti-

mization techniques for unstructured data. In Proc. ACM SIGMOD, 1996.

[Bla96] J. Blakeley. Data access for the masses through ole db. In Proc. ACM SIGMOD

Conf., pages 161–72, 1996.

[BLN86] C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies

for database schema integration. ACM Computing Surveys, 18:323–364, 1986.

[BLR97] C. Beeri, A. Y. Levy, and M.-C. Rousset. Rewriting queries using views in description

logics. In Proc. PODS Conf., pages 99–108, 1997.

[BM] P. V. Biron and A. Malhotra. XML Schema Part 2: Datatypes. W3C Working Draft.

Latest version available at http://www.w3.org/TR/xmlschema-2/.

[BPSM] T. Bray, J. Paoli, and C. Sperberg-McQueen. Extensible Markup Language (XML) 1.0.

W3C Recommendation. Latest version available at http://www.w3.org/TR/REC-xml.

[C+95] M.J. Carey et al. Towards heterogeneous multimedia information systems: The Garlic

approach. In Proc. RIDE-DOM Workshop, pages 124–31, 1995.

[Cad] Cadabra.com. At http://www.cadabra.com.

[CGL98a] D. Calvanese, G. De Giacomo, and M. Lenzerini. On the decidability of query con-

tainment under constraints. In Proc. PODS Conf., pages 149–158, 1998.

[CGL+98b] D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, and R. Rosati. Description

logic framework for information integration. In Proc. of the 6th Int. Conf. on the

Principles of Knowledge Representation and Reasoning (KR’98), pages 2–13, 1998.

Page 147: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

BIBLIOGRAPHY 134

[CGL99] D. Calvanese, G. De Giacomo, and M. Lenzerini. Answering queries using views in

description logics. In Proc. of the 6th Int. Workshop on Knowledge Representation

meets Databases (KRDB’99), pages 6–10, 1999.

[CGLV99] D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Vardi. Rewriting of regular

expressions and regular path queries. In Proc. PODS Conf., 1999.

[CGLV00] D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. Vardi. Query processing using

views for regular path queries with inverse. In Proc. PODS Conf., 2000.

[Cha92] E.P. Chan. Containment and minimization of positive conjunctive queries in OODB’s.

In Proc. PODS Conf., 1992.

[CKW93] W. Chen, M. Kifer, and D.S. Warren. Hilog: a foundation for higher-order logic

programming. Journal of Logic Programming, 15:187–230, February 1993.

[CLR90] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw

Hill, 1990.

[CM77] A.K. Chandra and P.M. Merlin. Optimal implementation of conjunctive queries in

relational databases. In Proceedings of the Ninth Annual ACM Symposium on Theory

of Computing, pages 77–90, 1977.

[CM90] M. Consens and A. Mendelzon. GraphLog: a visual formalism for real life recursion.

In Proc. PODS Conf., pages 404–416, 1990.

[CRF00] D. Chamberlin, J. Robie, and D. Florescu. Quilt: An XML query language for het-

erogeneous data sources. In Proc. SIGMOD WebDB Workshop, 2000.

[DFF+] A. Deutch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL:

A query language for XML. Submission to W3C. Latest version available at

http://www.w3.org/TR/NOTE-xml-ql.

[DG97] O. Duschka and M. Genesereth. Answering queries using recursive views. In Proc.

PODS Conf., 1997.

[DKS92] W. Du, R. Krishnamurthy, and M.-C. Shan. Query optimization in heterogeneous

DBMS. In Proc. VLDB Conference, pages 277–91, Vancouver, Canada, August 1992.

[DL97] O. Duschka and A. Levy. Recursive plans for information gathering. In Proceedings

of the Fifteenth International Joint Conference on Artificial Intelligence, 1997.

[DL99] A. Doan and A. Levy. Efficiently ordering query plans for data integration. In Proc.

AAAI Conf., pages 67–73, 1999.

[DP90] B.A. Davey and H. A. Priestley. Introduction to lattices and order. Cambridge Math-

ematical Textbooks, 1990.

Page 148: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

BIBLIOGRAPHY 135

[End72] H. Enderton. A Mathematical Introduction to Logic. Academic Press, 1972.

[Eno] Enosys Markets, inc. At http://www.enosysmarkets.com.

[Fet] Fetch Technologies. At http://www.fetch.com.

[FFK+98] M. Fernandez, D. Florescu, J. Kang, A. Levy, and D. Suciu. Catching the boat with

Strudel: Experiences with a web-site management system. In Proc. SIGMOD Conf.,

1998.

[FFLS97] M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language and processor

for a web-site management system. In Workshop on Management of Semistructured

Data, ACM SIGMOD Conf., 1997.

[FGL+98] P. Fankhauser, G. Gardarin, M. Lopez, J. Muoz, and A. Tomasic. Experiences in

federated databases: From IRO-DB to MIRO-Web. In Proc. VLDB Conf., pages

655–658, 1998.

[FKL97] D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integra-

tion. In Proc. VLDB Conf., 1997.

[FLM99] M. Friedman, A. Levy, and T. Millstein. Navigational plans for data integration. In

Proc. AAAI Conf., pages 67–73, 1999.

[FLMS99] D. Florescu, A. Levy, I. Manolescu, and D. Suciu. Query optimization in the presence

of limited access patterns. In Proc. SIGMOD Conf., pages 311–322, 1999.

[FLNS88] P. Fankhauser, W. Litwin, E.J. Neuhold, and M. Screfl. Global view definition and

multidatabase languages: two approaches to database integration. In Research into

Networks and Distributed Applications. European Teleinformatics Conf., pages 1069–

1082, Vienna, Austria, April 1988.

[FLS98] D. Florescu, A. Levy, and D. Suciu. Query containment for conjunctive queries with

regular expressions. In Proc. PODS Conf., 1998.

[FLSY99] D. Florescu, A. Y. Levy, D. Suciu, and K. Yagoub. Optimization of run-time man-

agement of data intensive web-sites. In Proc. VLDB Conf., pages 627–638, 1999.

[FS98] M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas.

In Proc. ICDE Conf., 1998.

[FSW99] M. Fernandez, J. Simeon, and P. Wadler. Xml query languages: Experiences and

exemplars, 1999. With contributions from S. Cluet et al. Available from http://www-

db.research.bell-labs.com/user/simeon/xquery.html.

[FW97] M. Friedman and D. S. Weld. Efficiently executing information-gathering plans. In

Proc. IJCAI Conf., 1997.

Page 149: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

BIBLIOGRAPHY 136

[G+92] M.R. Genesereth et al. Knowledge Interchange Format. Version 3.0. Reference Man-

ual. Technical Report Logic-92-1, Stanford University, 1992. Also available by URL

http://logic.stanford.edu/kif.html.

[GHR98] A. Gupta, V. Harinarayan, and A. Rajaraman. Virtual database technology. In Proc.

ICDE Conf., pages 297–301, 1998.

[GKD97] M. R. Genesereth, A. M. Keller, and O. Duschka. Infomaster: An information inte-

gration system. In Proc. SIGMOD Conf., 1997.

[GL94] P. Gupta and E. Lin. DataJoiner: a practical approach to multidatabase access. In

Proc. PDIS Conf., page 264, 1994.

[GM+97] H. Garcia-Molina et al. The TSIMMIS approach to mediation: data models and

languages. Journal of Intelligent Information Systems, 8:117–132, 1997.

[GM99] G. Grahne and A. O. Mendelzon. Tableau techniques for querying information sources

through global schemas. In Proc. ICDT Conf., pages 332–347, 1999.

[GMLY99] H. Garcia-Molina, W. Labio, and R. Yerneni. Capability-sensitive query processing

on internet sources. In Proc. ICDE Conf., pages 50–59, 1999.

[GMPVY] H. Garcia-Molina, Y. Papakonstanti-

nou, V. Vassalos, and R. Yerneni. A tsimmis retrospective. Working paper. Draft

available from http://www.stern.nyu.edu/ vassalos/retro-draft.ps.

[GN88] M.R. Genesereth and N.J. Nillson. Logical Foundations of Artificial Intelligence. Mor-

gan Cauffman, 1988.

[Gol90] C. Goldfarb. The SGML Handbook. Oxford University Press, 1990.

[Gup89] A. Gupta. Integration of Information Systems: Bridging Heterogeneous Databases.

IEEE Press, 1989.

[GW97] R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization

in semistructured databases. In Proc. VLDB Conf., 1997.

[HKWY96] Laura Haas, Donald Kossman, Edward Wimmers, and Jun Yang. An optimizer for

heterogeneous systems with non-standard data and search capabilities. Special Issue

on Query Processing for Non-Standard Data, IEEE Data Engineering Bulletin, 19:37–

43, December 1996.

[HKWY97] L. Haas, D. Kossman, E. Wimmers, and J. Yang. Optimizing queries across diverse

data sources. In Proc. VLDB, 1997.

Page 150: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

BIBLIOGRAPHY 137

[HM93] J. Hammer and D. McLeod. An approach to resolving semantic heterogeneity in a

federation of autonomous, heterogeneous database systems. Intl Journal of Intelligent

and Cooperative information Systems, 2:51–83, 1993.

[HY90] R. Hull and M. Yoshikawa. ILOG: Declarative creation and manipulation of object

identifiers. In Proc. VLDB Conference, pages 455–68, Brisbane, Australia, August

1990.

[HY91] R. Hull and M. Yoshikawa. On the equivalence of data restructurings involving object

identifiers. In Proc. PODS Conference, 1991.

[IFF+99] Z. G. Ives, D. Florescu, M. Friedman, A. Y. Levy, and D. S. Weld. An adaptive query

execution system for data integration. In Proc. SIGMOD Conf., pages 299–310, 1999.

[Imm82] N. Immerman. Upper and lower bounds for first-order expressibility. Journal of

Computer and System Sciences, 25(1):76–98, August 1982.

[JBHM+97] J.Hammer, M. Breunig, H.Garcia-Molina, S.Nestorov, V.Vassalos, and R. Yerneni.

Template-based wrappers in the tsimmis system. In Proc. ACM SIGMOD, pages

532–535, 1997.

[K+93] W. Kim et al. On resolving schematic heterogeneity in multidatabase systems. Dis-

tributed And Parallel Databases, 1:251–279, 1993.

[KL89] M. Kifer and G. Lausen. F-logic: a higher-order language for reasoning about objects,

inheritance, and scheme. In Proc. ACM SIGMOD Conf., pages 134–46, Portland, OR,

June 1989.

[KW96] C. T. Kwok and D. S. Weld. Planning to gather information. In Proc. AAAI Conf.,

1996.

[Lev] A. Levy. Answering queries using views: a survey. Available from

www.cs.washington.edu/homes/alon/site/files/view-survey.ps.

[Lev00] Special issue on adaptive query processing. Bulletin of the Technical Committee on

Data Engineering, 23(2), June 2000.

[Lew91] J.W. Lewis. Wrappers: integrating utilities and services for the DICE architecture.

In Proceedings of the Second National Symposium on Concurrent Engineering, pages

445–457, 1991.

[LHL+98] B. Ludscher, R. Himmerder, G. Lausen, W. May, and C. Schlepphorst. Managing

semistructured data with FLORID: A deductive object-oriented perspective. Infor-

mation Systems, 23(8):589–613, 1998.

Page 151: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

BIBLIOGRAPHY 138

[LMR90] W. Litwin, L. Mark, and N. Roussopoulos. Interoperability of multiple autonomous

databases. ACM Computing Surveys, 22:267–293, 1990.

[LMSS95] A. Levy, A. Mendelzon, Y. Sagiv, and D. Srivastava. Answering queries using views.

In Proc. PODS Conf., pages 95–104, 1995.

[LPV00] B. Ludascher, Y. Papakonstantinou, and P. Velikhov. Navigation-driven evaluation of

virtual mediated views. In Proc. EDBT Conf., 2000.

[LR96] A. Levy and M.-C. Rousset. CARIN: a representation language integrating rules and

description logics. In Proceedings of the European Conference on Artificial Intelligence,

Budapest, Hungary, 1996.

[LRO96] A. Levy, A. Rajaraman, and J. Ordille. Querying heterogeneous information sources

using source descriptions. In Proc. VLDB, pages 251–262, 1996.

[LRU96] A. Levy, A. Rajaraman, and J. Ullman. Answering queries using limited external

processors. In Proc. PODS, pages 227–37, 1996.

[LRU99] A. Levy, A. Rajaraman, and J. Ullman. Answering queries using limited external

processors. Journal of Computer and System Sciences, 58(1):69–82, February 1999.

[LS97] A. Levy and D. Suciu. Deciding containment for queries with complex objects. In

Proc. PODS Conf., 1997.

[MAG+97] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A database

management system for semistructured data. SIGMOD Record, 26(3):54–66, 1997.

[Mai86] D. Maier. A logic for objects. In J. Minker, editor, Preprints of Workshop on Founda-

tions of Deductive Database and Logic Programming, Washington, DC, USA, August

1986.

[Mer] Mergent Systems. At http://www.mergent.com.

[MFDG98] S. Mace, U. Flohr, R. Dobson, and T. Graham. Weaving a better Web. BYTE

Magazine, March 1998. Cover Story.

[MLF00] T. D. Millstein, A. Y. Levy, and M. Friedman. Query containment for data integration

systems. In Proc. PODS Conf., 2000.

[MP00] K. Munroe and Y. Papakonstantinou. BBQ: A visual interface for browsing and

querying XML. In Proc. Visual Database Systems, 2000.

[MS99] T. Milo and D. Suciu. Type inference for queries on semistructured data. In Proc.

PODS Conf., pages 215–226, 1999.

Page 152: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

BIBLIOGRAPHY 139

[MSV00] T. Milo, D. Suciu, and V. Vianu. Typechecking for XML transformers. In Proc. PODS

Conf., 2000.

[MW99] J. McHugh and J. Widom. Query optimization for an xml query language. In Proc.

SIGMOD Conf., 1999.

[Nim] Nimble.com. At http://www.nimble.com.

[PAGM96] Y. Papakonstantinou, S. Abiteboul, and H. Garcia-Molina. Object fusion in mediator

systems. In Proc. VLDB Conf., 1996.

[Pap97] Y. Papakonstantinou. Query processing in heterogeneous information sources.

Technical report, Stanford University Thesis, 1997. Available from www-

cse.ucsd.edu/~yannis/papers/.

[PGGMU95] Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. Ullman. A query translation

scheme for the rapid implementation of wrappers. In Proc. DOOD Conf., pages 161–

86, 1995.

[PGH96] Y. Papakonstantinou, A. Gupta, and L. Haas. Capabilities-based query rewriting in

mediator systems. In Proc. PDIS Conf., 1996.

[PGH98] Y. Papakonstantinou, A. Gupta, and L. Haas. Capabilities-based query rewriting in

mediator systems. Distributed and Parallel Databases, 6:73–110, 1998.

[PGMU96] Y. Papakonstantinou, H. Garcia-Molina, and J. Ullman. Medmaker: A mediation

system based on declarative specifications. In Proc. ICDE Conf., pages 132–41, 1996.

[PGMW95] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across

heterogeneous information sources. In Proc. ICDE Conf., pages 251–60, 1995.

[PL00] R. Pottinger and A. Levy. A scalable algorithm for answering queries using views. In

Proc. VLDB Conf., 2000.

[PV99] Y. Papakonstantinou and V. Vassalos. Query rewriting for semistructured data. In

Proc. SIGMOD Conf., pages 455–466, 1999.

[PV00] Y. Papakonstantinou and V. Vianu. DTD inference for views of XML data. In Proc.

PODS Conf., 2000.

[PWDN99] M. Papiani, J. Wason, A. Dunlop, and D. Nicole. A distributed scientific data archive

using the Web, XML and SQL/MED. SIGMOD Record, 28(3), 1999.

[Qia96] Xiaolei Qian. Query folding. In Proc. ICDE, pages 48–55, 1996.

[ROH99] M. Tork Roth, F. Ozcan, and L. M. Haas. Cost models DO matter: Providing cost

information for diverse data sources in a federated system. In Proc. VLDB Conf.,

pages 599–610, 1999.

Page 153: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

BIBLIOGRAPHY 140

[RS97] M. Tork Roth and P. Schwarz. Don’t Scrap It, Wrap It! An architecture for legacy

data sources. In Proc. VLDB Conf., pages 266–275, 1997.

[RSU95] A. Rajaraman, Y. Sagiv, and J. Ullman. Answering queries using templates with

binding patterns. In Proc. PODS Conf., pages 105–112, 1995.

[RSUV89] R. Ramakrishnan, Y. Sagiv, J.D. Ullman, and M.Y. Vardi. Proof tree transformations

and their applications. In Proc. PODS Conf, pages 172–182, 1989.

[S+] V.S. Subrahmanian et al. HERMES: A heterogeneous reasoning and mediator system.

Available at http://www.cs.umd.edu/projects/hermes/overview/paper.

[SAC+79] P. G. Selinger, M. Astrahan, D. Chamberlin, R. A. Lorie, and T. G. Price. Access

path selection in a relational database management system. In Proc. SIGMOD Conf.,

pages 23–34, 1979.

[SBGJ+97] K. Fynn S Bressan, C. Goh, M. Jakobisiak, K. Hussein, H. Kon, T. Lee., S. Madnick,

T. Pena, J. Qu, A. Shum, and M. Siegel. The context interchange mediator prototype.

In Proc. SIGMOD Conf., pages 525–527, 1997.

[SGM] Overview of SGML resources. At http://www.w3.org/MarkUp/SGML/.

[Suc98] D. Suciu. Semistructured data and XML. In Proc. FODO Conf., 1998.

[SY80] S. Sagiv and M. Yannakakis. Equivalences among relational expressions with the union

and difference operators. JACM, 27:633–55, 1980.

[T+90] G. Thomas et al. Heterogeneous distributed database systems for production use.

ACM Computing Surveys, 22:237–266, 1990.

[TBMM] H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML

Schema Part 1: Structures. W3C Working Draft. Latest version available at

http://www.w3.org/TR/xmlschema-1/.

[TRV98] A. Tomasic, L. Raschid, and P. Valduriez. Scaling access to heterogeneous data sources

with DISCO. Transactions on Knowledge and Data Engineering, 10(5):808–823, 1998.

[Tuk] Tukwila data integration system. At

http://data.cs.washington.edu.ucsd.edu/integration/tukwila.

[Ull88] J.D. Ullman. Principles of Database and Knowledge-Base Systems, Vol. I & II. Com-

puter Science Press, New York, NY, 1988.

[Ull89] J.D. Ullman. Principles of Database and Knowledge-Base Systems, Vol. II: The New

Technologies. Computer Science Press, New York, NY, 1989.

Page 154: QUERYING AUTONOMOUS, HETEROGENEOUS INFORMATION …people.stern.nyu.edu/vassalos/defense/thesis.pdf · 1.1 Information integration and the challenges of autonomy and heterogeneity

BIBLIOGRAPHY 141

[Ull97] J.D. Ullman. Information integration using logical views. In Proc. ICDT Conf., pages

19–40, 1997.

[VP97] V. Vassalos and Y. Papakonstantinou. Describing and Using Query Capabilities of

Heterogeneous Sources. In Proc. VLDB Conf., pages 256–266, 1997.

[VP98] V. Vassalos and Y. Papakonstantinou. Using knowledge of redundancy for query

optimization in mediators. In Proceedings of the AAAI’98 Workshop on AI and Infor-

mation Integration, 1998. Available from www.stern.edu/~vassalos/publications/.

[VP00] V. Vassalos and Y. Papakonstantinou. Expressive capabilities description languages

and query rewriting algorithms. Journal of Logic Programming, 43(1):75–122, 2000.

[Wel] D. Wells. Wrappers survey. Available from http://www.objs.com/survey/wrap.htm.

[Wie92] G. Wiederhold. Mediators in the architecture of future information systems. IEEE

Computer, 25:38–49, 1992.

[YL87] H. Z. Yang and P. . Larson. Query transformation for PSJ-queries. In Proc. VLDB

Conf., pages 245–254, 1987.

[YLGMU99] R. Yerneni, C. Li, H. Garcia-Molina, and J. D. Ullman. Computing capabilities of

mediators. In Proc. SIGMOD Conf., pages 443–454, 1999.

[YLUGM99] R. Yerneni, C. Li, J. D. Ullman, and H. Garcia-Molina. Optimizing large join queries

in mediation systems. In Proc. ICDT Conf., pages 348–364, 1999.

[ZGMHW95] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom. View maintenance in a

warehousing environment. In Proc. SIGMOD Conference, pages 316–327, 1995.