Describing data sources. Outline Overview Schema mapping languages

Describing data sources

Outline

Overview Schema mapping languages

Source descriptions

• Which sources are available

• What data exists in each source

• How each source can be accessed

Components of a data integration system

User query is reformulated into a query over the data sources

Working exampleMediated schema:Movie(title, director, year, genre), Actors(title, name)Plays(movie, location, startTime)Reviews(title, rating, description)Data sources:S1:Movie(MID, title)Actor(AID, firstName, lastName, nationality, yearOfBirth)ActorPlays(AID, MID)MovieDetail(MID, director, genre, year)S2: S3:Cinemas(place,movie,start) NYCCinemas(name, title,

startTime)S4: S5:Reviews(title,date,grade,review) MovieGenres(title, genre)S6: S7:MovieDirectors(title, dir) MovieYears(title, year)

Components of source descriptions Schema mappings

What data exists in sources How to map terms used in source schemata with

terms used in the mediated schema Information used to optimize queries

Access pattern limitations Because data sources may differ on the access patterns

supported Source completeness

Components of source descriptions Schema mappings

What data exists in sources How to map terms used in source schemata

with terms used in the mediated schema Information used to optimize queries to the

sources and to avoid illegal access patterns Access pattern limitations

Because data sources may differ on the access patterns supported

Source completeness

Schema mapping Main component of a source description Specification of:

What data exists in the source How the terms used in the source schema relate to the

terms used in the mediated schema Needs to handle semantic heterogeneity:

discrepancies between the source schemata and the mediated schema Relation and attribute names Tabular organization Domain coverage Data-level variations

Query reformulation

Besides schema mappings, source descriptions specify information: To enable the data integration system to optimize

queries posed to the sources Knowing that a data source is known to be complete

saves work by not accessing other data sources that have overlapping data

To avoid illegal access patterns Data sources may differ on which access patterns they

support

Schema mapping languages

Schema mapping: set of expressions that describe a relationship between a set of schemata (typically two). In our case, mediator schema and the schema of the sources Used to reformulate a query formulated in terms of the mediated

schema into appropriate queries on the sources. Result is called logical query plan (query expression that refers

only to the relations in the data sources) It will not be always possible to generate a query plan that produces

all the certain answers Two types of algorithms involved:

To find the best possible logical plan To find all the certain answers

Schema mapping languages based on: query expressions

Semantics of schema mappings A semantic mapping M defines a relation MR over:

I(G) X I(S1) X .... X I(Sn)

Where: I(G) denotes the possible instances of the mediated

schema I(S1), ..., I(Sn) denote the possible instances of the source

relations S1, ..., Sn, respectively

If (g, s1, ..., sn) MR, then g is a possible instance of the mediated schema when the source relation instances are s1, ..., sn

Certain answers

Let M be a schema mapping between a mediated schema G and source schemata S1, ..., Sn that defines the relation MR over I(G) X I(S1) X... X I(Sn).

Let Q be a query over G, and let s1,..., sn be instances of the source relations.

We say that t is a certain answer of Q wrt M and s1, ..., sn

if t Q(g) for every instance g of G s.t.

(g, s1, ..., sn) MR

Properties of schema mapping languages Flexibility: the formalism should be able to express

a wide variety of relationships between schemata. Efficient reformulation: reformulation algorithms

should have well understood properties and be efficient Trade-off: flexibility/expressivness vs efficiency

Easy update: Must be easy to add and remove sources

Schema mapping languages: Global-As-View (GAV) Local-As-View (LAV) Global-Local-As-View.(GLAV)

Two systems (GAV) TSIMMIS [Garcia-Molina+97] – Stanford

Focus: semistructured data (OEM), OQL-based language (Lorel) Creates a mediated schema as a view over the sources Spawned a UCSD project called MIX, which led to a company now

owned by BEA Systems Other important systems of this vein: Kleisli/K2 @ Penn

(LAV) Information Manifold [Levy+96] – AT&T Research Focus: local-as-view mappings, relational model Sources defined as views over mediated schema Led to peer-to-peer integration approaches (Piazza, etc.)

Focus: Web-based queriable sources

Global-As-View (GAV) Defines the mediated schema as a set of

views over the data sources Mediated schema also referred as global schema

Let G be a mediated schema, and let S = {S1, ..., Sn} be schemata of n data sources,A Global-As-View schema mapping M is a set of expressions of the form:

Gi(X) Q(S) or Gi(X) = Q(S), where Gi is a relation in G, and appears in at most one

expression in M, and Q(S) is a query over the relations in S

Working exampleMediated schema:Movie(title, director, year, genre), Actors(title, name)Plays(movie, location, startTime)Reviews(title, rating, description)Data sources:S1:Movie(MID, title)Actor(AID, firstName, lastName, nationality, yearOfBirth)ActorPlays(AID, MID)MovieDetail(MID, director, genre, year)S2: S3:Cinemas(place,movie,start) NYCCinemas(name, title,


Example of a GAV schema mappingMovie(title, director, year, genre) S1.Movie(MID, title), S1.MovieDetail(MID, director, genre, year)

Movie(title, director, year, genre) S5.MovieGenres(title, genre),S6.MovieDirectors(title, director),S7.MovieYears(title, year)

Plays(movie, location, startTime) S2.Cinemas(location, movie, startTime)

Plays(movie, location, startTime) S3.NYCCinemas(location, movie, startTime)

GAV semantics Let M = M1, ..., Ml be a GAV schema

mapping between G and S = {S1, ..., Sn}, where Mi is of the form Gi(X) Qi(S), or Gi(X) = Qi(S).

Let g be an instance of the mediated schema G, and let s = s1, ..., sn be instances of S1, ...Sn, respectively. The tuple of instances (g, s1, ..., sn) is in MR if for every 1<=i<=l, the following holds: If Mi is a = expression, then the extension of Gi in

g is equal to the result of evaluating Qi on s, If Mi is a expression, then the extension of Gi

in g is a superset of the result of evaluating Qi on s

Reformulation in GAV

To reformulate a query posed over the mediated schema, simply unfold the query with the view definitions

The reformulation resulting from the unfolding is guaranteed to find all the certain answers

Example The query Q, over the mediated schema, asks for comedies starting after 8pm:

Q(title,location,startTime) :- Movie(title,director,year,“comedy”),Plays(title, location, st), st >= 8pm

Reformulating Q with the source descriptions would yield thefollowing four logical query plans:

Q’(title, location, startTime) :- S1.Movie(MID, title),S1.MovieDetail(MID, director, “comedy”,

year),S2.Cinemas(location, movie, st), st >= 8pm

Q’(title, location, startTime) :- S1.Movie(MID, title),MovieDetail(MID, director, “comedy”, year),S3.NYCCinemas(location, title, st), st >= 8pm

Q’(title, location, startTime) :- S5.MovieGenres(title, “comedy”), S6.MovieDirectors(title, director),S7.MovieYears(title, year), S2.Cinemas(location, title, st), st >= 8pm

Q’(title, location, startTime) :- S5.MovieGenres(title, “comedy”), S6.MovieDirectors(title, director),S7.MovieYears(title, year), S3.NYCCinemas(location, title, st), st >= 8pm

Limitations The reformulation may not be the most efficient

method to answer the query Some subgoals may be redundant

In the last two reformulations, the subgoals: S6.MovieDirectors and S7.MovieYears are not needed, since what is really needed for the Movies relations is the genre of the movie.

But there is no way of concluding this in GAV descriptions Adding and removing sources involves considerable

work and knowledge of the sources -> potentially not scalable Ex: if we discover another source that includes only movie

directors To update the source descriptions we need to specify

exactly which sources it needs to be joined with in order to produce tuples of Movie

TSIMMIS [Garcia-Molina+97]

One of the first systems to support semi-structured data according to the OEM data model, which predated XML by several years.

Mediator Specification Language (MSL): logic-based OO language used as a view definition language targeted to the OEM data model and to the integration of heterogeneous data sources Based on Datalog, among others

Wrappers accept queries expressed in MSL and compare them with the patterns (MSL templates) given in the wrapper specification file

An instance of a GAV mediation system We define our global schema as views over the sources

XML vs. Object Exchange Model<book> <author>Bernstein</author> <author>Newcomer</author> <title>Principles of TP</title></book>

<book> <author>Chamberlin</author> <title>DB2 UDB</title></book>

O1: book { O2: author { Bernstein } O3: author { Newcomer } O4: title { Principles of TP }

}

O5: book { O6: author { Chamberlin } O7: title { DB2 UDB }}

User queries in TSIMMISSpecified in OQL-style language called Lorel

OQL was an object-oriented query language that looks like SQL Lorel is, in many ways, a predecessor to XQuery

Based on path expressions over OEM structures: select book where book.title = “DB2 UDB” and book.author =

“Chamberlin”

This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated:

for $b in AllData()/bookwhere $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin”return $b

Query Answering in TSIMMIS

Basically, it’s view unfolding, i.e., composing a query with a view

The query is the one being asked The views are the MSL templates for the wrappers Some of the views may actually require

parameters, e.g., an author name, before they’ll return answers Common for web forms (see Amazon, Google, …) XQuery functions (XQuery’s version of views) support

parameters as well, so we’ll see these in action

Recall SQL View Unfolding/Expansion A view consisting of branches and their customers

Find all customers of the Perryridge branch

create view all_customer as (select branch_name, customer_name from depositor, account where depositor.account_number =

account.account_number ) union (select branch_name, customer_name from borrower, loan where borrower.loan_number = loan.loan_number )

select customer_namefrom all_customerwhere branch_name = 'Perryridge'

A Wrapper Definition in MSLWrappers have templates and binding patterns ($X) in MSL:

B :- B: <book {<author $X>}> // $$ = “select * from book where author=“ $X //

If the template is matched by the query issued to the mediator, an SQL query is issued over Book(author, year, title), which is the relation stored in the data source

In XQuery, this might look like:define function GetBook($x AS xsd:string) as book {

for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x

+”’”)return <book>{$b/title}<author>$x</author></book>

}

…

The GetBook’s results is unioned with others to form the view Mediator()

How to Answer the Query

Given our query:for $b in Mediator()/bookwhere $b/title/text() = “DB2 UDB” and

$b/author/text() = “Chamberlin”return $b

Find all wrapper definitions that: Contain enough “structure” to match the

conditions of the query Or have already tested the conditions for us!

Query Composition with Views

We find all views that define book with author and titleas output, and we compose the query with each:

define function GetBook($x AS xsd:string) as book {for $b in

sql(“Amazon.DB”, “select * from book where author=‘” + $x +

“’”)return <book> {$b/title} <author>{$x}</author></book>

}

for $b in Mediator()/bookwhere $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin”return $b

book

title author

…

…

Matching View Output to Our Query’s Conditions

Determine that $b/author/text() $x by matching the pattern on the function’s output:define function GetBook($x AS xsd:string) as book {

for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x +

“’”)return <book>{ $b/title } <author>{$x}</author> </book>

}

let $x := “Chamberlin”for $b in GetBook($x)/bookwhere $b/title/text() = “DB2 UDB” return $b

book

title author

… …

The Final Step: Unfoldinglet $x := “Chamberlin”

for $b in ( for $b’ in sql(“Amazon.com”,

“select * from book where author=‘” + $x + “’”) return <book>{ $b/title }<author>{$x}</author></book> )/bookwhere $b/title/text() = “DB2 UDB” return $b

This can be simplified into:

for $b in sql(“Amazon.com”,“select * from book where author=‘Chamberlin’”)

where $b/title/text() = “DB2 UDB” return $b

Virtues of TSIMMIS

Early adopter of semistructured data, greatly predating XML Can support data from many different kinds of

sources Obviously, doesn’t fully solve heterogeneity

problem Presents a mediated schema that is the union

of multiple views Query answering based on view unfolding

Easily composed in a hierarchy of mediators

Big limitation of TSIMMIS

Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema

Local-As-View

Opposite approach to GAV Focus on describing each data source as

precisely as possible and independently of any other sources Instead of specifying how to compute tuples of the

mediated system LAV expressions describe data sources as

queries over the mediated schema

The Local-as-View Model

The basic model is the following: Local sources are views over the mediated schema Sources have the data – mediated schema is virtual Sources may not have all the data from the domain –

“open-world assumption”

The system must use the sources (views) to answer queries over the mediated schema

LAV schema mappings

Let G be a mediated schema and

let S = {S1, ..., Sn} be schemata of n data sources.

A Local-As-View schema mapping M is a set of expressions of the form Si(X) Qi(G) or Si(X) = Qi(G), where: Qi is a query over the mediated schema G, and Si is a source relation and it appears in at most

one expression in M

Recap. exampleMediated schema:Movie(title, director, year, genre), Actors(title, name)Plays(movie, location, startTime)Reviews(title, rating, description)Data sources:S1:Movie(MID, title)Actor(AID, firstName, lastName, nationality, yearOfBirth)ActorPlays(AID, MID)MovieDetail(MID, director, genre, year)S2: S3:Cinemas(place,movie,start) NYCCinemas(name, title,


LAV example In LAV, sources S5-S7 would be described as projection

queries over the Movie relation in the mediated schemaS5.MovieGenres(title, genre) Movie(title, director, year, genre)

S6.MovieDirectors(title, dir) Movie(title, director, year, genre)

S7.MovieYears(title, year) Movie(title, director, year, genre)

In LAV, we can express constraints on the contents of data sources

S9(title, year, “comedy”) Movie(title, director, year, “comedy”), year >= 1970

LAV semantics Let M= M1, ..., Ml be a LAV schema mapping

between G and S ={S1, ..., Sn}, where Mi is of the form Si(X) Qi(G) or Si(X) = Qi(G).

Let g be an instance of the mediated schema G, and let s = s1, ..., sn be instances of S1, ..., Sn, respectively. The tuple of instances (g, s1, ..., sn) is in MR if for every 1<=i<=l, the following holds: If Mi is an expression, then the result of evaluating Qi over

g is equal to si If Mi is a expression, then the result of evaluating Qi over

g is a subset of si

Reformulation in LAVMain advantages: flexibility + enables expressing

incomplete information data sources are described in isolation => the system, and not the

designer, will find ways of combining data from multiple sources Easier for a designer to add/remove sources

Example: Q(title) :- Movie(title, director, year, “comedy”), year >= 1960

Using sources S5-S7, we obtain the reformulation:Q’(title) :- S5.MovieGenres(title, “comedy”), S7.MovieYears(title, year), year >= 1960

Using source S9, we obtain the reformulation:Q’(title) :- S9(title, year, “comedy”)

The Information Manifold [Levy+96] When you integrate something, you have some conceptual model of the integrated domain

Define that as a basic frame of reference, everything else as a view over it

Local as View

May have overlapping/incomplete sources Define each source as the subset of a query over the

mediated schema We can use selection or join predicates to specify that a

source contains a range of values:ComputerBooks(…) Books(Title, …, Subj), Subj = “Computers”

Advantages and Shortcomings of LAV Enables expressing incomplete information More robust way of defining mediated schemas

and sources Mediated schema is clearly defined, less likely to

change Sources can be more accurately described

Computationally more expensive!

References Chapter 4, Draft of the book on “Principles of Data Integration” by

AnHai Doan, Alon Halevy, Zachary Ives (in preparation). Sudarshan Chawathe, Hector Garcia-Molina, Joachim

Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey Ullman, and Jennifer Widom.The TSIMMIS project: Integration of heterogeneous information sources. In proceedings of IPSJ, Tokyo, Japan, October 1994.

Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Querying Heterogeneous Information Sources Using Source Descriptions. In Proceedings of the International Conference on Very Large Databases (VLDB), 1996.

Zach Ives, slides of the course: “Database and Information Systems”, Fall 2007, available at: http://www.seas.upenn.edu/~zives/07f/cis550/

Documents

Describing data sources. Outline Overview Schema mapping languages