32
Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Embed Size (px)

Citation preview

Page 1: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Scaling Heterogeneous Databases and Design of

DISCO

Anthony TomasicLouiqa Raschid

Patrick Valduriez

Presented by:

Nazia KhatirTexas A&M University

Page 2: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Distributed Information Search COmponent (DISCO)

The distributed mediator architecture of DISCO

Query processing semanticsData modelsThe interface to underlying data sources

Page 3: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Introduction

Access to large number of data sources of heterogeneous distributed databases introduces new problems:

End users and application programmers

Unavailable data sources • To answer a query involving n databases, all n

databases must be available, otherwise either no answer is returned, or some partial answer is returned

The availability of answers in the system declines as the number of databases rises.

Page 4: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Introduction (Cont.)

Access to large number of data sources of heterogeneous distributed databases introduces new problems:

Database Administrators (DBA) Incorporating new sources into the model

• Schemas must be changed• Catalogs must be updated• New definitions must be added

Page 5: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Introduction (Cont.)

Access to large number of data sources of heterogeneous distributed databases introduces new problems:

Database Implementors (DBI)Translation of queries between query languages

and schemas• New codes must be written

Page 6: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

DISCO ArchitectureDISCO Architecture

A : Application

M : Mediator

C : Catalog

W : Wrapper

D : Data Source

Arcs represent exchange of queries and answers

Page 7: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Applications (A)

Written by application programmers Access a uniform representation of the

underlying sources through a uniform query language

Page 8: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediators (M)

Permit collection of databases to be accessed in a uniform way

Accept queries and transform them into sub-queries

Keep state of summary information about its associated databases

Page 9: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Catalogs (C)

Special mediatorsKeep track of collection of databases,

wrappers, and mediatorsOverview of the entire system

Page 10: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Wrappers (W)

Deal with the heterogeneous nature of databases

Transform sub-queriesMaps from the general query language,

used by mediators, to the source query language

Reform answer (data) appropriate to each mediator

Page 11: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Features of DISCO

For application Programmers Provides a new semantic for query processing

to ease dealing with unavailable data sources

Page 12: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Features of DISCO

For DBA Models data sources as objects which permits

powerful modeling capability Supports type transformations to ease the

incorporation of new data sources into a mediator

Page 13: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Features of DISCO

For DBI Provides flexible wrapper interface to ease the

construction of wrappers

Page 14: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Data Model

person (type)

person (extent)

person0 person1 person2

r0

Mary 200

r2

Select x.nameFrom x in personWhere x.salary > 10

The answer is:bag (“Mary”, “Sam”) of Bag type.

• (Programmer viewpoint) The same query would access the third data source as well.• (DBA viewpoint) The model supports dissimilar structures

r1

Sam 150

Page 15: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Wrapper Interface DISCO provides a flexible wrapper interface for DBI.

The interface to wrappers is at the level of an abstract algebraic machine (AM) of logical operators. DBI implements the logical operators and a call in the wrapper interface which returns the grammar.

• During the query processing, mediator generates a logical expression.• Mediator call interface to get the grammar and checks the logical expression matches the grammar

Mediator

Wrapper

Interface (Algebraic Machine)

Page 16: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Data ModelExtensions to the ODMG standard

ODMG (Object Data Management Group) Object Data Model Object Definition Language (ODL) Object Query language (OQL) Language binding

Page 17: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Data ModelExtensions to the ODMG standard

Object Data Model interface defines a type signature for an objectextent automatically maintain the collection of

objects of the interface, i.e. an extent is a name variable whose value is the collection of all objects of the associated interface. When objects are created or destroyed, the extent is updated automatically.

Page 18: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Data ModelExtensions to the ODMG standard

Object Definition Language (ODL)wrapper models wrappers repository the address of a database or some

other type of repository, contain several data sources. Each data source in a repository is associated with an extent

Page 19: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Data ModelExtensions to the ODMG standard

Define access to a data source1. Create an instance of the repository type:

r0 := Repository (host = “rodin.inria.fr”, name = “db”, address = “123.45.6.7”)

2. Locate the wrapper (written by a database implementor): w0 := WrapperPostgres ( );

3. Define the interface (type) in the mediator which corresponds to the data source object, e.g. Person type corresponds to the objects in data sources r0 and r1:

interface Person { attribute String name; attribute Short salary; }

4. Specify the extent of this mediator type which access the r0 utilizing the w0 wrapper.

extent person0 of Person wrapper w0 repository r0;

Each DISCO extent represents a collection of data in one data source

Page 20: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Data ModelExtensions to the ODMG standard

Data access from the data sourceThe queryselect x.name

from x in person0where x.salary > 10

returns the answer Bag(“Mary”)

Addition of a new extent of Person type:extent person1 of Person wrapper w0 repository r1;

To access objects in both data sources the query:select x.namefrom x in union (person0, person1)where x.salary > 10

returns the answer Bag(“Mary”, “Sam”)

Advantage: refer to the extents explicitly Disadvantage: difficult to express queries, when the extents are not explicitly specified

Page 21: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Data ModelExtensions to the ODMG standard

Solution: MetaExtent keeps details the extents of all the mediator types. General format of MetaExtent type that is created automatically:

interface MetaExtent (extent metaextent) {

attribute String name; attribute Extent e; attribute Type interface; attribute Wrapper wrapper; attribute Repository repository; attribute Map map;

}

Query definition expression of the extent person:interface Person (extent person) { attribute String name; attribute Short salary; }

Thus, the query dynamically accesses all the extents defined for the type Person

define person asFlatten( select x.e from x in metaextent where x.interface = Person)

Page 22: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Data ModelMatching similar and dissimilar structures or substructures

DBA defines the aggregation of data from data sources access to multiple data sources:

Matching similar substructures subtypeMatching similar structures mapMatching dissimilar structures view

Page 23: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Data ModelMatching similar substructures

Subtyping ODMG standard

Example:The Student interface as a subtype of Person and two extents are defined by DBA as follows: interface Student: Person { } extent student0 of Student wrapper w0 repository r2 extent student1 of Student wrapper w0 repository r3The person extent still contains person0, and person1. It does not automatically reference the extents of its subtypes, in the subtype hierarchy.DISCO Solution: special syntax person*

Page 24: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Data ModelMatching similar structures

Mapping Example:

interface PersonPrime { attribute String n; attribute Short s; }extent personprime0 of PersonPrime wrapper w0 repository r0;

Since objects returned from r0 are of type Person, the extent personprime0 has a type conflict with objects returned. To avoid a run-time error DISCO allows the DBA to resolve this type conflict.

Page 25: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Data ModelMatching similar structures

Mapping example (Cont.):The type conflict is resolved by specifying a mapping between a mediator type and a data source type. The mapping function is called the local transformation map.

extent personprime0 of

PersonPrime wrapper w0 repository r0

map ((person0=personprime0),

(name = n), (salary = s));

extent personprime0 of PersonPrime wrapper w0 repository r0;

Page 26: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Data ModelMatching dissimilar structures

View in DISCOExample:interface PersonTwo {

attribute String name; attribute Short regular; attribute Short consult; }

extent persontwo0 of PersonTwo wrapper w0 repository r5;

View definition to aggregate over the data sources:define personnew as bag (select struct (name : x.name, salary : x.salary) from x in person, select struct (name : x.name, salary : x.regular + x. consult) from x in persontwo0)

A view can reference other views but are not updatable

Page 27: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Mediator Query Processing

Page 28: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Query Processing With Unavailable Data There are three possibilities if a data source does not

respond:1- System waits

2- System assumes the unavailable source do not exist or the source is considered to have no matching tuples

3- System returns a partial answer DISCO uses partial evaluation semantics to queries, by

processing as much of the query as possible, from the information that is available. Thus, the answer to a query may be another query.

Page 29: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Assume r0 does not respond:

select x.name

from x in person

where x.salary > 10

Query Processing With Unavailable Data (Cont.)

Page 30: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Query Processing With Unavailable Data (Cont.)

Assume r0 does not respond:

select x.name

from x in person

where x.salary > 10

union (select y.name from y in person0 where y.salary > 10, Bag(“sam”))

Page 31: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Query Processing With Unavailable Data (Cont.)

Assume r0 does not respond:

select x.name

from x in person

where x.salary > 10

union (select y.name from y in person0 where y.salary > 10, Bag(“sam”))

partial answer (query)

partial answer (data)

Page 32: Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University

Conclusion

The design of DISCO provides some solutions to some of the problems encountered by the scaling the number of data sources in heterogeneous distributed databases. Partial evaluation query semantics AP Data modeling tools DBA Flexible wrapper interface DBI