19
2005 Integration-intro 1 Data Integration Systems overview The architecture of a data integration system: Components and their interaction Tasks Concepts

Data Integration Systems overview

  • Upload
    stormy

  • View
    33

  • Download
    2

Embed Size (px)

DESCRIPTION

Data Integration Systems overview. The architecture of a data integration system: Components and their interaction Tasks Concepts. Main components of a DI system. (I) Mediator מתווך Supports in its user interface : - PowerPoint PPT Presentation

Citation preview

Page 1: Data Integration Systems overview

2005 Integration-intro 1

Data Integration Systemsoverview

The architecture of a data integration system: Components and their interaction Tasks Concepts

Page 2: Data Integration Systems overview

2005 Integration-intro 2

Main components of a DI system

(I) Mediator מתווך

Supports in its user interface :

• The global data model

• The integrated / global / mediated schema / world view

• A query language

Manages the interaction with sources

• Posing queries

• Receiving answers, transforming and showing them

Is responsible for query execution strategies

• planning

• carrying out

Page 3: Data Integration Systems overview

2005 Integration-intro 3

(II) Wrapper עוטף

Serves as the interface to a source

• Receive queries from a mediator

• Plan and execute how to retrieve the data from its source

• Transform data to global data model

• Send to mediator

For an SQL source, these are rather easy

For a restricted capability source, may require

• A series of queries on the source, or• A program to be executed (on a non-db source)

• Filtering results obtained from the source

Page 4: Data Integration Systems overview

2005 Integration-intro 4

A simple architecture:

Arrows represent query and data flow

source source

wrapper wrapper

mediator

Page 5: Data Integration Systems overview

2005 Integration-intro 5

A more complex architecture:

Mediators can serves as wrapped sources for other mediators

source

source

wrapper

wrapper

mediator

mediator

source

wrapper

Page 6: Data Integration Systems overview

2005 Integration-intro 6

Important:

• The global database is virtual – contain no data

• The data reside in the sources

• The users pose queries as if the data resides in the global db

• Users may/may not be aware that the data actually comes from the sources

Page 7: Data Integration Systems overview

2005 Integration-intro 7

Main tasks & activities:

At mediator:

• Query reformulation & decomposition – express queries in terms of the sources’ schemas

– decompose into queries on sources

• Planning query execution, including optimization – a declarative query may be executed in various ways (even in a single

centralized db)

– different sources may provide same data at different costs (money, communication time, response time, delays, …)

– If data is associated with user priorities, we may want to retrieve some answers before others

• When answers arrive – fuse them – a full answer is not a simple union of partial answers; data on an entity must be combined (fused) into a single record

Page 8: Data Integration Systems overview

2005 Integration-intro 8

Requirements: (from mediator, wrapper, system)

Ability to handle • Incomplete information (data may be missing from available

sources)

• Heterogeneity – in data model, schema, contents

• Both data and meta-data

ability to describe sources:

• Capabilities – what queries can a source answer

– What mechanisms does it offer for data retrieval

• Coverage – allows to know – Where can data relevant to a query be found

– Is there overlap between sources?

Page 9: Data Integration Systems overview

2005 Integration-intro 9

The relationship between source and global data

The global data is virtual the mediated schema describes data that • resides in the sources• is described by source schemas

The relationship between the mediated and source data determines how queries are answered

Two main approaches

(a combination of the two – later)

Page 10: Data Integration Systems overview

2005 Integration-intro 10

Global as View – GAV

The global db is defined as a view on the sources

In relational model:

Each global relation defined as a view, by a query on sources

Obvious advantage:

simplicity of query answering

Given Q on global relations,

• expand it: replace each atom R(x) by an expression on sources, using the definition of R

• Then send appropriate sub-queries to sources

Page 11: Data Integration Systems overview

2005 Integration-intro 11

Simple example: a university database

Source A:

• Dept(D, C) – departments and their courses

• Teaches(C,T) – teachers of courses

Source B:

• Enroll(S, C) – student enrollment to courses

Integrated schema & its definition:

• Stud(S, D, T) :- Dept(D, C), Teach(C, T), Enroll(S, C)

Query Q:

• Stud(S, ‘CS’, ‘Beeri’)

Expand body to Dept(‘CS’, C), Teach(C, ‘Beeri’), Enroll(S, C)

Then use one of (at least) two execution strategies on sources A,B

Page 12: Data Integration Systems overview

2005 Integration-intro 12

Local as View – LAV

The global database is viewed as the “real world”

Each source is defined as a view on it

Example (revisited):

Global schema: Univ(D, C, T, S)

Source A:

• Dept(D, C) :- Univ(D, C, T, S)

• Teaches(C, T) :- Univ(D, C, T, S)

Source B:

• Enroll(S, C) :- Univ(D, C, T, S)

Page 13: Data Integration Systems overview

2005 Integration-intro 13

Possible assumptions on sources:

• A source contains all data in its defining view

• A source contains some of the data in its view, usually not all

2nd is more realistic

Example:

Global database describes cars for sale

A source may contain :

• only some of the attributes of cars present in the global schema

(e.g., it may not contain history, or owner-contact)

• Only some of the cars for sale

full view / contained view

Obviously, the more sources we have, the more cars

Page 14: Data Integration Systems overview

2005 Integration-intro 14

Query answering in LAV:• Expansion is not possible• An approach: answering queries using views Practically: rewriting queries using views (differences explored later)

Only the views have data rewrite query to an expression over the views expression must be (explained in more detail later)

• Full views: equivalent to query• Contained views: contained in query

Solution may/may not exist (contrast to expansion) Finding it is more difficultThis problem was explored in many contexts, e.g.:Query optimization using views/previous answers

Page 15: Data Integration Systems overview

2005 Integration-intro 15

Why prefer LAV to GAV?

• Ease of expanding a system:– In GAV, adding a source may require re-definition of global schema

– makes it difficult to add sources

– In LAV, just define the new source as a view

given an algorithm for using views to answer queries, it automatically uses the new source

As for expanding queries vs. using views:

Even in GAV, when sources have restricted capabilities, query answering requires using views

Page 16: Data Integration Systems overview

2005 Integration-intro 16

• Typically, a global schema reflects a real ‘world’, as we know it; each source materializes only a fragment– Horizontal – not all entity types or attributes are present

– Vertical – not all entities of a type are present

Thus, it is natural to define the sources as (contained) views

Examples:

• Cars for sale: – global db reflects our understanding and requirements

– A source provides only some info, only on the cars it has

• Looking for personal information using UNIX facilities – we know about: name, office, phone, email, …

– Each facility may offer only some of the above

Page 17: Data Integration Systems overview

2005 Integration-intro 17

LAV is a natural approach in the presence of

• www and its diversity & dynamicity of source

• Legacy systems

Most research efforts & systems are LAV

Page 18: Data Integration Systems overview

2005 Integration-intro 18

On rewriting queries using views:

It is not clear (now) how to obtain a rewriting, given Q

But, given v1(..), v2(..), …, vn(..) as a candidate, we may

• expand each vi using its definition in terms of the global schema

• Check whether the resulting expression is

equivalent to or contained in Q

(both Q and the expansion are in terms of global schema relations)

Equivalence and containment of queries are fundamental problems for data integration

Page 19: Data Integration Systems overview

2005 Integration-intro 19

Example (our LAV example):

Q: ans(S,’CS’,’Beeri’) :- Univ(’CS, C, ’Beeri’,S)

Guess an answer in terms of views:

ans`(S, ’CS’ , ‘Beeri’ ) :-

Dept(‘CS’,C), Teach(C, ‘Beeri’), Enroll(S,C)

(Note: must use distinct variables in different expansions for all non-join variables)

Is the query equivalent to this expansion?

Is the expansion contained in the query?

Univ(‘CS’,C, T1, S1) Univ(D2,C, ‘Beeri’, S2) Univ(D3,C, T3, S)