Upload
stormy
View
33
Download
2
Embed Size (px)
DESCRIPTION
Data Integration Systems overview. The architecture of a data integration system: Components and their interaction Tasks Concepts. Main components of a DI system. (I) Mediator מתווך Supports in its user interface : - PowerPoint PPT Presentation
Citation preview
2005 Integration-intro 1
Data Integration Systemsoverview
The architecture of a data integration system: Components and their interaction Tasks Concepts
2005 Integration-intro 2
Main components of a DI system
(I) Mediator מתווך
Supports in its user interface :
• The global data model
• The integrated / global / mediated schema / world view
• A query language
Manages the interaction with sources
• Posing queries
• Receiving answers, transforming and showing them
Is responsible for query execution strategies
• planning
• carrying out
2005 Integration-intro 3
(II) Wrapper עוטף
Serves as the interface to a source
• Receive queries from a mediator
• Plan and execute how to retrieve the data from its source
• Transform data to global data model
• Send to mediator
For an SQL source, these are rather easy
For a restricted capability source, may require
• A series of queries on the source, or• A program to be executed (on a non-db source)
• Filtering results obtained from the source
2005 Integration-intro 4
A simple architecture:
Arrows represent query and data flow
source source
wrapper wrapper
mediator
2005 Integration-intro 5
A more complex architecture:
Mediators can serves as wrapped sources for other mediators
source
source
wrapper
wrapper
mediator
mediator
source
wrapper
2005 Integration-intro 6
Important:
• The global database is virtual – contain no data
• The data reside in the sources
• The users pose queries as if the data resides in the global db
• Users may/may not be aware that the data actually comes from the sources
2005 Integration-intro 7
Main tasks & activities:
At mediator:
• Query reformulation & decomposition – express queries in terms of the sources’ schemas
– decompose into queries on sources
• Planning query execution, including optimization – a declarative query may be executed in various ways (even in a single
centralized db)
– different sources may provide same data at different costs (money, communication time, response time, delays, …)
– If data is associated with user priorities, we may want to retrieve some answers before others
• When answers arrive – fuse them – a full answer is not a simple union of partial answers; data on an entity must be combined (fused) into a single record
2005 Integration-intro 8
Requirements: (from mediator, wrapper, system)
Ability to handle • Incomplete information (data may be missing from available
sources)
• Heterogeneity – in data model, schema, contents
• Both data and meta-data
ability to describe sources:
• Capabilities – what queries can a source answer
– What mechanisms does it offer for data retrieval
• Coverage – allows to know – Where can data relevant to a query be found
– Is there overlap between sources?
2005 Integration-intro 9
The relationship between source and global data
The global data is virtual the mediated schema describes data that • resides in the sources• is described by source schemas
The relationship between the mediated and source data determines how queries are answered
Two main approaches
(a combination of the two – later)
2005 Integration-intro 10
Global as View – GAV
The global db is defined as a view on the sources
In relational model:
Each global relation defined as a view, by a query on sources
Obvious advantage:
simplicity of query answering
Given Q on global relations,
• expand it: replace each atom R(x) by an expression on sources, using the definition of R
• Then send appropriate sub-queries to sources
2005 Integration-intro 11
Simple example: a university database
Source A:
• Dept(D, C) – departments and their courses
• Teaches(C,T) – teachers of courses
Source B:
• Enroll(S, C) – student enrollment to courses
Integrated schema & its definition:
• Stud(S, D, T) :- Dept(D, C), Teach(C, T), Enroll(S, C)
Query Q:
• Stud(S, ‘CS’, ‘Beeri’)
Expand body to Dept(‘CS’, C), Teach(C, ‘Beeri’), Enroll(S, C)
Then use one of (at least) two execution strategies on sources A,B
2005 Integration-intro 12
Local as View – LAV
The global database is viewed as the “real world”
Each source is defined as a view on it
Example (revisited):
Global schema: Univ(D, C, T, S)
Source A:
• Dept(D, C) :- Univ(D, C, T, S)
• Teaches(C, T) :- Univ(D, C, T, S)
Source B:
• Enroll(S, C) :- Univ(D, C, T, S)
2005 Integration-intro 13
Possible assumptions on sources:
• A source contains all data in its defining view
• A source contains some of the data in its view, usually not all
2nd is more realistic
Example:
Global database describes cars for sale
A source may contain :
• only some of the attributes of cars present in the global schema
(e.g., it may not contain history, or owner-contact)
• Only some of the cars for sale
full view / contained view
Obviously, the more sources we have, the more cars
2005 Integration-intro 14
Query answering in LAV:• Expansion is not possible• An approach: answering queries using views Practically: rewriting queries using views (differences explored later)
Only the views have data rewrite query to an expression over the views expression must be (explained in more detail later)
• Full views: equivalent to query• Contained views: contained in query
Solution may/may not exist (contrast to expansion) Finding it is more difficultThis problem was explored in many contexts, e.g.:Query optimization using views/previous answers
2005 Integration-intro 15
Why prefer LAV to GAV?
• Ease of expanding a system:– In GAV, adding a source may require re-definition of global schema
– makes it difficult to add sources
– In LAV, just define the new source as a view
given an algorithm for using views to answer queries, it automatically uses the new source
As for expanding queries vs. using views:
Even in GAV, when sources have restricted capabilities, query answering requires using views
2005 Integration-intro 16
• Typically, a global schema reflects a real ‘world’, as we know it; each source materializes only a fragment– Horizontal – not all entity types or attributes are present
– Vertical – not all entities of a type are present
Thus, it is natural to define the sources as (contained) views
Examples:
• Cars for sale: – global db reflects our understanding and requirements
– A source provides only some info, only on the cars it has
• Looking for personal information using UNIX facilities – we know about: name, office, phone, email, …
– Each facility may offer only some of the above
2005 Integration-intro 17
LAV is a natural approach in the presence of
• www and its diversity & dynamicity of source
• Legacy systems
Most research efforts & systems are LAV
2005 Integration-intro 18
On rewriting queries using views:
It is not clear (now) how to obtain a rewriting, given Q
But, given v1(..), v2(..), …, vn(..) as a candidate, we may
• expand each vi using its definition in terms of the global schema
• Check whether the resulting expression is
equivalent to or contained in Q
(both Q and the expansion are in terms of global schema relations)
Equivalence and containment of queries are fundamental problems for data integration
2005 Integration-intro 19
Example (our LAV example):
Q: ans(S,’CS’,’Beeri’) :- Univ(’CS, C, ’Beeri’,S)
Guess an answer in terms of views:
ans`(S, ’CS’ , ‘Beeri’ ) :-
Dept(‘CS’,C), Teach(C, ‘Beeri’), Enroll(S,C)
(Note: must use distinct variables in different expansions for all non-join variables)
Is the query equivalent to this expansion?
Is the expansion contained in the query?
Univ(‘CS’,C, T1, S1) Univ(D2,C, ‘Beeri’, S2) Univ(D3,C, T3, S)