22
I. Khalil Ibrahim 1 Data Integration in Digital Libraries: Approaches and Challenges Bringing Digital Libraries together Dr. Ismail Khalil Ibrahim ismail.khalil- [email protected] +43 7236 3343 852 www.scch.at

I. Khalil Ibrahim1 Data Integration in Digital Libraries: Approaches and Challenges Bringing Digital Libraries together Dr. Ismail Khalil Ibrahim [email protected]

Embed Size (px)

Citation preview

I. Khalil Ibrahim 1

Data Integration in Digital Libraries: Approaches and Challenges

Bringing Digital Libraries together

Dr. Ismail Khalil Ibrahim

[email protected]

+43 7236 3343 852www.scch.at

I. Khalil Ibrahim 2

Biography

Dr. Ismail Khalil Ibrahim is a senior software develepoer and AgenCom project manager at the Software Competence Center Hagenberg - Austria. He worked in the University of Technology - Baghdad – Iraq from 1985-1990 as a lecturer, in the Human Resources Training and Development Institute - Iraq from 1990-1996 as the head of the academic studies department, in Gadjah Mada University from 1996-2000 as a teaching and research assistant.

His main research interests lay in the fields of E-commerce & I-Commerce, Database Applications and Techniques for the Web, Practical Experience and Applications in Information Integration systems , Logic Programming for Information Integration , Agents for Information Retrieval and Knowledge Discovery , XML and Semistructured Data Management , Information Systems Management and Development , Information Technology: Impact, Economic Analysis. Ismail is a member of ACM, SIGMOD, SIGKDD, and SIGecom, general Secretary of the Indonesian Information Society Initiative (IISI), member of the Iraqi Engineers Association (IEA), overseas Collaborator in the E-commerce Lab at the National University of Singapore, editorial Board of the Columbian Journal of Computing “Revista Colombiana de Computación”, chairman of the organizing committee of the 1st and 2nd International Workshop on Information Integration and Web-based Applications & Services (IIWAS'99, IIWAS'00) , Yogyakarta, Indonesia, chairman of the organizing committee of the 3rd International Conference on Information Integration and Web-based Applications & Services (IIWAS'2001), Linz, Austria.

Ismail holds a B.Sc. in Electrical Engineering, from the University of Technology, Iraq (1985), M.Sc. and Ph.D., in Computer Eng. and Information Systems from Gadjah Mada University (1998, 2001).

I. Khalil Ibrahim 3

Outline

Data Integration

What is it ?

What does a data integration system look like ?

What are some data integration challenges?

I. Khalil Ibrahim 4

What Is Data Integration?

Providing: uniform: sources transparent to user

access: query, and eventually updates

multiple: even two is a problem

autonomous: not effect behavior of sources

heterogeneous: different data models, schemas

unstructured: at least semi-structured

information sources: not only databases

I. Khalil Ibrahim 5

http://www.amazon.com

s1 (Title,Author,Subject)

http://www.book-a-million.com

s2 (ISBN,Title,Publisher)

http://……...

Example Scenario

I. Khalil Ibrahim 6

Retrieve the titles and subjects of all the technical reports written by (Stephane Bressan) and published by MIT PRESS

q1 amazon (Title,”Stephane Bressan”,subject)

q2 book-a-million (ISBN,Title,”MIT Press”)

Join the results

Example Scenariocont.

I. Khalil Ibrahim 7

So What is the Problem?

Virtual vs. Materialized Architectures

Access: query or query & update? Problem similar to updating through views need distributed transactional services

Mediated schema: yes or no? without mediated schema we lose advantages mediated schema requires schema integration schema integration need query transformation query transformation need query optimization

I. Khalil Ibrahim 8

Additional Dimensions

How many sources are we accessing? how autonomous are the sources?

how much knowledge do we have about sources?

how structured are the data in the sources?

Requirements from responses: accuracy

completeness

machine readable vs. human readable

handling inconsistencies

speed

closed World Assumption vs. Open World Assumption

I. Khalil Ibrahim 9

Related Technologies / Issues

Distributed databases

sources are homogeneous

data is distributed a priori

sources are not autonomous

Similarities at the optimization and execution level

Information retrieval keyword search

no semantics

Data mining: discovering properties and patterns in data

I. Khalil Ibrahim 10

Current Applications

Intranets enterprise data integration web-site construction

World Wide Web digital libraries comparison shopping (Netbot, Junglee) portals integration data from multiple resources XML integration

Science & Culture medical genetics: integrating genomic data Astrophysics: monitoring events in the sky Environment: puget sound regional synthesis model Culture: uniform access to all the cultural databases

I. Khalil Ibrahim 11

Integration

global defined from local

global “independent”of local

CWA

global-schema-as-view

OWA

global-as-view-of-local

local-as-view-of-global

Database Schema Integration Data Warehousing Mediation

Paradigms of Data Integration

I. Khalil Ibrahim 12

Paradigms of Data Integration II

Data Warehousing (materialization architecture)

data of interest is collected in a central place and a web site is built on top of it

queries are applied to the data warehouse

easy to support queries, transactions

hard to modify, the warehouse is not connected to the providers of information, ... etc.

I. Khalil Ibrahim 13

WrapperWrapperWrapper

Data Extraction

DataWarehouse

Application

DataSource

DataSource

DataSource

Data Warehousing Architecture

I. Khalil Ibrahim 14

Paradigms of Data Integration III

Information Mediation (virtual architecture)

data remains in web sources

rules that relate external data to internal application

data is not replicated, data are guaranteed to be up-to-date

query optimization and execution is more complex

I. Khalil Ibrahim 15

Glo

bal D

ata

Mod

elApplicationLo

cal D

ata

Mod

el Wrapper

DataSource

Query Execution Engine

Catalog

Wrapper

DataSource

Mediation Architecture

I. Khalil Ibrahim 16

World Relations:

Book(title,year,author,subject) BookYear(title,year)

BookRev(title,author,review)

GAV

LAV

Running Example

Source Relations:

DB1(title,author,year)

DB2(title,author,year)

DB3(title,review)

I. Khalil Ibrahim 17

Global As View (GAV)

Define a global schema of objects ande write down rules to collect these objects

for each relation RR in the mediated schema, we write a query over the sources' relations specifying how to obtain RR's tuples from the sources (Query unfolding)

traditional query processing applies

requires the right sources to be avaliable and compliant

I. Khalil Ibrahim 18

Local As View (GAV)

For every information source (SS), we write a query over the relations in the mediated schema that describes which tuples are found in S S (Query folding or Answering Queries using Views)

may be able to answer a query based on the avaliable partial information

generally, may not be able to answer the query

needs non standard query processing techniques

potentially high complexity

I. Khalil Ibrahim 19

Challanges

Complexity over traditional DBs: heterogeneous, autonomous, network-bounded surces

Query reformulation now understood

map queries over mediated schemas to „wrapped“ sources (heterogeneity)

Issues remain in query processing

few statistics (autonomous sources)

unanticipated delays and failures (network-bounded sources)

I. Khalil Ibrahim 20

Conclusions

Data integration handles many problems needed for embedded systems applications

Many data sources

Easy addition and deletion of sources

Different source capabilities

Dealing with network delays

Easy for user

I. Khalil Ibrahim 21

• Semantic Query Transformation for the Integration of Autonomous Information Sources (INAP’99 – Tokyo)

• IKA: Unity in Heterogenity (IIWAS’99 – Yogyakarta)• Information Reterival Agents for the Intelligent Integration of

Information Sources (MulNet 2000 - Bandung)• A Multilingual Natural Language Interface for Mediating E-

Commerce Product Catalogs (INAP2000 – Tokyo)• Semantic Query Transformation for the Intelligent Integration

of Information Sources over the Web (WIIW2001 – Rio de Janeiro)

• Rewriting Rules for Semantic Query Transformation in E-Commerce Applications (DS9 – Hong Kong)

• Data Integration in Digital Libraries: Challenges and Approaches (IndonesiaDL– Bandung)

Publications

I. Khalil Ibrahim 22

Thank you for your attention!