60
Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

Embed Size (px)

Citation preview

Page 1: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

Emanoel Carlos

Data Integration: The Teenage Years(Alon Halevy, Anand Rajaraman, Joann Ordille)

ecgfs

Page 2: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

2

Page 3: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

3

Why Data Integration?

o Consider FullServe, a company that provides Internet access to homes, but also sells a few products that support the home computing infrastructure.

o FullServe decided to extend its reach to Europe. To do so, FullServe acquired a European company, EuroCard, which is mainly a credit card provider, but has recently started leveraging its customer base to enter the Internet market.

Page 4: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

4

Why Data Integration?Human Resources Department

Training and Development Department

Sales Department Costumer Care Department

Page 5: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

5

Why Data Integration?

Some of the databases of EuroCard.

Page 6: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

6

Why Data Integration?

o Human Resources Department needs to be able to query for all of its employees

o FullServe has a single customer support hotline, which customers can call about any service or product they obtain from the company.

o FullServe wants to build a Web site to complement its telephone customer service line.

o Combining data from multiple sources can offer opportunities for a company to obtain a competitive advantage and find opportunities for improvement.

• suppose we find that in a particular area of the country FullServe is receiving an unusual number of calls about malfunctions in their service

Page 7: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

7

Why Data Integration?

Page 8: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

8

Why Data Integration?

o First, we face the challenge of schema heterogeneity, but on a much larger scale: millions of tables created by independent authors and in over 100 languages.

o Second, extracting the data is quite difficult

o Data integration is a key challenge for the advancement of science in fields such as biology, ecosystems, and water management, where groups of scientists are independently collecting data and trying to collaborate with one another.

o Data integration is a challenge for governments who want their different agencies to be better coordinated.

o And lastly, mash-ups are now a popular paradigm for visualizing information on the Web.

Page 9: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

9

Why Data Integration?o Query: The focus of most data integration systems is on querying

disparate data sources. However, updating the sources is certainly of interest.

o Number of sources: Data integration is already a challenge for a small number of sources (fewer than 10 and often even 2!), but the challenges are exacerbated when the number of sources grows. Transparency.

o Heterogeneity: A typical data integration scenario involves data sources that were developed independently of each other. As a consequence, the data sources run on different systems: some of them are databases, but others may be content management systems or simply files residing in a directory.

o Autonomy: The sources do not necessarily belong to a single administrative entity.

• We cannot assume that we have full access to the data in a source or that we can access the data whenever we want.

• Furthermore, the sources can change their data formats and access patterns at any time, without having to notify any central administrative entity.

Page 10: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

10

Data Integration Architectures

WarehouseVirtual Integration

Page 11: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

11

Data Integration Architectures

Data Broker

Page 12: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

12

Data Integration Architectures: Virtual approach to integration

Mediated Schemaor Warehouse

Source descriptions/Transforms

Query reformulation /Query over materialized data

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

RDBMS1

RDBMS2

Page 13: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

13

Data Integration Architectures: Warehousing approach

Mediated Schemaor Warehouse

Source descriptions/Transforms

Query reformulation /Query over materialized data

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

RDBMS1

RDBMS2

Page 14: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

14

Example

Movie (title, director, year, genre)Actors (title, actor)Plays (movie, location, startTime)Review (title, rating, description)

S1 S2 S3 S4 S4

Movies(name, actors, director, genre)

Cinemas(place, movie, start)

CinemasInNYC(cinema, title, startTime)

CinemasInSF(location, movie, startingTime)

Reviews(Title, date, grade, review)

Page 15: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

15

Example

Movie (title, director, year, genre)Actors (title, actor)Plays (movie, location, startTime)Review (title, rating, description)

S1 S2 S3 S4 S4

Page 16: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

16

Query Processing

QueryReformulator

QueryOptimizer

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

ExecutionEngine

RDBMS1

RDBMS2

Query over mediated schema

Logical query plan over sources

Physical query plan over sources

Subquery or fetchRequest per source

Page 17: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

17

Query ProcessingMovie (title, director, year, genre)Actors (title, actor)Plays (movie, location, startTime)Review (title, rating, description)

S1 S2 S3 S4 S4

Movies(name, actors, director, genre)

Cinemas(place, movie, start)

CinemasInNYC(cinema, title, startTime)

CinemasInSF(location, movie, startingTime)

Reviews(Title, date, grade, review)

Tuples for Movie can be obtained from source S1, but the attribute title needs to be reformulated to name.

Since source S3 requires the title of a movie as input, and such a title is not specified in the query, the query plan must first access source S1 and then feed the movie titles returned from S1 as inputs to S3.

Tuples for Plays can be obtained from either source S2, S3 or S4. Since the latter is complete for showings in New York City (same for S4), we choose it over S2.

Page 18: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

18

Query Processing

QueryReformulator

QueryOptimizer

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

ExecutionEngine

RDBMS1

RDBMS2

Query over mediated schema

Logical query plan over sources

Physical query plan over sources

Subquery or fetchRequest per source

In our example, the optimizer will decide which join algorithm to use to combine results from S1 and S3.

For example, the join algorithm may stream movie titles arriving from S1 and input them into S3, or it may batch them up before sending them to S3.

Page 19: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

19

Query Processing

QueryReformulator

QueryOptimizer

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

ExecutionEngine

RDBMS1

RDBMS2

Query over mediated schema

Logical query plan over sources

Physical query plan over sources

Subquery or fetchRequest per source

Page 20: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

20In Proceedings of the AAAI 1995 Spring Symp. on Information Gathering from Heterogeneous, Distributed Enviroments

Page 21: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

Information Manifold

o The main contribution of the Information Manifold was the way it described the contents of the data sources it knew about.

o The Information Manifold proposed the method that later became known as the Local-as-View approach (LAV): an information source is described as a view expression over the mediated schema.

o Previous approaches employed the Global-as-View (GAV).

Page 22: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

22

GAV: Global-as-view

o GAV defines the mediated schema as a set of views over the data sources.

Page 23: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

23

GAV: Global-as-view

• The first expression shows how to obtain tuples for the Movie relation by joining relations in S1.

• The second expression obtains tuples for the Movie relation by joining data from sources S5, S6, and S7. Hence, the tuples that would be computed for Movie are the result of the union of the first two expressions.

• Also note that the second expression requires that we know the director, genre, and year of a movie. If one of these is missing, we will not have a tuple for the movie in th relation Movie.

• The third and fourth expressions generate tuples for the Plays relation by taking the union of S2 and S3.

Page 24: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

24

Query over the mediated schema

o Suppose we have the following query over the mediated schema, asking for comedies starting after 8 pm:

o Reformulating Q with the source descriptions would yield the following four logical query plans:

Page 25: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

25

LAV: Local-as-View

o Instead of specifying how to compute tuples of the mediated schema, LAV focuses on describing each data source as precisely as possible and independently of any other sources.

Page 26: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

26

LAV: Local-as-View

o In LAV, sources S5–S7 would be described simply as projection queries over the Movie relation in the mediated schema.

o With LAV we can also model the source S8 as a join over the mediated schema:

o Furthermore, we can also express constraints on the contents of data sources. For example, we can describe the following source that includes movies produced after 1970 that are all comedies:

Page 27: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

27

LAV: Local-as-View

o Consider the following query asking for comedies produced in or after 1960:

o Using the sources S5–S7, we would generate the following reformulation from the LAV source descriptions:

Page 28: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

28

Directions

Page 29: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

29

Generating Schema Mappingso One of the major bottlenecks in setting up a data integration

application is the effort required to create the source descriptions, and more specifically, writing the semantic mappings between the sources and the mediated schema.

o Semi-automatically generating schema mappings

• Based on clues that can be obtained from the schemas themselves, such as linguistic similarities between schema elements and overlaps in data values or data types of columns.

• Schema mapping tasks are often repetitive. Hence, we could use Machine Learning techniques that consider the manually created schema mappings as training data, and generalize from them to predict mappings between unseen schemas.

• Automatic schema mapping is an AI-Complete problem.

Page 30: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

30

Adaptive query processing

o The context in which a data integration system operates is very dynamic and the optimizer has much less information than the traditional setting.

o As a result, two things happen:

• The optimizer may not have enough information to decide on a good plan

• A plan that looks good at optimization time may be arbitrarily bad if the sources do not respond exactly as expected.

Page 31: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

31

Model Management

o The goal of Model Management is to provide an algebra for manipulating schemas and mappings.

o With such an algebra, complex operations on data sources are described as simple sequences of operators in the algebra and optimized and processed using a general system.

o Some of the operators that have been considered include the creation of mappings, inverting and composing mappings, merging schemas and schema differencing.

Page 32: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

32

XML

Page 33: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

33

Querying XML

o Recente

o Falta álgebra

o Linguagens:

• XPath• XQuery• Outras: XLST, Xlink, XPointer

Page 34: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

34

Xpath (Caminho + condições)o XML ~ árvore

Bookstore.xml

bookstore

book book

titleautho

ryear

“en”

@lang

“Harry

Potter”

“J K. Rowlin

g”

“2005”

price

“29.99”

Page 35: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

35

Xpath (construtores) nodename

Seleciona os nós “nodename”

/ Separador ou Seleciona o nó raíz

// Seleciona qualquer descendente

. Seleciona o nó corrente

.. Seleciona o nó pai

@ Seleciona atributos bookstore Nós com nome “bookstore”

/bookstore Seleciona a raíz “bookstore”

bookstore/book

Seleciona todos os “book” filhos de “bookstore”

book Seleciona todos os “book”, não importa onde

bookstore//book

Seleciona todos os “book”, que são descendentes de “bookstore”, não importa o nível

//@lang Seleciona todos os atributos “lang”

Page 36: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

36

Xpath (construtores)Contains(s1, s2) Separador

elemento Acessa o elemento

* Acessa qualquer sub elemento

@elemento Acessa o atributo

// Acessa qualquer descendente

[@elemento>70]

Verifica a condição

[3] Acessa o 3º filho

Page 37: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

37

Demonstração XPath

Page 38: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

38

XQuery

o XQuery é a linguagem para busca de dados XML

o XQuery está para XML como SQL está para banco de dados

o XQuery é construída em expressões XPath

o XQuery é suportada pela maioria dos banco de dados

o XQuery é uma recomendação da W3C

Page 39: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

39

XQuery (Expressão FLWOR)

F or $variavel in expr

L et $variavel := expr

W here condition

O rder by expr

R eturn expr

o Apenas return é obrigatório.

<- iterator

<- atribuição

<- condição

<- ordenamento

Page 40: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

40

Demonstração XQuery

Page 41: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

41

Peer-to-Peer Data Managemento So far, one must create a mediated schema or centralized

database for the domain of the data integration application.

o Consider data sharing in a scientific context, where data may involve scientific findings from multiple disciplines, such as genomic data, diagnosis and clinical trial data, bibliographic data, and drug-related and clinical trial data.

o The owners of the data might not want to explicitly create a mediated schema that defines the entire scope of the collaboration and a set of terms to which every data source owner needs to map.

o The basic approach of peer data management systems, or PDMSs, is to eliminate the reliance on a central, authoritative mediated schema.

Page 42: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

42

Peer-to-Peer Data Management

Page 43: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

43

Integração semântica

o Agrupar e combinar dados de diferentes fontes considerando uma semântica explícita

o Ontologia

Page 44: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

44

Integração semântica

Ontologia única Múltiplas ontologias

Híbrido

Page 45: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

45

Integração semântica

o A simple ontology of movies, with subclasses for comedies and documentaries.

o Given this ontology, the system can reason about the relevance of individual data sources to a query.

Page 46: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

46

Integração semântica

o S1 is relevant to a query asking for all movies, but S2 is not relevant to a query asking for comedies.

o Sources S3 and S4 are relevant to queries asking for all movies that won awards, and S4 is relevant to a query asking for movies that won an Oscar.

Page 47: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

47

Data Warehouse

o Data from several operational sources (on-line transaction processing systems, OLTP) are extracted, transformed, and loaded (ETL) into a data warehouse.

o Then, analysis, such as online analytical processing (OLAP), can be performed on cubes of integrated and aggregated data.

Page 48: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

48

Data Warehouse: ETL

o Extract

o Transform

• Simple mapping• Aggregation and normalization• Calculation

o Load

• Updating extracted data is frequently done on a daily, weekly, or monthly basis.

Page 49: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

49

Operações: Roll Up

o Exemplo de uma operação de Roll Up utilizando a dimensão Tempo.

Page 50: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

50

Operações: Drill Down

o Exemplo de uma operação de Drill Down utilizando a dimensão localização geográfica.

Page 51: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

51

Operações: Drill Across

o Ocorre quando o usuário pula um nível intermediário dentro de uma mesma dimensão.

o O usuário executa um Drill Across quando ele passar de ano direto para trimestre ou mês.

Page 52: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

52

Operações: Drill Through

o Ocorre quando um usuário passa de uma informação contida em uma dimensão para uma outra.

o Exemplo: um usuário está na dimensão tempo e, no próximo passo, começa a analisar a informação por região.

Page 53: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

53

Operações: Slice and Dice

o Slice é operação que corta o cubo, mas mantém a mesma perspectiva de visualização dos dados

Page 54: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

54

Operações: Slice and Dice

o Tabela 1apresenta as vendas de celulares e pagers.

o Tabela 2representa uma fatia dos dados (operação que visualiza somente a produção de um tipo de produto –celulares)

Page 55: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

55

Operações: Slice and Dice

o Dice é a mudança de perspectiva da visão.

o É a extração de um subcubo ou a interseção de vários slices.

Page 56: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

56

Operações: Slice and Dice

o Estávamos visualizando e analisando no sentido estado, cidade, ano, modelo de produto eproduto (Tabela 1).

o Dice é a mudança de perspectiva para modelo de produto,

o produto, ano, estado e cidade (Tabela2)

Page 57: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

57

The Future of Data Integration o Crowdsourcing and “Human Computing”

• Some conditions are hard for a computer to evaluate.• For example, the quality of extractions from the Web can be

verified by humans, and schemas and data can be matched by crowds.

o Lightweight Integration

• We often face a task where we need to integrate data from multiple sources to answer a question that will be asked only once or twice. However, the integration needs to be done quickly and by people who may not have much technical expertise.

• For example, consider a disaster response situation in which reports are coming from multiple data sources in the field, and the goal is to corroborate them and quickly share them with the affected public.

Page 58: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

58

The Future of Data Integration o Visualizing Integrated Data

• Users do not want to view rows of data but rather visualizations that highlight the important patterns in the data and offer flexible exploration.

o Cluster- and Cloud-Based Parallel Processing and Caching

• The ultimate vision of the data integration field is to be able to integrate large numbers of sources with large amounts of data - ultimately approaching the scale of the structured part of the Web.

Page 59: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

Emanoel Carlos

Data Integration: The Teenage Years(Alon Halevy, Anand Rajaraman, Joann Ordille)

ecgfs

Page 60: Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

Referênciaso HALEVY, Alon; RAJARAMAN, Anand; ORDILLE, Joann. Data integration: the teenage

years. In: Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 2006. p. 9-16.

o ZIEGLER, Patrick; DITTRICH, Klaus R. Three Decades of Data Integration - all Problems Solved?. In: Building the Information Society. Springer US, 2004. p. 3-12.Livro 1

o REEVE, April. Managing Data in Motion: Data Integration Best Practice Techniques and Technologies. Newnes, 2013.

o DOAN, AnHai; HALEVY, Alon; IVES, Zachary. Principles of data integration. Elsevier, 2012.

o LÓSCIO, Bernadette. Integração de Dados: Ontem, hoje e sempre. Disponível em: http://pt.slideshare.net/bernafarias/integracao-dados-ontem-hoje-e-sempre. Acesso em 6 de set de 2014.

o LÓSCIO, Bernadette. Integração de Informações no Governo Eletrônico. Disponível em: http://pt.slideshare.net/bernafarias/integrao-de-informaes-no-governo-eletrnico. Acesso em 6 d set de 2014.

o OLIVEIRA, Stanley R de M. OLAP: Online Analitical Processing. Disponível em: http://pt.slideshare.net/Valldo/olap-1p. Acesso em 6 de set de 2014.

o WIDOM, Jennifer. Introduction to Databases . Disponível em: https://class.stanford.edu/courses/Engineering/db/2014_1/. Acesso em 6 de set de 2014.