Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs

Emanoel Carlos

Data Integration: The Teenage Years(Alon Halevy, Anand Rajaraman, Joann Ordille)

ecgfs

2

3

Why Data Integration?

o Consider FullServe, a company that provides Internet access to homes, but also sells a few products that support the home computing infrastructure.

o FullServe decided to extend its reach to Europe. To do so, FullServe acquired a European company, EuroCard, which is mainly a credit card provider, but has recently started leveraging its customer base to enter the Internet market.

4

Why Data Integration?Human Resources Department

Training and Development Department

Sales Department Costumer Care Department

5


Some of the databases of EuroCard.

6


o Human Resources Department needs to be able to query for all of its employees

o FullServe has a single customer support hotline, which customers can call about any service or product they obtain from the company.

o FullServe wants to build a Web site to complement its telephone customer service line.

o Combining data from multiple sources can offer opportunities for a company to obtain a competitive advantage and find opportunities for improvement.

• suppose we find that in a particular area of the country FullServe is receiving an unusual number of calls about malfunctions in their service

7


8


o First, we face the challenge of schema heterogeneity, but on a much larger scale: millions of tables created by independent authors and in over 100 languages.

o Second, extracting the data is quite difficult

o Data integration is a key challenge for the advancement of science in fields such as biology, ecosystems, and water management, where groups of scientists are independently collecting data and trying to collaborate with one another.

o Data integration is a challenge for governments who want their different agencies to be better coordinated.

o And lastly, mash-ups are now a popular paradigm for visualizing information on the Web.

9

Why Data Integration?o Query: The focus of most data integration systems is on querying

disparate data sources. However, updating the sources is certainly of interest.

o Number of sources: Data integration is already a challenge for a small number of sources (fewer than 10 and often even 2!), but the challenges are exacerbated when the number of sources grows. Transparency.

o Heterogeneity: A typical data integration scenario involves data sources that were developed independently of each other. As a consequence, the data sources run on different systems: some of them are databases, but others may be content management systems or simply files residing in a directory.

o Autonomy: The sources do not necessarily belong to a single administrative entity.

• We cannot assume that we have full access to the data in a source or that we can access the data whenever we want.

• Furthermore, the sources can change their data formats and access patterns at any time, without having to notify any central administrative entity.

10

Data Integration Architectures

WarehouseVirtual Integration

11

Data Integration Architectures

Data Broker

12

Data Integration Architectures: Virtual approach to integration

Mediated Schemaor Warehouse

Source descriptions/Transforms

Query reformulation /Query over materialized data

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

RDBMS1

RDBMS2

13

Data Integration Architectures: Warehousing approach

Mediated Schemaor Warehouse

Source descriptions/Transforms

Query reformulation /Query over materialized data

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

RDBMS1

RDBMS2

14

Example

Movie (title, director, year, genre)Actors (title, actor)Plays (movie, location, startTime)Review (title, rating, description)

S1 S2 S3 S4 S4

Movies(name, actors, director, genre)

Cinemas(place, movie, start)

CinemasInNYC(cinema, title, startTime)

CinemasInSF(location, movie, startingTime)

Reviews(Title, date, grade, review)

15

Example

Movie (title, director, year, genre)Actors (title, actor)Plays (movie, location, startTime)Review (title, rating, description)

S1 S2 S3 S4 S4

16

Query Processing

QueryReformulator

QueryOptimizer

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

ExecutionEngine

RDBMS1

RDBMS2

Query over mediated schema

Logical query plan over sources

Physical query plan over sources

Subquery or fetchRequest per source

17

Query ProcessingMovie (title, director, year, genre)Actors (title, actor)Plays (movie, location, startTime)Review (title, rating, description)

S1 S2 S3 S4 S4

Movies(name, actors, director, genre)

Cinemas(place, movie, start)

CinemasInNYC(cinema, title, startTime)

CinemasInSF(location, movie, startingTime)

Reviews(Title, date, grade, review)

Tuples for Movie can be obtained from source S1, but the attribute title needs to be reformulated to name.

Since source S3 requires the title of a movie as input, and such a title is not specified in the query, the query plan must first access source S1 and then feed the movie titles returned from S1 as inputs to S3.

Tuples for Plays can be obtained from either source S2, S3 or S4. Since the latter is complete for showings in New York City (same for S4), we choose it over S2.

18

Query Processing

QueryReformulator

QueryOptimizer

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

ExecutionEngine

RDBMS1

RDBMS2





In our example, the optimizer will decide which join algorithm to use to combine results from S1 and S3.

For example, the join algorithm may stream movie titles arriving from S1 and input them into S3, or it may batch them up before sending them to S3.

19

Query Processing

QueryReformulator

QueryOptimizer

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

Wrapper /Extractor

ExecutionEngine

RDBMS1

RDBMS2





20In Proceedings of the AAAI 1995 Spring Symp. on Information Gathering from Heterogeneous, Distributed Enviroments

Information Manifold

o The main contribution of the Information Manifold was the way it described the contents of the data sources it knew about.

o The Information Manifold proposed the method that later became known as the Local-as-View approach (LAV): an information source is described as a view expression over the mediated schema.

o Previous approaches employed the Global-as-View (GAV).

22

GAV: Global-as-view

o GAV defines the mediated schema as a set of views over the data sources.

23

GAV: Global-as-view

• The first expression shows how to obtain tuples for the Movie relation by joining relations in S1.

• The second expression obtains tuples for the Movie relation by joining data from sources S5, S6, and S7. Hence, the tuples that would be computed for Movie are the result of the union of the first two expressions.

• Also note that the second expression requires that we know the director, genre, and year of a movie. If one of these is missing, we will not have a tuple for the movie in th relation Movie.

• The third and fourth expressions generate tuples for the Plays relation by taking the union of S2 and S3.

24

Query over the mediated schema

o Suppose we have the following query over the mediated schema, asking for comedies starting after 8 pm:

o Reformulating Q with the source descriptions would yield the following four logical query plans:

25

LAV: Local-as-View

o Instead of specifying how to compute tuples of the mediated schema, LAV focuses on describing each data source as precisely as possible and independently of any other sources.

26

LAV: Local-as-View

o In LAV, sources S5–S7 would be described simply as projection queries over the Movie relation in the mediated schema.

o With LAV we can also model the source S8 as a join over the mediated schema:

o Furthermore, we can also express constraints on the contents of data sources. For example, we can describe the following source that includes movies produced after 1970 that are all comedies:

27

LAV: Local-as-View

o Consider the following query asking for comedies produced in or after 1960:

o Using the sources S5–S7, we would generate the following reformulation from the LAV source descriptions:

28

Directions

29

Generating Schema Mappingso One of the major bottlenecks in setting up a data integration

application is the effort required to create the source descriptions, and more specifically, writing the semantic mappings between the sources and the mediated schema.

o Semi-automatically generating schema mappings

• Based on clues that can be obtained from the schemas themselves, such as linguistic similarities between schema elements and overlaps in data values or data types of columns.

• Schema mapping tasks are often repetitive. Hence, we could use Machine Learning techniques that consider the manually created schema mappings as training data, and generalize from them to predict mappings between unseen schemas.

• Automatic schema mapping is an AI-Complete problem.

30

Adaptive query processing

o The context in which a data integration system operates is very dynamic and the optimizer has much less information than the traditional setting.

o As a result, two things happen:

• The optimizer may not have enough information to decide on a good plan

• A plan that looks good at optimization time may be arbitrarily bad if the sources do not respond exactly as expected.

31

Model Management

o The goal of Model Management is to provide an algebra for manipulating schemas and mappings.

o With such an algebra, complex operations on data sources are described as simple sequences of operators in the algebra and optimized and processed using a general system.

o Some of the operators that have been considered include the creation of mappings, inverting and composing mappings, merging schemas and schema differencing.

32

XML

33

Querying XML

o Recente

o Falta álgebra

o Linguagens:

• XPath• XQuery• Outras: XLST, Xlink, XPointer

34

Xpath (Caminho + condições)o XML ~ árvore

Bookstore.xml

bookstore

book book

titleautho

ryear

“en”

@lang

“Harry

Potter”

“J K. Rowlin

g”

“2005”

price

“29.99”

35

Xpath (construtores) nodename

Seleciona os nós “nodename”

/ Separador ou Seleciona o nó raíz

// Seleciona qualquer descendente

. Seleciona o nó corrente

.. Seleciona o nó pai

@ Seleciona atributos bookstore Nós com nome “bookstore”

/bookstore Seleciona a raíz “bookstore”

bookstore/book

Seleciona todos os “book” filhos de “bookstore”

book Seleciona todos os “book”, não importa onde

bookstore//book

Seleciona todos os “book”, que são descendentes de “bookstore”, não importa o nível

//@lang Seleciona todos os atributos “lang”

36

Xpath (construtores)Contains(s1, s2) Separador

elemento Acessa o elemento

* Acessa qualquer sub elemento

@elemento Acessa o atributo

// Acessa qualquer descendente

[@elemento>70]

Verifica a condição

[3] Acessa o 3º filho

37

Demonstração XPath

38

XQuery

o XQuery é a linguagem para busca de dados XML

o XQuery está para XML como SQL está para banco de dados

o XQuery é construída em expressões XPath

o XQuery é suportada pela maioria dos banco de dados

o XQuery é uma recomendação da W3C

39

XQuery (Expressão FLWOR)

F or $variavel in expr

L et $variavel := expr

W here condition

O rder by expr

R eturn expr

o Apenas return é obrigatório.

<- iterator

<- atribuição

<- condição

<- ordenamento

40

Demonstração XQuery

41

Peer-to-Peer Data Managemento So far, one must create a mediated schema or centralized

database for the domain of the data integration application.

o Consider data sharing in a scientific context, where data may involve scientific findings from multiple disciplines, such as genomic data, diagnosis and clinical trial data, bibliographic data, and drug-related and clinical trial data.

o The owners of the data might not want to explicitly create a mediated schema that defines the entire scope of the collaboration and a set of terms to which every data source owner needs to map.

o The basic approach of peer data management systems, or PDMSs, is to eliminate the reliance on a central, authoritative mediated schema.

42

Peer-to-Peer Data Management

43

Integração semântica

o Agrupar e combinar dados de diferentes fontes considerando uma semântica explícita

o Ontologia

44


Ontologia única Múltiplas ontologias

Híbrido

45


o A simple ontology of movies, with subclasses for comedies and documentaries.

o Given this ontology, the system can reason about the relevance of individual data sources to a query.

46


o S1 is relevant to a query asking for all movies, but S2 is not relevant to a query asking for comedies.

o Sources S3 and S4 are relevant to queries asking for all movies that won awards, and S4 is relevant to a query asking for movies that won an Oscar.

47

Data Warehouse

o Data from several operational sources (on-line transaction processing systems, OLTP) are extracted, transformed, and loaded (ETL) into a data warehouse.

o Then, analysis, such as online analytical processing (OLAP), can be performed on cubes of integrated and aggregated data.

48

Data Warehouse: ETL

o Extract

o Transform

• Simple mapping• Aggregation and normalization• Calculation

o Load

• Updating extracted data is frequently done on a daily, weekly, or monthly basis.

49

Operações: Roll Up

o Exemplo de uma operação de Roll Up utilizando a dimensão Tempo.

50

Operações: Drill Down

o Exemplo de uma operação de Drill Down utilizando a dimensão localização geográfica.

51

Operações: Drill Across

o Ocorre quando o usuário pula um nível intermediário dentro de uma mesma dimensão.

o O usuário executa um Drill Across quando ele passar de ano direto para trimestre ou mês.

52

Operações: Drill Through

o Ocorre quando um usuário passa de uma informação contida em uma dimensão para uma outra.

o Exemplo: um usuário está na dimensão tempo e, no próximo passo, começa a analisar a informação por região.

53

Operações: Slice and Dice

o Slice é operação que corta o cubo, mas mantém a mesma perspectiva de visualização dos dados

54


o Tabela 1apresenta as vendas de celulares e pagers.

o Tabela 2representa uma fatia dos dados (operação que visualiza somente a produção de um tipo de produto –celulares)

55


o Dice é a mudança de perspectiva da visão.

o É a extração de um subcubo ou a interseção de vários slices.

56


o Estávamos visualizando e analisando no sentido estado, cidade, ano, modelo de produto eproduto (Tabela 1).

o Dice é a mudança de perspectiva para modelo de produto,

o produto, ano, estado e cidade (Tabela2)

57

The Future of Data Integration o Crowdsourcing and “Human Computing”

• Some conditions are hard for a computer to evaluate.• For example, the quality of extractions from the Web can be

verified by humans, and schemas and data can be matched by crowds.

o Lightweight Integration

• We often face a task where we need to integrate data from multiple sources to answer a question that will be asked only once or twice. However, the integration needs to be done quickly and by people who may not have much technical expertise.

• For example, consider a disaster response situation in which reports are coming from multiple data sources in the field, and the goal is to corroborate them and quickly share them with the affected public.

58

The Future of Data Integration o Visualizing Integrated Data

• Users do not want to view rows of data but rather visualizations that highlight the important patterns in the data and offer flexible exploration.

o Cluster- and Cloud-Based Parallel Processing and Caching

• The ultimate vision of the data integration field is to be able to integrate large numbers of sources with large amounts of data - ultimately approaching the scale of the structured part of the Web.

Emanoel Carlos

Data Integration: The Teenage Years(Alon Halevy, Anand Rajaraman, Joann Ordille)

ecgfs

Referênciaso HALEVY, Alon; RAJARAMAN, Anand; ORDILLE, Joann. Data integration: the teenage

years. In: Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 2006. p. 9-16.

o ZIEGLER, Patrick; DITTRICH, Klaus R. Three Decades of Data Integration - all Problems Solved?. In: Building the Information Society. Springer US, 2004. p. 3-12.Livro 1

o REEVE, April. Managing Data in Motion: Data Integration Best Practice Techniques and Technologies. Newnes, 2013.

o DOAN, AnHai; HALEVY, Alon; IVES, Zachary. Principles of data integration. Elsevier, 2012.

o LÓSCIO, Bernadette. Integração de Dados: Ontem, hoje e sempre. Disponível em: http://pt.slideshare.net/bernafarias/integracao-dados-ontem-hoje-e-sempre. Acesso em 6 de set de 2014.

o LÓSCIO, Bernadette. Integração de Informações no Governo Eletrônico. Disponível em: http://pt.slideshare.net/bernafarias/integrao-de-informaes-no-governo-eletrnico. Acesso em 6 d set de 2014.

o OLIVEIRA, Stanley R de M. OLAP: Online Analitical Processing. Disponível em: http://pt.slideshare.net/Valldo/olap-1p. Acesso em 6 de set de 2014.

o WIDOM, Jennifer. Introduction to Databases . Disponível em: https://class.stanford.edu/courses/Engineering/db/2014_1/. Acesso em 6 de set de 2014.

Documents

Emanoel Carlos Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) ecgfs