43
Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

  • View
    226

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Data Cleaning and Transformation - Tools

Helena GalhardasDEI IST

Page 2: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Agenda

ETL/Data Quality tools Generic functionalities Categories of tools

The AJAX data cleaning and transformation framework

Page 3: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Typical architecture of a DQ system

HumanKnowledge

HumanKnowledge

DataExtraction

DataLoading

DataTransformation

Metadata Dictionaries DataAnalysis

SchemaIntegration

... ...

SOURCE DATA

TARGET DATA

DataTransformation

Page 4: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Existing technology

How to perform data transformation and cleaning to achieve high quality data Ad-hoc programs

Programs difficult to optimize and maintain

RDBMS mechanisms to guarantee the verification of integrity constraints

Do not address important data instance problems

Data transformation scripts using an ETL or data quality tool

Page 5: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

(ETL and) Data Quality tools

Page 6: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Generic functionalities (1/2) Data sources – extract data from different and

heterogeneous data sources Extraction capabilities – schedule extracts; select

data; merge data from multiple sources Loading capabilities – multiple types,

heterogeneous targets; refresh and append; automatically create tables

Incremental updates – update instead of rebuild from scratch

User interface – graphical vs non-graphical Metadata repository – stores data schemas and

transformations

Page 7: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Generic functionalities (2/2) Performance techniques – partitioning; parallel

processing; threads; clustering Versioning – version control of data quality programs Function library – pre-built functions Language binding – language support to define new

functions Debugging and tracing – document the execution;

tracking and analyzing to detect problems and refine specifications

Exception detection and handling – set of input rules for which the execution of data quality program fails

Data lineage – identifies the input records that produced a given data item

Page 8: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Classification of tools (2005)

Commercial

Research

Page 9: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Categories of data quality tools

Data analysis – statistical evaluation, logical study and application of data mining algorithms to define data patterns and rules

Data profiling – analysing data sources to identify data quality problemsData transformation– set of operations that source data must undergo to fit target schemaData cleaning– detecting, removing and correcting dirty dataDuplicate elimination– identifies and merges approximate duplicate recordsData enrichement– use of additional information to improve data quality

Page 10: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Categories

    

    

    

    

    

    

Ajax     Y Y Y Y

Arktos     Y Y    

Clio     Y      

Flamingo Project         Y  

FraQL     Y Y Y  

IntelliClean         Y  

Kent State University Y Y        

Potter's Wheel Y   Y      

Transcm     Y      

Page 11: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

The AJAX data transformation and cleaning framework

http://dmir.inesc-id.pt/dmir/ajax/

Page 12: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Problems of ETL/data quality solutions (1)

The semantics of some data transformations is defined in terms of their implementation algorithms

App. Domain 1

App. Domain 2

App. Domain 3

Data transformations

...

Page 13: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

There is a lack of interactive facilities to tune a data cleaning application program

Problems of ETL/data quality solutions (2)

Dirty Data

Data Cleaning and Transformation

Clean data Rejected data

Page 14: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

14

Motivating example (1)

DirtyData(paper:String)

Data Cleaning & Transformation

Events(eventKey, name)

Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year)

Authors(authorKey, name)

PubsAuthors(pubKey, authorKey)

Page 15: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

15

Motivating example (2)

[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95

DirtyData

Data Cleaning & Transformation

PDIS | Conference on Parallel and Distributed Information Systems

Events

QGMW96| Making Views Self-Maintainablefor Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996

PublicationsAuthors

DQua | Dallan Quass

AGup | Ashish Gupta

JWid | Jennifer Widom…..

QGMW96 | DQua

QGMW96 | AGup….

PubsAuthors

Page 16: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Modeling an ETL/data quality process

A data quality process is modeled by a directed acyclic graph of data transformations

DirtyData

DirtyAuthors

Authors

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles... DirtyEvents

CitiesTags

Page 17: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

AJAX features

An extensible data cleaning and transformation (quality) framework Logical operators as extensions of relational algebra Physical execution algorithms

A declarative language for logical operators SQL extension

A debugger facility for tuning a data cleaning program application Based on a mechanism of exceptions

Page 18: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

AJAX features

An extensible data cleaning and transformation (quality) framework Logical operators as extensions of relational algebra Physical execution algorithms

A declarative language for logical operators SQL extension

A debugger facility for tuning a data cleaning program application Based on a mechanism of exceptions

Page 19: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

19

Logical level: parametric operators

View: arbitrary SQL query Map: iterator-based one-to-many mapping with

arbitrary user-defined functions Match: iterator-based approximate join Cluster: uses an arbitrary clustering function Merge: extends SQL group-by with user-defined

aggregate functions Apply: executes an arbitrary user-defined

algorithm

Map Match

Merge

ClusterView

Apply

Page 20: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Logical level

DirtyData

DirtyAuthors

Authors

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles... DirtyEvents

CitiesTags

Page 21: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Logical

DirtyData

DirtyAuthors

Authors

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles... DirtyEvents

CitiesTags

Match

Cluster

Merge

Map

Map

Map

DirtyAuthorsDirtyTitles...

NL

TC

Java

DirtyData

Authors

DirtyEvents

CitiesTags

Java scan

SQL Scan

Java Scan

Physical

Page 22: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

22

Match Input: 2 relations Finds data records that correspond to the same

real object Calls distance functions for comparing field values

and computing the distance between input tuples Output: 1 relation containing matching tuples and

possibly 1 or 2 relations containing non-matching tuples

Page 23: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

23

Example

Cluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

Page 24: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

24

ExampleCREATE MATCH MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name)

WHERE distance < maxDist

INTO MatchAuthorsCluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

Page 25: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

25

ExampleCREATE MATCH MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name)

WHERE distance < maxDist

INTO MatchAuthors

Input:

DirtyAuthors(authorKey, name)861|johann christoph freytag

822|jc freytag

819|j freytag

814|j-c freytag

Output:

MatchAuthors(authorKey1, authorKey2, name1, name2)861|822|johann christoph freytag| jc freytag

822|814|jc freytag|j-c freytag ...

Cluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

Page 26: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Implementation of the match operator

s1 S1, s2 S2

(s1, s2) is a match if

editDistance (s1, s2) < maxDist

Page 27: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Nested loop

S1 S2

...

Very expensive evaluation when handling large amounts of data

Need alternative execution algorithms for the same logical specification

editDistance

Page 28: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

A database solution

CREATE TABLE MatchAuthors ASSELECT authorKey1, authorKey2, distanceFROM (SELECT a1.authorKey authorKey1,

a2.authorKey authorKey2, editDistance (a1.name, a2.name) distance FROM DirtyAuthors a1, DirtyAuthors a2)

WHERE distance < maxDist;

No optimization supported for a Cartesian product with external function calls

Page 29: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Window scanning

S

n

Page 30: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Window scanning

S

n

Page 31: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Window scanning

S

n

May loose some matches

Page 32: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

32

String distance filtering

S1 S2

maxDist = 1

John Smith

John Smit

Jogn Smith

John Smithe

length

length- 1

length

length + 1

editDistance

Page 33: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

AJAX features

An extensible data cleaning and transformation (quality) framework Logical operators as extensions of relational algebra Physical execution algorithms

A declarative language for logical operators SQL extension

A debugger facility for tuning a data cleaning program application Based on a mechanism of exceptions

Page 34: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

DEFINE FUNCTIONS ASChoose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerExceptionGenerate.generateId(INTEGER) RETURN STRINGNormal.removeCitationTags(STRING) RETURN STRING (600)

DEFINE ALGORITHMS ASTransitiveClosureSourceClustering(STRING)

DEFINE INPUT DATA FLOWS ASTABLE DirtyData(paper STRING (400));TABLE City(city STRING (80),citysyn STRING (80))KEY city,citysyn;

DEFINE TRANSFORMATIONS AS

CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paperKEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper}

Declarative specification

Page 35: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

DEFINE FUNCTIONS ASChoose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerExceptionGenerate.generateId(INTEGER) RETURN STRINGNormal.removeCitationTags(STRING) RETURN STRING (600)

DEFINE ALGORITHMS ASTransitiveClosureSourceClustering(STRING)

DEFINE INPUT DATA FLOWS ASTABLE DirtyData(paper STRING (400));TABLE City(city STRING (80),citysyn STRING (80))KEY city,citysyn;

DEFINE TRANSFORMATIONS AS

CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paperKEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper}

Graph of data transformations

Declarative specification

Page 36: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

AJAX features

An extensible data cleaning and transformation (quality) framework Logical operators as extensions of relational algebra Physical execution algorithms

A declarative language for logical operators SQL extension

A debugger facility for tuning a data cleaning program application Based on a mechanism of exceptions

Page 37: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

37

Management of exceptions

Problem: to mark tuples not handled by the cleaning criteria of an operator

Solution: to specify the generation of exception tuples within a logical operator exceptions are thrown by external functions output constraints are violated

Page 38: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Debugger facility

Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions

Supports the interactive data modification and, in the future, the incremental execution of logical operators

Page 39: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Architecture

Page 40: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

References

Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita: “Declarative Data Cleaning: Language, Model, and Algorithms”. VLDB 2001: 371-380

Page 41: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

References

“A Survey of Data Quality tools”, J. Barateiro, H. Galhardas, Datenbank-Spektrum 14: 15-21, 2005.

“Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita: “Declarative Data Cleaning: Language, Model, and Algorithms”. VLDB 2001: 371-380

Page 42: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

AJAX features

An extensible data cleaning and transformation (quality) framework Logical operators as extensions of relational algebra Physical execution algorithms

A declarative language for logical operators SQL extension

A debugger facility for tuning a data cleaning program application Based on a mechanism of exceptions

Page 43: Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Annotation-based optimization

The user specifies types of optimization The system suggests which algorithm to

use

Ex:

CREATE MATCHING MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET dist = editDistance(da1.name, da2.name)

WHERE dist < maxDist

% distance-filtering: map= length; dist = abs %

INTO MatchAuthors