20
Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation by Aaron St.Clair

Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Embed Size (px)

Citation preview

Page 1: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Lineage Tracing for General Data Warehouse Transformations

Yingwei Cui and Jennifer Widom

Computer Science Department, Stanford University

Presentation by Aaron St.Clair

Page 2: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Outline What is lineage tracing? Why is tracing lineage data important? How can we find lineage data? Performance results

Page 3: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Data Warehouses Integrate data from multiple sources Data undergoes series of transformations Transformations vary in complexity

Data Source

1

Data Source

2

Data Source

N…

Transformation

Summarized Data

Page 4: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Lineage Tracing Identifying the specific data items in the

sources that derive a given data item in the warehouse

Allows In-depth data analysis Data mining Authorization management View update Efficient warehouse recovery

Page 5: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

An Example

Selects items whose last quarter sales are more than twice the average of the last three quarter’s sales

Page 6: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

An Example

Page 7: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Lineage Granularity Coarse-Grained

Schema-level, attribute mapping Fine-Grained

Set of source data items

Page 8: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Existing Work Mostly coarse-grained lineage Existing methods for fine-grained lineage

Extra annotation Developer-defined weak inverses Statistical estimation Can’t handle complex procedural transformations

Page 9: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Tracing Lineage - Definitions Data set – set of data items without duplicates Transformation – any procedure that takes

data sets as input and produces data sets as output Stable (no spurious output) Deterministic (under some conditions)

Lineage of a data item – set of input data items that contribute to that item

Page 10: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Determining Contributions

• Need to find relevant data items– Easy for simple relational

operators– Difficult for procedural

transformations• Select positives vs. Aggregation and

sum

Page 11: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Lineage Tracing

• Use of hierarchical model– Transformation classes– Schema mappings– Defined inverses

Page 12: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Transformation Classes

Transformation class defines procedure lineage determination For a dispatcher:

Iteratively apply transformation to inputs

If T(I) is in output set add I to lineage of the output set

Page 13: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Schema Mappings Defined schema for input and output of a

transformation• Backward key-maps

– Akey g(B)– T1

Forward key-maps f(A) Bkey

T4 Backward total-maps A g(B) T5

Page 14: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Provided Inverses/Tracing Procedures Best case; someone has defined a function

mapping output items to their deriving lineage items

Know nothing about efficiency of function

Page 15: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Property Hierarchy

Page 16: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Finding Lineage

• Recursively apply algorithms based on the transformation type until we reach top level

Page 17: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Optimizations Indexing input data set improves

performance Functional index using the schema optimizes

queries of the form F(i) = v Store auxiliary or intermediate views in the

warehouse Reduce number by composing

transformations

Page 18: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Transformation Graphs Create a tracing sequence for each path from

input to output in the graph Combine the results of each sequence

Page 19: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Performance• 1GB warehouse• Schema mapping better than

transformation class-specific algorithms• Indexing helps• Combining attributes reduces trace time

Page 20: Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation

Questions?