47
CS 595 - Hot topics in database systems: Data Provenance Boris Glavic August 22, 2012

CS 595 - Hot topics in database systems: Data Provenancecs.iit.edu/~cs595/pdfs/1_info.pdf · Slide 1 of 26 Boris Glavic CS 595 ... Hungary, who in 1556 went to ... Hot topics in database

Embed Size (px)

Citation preview

CS 595 - Hot topics in database systems:Data Provenance

Boris Glavic

August 22, 2012

me Course Overview Course Info

Outline

1 Instructor

2 Course Overview

3 Course Details and Administrative Information

me Course Overview Course Info

Boris Glavic

• Assistant Professor for Database Systems (aka the new guy)

• Office: Stuart Building, room 226C

• Office hours: Thursday, 1:00 pm - 2:00 pm

• Webpage: www.cs.iit.edu/~glavic

• Phone: 312 567 5205

Slide 1 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Outline

1 Instructor

2 Course OverviewWhat the heck is Data Provenance?Why should I care?

3 Course Details and Administrative Information

me Course Overview Course Info

CS 595-06 - Data Provenance

Administrative Info

• Hours: Mon + Wed 3:15 - 4:30 PM

• Room: Stuart Building in room 106

• Course Webpage: Will be linked onwww.cs.iit.edu/~glavic soon!

Slide 2 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Data Provenance

Data Provenance

Information about the creation process and origin of data

Slide 3 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Why do we call it Provenance?

Origin of the Term

• From art dealing

Alternative Terms

• Lineage

• Data Pedigree

Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Why do we call it Provenance?

Origin of the Term

• From art dealing

Alternative Terms

• Lineage for kings

• Data Pedigree

Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Why do we call it Provenance?

Origin of the Term

• From art dealing

Alternative Terms

• Lineage for kings

• Data Pedigree for dogs

Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Why do we call it Provenance?

Origin of the Term

• From art dealing for pieces of art

Alternative Terms

• Lineage for kings

• Data Pedigree for dogs

Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Provenance in Art

Given a piece of art

• How do we know . . .• if it is authentic?• who created it?• if it has been altered?

Example

Jan Van Eyck - ArnolfiniPortrait

Slide 5 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Provenance in Art

Provenance

• French provenir, ”to come from”

• Chronology of the ownership or location of an historical object

Example

Jan Van Eyck - ArnolfiniPortrait

• 1434 - Painting dated by van Eyck; presumably owned by the sitters.

• before 1516 - In possession of Don Diego de Guevara (d. Brussels 1520),a Spanish career courtier of the Habsburgs (himself the subject of a fineportrait by Michael Sittow in the National Gallery of Art). He lived mostof his life in the Netherlands, and may have known the Arnolfinis in theirlater years. By 1516 he had given the portrait to Margaret of Austria,Habsburg Regent of the Netherlands.

• 1516 - Painting is the first item in an inventory of Margaret’s paintings,made in her presence at Mechelen. The item says (in French): ”a largepicture which is called Hernoul le Fin with his wife in a chamber, whichwas given to Madame by Don Diego, whose arms are on the cover of thesaid picture; done by the painter Johannes.” A note in the margin says”It is necessary to put on a lock to close it: which Madame has orderedto be done.”

• 1523-4 - In another Mechelen inventory, a similar description, this timethe name of the subject is given as ”Arnoult Fin”.

• 1558 - In 1530 the painting was inherited by Margaret’s niece Mary ofHungary, who in 1556 went to live in Spain. It is clearly described in aninventory taken after her death in 1558, when it was inherited by PhilipII of Spain. A painting of two of his young daughters commissioned byPhilip clearly copies the pose of the figures (Prado).[1]

• 1599 - a German visitor saw it in the Alcazar Palace in Madrid. Now ithad verses from Ovid painted on the frame: ”See that you promise: whatharm is there in promises? In promises anyone can be rich.” It is verylikely that Velazquez knew the painting, which may have influenced hisLas Meninas, which shows a room in the same palace.

• 1700 - In an inventory after the death of Carlos II it was still in the palace,with shutters and the verses from Ovid.

• 1794 - Now in the Palacio Nuevo in Madrid.

• 1816 - The painting is now in London, in the possession of Colonel JamesHay, a Scottish soldier. He claimed that after being seriously wounded atthe Battle of Waterloo the previous year, the painting hung in the roomwhere he convalesced in Brussels. He fell in love with it, and persuaded theowner to sell. More relevant to the real facts is no doubt Hay’s presenceat the Battle of Vitoria (1813) in Spain, where a large coach loaded byKing Joseph Bonaparte with easily portable artworks from the Spanishroyal collections was first plundered by British troops, before what wasleft was recovered by their commanders and returned to the Spanish. Hayoffered the painting to the Prince Regent, later George IV of England, viaSir Thomas Lawrence. The Prince had it on approval for two years atCarlton House before eventually returning it in 1818.

1523-4 In another Mechelen inventory, a similardescription, this time the name of the sub-ject is given as ”Arnoult Fin”.

1558 In 1530 the painting was inherited by Mar-garet’s niece Mary of Hungary, who in 1556went to live in Spain. It is clearly describedin an inventory taken after her death in1558, when it was inherited by Philip IIof Spain. A painting of two of his youngdaughters commissioned by Philip clearlycopies the pose of the figures (Prado).[1]

Slide 5 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Provenance in Data Processing

Given a piece of data

• How do we know . . .• which data it is derived from?• which transformations (SQL)

where used to create it?• who created it?• . . .

Exampleresult

shop revt1 Migros 125t2 Coop 25

Slide 6 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Provenance in Data Processing

Given a piece of data

• How do we know . . .• which data it is derived from?• which transformations (SQL)

where used to create it?• who created it?• . . .

Example

Compute the revenue for each shopas sum of prices of items sold

Exampleresult

shop revt1 Migros 125t2 Coop 25

↑SELECT shop ,

sum(price) AS rev

FROM sales , items

WHERE itemId = id

GROUP BY shop

↑ ↑sales

shop itemIds1 Migros 1s2 Migros 3s3 Coop 3

itemsid price

i1 1 100i2 2 10i3 3 25

Slide 6 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Provenance in Data Processing

Given a piece of data

• How do we know . . .• which data it is derived from?• which transformations (SQL)

where used to create it?• who created it?• . . .

Definition (Data Provenance)

Information about the origin andcreation process of data.

Exampleresult

shop revt1 Migros 125t2 Coop 25

↑SELECT shop ,

sum(price) AS rev

FROM sales , items

WHERE itemId = id

GROUP BY shop

↑ ↑sales

shop itemIds1 Migros 1s2 Migros 3s3 Coop 3

itemsid price

i1 1 100i2 2 10i3 3 25

Slide 6 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

A more Complex Example

Scenario

• You are an analyst for a garden supply shop

• You have to compute the first quater revenue for each shoplocation

• Datawarehouse with sales data

• Use SQL to compute the required information from thewarehouse

Slide 7 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

A more Complex Example

Example (Input Data)

EmployeeSSN Name WorksFor123 Peter Peterson New York342 Jane Janeson New York555 Heinz Heinzmann Wuppertal

ShopLocation BudgetNew York 1.000.000Wuppertal 4.000

ItemId Description Price1 Lawnmower 1992 Fertilizer 323 Rake 9

SalesEmployee Item Amount Month

123 1 1 1342 2 64 1342 3 2 3555 3 1 5

Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

A more Complex Example

Example (SalesTotal Query)

CREATE VIEW SalesTotal AS

SELECT Location AS Shop , Month , SSN AS Employee ,

Price * Amount AS Totalprice

FROM Employee E, Shop H, Item I, Sales S

WHERE E.WorksFor = H.Location

AND E.SSN = S.Employee

AND I.Id = S.Item

Example (Results)

SalesTotalShop Month Employee Totalprice

New York 1 123 199New York 1 342 2048New York 3 342 18Wuppertal 5 555 9

Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

A more Complex Example

Example (MonthlyRevenue Query)

CREATE VIEW MonthlyRevenue

SELECT Shop , Month , sum(Totalprice) AS Revenue

FROM SalesTotal

GROUP BY Shop , Month

Example (Results)

MonthlyRevenueShop Month Revenue

New York 1 2247New York 3 18Wuppertal 5 9

Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

A more Complex Example

Example (RevenueFirstQ Query)

CREATE VIEW RevenueFirstQ

SELECT Shop , sum(Revenue) AS Revenue

FROM MonthlyRevenue

WHERE Month < 5

GROUP BY Shop

Example (Results)

RevenueFirstQShop Revenue

New York 2265

Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

A more Complex Example

Compute First Quarter Revenue

MonthlyRevenue

RevenueFirstQ

SalesTotal

Employee Shop Item Sales

Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Example Data

Example

Compute First Quarter Revenue

MonthlyRevenue

RevenueFirstQ

SalesTotal

Slide 9 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Example Data

Example

Compute First Quarter Revenue

SalesTotal

RevenueFirstQ

MonthlyRevenue

Slide 9 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Example Data

Example

SalesTotal

EmployeeSSN Name WorksFor123 Peter Peterson New York342 Jane Janeson New York555 Heinz Heinzmann Wuppertal

EmployeeShopSSN Name WorksFor Location Budget123 Peter Peterson New York New York 1.000.000342 Jane Janeson New York New York 1.000.000555 Heinz Heinzmann Wuppertal Wuppertal 4.000

Slide 9 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Tracing an Error

Problem

• One result tuple of your query looks suspicious

• You expect the input data to be the culprit

• How to know which input data affected which output data

This is Data Provenance

• But how to get at the data provenance?

• Manually?• Not reasonable for big data or complex query!

• Need system that tracks it automatically!

Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Tracing an Error

Problem

• One result tuple of your query looks suspicious

• You expect the input data to be the culprit

• How to know which input data affected which output data

This is Data Provenance

• But how to get at the data provenance?

• Manually?• Not reasonable for big data or complex query!

• Need system that tracks it automatically!

Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Tracing an Error

Problem

• One result tuple of your query looks suspicious

• You expect the input data to be the culprit

• How to know which input data affected which output data

This is Data Provenance

• But how to get at the data provenance?

• Manually?• Not reasonable for big data or complex query!

• Need system that tracks it automatically!

Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Tracing an Error

Problem

• One result tuple of your query looks suspicious

• You expect the input data to be the culprit

• How to know which input data affected which output data

This is Data Provenance

• But how to get at the data provenance?

• Manually?• Not reasonable for big data or complex query!

• Need system that tracks it automatically!

Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

What the heck is Data Provenance?

Tracing an Error

Problem

• One result tuple of your query looks suspicious

• You expect the input data to be the culprit

• How to know which input data affected which output data

This is Data Provenance

• But how to get at the data provenance?

• Manually?• Not reasonable for big data or complex query!

• Need system that tracks it automatically!

Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Why should I care?

Use Cases

• Debugging (tracking the sources of errors)

• Propagating annotations

• Gain deeper understanding of data and transformations• Estimate quality, trust

• Improvement of other data processing technologies• Probabilistic databases• Deletion propagation• Testing

Slide 11 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Why should I care?

Application Domains

• Complex database queries, e.g., datawarehousing

• E-science and curated databases

• Data integration/exchange

• Workflow systems

• ⇒ Application domain with complex, multi-stage dataprocessing• Map-Reduce style processing and its “frontends” like Pig• Simulations• . . .

Slide 12 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Why should I care?

Application Domains

• Complex database queries, e.g., datawarehousing

• E-science and curated databases

• Data integration/exchange

• Workflow systems

• ⇒ Application domain with complex, multi-stage dataprocessing• Map-Reduce style processing and its “frontends” like Pig• Simulations• . . .

Slide 12 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Why should I care?

Debugging

Example

SalesTotal

Slide 13 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Why should I care?

Annotation Propagation

Example

EnzymeProduceEnzyme GeneEC 1.1.1.1 ALB ??EC 1.97.1.6 ALB ??

GeneId Name

4q11-q13 ALB {a4}18q21.3 BCL2 {}

EnzymeEnzyme Weight ProducedByEC 1.1.1.1 45 4q11-q13 {a1, a2}EC 1.97.1.6 12 4q11-q13 {a2, a3}

CREATE VIEW EnzymeProduce AS

SELECT Enzyme , Name AS Gene

FROM Gene G, Enzyme E

WHERE G.Id = E.ProducedBy

a1 Necessary for red blood cells a2 Produced in livera3 Unhealthy a4 Discovered by Edmond Hillary

Slide 14 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Outline

1 Instructor

2 Course Overview

3 Course Details and Administrative Information

me Course Overview Course Info

Course Topics

We will study...

• Several models of database provenance

• Approaches for automatically tracking provenance

• Query languages and storage mechanism for provenance

• Real systems that generate provenance data

• Outlook into other research areas that use provenance

Slide 15 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Prerequisites

• Some database background (mainly query languages)• SQL• Relational algebra• Datalog

• Ideally you have taken one of the following courses:• CS 425• CS 520• CS 525

Slide 16 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Course Material

• No text book is required

• Research papers that are available online

Slide 17 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Course Organization

Course will consist of ...

• Lectures

• Research paper reviews (oral presentations done by students)

• A major project• Implementation or extension of a real system• Written report• Oral presentation

Slide 18 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Paper Reviews

Papers

• The list of papers will be on the course website soon

• Topics cover the content of the course and extend it

Process

• Students will have to pick papers to review• By end of august, first week of september

• Some of the classes will be used for oral presentations on thetopics

• Each presentation will be followed by a discussion

• Short written reviews (4-6 pages) will be due by Nov 1st

Slide 19 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Course Project

• Implementation or extension of a provenance system

• Topics:• Adding new provenance types to existing systems

• Add causality based provenance to Perm• Add Why-provenance to Perm• Extend the How-provenance implementation of Perm to derive

new information

• Build a visualization tool for provenance in Perm• . . .• Your topic

Slide 20 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Course Project cont.

Organisation

• Choose a project by Sep 15th• Meetings to discuss the projects with me

• Project implementation• Progress reports during semester

• Written report: Due by Nov 15th

• Oral presentations: Classes at the end of the semester

Slide 21 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Stuff to get familiar with

• Get practical experience with a database system• PostgreSQL: open-source, quite standard compliant, Perm

based on that• Use Perm which is an provenance-enabled version of Postgres

• Query languages:• SQL• Relational Algebra• Datalog

• Programming skills: C, C++, Java

• *nix operating system or Cygwin for Windows users

Slide 22 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Detailed Course Outline

• Introduction to Data Provenance• What is Data Provenance?• Why do we need it?• Understanding different types of data provenance

• Database Provenance• Provenance Models and Systems

• Why-provenance• Where-provenance and the DBNotes system• Lineage and the WHIPS prototype• Witness-list semantics and Perm• Provenance semirings and Orchestra• Causality and Responsibility models

• Storage mechanisms• Query languages

Slide 23 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Detailed Course Outline cont.

• Extensions of the Provenance concept• Provenance for missing answers• Provenance for past queries• Provenance for updates

• Beyond Database Provenance• Scientific workflows• Provenance in the operating system context• Connection with Dataflow analysis in programming languages

Slide 24 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Grading Policies

• Course Project (Implementation, Report, Presentation): 60%

• Paper reviews: written review (15%) and oral presentation(15%)

• Participation in the paper discussions: (10%)

Slide 25 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info

Questions?

Further Information

• Office hours: Thursday, 1:00 pm - 2:00 pm, room 226C

• Webpage: www.cs.iit.edu/~glavic

• Course webpage will be linked there soon.

Slide 26 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance