Upload
vanhuong
View
216
Download
1
Embed Size (px)
Citation preview
me Course Overview Course Info
Outline
1 Instructor
2 Course Overview
3 Course Details and Administrative Information
me Course Overview Course Info
Boris Glavic
• Assistant Professor for Database Systems (aka the new guy)
• Office: Stuart Building, room 226C
• Office hours: Thursday, 1:00 pm - 2:00 pm
• Webpage: www.cs.iit.edu/~glavic
• Phone: 312 567 5205
Slide 1 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Outline
1 Instructor
2 Course OverviewWhat the heck is Data Provenance?Why should I care?
3 Course Details and Administrative Information
me Course Overview Course Info
CS 595-06 - Data Provenance
Administrative Info
• Hours: Mon + Wed 3:15 - 4:30 PM
• Room: Stuart Building in room 106
• Course Webpage: Will be linked onwww.cs.iit.edu/~glavic soon!
Slide 2 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Data Provenance
Data Provenance
Information about the creation process and origin of data
Slide 3 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Why do we call it Provenance?
Origin of the Term
• From art dealing
Alternative Terms
• Lineage
• Data Pedigree
Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Why do we call it Provenance?
Origin of the Term
• From art dealing
Alternative Terms
• Lineage for kings
• Data Pedigree
Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Why do we call it Provenance?
Origin of the Term
• From art dealing
Alternative Terms
• Lineage for kings
• Data Pedigree for dogs
Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Why do we call it Provenance?
Origin of the Term
• From art dealing for pieces of art
Alternative Terms
• Lineage for kings
• Data Pedigree for dogs
Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Provenance in Art
Given a piece of art
• How do we know . . .• if it is authentic?• who created it?• if it has been altered?
Example
Jan Van Eyck - ArnolfiniPortrait
Slide 5 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Provenance in Art
Provenance
• French provenir, ”to come from”
• Chronology of the ownership or location of an historical object
Example
Jan Van Eyck - ArnolfiniPortrait
• 1434 - Painting dated by van Eyck; presumably owned by the sitters.
• before 1516 - In possession of Don Diego de Guevara (d. Brussels 1520),a Spanish career courtier of the Habsburgs (himself the subject of a fineportrait by Michael Sittow in the National Gallery of Art). He lived mostof his life in the Netherlands, and may have known the Arnolfinis in theirlater years. By 1516 he had given the portrait to Margaret of Austria,Habsburg Regent of the Netherlands.
• 1516 - Painting is the first item in an inventory of Margaret’s paintings,made in her presence at Mechelen. The item says (in French): ”a largepicture which is called Hernoul le Fin with his wife in a chamber, whichwas given to Madame by Don Diego, whose arms are on the cover of thesaid picture; done by the painter Johannes.” A note in the margin says”It is necessary to put on a lock to close it: which Madame has orderedto be done.”
• 1523-4 - In another Mechelen inventory, a similar description, this timethe name of the subject is given as ”Arnoult Fin”.
• 1558 - In 1530 the painting was inherited by Margaret’s niece Mary ofHungary, who in 1556 went to live in Spain. It is clearly described in aninventory taken after her death in 1558, when it was inherited by PhilipII of Spain. A painting of two of his young daughters commissioned byPhilip clearly copies the pose of the figures (Prado).[1]
• 1599 - a German visitor saw it in the Alcazar Palace in Madrid. Now ithad verses from Ovid painted on the frame: ”See that you promise: whatharm is there in promises? In promises anyone can be rich.” It is verylikely that Velazquez knew the painting, which may have influenced hisLas Meninas, which shows a room in the same palace.
• 1700 - In an inventory after the death of Carlos II it was still in the palace,with shutters and the verses from Ovid.
• 1794 - Now in the Palacio Nuevo in Madrid.
• 1816 - The painting is now in London, in the possession of Colonel JamesHay, a Scottish soldier. He claimed that after being seriously wounded atthe Battle of Waterloo the previous year, the painting hung in the roomwhere he convalesced in Brussels. He fell in love with it, and persuaded theowner to sell. More relevant to the real facts is no doubt Hay’s presenceat the Battle of Vitoria (1813) in Spain, where a large coach loaded byKing Joseph Bonaparte with easily portable artworks from the Spanishroyal collections was first plundered by British troops, before what wasleft was recovered by their commanders and returned to the Spanish. Hayoffered the painting to the Prince Regent, later George IV of England, viaSir Thomas Lawrence. The Prince had it on approval for two years atCarlton House before eventually returning it in 1818.
1523-4 In another Mechelen inventory, a similardescription, this time the name of the sub-ject is given as ”Arnoult Fin”.
1558 In 1530 the painting was inherited by Mar-garet’s niece Mary of Hungary, who in 1556went to live in Spain. It is clearly describedin an inventory taken after her death in1558, when it was inherited by Philip IIof Spain. A painting of two of his youngdaughters commissioned by Philip clearlycopies the pose of the figures (Prado).[1]
Slide 5 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Provenance in Data Processing
Given a piece of data
• How do we know . . .• which data it is derived from?• which transformations (SQL)
where used to create it?• who created it?• . . .
Exampleresult
shop revt1 Migros 125t2 Coop 25
Slide 6 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Provenance in Data Processing
Given a piece of data
• How do we know . . .• which data it is derived from?• which transformations (SQL)
where used to create it?• who created it?• . . .
Example
Compute the revenue for each shopas sum of prices of items sold
Exampleresult
shop revt1 Migros 125t2 Coop 25
↑SELECT shop ,
sum(price) AS rev
FROM sales , items
WHERE itemId = id
GROUP BY shop
↑ ↑sales
shop itemIds1 Migros 1s2 Migros 3s3 Coop 3
itemsid price
i1 1 100i2 2 10i3 3 25
Slide 6 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Provenance in Data Processing
Given a piece of data
• How do we know . . .• which data it is derived from?• which transformations (SQL)
where used to create it?• who created it?• . . .
Definition (Data Provenance)
Information about the origin andcreation process of data.
Exampleresult
shop revt1 Migros 125t2 Coop 25
↑SELECT shop ,
sum(price) AS rev
FROM sales , items
WHERE itemId = id
GROUP BY shop
↑ ↑sales
shop itemIds1 Migros 1s2 Migros 3s3 Coop 3
itemsid price
i1 1 100i2 2 10i3 3 25
Slide 6 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
A more Complex Example
Scenario
• You are an analyst for a garden supply shop
• You have to compute the first quater revenue for each shoplocation
• Datawarehouse with sales data
• Use SQL to compute the required information from thewarehouse
Slide 7 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
A more Complex Example
Example (Input Data)
EmployeeSSN Name WorksFor123 Peter Peterson New York342 Jane Janeson New York555 Heinz Heinzmann Wuppertal
ShopLocation BudgetNew York 1.000.000Wuppertal 4.000
ItemId Description Price1 Lawnmower 1992 Fertilizer 323 Rake 9
SalesEmployee Item Amount Month
123 1 1 1342 2 64 1342 3 2 3555 3 1 5
Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
A more Complex Example
Example (SalesTotal Query)
CREATE VIEW SalesTotal AS
SELECT Location AS Shop , Month , SSN AS Employee ,
Price * Amount AS Totalprice
FROM Employee E, Shop H, Item I, Sales S
WHERE E.WorksFor = H.Location
AND E.SSN = S.Employee
AND I.Id = S.Item
Example (Results)
SalesTotalShop Month Employee Totalprice
New York 1 123 199New York 1 342 2048New York 3 342 18Wuppertal 5 555 9
Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
A more Complex Example
Example (MonthlyRevenue Query)
CREATE VIEW MonthlyRevenue
SELECT Shop , Month , sum(Totalprice) AS Revenue
FROM SalesTotal
GROUP BY Shop , Month
Example (Results)
MonthlyRevenueShop Month Revenue
New York 1 2247New York 3 18Wuppertal 5 9
Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
A more Complex Example
Example (RevenueFirstQ Query)
CREATE VIEW RevenueFirstQ
SELECT Shop , sum(Revenue) AS Revenue
FROM MonthlyRevenue
WHERE Month < 5
GROUP BY Shop
Example (Results)
RevenueFirstQShop Revenue
New York 2265
Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
A more Complex Example
Compute First Quarter Revenue
MonthlyRevenue
RevenueFirstQ
SalesTotal
Employee Shop Item Sales
Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Example Data
Example
Compute First Quarter Revenue
MonthlyRevenue
RevenueFirstQ
SalesTotal
Slide 9 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Example Data
Example
Compute First Quarter Revenue
SalesTotal
RevenueFirstQ
MonthlyRevenue
Slide 9 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Example Data
Example
SalesTotal
EmployeeSSN Name WorksFor123 Peter Peterson New York342 Jane Janeson New York555 Heinz Heinzmann Wuppertal
EmployeeShopSSN Name WorksFor Location Budget123 Peter Peterson New York New York 1.000.000342 Jane Janeson New York New York 1.000.000555 Heinz Heinzmann Wuppertal Wuppertal 4.000
Slide 9 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Tracing an Error
Problem
• One result tuple of your query looks suspicious
• You expect the input data to be the culprit
• How to know which input data affected which output data
This is Data Provenance
• But how to get at the data provenance?
• Manually?• Not reasonable for big data or complex query!
• Need system that tracks it automatically!
Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Tracing an Error
Problem
• One result tuple of your query looks suspicious
• You expect the input data to be the culprit
• How to know which input data affected which output data
This is Data Provenance
• But how to get at the data provenance?
• Manually?• Not reasonable for big data or complex query!
• Need system that tracks it automatically!
Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Tracing an Error
Problem
• One result tuple of your query looks suspicious
• You expect the input data to be the culprit
• How to know which input data affected which output data
This is Data Provenance
• But how to get at the data provenance?
• Manually?• Not reasonable for big data or complex query!
• Need system that tracks it automatically!
Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Tracing an Error
Problem
• One result tuple of your query looks suspicious
• You expect the input data to be the culprit
• How to know which input data affected which output data
This is Data Provenance
• But how to get at the data provenance?
• Manually?• Not reasonable for big data or complex query!
• Need system that tracks it automatically!
Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
What the heck is Data Provenance?
Tracing an Error
Problem
• One result tuple of your query looks suspicious
• You expect the input data to be the culprit
• How to know which input data affected which output data
This is Data Provenance
• But how to get at the data provenance?
• Manually?• Not reasonable for big data or complex query!
• Need system that tracks it automatically!
Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Why should I care?
Use Cases
• Debugging (tracking the sources of errors)
• Propagating annotations
• Gain deeper understanding of data and transformations• Estimate quality, trust
• Improvement of other data processing technologies• Probabilistic databases• Deletion propagation• Testing
Slide 11 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Why should I care?
Application Domains
• Complex database queries, e.g., datawarehousing
• E-science and curated databases
• Data integration/exchange
• Workflow systems
• ⇒ Application domain with complex, multi-stage dataprocessing• Map-Reduce style processing and its “frontends” like Pig• Simulations• . . .
Slide 12 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Why should I care?
Application Domains
• Complex database queries, e.g., datawarehousing
• E-science and curated databases
• Data integration/exchange
• Workflow systems
• ⇒ Application domain with complex, multi-stage dataprocessing• Map-Reduce style processing and its “frontends” like Pig• Simulations• . . .
Slide 12 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Why should I care?
Debugging
Example
SalesTotal
Slide 13 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Why should I care?
Annotation Propagation
Example
EnzymeProduceEnzyme GeneEC 1.1.1.1 ALB ??EC 1.97.1.6 ALB ??
GeneId Name
4q11-q13 ALB {a4}18q21.3 BCL2 {}
EnzymeEnzyme Weight ProducedByEC 1.1.1.1 45 4q11-q13 {a1, a2}EC 1.97.1.6 12 4q11-q13 {a2, a3}
CREATE VIEW EnzymeProduce AS
SELECT Enzyme , Name AS Gene
FROM Gene G, Enzyme E
WHERE G.Id = E.ProducedBy
a1 Necessary for red blood cells a2 Produced in livera3 Unhealthy a4 Discovered by Edmond Hillary
Slide 14 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Outline
1 Instructor
2 Course Overview
3 Course Details and Administrative Information
me Course Overview Course Info
Course Topics
We will study...
• Several models of database provenance
• Approaches for automatically tracking provenance
• Query languages and storage mechanism for provenance
• Real systems that generate provenance data
• Outlook into other research areas that use provenance
Slide 15 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Prerequisites
• Some database background (mainly query languages)• SQL• Relational algebra• Datalog
• Ideally you have taken one of the following courses:• CS 425• CS 520• CS 525
Slide 16 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Course Material
• No text book is required
• Research papers that are available online
Slide 17 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Course Organization
Course will consist of ...
• Lectures
• Research paper reviews (oral presentations done by students)
• A major project• Implementation or extension of a real system• Written report• Oral presentation
Slide 18 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Paper Reviews
Papers
• The list of papers will be on the course website soon
• Topics cover the content of the course and extend it
Process
• Students will have to pick papers to review• By end of august, first week of september
• Some of the classes will be used for oral presentations on thetopics
• Each presentation will be followed by a discussion
• Short written reviews (4-6 pages) will be due by Nov 1st
Slide 19 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Course Project
• Implementation or extension of a provenance system
• Topics:• Adding new provenance types to existing systems
• Add causality based provenance to Perm• Add Why-provenance to Perm• Extend the How-provenance implementation of Perm to derive
new information
• Build a visualization tool for provenance in Perm• . . .• Your topic
Slide 20 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Course Project cont.
Organisation
• Choose a project by Sep 15th• Meetings to discuss the projects with me
• Project implementation• Progress reports during semester
• Written report: Due by Nov 15th
• Oral presentations: Classes at the end of the semester
Slide 21 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Stuff to get familiar with
• Get practical experience with a database system• PostgreSQL: open-source, quite standard compliant, Perm
based on that• Use Perm which is an provenance-enabled version of Postgres
• Query languages:• SQL• Relational Algebra• Datalog
• Programming skills: C, C++, Java
• *nix operating system or Cygwin for Windows users
Slide 22 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Detailed Course Outline
• Introduction to Data Provenance• What is Data Provenance?• Why do we need it?• Understanding different types of data provenance
• Database Provenance• Provenance Models and Systems
• Why-provenance• Where-provenance and the DBNotes system• Lineage and the WHIPS prototype• Witness-list semantics and Perm• Provenance semirings and Orchestra• Causality and Responsibility models
• Storage mechanisms• Query languages
Slide 23 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Detailed Course Outline cont.
• Extensions of the Provenance concept• Provenance for missing answers• Provenance for past queries• Provenance for updates
• Beyond Database Provenance• Scientific workflows• Provenance in the operating system context• Connection with Dataflow analysis in programming languages
Slide 24 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Grading Policies
• Course Project (Implementation, Report, Presentation): 60%
• Paper reviews: written review (15%) and oral presentation(15%)
• Participation in the paper discussions: (10%)
Slide 25 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance
me Course Overview Course Info
Questions?
Further Information
• Office hours: Thursday, 1:00 pm - 2:00 pm, room 226C
• Webpage: www.cs.iit.edu/~glavic
• Course webpage will be linked there soon.
Slide 26 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance