67
Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

Embed Size (px)

DESCRIPTION

UNCERTAINTY. LINEAGE. DATA. Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS. Martin Theobald Jennifer Widom Stanford University. Outline. Motivation and Applications Working Model for ULDBs Trio-One System Architecture Efficient Confidence Computations. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

Martin Theobald

Jennifer Widom

Stanford University

Page 2: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

2

Outline

• Motivation and Applications

• Working Model for ULDBs

• Trio-One System Architecture

• Efficient Confidence Computations

Page 3: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

3

Motivation

• Many applications involve data that is uncertain

(approximate, probabilistic, inexact, incomplete, imprecise, fuzzy, inaccurate, ...)

• Many of the same applications need to track the lineage of their data

Neither uncertainty nor lineage are supported by conventional Database Management Systems (DBMSs)

Coincidence or Fate?Coincidence or Fate?

Page 4: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

4

Sample Applications

Deduplication / Entity Resolution• Uncertainty: Match and merge• Lineage: Source records

Information extraction• Uncertainty: Extracted labels and values• Lineage: Original context

Information integration• Uncertainty: Inconsistent information• Lineage: Original sources

Page 5: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

5

Sample Applications

Scientific experiments• Uncertainty: Captured (and derived) data

• Lineage: Layers of views

Sensor data• Uncertainty: Sensor values, aggregations,

missing readings

• Lineage: Original readings, views

Page 6: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

6

Claim

The connection between uncertainty and lineage goes deeper than just a shared need by several applications

Coincidence or Fate?Coincidence or Fate?

Page 7: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

7

Lineage and Uncertainty

Lineage...• Enables simple and consistent representation of

uncertain data

• Allows for intuitive and complete working models

• Correlates uncertainty in query results with uncertainty in the input data

• Can make confidence computations over uncertain data more efficient

Applications use lineage to reduce or resolve uncertainty

Page 8: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

8

Goal

A new kind of DBMS in which:

1. Data2. Uncertainty3. Lineage

are all first-class interrelated concepts

Trio Trio }

Page 9: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

9

The Trio Trio

1. Data Model Simplest extension to relational model that’s

sufficiently expressive

2. Query Language Simple extension to SQL with well-defined

semantics and intuitive behavior

3. System A complete open-source DBMS that people

want to use

Page 10: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

10

The Present

1. Data Model Uncertainty-Lineage Databases (ULDBs)

2. Query Language TriQL

3. System Trio-One: Layered prototype deployable

on top of any standard relational DBMS

Page 11: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

11

Running Example: Crime-Solving

Saw(witness,car) // may be uncertain

Drives(person,car) // may be uncertain

Suspects(person) = πperson(Saw ⋈car Drives)

Page 12: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

12

Data Model: Uncertainty

An uncertain database represents a set ofpossible instances of data values

• Amy saw either a Honda or a Toyota

• Jimmy drives a Toyota, a Mazda, or both

• Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3

• Hank is a suspect with confidence 0.7

Page 13: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

13

Our Model for Uncertainty

1. Alternatives (mutually exclusive)

2. ‘?’ (Maybe) Annotations

3. Confidences

Page 14: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

14

Our Model for Uncertainty

1. Alternatives: uncertainty about data value(s)

2. ‘?’ (Maybe) Annotations

3. ConfidencesSaw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

witness car

Amy { Honda, Toyota, Mazda }

=

Three possibleinstances

Page 15: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

15

Six possibleinstances

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe): uncertainty about presence

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

(Betty, Acura)?

Page 16: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

16

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences: weighted uncertainty

Saw (witness,car)

(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2

(Betty, Acura): 0.6?

Six possible instances, each with a probability

Page 17: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

17

Models for Uncertainty

• Our model (so far) is not especially new

• We spent some time exploring the space of models for uncertainty

• Tension between understandability and expressiveness– Our model is understandable

– But it is not complete, or even closed under common operations

Page 18: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

18

Closure and Completeness

Completeness Can represent any finite set of possible

instances

Closure Can represent any result of relational

operators

Note: Completeness Closure

Page 19: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

19

Models for Uncertainty

• A-Tuples – Uncertainty on attribute values• (Amy, {Toyota, Mazda})• Not closed

• X-Tuples – Uncertainty on entire tuples• { (Amy, Toyota), (Amy, Mazda) }

• Still not closed

• C-Tables – Tuples + Boolean constraints• { (Amy, Toyota, X=1),

(Amy, Mazda, X<>1),

(Cathy, Honda, X<>1) }• Closed and complete!• Understandable?

Page 20: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

20

Our Model is Not Closed

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

(Billy, Honda) ∥ (Frank, Honda)

(Hank, Honda)

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈car Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

Page 21: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

21

to the Rescue

Lineage (provenance): “where data came from”• Internal lineage – pointers to base data

• External lineage – files, URLs, web portals, etc.

In Trio: A Boolean function λ from alternatives to other alternatives (or external sources)

Lineage

Page 22: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

22

Example with Lineage

ID Saw (witness,car)

11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)

21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

???

Suspects = πperson(Saw ⋈car Drives)

λ(31) = (11,2),(21,2)λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)

λ(33) = (11,1), 23

Correctly captures possible instances inthe result

Page 23: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

23

Trio Data Model

1. Alternatives (mutually exclusive)

2. ‘?’ (Maybe) Annotations

3. Confidences

4. Lineage

ULDBs are closed and complete

Uncertainty-Lineage Databases (ULDBs)Uncertainty-Lineage Databases (ULDBs)

Page 24: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

24

ULDBs: Lineage

Conjunctive lineage sufficient for most operations

• Disjunctive lineage for duplicate-elimination

• Negative lineage for difference

Page 25: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

25

ULDBs: Minimality

A ULDB relation R represents a set of possible

instancesDoes every tuple in R appear in some

possible instance? (no extraneous tuples)

Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s)

Also

Data-minimalityData-minimality

Lineage-minimalityLineage-minimality

Page 26: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

26

Data Minimality Examples

Extraneous ‘?’

. . .

10

(Billy, Honda) ∥ (Frank, Honda)

. . .

. . .

40

Billy ∥ Frank

. . .

?λ(40,1)=(10,1); λ(40,2)=(10,2)

extraneous

. . .

20

(Billy, Honda)

. . .

?. . .

30

(Frank, Honda)

. . .

?

Page 27: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

27

Data Minimality Examples

Extraneous tuple

(Diane, Mazda) ∥ (Diane, Acura)

Dianeextraneous

(Diane, Mazda)

(Diane, Acura)

?

??

Page 28: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

28

ULDBs: Membership Questions

Does a given tuple t appear in some (all) possible instance(s) of R ?

Is a given table T one of (all of) the possible instances of R ?

Polynomial algorithms based on data-minimizationPolynomial algorithms based on data-minimization

NP-HardNP-Hard

Page 29: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

29

Non-Theorists...

Wake Back Up!

Page 30: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

30

Querying ULDBs

• Simple extension to SQL

• Formal semantics, intuitive meaning

• Query uncertainty, confidences, and lineage

TriQLTriQL

Page 31: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

31

Initial TriQL Example

ID Saw (witness,car)

11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)

21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)SELECT Drives.person INTO SuspectsFROM Saw, DrivesWHERE Saw.car = Drives.car

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

???

λ(31) = (11,2),(21,2)λ(32,1)=(11,1),(22,1); λ(32,2)=(11,1),(22,2)

λ(33) = (11,1), 23

Page 32: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

32

Formal Semantics

Query Q on ULDB D

DD

D1, D2, …, DnD1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)

D’D’implementation of Q

operational semanticsD + ResultD + Result

Page 33: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

33

Operational Semantics

Over conventional relational database:

For each tuple in cross-product of X1, X2, ..., Xn

1. Evaluate the predicate

2. If true, project attr-list to create result tuple

3. If INTO clause, insert into table

SELECT attr-list [ INTO table ]FROM X1, X2, ..., Xn

WHERE predicate

Page 34: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

34

Operational Semantics

Over ULDB:For each tuple in cross-product of X1, X2, ..., Xn

1. Create “super tuple” T from all combinations of alternatives

2. Evaluate predicate on each alternative in T ; keep only the true ones

3. Project attr-list on each alternative to create result tuple

4. Details: ‘?’, lineage, confidences

SELECT attr-list [ INTO table ]FROM X1, X2, ..., Xn

WHERE predicate

Page 35: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

35

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

Page 36: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

36

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Page 37: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

37

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Page 38: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

38

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Page 39: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

39

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Page 40: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

40

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Page 41: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

41

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Page 42: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

42

Operational Semantics: Example

SELECT Drives.person INTO SuspectsFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

Suspects

Jim ∥ Bill

Hank

??

λ( ) = ...λ( ) = ...

Page 43: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

43

Confidences

Confidences supplied with base data

Trio computes confidences on query results• Default probabilistic interpretation• Can choose to plug in different arithmetic

Saw (witness,car)

(Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4

Drives (person,car)

(Jim, Mazda): 0.3 ∥ (Bill, Mazda): 0.6

(Hank, Honda) ): 1.0

Suspects

Jim: 0.12 ∥ Bill: 0.24

Hank: 0.6

?

?

?

0.3 0.4

0.6

ProbabilisticProbabilistic Min

Page 44: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

44

TriQL: Querying Confidences

Built-in function: Conf()

SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(Saw) > 0.5 AND Conf(Drives) >

0.8

SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(*) > 0.4

Page 45: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

45

TriQL: Querying Lineage

Built-in join predicate: Lineage()

SELECT Saw.witness INTO AccusesHank FROM Suspects, Saw WHERE Lineage(Suspects,Saw) AND Suspects.person = ‘Hank’

Also works for transitive lineage:

SELECT witness FROM AccusesHank WHERE Lineage(Saw,AccusesHank)

Page 46: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

46

Additional Query Constructs

• “Horizontal subqueries”Refer to tuple alternatives as a relation

• Unmerged (allow horizontal duplicates)

• Flatten, GroupAlts

• NoLineage, NoConf, NoMaybe

• Query-computed confidences

• SQL-like data modification statements

Page 47: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

47

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

List suspects with conf values based on accuser credibility

Page 48: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

48

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

Page 49: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

49

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

Page 50: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

50

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

“Scaled as conf”

“Uniform as conf”

“Scaled as conf”

“Uniform as conf”

Page 51: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

51

The Trio System

Version 1Entirely on top of a conventional DBMS

Surprisingly easy and complete, reasonably efficient

Page 52: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

52

The Trio System: Version 1

Lin:RaidF

r tableaidT

o

R aid xid C

Relational DBMS

create trio table T(A,B)

select C into R ...

TrioMetadat

a

Trio APITrio API

SQL commands

• Result cursors• Traverse lineage

T aid xid A B

Page 53: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

53

Relation Encoding

aid xid

conf

person

car

21 1 2 Jim Mazda

22 1 2 Bill Mazda

23 2 1 Hank Honda

Suspects

Jim ∥ Bill

Hank

aid xid conf person

31 1 3 Jim

32 1 3 Bill

33 2 2 Hank

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

aid xid conf witness

car

11 1 2 Cathy Honda

12 1 2 Cathy Mazda

??

aidFr

table aidTo

31 Saw 12

31 Drives 21

32 Saw 12

32 Drives 22

33 Saw 11

33 Drives 23

SawDrives

SuspectsLin:Suspects

Page 54: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

54

Relation Encoding with Confidences

aid xid

conf

person car

21 1 0.3 Jim Mazda

22 1 0.6 Bill Mazda

23 2 1.0 Hank Honda

aid xid conf person

31 1 NULL Jim

32 1 NULL Bill

33 2 NULL Hank

Saw (witness,car)

(Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4

Drives (person,car)

(Jim, Mazda): 0.3 ∥ (Bill, Mazda): 0.6

(Hank, Honda)

aid xid conf witness

car

11 1 0.6 Cathy Honda

12 1 0.4 Cathy Mazda

SawDrives

Suspects

0.12

0.24

0.6

Page 55: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

55

Query Translation

Persistent results: Query Q into result table R

1. Run query Q’ to produce “super-result” R Q’ ≈ Q but adds aid/xid’s of source tuples, joins lineage tables for lineage() predicates

2. Group R into alternatives, generate new xid’s3. Materialze lineage data into lin:R

4. Compute confidences? (optional)5. Add/update metadata: schemas, confidence info, lineage structure

Transient results: stop at 2, return cursor over new xid’s

Page 56: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

56

The Trio System: Version 1

R aid xid C

Relational DBMS

create trio table T(A,B)

select C into R ...

TrioMetadat

a

Trio APITrio API

SQL commands

• Result cursors• Traverse lineage

T aid xid A B

Lin:RaidF

r tableaidT

o

Page 57: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

57

The Trio System: Version 1

Trio API and translator(TriQL parser in Python)

Trio API and translator(TriQL parser in Python)

TrioMetadat

a

TrioExplorer(GUI client)

TrioExplorer(GUI client)

Trio Stored

Procedures

EncodedData

TablesLineageTables

Standard SQL

• Data & system columns• A-IDs for alternatives• “Verticalized” through shared X-IDs• Confidences

• One lineage table per derived table• Connecting A-IDs

• Table types• Schema-level lineage structure

• conf()• lineage()

• TriQL queries• DDL commands• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation

Command-lineclient

Command-lineclient

Relational DBMS

SPI-Plugin for

PostgreSQLReplacing

“Fetch-All”

SPI-Plugin for

PostgreSQLReplacing

“Fetch-All”

Page 58: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

58

The Trio System: Version 2

Trio API and translatorTrio API and translator

TrioExplorer(GUI client)

TrioExplorer(GUI client)Command-line

client

Command-lineclient

Trio DBMS

SpecializedTrio Data Structures

SpecializedTrio Processing & Optimizing

Standard SQL Direct function calls

Page 59: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

59

Efficient Confidence Computation

Previous approach (probabilistic databases)• Each operator computes confidences during

query execution• Only certain (“safe”) query plans allowed

In ULDBs Confidence of alternative A is function of

confidences in λ(A)

Our approach• Decoupled data and confidence computation

• Allow any query plan for data computation

• Compute confidences on-demand based on lineage

Page 60: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

60

CIDs & Confidence Memoization

Identify set of Closest Independent Descendants (CIDs) for each relation in the schema-level lineage DAG

relations with no shared ancestors

Batched confidence computations in SQL• Select unions of distinct base alternatives from

all root-to-leaf lineage paths• With CIDs and memoization enabled

– if confidences are memoized then stop at CIDs

– else traverse down to the leafs (base data) and memoize at CIDs

(ongoing work)

Page 61: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

61

“Probabilistic TPC-H” Setup

• Horizontal split of TPC-H tables

• Uniform-random number of alternatives [1-10]

• Uniform confidences

PartsuppsPartsuppsLineitemsLineitemsOrdersOrders

O2O2O1

O1

OO LL PP

OLOL LPLP

OLPOLP1.5M

1.0M 0.1M

1.0M 0.5M 0.7M

O3O3 L2

L2L1L1 L3

L3 L4L4 P2

P2P1P1 P3

P3

OO LL PP

CID(OLP) = {O, L, P}

0.8M6M1.5M

Page 62: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

62

Confidence Computation Times

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

4,500

5,000

Exe

cuti

on

Tim

e (s

eco

nd

s)

O L P OL LP OLP Total

No CIDs

CIDs

Page 63: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

63

Per-Tuple vs. Batching (no CIDs)

Page 64: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

64

Memoization & Join Multiplicity

Page 65: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

65

Current Topics

System• Full query language• Nice interface• Performance experiments• Demo applications

Algorithms: confidence & lineage computation,

extraneous data, membership questions• Minimize lineage traversal

• Confidence memoization

• Efficient batch computations

Page 66: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

66

Future Directions

Theory, Model, Algorithms• Unlimited opportunities

System• Storage, indexing, partitioning• Statistics and query optimization

More features• Top-k by confidence • Incomplete relations; continuous uncertainty;

correlated uncertainty• External lineage; update lineage; versioning

Page 67: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

Search “stanford trio”

Current Trio team:Parag Agrawal, Anish Das Sarma, Raghotham

Murthy, Michi Mutsuzaki, Shubha Nabar, Tomoe Sugihara,

Martin Theobald, Jennifer Widom

Special thanks to:Ashok Chandra, Alon Halevy, Jeff Ullman,

Omar Benjelloun

but don’t forgetthe lineage…