Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

Preview:

DESCRIPTION

UNCERTAINTY. LINEAGE. DATA. Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS. Martin Theobald Jennifer Widom Stanford University. Outline. Motivation and Applications Working Model for ULDBs Trio-One System Architecture Efficient Confidence Computations. Motivation. - PowerPoint PPT Presentation

Citation preview

Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

Martin Theobald

Jennifer Widom

Stanford University

2

Outline

• Motivation and Applications

• Working Model for ULDBs

• Trio-One System Architecture

• Efficient Confidence Computations

3

Motivation

• Many applications involve data that is uncertain

(approximate, probabilistic, inexact, incomplete, imprecise, fuzzy, inaccurate, ...)

• Many of the same applications need to track the lineage of their data

Neither uncertainty nor lineage are supported by conventional Database Management Systems (DBMSs)

Coincidence or Fate?Coincidence or Fate?

4

Sample Applications

Deduplication / Entity Resolution• Uncertainty: Match and merge• Lineage: Source records

Information extraction• Uncertainty: Extracted labels and values• Lineage: Original context

Information integration• Uncertainty: Inconsistent information• Lineage: Original sources

5

Sample Applications

Scientific experiments• Uncertainty: Captured (and derived) data

• Lineage: Layers of views

Sensor data• Uncertainty: Sensor values, aggregations,

missing readings

• Lineage: Original readings, views

6

Claim

The connection between uncertainty and lineage goes deeper than just a shared need by several applications

Coincidence or Fate?Coincidence or Fate?

7

Lineage and Uncertainty

Lineage...• Enables simple and consistent representation of

uncertain data

• Allows for intuitive and complete working models

• Correlates uncertainty in query results with uncertainty in the input data

• Can make confidence computations over uncertain data more efficient

Applications use lineage to reduce or resolve uncertainty

8

Goal

A new kind of DBMS in which:

1. Data2. Uncertainty3. Lineage

are all first-class interrelated concepts

Trio Trio }

9

The Trio Trio

1. Data Model Simplest extension to relational model that’s

sufficiently expressive

2. Query Language Simple extension to SQL with well-defined

semantics and intuitive behavior

3. System A complete open-source DBMS that people

want to use

10

The Present

1. Data Model Uncertainty-Lineage Databases (ULDBs)

2. Query Language TriQL

3. System Trio-One: Layered prototype deployable

on top of any standard relational DBMS

11

Running Example: Crime-Solving

Saw(witness,car) // may be uncertain

Drives(person,car) // may be uncertain

Suspects(person) = πperson(Saw ⋈car Drives)

12

Data Model: Uncertainty

An uncertain database represents a set ofpossible instances of data values

• Amy saw either a Honda or a Toyota

• Jimmy drives a Toyota, a Mazda, or both

• Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3

• Hank is a suspect with confidence 0.7

13

Our Model for Uncertainty

1. Alternatives (mutually exclusive)

2. ‘?’ (Maybe) Annotations

3. Confidences

14

Our Model for Uncertainty

1. Alternatives: uncertainty about data value(s)

2. ‘?’ (Maybe) Annotations

3. ConfidencesSaw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

witness car

Amy { Honda, Toyota, Mazda }

=

Three possibleinstances

15

Six possibleinstances

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe): uncertainty about presence

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

(Betty, Acura)?

16

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences: weighted uncertainty

Saw (witness,car)

(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2

(Betty, Acura): 0.6?

Six possible instances, each with a probability

17

Models for Uncertainty

• Our model (so far) is not especially new

• We spent some time exploring the space of models for uncertainty

• Tension between understandability and expressiveness– Our model is understandable

– But it is not complete, or even closed under common operations

18

Closure and Completeness

Completeness Can represent any finite set of possible

instances

Closure Can represent any result of relational

operators

Note: Completeness Closure

19

Models for Uncertainty

• A-Tuples – Uncertainty on attribute values• (Amy, {Toyota, Mazda})• Not closed

• X-Tuples – Uncertainty on entire tuples• { (Amy, Toyota), (Amy, Mazda) }

• Still not closed

• C-Tables – Tuples + Boolean constraints• { (Amy, Toyota, X=1),

(Amy, Mazda, X<>1),

(Cathy, Honda, X<>1) }• Closed and complete!• Understandable?

20

Our Model is Not Closed

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

(Billy, Honda) ∥ (Frank, Honda)

(Hank, Honda)

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈car Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

21

to the Rescue

Lineage (provenance): “where data came from”• Internal lineage – pointers to base data

• External lineage – files, URLs, web portals, etc.

In Trio: A Boolean function λ from alternatives to other alternatives (or external sources)

Lineage

22

Example with Lineage

ID Saw (witness,car)

11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)

21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

???

Suspects = πperson(Saw ⋈car Drives)

λ(31) = (11,2),(21,2)λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)

λ(33) = (11,1), 23

Correctly captures possible instances inthe result

23

Trio Data Model

1. Alternatives (mutually exclusive)

2. ‘?’ (Maybe) Annotations

3. Confidences

4. Lineage

ULDBs are closed and complete

Uncertainty-Lineage Databases (ULDBs)Uncertainty-Lineage Databases (ULDBs)

24

ULDBs: Lineage

Conjunctive lineage sufficient for most operations

• Disjunctive lineage for duplicate-elimination

• Negative lineage for difference

25

ULDBs: Minimality

A ULDB relation R represents a set of possible

instancesDoes every tuple in R appear in some

possible instance? (no extraneous tuples)

Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s)

Also

Data-minimalityData-minimality

Lineage-minimalityLineage-minimality

26

Data Minimality Examples

Extraneous ‘?’

. . .

10

(Billy, Honda) ∥ (Frank, Honda)

. . .

. . .

40

Billy ∥ Frank

. . .

?λ(40,1)=(10,1); λ(40,2)=(10,2)

extraneous

. . .

20

(Billy, Honda)

. . .

?. . .

30

(Frank, Honda)

. . .

?

27

Data Minimality Examples

Extraneous tuple

(Diane, Mazda) ∥ (Diane, Acura)

Dianeextraneous

(Diane, Mazda)

(Diane, Acura)

?

??

28

ULDBs: Membership Questions

Does a given tuple t appear in some (all) possible instance(s) of R ?

Is a given table T one of (all of) the possible instances of R ?

Polynomial algorithms based on data-minimizationPolynomial algorithms based on data-minimization

NP-HardNP-Hard

29

Non-Theorists...

Wake Back Up!

30

Querying ULDBs

• Simple extension to SQL

• Formal semantics, intuitive meaning

• Query uncertainty, confidences, and lineage

TriQLTriQL

31

Initial TriQL Example

ID Saw (witness,car)

11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)

21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)SELECT Drives.person INTO SuspectsFROM Saw, DrivesWHERE Saw.car = Drives.car

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

???

λ(31) = (11,2),(21,2)λ(32,1)=(11,1),(22,1); λ(32,2)=(11,1),(22,2)

λ(33) = (11,1), 23

32

Formal Semantics

Query Q on ULDB D

DD

D1, D2, …, DnD1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)

D’D’implementation of Q

operational semanticsD + ResultD + Result

33

Operational Semantics

Over conventional relational database:

For each tuple in cross-product of X1, X2, ..., Xn

1. Evaluate the predicate

2. If true, project attr-list to create result tuple

3. If INTO clause, insert into table

SELECT attr-list [ INTO table ]FROM X1, X2, ..., Xn

WHERE predicate

34

Operational Semantics

Over ULDB:For each tuple in cross-product of X1, X2, ..., Xn

1. Create “super tuple” T from all combinations of alternatives

2. Evaluate predicate on each alternative in T ; keep only the true ones

3. Project attr-list on each alternative to create result tuple

4. Details: ‘?’, lineage, confidences

SELECT attr-list [ INTO table ]FROM X1, X2, ..., Xn

WHERE predicate

35

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

36

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

37

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

38

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

39

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

40

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

41

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

42

Operational Semantics: Example

SELECT Drives.person INTO SuspectsFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

Suspects

Jim ∥ Bill

Hank

??

λ( ) = ...λ( ) = ...

43

Confidences

Confidences supplied with base data

Trio computes confidences on query results• Default probabilistic interpretation• Can choose to plug in different arithmetic

Saw (witness,car)

(Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4

Drives (person,car)

(Jim, Mazda): 0.3 ∥ (Bill, Mazda): 0.6

(Hank, Honda) ): 1.0

Suspects

Jim: 0.12 ∥ Bill: 0.24

Hank: 0.6

?

?

?

0.3 0.4

0.6

ProbabilisticProbabilistic Min

44

TriQL: Querying Confidences

Built-in function: Conf()

SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(Saw) > 0.5 AND Conf(Drives) >

0.8

SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(*) > 0.4

45

TriQL: Querying Lineage

Built-in join predicate: Lineage()

SELECT Saw.witness INTO AccusesHank FROM Suspects, Saw WHERE Lineage(Suspects,Saw) AND Suspects.person = ‘Hank’

Also works for transitive lineage:

SELECT witness FROM AccusesHank WHERE Lineage(Saw,AccusesHank)

46

Additional Query Constructs

• “Horizontal subqueries”Refer to tuple alternatives as a relation

• Unmerged (allow horizontal duplicates)

• Flatten, GroupAlts

• NoLineage, NoConf, NoMaybe

• Query-computed confidences

• SQL-like data modification statements

47

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

List suspects with conf values based on accuser credibility

48

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

49

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

50

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

“Scaled as conf”

“Uniform as conf”

“Scaled as conf”

“Uniform as conf”

51

The Trio System

Version 1Entirely on top of a conventional DBMS

Surprisingly easy and complete, reasonably efficient

52

The Trio System: Version 1

Lin:RaidF

r tableaidT

o

R aid xid C

Relational DBMS

create trio table T(A,B)

select C into R ...

TrioMetadat

a

Trio APITrio API

SQL commands

• Result cursors• Traverse lineage

T aid xid A B

53

Relation Encoding

aid xid

conf

person

car

21 1 2 Jim Mazda

22 1 2 Bill Mazda

23 2 1 Hank Honda

Suspects

Jim ∥ Bill

Hank

aid xid conf person

31 1 3 Jim

32 1 3 Bill

33 2 2 Hank

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

aid xid conf witness

car

11 1 2 Cathy Honda

12 1 2 Cathy Mazda

??

aidFr

table aidTo

31 Saw 12

31 Drives 21

32 Saw 12

32 Drives 22

33 Saw 11

33 Drives 23

SawDrives

SuspectsLin:Suspects

54

Relation Encoding with Confidences

aid xid

conf

person car

21 1 0.3 Jim Mazda

22 1 0.6 Bill Mazda

23 2 1.0 Hank Honda

aid xid conf person

31 1 NULL Jim

32 1 NULL Bill

33 2 NULL Hank

Saw (witness,car)

(Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4

Drives (person,car)

(Jim, Mazda): 0.3 ∥ (Bill, Mazda): 0.6

(Hank, Honda)

aid xid conf witness

car

11 1 0.6 Cathy Honda

12 1 0.4 Cathy Mazda

SawDrives

Suspects

0.12

0.24

0.6

55

Query Translation

Persistent results: Query Q into result table R

1. Run query Q’ to produce “super-result” R Q’ ≈ Q but adds aid/xid’s of source tuples, joins lineage tables for lineage() predicates

2. Group R into alternatives, generate new xid’s3. Materialze lineage data into lin:R

4. Compute confidences? (optional)5. Add/update metadata: schemas, confidence info, lineage structure

Transient results: stop at 2, return cursor over new xid’s

56

The Trio System: Version 1

R aid xid C

Relational DBMS

create trio table T(A,B)

select C into R ...

TrioMetadat

a

Trio APITrio API

SQL commands

• Result cursors• Traverse lineage

T aid xid A B

Lin:RaidF

r tableaidT

o

57

The Trio System: Version 1

Trio API and translator(TriQL parser in Python)

Trio API and translator(TriQL parser in Python)

TrioMetadat

a

TrioExplorer(GUI client)

TrioExplorer(GUI client)

Trio Stored

Procedures

EncodedData

TablesLineageTables

Standard SQL

• Data & system columns• A-IDs for alternatives• “Verticalized” through shared X-IDs• Confidences

• One lineage table per derived table• Connecting A-IDs

• Table types• Schema-level lineage structure

• conf()• lineage()

• TriQL queries• DDL commands• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation

Command-lineclient

Command-lineclient

Relational DBMS

SPI-Plugin for

PostgreSQLReplacing

“Fetch-All”

SPI-Plugin for

PostgreSQLReplacing

“Fetch-All”

58

The Trio System: Version 2

Trio API and translatorTrio API and translator

TrioExplorer(GUI client)

TrioExplorer(GUI client)Command-line

client

Command-lineclient

Trio DBMS

SpecializedTrio Data Structures

SpecializedTrio Processing & Optimizing

Standard SQL Direct function calls

59

Efficient Confidence Computation

Previous approach (probabilistic databases)• Each operator computes confidences during

query execution• Only certain (“safe”) query plans allowed

In ULDBs Confidence of alternative A is function of

confidences in λ(A)

Our approach• Decoupled data and confidence computation

• Allow any query plan for data computation

• Compute confidences on-demand based on lineage

60

CIDs & Confidence Memoization

Identify set of Closest Independent Descendants (CIDs) for each relation in the schema-level lineage DAG

relations with no shared ancestors

Batched confidence computations in SQL• Select unions of distinct base alternatives from

all root-to-leaf lineage paths• With CIDs and memoization enabled

– if confidences are memoized then stop at CIDs

– else traverse down to the leafs (base data) and memoize at CIDs

(ongoing work)

61

“Probabilistic TPC-H” Setup

• Horizontal split of TPC-H tables

• Uniform-random number of alternatives [1-10]

• Uniform confidences

PartsuppsPartsuppsLineitemsLineitemsOrdersOrders

O2O2O1

O1

OO LL PP

OLOL LPLP

OLPOLP1.5M

1.0M 0.1M

1.0M 0.5M 0.7M

O3O3 L2

L2L1L1 L3

L3 L4L4 P2

P2P1P1 P3

P3

OO LL PP

CID(OLP) = {O, L, P}

0.8M6M1.5M

62

Confidence Computation Times

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

4,500

5,000

Exe

cuti

on

Tim

e (s

eco

nd

s)

O L P OL LP OLP Total

No CIDs

CIDs

63

Per-Tuple vs. Batching (no CIDs)

64

Memoization & Join Multiplicity

65

Current Topics

System• Full query language• Nice interface• Performance experiments• Demo applications

Algorithms: confidence & lineage computation,

extraneous data, membership questions• Minimize lineage traversal

• Confidence memoization

• Efficient batch computations

66

Future Directions

Theory, Model, Algorithms• Unlimited opportunities

System• Storage, indexing, partitioning• Statistics and query optimization

More features• Top-k by confidence • Incomplete relations; continuous uncertainty;

correlated uncertainty• External lineage; update lineage; versioning

Search “stanford trio”

Current Trio team:Parag Agrawal, Anish Das Sarma, Raghotham

Murthy, Michi Mutsuzaki, Shubha Nabar, Tomoe Sugihara,

Martin Theobald, Jennifer Widom

Special thanks to:Ashok Chandra, Alon Halevy, Jeff Ullman,

Omar Benjelloun

but don’t forgetthe lineage…