Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom et al Stanford University

Preview:

Citation preview

Trio: A System for Data, Uncertainty, and Lineage

Jennifer Widom et al

Stanford University

2

Depiction of Trio Project

3

Some Context

TrioTrio Project Project They’re building a new kind of DBMS in

which:1.1. DataData2.2. AccuracyAccuracy3.3. LineageLineage

are all first-class interrelated concepts

4

The “Trio” in Trio

1. Data Student #123 is majoring in Econ: (123,Econ)

Major

2. Uncertainty Student #123 is majoring in Econ or CS:

(123, Econ ∥ CS) Major With confidence 60% student #456 is a CS major:

(456, CS 0.6) Major

3. Lineage 456 HardWorker derived from:

(456, CS) Major“CS is hard” some web page

5

Depiction

Data

Uncertainty

Lineage(“sourcing”)

6

Original Motivation for the Project

New Application Domains

• Many involve data that is uncertain (approximate, probabilistic, inexact, incomplete,

imprecise, fuzzy, inaccurate,...)

• Many of the same ones need to track the lineage (provenance) of their data

Neither uncertainty nor lineage is supported in current database systems

Neither uncertainty nor lineage is supported in current database systems

7

Importance of Lineage

[technical] Lineage:1. Enables simple and consistent

representation of uncertain data

2. Correlates uncertainty in query results with uncertainty in the input data

3. Can make computation over uncertain data more efficient

[fluffy] Applications use lineage to reduce or resolve uncertainty

8

Sample Applications

Information extraction• Find & label entities in unstructured text• Often probabilistic

Information integration• Combine data from multiple sources• Inconsistencies

Scientific experiments• Inexact/incomplete data• Many levels of “derived data products”

9

Sample Applications

Sensor data management• Approximate readings• Missing readings• Levels of data aggregation

Deduplication (“data cleaning”)• Object linkage, entity resolution• Often heuristic/probabilistic

Approximate query processing• Fast but inexact answers

10

The “Usual” DBMS Features

(From first lecture of any Intro to Databases class)

1. Efficient,

2. Convenient,

3. Safe,

4. Multi-User storage of and access to

5. Massive amounts of

6. Persistent data

11

Completeness vs. Closure

All sets-of-instances

Representablesets-of-instances

Op1

Op2

Proposition:Proposition: An incomplete representationis still interesting if it’s expressive enoughand closed under all required operations

Completeness:Completeness:blue=yellow

Closure:Closure:arrow stays

in blue

12

Operations: Semantics

Easy and natural (re)definition for any standard

database operation (call it Op)

Closure:Closure:up-arrow

always exists

Note: Completeness Completeness Closure Closure

D

I1, I2, …, In J1, J2, …, Jm

D’

possibleinstances

Op on eachinstance

rep. ofinstances

Op’ directimplementation

13

Incompleteness

person day

Jennifer Monday

Mike Tuesday

person day

Jennifer Monday

Instance1Instance2

person day

Mike Tuesday

Instance3

person day

Jennifer Monday

Mike Tuesday ??

?? generates 4th instance:empty relation

14

Non-Closure Under Join

person day

Mike{Monday,Tuesday

}

day food

Monday chicken

Tuesday fish⋈

Result has two possible instances:person day food

MikeMonda

y chicken

Instance1

person day food

MikeTuesda

y fish

Instance2

Not representable with or-sets and ?

15

Another “Trio” in Trio

1. Data Model Simplest extension to relational model that’s

sufficiently expressive

2. Query Language Simple extension to SQL with well-defined

semantics and intuitive behavior

3. System A complete open-source DBMS that people

want to use

16

Another “Trio” in Trio

1. Data Model Uncertainty-Lineage Databases (ULDBs)

2. Query Language TriQL

3. System Trio-One — built on top of standard DBMS

17

Remainder of Talk

1. Data Model Uncertainty-Lineage Databases (ULDBs)

2. Query Language TriQL

3. System Trio-One — built on top of standard DBMS

4. Demo

18

Quote from Jennifer

We are not about machine learning or probabilistic reasoning!

We are about efficient and convenient storage, manipulation, and retrieval of large data sets (with uncertainty and lineage in them)

19

Running Example: Crime-Solving

Saw (witness, color, car) // may be uncertain

Drives (person, color, car) // may be uncertain

Suspects (person) = πperson(Saw ⋈ Drives)

20

In Standard Relational DBMS

Drives (person, color, car)

Jimmy red Toyota

Billy blue Honda

Frank red Mazda

Frank green

Mazda

Saw (witness, color, car)

Amy red Mazda

Betty blue Honda

Carol green Toyota

Suspects

Frank

Billy

Create Table Suspects asSelect personFrom Saw, DrivesWhere Saw.color = Drives.color And Saw.car = Drives.car

Create Table Suspects asSelect personFrom Saw, DrivesWhere Saw.color = Drives.color And Saw.car = Drives.car

21

Data Model: Uncertainty

An uncertain database represents a set of possible instances

• Amy saw either a Honda or a Toyota

• Jimmy drives a Toyota, a Mazda, or both

• Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3

• Hank is a suspect with confidence 0.7

22

Their Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

23

Our Model for Uncertainty

1. Alternatives: uncertainty about value

2. ‘?’ (Maybe) Annotations

3. ConfidencesSaw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Three possible

instances

24

Six possibleinstances

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe): uncertainty about presence

3. Confidences

?

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Betty blue, Acura

25

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences: weighted uncertainty

Six possible instances, each with a probability

?

Saw (witness, color, car)

Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2

Betty blue, Acura 0.6

26

Models for Uncertainty

Our model (so far) is not especially new

We spent some time exploring the space of models for uncertainty

Tension between understandability and expressiveness– Our model is understandable

– But it is not complete, or even closed under common operations

27

Our Model is Not Complete or Closed

Saw (witness, car)

Cathy

Honda ∥ Mazda

Drives (person, car)

Jimmy, Toyota ∥ Jimmy, Mazda

Billy, Honda ∥ Frank, Honda

Hank, Honda

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

28

to the Rescue

Lineage (provenance): “where data came from”• Internal lineage

• External lineage

In Trio: A function λ from data elements to other data elements (or external sources)

Lineage

29

Example with Lineage

ID Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID Drives (person, car)

21

Jimmy, Toyota ∥ Jimmy, Mazda

22

Billy, Honda ∥ Frank, Honda

23

Hank, Honda

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

Suspects = πperson(Saw ⋈ Drives)

???

λ(31) = (11,2),(21,2)λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)

λ(33) = (11,1), 23

Correctly captures possible instances inthe result

30

Example with Lineage

ID Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID Drives (person, car)

21

Jimmy, Toyota ∥ Jimmy, Mazda

22

Billy, Honda ∥ Frank, Honda

23

Hank, Honda

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

Suspects = πperson(Saw ⋈ Drives)

???

31

Trio Data Model

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

4. Lineage

ULDBs are closed and complete

Uncertainty-Lineage Databases (ULDBs)Uncertainty-Lineage Databases (ULDBs)

32

Formal Definition of ULDB

33

Formal definition of Completeness

34

Proof of Completeness

Proof: Construct R with x-relations S1, .., Sn, corresponding to R1, .. ,Rn

Construct an extra relation PW that encodes the possible instances. PW contains exactly one x-tuple: (1)|(2)|...|(m).

Each Si is constructed as follows,

For every Pj , each tuple t in Ri forms a maybe x-tuple with just one alternative with value t.

Duplicates within and across possible instances are preserved in Si.

We add (j) in PW to the lineage of alternatives in tuples copied from Pj .

This now exactly encodes the data in each of the possible instances.

35

Continuous, Proof …

The correct lineage is obtained as follows,

We look at the lineage j in Pj and mimic it in the x-tuples it contributes in S1 through Sn.

For example, if j(t1) = {t2} in Pj, where t1 is a subset of R1 and t2 is a subset of R2, then the x-tuple that t2 gave in S2 is added to the lineage of the x-tuple from t1 in S1.

As a final step, we remove the extra relation PW but retain its symbols as external lineage.

Therefore, each possible LDB of D now has the same schema as each Pj , and represents exactly the same data and internal lineage.

36

ULDBs: Minimality

A ULDB relation R represents a set of possible

instances

• Does every tuple in R appear in some possible instance? (no extraneous tuples)

• Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s)

• Also

Data-minimalityData-minimality

Lineage-minimalityLineage-minimality

37

Data Minimality Examples

Extraneous ‘?’

. . .

10

Billy, Honda ∥ Frank, Mazda

. . .

. . .

20

Billy ∥ Frank

. . .

?λ(20,1)=(10,1); λ(20,2)=(10,2)

extraneous

38

Data Minimality Examples

Extraneous tuple

Diane Mazda ∥ Acura

Dianeextraneous?

?? Diane AcuraDiane Mazda

39

Querying ULDBs

Simple extension to SQL

Formal semantics, intuitive meaning

Query uncertainty, confidences, and lineage

TriQLTriQL

40

Simple TriQL Example

ID Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID Drives (person, car)

21

Jimmy, Toyota ∥ Jimmy, Mazda

22

Billy, Honda ∥ Frank, Honda

23

Hank, Honda

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

???

λ(31)=(11,2),(21,2)λ(32,1)=(11,1),(22,1); λ(32,2)=(11,1),(22,2)

λ(33)=(11,1),23

Create Table Suspects asSelect personFrom Saw, DrivesWhere Saw.car = Drives.car

Create Table Suspects asSelect personFrom Saw, DrivesWhere Saw.car = Drives.car

41

Formal Semantics

Relational (SQL) query Q on ULDB D

DD

D1, D2, …, DnD1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)

D’D’implementation of Q

operational semanticsD + ResultD + Result

42

TriQL: Querying Confidences

Built-in function: Conf()

SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(Saw) > 0.5 AND Conf(Drives) >

0.8

43

TriQL: Querying Lineage

Built-in join predicate: Lineage()

SELECT Suspects.person FROM Suspects, Saw WHERE Lineage(Suspects,Saw) AND Saw.witness = ‘Amy’

X ==> Y shorthand for Lineage(X,Y)

44

Operational Semantics

Over standard relational database:

For each tuple in cross-product of X1, X2, ..., Xn

1. Evaluate the predicate

2. If true, project attr-list to create result tuple

SELECT attr-listFROM X1, X2, ..., Xn

WHERE predicate

45

Operational Semantics

Over ULDB:For each tuple in cross-product of X1, X2, ..., Xn

1. Create “super tuple” T from all combinations of alternatives

2. Evaluate predicate on each alternative in T ; keep only the true ones

3. Project attr-list on each alternative to create result tuple

4. Details: ‘?’, lineage, confidences

SELECT attr-listFROM X1, X2, ..., Xn

WHERE predicate

46

Operational Semantics: Example

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

47

Operational Semantics: Example

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Jim,Mazda) (Cathy,Honda,Bill,Mazda) (Cathy,Mazda,Jim,Mazda) (Cathy,Mazda,Bill,Mazda)∥ ∥ ∥

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

48

Operational Semantics: Example

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Jim,Mazda) (Cathy,Honda,Bill,Mazda)∥ ∥(Cathy,Mazda,Jim,Mazda) (Cathy,Mazda,Bill,Mazda)∥

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

49

Operational Semantics: Example

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Jim,Mazda) (Cathy,Honda,Bill,Mazda) (Cathy,Mazda,∥ ∥ Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

50

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda) (Cathy,Honda,Bill,Mazda) (Cathy,Mazda,∥ ∥ Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Hank,Honda) (Cathy,Mazda,Hank,Honda)∥

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

51

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda) (Cathy,Honda,Bill,Mazda) (Cathy,Mazda,∥ ∥ Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Hank,Honda) (Cathy,Mazda,Hank,Honda)∥

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

52

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda) (Cathy,Honda,Bill,Mazda) (Cathy,Mazda,∥ ∥ Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Hank,Honda) (∥ Cathy,Mazda,Hank,Honda)

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

53

Additional TriQL Constructs

• “Horizontal subqueries” Refer to tuple alternatives as a relation

• Aggregations: low, high, expected

• Unmerged (horizontal duplicates)

• Flatten, GroupAlts

• NoLineage, NoConf, NoMaybe

• Query-computed confidences

• Data modification statements

54

Trio-Specific Additional Features

• Lineage tracing

• On-demand confidence computation

• Coexistence checks

• Extraneous data removal

Interrelated algorithms

55

The Trio System

Version 1 (“Trio-One”)On top of standard DBMS

Surprisingly easy and complete, reasonably efficient

56

Trio-One Overview

Standard relational DBMS

Trio API and translator(Python)

Trio API and translator(Python)

Command-lineclient

Command-lineclient

TrioMetadat

a

TrioExplorer(GUI client)

TrioExplorer(GUI client)

Trio Stored

Procedures

EncodedData

TablesLineageTables

Standard SQL• Partition and “verticalize”• Shared IDs for alternatives• Columns for confidence,“?”• One per result table• Uses unique IDs• Encodes formulas

• Table types• Schema-level lineage structure• Conf()• Lineage()

• DDL commands• TriQL queries• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation

57

Strengths

First DBMS with uncertainty and lineage

Has many applications like I showed earlier

Done by Stanford!

58

Weaknesses

Paper has lots of definitions rather than explanations.

Proofs are written in whole text rather multiple lines with math symbols.

59

Future Work

More forms of uncertainty• Continuous uncertainty (intervals, Gaussians)• Correlated uncertainty• Incomplete relations

More forms of lineage• External lineage• Update lineage

Confidence-based queries• Threshold; “Top-K”

60

Future Work

Conjunctive lineage sufficient for most operations• Disjunctive lineage for duplicate-elimination

• Negative lineage for difference

General case after several queries: • Boolean formula

Search “stanford trio”

Recommended