Upload
ranae
View
29
Download
0
Embed Size (px)
DESCRIPTION
UNCERTAINTY. LINEAGE. DATA. Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS. Martin Theobald Jennifer Widom Stanford University. Outline. Motivation and Applications Working Model for ULDBs Trio-One System Architecture Efficient Confidence Computations. Motivation. - PowerPoint PPT Presentation
Citation preview
Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS
Martin Theobald
Jennifer Widom
Stanford University
2
Outline
• Motivation and Applications
• Working Model for ULDBs
• Trio-One System Architecture
• Efficient Confidence Computations
3
Motivation
• Many applications involve data that is uncertain
(approximate, probabilistic, inexact, incomplete, imprecise, fuzzy, inaccurate, ...)
• Many of the same applications need to track the lineage of their data
Neither uncertainty nor lineage are supported by conventional Database Management Systems (DBMSs)
Coincidence or Fate?Coincidence or Fate?
4
Sample Applications
Deduplication• Uncertainty: Match and merge• Lineage: Source records
Information extraction• Uncertainty: Extracted labels and values• Lineage: Original context
Information integration• Uncertainty: Inconsistent information• Lineage: Original sources
5
Sample Applications
Scientific experiments• Uncertainty: Captured (and derived) data
• Lineage: Layers of views
Sensor data• Uncertainty: Sensor values, aggregations,
missing readings
• Lineage: Original readings, views
6
Claim
The connection between uncertainty and lineage goes deeper than just a shared need by several applications
Coincidence or Fate?Coincidence or Fate?
7
Lineage and Uncertainty
Lineage...• Enables simple and consistent representation of
uncertain data
• Allows for intuitive and complete working models
• Correlates uncertainty in query results with uncertainty in the input data
• Can make computation over uncertain data more efficient
Applications use lineage to reduce or resolve uncertainty
8
Goal
A new kind of DBMS in which:
1. Data2. Uncertainty3. Lineage
are all first-class interrelated concepts
Trio Trio }
9
The Trio Trio
1. Data Model Simplest extension to relational model that’s
sufficiently expressive
2. Query Language Simple extension to SQL with well-defined
semantics and intuitive behavior
3. System A complete open-source DBMS that people
want to use
10
The Present
1. Data Model Uncertainty-Lineage Databases (ULDBs)
2. Query Language TriQL
3. System Trio-One: Layered prototype deployable
on top of any standard relational DBMS
11
Running Example: Crime-Solving
Saw(witness,car) // may be uncertain
Drives(person,car) // may be uncertain
Suspects(person) = πperson(Saw ⋈car Drives)
12
Data Model: Uncertainty
An uncertain database represents a set ofpossible instances of data values
• Amy saw either a Honda or a Toyota
• Jimmy drives a Toyota, a Mazda, or both
• Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3
• Hank is a suspect with confidence 0.7
13
Our Model for Uncertainty
1. Alternatives (mutually exclusive)
2. ‘?’ (Maybe) Annotations
3. Confidences
14
Our Model for Uncertainty
1. Alternatives: uncertainty about value
2. ‘?’ (Maybe) Annotations
3. Confidences
Saw (witness,car)
(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)
witness car
Amy { Honda, Toyota, Mazda }
=
Three possibleinstances
15
Six possibleinstances
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe): uncertainty about presence
3. Confidences
Saw (witness,car)
(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)
(Betty, Acura)?
16
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe) Annotations
3. Confidences: weighted uncertainty
Saw (witness,car)
(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2
(Betty, Acura): 0.6?
Six possible instances, each with a probability
17
Models for Uncertainty
• Our model (so far) is not especially new
• We spent some time exploring the space of models for uncertainty
• Tension between understandability and expressiveness– Our model is understandable
– But it is not complete, or even closed under common operations
18
Closure and Completeness
Completeness Can represent any finite set of possible
instances
Closure Can represent any result of relational
operators
Note: Completeness Closure
19
Models for Uncertainty
• A-Tuples – Uncertainty on attribute values• (Amy, {Toyota, Mazda})• Not closed
• X-Tuples – Uncertainty on entire tuples• { (Amy, Toyota), (Amy, Mazda) }
• Still not closed
• C-Tables – Tuples + Boolean constraints• { (Amy, Toyota, X=1),
(Amy, Mazda, X<>1),
(Cathy, Honda, X<>1) }• Closed and complete!• Understandable?
20
Our Model is Not Closed
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jimmy, Toyota) ∥ (Jimmy, Mazda)
(Billy, Honda) ∥ (Frank, Honda)
(Hank, Honda)
Suspects
Jimmy
Billy ∥ Frank
Hank
Suspects = πperson(Saw ⋈car Drives)
???
Does not correctlycapture possibleinstances in theresult
CANNOT
21
to the Rescue
Lineage (provenance): “where data came from”• Internal lineage
• External lineage
In Trio: A function λ from alternatives to other
alternatives (or external sources)
Lineage
22
Example with Lineage
ID Saw (witness,car)
11
(Cathy, Honda) ∥ (Cathy, Mazda)
ID Drives (person,car)
21
(Jimmy, Toyota) ∥ (Jimmy, Mazda)
22
(Billy, Honda) ∥ (Frank, Honda)
23
(Hank, Honda)
ID Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
???
Suspects = πperson(Saw ⋈car Drives)
λ(31) = (11,2),(21,2)λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)
λ(33) = (11,1), 23
Correctly captures possible instances inthe result
23
Trio Data Model
1. Alternatives (mutually exclusive)
2. ‘?’ (Maybe) Annotations
3. Confidences
4. Lineage
ULDBs are closed and complete
Uncertainty-Lineage Databases (ULDBs)Uncertainty-Lineage Databases (ULDBs)
24
ULDBs: Lineage
Conjunctive lineage sufficient for most operations
• Disjunctive lineage for duplicate-elimination
• Negative lineage for difference
25
ULDBs: Minimality
A ULDB relation R represents a set of possible
instancesDoes every tuple in R appear in some
possible instance? (no extraneous tuples)
Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s)
Also
Data-minimalityData-minimality
Lineage-minimalityLineage-minimality
26
Data Minimality Examples
Extraneous ‘?’
. . .
10
(Billy, Honda) ∥ (Frank, Honda)
. . .
. . .
40
Billy ∥ Frank
. . .
?λ(40,1)=(10,1); λ(40,2)=(10,2)
extraneous
. . .
20
(Billy, Honda)
. . .
?. . .
30
(Frank, Honda)
. . .
?
27
Data Minimality Examples
Extraneous tuple
(Diane, Mazda) ∥ (Diane, Acura)
Dianeextraneous
(Diane, Mazda)
(Diane, Acura)
?
??
28
ULDBs: Membership Questions
Does a given tuple t appear in some (all) possible instance(s) of R ?
Is a given table T one of (all of) the possible instances of R ?
Polynomial algorithms based on data-minimizationPolynomial algorithms based on data-minimization
NP-HardNP-Hard
29
Non-Theorists...
Wake Back Up!
30
Querying ULDBs
• Simple extension to SQL
• Formal semantics, intuitive meaning
• Query uncertainty, confidences, and lineage
TriQLTriQL
31
Initial TriQL Example
ID Saw (witness,car)
11
(Cathy, Honda) ∥ (Cathy, Mazda)
ID Drives (person,car)
21
(Jimmy, Toyota) ∥ (Jimmy, Mazda)
22
(Billy, Honda) ∥ (Frank, Honda)
23
(Hank, Honda)SELECT Drives.person INTO SuspectsFROM Saw, DrivesWHERE Saw.car = Drives.car
ID Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
???
λ(31) = (11,2),(21,2)λ(32,1)=(11,1),(22,1); λ(32,2)=(11,1),(22,2)
λ(33) = (11,1), 23
32
Formal Semantics
Query Q on ULDB D
DD
D1, D2, …, DnD1, D2, …, Dn
possibleinstances
Q on eachinstance
representationof instances
Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)
D’D’implementation of Q
operational semanticsD + ResultD + Result
33
Operational Semantics
Over conventional relational database:
For each tuple in cross-product of X1, X2, ..., Xn
1. Evaluate the predicate
2. If true, project attr-list to create result tuple
3. If INTO clause, insert into table
SELECT attr-list [ INTO table ]FROM X1, X2, ..., Xn
WHERE predicate
34
Operational Semantics
Over ULDB:For each tuple in cross-product of X1, X2, ..., Xn
1. Create “super tuple” T from all combinations of alternatives
2. Evaluate predicate on each alternative in T ; keep only the true ones
3. Project attr-list on each alternative to create result tuple
4. Details: ‘?’, lineage, confidences
SELECT attr-list [ INTO table ]FROM X1, X2, ..., Xn
WHERE predicate
35
Operational Semantics: Example
SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jim, Mazda) ∥ (Bill, Mazda)
(Hank, Honda)
36
Operational Semantics: Example
SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jim, Mazda) ∥ (Bill, Mazda)
(Hank, Honda)
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
37
Operational Semantics: Example
SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jim, Mazda) ∥ (Bill, Mazda)
(Hank, Honda)
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
38
Operational Semantics: Example
SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jim, Mazda) ∥ (Bill, Mazda)
(Hank, Honda)
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
39
Operational Semantics: Example
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jim, Mazda) ∥ (Bill, Mazda)
(Hank, Honda)
(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)
40
Operational Semantics: Example
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jim, Mazda) ∥ (Bill, Mazda)
(Hank, Honda)
(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)
41
Operational Semantics: Example
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jim, Mazda) ∥ (Bill, Mazda)
(Hank, Honda)
(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)
42
Operational Semantics: Example
SELECT Drives.person INTO SuspectsFROM Saw, DrivesWHERE Saw.car = Drives.car
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jim, Mazda) ∥ (Bill, Mazda)
(Hank, Honda)
Suspects
Jim ∥ Bill
Hank
??
λ( ) = ...λ( ) = ...
43
Confidences
Confidences supplied with base data
Trio computes confidences on query results• Default probabilistic interpretation• Can choose to plug in different arithmetic
Saw (witness,car)
(Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4
Drives (person,car)
(Jim, Mazda): 0.3 ∥ (Bill, Mazda): 0.6
(Hank, Honda) ): 1.0
Suspects
Jim: 0.12 ∥ Bill: 0.24
Hank: 0.6
?
?
?
0.3 0.4
0.6
ProbabilisticProbabilistic Min
44
TriQL: Querying Confidences
Built-in function: Conf()
SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(Saw) > 0.5 AND Conf(Drives) >
0.8
SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(*) > 0.4
45
TriQL: Querying Lineage
Built-in join predicate: Lineage()
SELECT Saw.witness INTO AccusesHank FROM Suspects, Saw WHERE Lineage(Suspects,Saw) AND Suspects.person = ‘Hank’
Also works for transitive lineage:
SELECT witness FROM AccusesHank WHERE Lineage(Saw,AccusesHank)
46
Additional Query Constructs
• “Horizontal subqueries”Refer to tuple alternatives as a relation
• Unmerged (horizontal duplicates)
• Flatten, GroupAlts
• NoLineage, NoConf, NoMaybe
• Query-computed confidences
• Data modification statements
47
Final Example Query
PrimeSuspect (crime#, accuser, suspect)
(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)
(2, Cathy, Frank) ∥ (2, Betty, Freddy)
person score
Amy 10
Betty 15
Cathy 5
Suspects
Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166
Frank: 0.25 ∥ Freddy: 0.75
Credibility
List suspects with conf values based on accuser credibility
48
Final Example Query
PrimeSuspect (crime#, accuser, suspect)
(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)
(2, Cathy, Frank) ∥ (2, Betty, Freddy)
person score
Amy 10
Betty 15
Cathy 5
Suspects
Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166
Frank: 0.25 ∥ Freddy: 0.75
Credibility
SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)
49
Final Example Query
PrimeSuspect (crime#, accuser, suspect)
(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)
(2, Cathy, Frank) ∥ (2, Betty, Freddy)
person score
Amy 10
Betty 15
Cathy 5
Suspects
Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166
Frank: 0.25 ∥ Freddy: 0.75
Credibility
SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)
50
Final Example Query
PrimeSuspect (crime#, accuser, suspect)
(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)
(2, Cathy, Frank) ∥ (2, Betty, Freddy)
person score
Amy 10
Betty 15
Cathy 5
Suspects
Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166
Frank: 0.25 ∥ Freddy: 0.75
Credibility
SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)
“Scaled as conf”
“Uniform as conf”
…
“Scaled as conf”
“Uniform as conf”
…
51
The Trio System
Version 1Entirely on top of conventional DBMS
Surprisingly easy and complete, reasonably efficient
52
The Trio System: Version 1
Lin:RaidF
r tableaidT
o
R aid xid C
Relational DBMS
create trio table T(A,B)
select C into R ...
TrioMetadat
a
Trio APITrio API
SQL commands
• Result cursors• Traverse lineage
T aid xid A B
53
Relation Encoding
aid xid
conf
person
car
21 1 2 Jim Mazda
22 1 2 Bill Mazda
23 2 1 Hank Honda
Suspects
Jim ∥ Bill
Hank
aid xid conf person
31 1 3 Jim
32 1 3 Bill
33 2 2 Hank
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jim, Mazda) ∥ (Bill, Mazda)
(Hank, Honda)
aid xid conf witness
car
11 1 2 Cathy Honda
12 1 2 Cathy Mazda
??
aidFr
table aidTo
31 Saw 12
31 Drives 21
32 Saw 12
32 Drives 22
33 Saw 11
33 Drives 23
SawDrives
SuspectsLin:Suspects
54
Relation Encoding with Confidences
aid xid
conf
person car
21 1 0.3 Jim Mazda
22 1 0.6 Bill Mazda
23 2 1.0 Hank Honda
aid xid conf person
31 1 NULL Jim
32 1 NULL Bill
33 2 NULL Hank
Saw (witness,car)
(Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4
Drives (person,car)
(Jim, Mazda): 0.3 ∥ (Bill, Mazda): 0.6
(Hank, Honda)
aid xid conf witness
car
11 1 0.6 Cathy Honda
12 1 0.4 Cathy Mazda
SawDrives
Suspects
0.12
0.24
0.6
55
Query Translation
Persistent results: Query Q into result table R
1. Run query Q’ to produce “super-result” R Q’ ≈ Q but adds aid/xid’s of source tuples, joins lineage
tables for lineage() predicates
2. Group R into alternatives, generate new xid’s3. Materialze lineage data into lin:R
4. Compute confidences? (optional)5. Add/update metadata: schemas, confidence info,
lineage structure
Transient results: stop at 2, return cursor over new xid’s
56
The Trio System: Version 1
Lin:R aidtabl
e aid
R aid xid C
Relational DBMS
create trio table T(A,B)
select C into R ...
TrioMetadat
a
Trio APITrio API
SQL commands
• Result cursors• Traverse lineage
T aid xid A B
57
The Trio System: Version 1
Trio API and translator(TriQL parser in Python)
Trio API and translator(TriQL parser in Python)
TrioMetadat
a
TrioExplorer(GUI client)
TrioExplorer(GUI client)
Trio Stored
Procedures
EncodedData
TablesLineageTables
Standard SQL
• Data & system columns• A-IDs for alternatives• “Verticalized” through shared X-IDs• Confidences
• One lineage table per derived table• Connecting A-IDs
• Table types• Schema-level lineage structure
• conf()• lineage()
• TriQL queries• DDL commands• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation
Command-lineclient
Command-lineclient
Relational DBMS
SPI-Plugin for
PostgreSQLReplacing
“Fetch-All”
SPI-Plugin for
PostgreSQLReplacing
“Fetch-All”
58
The Trio System: Version 2
Trio API and translatorTrio API and translator
TrioExplorer(GUI client)
TrioExplorer(GUI client)Command-line
client
Command-lineclient
Trio DBMS
SpecializedTrio Data Structures
SpecializedTrio Processing & Optimizing
Standard SQL Direct function calls
59
Efficient Confidence Computation
Previous approach (probabilistic databases)• Each operator computes confidences during
query execution• Only certain (“safe”) query plans allowed
In ULDBs Confidence of alternative A is function of
confidences in λ(A)
Our approach• Decoupled data and confidence computation
• Allow any query plan for data computation
• Compute confidences on-demand based on lineage
60
CIDs & Confidence Memoization
Find set of Closest Independent Descendants (CIDs) for each relation in the schema-level lineage DAG (relations with no shared ancestors)
Batched confidence computations in SQL• Select unions of distinct base alternatives from
all root-to-leaf lineage paths• With CIDs and memoization enabled
– if confidences are memoized then stop at CIDs – else traverse down to the leafs (base data) and memoize at CIDs
(ongoing work)
61
“Probabilistic TPC-H” Setup
• Horizontal split of TPC-H tables
• Uniform-random number of alternatives [1-10]
• Uniform confidences
PartsuppsPartsuppsLineitemsLineitemsOrdersOrders
O2O2O1
O1
OO LL PP
OLOL LPLP
OLPOLP1.5M
1.0M 0.1M
1.0M 0.5M 0.7M
O3O3 L2
L2L1L1 L3
L3 L4L4 P2
P2P1P1 P3
P3
OO LL PP
CID(OLP) = {O, L, P}
0.8M6M1.5M
62
Confidence Computation Times
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
4,500
5,000
Exe
cuti
on
Tim
e (s
eco
nd
s)
O L P OL LP OLP Total
No CIDs
CIDs
63
Per-Tuple vs. Batching (no CIDs)
64
Memoization & Join Multiplicity
65
Current Topics
System• Full query language• Nice interface• Performance experiments• Demo applications
Algorithms: confidence & lineage computation,
extraneous data, membership questions• Minimize lineage traversal
• Confidence memoization
• Efficient batch computations
66
Future Directions
Theory, Model, Algorithms• Unlimited opportunities
System• Storage, indexing, partitioning• Statistics and query optimization
More features• Top-K by confidence • Incomplete relations; continuous uncertainty;
correlated uncertainty• External lineage; update lineage; versioning
Search “stanford trio”
Current Trio team:Parag Agrawal, Anish Das Sarma, Raghotham
Murthy, Michi Mutsuzaki, Shubha Nabar, Tomoe Sugihara,
Martin Theobald, Jennifer Widom
Special thanks to:Ashok Chandra, Alon Halevy, Jeff Ullman,
Omar Benjelloun
but don’t forgetthe lineage…