Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Scorpion Explaining Away Outliers in Aggregate Queries
eugene wu and sam madden
MIT
http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html
Table
Split Aggregate Visualize
USA Italy
Exp
ense
s
China
SELECT sum(cost) FROM expenses GROUPBY country
USA Italy China
Exp
ense
s
SELECT sum(cost) FROM expenses GROUPBY country
USA Italy China
Exp
ense
s
SELECT sum(cost) FROM expenses GROUPBY country
USA Italy China
Exp
ense
s
SELECT sum(cost) FROM expenses GROUPBY country
USA Italy China
Given Outlier and normal results
Understand Why
Exp
ense
s
SELECT sum(cost) FROM expenses GROUPBY country
Given Outlier and normal results
caused the outliers? most caused the outliers? caused outliers but didn’t affect normal outputs?
USA Italy China
What input properties
Exp
ense
s
SELECT sum(cost) FROM expenses GROUPBY country
Can’t Touch This
Provenance
Data!
Provenance $$$
SELECT SUM(cost) FROM sam’s bank account
Provenance
SELECT SUM(cost) FROM sam’s bank account
$$$
Provenance
SELECT SUM(cost) FROM sam’s bank account
$$$
Provenance
Darn! Ya caught me
SELECT SUM(cost) FROM sam’s bank account
$$$
Provenance
http://weknowmemes.com/2012/04/whats-the-point/
Filter for “most influential”
Provenance
SELECT SUM(cost) FROM sam’s bank account
Provenance
Provenance
Faceting
http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf
Provenance
Faceting
http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf
Provenance
Faceting
Dimensionality :(
Dealing with multiple outliers?
http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf
Provenance
Faceting
Provenance
Faceting
Scorpion!
USA Italy
Understand Why
Given Outlier and normal results
China
Exp
ense
s
USA Italy China
Predicates correlated with outliers
Find
Given Outlier and normal results
Desc = “toilets”
Exp
ense
s
USA Italy China
Removing predicate from inputs “fixes” outliers & maintains normal results
Predicates correlated with outliers
Find
s.t.
Given Outlier and normal results
Exp
ense
s
Desc = “toilets”
USA Italy China
Removing predicate from inputs “fixes” outliers & maintains normal results
Predicates correlated with outliers
Find
s.t.
Given Outlier and normal results
Exp
ense
s
Desc = “toilets”
Removing predicate from inputs “fixes” outliers & maintains normal results
Predicates correlated with outliers
Find
s.t.
USA Italy China
Given Outlier and normal results
Exp
ense
s
Desc = “toilets”
Formalize “influence” as metric Predicate search heuristics
Some results
25
50
12pm
T
p(T)
25
50
12pm
T
Desc = “toilet”
p(T)
25
50
12pm
T
T – p(T)
p(T)
25
50
12pm
25
50
12pm
T
25
50
12pm
25
50
12pm
p(T)
25
50
12pm
25
50
12pm
p(T)
25
50
12pm
25
50
12pm
p(T)
Δoutput
25
50
12pm
25
50
12pm
|p(T)| p(T)
Δoutput
25
50
12pm
25
50
12pm
|p(T)| p(T)
Δoutput
Δoutput |p(T)|
Influence Metric
Δoutput |p(T)|
Δf(x) Δx
Sensitivity Analysis
Influence Metric
Δoutput
|p(T)| ΔOutput
“High vs Low”
|p(T)|
ΔNormal
Multiple Outputs
Δoutput
|p(T)|
Δoutput � V
|p(T)|
ΔOutput
“High vs Low”
|p(T)|
ΔNormal
Multiple Outputs
Δoutput � V
|p(T)|
Δoutput � V
|p(T)|c
ΔOutput
“High vs Low”
|p(T)|
ΔNormal
Multiple Outputs
Δoutput
|p(T)|
Δoutlier � V
|p(T)|c ΔNormal
Δoutput � V
|p(T)|c
-
ΔOutput
“High vs Low”
|p(T)|
ΔNormal
Multiple Outputs
Δoutput
|p(T)|
Δoutput � V
|p(T)|
Δoutlier � V
|p(T)|c ΔNormal -
ΔOutput
“High vs Low”
|p(T)|
ΔNormal
Multiple Outputs Δoutlier � V
|p(T)|c mean ΔNormal max
outlier normal -
Δoutput
|p(T)|
Δoutput � V
|p(T)|c
Δoutput � V
|p(T)|
Δoutlier � V
|P(T)|c ΔHold-out
Δoutlier
|P(T)|
Δoutlier � V
|P(T)|
Δoutlier � V
|P(T)|c
-
Δoutput
“High vs Low”
|P(T)|
ΔNormal
Multiple Outputs Δoutlier � V
|P(T)|c mean ΔHold-out max
outlier normal -
influence(p)
Formalize “influence” as metric Predicate search heuristics
Some results
influence(p) argmax p ∈ predicates p* =
influence(p) argmax p ∈ predicates p* =
O(agg(T-p(T)))
influence(p) argmax p ∈ predicates p* =
O(agg(T-p(T)))
SUM({1,2,3,4,5}) = 15
influence(p) argmax p ∈ predicates p* =
O(agg(T-p(T)))
SUM({1,2,3,4,5}) = 15
p
influence(p) argmax p ∈ predicates p* =
O(agg(T-p(T)))
SUM({1,2,3,4,5}) = 15 - {4,5}
p
influence(p) argmax p ∈ predicates p* =
O(agg(T-p(T)))
SUM({1,2,3,4,5}) = 15
SUM({1,2,3}) = 6
- {4,5}
p
influence(p) argmax p ∈ predicates p* =
O(exponential) O(agg(T-p(T)))
influence(p) argmax p ∈ predicates p* =
O(exponential)
Operator Properties
O(agg(T-p(T)))
influence(p) argmax p ∈ predicates p* =
O(exponential) O(agg(p(T)))
Operator Properties
Incrementally removable
influence(p) argmax p ∈ predicates p* =
O(exponential) O(agg(p(T))) Incrementally removable
SUM({1,2,3,4,5}) = 15
p
influence(p) argmax p ∈ predicates p* =
O(exponential) O(agg(p(T))) Incrementally removable
15 - SUM({ 4,5}) = 6
SUM({1,2,3,4,5}) = 15
p
influence(p) argmax p ∈ predicates p* =
O(exponential) O(agg(p(T)))
SUM COUNT AVG STDDEV
Incrementally removable
influence(p) argmax p ∈ predicates p* =
O(exponential) O(agg(p(T)))
SUM COUNT AVG STDDEV
MEDIAN MODE
Incrementally removable
influence(p) argmax p ∈ predicates p* =
Independent
Incrementally removable
O(agg(p(T))) O(exponential)
Least influence
Most influence
influence(p) argmax p ∈ predicates p* =
Independent
Incrementally removable
O(agg(p(T))) O(exponential)
Least influence
Most influence
influence(p) argmax p ∈ predicates p* =
Independent
Incrementally removable
O(agg(p(T))) O(exponential)
Least influence
Most influence
influence(p) argmax p ∈ predicates p* =
Independent
Incrementally removable
O(agg(p(T))) O(exponential)
Least influence
Most influence
influence(p) argmax p ∈ predicates p* =
Top Down Independent Incrementally removable
O(agg(p(T))) O(exponential)
influence(p) argmax p ∈ predicates p* =
Top Down Independent Incrementally removable
O(agg(p(T))) O(exponential)
influence(p) argmax p ∈ predicates p* =
Top Down Independent Incrementally removable
O(agg(p(T))) O(exponential)
influence(p) argmax p ∈ predicates p* =
Top Down Independent Incrementally removable
O(agg(p(T))) O(exponential)
influence(p) argmax p ∈ predicates p* =
Top Down Independent Incrementally removable
O(agg(p(T))) O(exponential)
Anti-monotonic
influence(p) argmax p ∈ predicates p* =
Top Down Independent Incrementally removable
O(agg(p(T))) O(exponential)
Anti-monotonic
p’⊂p
influence(p) argmax p ∈ predicates p* =
Top Down Independent Incrementally removable
O(agg(p(T))) O(exponential)
Anti-monotonic
p’⊂p
influence(p’) ≤ influence(p)
influence(p) argmax p ∈ predicates p* =
Top Down Independent Incrementally removable
O(agg(p(T))) O(exponential)
Bottom Up Anti-monotonic
influence(p) argmax p ∈ predicates p* =
Top Down Independent Incrementally removable
O(agg(p(T))) O(exponential)
Bottom Up Anti-monotonic
influence(p) argmax p ∈ predicates p* =
Top Down Independent Incrementally removable
O(agg(p(T))) O(exponential)
Bottom Up Anti-monotonic
Formalize “influence” as metric Predicate search heuristics
Some results
SELECT sum(Y) GROUPBY X S
um(Y
)
X
SELECT sum(Y) GROUPBY X S
um(Y
)
X
Z
Y
SELECT sum(Y) GROUPBY X S
um(Y
)
X
Z
Y
Z
SELECT sum(Y) GROUPBY X S
um(Y
)
X
Z
Y
Z
1K 5K 10K
thousand tuples / group
1K 5K 10K
100
1000
10
thousand tuples / group
cost
sec
ond
s
1K 5K 10K
100
1000
10
thousand tuples / group
cost
sec
ond
s Naive
1K 5K 10K
100
1000
10
thousand tuples / group
cost
sec
ond
s Naive
Top down
1K 5K 10K
100
1000
10
thousand tuples / group
cost
sec
ond
s Naive
Top down
Bottom up
influence metric that is
accessible to end-users for
Data cleaning Data exploration Provenance reduction
scorpion
http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html
scorpion
http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html
scorpion
http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html
C-parameter
Δoutput � V |p(T)|c
Y
Z
Low C High C
Z