Sequential DependenciesFlip Korn, AT&T
Lukasz Golab, AT&THoward Karloff, AT&T
Avishek Saha, University of UtahDivesh Srivastava, AT&T
Data Quality
• Cost to business: $600 billion• Problems prevalent in measurement data
– Equipment failure– Calibration/systemic errors– Configuration errors– Management errors
• Our goal: detection, not cleaning
Data Quality
• Principled approach: integrity constraints– Assert semantics– Deviations = quality issues– Allow approx for real-world data– Condition tableau discovery
• Domain often gives rise to semantics– missing, extraneous, out-of-order
Sequential Dependency Definition
• Sequential dependency X g Y– (yi – yi-1) g, y’s sorted w.r.t. x-values
– extension of functional dependency, g = (0,)• X Y : t1,t2 t1[X] < t2[X] t1[Y] < t2[Y]
• Example #1: date [20,) price– Prices increasing by at least 20 units
• Example #2: poll# [4,6] time– Consecutive polls within 4-6 mins
Approx Sequential Dependency
Start End
Confidence = 67%
g = (0,)
Conditional Sequential Dependencies
Start End[1,6] --
[2,11] --[7,12] --
Confidence ≥ 80%
[1,6]
[2,11][7,12]
Conditional Sequential Dependencies
Example #1: g = (0,)
Conditional Sequential Dependencies
Example #2: g = [9,11] g = [20,]
Contributions
• Introduce sequential dependencies (SDs)– algorithm for computing confidence
• Tableau Discovery for CSDs– problem definition– fast approximation algorithm
Approx Sequential Dependency
• Confidence: (N-OPS)/N, OPS = min ins+del– Edit distance
• Ex: <5,9,12,25,31,30,34,40> with [4,6]– del 12, ins 15, ins 20, del 31 conf = 4/8
• Doesn’t overpenalize for rare drops– Eg, <5, 10, 15, 20, 30, 35, 40, 45> with [5,5]
• Penalize large gaps– Eg, [3,5] with gap of 6 vs. 1000
Approx Sequential Dependency
• How to compute OPS for g=[G1,G2]?• dcost(d) = #ins (or ) to <0> to end in d
– Eg, [4,6]: dcost(6) = 1, dcost(7) = , dcost(8) = 2– d/G2 when (d+1)/G1 = d/G2; else
• Let T(i) := OPS made to <a1,a2,…,ai> <…,v>
• Suppose T(1), T(2), …, T(i-1) already computed• T(i) = minj { T(j) + (i-1-j) + [dcost(ai-aj)-1] }
– O(G2/(G2-G1) N log N) algorithm
Tableau Discovery
• Assume underlying SD given– Data often suggest ordering semantics
• Good tableau = small set of intervals– Each interval satisfies confidence threshold– Union satisfies support threshold
• Find maximal time intervals [i,j] s.t.– Confidence satisfied in [i,j]
• Can we do better than testing all [i,j]’s?
Tableau Discovery: Candidates
• Relax constraint: confidence ≥ ĉ/(1+ε)• For any interval I, exists J s.t.• (a) I J and• (b) |J| ≤ (1+ε)|I|• conf(J) ≥ conf(I)/(1+ε)
I
J
Tableau Discovery: Candidates
• Test just enough intervals:• (a) lengths 1, (1+δ), (1+δ)2, …• (b) starting points δ, δ(1+δ), δ(1+δ)2, …
Tableau Discovery: Candidates
• Processing cost:– Intervals at level h have length (1+δ)h
– N/(δ(1+δ)h) intervals at level h– log1+δN total levels
– sum of lengths = O((N/δ)log1+δN) = O(N/δ2 lg N)
• Improvement:– Interval lengths in [A,2A] start at δA,2δA,3δA,…– Prefix property
Tableau Discovery: Assembly
• Optimal solution in quadratic time• Greedy partial set cover• Can implement in linear time• Constant performance ratio
Summary of Results
• Tableau almost identical at small δ• Significant speedup at small δ• “Inflating” ĉ to (1+δ)ĉ works well
Experiments: Sample Tableau
Data: WeatherDates, conf ≥ 0.995, support ≥ 0.5, δ = 0.05
Experiments: Tableau Size
Gaps in [0,∞ ) Gaps in [0,5]
DowJones data: support 0.5
Experiments: Scalability
Gaps in [0,∞ ) Gaps in [4,6]
Network datasupport 0.5conf 0.99
WeatherDatessupport 0.5conf 0.9
Case Study: Polled Data
conf ≥ 0.995, support ≥ 0.5, δ=0.05
Case Study: Stock Data
conf ≥ 0.995, support ≥ 0.5, δ=0.05
Dow Jones 2-week moving average
104
103
102
Conclusions
• Constraint-driven approach– Define, discover, detect
• Use whatever semantics available– Domain knowledge, expectation, etc.
• Model errors carefully– Confidence measure
• Tableaux useful for summary
The End
Background
• Functional Dependency– X Y : t1,t2 t1[X] = t2[X] t1[Y] = t2[Y]
• Example– title salary– What happens when data merged?
Page 26
Background• ssn|name|title|salary• 123|alice|manager|50• 456|bob|sales|40• 789|cathy|manager|50
• title salary
• ssn|name|company|title|salary
• 123|alice|ATT|manager|50
• 456|bob|ATT|sales|40
• 789|cathy|ATT|manager|50
• 012|david|IBM|engineer|30
• 345|emily|IBM|engineer|35
• [title,company] salary?– 100% support, 80% confidence
Hold Tableau Fail Tableau
ATT
Company
**
SalaryTitle
60% support, 100% confidence
IBM
Company
**
SalaryTitle
40% support, 50% confidence
CFD Results
• Given FD, discover tableau:– min tableau size– subj. to support and confidence constraints• Hardness:– global conf: inapproximable– local conf: NP-hard, fast approx algo