Download ppt - Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T

Sequential DependenciesFlip Korn, AT&T

Lukasz Golab, AT&THoward Karloff, AT&T

Avishek Saha, University of UtahDivesh Srivastava, AT&T

Data Quality

• Cost to business: $600 billion• Problems prevalent in measurement data

– Equipment failure– Calibration/systemic errors– Configuration errors– Management errors

• Our goal: detection, not cleaning

Data Quality

• Principled approach: integrity constraints– Assert semantics– Deviations = quality issues– Allow approx for real-world data– Condition tableau discovery

• Domain often gives rise to semantics– missing, extraneous, out-of-order

Sequential Dependency Definition

• Sequential dependency X g Y– (yi – yi-1) g, y’s sorted w.r.t. x-values

– extension of functional dependency, g = (0,)• X Y : t1,t2 t1[X] < t2[X] t1[Y] < t2[Y]

• Example #1: date [20,) price– Prices increasing by at least 20 units

• Example #2: poll# [4,6] time– Consecutive polls within 4-6 mins

Approx Sequential Dependency

Start End

Confidence = 67%

g = (0,)

Conditional Sequential Dependencies

Start End[1,6] --

[2,11] --[7,12] --

Confidence ≥ 80%

[1,6]

[2,11][7,12]


Example #1: g = (0,)


Example #2: g = [9,11] g = [20,]

Contributions

• Introduce sequential dependencies (SDs)– algorithm for computing confidence

• Tableau Discovery for CSDs– problem definition– fast approximation algorithm


• Confidence: (N-OPS)/N, OPS = min ins+del– Edit distance

• Ex: <5,9,12,25,31,30,34,40> with [4,6]– del 12, ins 15, ins 20, del 31 conf = 4/8

• Doesn’t overpenalize for rare drops– Eg, <5, 10, 15, 20, 30, 35, 40, 45> with [5,5]

• Penalize large gaps– Eg, [3,5] with gap of 6 vs. 1000


• How to compute OPS for g=[G1,G2]?• dcost(d) = #ins (or ) to <0> to end in d

– Eg, [4,6]: dcost(6) = 1, dcost(7) = , dcost(8) = 2– d/G2 when (d+1)/G1 = d/G2; else

• Let T(i) := OPS made to <a1,a2,…,ai> <…,v>

• Suppose T(1), T(2), …, T(i-1) already computed• T(i) = minj { T(j) + (i-1-j) + [dcost(ai-aj)-1] }

– O(G2/(G2-G1) N log N) algorithm

Tableau Discovery

• Assume underlying SD given– Data often suggest ordering semantics

• Good tableau = small set of intervals– Each interval satisfies confidence threshold– Union satisfies support threshold

• Find maximal time intervals [i,j] s.t.– Confidence satisfied in [i,j]

• Can we do better than testing all [i,j]’s?

Tableau Discovery: Candidates

• Relax constraint: confidence ≥ ĉ/(1+ε)• For any interval I, exists J s.t.• (a) I J and• (b) |J| ≤ (1+ε)|I|• conf(J) ≥ conf(I)/(1+ε)

I

J


• Test just enough intervals:• (a) lengths 1, (1+δ), (1+δ)2, …• (b) starting points δ, δ(1+δ), δ(1+δ)2, …


• Processing cost:– Intervals at level h have length (1+δ)h

– N/(δ(1+δ)h) intervals at level h– log1+δN total levels

– sum of lengths = O((N/δ)log1+δN) = O(N/δ2 lg N)

• Improvement:– Interval lengths in [A,2A] start at δA,2δA,3δA,…– Prefix property

Tableau Discovery: Assembly

• Optimal solution in quadratic time• Greedy partial set cover• Can implement in linear time• Constant performance ratio

Summary of Results

• Tableau almost identical at small δ• Significant speedup at small δ• “Inflating” ĉ to (1+δ)ĉ works well

Experiments: Sample Tableau

Data: WeatherDates, conf ≥ 0.995, support ≥ 0.5, δ = 0.05

Experiments: Tableau Size

Gaps in [0,∞ ) Gaps in [0,5]

DowJones data: support 0.5

Experiments: Scalability

Gaps in [0,∞ ) Gaps in [4,6]

Network datasupport 0.5conf 0.99

WeatherDatessupport 0.5conf 0.9

Case Study: Polled Data

conf ≥ 0.995, support ≥ 0.5, δ=0.05

Case Study: Stock Data

conf ≥ 0.995, support ≥ 0.5, δ=0.05

Dow Jones 2-week moving average

104

103

102

Conclusions

• Constraint-driven approach– Define, discover, detect

• Use whatever semantics available– Domain knowledge, expectation, etc.

• Model errors carefully– Confidence measure

• Tableaux useful for summary

The End

Background

• Functional Dependency– X Y : t1,t2 t1[X] = t2[X] t1[Y] = t2[Y]

• Example– title salary– What happens when data merged?

Page 26

Background• ssn|name|title|salary• 123|alice|manager|50• 456|bob|sales|40• 789|cathy|manager|50

• title salary

• ssn|name|company|title|salary

• 123|alice|ATT|manager|50

• 456|bob|ATT|sales|40

• 789|cathy|ATT|manager|50

• 012|david|IBM|engineer|30

• 345|emily|IBM|engineer|35

• [title,company] salary?– 100% support, 80% confidence

Hold Tableau Fail Tableau

ATT

Company

**

SalaryTitle

60% support, 100% confidence

IBM

Company

**

SalaryTitle

40% support, 50% confidence

CFD Results

• Given FD, discover tableau:– min tableau size– subj. to support and confidence constraints• Hardness:– global conf: inapproximable– local conf: NP-hard, fast approx algo