23
The Impact of Duality The Impact of Duality on Data Representation Problems on Data Representation Problems Panagiotis Karras Panagiotis Karras HKU, June 14 th , 2007

The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

Embed Size (px)

Citation preview

Page 1: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

The Impact of DualityThe Impact of Dualityon Data Representation Problemson Data Representation Problems

Panagiotis KarrasPanagiotis KarrasHKU, June 14th, 2007

Page 2: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

IntroductionIntroduction• Many data representation problems require

the optimization of one parameter under a bound on one or more others.

• Classical approaches treat them in a direct manner, producing complicated solutions, and sometimes resorting to heuristics.

• Parameters involved have a monotonic relationship.

• Hence, an alternative approach is possible, based on dual problems.

Page 3: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

OutlineOutline• Histograms.• Restricted Haar Wavelet Synopses.• Unrestricted Haar and Haar+ Synopses.• l-Diversification in 1D.• Compact Hierarchical Histograms.

Page 4: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

HistogramsHistograms• Approximate a data set [d1, d2, …, dn] with B buckets,

si = [bi, ei, vi] so that a maximum-error metric is minimized.

• Classical solution: Jagadish et al. VLDB 1998 Guha et al. VLDB 2004, Guha VLDB 2005 ijbjEbiE

ij,1,1,maxmin,

1

nnBO 2log

• Recent solutions: Buragohain et al. ICDE 2007

Guha and Shim TKDE 19(7) 2007 (linear for )

Bn

UnnO loglog

nBnO 32 logn

nB

3log

199824,741,073,1230 Bn

Page 5: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

HistogramsHistograms

• Solve the error-bounded problem.

Maximum Absolute Error bound ε = 2

4 5 6 2 15 17 3 6 9 12 …

[ 4 ] [ 16 ] [ 4.5 ] […

• Generalized to any weighted maximum-error metric.

Each value di defines a tolerance interval

Bucket closed when running union of interval becomes null

Complexity:

ii

ii w

dw

d

,

nO

Page 6: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

HistogramsHistograms

• Apply to the space-bounded problem.

Perform binary search in the domain of the error bound ε

Complexity: *lognO

For error values requiring space , with actual error , run an optimality test:BB

Error-bounded algorithm running under constraint instead oferror error

If requires space, then optimal solution has been reached.BB ~error

Independent of buckets B

Page 7: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

34 16 2 20 20 0 36 16

0

18

7 -8

9 -9 1010 25 11 10 26

Restricted Haar Wavelet Restricted Haar Wavelet Synopses Synopses

• Select subset of Haar wavelet decomposition coefficients, so that a maximum-error metric is minimized.

• Classical solution: Garofalakis and Kumar PODS 2004 Guha VLDB 2005

18 18

1,,

,,,max

,,,

,,,max

min,,

bbzviE

bzviE

bbviE

bviE

bviE

iR

iL

R

L

2nO

Page 8: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

Restricted Haar Wavelet Restricted Haar Wavelet SynopsesSynopses

• Solve the error-bounded problem. Muthukrishnan FSTTCS 2005

Local search within each of subtrees in bottom Haar tree levels

n

nO

log

2

1,,

,,,min,

iRiL

RL

zviSzviS

viSviSviS

nloglog

n

n

log

Complexity:

• Apply to the space-bounded problem.

Complexity:

n

nOlog

log *2

no significant advantage

Page 9: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

Unrestricted Haar and HaarUnrestricted Haar and Haar++ SynopsesSynopses

• Assign arbitrary values to Haar/Haar+ coefficients, so that a maximum-error metric is minimized.

• Classical solutions: Guha and Harb KDD 2005, SODA 2006

0,,

,,,max

min,,00

zbbzviE

bzviE

bviE

R

L

zbbSz v

i

BnnRO 22 loglog

c1+

c2 c3

C1

c5 c6+

C2 c7c8 c9

c

o

d3d2d1d0

-++ +

-+c4

+-+

+ +

C3

0,,

,,,maxmin

,0,,

,,,maxmin

,0,,

,,,maxmin

min,,

00

00

00

,

,

,

rrR

L

zbbSz

lR

lL

zbbSz

hhR

hL

zbbSz

zbbzviE

bviE

zbbviE

bzviE

zbbzviE

bzviE

bviE

r

vRir

l

vLil

h

vHih

n

B

nRBO log

time

space

Karras and Mamoulis ICDE 2007

Page 10: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

Unrestricted Haar and HaarUnrestricted Haar and Haar++ SynopsesSynopses• Solve the error-bounded problem.

nnRO log2

0,,min,

zzviSzviSviS RLSz v

i

Complexity:

• Apply to the space-bounded problem.

Complexity:

unrestricted Haar

0,

,,maxmin

,0,

,,maxmin

,0,

,,maxmin

min,

,

,

,

rrR

L

Sz

lR

lL

Sz

hhR

hL

Sz

zzviS

viS

zviS

zviS

zzviS

zviS

viS

vRir

vLil

vHih

Haar+

time nnRO log space

nnRO loglog *2 significant time & space advantage

Page 11: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

l-Diversification in 1Dl-Diversification in 1D• Given database table T(A1, A2,…, An), a quasi-identifier

attribute set QT is a subset of attributes which can reveal the personal identity of records.

• Equivalence class with respect to quasi-identifier attribute set QT is a set of records indistinguishable in the projection of T on QT.

• A database table T with quasi-identifier set QT and sensitive attribute S conforms to the l-diversity property iff each equivalence class in T with respect to QT has at least l well-represented values of S [Machanavajjhala et al. ICDE 2006]

• Utility metric: Extent of equivalence class (group).• Other parameter: Outliers, records whose quasi-identifier

values are suppressed.

Page 12: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

10 30 50 70 90

7

6

5

4

3

2

1

Lead Poisoning

Parkinson’s

Flu

Hyperthyroidism

Age

Postcode

Age

Postcode

10 30 50 70 90

7

6

5

4

3

2

1

l-Diversification in 1Dl-Diversification in 1D• A two-dimensional example.

Page 13: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

quasi-identifier

Sensitive value

l-Diversification in 1Dl-Diversification in 1D• Study the problem in one dimension (a single

quasi-identifier).• Total order exists.• Similar to histogram construction.• Polynomially tractable.

Page 14: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

quasi-identifier

Sensitive value

D1

D3

D2

D4r1 r6

r4

r2

r3

r5

• Groups consecutive in each sensitive value domain.

• Groups order the same in each domain.• Example for l=3.

l-Diversification in 1Dl-Diversification in 1D

Page 15: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

quasi-identifier

Sensitive value

D1

D3

D2

D4r1 r6

r4

r2

r3

r5

• Groups consecutive in each sensitive value domain.• Groups order the same in each domain.• Example for l=3

l-Diversification in 1Dl-Diversification in 1D

Page 16: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

quasi-identifier

Sensitive value

e

E

l-Diversification in 1Dl-Diversification in 1D• Given interval I of extent E, which includes c items with m different

sensitive values, number of possible boundaries/groups in I is:

cmO

cmm

cO

Bc

m

cm

,2

,

cmO

cmmc

OC

c

m

cm

,3

,22

Page 17: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

l-Diversification in 1Dl-Diversification in 1D• Solve the outlier minimization problem.

nnCCO wm

cm log Complexity:

bccabc

ababab,Nmin

,,|

NN

PME

time wBO cm space

• Apply to the accuracy maximization problem.

Complexity:

• Apply to the privacy maximization problem.

Complexity:

nnCCO wm

cm loglog * time

time nnCCO wm

cm loglog *

Page 18: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

Compact Hierarchical Compact Hierarchical HistogramsHistograms

• Assign arbitrary values to CHH coefficients, so that a maximum-error metric is minimized.

• Heuristic solutions: Reiss et al. VLDB 2006

BnnBO loglog2

c0

c1 c2

c3 c4c5 c6

d3d2d1d0

nnBO 2log

time

space

The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node.

[Reiss et al. VLDB 2006]

Page 19: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. Next-to-bottom level case

dcbavdcba

dcbavdcbadcbavdcba

dcbavdcba

viS

,,,,

,,,,,,,,

,,,,

,2

,1

,0

,

1,,, ** ii ssviSv

cic2i c2i+1

bav ,

z00

ba, dc,

dcba ,,

cic2i

0 0

z

dcbav ,,

dc, ba,

dcba ,,

dcz , dcbaz ,,

Page 20: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. General, recursive case

0000

00000000

0000

**

**

**

,2

,1

,

,

RLRL

RLRLRLRL

RLRL

v

vv

v

ss

ss

ss

viS

RL

RL

RL

ii

ii

ii

*0

*0 ,,

RL iRiL sviSvsviSv RL

Complexity: nnOn

On 2log

0 1log

22

time

space

• Apply to the space-bounded problem.

Complexity: Polynomially Tractable

nOOn

log

02

nnnO logloglog *2

Page 21: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

ConclusionsConclusions• Offline data representation problems under

constrains are more easily solvable through their counterparts optimizing another parameter.

• Dual-problem-based algorithms are simpler, more scalable, more elegant, and more memory-parsimonious than the direct ones.

• In the CHH case, the dual-problem-based algorithm achieves an optimal solution to the maximum-error longest-prefix-match CHH partitioning problem, which was considered intractable.

• Future: assessment of privacy and CHH solutions.

Page 22: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

Related WorkRelated Work• H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik,

and T. Suel. Optimal histograms with quality guarantees. VLDB 1998• S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram

construction algorithms. VLDB 2004• M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for

maximum-error metrics. PODS 2004• S. Guha. Space efficiency in synopsis construction algorithms. VLDB

2005• S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing

Non-Euclidean Error. KDD 2005• S. Muthukrishnan. Subquadratic algorithms for workload-aware haar

wavelet synopses. FSTTCS 2005• S. Guha and B. Harb. Approximation algorithms for wavelet transform

coding of data streams. SODA 2006• we devised a specialized, highly efficient method for the case that a• F. Reiss, M. Garofalakis, and J. M. Hellerstein. Compact histograms for

hierarchical identifiers. VLDB 2006• A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam.

l-• diversity: Privacy beyond k-anonymity. ICDE 2006• P. Karras and N. Mamoulis. The Haar+ tree: a refined synopsis data

structure. ICDE 2007

Page 23: The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

Thank you! Questions?Thank you! Questions?