Upload
fuller-craig
View
52
Download
4
Tags:
Embed Size (px)
DESCRIPTION
On the Anonymization of Sparse High-Dimensional Data. 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong [email protected]. Publishing Transaction Data. Publishing transaction data Retail chain-owned shopping cart data - PowerPoint PPT Presentation
Citation preview
1
On the Anonymization of Sparse High-Dimensional
Data
1 National University of Singapore{ghinitag,kalnis}@comp.nus.edu.sg
2 Chinese University of Hong [email protected]
Gabriel Ghinita1 Yufei Tao2 Panos Kalnis1
2
Publishing Transaction Data Publishing transaction data
Retail chain-owned shopping cart data
Infer consumer spending patterns
Correlations among purchased items
e.g., 90% of cereals buyers also buy milk
What about privacy?
3
Privacy Threat
Quasi-identifying
Items
Sensitive
Items
4
Privacy Paradigm ℓ-diversity
prevent association between quasi-identifier and sensitive attributes
Create groups of transactions freq. of an SA value in a group < 1/p
Objective Enforce privacy Preserve correlations among items Challenge: high data dimensionality
5
Data Re-organization
Band Matrix Organization
PRESERVES
CORELATIONS!
6
Published Data
Summary of Sensitive Items
7
Contributions Novel data representation
Preserves correlation among items
Efficient heuristic for group formation Linear time to data size Supports multiple sensitive items
State-of-the-art: Mondrian[FWR06]
Generalization-based data-space partitioning similar to k-d-trees
split recursively until privacy condition does not hold
constrained global recoding
k = 2
[FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006
Age
20 40 60
Weig
ht
40
60
80
100
GENERALIZATION + HIGH DIMENSIONALITY
=
UNACCEPTBLE INFORMATION LOSS
State-of-the-art: Anatomy[XT06]
Permutation-based method discloses exact QID values
DiseaseUlcer(1)
Pneumonia(1)Flu(1)
Dyspepsia(1)
Gastritis(1) Dyspepsia(1)
[XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006
Age ZipCode42 5200047 4300051 3200062 4100055 2700067 55000
Age ZipCode Disease
42 52000 Ulcer47 43000 Pneumonia51 32000 Flu55 27000 Gastritis62 41000 Dyspepsia67 55000 Dyspepsia
“Anatomized” table|G|! permutationsRANDOM GROUP FORMATION
DOES NOT PRESERVE CORRELATIONS
10
Band Matrix Representation
Bandwidth = U+L+1 Minimizing bandwidth is NP-hard
11
Reverse Cuthil-McKee (RCM) Heuristic Bandwidth Minimization
Solves corresponding graph labeling problem Permutes rows and columns Complexity N* D * log D
N = matrix rows (# transactions) D = maximum degree of any vertex
12
Group Formation Correlation-aware Anonymization of High-
Dimensional Data (CAHD)
Use the order given by RCM Consecutive transactions highly correlated
O(pN) complexity
13
Group Formation
Experimental Evaluation
15
RCM Visualization
16
Experimental Setting BMS dataset Compare with hybrid PermMondrian(PM)
Combines Mondrian with Anatomy Query Workload
Reconstruction Error
17
Recostruction Error vs p
18
Execution Time
19
Conclusions Anonymizing transaction data
High-dimensionality Preserving correlation
Future work Different encodings for data representation
Enhance correlation among consecutive rows