View
216
Download
0
Embed Size (px)
Citation preview
Top-k Queries on Uncertain Data: On score Distribution and Typical Answers
Presented by Qian Wan, HKUSTBased on [1][2]
Introduction: Uncertain Data Management
Modeling Uncertain Data Possible Worlds Model
Uncertain data management Top-k, Join, kNN, Skyline, Indexing,
etc. Uncertain Data Mining
Clustering, Classification, Frequent Pattern, Outlier Detection
Introduction: Data Introduction: Data RepresentationRepresentation
A simple way to representing probabilistic data
Each tuple has a confidence Pr(instance)=
∏Pr(attendance) x ∏Pr(absence)
Mutual Exclusion Constraints for each tuple*
Scoring function*
Introduction: Other WorksIntroduction: Other Works
K tuples that co-exist in a possible world U-Topk
Returning tuples according to marginal distribution of top-k results U-kRanks and PT-k
Introduction: Other Works Introduction: Other Works (Example)(Example)
Introduction: Other Works Introduction: Other Works (drawback)(drawback)
The top-k result may be atypical The distribution of scores is not used
Introduction: c-Typical-Top kIntroduction: c-Typical-Top k
3-Typical-Top 2 scores of this example is {118, 183, 235}
Expected distance is 6.6 The vectors are {(t2, t6),
(T7,T6), (T7,T3)}
Algorithm
Distribution of top-2 tuples’
scores
Algorithm – Naïve approach
INPUT: tuples with membership probabilities
OUTPUT: Top-k scores distribution IDEA: recursively go through all
possible worlds to calculate all probabilities, until reaching a threshold
Algorithm – a DP approach D(i,j): score
distribution of top-j starting at Ti.
The main problem is D(1,k) (?)
Algorithm – a DP approach Transformation:
D(i,j) = TF[D(i+1,j),D(i+1,j-1)]
D(i+1,j): For each (v,p) add (v,
p(1-pi)) D(i+1,j-1):
For each (v,p) add (v+si, p*pi)
Merge duplicate items Bottom up DP Approximation
Handling More Real Scenarios
Handling Mutually Exclusive Rules Compress the ME group Refine by lead tuple region
Handling Ties When two tuples have the same
score, rank them according to probability
Algorithm
118, 0.2
183, 0.15
235, 0.12
0
0.05
0.1
0.15
0.2
0.25
0 50 100 150 200 250
Series1
3-Typical-Top 2 scores
c-Typical-Top kc-Typical-Top k
3-Typical-Top 2 scores of this example is {118, 183, 235}
Expected distance is 6.6 The vectors are {(t2, t6),
(T7,T6), (T7,T3)}
Computing c-Typical-Top kComputing c-Typical-Top k
Define F^a(j) to be the optimal objective over {sj, …, sn} where a is the number of typical scores.
G^a(j) means the same
Computing c-Typical-Top kComputing c-Typical-Top k
Just solve the two function optimization problem, using DP
Boundary conditions
Empirical Study 3 -Typical VS U-Topk
Empirical Study
Empirical Study
Q&A
Reference [1] Charu C. Aggarwal, Philip S. Yu “A Survey
of Uncertain Data Algorithms and Applications”, IEEE Transactions on Knowledge and Data Engineering, 2009
[2] Tingjian Ge, Stan Zdonik, Samuel Madden. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers. SIGMOD,
2009