Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Top-k Queries on Uncertain Data: On score Distribution and Typical Answers

Presented by Qian Wan, HKUSTBased on [1][2]

Introduction: Uncertain Data Management

Modeling Uncertain Data Possible Worlds Model

Uncertain data management Top-k, Join, kNN, Skyline, Indexing,

etc. Uncertain Data Mining

Clustering, Classification, Frequent Pattern, Outlier Detection

Introduction: Data Introduction: Data RepresentationRepresentation

A simple way to representing probabilistic data

Each tuple has a confidence Pr(instance)=

∏Pr(attendance) x ∏Pr(absence)

Mutual Exclusion Constraints for each tuple*

Scoring function*

Introduction: Other WorksIntroduction: Other Works

K tuples that co-exist in a possible world U-Topk

Returning tuples according to marginal distribution of top-k results U-kRanks and PT-k

Introduction: Other Works Introduction: Other Works (Example)(Example)

Introduction: Other Works Introduction: Other Works (drawback)(drawback)

The top-k result may be atypical The distribution of scores is not used

Introduction: c-Typical-Top kIntroduction: c-Typical-Top k

3-Typical-Top 2 scores of this example is {118, 183, 235}

Expected distance is 6.6 The vectors are {(t2, t6),

(T7,T6), (T7,T3)}

Algorithm

Distribution of top-2 tuples’

scores

Algorithm – Naïve approach

INPUT: tuples with membership probabilities

OUTPUT: Top-k scores distribution IDEA: recursively go through all

possible worlds to calculate all probabilities, until reaching a threshold

Algorithm – a DP approach D(i,j): score

distribution of top-j starting at Ti.

The main problem is D(1,k) (?)

Algorithm – a DP approach Transformation:

D(i,j) = TF[D(i+1,j),D(i+1,j-1)]

D(i+1,j): For each (v,p) add (v,

p(1-pi)) D(i+1,j-1):

For each (v,p) add (v+si, p*pi)

Merge duplicate items Bottom up DP Approximation

Handling More Real Scenarios

Handling Mutually Exclusive Rules Compress the ME group Refine by lead tuple region

Handling Ties When two tuples have the same

score, rank them according to probability

Algorithm

118, 0.2

183, 0.15

235, 0.12

0

0.05

0.1

0.15

0.2

0.25

0 50 100 150 200 250

Series1

3-Typical-Top 2 scores

c-Typical-Top kc-Typical-Top k

3-Typical-Top 2 scores of this example is {118, 183, 235}

Expected distance is 6.6 The vectors are {(t2, t6),

(T7,T6), (T7,T3)}

Computing c-Typical-Top kComputing c-Typical-Top k

Define F^a(j) to be the optimal objective over {sj, …, sn} where a is the number of typical scores.

G^a(j) means the same

Computing c-Typical-Top kComputing c-Typical-Top k

Just solve the two function optimization problem, using DP

Boundary conditions

Empirical Study 3 -Typical VS U-Topk

Empirical Study

Empirical Study

Q&A

Reference [1] Charu C. Aggarwal, Philip S. Yu “A Survey

of Uncertain Data Algorithms and Applications”, IEEE Transactions on Knowledge and Data Engineering, 2009

[2] Tingjian Ge, Stan Zdonik, Samuel Madden. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers. SIGMOD,

2009

Documents

Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]