21
Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Top-k Queries on Uncertain Data: On score Distribution and Typical Answers

Presented by Qian Wan, HKUSTBased on [1][2]

Page 2: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Introduction: Uncertain Data Management

Modeling Uncertain Data Possible Worlds Model

Uncertain data management Top-k, Join, kNN, Skyline, Indexing,

etc. Uncertain Data Mining

Clustering, Classification, Frequent Pattern, Outlier Detection

Page 3: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Introduction: Data Introduction: Data RepresentationRepresentation

A simple way to representing probabilistic data

Each tuple has a confidence Pr(instance)=

∏Pr(attendance) x ∏Pr(absence)

Mutual Exclusion Constraints for each tuple*

Scoring function*

Page 4: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Introduction: Other WorksIntroduction: Other Works

K tuples that co-exist in a possible world U-Topk

Returning tuples according to marginal distribution of top-k results U-kRanks and PT-k

Page 5: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Introduction: Other Works Introduction: Other Works (Example)(Example)

Page 6: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Introduction: Other Works Introduction: Other Works (drawback)(drawback)

The top-k result may be atypical The distribution of scores is not used

Page 7: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Introduction: c-Typical-Top kIntroduction: c-Typical-Top k

3-Typical-Top 2 scores of this example is {118, 183, 235}

Expected distance is 6.6 The vectors are {(t2, t6),

(T7,T6), (T7,T3)}

Page 8: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Algorithm

Distribution of top-2 tuples’

scores

Page 9: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Algorithm – Naïve approach

INPUT: tuples with membership probabilities

OUTPUT: Top-k scores distribution IDEA: recursively go through all

possible worlds to calculate all probabilities, until reaching a threshold

Page 10: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Algorithm – a DP approach D(i,j): score

distribution of top-j starting at Ti.

The main problem is D(1,k) (?)

Page 11: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Algorithm – a DP approach Transformation:

D(i,j) = TF[D(i+1,j),D(i+1,j-1)]

D(i+1,j): For each (v,p) add (v,

p(1-pi)) D(i+1,j-1):

For each (v,p) add (v+si, p*pi)

Merge duplicate items Bottom up DP Approximation

Page 12: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Handling More Real Scenarios

Handling Mutually Exclusive Rules Compress the ME group Refine by lead tuple region

Handling Ties When two tuples have the same

score, rank them according to probability

Page 13: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Algorithm

118, 0.2

183, 0.15

235, 0.12

0

0.05

0.1

0.15

0.2

0.25

0 50 100 150 200 250

Series1

3-Typical-Top 2 scores

Page 14: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

c-Typical-Top kc-Typical-Top k

3-Typical-Top 2 scores of this example is {118, 183, 235}

Expected distance is 6.6 The vectors are {(t2, t6),

(T7,T6), (T7,T3)}

Page 15: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Computing c-Typical-Top kComputing c-Typical-Top k

Define F^a(j) to be the optimal objective over {sj, …, sn} where a is the number of typical scores.

G^a(j) means the same

Page 16: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Computing c-Typical-Top kComputing c-Typical-Top k

Just solve the two function optimization problem, using DP

Boundary conditions

Page 17: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Empirical Study 3 -Typical VS U-Topk

Page 18: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Empirical Study

Page 19: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Empirical Study

Page 20: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Q&A

Page 21: Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Reference [1] Charu C. Aggarwal, Philip S. Yu “A Survey

of Uncertain Data Algorithms and Applications”, IEEE Transactions on Knowledge and Data Engineering, 2009

[2] Tingjian Ge, Stan Zdonik, Samuel Madden. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers. SIGMOD,

2009