Mining Uncertain Data (Sebastiaan van Schaaik)

  • View
    104

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Text of Mining Uncertain Data (Sebastiaan van Schaaik)

  • 1. Seminar web data extraction 1 / 26 Seminar web data extraction: Mining uncertain data Sebastiaan van SchaikSebastiaan.van.Schaik@comlab.ox.ac.uk20 January 2011Sebastiaan van Schaik

2. Seminar web data extraction > Frequent patterns & association rules 2 / 26IntroductionFocus of this presentation: mining of frequent patterns andassociation rules from (uncertain) data.Example applications:discover regularities in customer transactions;analysing log les: determine how visitors use a website;Based on:Mining Uncertain Data with Probabilistic Guarantees[9] (KDD 2010);Frequent Pattern Mining with Uncertain Data[1] (KDD 2009);A Tree-Based Approach for Frequent Pattern Mining from UncertainData[6] (PAKDD 2008).Sebastiaan van Schaik 3. Seminar web data extraction > Frequent patterns & association rules 3 / 26Introduction & running exampleFrequent pattern (itemset): items that occurs suciently often.Example: {fever, headache}Association rule: a set of items values implying another set of items.Example: {fever, headache} {nausea} PatientDiagnosist1 Cheng{severe cold}t2 Andrey {yellow fever, haemochromatosis}t3 Omer {schistosomiasis, syringomyelia}t4 Tim{Wilsons disease}t5 Dan{Hughes-Stovin syndrome}Yellow fever?t6 Bas{Henoch-Schnlein purpura}Running example: patient diagnosis databaseSebastiaan van Schaik 4. Seminar web data extraction > Frequent patterns & association rules 4 / 26Measuring interestingness: support & condenceSupport of an itemset X :sup(X ): number of entries (rows, transactions) that contain XCondence of an association rule X Y :sup(X Y ) conf(X Y ) =sup(X )Sebastiaan van Schaik 5. Seminar web data extraction > Frequent patterns & association rules 5 / 26Finding association rules: Apriori (1)Agrawal et al. introduced Apriori in 1994[2] to mine association rules: 1 Find all frequent itemsets X in database D (X is frequent i isup(Xi ) > minsup): 1 Candidate generation: generate all possible itemsets of length k (starting k = 1) based on frequent itemsets of length k 1; 2 Test candidates, discard infrequent itemsets; 3 Repeat with k = k + 1.Sebastiaan van Schaik 6. Seminar web data extraction > Frequent patterns & association rules 5 / 26Finding association rules: Apriori (1)Agrawal et al. introduced Apriori in 1994[2] to mine association rules: 1 Find all frequent itemsets X in database D (X is frequent i isup(Xi ) > minsup): 1 Candidate generation: generate all possible itemsets of length k (starting k = 1) based on frequent itemsets of length k 1; 2 Test candidates, discard infrequent itemsets; 3 Repeat with k = k + 1.Important observation: all subsets X of a frequent itemset X arefrequent (Apriori property). Used to purge before step (2).Example: if X = {fever} is not frequent in database D, thenX = {fever, headache} can not be frequent.Sebastiaan van Schaik 7. Seminar web data extraction > Frequent patterns & association rules 6 / 26Finding association rules: Apriori (2)Apriori continued: 2 Extract association rules from frequent itemsets X . For each Xi X : 1 Generate all non-empty subsets S of Xi . For each S: 2 Test condence of rule S (Xi S)Sebastiaan van Schaik 8. Seminar web data extraction > Frequent patterns & association rules 6 / 26Finding association rules: Apriori (2)Apriori continued: 2 Extract association rules from frequent itemsets X . For each Xi X : 1 Generate all non-empty subsets S of Xi . For each S: 2 Test condence of rule S (Xi S)Example: itemset X = {fever, headache, nausea} is frequent, test:{fever, headache} {nausea}{fever, nausea} {headache}{nausea, headache} {fever}{fever} {headache, nausea}(. . . )Sebastiaan van Schaik 9. Seminar web data extraction > Introduction to uncertain data 7 / 26Introduction to uncertain dataData might be uncertain, for example: Location detection using multiple RFID sensors (triangulation); Sensor readings (temperature, humidity) are noisy; Face recognition; Patient diagnosis.Challenge: how do we model uncertainty andtake it into account when mining frequentitemsets and association rules? Sebastiaan van Schaik 10. Seminar web data extraction > Introduction to uncertain data8 / 26Existential probabilitiesExistential probability: a probability is associated with each item in atuple, expressing the odds that the item belongs to that tuple.Important assumption: tuple and item independence!Sebastiaan van Schaik 11. Seminar web data extraction > Introduction to uncertain data 8 / 26Existential probabilitiesExistential probability: a probability is associated with each item in atuple, expressing the odds that the item belongs to that tuple.Important assumption: tuple and item independence!Patient Diagnosis (including existential probabilities) t1 Cheng { 0.9 : a0.72 : d 0.718 : e0.8 : f} t2 Andrey{ 0.9 : a 0.81 : c 0.718 : d0.72 : e} t3 Omer{0.875 : b 0.857 : c} t4 Tim { 0.9 : a0.72 : d 0.718 : e } t5 Dan {0.875 : b 0.857 : c 0.05 : d } t6 Bas {0.875 : b 0.1 : f}Simplied probabilistic diagnosis database (adapted from [6]) Sebastiaan van Schaik 12. Seminar web data extraction > Introduction to uncertain data9 / 26Possible worldsD = {t1 , t2 , . . . , tn } (n transactions)tj = (p(j,1) , i1 ), . . . , (p(j,m) , im ) (m items in each transaction)D can be expanded to possible worlds: W = {W1 , . . . , W2nm }.Sebastiaan van Schaik 13. Seminar web data extraction > Introduction to uncertain data 9 / 26Possible worldsD = {t1 , t2 , . . . , tn } (n transactions)tj = (p(j,1) , i1 ), . . . , (p(j,m) , im ) (m items in each transaction)D can be expanded to possible worlds: W = {W1 , . . . , W2nm }.Patient Diagnosis (including prob.) t1 Cheng { 0.9 : a 0.72 : d 0.718 : e 0.8 : f} t2 Andrey{ 0.9 : a0.81 : c0.718 : d0.72 : e} t3 Omer{0.875 : b 0.857 : c} t4 Tim { 0.9 : a 0.72 : d 0.718 : e} t5 Dan {0.875 : b 0.857 : c0.05 : d} t6 Bas {0.875 : b 0.1 : f}Pr[Wx ] = (1 p(1,a) ) p(1,d) (1 p(1,e) ) p(1,f ) p(2,a) . . . p(6,f ) = 0.1 0.72 0.29 0.2 0.9 . . . 0.1 0.00000021(one of the 218 possible worlds) Sebastiaan van Schaik 14. Seminar web data extraction > Mining uncertain data > Introduction10 / 26 IntroductionApproaches to mining frequent itemsets from uncertain data: U-Apriori[4] and p-Apriori[9] UF-growth[6] UFP-tree[1] ...Further focus: UF-growth: mining without candidate generation; p-Apriori: pruning using Cherno bounds Sebastiaan van Schaik 15. Seminar web data extraction > Mining uncertain data > Introduction11 / 26 Expected supportSupport of an itemset X turns into a random variable:E [sup(X )] = Pr[Wi ] supWi (X )Wi W Sebastiaan van Schaik 16. Seminar web data extraction > Mining uncertain data > Introduction 11 / 26 Expected supportSupport of an itemset X turns into a random variable:E [sup(X )] = Pr[Wi ] supWi (X )Wi WEnumerating all possible worlds is infeasible, however (because ofindependency assumptions):E [sup(X )] =Pr[x, tj ]tj DxX(see [7, 6])Sebastiaan van Schaik 17. Seminar web data extraction > Mining uncertain data > Introduction 12 / 26 Expected support (2) PatientDiagnosis (including prob.) t1Cheng{ 0.9 : a 0.72 : d 0.718 : e 0.8 : f } t2Andrey { 0.9 : a0.81 : c0.718 : d0.72 : e } t3Omer {0.875 : b 0.857 : c } t4Tim{ 0.9 : a 0.72 : d 0.718 : e } t5Dan{0.875 : b 0.857 : c0.05 : d } t6Bas{0.875 : b 0.1 : f }Expected support of itemset X = {a, d} in patient diagnosis database: supWx (X )= 2E[sup(X )] =Pr[Wi ] supWi (X ) Wi W Sebastiaan van Schaik 18. Seminar web data extraction > Mining uncertain data > Introduction 12 / 26 Expected support (2) PatientDiagnosis (including prob.) t1Cheng{ 0.9 : a 0.72 : d 0.718 : e 0.8 : f} t2Andrey { 0.9 : a0.81 : c0.718 : d0.72 : e} t3Omer {0.875 : b 0.857 : c} t4Tim{ 0.9 : a 0.72 : d 0.718 : e} t5Dan{0.875 : b 0.857 : c0.05 : d} t6Bas{0.875 : b 0.1 : f}Expected support of itemset X = {a, d} in patient diagnosis database: supWx (X )= 2E[sup(X )] =Pr[Wi ] supWi (X ) Wi W = Pr[x, tj ] tj D xX = 0.9 0.72 + 0.9 0.71 + 0 0 + 0.9 0.72 + 0 0.05 + 0 0 = 1.935 Sebastiaan van Schaik 19. Seminar web data extraction > Mining uncertain data > Introduction 13 / 26 Frequent itemsets in probabilistic databasesAn itemset X is frequent i: UF-growth:E[sup(X )] > minsup (also used in [4, 1] and many others)p-Apriori: Pr[sup(X ) > minsup] minprobSebastiaan van Schaik 20. Seminar web data extraction > Mining uncertain data > UF-growth 14 / 26 Introduction to UF-growth Apriori versus UF-growth: Apriori-like algorithms generate and test candidate itemsets; UF-growth[6] (based on FP-growth[5]) grows a tree based on a probabilistic database. Sebastiaan van Schaik 21. Seminar web data extraction > Mining uncertain data > UF-growth 14 / 26 Introduction to UF-growth Apriori versus UF-growth: Apriori-like algorithms generate and test candidate itemsets; UF-growth[6] (based on FP-growth[5]) grows a tree based on a probabilistic database. Outline of procedure (example follows): 1 First scan: determine expected support of all items; 2 Second scan: create branch for each transaction (merging identical nodes when possible). Each node contains:An item;Its probability;Its occurrence count.Example: (a, 0.9, 2) Sebastiaan van Schaik 22. Seminar web data extraction > Mining uncertain data > UF-growth 14 / 26 Introduction to UF-growth Apriori versus UF-growth: Apriori-like algorithms generate and test candidate itemsets; UF-growth[6] (based on FP-growth[5]) grows a tree based on a probabilistic database. Outline of procedure (example follows): 1 First scan: determine expected support of all items; 2 Second scan: create branch for each transaction (merging identical nodes when possible). Each node contains:An item;Its probability;Its occurrence count.Example: (a, 0.9, 2) An itemset X is frequent i: E[sup(X )] > minsup Sebastiaan van Schaik 23. Seminar web data extraction > Mining uncertain data > UF-growt