Data Mining Apriori FP Growth Arafat

Slide 1

Mining Frequent Patterns, Associations and CorrelationsMd. Yasser ArafatMS Student, Dept of CSE, DUApril 6, 20151Topics CoveredFrequent PatternAssociationCorrelationSupport & ConfidenceClosed Patterns and Max-PatternsApriori AlgorithmFP-GrowthComparison between Apriori and FP-GrowthCorrelation Analysis2April 6, 2015Frequent PatternsFrequent patterns are patterns such as itemset, subsequence or substructure that appear in dataset frequently.Helps in data classification, clustering and mining association, correlation and other interesting relationships among data.Has become an important data mining task and a focused theme in data mining research.

April 6, 20153Association Association rules are if/then statementshelps uncover relationships between seemingly unrelated data in arelational databaseor other information repository. April 6, 20154Correlation A mutual relationship or connection between two or more things. Main goal is to find correlated interested itemset.April 6, 20155Support & ConfidenceFind all the rules A B with minimum support and confidencesupport, s, probability that a transaction contains A Bconfidence, c, conditional probability that a transaction having A also contains B

Support (A => B) = P(A B)Confidence (A => B) = P(B|A) = support (A B)/support(A)

6April 6, 2015ExampleTransaction IDItems1A, B, C2A, C3A, D4B, DFrequent itemsetSupportA75%B50%C50%D50%A C50%For rule A C:support = support({A C}) = 50%confidence = support({A C})/support({A}) = 66.6%

Min_sup = 50%7April 6, 2015Closed Patterns and Max-PatternsAn itemset X is closed frequent itemset in data set D if X is frequent and there exists no proper super-itemset Y such that Y has the same support count as X in D.An itemset X is a maximal frequent itemset in data set D if X is frequent and there exists no super itemset of Y such that X C Y and Y is frequent in D.

8April 6, 2015ExampleExercise: Suppose there are only two transactions, Let min_sup = 1What is the set of closed itemset?{a1, , a100}: 1{a1, , a50}: 2What is the set of max-pattern?{a1, , a100}: 19April 6, 2015Apriori AlgorithmFinding frequent itemsets by candidate generationApriori property: All nonempty subsets of a frequent itemset must also be frequent.

10April 6, 2015Apriori Algorithm Apriori pruning principle If there is any pattern which is infrequent, its superset should not be generated/tested.ProcessScan Database once to get frequent 1-itemsetFor each level k: Generate length (k+1) candidates from length k frequent patternsScan Database and remove the infrequent candidatesTerminate when no candidate set can be generated11April 6, 2015Pseudo-code1: Find all large 1-itemsets2: For (k = 2 ; while Lk-1 is non-empty; k++)3:{Ck = apriori-gen(Lk-1)4: For each c in Ck, initialise c.count to zero 5: For all records r in the DB6: {Cr = subset(Ck, r); For each c in Cr , c.count++ }7: Set Lk := all c in Ck whose count >= minsup8: } /* end -- return all of the Lk sets.

12April 6, 2015ExampleTIDList of Items_IDsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2 ,I3, I5T900I1, I2, I3 Consider a database, D , consisting of 9 transactions.

Suppose minimum support count required is 2 (i.e. min_sup = 2/9 = 22 % )

Let minimum confidence required is 70%.

13April 6, 2015 ItemsetSup.Count{I1}6{I2}7{I3}6{I4}2{I5}2ItemsetSup.Count{I1}6{I2}7{I3}6{I4}2{I5}2Scan D for count of each candidateCompare candidate support count with minimum support countC1L1Example Generating 1-itemset Frequent Pattern:14April 6, 2015ExampleItemset{I1, I2}{I1, I3}{I1, I4}{I1, I5}{I2, I3}{I2, I4}{I2, I5}{I3, I4}{I3, I5}{I4, I5}ItemsetSup.Count{I1, I2}4{I1, I3}4{I1, I4}1{I1, I5}2{I2, I3}4{I2, I4}2{I2, I5}2{I3, I4}0{I3, I5}1{I4, I5}0ItemsetSupCount{I1, I2}4{I1, I3}4{I1, I5}2{I2, I3}4{I2, I4}2{I2, I5}2Generate C2 candidates from L1C2C2L2Scan D for count of each candidateCompare candidate support count with minimum support count Generating 2-itemset Frequent Pattern:15April 6, 2015Itemset{I1, I2, I3}{I1, I2, I5}ItemsetSup.Count{I1, I2, I3}2{I1, I2, I5}2ItemsetSupCount{I1, I2, I3}2{I1, I2, I5}2C3C3L3Scan D for count of each candidateCompare candidate support count with min support count In order to find C3, we compute L2 Join L2. C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck.

Example Generating 3-itemset Frequent Pattern:16Generate C3 candidates from L2April 6, 2015Association rule generationProcedure:For each frequent itemset l, generate all nonempty subsets of l.For every nonempty subset s of l, output the rule s (l-s) if support_count(l) / support_count(s) >= minimum confidence threshold

17April 6, 2015ExampleFrom previous example frequent itemset = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.Lets take l = {I1,I2,I5}. Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.Now we can calculate the confidence of different association rules-{I1, I2} I5, Confidence = 2/4 = 50%{I1, I5} I2, Confidence = 2/2 = 100% {I2, I5} I1, Confidence = 2/2 = 100%I1 {I2, I5}, Confidence = 2/6 = 33% I2 {I1, I5}, Confidence = 2/7 = 29%I5 {I1, I2}, Confidence = 2/2 = 100%As minimum confidence threshold is 70%, then 2nd, 3rd and 6th are the strong association rule.

18April 6, 2015Apriori Algorithm: Efficiency ImprovementMany variations of the algorithm has been proposed to improve the efficiency of the original algorithm.Methods to improve the efficiency of the Apriori algorithm:Hash based itemset counting.Transaction reduction.PartitioningSampling

April 6, 201519Bottlenecks of AprioriGenerate a huge number of candidate setsRepeatedly scan the whole databaseCheck a large set of candidates by pattern matching20April 6, 2015FP-GrowthFP-growth or Frequent Pattern Growth adopts a divide-and-conquer strategyCompresses the database into FP-tree.Divides the database into a set of conditional databases which is associated with one pattern fragment.Associated data sets for each fragment is examined.21April 6, 2015ExampleItem IdSupport CountNode linkI27I16I36I42I52null{ }TIDItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2 ,I3, I5T900I1, I2, I322April 6, 2015ExampleItem IdSupport CountNode linkI27I16I36I42I52TIDItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2 ,I3, I5T900I1, I2, I3I2:1null{ }I1:1I5:123April 6, 2015ExampleItem IdSupport CountNode linkI27I16I36I42I52TIDItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2 ,I3, I5T900I1, I2, I3I2:2null{ }I1:1I5:1I4:124April 6, 2015ExampleItem IdSupport CountNode linkI27I16I36I42I52TIDItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2 ,I3, I5T900I1, I2, I3I2:3null{ }I1:1I5:1I4:1I3:125April 6, 2015ExampleItem IdSupport CountNode linkI27I16I36I42I52TIDItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2 ,I3, I5T900I1, I2, I3I2:4null{ }I1:2I5:1I4:1I3:1I4:126April 6, 2015ExampleItem IdSupport CountNode linkI27I16I36I42I52TIDItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2 ,I3, I5T900I1, I2, I3I2:4null{ }I1:2I5:1I4:1I3:1I4:1I1:1I3:127April 6, 2015ExampleItem IdSupport CountNode linkI27I16I36I42I52TIDItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2 ,I3, I5T900I1, I2, I3I2:5null{ }I1:2I5:1I4:1I3:2I4:1I1:1I3:128April 6, 2015ExampleItem IdSupport CountNode linkI27I16I36I42I52TIDItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2 ,I3, I5T900I1, I2, I3I2:5null{ }I1:2I5:1I4:1I3:2I4:1I1:2I3:229April 6, 2015ExampleItem IdSupport CountNode linkI27I16I36I42I52TIDItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2 ,I3, I5T900I1, I2, I3I2:6null{ }I1:3I5:1I4:1I3:2I4:1I1:2I3:2I5:1I3:130April 6, 2015ExampleItem IdSupport CountNode linkI27I16I36I42I52TIDItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2 ,I3, I5T900I1, I2, I3I2:7null{ }I1:4I5:1I4:1I3:2I4:1I1:2I3:2I5:1I3:231April 6, 2015ExampleItemConditional pattern baseConditional FP-TreeFrequent pattern generatedI5{(I2, I1: 1),(I2, I1, I3: 1)}

{I2, I5:2}, {I1, I5:2}, {I2, I1, I5: 2}Branches of I5 :I2, I1, I5: 1I2, I1, I3, I5: 1Conditional pattern base : I2, I1: 1I2, I1, I3: 1I2:2null{ }I1:232April 6, 201532ExampleItemConditional pattern baseConditional FP-TreeFrequent pattern generatedI4{(I2 I1: 1),(I2: 1)}

{I2, I4: 2}Branches of I4 :I2, I4: 1I2, I1, I4: 1Conditional pattern base : I2: 1I2, I1: 1I2:2null{ }33April 6, 2015ExampleItemConditional pattern baseConditional FP-TreeFrequent pattern generatedI3{(I2, I1: 1),(I2: 2), (I1: 2)},{I2, I3:4}, {I1, I3: 2} , {I2, I1, I3: 2}Branches of I3 :I2, I1, I3: 2I2, I3: 2I1, I3: 2Conditional pattern base : I2, I1: 2I2: 2I1: 2I2:4null{ }I1:2I1:234April 6, 2015ExampleItemConditional pattern baseConditional FP-TreeFrequent pattern generatedI1{(I2: 4)}

{I2, I1: 4}Branches of I1 :I2, I1: 4Conditional pattern base : I2: 4I2:4null{ }35April 6, 2015ExampleItemConditional pattern baseConditional FP-TreeFrequent pattern generatedI5{(I2, I1: 1),(I2, I1, I3: 1)}

{I2, I5:2}, {I1, I5:2}, {I2, I1, I5: 2}I4{(I2, I1: 1),(I2: 1)}

{I2, I4: 2}I3{(I2, I1: 1),(I2: 2), (I1: 2)},{I2, I3:4}, {I1, I3: 2}, {I2, I1, I3: 2}I1{(I2: 4)}

{I2, I1: 4}36April 6, 2015Pros of FP-growthNo candidate generation, no candidate testUse compact data structureEliminate repeated database scanBasic operation is counting and FP-tree building

37April 6, 2015Comparison between Apriori and FP-GrowthParameterApriori AlgorithmFP-growth AlgorithmnTechniqueUse Apriori property and join and prune propertyIt constructs conditional pattern base and condition FP tree from database which satisfy minimum supportMemory utilizationDue to large number of candidate generation, require large memory spaceDue to compact structure and no candidate generation require less memoryNumber of scansMultiple scans for generation candidate setsScan the database only twiceExecution timeExecution time is more as time is wasted for candidate generation every timeExecution time is smaller than apriory algorithm38April 6, 2015Correlation AnalysisCorrelation Analysis provides an alternative framework for finding interesting relationships, or to improve understanding of meaning of some association rulesCorrelation measureLiftX2 measure39April 6, 2015Correlation measure : LiftTwo item sets A and B are independent iff P(A B) = P(A) P(B)

Otherwise A and B are dependent and correlated

The measure of correlation, or correlation between A and B is given by the formula:lift(A,B)= P(A U B ) / P(A) . P(B)

40April 6, 2015Correlation measure : Liftlift(A,B) >1 means that A and B are positively correlated

lift(A,B) < 1 means that the occurrence of A is negatively correlated with B.

lift(A,B) =1 means that A and B are independent and there is no correlation between them.

41April 6, 2015Correlation measure : X2 measureX2 measure42

April 6, 2015ReferenceChapter 6, Data Mining Concepts and Techniques, Third Edition. By Jiawei Han, Micheline Kamber and Jian Pei.43April 6, 2015Thank You44April 6, 2015

Documents

Data Mining Apriori FP Growth Arafat