Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot

Mining Frequent Patterns, Associations, and Correlations

Compiled By:

Umair Yaqub

Lecturer

Govt. Murray College Sialkot

2

Frequent Pattern Mining - Basic Concepts Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs

frequently in a data set

Finding frequent associations or correlations among sets of items or objects in transaction databases, relational databases, and other information repositories

Let I={i1,i2,…im} be a set of items, and let D be a set of database of transactions, where each transaction T is a list of items (purchased by a customer in a visit).

An association rule is an implication of the form A → B, where A and B are subsets of I, and A∩B= Ø

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Customerbuys A (Computer)

Customerbuys both

Customerbuys B (Software)

3

Association Mining-Basic Concepts (contd…)

Find all the rules A → B with minimum confidence and support support, s, probability that a transaction contains both A and B confidence, c, conditional probability that a transaction having A also contains B

Rules satisfying a minimum support threshold and a minimum confidence threshold are called strong

A set of items is referred to as an itemset. An itemset containing k items is a k-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset

(frequency, support count or count) An itemset satisfying minimum support (count) is a frequent itemset commonly denoted by Lk

4

Association Mining-Basic Concepts (contd…)

Association rule mining is a two step process Find all frequent itemsets Generate strong association rules from frequent itemsets

Performance determined by first step

5

Association Rule Mining: A Road Map Based on the completeness of mined patterns

Complete set of frequent itemsets, constrained frequent itemsets

Based on levels of abstraction Single level vs. multiple-level analysis

age(x, “30..39”) ®buys(x, “computer”) age(x, “30..39”) ®buys(x, “laptop”)

Based on number of data dimensions Single dimension vs. multiple dimensional associations

Based on the types of values handled Boolean vs. quantitative associations buys(x, “SQLServer”) ^ buys(x, “DMBook”) ®buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) ®buys(x, “PC”) [1%, 75%]

Based on kinds of rules to be mined Association rules, correlation rules

Based on the kinds of patterns to be mined Frequent itemset mining, sequential pattern mining, structured patterns mining

6

Mining Association Rules—An ExampleTransaction ID Items Bought

2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

7

The Apriori Algorithm

Method:

Initially, scan DB once to get frequent 1-itemset

Generate length (k+1) candidate itemsets from length k frequent itemsets

Test the candidates against DB

Terminate when no frequent or candidate set can be generated

Use the frequent itemsets to generate association rules.

The Apriori principle:

All nonempty subsets of a frequent itemset must be frequent

8

The Apriori Algorithm — Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2

C2Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

9

The Apriori Algorithm

Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

10

Important Details of Apriori How to generate candidates?

Step 1: self-joining Lk

Step 2: pruning

How to count supports of candidates?

Example of Candidate-generation L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:

acde is removed because ade is not in L3

C4={abcd}

11

How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

12

Example – Transaction DB

13 Adapted from slides by Han and Kamber http://www-faculty.cs.uiuc.edu/~hanj/bk2/

Example – Finding Frequent Patterns (1)

14

Example – Finding Frequent Patterns (2)

Documents

Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot