Associative Learning

Content1. Introduction2. Association Rule Learning3. Apriori Algorithm4. Proposed Work

IntroductionData mining is the analysis of large quantities of data to extract interesting

patterns such as :- groups of data records- cluster analysis unusual records -anomaly detection dependencies- associative rules

Association rule mining which was first proposed in[2], is a popular and well researched data mining method for discovering interesting relations between variables in large databases.

Association Rule learningThe problem of association Rule Mining[2] is defined as :

Let I = {i1 ,i2 ,……,in,} be a set of n attributes called items. Let D={ t1, t2,……., tm} be a set of transactions called the database. Each transaction t in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form XY where X,Y I and X ∩ Y = Ø.

Example of rule for a supermarket could be {butter , bread}{milk}.This means if butter and bread are bought then customers also buy milk.

ConstraintsThe Best known constraints are minimum threshold on support and confidence.[3]The Support of an item-set X is defined as the number of transaction in the data set which contain the item-set. It is written as supp(X).The confidence of a rule is defined as conf(XY)=supp(X U Y) / supp(X).

Association rule generation technique[16,17] can be split into two steps :i) First ,we apply user defined minimum support on a database to find out all the frequent item-sets.ii) Second, these frequent item-sets and the user defined minimum confidence are used to form the rules.

For the purpose of finding the frequent item-sets we use the Apriori algorithm.[4][5]

An Example

Supp(milk)= 2/5 Supp(bread)=3/5 Supp(butter)=2/5 Supp(beer)=1/5

Rule:{milk,bread}{butter} has a confidence = supp(milk,bread,butter)/supp(milk,bread) =2/4=50%

Transaction ID milk bread butter beer

1 1 1 0 0

2 0 0 1 0

3 0 0 0 1

4 1 1 1 0

5 0 1 0 0

ApplicationMarket Analysis

Telecommunication

Credit Cards/ Banking Services

Medical Treatments

Basketball-Game Analysis

Apriori AlgorithmApriori[11]is a classic algorithm for finding the frequent item-set over

transactional databases.

It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database i.e. satisfies minimum support for the database.

• Frequent Item-set Property: Any subset of a frequent item-set is frequent.

This algorithm is divided into two part : Generating Candidate Item-set Generating the Large Frequent Item-set

Apriori Algorithm Contd.Lk: Set of frequent item-sets of size k (with min support)Ck: Set of candidate item-set of size k (potentially frequent item-sets)

L1 = {frequent items where the size of item is 1};for (k = 1; Lk !=; k++) do Ck+1 = candidates generated from Lk ;

for each transaction t in database do increment the count of all candidates in

Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support

Return k Lk;

How it Works

Scan D

itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

C1 itemset sup.{1} 2{2} 3{3} 3{5} 3

L1

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

C2 itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

C2

Scan D

C3 itemset{2 3 5}

Scan D L3 itemset sup{2 3 5} 2

TID ItemsT1 1 3 4T2 2 3 5T3 1 2 3 5T4 2 5

Database DMin support =2

Generation of CandidatesInput: Li-1 : set of frequent item-sets of size i-1

Output: Ci : set of candidate item-sets of size i

Ci = empty set;

for each item-set J in Li-1 do

for each item-set K in Li-1 s.t. K<> J do

if i-2 of the elements in J and K are equal then

if all subsets of {K J} are in Li-1 then

Ci = Ci {K J}

return Ci;

Example of finding CandidatesSay L3 consists of the item-sets{abc, abd, acd, ace, bcd}

Now to Generate C4 from L3

abcd from abc and abd

acde from acd and ace

Pruning the candidate set :

acde is removed because ade is not in L3

Hence C4 will have only {abcd}

Discovering Rulesfor each frequent item-set I do

for each rule C I-C do

if (support(I) / support(C) >= min_conf) then [ as {(C) U (I-C)} I ]

output the rule (C I-C) ,with confidence = support(I) / support (C)

and support = support(I)

Example of Discovering RulesLet use consider the 3-itemset {I2, I3, I5}:

Support of {I2,I3,I5}= 2

{I2 , I3} I5 confidence = 2/2=100%{I2 , I5} I3 confidence = 2/3=67%{I3 , I5} I2 confidence = 2/2=100% I2 {I3 , I5} confidence = 2/3=67% I3 {I2 , I5} confidence = 2/3=67% I5 {I2 , I3} confidence = 2/3=67%

TID ItemsT1 1 3 4T2 2 3 5T3 1 2 3 5T4 2 5

Database D

Advantage :i) Apriori Method is very useful when the data size is huge as it uses level-wise search method to find out the frequent item-sets.ii) Apriori uses breadth-first search to count candidate item sets efficiently.

Disadvantage : i) The Apriori Algorithm needs to go through all the database.

ii)The computation complexity does increase when the size of the candidate increases.

Proposed Work

1.Modified Search Algorithm

2.Modified Association Rule Generation for Classification of Data

Modified Search Algorithm

1.Add a tag Field to each Transaction in databaseFormat : if transaction is <T1> then the transaction will be modified in to <T1,tag>.

2.Tag will contain the first ,middle and last instance of the transaction.

3.Example : If a certain transaction <I4,I5,I6,I9,I11,I12> then the tag field will be <I4,I6,I12>

Modified Search Algorithm Contd. Step 1: First create a TAG field for each Transaction in the Dataset. TAG field will

contain 3 fields <Starting Value, Middle Value, End value>.

Step 2: For each item to search in the dataset first check whether the item is equal to or greater than starting value and also less than or equal to end value.

Step 3: If the value does not match the condition in Step 2 then do not search in that particular Transaction. If value does match with both the conditions in the Step 2 then go to Step 4.

Step 4: Check whether the item to be searched matches with the middle element. If it matches then go to Step 6.If it does not match then go to Step 5

Step 5: Calculate the difference of the item to be searched from the starting, middle and the end value. Choose the least difference of these three values and reduce the range of data-set and go to Step 4 if the difference from any element is 0 then go to Step 6

Step 6: Increase the count by 1 for that particular item when found in the particular Transaction.

Example: We randomly take 30 numbers for the example

(10,11,12,21,22,31,33,37,39,41,45,46,49,51,54,57,61,67,69,71,78,79,81,101,103,105,107,109,111,127)

We need to find 51 among these data.

1st Iteration

Middle Element

10,11,12,21,22,31,33,37,39,41,45,46,49,51,54,57,61,67,69,71,78,79,81,101,103,

105,107,109,111,127

41 351< 54 so the range must be 10-51.But we calculate the difference. And from the difference we can say that item(51) is much closer to the 54 than 10.So the actual range can be

converted to 33-51 as at most middle position of the range 10-51 can be equal to the item(51)

Example:2nd Iteration : Middle Element

10,11,12,21,22,31,33,37,39,41,45,46,49,51,54,57,61,67,69,71,78,79,81,101, 103,105,107,109,111,127

651>45 so the range must be 46-51.But again we calculate the difference.And difference of item (51) from 45 is 6 and from the 51 is 0.So the Search

will end. And counter for the item will be increased by 1.So we can see that in only 2 iterations we can find out the data we need to find.

Example:Comparison With Binary Search :

10,11,12,21,22,31,33,37,39,41,45,46,49,51,54,57,61,67,69,71,78,79,81,101, 103,105,107,109,111,127

For Binary Search we will have the following iteration :1st iteration:(check 51<,>, = 54) result: 51<54 search in the range 10 and 51 2nd iteration:(check 51<, >, = 33)result: 51>33 search in the range 37 and 513rd iteration: (check 51<,>, = 45)result: 51>45 search in the range 46 and 51 4th iteration: (check 51<,>, = 49)result: 51>49 search in the range 51 and 515th iteration: (check 51<,>, = 51)result: 51=51 search end, Data foundConclusion :

From the comparison it is clear that our proposed algorithm for search can find the desired data in lesser amount of iteration hence less time.

Modified Association Rule Generation for Classification of Data

Issues : a) Minimal Number of Rules b) Maximum Classification of data Correctly

Example :

For item value 1 there is 3 decisions : 1, 2 and 3. We calculate count(1,1),count(1,2)and count(1,3).And support(1)=max(count(1,1),count(1,2),count(1,3)).

I1 I2 I3 I4 DECISION

1 2 3 4 1

1 2 6 7 1

1 3 5 8 2

2 5 6 9 2

1 2 3 6 3

Modified Association Rule Generation for Classification of Data

Algorithm : Step 1 : Let k = 1Step 2 : Generate frequent item-sets of length 1(GOTO STEP 11) Step 3 : Repeat until no new frequent item-sets are identified

(i)Generate length (k+1) candidate item-sets from length k frequent item-sets

(ii)Prune candidate item-sets containing subsets of length k that are infrequent

(iii)Count the support of each candidate by scanning the DB(GOTO STEP11)

(iv)Eliminate candidates that are infrequent, leaving only those that are frequent.

Step 11: For each item in the dataset calculate the number of times the item is present in the whole data-set and also their corresponding decision values.( For example I2D1 or I2D2 or I2D3)

Step 12: Find the maximum of the calculated support for each item.Step 13: Return the Support for the item.

DECISION TABLE AlgorithmPART Algorithm Proposed Algorithm One-R Algorithm

Experimental ResultsWe have IRIS data-set from UCI Machine Learning RepositoryTotal Number of Instances 148

Classes available 3 : Iris Setosa(A), Iris Versicolour(B), Iris Virginica(C)We first classify this data-set using the existing algorithms

using the Weka Tool.

ConclusionComparative Studies :

From this comparative study we can say that using our proposed algorithm we can classify the data-set more correctly than the existing algorithms.

ALGORITHMS

Classification

DECISION TABLE

ONE-R PART Proposed Method

Correctly Classified 134 136 134 138

In-Correctly Classified 13 11 13 10

Number of total Instances classified

147 147 147 148

Future Scope

In future we will try to optimize the searching technique for apriori algorithm

Also we will try to optimize the rule set generated to have lesser number of rules.

References 1. Piatetsky-Shapiro, Gregory (1991), Discovery, analysis, and presentation of strong rules, in

Piatetsky-Shapiro, Gregory; and Frawley, William J.; eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge,2. MA.Agrawal, R.; Imieliński, T.; Swami, A. (1993). "Mining association rules between sets of items in large databases". Proceedings of the 1993 ACM SIGMOD international conference on Management of data-SIGMOD'93.pp. 207.3. Liu, B., Hsu, W., Ma, Y. (1998).Integrating Classification and Association Rule Mining, American Association for Artificial Intelligence.

4. Agrawal, R.,Faloutsos C. and Swami A.N.(1994).Efficient similarity search in sequence datatabases.5. Lomet D. (Ed.), Proceedings of the 4th International Conference of Foundations of Data Organization and Algorithms (FODO), Chicago, Illinois, pp. 69-84. Springer Verlag.6. www.en.wikipedia.org/wiki/Binary_search_algorithm. 7. Press, William H.; Flannery, Brian P.; Teukolsky, Saul A.; Vetterling, William T. (1988), Numerical Recipes in C: The Art of Scientific Computing, Cambridge University Press, pp. 98–99,8. Hipp, J., Güntzer, U., and Nakhaeizadeh, G. (2000). Algorithms for association rule mining — a general survey and comparison. SIGKDD Explor. Newsl. 2, 1 (Jun. 2000), 58-64.9. Pingping W, Cuiru W, Baoyi W, Zhenxing Z, “Data Mining Technology and Its Application in University Education System”. Computer Engineering, June 2003, pp.87-89. 10. Taorong Q, Xiaoming B, Liping Z, “An Apriori algorithm based on granular computing and its application in Library management system”, Control & Automation, 2006, pp.218-221

References Contd.• 11. R. Agrawal, and R. Srikant, “Fast Algorithms for Mining Association Rules”, In Proc.

VLDB 1994, pp.487-499.• 12. Chai, S, Jia Y, and Yang C. "The research of improved Apriori algorithm for mining

association rules." Service Systems and Service Management, 2007 International Conference on. IEEE, 2007.

• 13. Kumar, K. Saravana, and R. Manicka Chezian. "A Survey on Association Rule Mining using Apriori Algorithm." International Journal of Computer Applications 45.5 (2012): 47-50.14. Saggar, M., Agrawal, A. K., & Lad, A. (2004, October). “Optimization of association rule mining using improved genetic algorithms”. In Systems, Man and Cybernetics, 2004 IEEE International Conference on (Vol. 4, pp. 3725-3729). IEEE.15. Christian, A. J., & Martin, G. P. (2010, November).” Optimization of association rules with genetic algorithms”. In Chilean Computer Science Society (SCCC), 2010 XXIX International Conference of the (pp. 193-197). IEEE.

• 16. Hipp, J., Güntzer, U., & Nakhaeizadeh, G. (2000).” Algorithms for association rule mining—a general survey and comparison”. ACM SIGKDD Explorations Newsletter, 2(1), 58-64.17. Mitra, S., & Acharya, T. (2003). “Data Mining: multimedia, soft computing, and bioinformatics”. Wiley-Interscience,7-8

Thank You