A Performance Comparison Of High Utility Based Mining Algorithms

Embed Size (px)

Citation preview

  • 7/27/2019 A Performance Comparison Of High Utility Based Mining Algorithms

    1/5

    Internat ional Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue5- May 2013

    ISSN: 2231-5381 http://www.ijettjournal.org Page 1796

    A Performance Comparison Of High Utility BasedMining Algorithms

    S Dileep*, Satyanarayana Mummana#

    *Final M.Tech Student, Department of Computer Science & Engineering, Avanthi Institute ofEngineering& Technology, Narsipatnam, Andhra Pradesh.

    #Asst Professor, Dept of Computer Science and Engineering, Avanthi Institute of Engineering &

    Technology, Narsipatnam, Andhra Pradesh.

    Abstract:-Finding interesting frequent patterns from thetransaction tables is still an important research issue in the field

    of data mining, For producing Itemsets which are frequentlypurchasing & Association rules from huge amount of data

    sources to improving the efficiency of business applications. But,Frequent Pattern data structure tree produces only frequentpatterns and there is a need to generate high utility itemsets to

    improve the performance of market basket analysis. Producingthe Itemsets with high utility means, searching the Itemset with

    huge Interestedness from the data source. Even though we had

    number of traditional frequent pattern algorithms, there aresome problems with those existing mechanisms proposed in past.Searching of the frequent itemsets when the data base havinglarge amount of transactions in datasource. The major problem

    is impacting on the performance & efficiency is depends on largenumber of candidates. In this , we are comparing utility baseditemset mining algorithms which are having high utility FPGrowth algorithm, UP Growth algorithm for searching frequentitemsets from high utility based itemset mining.

    Keywords: Itemset Mining, high utility, candidate elimination,

    interestedness

    I. INTRODUCTION

    Mining of Frequent Itemsets plays vital role in miningapplications. The Apriori algorithm which are used forassociation rule mining in market basket analysis, all thefrequent itemsets are retrieved from transactional databasewhich are often coming together in transactions. In past, someprevious studies are utilized and producing with candidate andwithout candidates. But it is not producing the customerrequirement like profit, sales in particular item. In the view ofabove, itemset mining with high interestedness or utility plays

    vital role in mining area. The unit profits and purchasedquantities of items are not considered in the framework ofmining frequent itemset. Hence, it cannot satisfy therequirement of the user who is interested in discovering theitemsets with high sales profits. In view of this, utility mining[2, 3, 4, 6, 7, 8, 9, 10] emerges as an important topic in datamining for discovering the itemsets with high utility likeprofits. The basic meaning of utility is theinterestedness/importance/profitability of items to the users.The utility of items in a transaction database consists of twoaspects: (1) the importance of distinct items, which is called

    external utility, and (2) the importance of the items in thetransaction, which is called internal utility. The itemset of autility means the external utility multiplied by the internalutility. An itemset is called a high utility itemsetif its utility isno less than a user specified threshold; otherwise, the itemsetis called a low utility itemset. Mining high utility itemsets

    from databases is an important task which is essential to awide range of applications such as website click streaminganalysis, cross-marketing in retail stores, business promotionin chain hypermarkets and even biomedical applications.. The deficiency of this approach is that it does notconsider the statistical quality of itemsets. The measuresbased utility should take in user-defined utility as well as rawstatistical aspects of data. Consequently, it is meaningful todefine a specialized form of high utility itemsets, utility-frequent itemsets, which are a subset of high utility itemsetsas well as frequent itemsets. The following example indicatesdifferences between itemsets of high utility, frequent, andutility-frequent.

    Example 1. As an example let us analyze sales in a large retailstore. We can find that itemset {bread, milk} is frequent,itemset {caviar, champagne} is of high utility and itemset{beer} is utility frequent. A smart manager should pay specialattention to itemset {beer} as it is frequent and of high utility.On the other side, itemset {bread, milk} is frequent but not ofhigh utility and itemset {caviar, champagne} gives high utilitybut is not frequent.To address this issue, we propose in this paper a novelalgorithm with a compact data structure for efficientlydiscovering high utility itemsets from transactional databases.The major contributions of this work are summarized asfollows:

    1. A novel algorithm, called UP-Growth (Utility PatternGrowth), is proposed for discovering high utility itemsets.Correspondingly, a compact tree structure, called UP-Tree(Utility Pattern Tree), isproposed to maintain the important information of thetransaction database related to the utility patterns. High utilityitemsets are then generated from the UP-Tree efficiently withonly two scans of the database.2. Four strategies are proposed for efficient construction ofUP-Tree and the processing in UP-Growth. By thesestrategies, the estimated utilities of candidates can be ell

  • 7/27/2019 A Performance Comparison Of High Utility Based Mining Algorithms

    2/5

    Internat ional Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue5- May 2013

    ISSN: 2231-5381 http://www.ijettjournal.org Page 1797

    reduced by discarding the utilities of the items which areimpossible to be high utility or not involved in the searchspace. The Strategies were proposed cannot only efficientlydecrease the estimated utilities of the potential high utilityitemsets but also effectively reduce the number of candidates.

    II. PROBLEM DEFINITION

    A frequent itemset is a set of items that appears at least in apre-specified number of transactions. Formally, let I = {i1, i2, .. . , im} be a set of items and DB = {T1, T2, ..., Tn} a set oftransactions where every transaction is also a set of items (i.e.itemset). Given a minimum support threshold minSup anitemset S is frequent iff: |{T|S T, T DB, S I} / |DB|minSup . Frequent itemset mining is the first and the mosttime consuming step of mining association rules. Meanwhileof the search for find itemsets which are frequent, the anti-monotone property is used. Let D be the domain of a functionf. f has the anti-monotone property when V x, y D : x yf(y) f(x). In the case of mining frequent itemsets the anti-monotone property assures that no superset of an infrequentitemset is frequent. Consequently, infrequent candidates canbe discarded during the candidate generation phase.

    Definition 1. The utility of an item ip in the transaction Tdisdenoted as u(ip, Td) and defined asp(ip) q(ip, Td). Forexample, in Table 1, u({A}, T1) = 5 1 = 5.

    Definition 2. The utility of an itemset X in Tdis denoted asu(X, Td) and defined as ipX^Xsubset ofTd u(ip,Td)For example, u({AC}, T1) = u({A}, T1) + u({C}, T1) = 5 + 1 =6.

    Table 2. Profit TableItem A B C D E F GProfit 5 2 1 2 3 1 1

    A. Related Work

    In association with rules mining, Apriori (Agrawal andSrikant, 1995), DHP (Park et al., 1997) and partition-basedones (Lin and Dunham, 1998; Savasere et al., 1995) wereproposed to find frequent itemsets. The number ofapplications has called for the need of incremental mining due

    to the increasing use of record-based databases to which dataare being added continuously. Many algorithms like FUP(Cheung et al., 1996), FUP2 (Cheung et al., 1997) and UWEP(Ayn et al., 1999;Ayn et al., 1999) have been proposed to findfrequent itemsets in databases.The tree-based approaches such as FP-Growth [5] wereafterward proposed. Its widely recognized that FP-Growthachieves a better performance than Apriori-based approachessince it finds frequent itemsets without generating anycandidate itemset and it scans database just

    Table1.An Example databaseTID Transaction TU

    T1 (A,1)(C,1)(D,1) 8T2 (A,2)(c,6)(E,2)(G,5) 27T3 (A,1)(B,2)(C,1)(D,6)(E,1)(F,5) 30T4 (B,4)(c,3)(D,3)(E,1) 20T5 (B,2)(C,2)(E,1)(G,2) 11

    twice. A formal definition of utility mining and theoreticalmodel was proposed in Yao et al. (2004), namely MEU,where the utility is defined as the combination of utilityinformation in each transaction and remaining resources.Since it cannot depend on downward closure property ofApriori to restrict the number of itemsets to be examined, aheuristic is used to predict whether an itemset should beadded to the candidate set. So, the prediction generally highlyestimates, specially at the starting stages, where the number ofcandidates approaches the number of all the possibilities ofitems. The verification of all the possibilities is not adopt foruse, either in computing cost or in memory space cost,whenever the maximum of items is huge or the thresholdsutility is low. However this algorithm is not suitable in theterms of efficiency or scalable, it is by far the best one tosolve this specific problem.To efficiently generate HTWUIs in phase I and avoidscanning database many times, Ahmed et al. [2] proposed acompact data structure tree-based algorithm, calledIncremental High Utility Pattern [IHUP], for mining itemsetswith high utiltiy. They use an IHUP-Tree to maintain theinformation of high utility itemsets and transactions. Each andEvery node in this tree having an item name, a support count,and a TWU value. The framework of the algorithm consists of

    three steps: (1) The construction of IHUP-Tree, (2) thegeneration of HTWUIs and (3) identification of itemsets withhigh utility. The phase I of this tree In First step , items in thetransaction are rearranged in a fixed order such aslexicographic order, support descending order or TWU fromhigh to low values. Then, again the transactions are arrangedand inserted into the Incremental High Utility Pattern Tree.The diagram (figure 1) shows a global Incremental HighUtility Pattern Tree for the data source in above transactiontable, where items are arranged in the descending order ofTransaction Weighted Units. In the second step, HighTransaction Weighted Utility Itemsets are generated from theIHUP-Tree by applying the FP-Growth Mining frequent

    patterns without candidate generation algorithm [5]. Thus,High Transaction Weighted Utility Itemsets in first phase canget more efficiently without generating candidates forHTWUIs. In step 3, high utility itemsets and their utilities areidentified from the set of HTWUIs by scanning the originaldatabase once.

    III. PROPOSED METHOD

    To make easier the mining performance and keep awayscanning original database repeatedly, we use a compact tree

  • 7/27/2019 A Performance Comparison Of High Utility Based Mining Algorithms

    3/5

    Internat ional Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue5- May 2013

    ISSN: 2231-5381 http://www.ijettjournal.org Page 1798

    structure, called UP-Tree to maintain the information oftransactions and high utility itemsets.

    A.The elements in UP-TreeIn UP-Tree, each node N includes N.name, N.count, N.nu,N.parent, N.hlink and a set of child nodes. The details areintroduced as follows. N.name is the item name of the node.N.count is the support count of the node [5]. N.nu is called

    node utility which is an estimate utility value of the node..parentrecords the parent node of the node. N.hlinkis a nodelink which points to a node whose item name is the same asN.name.Header table is employed to facilitate the traversal ofUP-Tree. In the header table, each entry is composed of anitem name, an estimate utility value, and a link. The linkpoints to the last occurrence of the node which has the sameitem as the entry in the UP-Tree. By following the link in theheader table and the nodes in UP-Tree, the nodes whose itemnames are the same can be traversed efficiently.

    B.Discarding global unpromising items during the

    construction of a global UP-Tree:

    The construction of UP-Tree can be performed with two scansof the actual data base. In the 1st scan of data source,Item Twu Link

    C 96E 88

    A 65

    F 61

    D 58

    the transaction utility of each transaction is computed. At thesame time, TWU of each single item is also accumulated.After scanning database once, items and their TWUs areobtained. By TWDC property, if the TWU of an item is lessthan minimum utility threshold, its supersets are unpromisingto be high utility itemsets. The item is called unpromisingitems. The below Promising item & unpromising item gives aformal definition for unpromising items and promising items.Promising item and unpromising item: An item ipis called apromising item ifTWU(ip) min_util. Otherwise, the item iscalled an unpromising item.Example 1. Consider the transaction database in Table 1 andthe profit table in Table 2. Suppose the minimum utility

    threshold(min_util) is 40. In the 1st scan of data source,Transaction Utility of the transactions and the TWUs of theitems are computed. They are shown in the last column ofTable 1 and in Table 3, respectively. As shown in Table 3,{F} and {G} are unpromising items since their TWUs are lessthan min_util. The promising items are reorganized in theheader table in the descending order of TWU. Table 4 showsthe reorganized transactions and their RTUs for the databasein Table 1. As shown in Table 4, unpromising items {F} and{G} are removed from the transactions second, third and fifthtransactions(T2, T3 andT5) respectively. The utilities of {F}

    and {G} are eliminated from the TUs of T2, T3 and T5,respectively. The remaining promising items {A}, {B}, {C},{D} and {E} in the transaction are sorted in the descendingorder of TWU. Then, we insert reorganized transactions intothe UP-Tree by the same processes as IHUP-Tree [2]. We usethe following example to describe the operation of insertion.Strategy1. Discarding global unpromising items (DGU).

    The unpromising items and their utilities are eliminated fromthe transaction utilities during the construction of a globalUP-Tree.The principle of Discarding Global Utility is to removes theinformation of having more value in Transaction WeightedUtility (TWU(ip)>min_util.) from the data base since anunpromising item plays no role in high utility itemsets andonly the supersets of promising items are likely to be highutility.

    Table 3 Items and their TWUs

    Item A B C D E F GProfit 65 61 96 58 83 30 38

    Table 4 Re-Organized transactions and their RTUsTID Transaction TU

    T1 (C,1) (A,1) (D,1) 8T2 (C,6)(E,2)(A,2) 22T3 (C,1) (E,1)( (A,1)(B,2)(D,6) 25T4 (C,3) (E,1) (B,4) (D,3) 20T5 (C,2) (E,1) (B,2) 9

    C. Generating PHUIs from the global UP-Tree by FP-

    Growth:Each node in the fig. 2., UP-Tree is associated with twonumbers: the first one is support count and the second one is

    node utility. Besides, the nodes which have the same itemnames are linked in a sequence by their node links.Comparing with the Incremental High Utility Pattern Tree infirst figure, the NU (node utilities) of the nodes in UtilityPattern Tree are less than the NU of the nodes in IHUP-Treesince reorganized transactions are inserted with RTUs insteadof TWUs.In the UP-Tree, each node {ai} to the root forms a path ({ai} {ai+1} {an}). Each path represents a commonprefix thatshared by multiple reorganized transactions. Besides,{ai}.countis the number of reorganized transactions that sharethe path and {ai}.nu is an estimate utility value for the path.

    Similar to [2], PHUIs can be generated from the UP-Tree byapplying FP-Growth .

    D. The Proposed Mining Method: UP-Growth:The details of UP-Growth for efficiently generating PHUIsfrom the global UP-Tree with two strategies, namely DLU(Discarding local unpromising items) andDLN (Decreasing local node utilities). Although strategiesDGU and DGN described in previous section can effectivelyreduce the number of candidates in phase I, they are appliedduring the construction of the global UP-Tree and cannot be

    {R}

    {C}:4, 96

    {E}:4, 88

    {A}:2, 57

    {B}:1, 30

    {D}:1, 30

    {B}:2, 31

    {D}:1, 20{A}:1, 8

    {A}:1, 8

  • 7/27/2019 A Performance Comparison Of High Utility Based Mining Algorithms

    4/5

    Internat ional Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue5- May 2013

    ISSN: 2231-5381 http://www.ijettjournal.org Page 1799

    applied during the construction of the local UP-Tree. Thereason is that the individual items and their utilities are notmaintained in the conditional pattern base (sub database)referred as CPB. We cannot get the values of utility to theunpromising items in the conditional pattern base. Toovercome this problem, a nave approach is to maintain theutilities of the items in the conditional pattern base.

    Minimum item utility of an item: The utility of item ip intransaction Tdis called the minimum item utility ofip if theredoesnt exist a transaction Td such that u(ip, Td) < u(ip, Td).The minimum item utility ofipis denoted as miu(ip).

    Strategy 2. Discarding local unpromising items (DLU).The minimum item utilities of unpromising items arediscarded from path utilities of the paths during theconstruction of a local UP-Tree.

    Fig3 A UP Tree by applying strategies DGU and DGNItem Twu Link

    C 96

    E 88A 65

    F 61

    D 58

    Table 5 Minimum item utility tableItem A B C D EMinimum item utility 5 4 1 2 3

    Rationale: By the rationale of DGU strategy, in a CPB, localhigh TWU values (unpromising items) and theirutilities can be discarded. Since the minimum item utility of alocal unpromising item in a path is always equal to or lessthan its real utility in the path, we can also discard itsminimum item utility from the paths of the conditional patterntree without losing any PHUI.

    IV. EXPERIMENTAL EVALUATIONThe experiments were done on a 1.83 GHz Intel(R)Atom(TM) Processor with 1 GB RAM, and performing onWindows XP. In this paper, the proposed algorithms are

    implemented in Java language. Both synthetic and realdatasets are used to evaluate the performance of thealgorithms. Synthetic datasets were generated from the datagenerator in the parameters are described as follows: D is thetotal number of transactions; T is the average size oftransactions;Nis the number of distinct items;Iis the averagesize of maximal potential frequent itemsets.

    A. Evaluation on Synthetic DatasetsAs shown in below Table, UP+FPG generates fewercandidates than IHUP+FPG since the node utilities of thenodes in the UP-Tree are less than the IHUP-Tree. This shows

    the effectiveness of strategies DGU and DGN. By applyingstrategy DGU, global unpromising items and their utilities arediscarded from the transactions and TUs. By applying DGNstrategy, the node utilities of the nodes in a global UP-Treeare effectively decreased since the utilities of theirdescendants are discarded. In this, when the minimum utilitythreshold is less than 0.6%, the number of candidates

    generated by UP+UPG becomes smaller than UP+FPGobviously. This indicates that strategies DLU and DLN workwell and more itemsets which are impossible to be high utilityare reduced than FP-Growth when the minimum utilitythreshold is low. By applying DLU strategy, localunpromisingitems are removed from the paths of conditional pattern baseand their minimum item utilities are eliminated from the pathutilities. By applying DLN, the node utilities of the nodes in alocal UP-Tree are decreased since they discard the minimumitem utilities of their descendants.

    Table: Number of candidates on T10I6D100KMinimum Utility IHUP+FPG UP+FPG UP+UPG

    0.2% 20,651 15,057 10,6640.4% 4,003 2,347 1,9900.6% 1,684 910 8440.8% 873 527 5211.0% 566 411 411

    V. CONCLUSIONS

    In this paper, we have proposed an efficientalgorithm named UP-Growth for mining high utility itemsetsfrom transaction databases. A data structure named UP-Tree

    {R}

    {C}:4, 96

    {E}:4, 88

    {A}:2, 57

    {B}:1, 30

    {D}:1, 30

    {B}:2, 31

    {D}:1, 20{A}:1, 8

    {A}:1, 8

  • 7/27/2019 A Performance Comparison Of High Utility Based Mining Algorithms

    5/5

    Internat ional Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue5- May 2013

    ISSN: 2231-5381 http://www.ijettjournal.org Page 1800

    is proposed for maintaining the information of high utilityitemsets. Hence, the potential high utility itemsets can beefficiently generated from the UP-Tree with only two scans ofthe database. Besides, we develop four strategies to decreasethe estimated utility value and enhance the miningperformance in utility mining. In the experiments, both ofsynthetic and real datasets are used to evaluate the

    performance of our algorithm. The mining performance isenhanced significantly since both the search space and thenumber of candidates are effectively reduced by the proposedstrategies. The experimental results show that UP-Growthoutperforms the state-of-the-art algorithms significant, whenthe data source contains huge amount of transactions

    REFERENCES

    [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules.In Proc. of the 20th Int'l Conf. on VeryLarge Data Bases, pp. 487-499, 1994.[2] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, and Y.-K. Lee.Efficient tree structures for high utility pattern mining in incremental

    databases. InIEEE Transactions on Knowledge and Data Engineering, Vol.21, Issue 12, pp. 1708-1721, 2009.[3] R. Chan, Q. Yang, and Y. Shen. Mining high utility itemsets. In Proc. ofThird IEEE Int'l Conf. on Data Mining , pp. 19-26, Nov., 2003.[4] A. Erwin, R. P. Gopalan, and N. R. Achuthan. Efficient mining of highutility itemsets from large datasets. In Proc. of PAKDD 2008, LNAI 5012,pp.554-561.[5] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidategeneration. In Proc. of the ACM-SIGMOD Int'l Conf. on Management ofData, pp. 1-12, 2000.[6] Y.-C. Li, J.-S. Yeh, and C.-C. Chang. Isolated items discarding strategyfor discovering high utility itemsets. InData & Knowledge Engineering , Vol.64, Issue 1, pp. 198-217, Jan., 2008.[7] Y. Liu, W. Liao, and A. Choudhary. A fast high utility itemsets miningalgorithm. In Proc. of the Utility-Based Data Mining Workshop, 2005.[8] B.-E. Shie, V. S. Tseng, and P. S. Yu. Online mining of temporal maximal

    utility itemsets from data streams. In Proc. of the 25th Annual ACMSymposium on Applied Computing, Switzerland, Mar., 2010.[9] H. Yao, H. J. Hamilton, L. Geng, A unified framework for utility-basedmeasures for mining itemsets. In Proc. of ACM SIGKDD 2nd Workshop onUtility-Based Data Mining, pp. 28-37, USA, Aug., 2006.[10] S.-J. Yen and Y.-S. Lee. Mining high utility quantitative associationrules. In Proc. of 9th Int'l Conf. on Data Warehousing and KnowledgeDiscovery, Lecture Notes in Computer Science 4654, pp. 283-292, Sep.,2007.[11] Frequent itemset mining implementations repository,http://fimi.cs.helsinki.fi/ 262

    BIOGRAPHIES

    Mr.Satyanarayana Mummana is workingas an Asst. Professor in Avanthi Institute ofEngineering & Technology,Visakhapatnam, Andhra Pradesh. He hasreceived his Masters degree (MCA) fromGandhi Institute of Technology andManagement (GITAM), Visakhapatnamand M.Tech (CSE) from Avanthi Institute

    of Engineering & Technology, Visakhapatnam. AndhraPradesh. His research areas include Image Processing,Computer Networks, Data Mining, Distributed Systems,Cloud Computing.

    Mr. S.Dileep, Completed his B.tech in SriPrakash College of Engineering. Later he isstudying M.Tech(Software Engineering) inAvanthi Institute of Engineering andtechnology. His interested areas are datamining and My SQL Database with .net

    technologies