7
A Novel Technique For Mining Closed Frequent Itemsets Using Variable Sliding Window Vikas kumar Department of Computer Science and engineering Ajay kumar Garg engineering college Ghaziabad, India Sangita Rani Satapathy Department of Computer Science and engineering Ajay kumar Garg engineering college Ghaziabad, India Abstract— Frequent itemset mining over dynamic data is an important problem in the context of knowledge discovery and data mining. Various data stream models are being used for mining frequent itemsets. In a data stream model the data arrive at high speed such that the algorithms used for mining data streams must process them in strict constraint of time and space. Due to emphasis over recent data and its bounded memory requirement, sliding window model is a widely used model for mining frequent itemset over data stream. In this paper we proposed an algorithm named Variable- Moment for mining both frequent and closed frequent itemset over data stream. The algorithm is appropriate for noticing latest or new changes in the set of frequent itemset by making its window size variable, which is determined dynamically based on the extent of concept drift occurring within the arriving data stream. The size of window expands when there is no concept drift in the arriving data stream and size shrinks when there is a concept change. The relative support instead of absolute support is being used for making the concept of variable window effective. The algorithm uses an in-memory data structure to store frequent itemsets. Data structure gets updated whenever a batch of transaction is added or deleted from the sliding window to output exact frequent itemsets. Extensive experiments on both real and synthetic data show that our algorithm excellently spots the concept changes and adapts itself to the new concept along data stream by adjusting window size. Keywords— data stream; concept drift; closed frequent itemset; variable window. I. INTRODUCTION Today we all bear with information overloaded. Many organizations today produce an electronic record of essentially every transaction they are involved in, this resulting tens or hundreds of millions of records being produced every day. The quantity of information available to us is so huge that it bar rather than help us. e.g. in a single day Wal-Mart records 20 million sales transactions, Google handles 200 million searches, and AT&T produces 270 million call records. Scientific data collection (e.g., by earth sensing satellites or astronomical observations) routinely produces gigabytes of data per day. By providing and handling information computers were thought to alleviate our life, but they are only causing information glut. This data morass constitutes the emerging discipline of data mining. That means, data mining can be thought of as algorithm for discovering valuable information from huge databases. Current algorithms for mining complex models from huge databases can not mine even a fraction of this data in useful time. Overcoming this state of affairs requires a shift in our frame of mind from mining database to mining data streams. In this paper we are focusing over discovering frequent itemsets from data streams. Mining frequent itemset [1] is one of the eminent concepts in data mining. Many other data mining theories and tasks stem from this idea. Generally what an itemset is. The word was specially invented for data mining which represents, a set of items that together denotes a special entity. A frequent itemset is an itemset that occurs frequently in a data stream, but how frequent is enough frequent is being set by a parameter called support. The support value can be of two type’s absolute support, denotes number of transaction in data stream in which an item/itemset occurs and Relative support, which denotes fraction of transaction in which an item\itemset occurs. Hence a frequent itemset is one that occurs in at least a user specifies percentage of transactions. One of the important properties of frequent itemset is anti- monotonicity, which states every subset of frequent itemset is frequent. In a huge transaction database that contains a large number of frequent itemsets, mining every frequent itemset is not a good option. Like a frequent itemset of size n generates 2 n non-empty subsets. In many cases we just only want to know whether an itemset is frequent or not. In such scenario we must discover only Maximal frequent itemset[12] i.e. itemset having none of its proper superset frequent. The single downside of mining only MFI, is that we only know all its subsets are frequent but we do not know exact value of support. To solve this problem another type of frequent itemset Closed frequent itemset, proposed. According to this an itemset is CFI if none of its immediate superset has same support as it has. 504 978-1-4799-2572-8/14/$31.00 c 2014 IEEE

[IEEE 2014 IEEE International Advance Computing Conference (IACC) - Gurgaon, India (2014.02.21-2014.02.22)] 2014 IEEE International Advance Computing Conference (IACC) - A novel technique

Embed Size (px)

Citation preview

A Novel Technique For Mining Closed Frequent Itemsets Using Variable Sliding Window

Vikas kumar Department of Computer Science and engineering

Ajay kumar Garg engineering college Ghaziabad, India

Sangita Rani Satapathy Department of Computer Science and engineering

Ajay kumar Garg engineering college Ghaziabad, India

Abstract— Frequent itemset mining over dynamic data is an important problem in the context of knowledge discovery and data mining. Various data stream models are being used for mining frequent itemsets. In a data stream model the data arrive at high speed such that the algorithms used for mining data streams must process them in strict constraint of time and space. Due to emphasis over recent data and its bounded memory requirement, sliding window model is a widely used model for mining frequent itemset over data stream.

In this paper we proposed an algorithm named Variable-Moment for mining both frequent and closed frequent itemset over data stream. The algorithm is appropriate for noticing latest or new changes in the set of frequent itemset by making its window size variable, which is determined dynamically based on the extent of concept drift occurring within the arriving data stream. The size of window expands when there is no concept drift in the arriving data stream and size shrinks when there is a concept change. The relative support instead of absolute support is being used for making the concept of variable window effective. The algorithm uses an in-memory data structure to store frequent itemsets. Data structure gets updated whenever a batch of transaction is added or deleted from the sliding window to output exact frequent itemsets. Extensive experiments on both real and synthetic data show that our algorithm excellently spots the concept changes and adapts itself to the new concept along data stream by adjusting window size.

Keywords— data stream; concept drift; closed frequent itemset; variable window.

I. INTRODUCTION Today we all bear with information overloaded. Many organizations today produce an electronic record of essentially every transaction they are involved in, this resulting tens or hundreds of millions of records being produced every day. The quantity of information available to us is so huge that it bar rather than help us. e.g. in a single day Wal-Mart records 20 million sales transactions, Google handles 200 million searches, and AT&T produces 270 million call records. Scientific data collection (e.g., by earth sensing satellites or astronomical observations) routinely produces gigabytes of data per day. By providing and handling information

computers were thought to alleviate our life, but they are only causing information glut. This data morass constitutes the emerging discipline of data mining. That means, data mining can be thought of as algorithm for discovering valuable information from huge databases. Current algorithms for mining complex models from huge databases can not mine even a fraction of this data in useful time. Overcoming this state of affairs requires a shift in our frame of mind from mining database to mining data streams. In this paper we are focusing over discovering frequent itemsets from data streams. Mining frequent itemset [1] is one of the eminent concepts in data mining. Many other data mining theories and tasks stem from this idea. Generally what an itemset is. The word was specially invented for data mining which represents, a set of items that together denotes a special entity. A frequent itemset is an itemset that occurs frequently in a data stream, but how frequent is enough frequent is being set by a parameter called support. The support value can be of two type’s absolute support, denotes number of transaction in data stream in which an item/itemset occurs and Relative support, which denotes fraction of transaction in which an item\itemset occurs. Hence a frequent itemset is one that occurs in at least a user specifies percentage of transactions. One of the important properties of frequent itemset is anti-monotonicity, which states every subset of frequent itemset is frequent. In a huge transaction database that contains a large number of frequent itemsets, mining every frequent itemset is not a good option. Like a frequent itemset of size n generates 2nnon-empty subsets. In many cases we just only want to know whether an itemset is frequent or not. In such scenario we must discover only Maximal frequent itemset[12] i.e. itemset having none of its proper superset frequent. The single downside of mining only MFI, is that we only know all its subsets are frequent but we do not know exact value of support. To solve this problem another type of frequent itemset Closed frequent itemset, proposed. According to this an itemset is CFI if none of its immediate superset has same support as it has.

504978-1-4799-2572-8/14/$31.00 c©2014 IEEE

Mining frequent itemset in data stream encounters many challenges. In data stream data is infinite and probably unbounded. The data elements arrive at rapid rates and each item in a stream could be examined once. Moreover, although data being generated continuously, the memory space could be used is limited. Finally the mining result must be generated as fast as possible. Due to dynamic nature of incoming data in data stream processing phenomenon of concept change occurs. Concept change/drift results when underlying distribution of data might change, which leads to change in set of frequent itemsets.

II. RELEATED WORK A number of representative state-of-the-art algorithms on mining frequent itemsets, closed frequent itemsets over data streams has been proposed. In this paper we organize these algorithms into three categories based on the window model that they adopt: landmark window model, time-fading model or sliding window model. The various algorithm for mining frequent itemsets falls into two categories: exact or approximate. Approximate algorithms are further classified into two categories: false-positive or false-negative approach.

A. Mining over landmark window The problem of frequent itemset mining over data stream was first introduced by Manku and Motwani [10]. They proposed two algorithms, the first, Sticky Sampling algorithm, which was probabilistic and the second algorithm, Lossy Counting, a deterministic one, for computing an approximate set of FIs over the entire history of a stream. The output consists of all frequent itemset with no false-negative. To control the quality of the approximation of the mining result a relaxed minimum support threshold being used. Li et al.[16] proposed another single pass frequent itemset mining algorithm, called DSMFI (Data Stream Mining for Frequent Itemsets). He develops in-memory data structure, called Item-suffix Frequent Itemset forest (IsFI-forest), for storing essential information about frequent itemsets of the streaming data seen so far. The current algorithm outperforms the lossy counting algorithm by having shorter execution time and less memory usage. Then came another algorithm propose by Yu et al.[18] derived from the concept of chernoff bound. Both algorithm above given by Manku and Motwani and Li et al. were false-positive i.e. returns a set of itemsets that includes all frequent itemsets but also some infrequent itemsets, which leads to the problem of exponential explosion. Yu et al. propose FDPM algorithm, which uses a false-negative oriented approach to control the memory requirements and the error on the estimated frequency of mined itemsets.

B. FIs mining over time-fading model As the effect of newer transaction is much more than that of old ones Chang and Lee [3] proposed a new algorithm, estDec, which uses a decay rate, r (1 < r > 0), to diminish the effect of obsolete and old transactions on the mining results. A prefix data structure maintains all the itemsets generated from data stream. However, the algorithm does not output the exact mining result and is of false-positive approximation type. Lee and Lee extend this algorithm given by Chang and Lee to approximate a set of Maximal frequent itemsets. For this they propose another compress data structure, CP-tree, which

results in reduction of memory consumption. Giannella et al. [6] propose another model to find out approximate count of frequent itemset. They use tilted-time window model, in which they assign coarser granularity for older frames and fine granularity for recent frames. A data structure called FP-stream is used for maintain itemsets. The tilted-time window model allows user to answer more expressive time-sensitive queries. A drawback of the approach is that the FP-stream can become very large over time and updating and scanning such a large structure may degrade the mining throughput.

C. FIs mining over sliding window model Chang and Lee [13] propose the estWin algorithm to maintain frequent itemsets over a sliding window model. A prefix data structure is being used for maintaining frequent itemset generated. The algorithm does not produce exact output and is of false-positive approximation type. Later Chi et al. [5] proposed another algorithm, Moment to effectively mine closed frequent itemset from data streams. The merit of algorithm is it produces exact output with no approximation. To maintain a dynamically selected set of frequent itemset over sliding window, they design an in-memory prefix-tree-based structure, called the CET (Closed Enumeration Tree). CET also monitors the itemset that form the boundary between closed frequent itemsets and other itemsets. The algorithm suffers from number of shortcomings. First, it stores sliding window transaction using FP-tree which requires a considerable amount of memory. Secondly, the number of closed frequent nodes with respect to total number of CET nodes is very low. To overcome the limitation of moment algorithm Fatemeh Nori et al. [17] proposed another algorithm Tmoment for mining closed frequent itemset. They design an effective and efficient data structure TCET (Transaction Translate Closed Enumeration Tree) which maintained both itemsets and transaction information.

III. PROBLEM STATEMENT Traditional mining techniques give great results in static environments however, data streams overwhelming volume and their distinctive feature - concept drift; they fail to successfully process data streams. The recognition of these features in data streams has led to sliding-window approaches that model a forgetting process, which allows to limit the number of processed data and to react to changes. Sliding window model should be able to react to the changing concept by forgetting out dated data, while learning new class descriptions. This data range for getting better result is being determined by window size, which is supposed to be given by the user and remain fixed during the mining process. This approach, based on fixed window size, is caught in a trade-off between flexibility and stability. In a larger window, model may lead to a sluggish, but stable and well trained classifier. While If the window is small, the system will react quickly to changes and very responsive too, but the accuracy of the classifier might be low due to insufficient training data in the window. However, in this paper we define sliding window model with variable window size which changes as the content of window changes. Here we provide some key terms that explains the concept of frequent itemset mining over data stream. Let I = {i1, i2, i3… in} be set of literals, called items. An itemset may be defined as a

2014 IEEE International Advance Computing Conference (IACC) 505

set X= (im…, io) ⊆ I, m < n, and m, o ∈ [1, n]. An itemset with k items is called k-itemset. A transaction t = (tid, C) is a tuple where tid is a transaction-id and C is an itemset. If an itemset X ⊆ C, it is said that itemset X is contained in transaction (tid, C). Transactional data stream TDS is infinite sequence of transactions, TDS = (T1, T2, …, Tm), with Tm as the latest/newest transaction. A window W can be referred to as a set of all transactions between the xth and yth (where y > x) arrival of transactions and the number of transactions between the xth and yth arrival of transactions denotes the size of window. In W, it is supposed transactions are received from data stream one by one which makes sliding window move forward by inserting new transaction at one end and deleting the oldest transaction from another end. However, for the efficiency issues instead of using unit of insertion one we use panes/batch i.e. fixed number of transaction. Such that each window, W, be composed of a number of equal-sized non-overlapping transaction, called panes. The current window over transactional data stream is defined as SWm-|W|+1=[Tm-

|W|+1, Tm-|W|+2,…, Tm] where Ti and m — |w| + 1 are the ith received transaction and window’s identifier respectively. TABLE 1. Processing a transactional data stream using sliding window model of window size 6 and pane size 3

Tid Items 1 2,5,7,9 2 1,2,5 3 1,3,5,8 4 2,3,5 5 3,5,8,9 6 1,2,3,5,8 7 5,8 8 1,5 9 8

IV. VARIABLE-MOMENT ALGORITHM In this section, we define the proposed algorithm Variable-Moment. The proposed algorithm uses an in-memory prefix tree data structure which maintained both itemset and transaction information. This data structure is being updated whenever a pane of transaction added or deleted from sliding window. In this algorithm addition of a pane not only results new frequent itemsets but also may lead to change of an item set from frequent to non-frequent. Likewise, deletion of pane also results in both types of node changes i.e. from non-frequent to frequent vice-versa.

A. Window initialization and computing support using tidlists of items

The window is set by defining initial window size along with others parameters like pane size, minimum change threshold and minimum support threshold. Tidlists, proposed by Zaki [11], of items within the window are maintained for items of arriving data stream. The support of an item is calculated by using tidlist of that particular item which upholds tids of transactions within the window that include that item. In Variable-Moment algorithm, the relative support of item is being computed instead of absolute support. As the window size is not fixed the support of item is being calculated with respect to window size at that moment of time i.e. fraction of transactions containing that particular item. Likewise, support

of a K-itemset is calculated from (K-1) subsets by intersecting their tidlists.

B. Building Prefix Tree Data Structure After initializing the window the frequent itemset are mined and are stored into prefix data structure. The data structure is updated after every pane insertion and deletion in order to improve the efficiency. The prefix tree data structure is constructed after window initialization using transaction list of single items. To build the tree, first we create a root node ø. At the first level of the tree we store all single items and corresponding tid list, present in the window irrespective to being frequent or non-frequent. We use two hash tables for storing closed frequent itemset and frequent itemset. Building data structure is a recursive procedure in which at each call it computes the support of an itemset by intersecting transactions list of items. The process builds the prefix tree for the first window of data stream. In this procedure, ni is node to be processed, |W| is size of initial window and s is the minimum support threshold value set by the user. As shown in sub step of 4.1, for frequent itemsets, new children are generated and its list of transaction is generated by intersecting list of sibling and list of its parent, in line through 4.1.1 to 4.1.4. For all new itemsets, the procedure, build, is called recursively for the insertion of nodes at lower level, in lines 4.1.5 and 4.1.6. In the last four lines closed frequent and frequent itemset are being distinguish and stored into hash table.

C. Inserting new pane of transaction A new transaction is added to window after completing the process of window initialization, by updating the support of all related frequent pattern. For a new item, contained in a newly arrived transaction a new node is created and tid of transaction is added to the list. To truly identify all new frequent itemset, we must monitor all single items present within the existing window, by updating their support count. Prefix tree is scanned and updated after full pane insertion. Based on the size of window and minimum support threshold, beside infrequent single item, all non-frequent itemsets are pruned. The procedure of adding new transaction is shown in step 6-8 of the algorithm. This procedure updates the prefix tree by new pane of transaction. In lines 6 and 7, a new pane is added and after inserting each transaction of the new pane current set of frequent itemsets is updated. In step 8, add procedure is called to update the prefix tree. Step a to f of add procedure is like the build procedure. Only difference is from line g to o, in which the support of previously frequent and closed frequent itemset require to be updated in their corresponding hash tables.

D. Window resizing and elimination of stale transactions Concept drift, which leads to the change in set of frequent itemsets due to change in the underlying distribution of data stream, plays a vital role in determining window size. To make the performance of a mining model better in terms of reflecting recent changes, the old concept must be replaced as soon as concept drift occurs via new concept. The amount of change in set of frequent itemsets determined the difference between two concepts. A minimum change threshold defined by user initially helps in determine whether concept drift

Time

SW2

SW1

Pane 2

Pane 1

506 2014 IEEE International Advance Computing Conference (IACC)

occurs or not. The concept drift between two sets of frequent itemsets Ft1 and Ft2 at time t1 and t2 (t2 > t1) can be calculated as follows:

Concept drift = ((|F+| + |F-|)/ (|F+| + |F|))

Where, F+ is the set of newly emerged frequent itemset at time t2 wrt. time t1 i.e. Ft2 — Ft1. Similarly, F- is the set of itemset which were frequent at time t1 but are infrequent at time t2. With the passage of arriving data stream, the size of window get shrink or expand, depending upon the observed change in the set of frequent itemsets. If the change is greater than the specified minimum change threshold, then size of

window shrinks. All the transaction before a specific point call checkmark has been eliminated by removing corresponding information from the tidlists. Checkmark initial position is at last tid of the initialized window. But the position goes on changing and moves forward where a new concept drift is detected. Furthermore, all the information related to these eliminated transaction has been removed from prefix tree. Otherwise, if the change is not greater checkmark remains same, no elimination of transaction performed and no change in the prefix tree have to be done and data stream processing continues.

INPUT: W: Initial window size P: pane size s: minimum support threshold MTh: minimum change threshold TD: transactional database OUTPUT: FI: frequent items/itemsets CFI: closed frequent itemsets (after every concept drift occurrence) Begin Step 1: Initialize window by setting window size W, pane size P. Step 2: Build TidList and compute support. Step 3: Mark last transaction Tid of the initialized window as the checkmark. Step 4: Mine frequent itemsets and closed frequent itemsets and store into a prefix tree data structure. Step 4.1: Build (ni, |W|, s) Step 4.1.1: if support (ni) > s then Step4.1.2: For every frequent sibling nj of ni

Step4.1.3: nk = node (ni, nj) Step4.1.4: Tid (nk) = Tid (ni) Tid (nj) Step4.1.5: For every child ni` of ni Step4.1.6: Build (ni’, |W|, s) Step4.1.7: if Support (ni’) Support (ni) then Step4.1.8: ni and ni’ belongs to closed frequent itemset Step4.1.9: insert ni and ni’ hash table1 Step4.1.10: else insert ni hash table2 and ni’ hash table1

Step 5: Repeat the process of adding and deleting transaction till the end of data stream. Step 6: for 1 N P Step 7: Update Tidlist W = W U [T] Step 8: Update data structure and corresponding frequent and closed frequent itemset. Step 8.1: Add (ni, |W|, s) a. if support (ni) > minimum support threshold then b. for each frequent sibling nj of ni

c. Tid (nk) = Tid (ni) Tid (nj) d. for each child ni’ of ni do e. Add (ni’, |W|, s) f. if support (ni’) = support (ni)

g. if ni is frequent itemset then h. ni’ is closed frequent itemset i. insert ni’ into hashtable1

j. else k. ni belongs to closed frequent itemset l. remove ni from hashtable1 and insert ni’ into hashtable1 as closed frequent itemset m. else support (ni) <= minimum support threshold n. if ni was closed frequent or frequent itemset o. ni belongs to non-frequent itemset and eliminates ni from corresponding hash table. end for Step 9: itemsets which turn into non-frequent itemset as a result of pane insertion are stored in NIF NIF = Prune (FItemsets, |w|, s) Step 10: newly infrequent pattern are added to F- F- = (F-) U (NIF) Step 11: newly emerged frequent itemsets are stored in NF NF = insert (FItemsets, |W|, s) Step 12: newly emerged frequent pattern are added to F+ F+ = (F+) U (NF) Step 13: if ((|F+| + |F-|)/ (|F+| + |F|)) > MTh then Step 14: resize window size by eliminating all the transaction before checkmark and updating tidlist Eliminate = delete (CP, W) Step 15: move checkmark to the latest transaction Checkmark = last Tid (W) Step 16: update frequent and closed frequent itemset after eliminating stale transaction Delete (FItemsets, stale) Step 17: Delete (ni, |W|, s)

2014 IEEE International Advance Computing Conference (IACC) 507

a. if support (ni) > s b. for each frequent sibling nj of ni

c. Tid (nk) = Tid (ni) intersection Tid (nj) d. for each child ni’ of ni do e. Delete (ni’, |W|, s) f. else if support (ni) < s g. if ni belongs to frequent or closed frequent itemset h. ni belongs to non-frequent itemset and eliminate ni from its corresponding hash table Step 18: set F+ and F- to ø end if Fig. 1. Variable-moment algorithm

V. EXPERIMENTAL RESULTS In this section, we have performed three sets of experiments using real and synthetic datasets to empirically evaluate the performance of proposed algorithm. We have implemented the Variable-moment algorithm in java and using window 8 on a 2.4 GHz CPU with 2 GB main memory. In our first experiment using a synthetic dataset we have shown how the window resizes by detecting the concept change within an input data stream. For this we need a dataset which has a concept drift in a specific point. Let pane size, minimum support threshold and minimum change threshold to 20k transactions, 10k transactions, 0.3 and 0.1 respectively. The first checkmark is marked after window initialization (i.e. at 20k transactions). As shown in figure 2, the amount of concept change detected after each pane insertion is shown with respect to number of transaction. The size of window shown in the figure goes on increasing till any concept change detected. As shown in figure 2a, in 150Kth and 200Kth transactions, the amount of concept change exceeds the given minimum change threshold and a considerable amount of change is detected. Window get resize by eliminating all obsolete transactions from window. Therefore at 150Kth transaction window get resize to 130K (150 – 20), by deleting all transactions before checkmark and checkmark is moved to new position where concept change detected i.e. 150Kth transaction. In same manner another concept change is detected at 200Kth transactions and size of window get reduce to 50K (180 – 130), by deleting all the obsolete information before checkmark. In the above experiments concept change is detected as the new concept is gradually inserted into window. Hence by deleting the obsolete transactions twice at 150Kth and 200Kth, removes completely the old concept with a new one. The question may arise does the value of initial window size and pane size affect the process of mining and window resizing. In order to show the effects of this we perform the above experiments with initial window size and pane size of 50K and 25K transactions respectively with same minimum support and change threshold values. As shown in figure 2b, the concept change is detected at 150Kth and 200kth transactions and ultimately leads to same window size.

Before you begin to format your paper, first write and save the content as a separate text file. Keep your text and graphic files separate until after the text has been formatted and styled.

Do not use hard tabs, and limit use of hard returns to

In the second experiment, we are using a real dataset to see how the algorithm resizes the window based on detected concept change value. For performing this experiment we need to know a specified position in dataset where concept change is observed. As no real dataset can provide this value, we verify the window resizing of our algorithm for different value of minimum change threshold. Taking initial window size, pane size and minimum support values as 10k, 5k transactions and 0.3 respectively. Three different values of change threshold are taken for showing its effect on the concept change detection and window size. In the figure 3, we have shown the effect of different minimum change threshold over window size and mining results. In the first sub figure 3a, a low minimum change threshold of 0.1 is taken which results in window resizing after every pane insertion. Resulting into smaller window size of 5k i.e. equals to pane size. In the second sub figure 3b, the threshold is set to 0.3 the quantity of change exceeds this value two times and window is resized according to that. In the third sub figure 3c, the change threshold value is taken 0.5 and no concept change is detected resulting in a greater size of window. This experiment shows that window resizing and concept change detections depends upon the underlying properties of data stream. A higher values for the change threshold leads to the less number of concept change detection and a greater window size whereas a low value for minimum threshold leads to higher number of concept change and a lower or a smaller window size during data stream mining.

Fig. 2a. Pattern of change ratio and window size with pane size = 10 K and initial window size = 20 K

Fig. 2b. Pattern of change ratio and window size with pane size = 25 K

and initial window size = 50 K

Fig. 3a. Change threshold 0.1

508 2014 IEEE International Advance Computing Conference (IACC)

In our third experiment, we have shown the adaptability of our prosed algorithm with respect to recent frequent itemsets. In order to how our algorithm adapts to new concept as the new transactions are added, the coverage rate of frequent itemsets is used. Given two mutually exclusive set of items A and B, the coverage rate CR (A) of the dataset A is defined as follows [13]:

100|)(|

|)(|×

BAF

AF

Here |F (A)| represent the total number of frequent itemsets composed of the items in A. For performing this experiment we have used dataset T5.I4.D1000K-AB composed of two consecutive parts A and B, with no common item between them. Each sequence contains 500 K transactions and contain 1000 items total in both sequences. We have compared the adaptability of variable moment using variable window size from, the estDec [3] algorithm using time decay model and SWIM [15] algorithm using fixed size sliding window model. Here, for all algorithms the minimum support value was set to 0.001. The initial window size for variable-moment is set to 400 K transactions with pane size of 1000 transactions. As the unit of insertion in SWIM is also pane, its pane size was also 1000 transactions. In all these algorithm coverage rate is directly related to one parameter. The parameters are decay rate for estDec. Decay rate defined by two parameters: a decay base b, which was set to 2 and decay base life h of which different values were used. Window size in case of sliding window model and minimum change threshold for our proposed variable-moment algorithm. In figure 4, the coverage rate of all three algorithms is shown with X axis showing number of passed transactions and Y axis showing the value of coverage rate. In figure 4, it can be seen that all algorithms reach 100% coverage rate but the amount of transactions required to achieve it is different. In variable-moment, algorithm reaches 100% coverage rate more quickly when there is low minimum change threshold. As the value for minimum change threshold increases more number of transactions is required to achieve 100% coverage rate.

For estDec algorithm lower the value of decay base life h higher the coverage rate. The algorithm adapts more rapidly to the new concept as the value for decay base life h becomes smaller. On comparing variable moment with estDec it can be seen that estDec algorithm is less adaptive than our proposed algorithm for all values of decay base life h and minimum change threshold CTh. As shown in figure 4c, the coverage rate behavior for sliding window model algorithm SWIM is shown. On comparing SWIM is superior to other algorithms in terms of coverage rate, when the size of window is small (100K). In this scenario, when the active window completely resides on part B, deletion of transactions doesn’t stop, though the concept is stable. Therefore, mining result is approximation with respect to current concept. While in variable moment after change detection and achieving 100% coverage rate, no transaction of new concept is neglected or deleted. Therefore, the mining result becomes more accurate and provides better representation of current concept.

Fig. 3b. Change threshold = 0.3

Fig. 3c. Change threshold = 0.5

Fig. 4a. Coverage rate of frequent itemsets of variable-moment for different change threshold

Fig. 4b. Coverage rate of frequent itemsets of estDec for different decay base life

Fig. 4c. Coverage rate of frequent itemsets of SWIm for different window sizes

2014 IEEE International Advance Computing Conference (IACC) 509

VI. CONCLUSION AND FUTURE WORK Considering the continuousness of a data stream, the traditional methods or techniques for finding frequent itemsets in conventional data mining methodology may not be valid in a data stream. This is because we cannot consider whole data and must identify when a data becomes obsolete or invalid. As the old information of a data stream may be no longer useful or possibly invalid at present. In order to support various requirements of mining data stream, the mining window or the interesting recent range of a data stream needs not to be defined static but must be flexible. Based on this range, a data mining method can be able to identify when a transactions becomes stale or needs to be disregarded. In this paper, we have investigated the problem of mining frequent itemset over data stream using flexible size sliding window model and proposed a new algorithm for this problem. The size of sliding window is adaptively adjusted based on the amount of observed concept change in the underlying properties of incoming data stream. The size of window enlarges or increase when there is no significant amount of change observed. While the window size reduced or decrease when there is considerable amount of concept change or significant change in set of frequent itemsets occurs. Based on the value of minimum change threshold given by user, the size of window is being controlled. After every pane insertion the set of frequent itemsets are updated and value of concept change is calculated. If the value exceeds the given minimum change threshold the window gets smaller by deleting all the obsolete information before a point defined (called checkmark). Experimental results shows that our algorithm tracks the concept change efficiently while mining data stream and is more adaptive to recent frequent itemsets than fixed size sliding window models or time fading window models. For the future work, we are trying to enhance the performance by using fuzzy sets for minimum change threshold value so that the values like low, medium, high and very high instead of certain value between ranges of 0 to 1.

REFERENCES [1] Aggarwal, C, “A framework for diagnosing changes in evolving

data streams,” In Proc. ACM SIGMOD int. conf. on management of data (pp. 575–586), 2003.

[2] Agrawal, R., & Srikant, R, “Fast algorithms for mining association rules,” In Proc. VLDB int. conf. very large databases (pp. 487–499), 1994.

[3] Chang, J., & Lee, W. S, “Finding recently frequent itemsets adaptively over online transactional data streams,” Information Systems, 31(8), pp. 849–869, 2006.

[4] Han, J., Cheng, H., Xin, D., & Yan, X. Frequent pattern mining: Current status and future directions. Data Mining and Knowledge Discovery, 15(1), pp. 55–86, 2007.

[5] Y. Chi, H. Wang, P. S. Yu and R. R. Muntz. Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. In KAIS, 10(3): pp. 265-294, 2006.

[6] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu. Mining frequent patterns in data streams at multiple time granularities. In Kargupta et al.: Data Mining: Next Generation Challenges and Future Directions, MIT/AAAI Press, 2004.

[7] V. kumar, S. satapathy, “A review on algorithms for mining frequent itemsets over data stream,” in ijarcsse V3 I4, 2013.

[8] Tanbeer, S. K., Ahmed, C. F., Jeong, B.-S., & Lee, Y.-K, “Sliding window-based frequent pattern mining over data streams,” Information Sciences, 179(22), pp. 3843–3865, 2009.

[9] Tsai, P. S. M, “Mining frequent itemsets in data streams using the weighted sliding window model,” Expert Systems with Applications, 36(9), pp. 11617–11625, 2009.

[10] Manku, G. S., & Motwani, R. Approximate frequency counts over data streams. In Proc. VLDB int. conf. very large databases (pp. 346–357), 2002.

[11] Zaki, M. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3), 372–390.

[12] Woo H. J., & Lee, W. S. (2009). estMax: Tracing maximal frequent item sets instantly over online transactional data streams. IEEE Transactions on Knowledge and Data Engineering, 21(10), 1418–1431.

[13] Koh, J.- L., & Lin, C.- Y, “Concept shift detection for frequent itemsets from sliding window over data streams”, lecture notes in computer science: Database systems for advanced applications (pp. 334–348). DASFAA Int. Workshops, Springer-Verlag.2009.

[14] J. H. Chang and W. S. Lee. estWin: Adaptively Monitoring the Recent Change of Frequent Itemsets over Online Data Streams. In Proc. of CIKM, 2003.

[15] Mozafari B, Thakkar H, Zaniolo C, “Verifying and mining patterns from large windows over data streams,” In Proc. Int. conf. ICDE. pp. 179-188, 2008.

[16] H. Li, S. Lee, and M. Shan, “An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams”, In Proc. of First International Workshop on Knowledge Discovery in Data Streams, 2004.

[17] F. Nori, M. Deypir, M. Sadreddini, “A sliding window based algorithm for frequent closed itemset mining over data streams”, journal of system and software, 2012.

[18] J. Yu, Z. Chong, H. Lu, and A. Zhou. False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams. In Proc. of VLDB, 2004.

510 2014 IEEE International Advance Computing Conference (IACC)