53
A New Incremental Data Mining Algorithm Using Pre-large Itemsets* Tzung-Pei Hong** Department of Information Management I-Shou University Kaohsiung, 84008, Taiwan, R.O.C. [email protected] http://www.nuk.edu.tw/tphong Ching-Yao Wang Institute of Computer and Information Science National Chiao-Tung University Hsinchu, 300, Taiwan, R.O.C. [email protected] Yu-Hui Tao Department of Information Management I-Shou University Kaohsiung, 84008, Taiwan, R.O.C. [email protected] --------------------------------------

Yu-Hui Tao

  • Upload
    tommy96

  • View
    1.004

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Yu-Hui Tao

A New Incremental Data Mining Algorithm

Using Pre-large Itemsets*

Tzung-Pei Hong**

Department of Information Management

I-Shou University

Kaohsiung, 84008, Taiwan, R.O.C.

[email protected]

http://www.nuk.edu.tw/tphong

Ching-Yao Wang

Institute of Computer and Information Science

National Chiao-Tung University

Hsinchu, 300, Taiwan, R.O.C.

[email protected]

Yu-Hui Tao

Department of Information Management

I-Shou University

Kaohsiung, 84008, Taiwan, R.O.C.

[email protected]

--------------------------------------

* This is a modified and expanded version of the paper "Incremental data mining

based on two support thresholds," presented at The Fourth International Conference

on Knowledge-Based Intelligent Engineering Systems & Allied Technologies,

2000, England.

** Corresponding author.

Page 2: Yu-Hui Tao

Abstract

Due to the increasing use of very large databases and data warehouses, mining

useful information and helpful knowledge from transactions is evolving into an

important research area. In the past, researchers usually assumed databases were static

to simplify data mining problems. Thus, most of the classic algorithms proposed

focused on batch mining, and did not utilize previously mined information in

incrementally growing databases. In real-world applications, however, developing a

mining algorithm that can incrementally maintain discovered information as a

database grows is quite important. In this paper, we propose the concept of pre-large

itemsets and design a novel, efficient, incremental mining algorithm based on it. Pre-

large itemsets are defined by a lower support threshold and an upper support

threshold. They act as gaps to avoid the movements of itemsets directly from large to

small and vice-versa. The proposed algorithm doesn't need to rescan the original

database until a number of transactions have been newly inserted. If the database has

grown larger, then the number of new transactions allowed will be larger too.

Keywords: data mining, association rule, large itemset, pre-large itemset,

incremental mining.

2

Page 3: Yu-Hui Tao

1. Introduction

Years of effort in data mining have produced a variety of efficient techniques.

Depending on the type of databases processed, these mining approaches may be

classified as working on transaction databases, temporal databases, relational

databases, and multimedia databases, among others. On the other hand, depending on

the classes of knowledge derived, the mining approaches may be classified as finding

association rules, classification rules, clustering rules, and sequential patterns [4],

among others. Among them, finding association rules in transaction databases is most

commonly seen in data mining [1][3][5][9][10][12][13][15][16].

In the past, many algorithms for mining association rules from transactions were

proposed, most of which were executed in level-wise processes. That is, itemsets

containing single items were processed first, then itemsets with two items were

processed, then the process was repeated, continuously adding one more item each

time, until some criteria were met. These algorithms usually considered the database

size static and focused on batch mining. In real-world applications, however, new

records are usually inserted into databases, and designing a mining algorithm that can

maintain association rules as a database grows is thus critically important

When new records are added to databases, the original association rules may

become invalid, or new implicitly valid rules may appear in the resulting updated

databases [7][8][11][14][17]. In these situations, conventional batch-mining

algorithms must re-process the entire updated databases to find final association rules.

Two drawbacks may exist for conventional batch-mining algorithms in maintaining

3

Page 4: Yu-Hui Tao

database knowledge:

(a) Nearly the same computation time as that spent in mining from the original

database is needed to cope with each new transaction. If the original database

is large, much computation time is wasted in maintaining association rules

whenever new transactions are generated

(b) Information previously mined from the original database, such as large

itemsets and association rules, provides no help in the maintenance process.

Cheung and his co-workers proposed an incremental mining algorithm, called

FUP (Fast UPdate algorithm) [7], for incrementally maintaining mined association

rules and avoiding the shortcomings mentioned above. The FUP algorithm modifies

the Apriori mining algorithm [3] and adopts the pruning techniques used in the DHP

(Direct Hashing and Pruning) algorithm [13]. It first calculates large itemsets mainly

from newly inserted transactions, and compares them with the previous large itemsets

from the original database. According to the comparison results, FUP determines

whether re-scanning the original database is needed, thus saving some time in

maintaining the association rules. Although the FUP algorithm can indeed improve

mining performance for incrementally growing databases, original databases still need

to be scanned when necessary. In this paper, we thus propose a new mining algorithm

based on two support thresholds to further reduce the need for rescanning original

databases. Since rescanning the database spends much computation time, the

maintenance cost can thus be reduced in the proposed algorithm.

4

Page 5: Yu-Hui Tao

The remainder of this paper is organized as follows. The data mining process is

introduced in section 2. The maintenance of association rules is described in section 3.

The FUP algorithm is reviewed in section 4. A new incrementally mining algorithm is

proposed in section 5. An example is also given there to illustrate the proposed

algorithm. Conclusions are summarized in section 6.

2. The Data Mining Process Using Association Rules

Data mining plays a central role in knowledge discovery. It involves applying

specific algorithms to extract patterns or rules from data sets in a particular

representation. Because data mining is important to KDD, many researchers in

database and machine-learning fields are interested in this new research topic since it

offers opportunities to discover useful information and important relevant patterns in

large databases, thus helping decision-makers analyze data easily and make good

decisions regarding the domains in question.

One application of data mining is to induce association rules from transaction

data, such that the presence of certain items in a transaction will imply the presence of

certain other items. To achieve this purpose, Agrawal and his co-workers proposed

several mining algorithms based on the concept of large itemsets to find association

rules in transaction data [1][3][5]. They divided the mining process into two phases.

In the first phase, candidate itemsets were generated and counted by scanning the

transaction data. If the count of an itemset appearing in the transactions was larger

than a pre-defined threshold value (called the minimum support), the itemset was

considered a large itemset. Itemsets containing only one item were processed first.

5

Page 6: Yu-Hui Tao

Large itemsets containing only single items were then combined to form candidate

itemsets containing two items. This process was repeated until all large itemsets had

been found. In the second phase, association rules were induced from the large

itemsets found in the first phase. All possible association combinations for each large

itemset were formed, and those with calculated confidence values larger than a

predefined threshold (called the minimum confidence) were output as association

rules. We may summarize the data mining process we focus on as follows:

1. Determine user-specified thresholds, including the minimum support

value and the minimum confidence value.

2. Find large itemsets in an iterative way. The count of a large itemset must

exceed or equal the minimum support value.

3. Utilize the large itemsets to generate association rules, whose confidence

must exceed or equal the minimum confidence value.

Below, we use a simple example to illustrate the mining process. Suppose a

database with five transactions shown in Table 1 is to be mined. The database has two

features, transaction identification (TID) and transaction description (Items).

Table 1. An example of a transaction database

TID Items100 BE200 ABD300 AD400 BCE500 ABDE

Assume the user-specified minimum support and minimum confidence are 40%

and 80%, respectively. The transaction database is first scanned to count the candidate

6

Page 7: Yu-Hui Tao

1-itemsets. The results are shown in Table 2.

Table 2. Candidate 1-itemsets

Item CountA 3B 4C 1D 3E 3

Since the counts of the items A, B, D and E are larger than 2 (5*40%), they are

put into the set of large 1-itemsets. The candidate 2-itemsets are then formed from

these large 1-itemsets as shown in Table 3.

Table 3. Candidate 2-itemsets with counts

Items CountAB 2AD 3AE 1BD 2BE 3DE 1

AB, AD, BD and BE then form the set of large 2-itemsets. In a similar way,

ABD can be found to be a large 3-itemset.

Next, the large itemsets are used to generate association rules. According to the

condition probability, the possible association rules generated are shown in Table 4.

7

Page 8: Yu-Hui Tao

Table 4. Possible association rules

Rule ConfidenceIF AB, Then D Count(ABD)/Count(AB)=1IF AD, Then B Count(ABD)/Count(AD)=2/3IF BD, Then A Count(ABD)/Count(BD)=1IF A, Then B Count(AB)/Count(A)=2/3IF B, Then A Count(AB)/Count(B)=2/4IF A, Then D Count(AD)/Count(A)=1IF D, Then A Count(AD)/Count(D)=1IF B, Then D Count(BD)/Count(B)=2/4IF D, Then B Count(BD)/Count(D)=2/3IF B, Then E Count(BE)/Count(B)=3/4IF E, Then B Count(BE)/Count(E)=1

Since the user-specified minimum confidence is 80%, the final association rules

are shown in Table 5.

Table 5. The final association rules for this example

Rule ConfidenceIF AB, Then D Count(ABD)/Count(AB)=1IF BD, Then A Count(ABD)/Count(BD)=1IF A, Then D Count(AD)/Count(A)=1IF D, Then A Count(AD)/Count(D)=1IF E, Then B Count(BE)/Count(E)=1

3. Maintenance of Association Rules

In real-world applications, transaction databases grow over time and the

association rules mined from them must be re-evaluated because new association rules

may be generated and old association rules may become invalid when the new entire

databases are considered.

Conventional batch-mining algorithms, such as Apriori [1] and DHP [13], solve

8

Page 9: Yu-Hui Tao

this problem by re-processing entire new databases when new transactions are

inserted into the original databases. These algorithms do not, however, use previously

mined information and require nearly the same computational time they needed to

mine from the original databases. If new transactions appear often and the original

databases are large, these algorithms are thus inefficient in maintaining association

rules.

Considering an original database and newly inserted transactions, the following

four cases (illustrated in Figure 1) may arise:

Case 1: An itemset is large in the original database and in the newly inserted

transactions.

Case 2: An itemset is large in the original database, but is not large in the newly

inserted transactions.

Case 3: An itemset is not large in the original database, but is large in the newly

inserted transactions.

Case 4: An itemset is not large in the original database and in the newly inserted

transactions.

9

Large itemset

Small itemset

Case 1 Case 2

Case 3 Case 4

Small itemset

Large itemset

New transactions

Original database

Large itemset

Small itemset

Case 1 Case 2

Case 3 Case 4

Small itemset

Large itemset

New transactions

Original database

Page 10: Yu-Hui Tao

Figure 1: Four cases arising from adding new transactions to existing databases

Since itemsets in Case 1 are large in both the original database and the new

transactions, they will still be large after the weighted average of the counts.

Similarly, itemsets in Case 4 will still be small after the new transactions are inserted.

Thus Cases 1 and 4 will not affect the final association rules. Case 2 may remove

existing association rules, and case 3 may add new association rules. A good rule-

maintenance algorithm should thus accomplish the following.

1. Evaluate large itemsets in the original database and determine whether they

are still large in the updated database;

2. Find out whether any small itemsets in the original database may become

large in the updated database;

3. Seek itemsets that appear only in the newly inserted transactions and

determine whether they are large in the updated database.

These are accomplished by the FUP algorithm and by our proposed algorithm.

4. Review of the Fast Update Algorithm (FUP)

Cheung et al. proposed the FUP algorithm to incrementally maintain association

rules when new transactions are inserted [7][8]. Using FUP, large itemsets with their

counts in preceding runs are recorded for later use in maintenance. As new

transactions are added, FUP first scans them to generate candidate 1-itemsets (only for

these transactions), and then compares these itemsets with the previous ones. FUP

10

Page 11: Yu-Hui Tao

partitions candidate 1-itemsets into two parts according to whether they are large for

the original database. If a candidate 1-itemset from the newly inserted transactions is

also among the large 1-itemsets from the original database, its new total count for the

entire updated database can easily be calculated from its current count and previous

count since all previous large itemsets with their counts are kept by FUP. Whether an

original large itemset is still large after new transactions are inserted is determined

from its support ratio as its total count over the total number of transactions. By

contrast, if a candidate 1-itemset from the newly inserted transactions does not exist

among the large 1-itemsets in the original database, one of two possibilities arises. If

this candidate 1-itemset is not large for the new transactions, then it cannot be large

for the entire updated database, which means no action is necessary. If this candidate

1-itemset is large for the new transactions but not among the original large 1-itemsets,

the original database must be re-scanned to determine whether the itemset is actually

large for the entire updated database. Using the processing tactics mentioned above,

FUP is thus able to find all large 1-itemsets for the entire updated database. After that,

candidate 2-itemsets from the newly inserted transactions are formed and the same

procedure is used to find all large 2-itemsets. This procedure is repeated until all large

itemsets have been found.

Below, we use a simple example to illustrate the FUP algorithm. Suppose a

database with eight transactions such as the one shown in Table 6 is to be mined. The

minimum support threshold s is set at 50%.

Table 6. An original database with TID and Items

Incremental databaseTID Items

11

Page 12: Yu-Hui Tao

100 ACD200 BCE300 ABCE400 ABE500 ABE600 ACD700 BCDE800 BCE

Using a conventional mining algorithm such as the Apriori algorithm, all large

itemsets with counts larger than or equal to 4 (850%) are found, as shown in Table 7.

These large itemsets and their counts are retained by the FUP algorithm.

Table 7. All large itemsets from an original database with s=50%

Large itemsets 1 item Count 2 items Count 3 items Count

A 5 BC 4 BCE 4B 6 BE 6C 6 CE 4E 6

Next, assume two new transactions, as shown in Table 8 appear.

Table 8. New transactions for the example

New transactionsTID Items900 ABCD1000 DEF

The FUP algorithm processes them as follows. First, the final large 1-itemsets for

the entire updated database are found. This process is shown in Figure 2. The same

12

Page 13: Yu-Hui Tao

process is then repeated until no new candidate itemsets are generated.

13

Page 14: Yu-Hui Tao

14

New transactions TID Items 900 ABCD 1000 DEF

Item Count A 1 B 1 C 1 D 2 E 1

F 1

Item Count A 1 B 1 C 1 E 1

Item Count D 2 F 1

Item Count A 6 B 7 C 7 E 7

Item Count D 2 F 1

Item Count A 6 B 7 C 7 E 7

Item Count D 5

Items Count A 6 B 7 C 7 D 5 E 7

Find all candidate 1-itemsets

Extract originally large 1-itemsets from these two transactions

Extract originally small 1-itemsets from these two transactions

Extract 1-itemsets large for the new transactions

Find the large 1- itemsets for updated database

Find the large 1-itemsets by rescanning the original database

Add the counts to the originally large 1-itemsets

Page 15: Yu-Hui Tao

Figure 2: The FUP process of finding large 1-itemsets

A summary of the four cases and their FUP results is given in Table 9.

Table 9. Four cases and their FUP results

Cases: Original – New ResultsCase 1: Large – Large Always largeCase 2: Large – Small Determined from existing informationCase 3: Small – Large Determined by rescanning original databaseCase 4: Small – Small Always small

FUP is thus able to handle cases 1, 2 and 4 more efficiently than conventional

batch mining algorithms. It must, however, reprocess the original database to handle

case 3.

5. Maintenance of Association Rules Based on Pre-large Itemsets

Although the FUP algorithm focuses on the newly inserted transactions and thus

saves much processing time by incrementally maintaining rules, it must still scan the

original database to handle case 3 in which a candidate itemsets is large for new

transactions but is not recorded in large itemsets already mined from the original

database. This situation may occur frequently, especially when the number of new

transactions is small. In an extreme situation, if only one new transaction is added

each time, then all items in this transaction are large since their support ratios are

100% for the new transaction. Thus, if case 3 could be efficiently handled, the

maintenance time could be further reduced.

15

Page 16: Yu-Hui Tao

5.1 Definition of Pre-large Itemsets

In this paper, we propose the concept of pre-large itemsets to solve the problem

represented by case 3. A pre-large itemset is not truly large, but promises to be large

in the future. A lower support threshold and an upper support threshold are used to

realize this concept. The upper support threshold is the same as that used in the

conventional mining algorithms. The support ratio of an itemset must be larger than

the upper support threshold in order to be considered large. On the other hand, the

lower support threshold defines the lowest support ratio for an itemset to be treated as

pre-large. An itemset with its support ratio below the lower threshold is thought of as

a small itemset. Pre-large itemsets act like buffers in the incremental mining process

and are used to reduce the movements of itemsets directly from large to small and

vice-versa.

Considering an original database and transactions newly inserted using the two

support thresholds, itemsets may thus fall into one of the following nine cases

illustrated in Figure 3.

Figure 3: Nine cases arising from adding new transactions to existing databases

16

Large itemsets

Large itemsets

Pre-large itemsets

Original database

New transactions

Small itemsets

Small itemsets

Case 1 Case 2 Case 3

Case 4 Case 5 Case 6

Case 7 Case 8 Case 9

Pre-large itemsets

Large itemsets

Large itemsets

Pre-large itemsets

Original database

New transactions

Small itemsets

Small itemsets

Case 1 Case 2 Case 3

Case 4 Case 5 Case 6

Case 7 Case 8 Case 9

Pre-large itemsets

Page 17: Yu-Hui Tao

Cases 1, 5, 6, 8 and 9 above will not affect the final association rules according

to the weighted average of the counts. Cases 2 and 3 may remove existing association

rules, and cases 4 and 7 may add new association rules. If we retain all large and pre-

large itemsets with their counts after each pass, then cases 2, 3 and case 4 can be

handled easily. Also, in the maintenance phase, the ratio of new transactions to old

transactions is usually very small. This is more apparent when the database is growing

larger. An itemset in case 7 cannot possibly be large for the entire updated database as

long as the number of transactions is small compared to the number of transactions in

the original database. This point is proven below. A summary of the nine cases and

their results is given in Table 10.

Table 10. Nine cases and their results

Cases: Original – New ResultsCase 1: Large – Large Always large

Case 2: Large - Pre-largeLarge or pre-large,

Determined from existing information

Case 3: Large - SmallLarge or pre-large or small,

Determined from existing information

Case 4: Pre-large - LargePre-large or large,

Determined from existing informationCase 5: Pre-large - Pre-large Always pre-large

Case 6: Pre-large - SmallPre-large or small,

Determined from existing information

Case 7: Small - LargePre-large or small when the number of

transactions is smallCase 8: Small - Pre-large Small or Pre-largeCase 9: Small - Small Always small

5.2 Notation

The notation used in this paper is defined below.

17

Page 18: Yu-Hui Tao

D : the original database;.

T : the set of new transactions;

U : the entire updated database, i.e., D T;

d : the number of transactions in D;

t : the number of transactions in T;

Sl : the lower support threshold for pre-large itemsets;

Su : the upper support threshold for large itemsets, Su >Sl;

: the set of large k-itemsets from D;

: the set of large k-itemsets from T;

: the set of large k-itemsets from U;

: the set of pre-large k-itemsets from D;

: the set of pre-large k-itemsets from T;

: the set of pre-large k-itemsets from U;

Ck : the set of all candidate k-itemsets from T;

I : an itemset;

SD(I) : the number of occurrences of I in D;

ST(I) : the number of occurrences of I in T;

SU(I) : the number of occurrences of I in U.

5.3 Theoretical Foundation

As mentioned above, if the number of new transactions is small compared to the

number of transactions in the original database, an itemset that is small (neither large

nor pre-large) in the original database but is large in the newly inserted transactions

cannot possibly be large for the entire updated database. This is proven in the

18

Page 19: Yu-Hui Tao

following theorem.

Theorem 1: let Sl and Su be respectively the lower and the upper support

thresholds, and let d and t be respectively the numbers of the original and new

transactions. If t , then an itemset that is small (neither large nor pre-

large) in the original database but is large in newly inserted transactions is not large

for the entire updated database.

Proof:

The following derivation can be obtained from t :

t (1)

t(1-Su) (Su- Sl) d

t-tSu dSu- dSl

t+ dSl Su(d+t)

Su.

If an itemset I is small (neither large nor pre-large) in the original database D,

then its count SD(I) must be less than Sld, therefore,

SD(I) < dSl.

If I is large in the newly inserted transactions T, then:

t ST(I) tSu.

19

Page 20: Yu-Hui Tao

The entire support ratio of I in the updated database U is , which can be

further expanded to:

=

Su.

I is thus not large for the entire updated database. This completes the proof.

Example 1: Assume d=100, Sl=50% and Su=60%. The number of new

transactions within which the original database need not be scanned for rule

maintenance is:

= .

Thus, if the number of newly inserted transactions is equal to or less than 25,

then I is cannot be large for the entire updated database.

From theorem 1, the number of new transactions required for efficient handling

of case 7 is determined by Sl, Su, and d. It can easily be seen from Formula 1 that if d

grows larger, then t can grow larger too. Therefore, as the database grows, our

proposed approach becomes increasingly efficient. This characteristic is especially

useful for real-world applications.

Form theorem 1, the ratio of new transactions to previous transactions for the

20

Page 21: Yu-Hui Tao

proposed approach to work out can easily be derived as follows.

Corollary 1: Let r denote the ratio of new transactions t to old transactions d. If

r , then an itemset that is small (neither large nor pre-large) in the original

database but is large in the newly inserted transactions cannot be large for the entire

updated database.

Example 2: Assume Sl=50% and Su=60%. The ratio of new transactions to old

transactions within which the original database need not be scanned for rule

maintenance is:

= .

Thus, if the number of newly inserted transactions is equal to or less than 1/4 of

the number of original transactions, then I cannot be large for the entire updated

database.

It is easily seen from corollary 1 that if the range between Sl and Su is large, then

the ratio r can also be large, meaning that the number of new transactions will be large

for a fixed d. However, a large range between Sl and Su will also create a large set of

pre-large itemsets, which will represent an additional overhead in maintenance.

5.4 Presentation of the Algorithm

21

Page 22: Yu-Hui Tao

In the proposed algorithm, the large and pre-large itemsets with their counts in

preceding runs are recorded for later use in maintenance. As new transactions are

added, the proposed algorithm first scans them to generate candidate 1-itemsets (only

for these transactions), and then compares these itemsets with the previously retained

large and pre-large 1-itemsets. It partitions candidate 1-itemsets into three parts

according to whether they are large or pre-large for the original database. If a

candidate 1-itemset from the newly inserted transactions is also among the large or

pre-large 1-itemsets from the original database, its new total count for the entire

updated database can easily be calculated from its current count and previous count

since all previous large and pre-large itemsets with their counts have been retained.

Whether an originally large or pre-large itemset is still large or pre-large after new

transactions have been inserted is determined from its new support ratio, as derived

from its total count over the total number of transactions. On the contrary, if a

candidate 1-itemset from the newly inserted transactions does not exist among the

large or pre-large 1-itemsets in the original database, then it is absolutely not large for

the entire updated database as long as the number of newly inserted transactions is

within the safety threshold derived from Theorem 1. In this situation, no action is

needed. When transactions are incrementally added and the total number of new

transactions exceeds the safety threshold, the original database is re-scanned to find

new pre-large itemsets in a way similar to that used by the FUP algorithm. The

proposed algorithm can thus find all large 1-itemsets for the entire updated database.

After that, candidate 2-itemsets from the newly inserted transactions are formed and

the same procedure is used to find all large 2-itemsets. This procedure is repeated

until all large itemsets have been found. The details of the proposed maintenance

algorithm are described below. A variable, c, is used to record the number of new

22

Page 23: Yu-Hui Tao

transactions since the last re-scan of the original database.

The proposed maintenance algorithm:

INPUT: A lower support threshold Sl, an upper support threshold Su, a set of large

itemsets and pre-large itemsets in the original database consisting of (d+c)

transactions, and a set of t new transactions.

OUTPUT: A set of final association rules for the updated database.

STEP 1: Calculate the safety number f of new transactions according to theorem 1 as

follows:

f = .

STEP 2: Set k =1, where k records the number of items in itemsets currently being

processed.

STEP 3: Find all candidate k-itemsets Ck and their counts from the new transactions.

STEP 4: Divide the candidate k-itemsets into three parts according to whether they are

large, pre-large or small in the original database.

STEP 5: For each itemset I in the originally large k-itemsets , do the following

substeps:

Substep 5-1: Set the new count SU(I) = ST(I)+ SD(I).

Substep 5-2: If SU(I)/(d+t+c) Su, then assign I as a large itemset, set SD(I) =

SU(I) and keep I with SD(I),

otherwise, if SU(I)/(d+t+c) Sl, then assign I as a pre-large itemset, set

SD(I) = SU(I) and keep I with SD(I),

otherwise, neglect I.

STEP 6: For each itemset I in the originally pre-large itemset , do the following

23

Page 24: Yu-Hui Tao

substeps:

Substep 6-1: Set the new count SU(I) = ST(I)+ SD(I).

Substep 6-2: If SU(I)/(d+t+c) Su, then assign I as a large itemset, set SD(I) =

SU(I) and keep I with SD(I),

otherwise, if SU(I)/(d+t+c) Sl, then assign I as a pre-large itemset, set

SD(I) = SU(I) and keep I with SD(I),

otherwise, neglect I.

STEP 7: For each itemset I in the candidate itemsets that is not in the originally large

itemsets or pre-large itemsets , do the following substeps:

Substep 7-1: If I is in the large itemsets or pre-large itemsets from the

new transactions, then put it in the rescan-set R, which is used

when rescanning in Step 8 is necessary.

Substep 7-2: If I is small for the new transactions, then do nothing.

STEP 8: If t +c f or R is null, then do nothing; otherwise, rescan the original

database to determine whether the itemsets in the rescan-set R are large or

pre-large.

STEP 9: Form candidate (k+1)-itemsets Ck+1 from finally large and pre-large k-

itemsets ( ) that appear in the new transactions.

STEP 10: Set k = k+1.

STEP 11: Repeat STEPs 4 to 10 until no new large or pre-large itemsets are found.

STEP 12: Modify the association rules according to the modified large itemsets.

STEP 13: If t +c > f, then set d=d+t+c and set c=0; otherwise, set c=t+c.

After Step 13, the final association rules for the updated database have been

determined.

24

Page 25: Yu-Hui Tao

5.5 An Example

In this section, an example is given to illustrate the proposed incremental data

mining algorithm. Assume the initial data set includes 8 transactions, which are the

same as those shown in Table 6. For Sl=30% and Su=50%, the sets of large itemsets

and pre-large itemsets for the given data are shown in Tables 11 and 12, respectively.

Table 11. The large itemsets for the original database

Large itemsets1 item Count 2 items Count 3 items Count

A 5 BC 4 BCE 4B 6 BE 6C 6 CE 4E 6

Table 12. The pre-large itemsets for the original database

Pre-large itemsets1 item Count 2 items Count 3 items Count

D 3 AB 3 ABE 3AC 3AE 3CD 3

Assume the two new transactions shown in Table 13 are inserted after the initial

data set is processed.

Table 13. Two new transactions

New transactionsTID Items

25

Page 26: Yu-Hui Tao

900 ABCD1000 DEF

The proposed incremental mining algorithm proceeds as follows. The variable c

is initially set at 0.

STEP 1: The safety number f for new transactions is calculated as:

0

f = = .

STEP 2: k is set to 1, where k records the number of items in itemsets currently being

processed.

STEP 3: All candidate 1-itemsets C1 and their counts from the two new transactions

are found, as shown in Table 14.

Table 14. All candidate 1-itemsets with counts from the two new transactions

Candidate 1-itemsetsItems Count

A 1B 1C 1D 2E 1F 1

STEP 4: From Table 14, all candidate 1-itmesets {A}{B}{C}{D}{E}{F} are divided

into three parts, {A}{B}{C}{E}, {D}, and {F} according to whether they

are large, pre-large or small in the original database. Results are shown in

Table 15, where the counts are only from the new transactions.

26

Page 27: Yu-Hui Tao

Table 15. Three partitions of all candidate 1-itemsets from the two new transactions

Originally large 1-itemsets

Originally pre-large 1-itemsets

Originally small 1-itemsets

Items Count Items Count Items CountA 1 D 2 F 1B 1C 1E 1

STEP 5: The following substeps are done for each of the originally large 1-itemsets

{A}{B}{C}{E}:

Substep 5-1: The total counts of the candidate 1-itemsets {A}{B}{C}{E}

are calculated using ST(I)+ SD(I). Table 16 shows the results.

Table 16. The total counts of {A}{B}{C}{E}

Items CountA 6B 7C 7E 7

Substep 5-2: The new support ratios of {A}{B}{C}{E} are calculated. For

example, the new support ratio of {A} is 6/(8+2+0) 0.5.

{A} is thus still a large itemset. In this example, {A}{B}{C}

{E} are all large. {A}{B}{C}{E} with their new counts are

retained in the large 1-itemsets for the entire updated database.

STEP 6: The following substeps are done for itemset {D}, which is originally pre-

large:

Substep 6-1: The total count of the candidate 1-itemset {D} is calculated

27

Page 28: Yu-Hui Tao

using ST(I)+ SD(I) (= 5).

Substep 6-2: The new support ratio of {D} is 5/(8+2+0) 0.5. {D} thus

becomes a large 1-itemset for the whole updated database. {D}

with its new count is retained in the large 1-itemsets for the

entire updated database.

STEP 7: Since the itemset {F}, which was originally neither large nor pre-large, is

large for the new transactions, it is put in the rescan-set R., which is used

when rescanning in Step 8 is necessary.

STEP 8: Since t +c=2+0 f (=3), rescanning the database is unnecessary, so nothing

is done.

STEP 9: From Steps 5,6 and 7, the final large 1-itemsets and pre-large 1-itemsets for

the entire updated database are {A}{B}{C}{D}{E}. All candidate 2-itemsets

generated from them are shown in Table 17.

Table 17. All candidate 2-itemsets for the new transactions

Candidate 2-itemsetsABACADAEBCBDBECDCEDE

STEP 10: k = k+1=2.

STEP 11: Steps 4 to 10 are repeated to find large or pre-large 2-itemsets. Results are

shown in Table 18.

28

Page 29: Yu-Hui Tao

Table 18. All large 2-itemsets and pre-large 2-itemsets for the whole updated database

Large 2-Itemsets Pre-large 2-ItemsetsItems Count Items CountBC 5 AB 4

BE 6 AC 4

AE 3

CD 4

CE 4

Large or pre-large 3-itemsets are found in the same way. No large 3-itemsets

were found in this example.

STEP 12: The association rules derived from the newly found large itemsets are:

BC (Confidence=5/7),

CB (Confidence=5/7),

BE (Confidence=6/7), and

EB (Confidence=6/7).

STEP 13: c=t+c=2+0=2.

After Step 13, the final association rules for the updated database can then be

found. Note that the final value of c is 2 in this example and f-c=1. This means that

one more new transaction can be added without rescanning the original database. The

whole process of finding large itemsets for this example is illustrated in Figures 4, 5

and 6.

29

Page 30: Yu-Hui Tao

Figure 4: Our process of finding large 1-itemsets

30

New transactionsTID Items900 ABCD

1000 DEF

Item CountA 1B 1C 1D 2E 1F 1

Item CountD 2

Item CountF 1

Item CountA 1B 1C 1E 1

Item CountA 6B 7C 7E 7

Item CountD 5

Item CountF 1

Item CountD 5

Item CountA 6B 7C 7E 7

Items CountA 6B 7C 7D 5E 7

Find all candidate 1-itemsets

Extract large 1-itemsets inoriginal database

Extract 1-itemsets notrecorded in original database

Extract 1-itemsetslarge or pre-large forthe new transactions

Find the large itemsets or pre-largeitemsets from the counts

Since t+c=2+0<f=3,insert the 1-itemsets into R

Add the counts to the originallarge 1-itemsets

Extract pre-large 1-itemsetsin original database

Add the counts to the originalpre-large 1-itemsets

R={F}

Find the large itemsets or pre-largeitemsets from the counts

Page 31: Yu-Hui Tao

New transactionsTID Items900 ABCD1000 DEF

Figure 5: Our process of finding large 2-itemsets and pre-large 2-itemsets

Item CountAB 1AC 1AD 1AE 0BC 1BD 1BE 0CD 1CE 0DE 1

Item CountAB 1AC 1AE 0CD 1

Item CountBC 1BE 0CE 0

Item CountAD 1BD 1DE 1

Item CountBC 5BE 6CE 4

Item CountAB 4AC 4AE 3CD 4

Item CountAD 1BD 1DE 1

Item CountBC 5BE 6CE 4

Item CountAB 4AC 4AE 3CD 4

Items CountAB 4AC 4AE 3BC 5BE 6CD 4CE 4

31

Find all candidate 2-itemsets

Extract large 2-itemsets in original database

Extract 2-itemsets not recorded in original database

Extract 2-itemsets large or pre-large for the new

transactionsFind the large itemsets or pre-large

itemsets from the counts

Since t+c=2+0<f=3, insert the 2-itemsets into R

Add the counts to the original large 2-itemsets

Extract pre-large 2-itemsets in original database

Add the counts to the original pre-large 2-itemsets

R={F}Find the large itemsets or pre-large

itemsets from the counts

R={F,AD,BD,DE}

Page 32: Yu-Hui Tao

Figure 6: Our process of finding large 3-itemsets and pre-large 3-itemsets

In Pass 1 of this example, the candidate 1-itemsets {D} and {F}, can easily be

processed by our proposed algorithm; they are, however, processed by rescanning the

whole database in the FUP algorithm.

6. Conclusions

32

New transactionsTID Items900 ABCD

1000 DEF

Item CountABC 1ABE 0ACE 0BCE 0

Item CountBCE 0

Item CountABE 0

Item CountABC 1ACE 0

Item CountABC 1

Item CountABE 3

Item CountABE 3

Item CountBCE 4

Item CountBCE 4

Items CountABE 3BCE 4

Find the large itemsets or pre-largeitemsets from the counts

Find the large itemsets or pre-largeitemsets from the counts

Since t+c=2+0<f=3,insert the 3-itemsets into R

Find all candidate 3-itemsets

Extract large 3-itemsets inoriginal database

Extract 3-itemsets notrecorded in original database

Extract 3-itemsets large or pre-large for the new transactions

Add the counts to the originallarge 3-itemsets

Extract pre-large 3-itemsetsin original database

Add the counts to the originalpre-large 3-itemsets

R={F,AD,BD,DE,ABC}

Page 33: Yu-Hui Tao

In this paper, we have proposed the concept of pre-large itemsets, and designed a

novel, efficient, incremental mining algorithm based on it. Using two user-specified

upper and lower support thresholds, the pre-large itemsets act as a gap to avoid small

itemsets becoming large in the updated database when transactions are inserted. Our

proposed algorithm also retains the following features of the FUP algorithm [7][14]:

1. It avoids re-computing large itemsets that have already been discovered.

2. It focuses on newly inserted transactions, thus greatly reducing the number of

candidate itemsets.

3. It uses a simple check to further filter the candidate itemsets in inserted

transactions.

Moreover, the proposed algorithm can effectively handle cases, in which

itemsets are small in an original database but large in newly inserted transactions,

although it does need additional storage space to record the pre-large itemsets. Note

that the FUP algorithm needs to rescan databases to handle such cases. The proposed

algorithm does not require rescanning of the original databases until a number of new

transactions determined from the two support thresholds and the size of the database

have been processed. If the size of the database grows larger, then the number of new

transactions allowed before rescanning will be larger too. Therefore, as the database

grows, our proposed approach becomes increasingly efficient. This characteristic is

especially useful for real-world applications.

Acknowledgment

33

Page 34: Yu-Hui Tao

The authors would like to thank the anonymous referees for their very

constructive comments. This research was supported by the National Science Council

of the Republic of China under contract NSC 89-2213-E-214-056.

References

[1] R. Agrawal, T. Imielinksi and A. Swami, “Mining association rules between sets

of items in large database,“ The ACM SIGMOD Conference, pp. 207-216,

Washington DC, USA, 1993.

[2] R. Agrawal, T. Imielinksi and A. Swami, “Database mining: a performance

perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 5,

No. 6, pp. 914-925, 1993.

[3] R. Agrawal and R. Srikant, “Fast algorithm for mining association rules,” The

International Conference on Very Large Data Bases, pp. 487-499, 1994.

[4] R. Agrawal and R. Srikant, ”Mining sequential patterns,” The Eleventh IEEE

International Conference on Data Engineering, pp. 3-14, 1995.

[5] R. Agrawal, R. Srikant and Q. Vu, “Mining association rules with item

constraints,” The Third International Conference on Knowledge Discovery in

Databases and Data Mining, pp. 67-73, Newport Beach, California, 1997.

[6] M.S. Chen, J. Han and P.S. Yu, “Data mining: An overview from a database

perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 8,

No. 6, pp. 866-883, 1996.

[7] D.W. Cheung, J. Han, V.T. Ng, and C.Y. Wong, “Maintenance of discovered

association rules in large databases: An incremental updating approach,” The

34

Page 35: Yu-Hui Tao

Twelfth IEEE International Conference on Data Engineering, pp. 106-114, 1996.

[8] D.W. Cheung, S.D. Lee, and B. Kao, “A general incremental technique for

maintaining discovered association rules,” In Proceedings of Database Systems

for Advanced Applications, pp. 185-194, Melbourne, Australia, 1997.

[9] T. Fukuda, Y. Morimoto, S. Morishita and T. Tokuyama, "Mining optimized

association rules for numeric attributes," The ACM SIGACT-SIGMOD-SIGART

Symposium on Principles of Database Systems, pp. 182-191, 1996.

[10] J. Han and Y. Fu, “Discovery of multiple-level association rules from large

database,” The Twenty-first International Conference on Very Large Data Bases,

pp. 420-431, Zurich, Switzerland, 1995.

[11] M.Y. Lin and S.Y. Lee, “Incremental update on sequential patterns in large

databases,” The Tenth IEEE International Conference on Tools with Artificial

Intelligence, pp. 24-31, 1998.

[12] H. Mannila, H. Toivonen, and A.I. Verkamo, “Efficient algorithm for

discovering association rules,” The AAAI Workshop on Knowledge Discovery in

Databases, pp. 181-192, 1994.

[13] J.S. Park, M.S. Chen, P.S. Yu, “Using a hash-based method with transaction

trimming for mining association rules,” IEEE Transactions on Knowledge and

Data Engineering, Vol. 9, No. 5, pp. 812-825, 1997.

[14] N.L. Sarda and N.V. Srinivas, “An adaptive algorithm for incremental mining of

association rules,” The Ninth International Workshop on Database and Expert

Systems, pp. 240-245, 1998.

[15] R. Srikant and R. Agrawal, “Mining generalized association rules,” The Twenty-

first International Conference on Very Large Data Bases, pp. 407-419, Zurich,

Switzerland, 1995.

35

Page 36: Yu-Hui Tao

[16] R. Srikant and R. Agrawal, “Mining quantitative association rules in large

relational tables,” The 1996 ACM SIGMOD International Conference on

Management of Data, pp. 1-12, Montreal, Canada, 1996.

[17] S. Zhang, “Aggregation and maintenance for database mining,” Intelligent Data

Analysis, Vol. 3, No. 6, pp. 475-490, 1999.

36