Upload
tommy96
View
1.004
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
A New Incremental Data Mining Algorithm
Using Pre-large Itemsets*
Tzung-Pei Hong**
Department of Information Management
I-Shou University
Kaohsiung, 84008, Taiwan, R.O.C.
http://www.nuk.edu.tw/tphong
Ching-Yao Wang
Institute of Computer and Information Science
National Chiao-Tung University
Hsinchu, 300, Taiwan, R.O.C.
Yu-Hui Tao
Department of Information Management
I-Shou University
Kaohsiung, 84008, Taiwan, R.O.C.
--------------------------------------
* This is a modified and expanded version of the paper "Incremental data mining
based on two support thresholds," presented at The Fourth International Conference
on Knowledge-Based Intelligent Engineering Systems & Allied Technologies,
2000, England.
** Corresponding author.
Abstract
Due to the increasing use of very large databases and data warehouses, mining
useful information and helpful knowledge from transactions is evolving into an
important research area. In the past, researchers usually assumed databases were static
to simplify data mining problems. Thus, most of the classic algorithms proposed
focused on batch mining, and did not utilize previously mined information in
incrementally growing databases. In real-world applications, however, developing a
mining algorithm that can incrementally maintain discovered information as a
database grows is quite important. In this paper, we propose the concept of pre-large
itemsets and design a novel, efficient, incremental mining algorithm based on it. Pre-
large itemsets are defined by a lower support threshold and an upper support
threshold. They act as gaps to avoid the movements of itemsets directly from large to
small and vice-versa. The proposed algorithm doesn't need to rescan the original
database until a number of transactions have been newly inserted. If the database has
grown larger, then the number of new transactions allowed will be larger too.
Keywords: data mining, association rule, large itemset, pre-large itemset,
incremental mining.
2
1. Introduction
Years of effort in data mining have produced a variety of efficient techniques.
Depending on the type of databases processed, these mining approaches may be
classified as working on transaction databases, temporal databases, relational
databases, and multimedia databases, among others. On the other hand, depending on
the classes of knowledge derived, the mining approaches may be classified as finding
association rules, classification rules, clustering rules, and sequential patterns [4],
among others. Among them, finding association rules in transaction databases is most
commonly seen in data mining [1][3][5][9][10][12][13][15][16].
In the past, many algorithms for mining association rules from transactions were
proposed, most of which were executed in level-wise processes. That is, itemsets
containing single items were processed first, then itemsets with two items were
processed, then the process was repeated, continuously adding one more item each
time, until some criteria were met. These algorithms usually considered the database
size static and focused on batch mining. In real-world applications, however, new
records are usually inserted into databases, and designing a mining algorithm that can
maintain association rules as a database grows is thus critically important
When new records are added to databases, the original association rules may
become invalid, or new implicitly valid rules may appear in the resulting updated
databases [7][8][11][14][17]. In these situations, conventional batch-mining
algorithms must re-process the entire updated databases to find final association rules.
Two drawbacks may exist for conventional batch-mining algorithms in maintaining
3
database knowledge:
(a) Nearly the same computation time as that spent in mining from the original
database is needed to cope with each new transaction. If the original database
is large, much computation time is wasted in maintaining association rules
whenever new transactions are generated
(b) Information previously mined from the original database, such as large
itemsets and association rules, provides no help in the maintenance process.
Cheung and his co-workers proposed an incremental mining algorithm, called
FUP (Fast UPdate algorithm) [7], for incrementally maintaining mined association
rules and avoiding the shortcomings mentioned above. The FUP algorithm modifies
the Apriori mining algorithm [3] and adopts the pruning techniques used in the DHP
(Direct Hashing and Pruning) algorithm [13]. It first calculates large itemsets mainly
from newly inserted transactions, and compares them with the previous large itemsets
from the original database. According to the comparison results, FUP determines
whether re-scanning the original database is needed, thus saving some time in
maintaining the association rules. Although the FUP algorithm can indeed improve
mining performance for incrementally growing databases, original databases still need
to be scanned when necessary. In this paper, we thus propose a new mining algorithm
based on two support thresholds to further reduce the need for rescanning original
databases. Since rescanning the database spends much computation time, the
maintenance cost can thus be reduced in the proposed algorithm.
4
The remainder of this paper is organized as follows. The data mining process is
introduced in section 2. The maintenance of association rules is described in section 3.
The FUP algorithm is reviewed in section 4. A new incrementally mining algorithm is
proposed in section 5. An example is also given there to illustrate the proposed
algorithm. Conclusions are summarized in section 6.
2. The Data Mining Process Using Association Rules
Data mining plays a central role in knowledge discovery. It involves applying
specific algorithms to extract patterns or rules from data sets in a particular
representation. Because data mining is important to KDD, many researchers in
database and machine-learning fields are interested in this new research topic since it
offers opportunities to discover useful information and important relevant patterns in
large databases, thus helping decision-makers analyze data easily and make good
decisions regarding the domains in question.
One application of data mining is to induce association rules from transaction
data, such that the presence of certain items in a transaction will imply the presence of
certain other items. To achieve this purpose, Agrawal and his co-workers proposed
several mining algorithms based on the concept of large itemsets to find association
rules in transaction data [1][3][5]. They divided the mining process into two phases.
In the first phase, candidate itemsets were generated and counted by scanning the
transaction data. If the count of an itemset appearing in the transactions was larger
than a pre-defined threshold value (called the minimum support), the itemset was
considered a large itemset. Itemsets containing only one item were processed first.
5
Large itemsets containing only single items were then combined to form candidate
itemsets containing two items. This process was repeated until all large itemsets had
been found. In the second phase, association rules were induced from the large
itemsets found in the first phase. All possible association combinations for each large
itemset were formed, and those with calculated confidence values larger than a
predefined threshold (called the minimum confidence) were output as association
rules. We may summarize the data mining process we focus on as follows:
1. Determine user-specified thresholds, including the minimum support
value and the minimum confidence value.
2. Find large itemsets in an iterative way. The count of a large itemset must
exceed or equal the minimum support value.
3. Utilize the large itemsets to generate association rules, whose confidence
must exceed or equal the minimum confidence value.
Below, we use a simple example to illustrate the mining process. Suppose a
database with five transactions shown in Table 1 is to be mined. The database has two
features, transaction identification (TID) and transaction description (Items).
Table 1. An example of a transaction database
TID Items100 BE200 ABD300 AD400 BCE500 ABDE
Assume the user-specified minimum support and minimum confidence are 40%
and 80%, respectively. The transaction database is first scanned to count the candidate
6
1-itemsets. The results are shown in Table 2.
Table 2. Candidate 1-itemsets
Item CountA 3B 4C 1D 3E 3
Since the counts of the items A, B, D and E are larger than 2 (5*40%), they are
put into the set of large 1-itemsets. The candidate 2-itemsets are then formed from
these large 1-itemsets as shown in Table 3.
Table 3. Candidate 2-itemsets with counts
Items CountAB 2AD 3AE 1BD 2BE 3DE 1
AB, AD, BD and BE then form the set of large 2-itemsets. In a similar way,
ABD can be found to be a large 3-itemset.
Next, the large itemsets are used to generate association rules. According to the
condition probability, the possible association rules generated are shown in Table 4.
7
Table 4. Possible association rules
Rule ConfidenceIF AB, Then D Count(ABD)/Count(AB)=1IF AD, Then B Count(ABD)/Count(AD)=2/3IF BD, Then A Count(ABD)/Count(BD)=1IF A, Then B Count(AB)/Count(A)=2/3IF B, Then A Count(AB)/Count(B)=2/4IF A, Then D Count(AD)/Count(A)=1IF D, Then A Count(AD)/Count(D)=1IF B, Then D Count(BD)/Count(B)=2/4IF D, Then B Count(BD)/Count(D)=2/3IF B, Then E Count(BE)/Count(B)=3/4IF E, Then B Count(BE)/Count(E)=1
Since the user-specified minimum confidence is 80%, the final association rules
are shown in Table 5.
Table 5. The final association rules for this example
Rule ConfidenceIF AB, Then D Count(ABD)/Count(AB)=1IF BD, Then A Count(ABD)/Count(BD)=1IF A, Then D Count(AD)/Count(A)=1IF D, Then A Count(AD)/Count(D)=1IF E, Then B Count(BE)/Count(E)=1
3. Maintenance of Association Rules
In real-world applications, transaction databases grow over time and the
association rules mined from them must be re-evaluated because new association rules
may be generated and old association rules may become invalid when the new entire
databases are considered.
Conventional batch-mining algorithms, such as Apriori [1] and DHP [13], solve
8
this problem by re-processing entire new databases when new transactions are
inserted into the original databases. These algorithms do not, however, use previously
mined information and require nearly the same computational time they needed to
mine from the original databases. If new transactions appear often and the original
databases are large, these algorithms are thus inefficient in maintaining association
rules.
Considering an original database and newly inserted transactions, the following
four cases (illustrated in Figure 1) may arise:
Case 1: An itemset is large in the original database and in the newly inserted
transactions.
Case 2: An itemset is large in the original database, but is not large in the newly
inserted transactions.
Case 3: An itemset is not large in the original database, but is large in the newly
inserted transactions.
Case 4: An itemset is not large in the original database and in the newly inserted
transactions.
9
Large itemset
Small itemset
Case 1 Case 2
Case 3 Case 4
Small itemset
Large itemset
New transactions
Original database
Large itemset
Small itemset
Case 1 Case 2
Case 3 Case 4
Small itemset
Large itemset
New transactions
Original database
Figure 1: Four cases arising from adding new transactions to existing databases
Since itemsets in Case 1 are large in both the original database and the new
transactions, they will still be large after the weighted average of the counts.
Similarly, itemsets in Case 4 will still be small after the new transactions are inserted.
Thus Cases 1 and 4 will not affect the final association rules. Case 2 may remove
existing association rules, and case 3 may add new association rules. A good rule-
maintenance algorithm should thus accomplish the following.
1. Evaluate large itemsets in the original database and determine whether they
are still large in the updated database;
2. Find out whether any small itemsets in the original database may become
large in the updated database;
3. Seek itemsets that appear only in the newly inserted transactions and
determine whether they are large in the updated database.
These are accomplished by the FUP algorithm and by our proposed algorithm.
4. Review of the Fast Update Algorithm (FUP)
Cheung et al. proposed the FUP algorithm to incrementally maintain association
rules when new transactions are inserted [7][8]. Using FUP, large itemsets with their
counts in preceding runs are recorded for later use in maintenance. As new
transactions are added, FUP first scans them to generate candidate 1-itemsets (only for
these transactions), and then compares these itemsets with the previous ones. FUP
10
partitions candidate 1-itemsets into two parts according to whether they are large for
the original database. If a candidate 1-itemset from the newly inserted transactions is
also among the large 1-itemsets from the original database, its new total count for the
entire updated database can easily be calculated from its current count and previous
count since all previous large itemsets with their counts are kept by FUP. Whether an
original large itemset is still large after new transactions are inserted is determined
from its support ratio as its total count over the total number of transactions. By
contrast, if a candidate 1-itemset from the newly inserted transactions does not exist
among the large 1-itemsets in the original database, one of two possibilities arises. If
this candidate 1-itemset is not large for the new transactions, then it cannot be large
for the entire updated database, which means no action is necessary. If this candidate
1-itemset is large for the new transactions but not among the original large 1-itemsets,
the original database must be re-scanned to determine whether the itemset is actually
large for the entire updated database. Using the processing tactics mentioned above,
FUP is thus able to find all large 1-itemsets for the entire updated database. After that,
candidate 2-itemsets from the newly inserted transactions are formed and the same
procedure is used to find all large 2-itemsets. This procedure is repeated until all large
itemsets have been found.
Below, we use a simple example to illustrate the FUP algorithm. Suppose a
database with eight transactions such as the one shown in Table 6 is to be mined. The
minimum support threshold s is set at 50%.
Table 6. An original database with TID and Items
Incremental databaseTID Items
11
100 ACD200 BCE300 ABCE400 ABE500 ABE600 ACD700 BCDE800 BCE
Using a conventional mining algorithm such as the Apriori algorithm, all large
itemsets with counts larger than or equal to 4 (850%) are found, as shown in Table 7.
These large itemsets and their counts are retained by the FUP algorithm.
Table 7. All large itemsets from an original database with s=50%
Large itemsets 1 item Count 2 items Count 3 items Count
A 5 BC 4 BCE 4B 6 BE 6C 6 CE 4E 6
Next, assume two new transactions, as shown in Table 8 appear.
Table 8. New transactions for the example
New transactionsTID Items900 ABCD1000 DEF
The FUP algorithm processes them as follows. First, the final large 1-itemsets for
the entire updated database are found. This process is shown in Figure 2. The same
12
process is then repeated until no new candidate itemsets are generated.
13
14
New transactions TID Items 900 ABCD 1000 DEF
Item Count A 1 B 1 C 1 D 2 E 1
F 1
Item Count A 1 B 1 C 1 E 1
Item Count D 2 F 1
Item Count A 6 B 7 C 7 E 7
Item Count D 2 F 1
Item Count A 6 B 7 C 7 E 7
Item Count D 5
Items Count A 6 B 7 C 7 D 5 E 7
Find all candidate 1-itemsets
Extract originally large 1-itemsets from these two transactions
Extract originally small 1-itemsets from these two transactions
Extract 1-itemsets large for the new transactions
Find the large 1- itemsets for updated database
Find the large 1-itemsets by rescanning the original database
Add the counts to the originally large 1-itemsets
Figure 2: The FUP process of finding large 1-itemsets
A summary of the four cases and their FUP results is given in Table 9.
Table 9. Four cases and their FUP results
Cases: Original – New ResultsCase 1: Large – Large Always largeCase 2: Large – Small Determined from existing informationCase 3: Small – Large Determined by rescanning original databaseCase 4: Small – Small Always small
FUP is thus able to handle cases 1, 2 and 4 more efficiently than conventional
batch mining algorithms. It must, however, reprocess the original database to handle
case 3.
5. Maintenance of Association Rules Based on Pre-large Itemsets
Although the FUP algorithm focuses on the newly inserted transactions and thus
saves much processing time by incrementally maintaining rules, it must still scan the
original database to handle case 3 in which a candidate itemsets is large for new
transactions but is not recorded in large itemsets already mined from the original
database. This situation may occur frequently, especially when the number of new
transactions is small. In an extreme situation, if only one new transaction is added
each time, then all items in this transaction are large since their support ratios are
100% for the new transaction. Thus, if case 3 could be efficiently handled, the
maintenance time could be further reduced.
15
5.1 Definition of Pre-large Itemsets
In this paper, we propose the concept of pre-large itemsets to solve the problem
represented by case 3. A pre-large itemset is not truly large, but promises to be large
in the future. A lower support threshold and an upper support threshold are used to
realize this concept. The upper support threshold is the same as that used in the
conventional mining algorithms. The support ratio of an itemset must be larger than
the upper support threshold in order to be considered large. On the other hand, the
lower support threshold defines the lowest support ratio for an itemset to be treated as
pre-large. An itemset with its support ratio below the lower threshold is thought of as
a small itemset. Pre-large itemsets act like buffers in the incremental mining process
and are used to reduce the movements of itemsets directly from large to small and
vice-versa.
Considering an original database and transactions newly inserted using the two
support thresholds, itemsets may thus fall into one of the following nine cases
illustrated in Figure 3.
Figure 3: Nine cases arising from adding new transactions to existing databases
16
Large itemsets
Large itemsets
Pre-large itemsets
Original database
New transactions
Small itemsets
Small itemsets
Case 1 Case 2 Case 3
Case 4 Case 5 Case 6
Case 7 Case 8 Case 9
Pre-large itemsets
Large itemsets
Large itemsets
Pre-large itemsets
Original database
New transactions
Small itemsets
Small itemsets
Case 1 Case 2 Case 3
Case 4 Case 5 Case 6
Case 7 Case 8 Case 9
Pre-large itemsets
Cases 1, 5, 6, 8 and 9 above will not affect the final association rules according
to the weighted average of the counts. Cases 2 and 3 may remove existing association
rules, and cases 4 and 7 may add new association rules. If we retain all large and pre-
large itemsets with their counts after each pass, then cases 2, 3 and case 4 can be
handled easily. Also, in the maintenance phase, the ratio of new transactions to old
transactions is usually very small. This is more apparent when the database is growing
larger. An itemset in case 7 cannot possibly be large for the entire updated database as
long as the number of transactions is small compared to the number of transactions in
the original database. This point is proven below. A summary of the nine cases and
their results is given in Table 10.
Table 10. Nine cases and their results
Cases: Original – New ResultsCase 1: Large – Large Always large
Case 2: Large - Pre-largeLarge or pre-large,
Determined from existing information
Case 3: Large - SmallLarge or pre-large or small,
Determined from existing information
Case 4: Pre-large - LargePre-large or large,
Determined from existing informationCase 5: Pre-large - Pre-large Always pre-large
Case 6: Pre-large - SmallPre-large or small,
Determined from existing information
Case 7: Small - LargePre-large or small when the number of
transactions is smallCase 8: Small - Pre-large Small or Pre-largeCase 9: Small - Small Always small
5.2 Notation
The notation used in this paper is defined below.
17
D : the original database;.
T : the set of new transactions;
U : the entire updated database, i.e., D T;
d : the number of transactions in D;
t : the number of transactions in T;
Sl : the lower support threshold for pre-large itemsets;
Su : the upper support threshold for large itemsets, Su >Sl;
: the set of large k-itemsets from D;
: the set of large k-itemsets from T;
: the set of large k-itemsets from U;
: the set of pre-large k-itemsets from D;
: the set of pre-large k-itemsets from T;
: the set of pre-large k-itemsets from U;
Ck : the set of all candidate k-itemsets from T;
I : an itemset;
SD(I) : the number of occurrences of I in D;
ST(I) : the number of occurrences of I in T;
SU(I) : the number of occurrences of I in U.
5.3 Theoretical Foundation
As mentioned above, if the number of new transactions is small compared to the
number of transactions in the original database, an itemset that is small (neither large
nor pre-large) in the original database but is large in the newly inserted transactions
cannot possibly be large for the entire updated database. This is proven in the
18
following theorem.
Theorem 1: let Sl and Su be respectively the lower and the upper support
thresholds, and let d and t be respectively the numbers of the original and new
transactions. If t , then an itemset that is small (neither large nor pre-
large) in the original database but is large in newly inserted transactions is not large
for the entire updated database.
Proof:
The following derivation can be obtained from t :
t (1)
t(1-Su) (Su- Sl) d
t-tSu dSu- dSl
t+ dSl Su(d+t)
Su.
If an itemset I is small (neither large nor pre-large) in the original database D,
then its count SD(I) must be less than Sld, therefore,
SD(I) < dSl.
If I is large in the newly inserted transactions T, then:
t ST(I) tSu.
19
The entire support ratio of I in the updated database U is , which can be
further expanded to:
=
Su.
I is thus not large for the entire updated database. This completes the proof.
Example 1: Assume d=100, Sl=50% and Su=60%. The number of new
transactions within which the original database need not be scanned for rule
maintenance is:
= .
Thus, if the number of newly inserted transactions is equal to or less than 25,
then I is cannot be large for the entire updated database.
From theorem 1, the number of new transactions required for efficient handling
of case 7 is determined by Sl, Su, and d. It can easily be seen from Formula 1 that if d
grows larger, then t can grow larger too. Therefore, as the database grows, our
proposed approach becomes increasingly efficient. This characteristic is especially
useful for real-world applications.
Form theorem 1, the ratio of new transactions to previous transactions for the
20
proposed approach to work out can easily be derived as follows.
Corollary 1: Let r denote the ratio of new transactions t to old transactions d. If
r , then an itemset that is small (neither large nor pre-large) in the original
database but is large in the newly inserted transactions cannot be large for the entire
updated database.
Example 2: Assume Sl=50% and Su=60%. The ratio of new transactions to old
transactions within which the original database need not be scanned for rule
maintenance is:
= .
Thus, if the number of newly inserted transactions is equal to or less than 1/4 of
the number of original transactions, then I cannot be large for the entire updated
database.
It is easily seen from corollary 1 that if the range between Sl and Su is large, then
the ratio r can also be large, meaning that the number of new transactions will be large
for a fixed d. However, a large range between Sl and Su will also create a large set of
pre-large itemsets, which will represent an additional overhead in maintenance.
5.4 Presentation of the Algorithm
21
In the proposed algorithm, the large and pre-large itemsets with their counts in
preceding runs are recorded for later use in maintenance. As new transactions are
added, the proposed algorithm first scans them to generate candidate 1-itemsets (only
for these transactions), and then compares these itemsets with the previously retained
large and pre-large 1-itemsets. It partitions candidate 1-itemsets into three parts
according to whether they are large or pre-large for the original database. If a
candidate 1-itemset from the newly inserted transactions is also among the large or
pre-large 1-itemsets from the original database, its new total count for the entire
updated database can easily be calculated from its current count and previous count
since all previous large and pre-large itemsets with their counts have been retained.
Whether an originally large or pre-large itemset is still large or pre-large after new
transactions have been inserted is determined from its new support ratio, as derived
from its total count over the total number of transactions. On the contrary, if a
candidate 1-itemset from the newly inserted transactions does not exist among the
large or pre-large 1-itemsets in the original database, then it is absolutely not large for
the entire updated database as long as the number of newly inserted transactions is
within the safety threshold derived from Theorem 1. In this situation, no action is
needed. When transactions are incrementally added and the total number of new
transactions exceeds the safety threshold, the original database is re-scanned to find
new pre-large itemsets in a way similar to that used by the FUP algorithm. The
proposed algorithm can thus find all large 1-itemsets for the entire updated database.
After that, candidate 2-itemsets from the newly inserted transactions are formed and
the same procedure is used to find all large 2-itemsets. This procedure is repeated
until all large itemsets have been found. The details of the proposed maintenance
algorithm are described below. A variable, c, is used to record the number of new
22
transactions since the last re-scan of the original database.
The proposed maintenance algorithm:
INPUT: A lower support threshold Sl, an upper support threshold Su, a set of large
itemsets and pre-large itemsets in the original database consisting of (d+c)
transactions, and a set of t new transactions.
OUTPUT: A set of final association rules for the updated database.
STEP 1: Calculate the safety number f of new transactions according to theorem 1 as
follows:
f = .
STEP 2: Set k =1, where k records the number of items in itemsets currently being
processed.
STEP 3: Find all candidate k-itemsets Ck and their counts from the new transactions.
STEP 4: Divide the candidate k-itemsets into three parts according to whether they are
large, pre-large or small in the original database.
STEP 5: For each itemset I in the originally large k-itemsets , do the following
substeps:
Substep 5-1: Set the new count SU(I) = ST(I)+ SD(I).
Substep 5-2: If SU(I)/(d+t+c) Su, then assign I as a large itemset, set SD(I) =
SU(I) and keep I with SD(I),
otherwise, if SU(I)/(d+t+c) Sl, then assign I as a pre-large itemset, set
SD(I) = SU(I) and keep I with SD(I),
otherwise, neglect I.
STEP 6: For each itemset I in the originally pre-large itemset , do the following
23
substeps:
Substep 6-1: Set the new count SU(I) = ST(I)+ SD(I).
Substep 6-2: If SU(I)/(d+t+c) Su, then assign I as a large itemset, set SD(I) =
SU(I) and keep I with SD(I),
otherwise, if SU(I)/(d+t+c) Sl, then assign I as a pre-large itemset, set
SD(I) = SU(I) and keep I with SD(I),
otherwise, neglect I.
STEP 7: For each itemset I in the candidate itemsets that is not in the originally large
itemsets or pre-large itemsets , do the following substeps:
Substep 7-1: If I is in the large itemsets or pre-large itemsets from the
new transactions, then put it in the rescan-set R, which is used
when rescanning in Step 8 is necessary.
Substep 7-2: If I is small for the new transactions, then do nothing.
STEP 8: If t +c f or R is null, then do nothing; otherwise, rescan the original
database to determine whether the itemsets in the rescan-set R are large or
pre-large.
STEP 9: Form candidate (k+1)-itemsets Ck+1 from finally large and pre-large k-
itemsets ( ) that appear in the new transactions.
STEP 10: Set k = k+1.
STEP 11: Repeat STEPs 4 to 10 until no new large or pre-large itemsets are found.
STEP 12: Modify the association rules according to the modified large itemsets.
STEP 13: If t +c > f, then set d=d+t+c and set c=0; otherwise, set c=t+c.
After Step 13, the final association rules for the updated database have been
determined.
24
5.5 An Example
In this section, an example is given to illustrate the proposed incremental data
mining algorithm. Assume the initial data set includes 8 transactions, which are the
same as those shown in Table 6. For Sl=30% and Su=50%, the sets of large itemsets
and pre-large itemsets for the given data are shown in Tables 11 and 12, respectively.
Table 11. The large itemsets for the original database
Large itemsets1 item Count 2 items Count 3 items Count
A 5 BC 4 BCE 4B 6 BE 6C 6 CE 4E 6
Table 12. The pre-large itemsets for the original database
Pre-large itemsets1 item Count 2 items Count 3 items Count
D 3 AB 3 ABE 3AC 3AE 3CD 3
Assume the two new transactions shown in Table 13 are inserted after the initial
data set is processed.
Table 13. Two new transactions
New transactionsTID Items
25
900 ABCD1000 DEF
The proposed incremental mining algorithm proceeds as follows. The variable c
is initially set at 0.
STEP 1: The safety number f for new transactions is calculated as:
0
f = = .
STEP 2: k is set to 1, where k records the number of items in itemsets currently being
processed.
STEP 3: All candidate 1-itemsets C1 and their counts from the two new transactions
are found, as shown in Table 14.
Table 14. All candidate 1-itemsets with counts from the two new transactions
Candidate 1-itemsetsItems Count
A 1B 1C 1D 2E 1F 1
STEP 4: From Table 14, all candidate 1-itmesets {A}{B}{C}{D}{E}{F} are divided
into three parts, {A}{B}{C}{E}, {D}, and {F} according to whether they
are large, pre-large or small in the original database. Results are shown in
Table 15, where the counts are only from the new transactions.
26
Table 15. Three partitions of all candidate 1-itemsets from the two new transactions
Originally large 1-itemsets
Originally pre-large 1-itemsets
Originally small 1-itemsets
Items Count Items Count Items CountA 1 D 2 F 1B 1C 1E 1
STEP 5: The following substeps are done for each of the originally large 1-itemsets
{A}{B}{C}{E}:
Substep 5-1: The total counts of the candidate 1-itemsets {A}{B}{C}{E}
are calculated using ST(I)+ SD(I). Table 16 shows the results.
Table 16. The total counts of {A}{B}{C}{E}
Items CountA 6B 7C 7E 7
Substep 5-2: The new support ratios of {A}{B}{C}{E} are calculated. For
example, the new support ratio of {A} is 6/(8+2+0) 0.5.
{A} is thus still a large itemset. In this example, {A}{B}{C}
{E} are all large. {A}{B}{C}{E} with their new counts are
retained in the large 1-itemsets for the entire updated database.
STEP 6: The following substeps are done for itemset {D}, which is originally pre-
large:
Substep 6-1: The total count of the candidate 1-itemset {D} is calculated
27
using ST(I)+ SD(I) (= 5).
Substep 6-2: The new support ratio of {D} is 5/(8+2+0) 0.5. {D} thus
becomes a large 1-itemset for the whole updated database. {D}
with its new count is retained in the large 1-itemsets for the
entire updated database.
STEP 7: Since the itemset {F}, which was originally neither large nor pre-large, is
large for the new transactions, it is put in the rescan-set R., which is used
when rescanning in Step 8 is necessary.
STEP 8: Since t +c=2+0 f (=3), rescanning the database is unnecessary, so nothing
is done.
STEP 9: From Steps 5,6 and 7, the final large 1-itemsets and pre-large 1-itemsets for
the entire updated database are {A}{B}{C}{D}{E}. All candidate 2-itemsets
generated from them are shown in Table 17.
Table 17. All candidate 2-itemsets for the new transactions
Candidate 2-itemsetsABACADAEBCBDBECDCEDE
STEP 10: k = k+1=2.
STEP 11: Steps 4 to 10 are repeated to find large or pre-large 2-itemsets. Results are
shown in Table 18.
28
Table 18. All large 2-itemsets and pre-large 2-itemsets for the whole updated database
Large 2-Itemsets Pre-large 2-ItemsetsItems Count Items CountBC 5 AB 4
BE 6 AC 4
AE 3
CD 4
CE 4
Large or pre-large 3-itemsets are found in the same way. No large 3-itemsets
were found in this example.
STEP 12: The association rules derived from the newly found large itemsets are:
BC (Confidence=5/7),
CB (Confidence=5/7),
BE (Confidence=6/7), and
EB (Confidence=6/7).
STEP 13: c=t+c=2+0=2.
After Step 13, the final association rules for the updated database can then be
found. Note that the final value of c is 2 in this example and f-c=1. This means that
one more new transaction can be added without rescanning the original database. The
whole process of finding large itemsets for this example is illustrated in Figures 4, 5
and 6.
29
Figure 4: Our process of finding large 1-itemsets
30
New transactionsTID Items900 ABCD
1000 DEF
Item CountA 1B 1C 1D 2E 1F 1
Item CountD 2
Item CountF 1
Item CountA 1B 1C 1E 1
Item CountA 6B 7C 7E 7
Item CountD 5
Item CountF 1
Item CountD 5
Item CountA 6B 7C 7E 7
Items CountA 6B 7C 7D 5E 7
Find all candidate 1-itemsets
Extract large 1-itemsets inoriginal database
Extract 1-itemsets notrecorded in original database
Extract 1-itemsetslarge or pre-large forthe new transactions
Find the large itemsets or pre-largeitemsets from the counts
Since t+c=2+0<f=3,insert the 1-itemsets into R
Add the counts to the originallarge 1-itemsets
Extract pre-large 1-itemsetsin original database
Add the counts to the originalpre-large 1-itemsets
R={F}
Find the large itemsets or pre-largeitemsets from the counts
New transactionsTID Items900 ABCD1000 DEF
Figure 5: Our process of finding large 2-itemsets and pre-large 2-itemsets
Item CountAB 1AC 1AD 1AE 0BC 1BD 1BE 0CD 1CE 0DE 1
Item CountAB 1AC 1AE 0CD 1
Item CountBC 1BE 0CE 0
Item CountAD 1BD 1DE 1
Item CountBC 5BE 6CE 4
Item CountAB 4AC 4AE 3CD 4
Item CountAD 1BD 1DE 1
Item CountBC 5BE 6CE 4
Item CountAB 4AC 4AE 3CD 4
Items CountAB 4AC 4AE 3BC 5BE 6CD 4CE 4
31
Find all candidate 2-itemsets
Extract large 2-itemsets in original database
Extract 2-itemsets not recorded in original database
Extract 2-itemsets large or pre-large for the new
transactionsFind the large itemsets or pre-large
itemsets from the counts
Since t+c=2+0<f=3, insert the 2-itemsets into R
Add the counts to the original large 2-itemsets
Extract pre-large 2-itemsets in original database
Add the counts to the original pre-large 2-itemsets
R={F}Find the large itemsets or pre-large
itemsets from the counts
R={F,AD,BD,DE}
Figure 6: Our process of finding large 3-itemsets and pre-large 3-itemsets
In Pass 1 of this example, the candidate 1-itemsets {D} and {F}, can easily be
processed by our proposed algorithm; they are, however, processed by rescanning the
whole database in the FUP algorithm.
6. Conclusions
32
New transactionsTID Items900 ABCD
1000 DEF
Item CountABC 1ABE 0ACE 0BCE 0
Item CountBCE 0
Item CountABE 0
Item CountABC 1ACE 0
Item CountABC 1
Item CountABE 3
Item CountABE 3
Item CountBCE 4
Item CountBCE 4
Items CountABE 3BCE 4
Find the large itemsets or pre-largeitemsets from the counts
Find the large itemsets or pre-largeitemsets from the counts
Since t+c=2+0<f=3,insert the 3-itemsets into R
Find all candidate 3-itemsets
Extract large 3-itemsets inoriginal database
Extract 3-itemsets notrecorded in original database
Extract 3-itemsets large or pre-large for the new transactions
Add the counts to the originallarge 3-itemsets
Extract pre-large 3-itemsetsin original database
Add the counts to the originalpre-large 3-itemsets
R={F,AD,BD,DE,ABC}
In this paper, we have proposed the concept of pre-large itemsets, and designed a
novel, efficient, incremental mining algorithm based on it. Using two user-specified
upper and lower support thresholds, the pre-large itemsets act as a gap to avoid small
itemsets becoming large in the updated database when transactions are inserted. Our
proposed algorithm also retains the following features of the FUP algorithm [7][14]:
1. It avoids re-computing large itemsets that have already been discovered.
2. It focuses on newly inserted transactions, thus greatly reducing the number of
candidate itemsets.
3. It uses a simple check to further filter the candidate itemsets in inserted
transactions.
Moreover, the proposed algorithm can effectively handle cases, in which
itemsets are small in an original database but large in newly inserted transactions,
although it does need additional storage space to record the pre-large itemsets. Note
that the FUP algorithm needs to rescan databases to handle such cases. The proposed
algorithm does not require rescanning of the original databases until a number of new
transactions determined from the two support thresholds and the size of the database
have been processed. If the size of the database grows larger, then the number of new
transactions allowed before rescanning will be larger too. Therefore, as the database
grows, our proposed approach becomes increasingly efficient. This characteristic is
especially useful for real-world applications.
Acknowledgment
33
The authors would like to thank the anonymous referees for their very
constructive comments. This research was supported by the National Science Council
of the Republic of China under contract NSC 89-2213-E-214-056.
References
[1] R. Agrawal, T. Imielinksi and A. Swami, “Mining association rules between sets
of items in large database,“ The ACM SIGMOD Conference, pp. 207-216,
Washington DC, USA, 1993.
[2] R. Agrawal, T. Imielinksi and A. Swami, “Database mining: a performance
perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 5,
No. 6, pp. 914-925, 1993.
[3] R. Agrawal and R. Srikant, “Fast algorithm for mining association rules,” The
International Conference on Very Large Data Bases, pp. 487-499, 1994.
[4] R. Agrawal and R. Srikant, ”Mining sequential patterns,” The Eleventh IEEE
International Conference on Data Engineering, pp. 3-14, 1995.
[5] R. Agrawal, R. Srikant and Q. Vu, “Mining association rules with item
constraints,” The Third International Conference on Knowledge Discovery in
Databases and Data Mining, pp. 67-73, Newport Beach, California, 1997.
[6] M.S. Chen, J. Han and P.S. Yu, “Data mining: An overview from a database
perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 8,
No. 6, pp. 866-883, 1996.
[7] D.W. Cheung, J. Han, V.T. Ng, and C.Y. Wong, “Maintenance of discovered
association rules in large databases: An incremental updating approach,” The
34
Twelfth IEEE International Conference on Data Engineering, pp. 106-114, 1996.
[8] D.W. Cheung, S.D. Lee, and B. Kao, “A general incremental technique for
maintaining discovered association rules,” In Proceedings of Database Systems
for Advanced Applications, pp. 185-194, Melbourne, Australia, 1997.
[9] T. Fukuda, Y. Morimoto, S. Morishita and T. Tokuyama, "Mining optimized
association rules for numeric attributes," The ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems, pp. 182-191, 1996.
[10] J. Han and Y. Fu, “Discovery of multiple-level association rules from large
database,” The Twenty-first International Conference on Very Large Data Bases,
pp. 420-431, Zurich, Switzerland, 1995.
[11] M.Y. Lin and S.Y. Lee, “Incremental update on sequential patterns in large
databases,” The Tenth IEEE International Conference on Tools with Artificial
Intelligence, pp. 24-31, 1998.
[12] H. Mannila, H. Toivonen, and A.I. Verkamo, “Efficient algorithm for
discovering association rules,” The AAAI Workshop on Knowledge Discovery in
Databases, pp. 181-192, 1994.
[13] J.S. Park, M.S. Chen, P.S. Yu, “Using a hash-based method with transaction
trimming for mining association rules,” IEEE Transactions on Knowledge and
Data Engineering, Vol. 9, No. 5, pp. 812-825, 1997.
[14] N.L. Sarda and N.V. Srinivas, “An adaptive algorithm for incremental mining of
association rules,” The Ninth International Workshop on Database and Expert
Systems, pp. 240-245, 1998.
[15] R. Srikant and R. Agrawal, “Mining generalized association rules,” The Twenty-
first International Conference on Very Large Data Bases, pp. 407-419, Zurich,
Switzerland, 1995.
35
[16] R. Srikant and R. Agrawal, “Mining quantitative association rules in large
relational tables,” The 1996 ACM SIGMOD International Conference on
Management of Data, pp. 1-12, Montreal, Canada, 1996.
[17] S. Zhang, “Aggregation and maintenance for database mining,” Intelligent Data
Analysis, Vol. 3, No. 6, pp. 475-490, 1999.
36