Integrate Association Rule With RDBMS

7/28/2019 Integrate Association Rule With RDBMS

1/12

Integrating Association Rule Mining with Relational DatabaseSystems: Alternatives and Implications

Sunita Sarawagi Shiby Thomas * Rakesh [email protected] [email protected] ragrawalQaZmaden.ibm.com

IBM Almaden Research Center650 Harry Road, San Jose, CA 95120

AbstractData mining on large data warehouses is becoming increas-ingly important. In support of this trend, we consider aspectrum of architectural alternatives for coupling miningwith database systems. These alternatives include: loose-coupling through a SQL cursor interface; encapsulation of amining algorithm in a stored procedure; caching the data toa file system on-the-fly and mining; tight-coupling using pri-marily user-defined functions; and SQL implementations forprocessing in the DBMS. We comprehensively study the op-tion of expressing the mining algorithm in the form of SQLqueries using Association rule mining as a case in point.We consider four options in SQL-92 and six options in SQLenhanced with object-relational extensions (SQL-OR). Ourevaluation of the different architectural alternatives showsthat from a performance perspective, the Cache-Mine optionis superior, although the performance of the SQL-OR optionis within a factor of two. Both the Cache-Mine and theSQL-OR approaches incur a higher storage penalty than theloose-coupling approach which performance-wise is a factorof 3 to 4 worse than Cache-Mine. The SQL-92 implemen-tations were too slow to qualify as a competitive option.We also compare these alternatives on the basis of qualita-tive factors like automatic parallelization, development ease,portability and inter-operability.1 IntroductionAn ever increasing number of organizations are installinglarge data warehouses using relational database technology.There is a huge demand for mining nuggets of knowledgefrom these data warehouses.The initial research on data mining was concentrated ondefining new mining operations and developing algorithmsfor them. Most early mining systems were developed largelyon file systems and specialized data structures and buffermanagement strategies were devised for each algorithm. Cou-pling with database systems was at best loose, and access

*Current affiliation: Dept. of Computer & Information Science &Engineering, University of Florida, Gainesville

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided thatcopies are not mada or distributed for profit or commercial advsn-taSc and that copi es bear this notice and the full citation on the first page.To copy otherwise, to republish, to post on servers or toredistribute to lists, requires prior apscific permission and/or a he.SIGMOD 98 Seattle, WA, USA8 1998 ACM 0-89791-995-5/98/006...$5.00

to data in a DBMS was provided through an ODBC or SQLcursor interface (e.g. [14 , 1, 9, 121).Researchers of late have started to focus on issues relatedto integrating mining with databases. There have been lan-guage proposals to extend SQL to support mining operators.For instance, the query language DMQL [9] extends SQLwith a collection of operators for mining characteristic rules,discriminant rules, classification rules, association rules, etc.The M-SQL language [13] extends SQL with a special unifiedoperator Mine to generate and query a whole set of proposi-tional rules. Another example is the mine rule [17] operatorfor a generalized version of the association rule discoveryproblem. Query flocks for association rule mining using agenerate-and-test model has been proposed in [25].The issue of tightly coupling a mining algorithm with arelational database system from the systems point of viewwas addressed in [5]. This proposal makes use of user-definedfunctions (UDFs) in SQL statements to selectively pushparts of the computation into the database system. Theobjective was to avoid one-at-a-time record retrieval fromthe database, saving both the copying and process contextswitching costs. The SETM algorithm [lo] for finding as-sociation rules was expressed in the form of SQL queries.However, as shown in [3], SETM is not efficient and thereare no results reported on running it against a relationalDBMS. Recently, the problem of expressing the associationrules algorithm in SQL has been explored in [20]. We discussthis work later in the paper.1.1 GoalThis paper is an attempt to understand implications of vari-ous architectural akematives for coupling data mining withrelational database systems. In particular, we are interestedin studying how competitive can a mining computation ex-pressed in SQL be compared to a specialized implementationof the same mining operation.There are several potential advantages of a SQL imple-mentation. One can make use of the database indexingand query processing capabilities thereby leveraging on morethan a decade of effort spent in making these systems robust,portable, scalable, and concurrent. One can also exploit theunderlying SQL parallelization, particularly in an SMP en-vironment . The DBMS support for checkpointing and spacemanagement can be valuable for long-running mining algo-rithmsThe architecture we have in mind is schematically shownin Figure 1. We visualize that the desired mining operationwill be expressed in some extension of SQL o r a graphi-

343


2/12

Figure 1: SQL architecture for mining in a DBMS

cal language. A preprocessor will generate appropriate SQLtranslation for this operation. We consider translations thatcan be executed on a SQL-92 [16] relational engine, as well astranslations that require some of the newer object-relationalcapabilities being designed for SQL [15]. Specifically, we as-sume availability of blobs, user-defined functions, and tablefunctions [19].We compare the performance of the above SQL archi-tecture with the following alternatives:Read directly from DBMS: Data is read tuple by tuplefrom the DBMS to the mining kernel using a cursor inter-face. Data is never copied to a file system. We considertwo variations of this approach. One is the loose-couplingapproach where the DBMS runs in a different address spacefrom the mining process. This is the approach followed bymost existing mining systems. A potential problem withthis approach is the high context switching cost betweenthe DBMS and the mining process [5]. In spite of the block-read optimization present in many systems (e.g. Oracle [18],DB2 [7]) where a block of tuples is read at a time, the perfor-mance could suffer. The second is the stored-procedureapproach where the mining algorithm is encapsulated as astored procedure [7] that runs in the same address space asthe DBMS. The main advantage of both these approaches isgreater programming flexibility and no extra storage require-ment. The mined results are stored back into the DBMS.Cache-mine: This option is a variation of the Stored-procedure approach where after reading the entire data oncefrom the DBMS, the mining algorithm temporarily cachesthe relevant data in a lookaside buffer on a local disk. Thecached data could be transformed to a format that enablesefficient future accesses. The cached data is discarded whenthe execution completes. This method has all the advan-tages of the stored procedure approach plus it promises tohave better performance. The disadvantage is that it re-quires additional disk space for caching. Note that the per-manent data continues to be managed by the DBMS.User-defined function (UDF): The mining algorithm isexpressed as a collection of user-defined functions (UDFs) [7]that are appropriately placed in SQL data scan queries.Most of the processing happens in the UDF and the DBMSis used primarily to provide tuples to the UDFs. Little use ismade of the query processing capability of the DBMS. TheUDFs are run in the unfenced mode (same address space asthe database). Such an implementation was presented in [5].The main attraction of this method over Stored-procedure isperformance since passing tuples to a stored procedure isslower than passing it to a UDF. Otherwise, the processinghappens in almost the same manner as in the stored proce-dure case. The main disadvantage is the development costsince the entire mining algorithm has to be written as UDFsinvolving significant code rewrites [5]. This option can beviewed as an extreme case of the SQL-OR approach whereUDFs do alI the processing.

1.2 MethodologyWe do both quantitative and qualitative comparisons of thearchitectures stated above with respect to the problem ofdiscovering Association rules [2] against IBM DB2 UniversalServer [ll].For the loose-coupling and Stored-procedure architec-tures, we use the implementation of the Apriori algorithm [3]for finding association rules provided with the IBM datamining product, Intelligent Miner [14]. For the Cache-Minearchitecture, we used the space option provided in Intelli-gent Miner that caches the data in a binary format after thefirst pass. For the UDF architecture, we use the UDF im-plementation of the Apriori algorithm described in [5]. Forthe SQL-architecture, we consider two classes of implemen-tations: one uses only the features supported in SQL-92 andthe other uses object-relational extensions to SQL (hence-forth referred to as SQL-OR). We consider fou r different im-plementations in the fnst case and six in the second. Theseimplementations differ in the way they exploit different fea-tures of SQL. We compare the performance of these differentapproaches using four real-life datasets. We also use syn-thetic datasets at various points to better understand thebehavior of different algorithms.1.3 Paper LayoutThe rest of the paper is organized as follows. In Section2, we cover background material. In Section 3, we presentthe overview of the SQL implementations. In Sections 4and 5, we elaborate on different ways of doing the supportcounting phase of Associations in SQL - Section 4 presentsSQL-92 implementations and Section 5 gives implementa-tions in SQL-OR. In Section 6 we present a qualitative andquantitative comparison of the different architectural alter-natives. We present conclusions in Section 7 . This paper isan abbreviated version of the full paper that appears in [21].2 Background2.1 Association RulesGiven a set of transactions, where each transaction is a set ofitems, an association rule [2] is an expression X+Y, whereX and Y are sets of items. The intuitive meaning of such arule is that the transactions that contain the items in X tendto also contain the items in Y. An example of such a rulemight be that 30% of transactions that contain beer alsocontain diapers; 2% of all transactions contain both theseitems. Here 30% is called the confidenceof the rule, and 2%the ~vpport of the rule. The problem of mining associationrules is to iind all rules that satisfy a user-specified minimumsupport and minimum confidence.The association rule mining problem can be decomposedinto two subproblems [2]:

l Find all combinations of items, called frequent item-sets, whose support is greater than m inimum support.l Use the frequent itemsets to generate the desired rules.The idea is that if, say, ABCD and AB are frequent,then the rule AB-GD holds if the ratio of support(ABCD)to support is at least as large as the minimumconfidence. Note that the rule wilI have minimum sup-port because ABCD is frequent.The first part on generation of frequent itemsets is themost time-consuming part and we concentrate on this partin the paper. In [21] we also discuss rule generation.

344


3/12

2.2 Apriori Algorithmwe use the Apriori algorithm [3] as the basis for our presen-tation. There are recent proposals for improving the Apri-ori algorithm by reducing the number of data passes [24, 61.They all have the same basic dataflow structure as the Apri-ori algorithm. Our goal in this work is to understand howbest to integrate this basic structure within a database sys-tem. In [21], we discuss how our conclus ions extrapolate tothese algorithms.The Apriori algorithm for finding frequen t itemsets makesmultiple passes over the data. In the kth pass it fmds allitemsets having k items called the k-itemsets. Each passconsists of two phases. Let Fk represent the set of frequentk-itemsets, and Ck the set of candidate k-itemsets (poten-tially frequent itemsets). First, is the candidate gener-ation phase where the set of all frequent (k - 1)-itemsets,Fk-1, found in the (k - 1)th pass, is used to generate thecandidate itemsets Ck The candidate generation procedureensures tha t Ck is a superset of the set of all frequent k-itemsets. A specialized in-memory hash-tree data structureis used to store Ck . Then, data is scanned in the supportcounting phase. For each transaction, the candidates in Ckcontained in the transaction are determined using the hash-tree data structure and their support count is incremented.At the end of the pass, Ck is examined to determine whichof the candidates are frequent, yielding Fk. The algorithmterminates when Fk or Ck+i becomes empty.2.3 Input formatThe transaction table T has two column attributes: transac-tion identifier (tid) and item identifier (item). The numberof items per tid is variable and unknown during table cre-ation time. Thus, alternatives such as [20], where all itemsof a tid appear as different columns of a single tuple, maynot be practical. Often the number of items per transactioncan be more than the maximum number of columns thatthe DBMS supports. For instance, for one of our real-lifedatasets the maximum number of items per transaction is872 and for another it is 700. In contrast, the correspondingaverage number of items per transaction is only 9.6 and 4.4respectively.3 Associations in SQLIn Section 3 .1 we present the candidate generation procedurein SQL and in Section 3.2 we present the support countingprocedure.3.1 Candidate generation in SQLEach pass k of the Apriori algorithm first generates a can-didate itemset set Ck from frequent itemsets Fk-1 of theprevious pass.In the join step, a superset of the candidate itemsets Ckis generated by joining Fk-i with itself:

insert into ck select 11 .&ml, . . , 11 .&m&l, Iz.itemk-1from Fk--l Il,Fk--l 12where II.iteml = Iz.iteml and

I~.itemk-2 = I2.itemk-2 andI1 .itemk-1 < 12 itemk-1For example, let Fs be ((1 2 3}, (1 2 41, (13 4}, (1 3 5}, (23 4)). After the join step, Ca will be ((1 2 3 41, (1 3 4 5)).

Next, in the prunestep, all itemsets c E Ck, where some (k-1)-subset of c is not in Fk-1, are deleted. Continuing withthe example above, the prune step will delete the itemset (13 4 5) because the subset (1 4 5) is not in Fs. We will thenbe left with only (1 2 3 4) in C4.We can perform the prune step in the same SQL state-ment as the join step by writing it as a k-way join as shownin Figure 2. A k-way join is used since for any k-itemsetthere are k subsets of length (k - 1) for which Fk-i needs tobe checked for membersh ip. The join predicates on 11 and 12remain the same. After the join between Ii and 12 we get a kitemset COnSiSting Of(Il.iteml,...,Il.itemk-1,Iz.itemk-1).For this itemset, two of its (k - 1 )-length subsets are al-ready known to be frequent since it was generated fromtwo itemsets in Fk-1. We check the remaining k - 2 sub-sets using additional joins. The predicates for these joinsare enumerated by skipping one item a t a time from thek-itemset as follows: We first skip item1 and check if sub-set (Ii .itemz, . . , Il.itemk-1, Iz.itemk-1) belongs to Fk-1as shown by the join with 13 in the figure. In general, fora join with I,, we skip item r - 2 . We construct a primaryindex on (itemI, , item&l) of Fk-1 to efficiently processthese k-way joins using index probes.Ck need not always be materialized before the countingphase. Instead, the candidate generation can be pipelinedwith the subsequent SQL queries used for support counting .

Figure 2: Candidate generation for any k

3.2 Counting support to find frequent itemsetrThis is the most time-consuming part o f the association rulesalgorithm. We use the candidate itemsets ck and the datatable T to count the support of the itemsets in Ck. Weconsider two different categories of SQL implementations:

(A) The first one is based purely on SQL-92. We discussfour approaches in this category in Section 4 .(B) The second utilizes object-relational extensions likeUDFs, BLOBS (Binary large objects) and table func-tions. Table functions [19] are virtual tables associatedwith a user defined function which generate tuples onthe fly. They have pre-defined schemas like any othertable. The function associated w ith a table functioncan be implemented as a UDF. Thus, table func tionscan be viewed as UDFs that retu rn a collection of tu-ples instead of scalar values.

We discuss six approaches in this category in Section 5.UDFs in this approach are light weight and do not re-quire extensive memory allocations and coding unlikethe UDF architectural option (Section 1.1).

345


4/12

4 Support counting using SQL-92We studied four approaches in this category - we presentthe two better ones here. The other two are discussed inPII.4.1 K-way joinsIn each pass k, we join the candidate itemsets ck with ktransaction tables T and follow it up with a group by onthe itemsets as shown in Figure 3. The figure 3 also shows atree diagram of the query. These tree diagrams are not to beconfused with the plan trees that could look quite different.

insert into Fk select iteml, . . . itemk, count(*)from ck, T tl, . ..T tkwhere tl item = Ck.iteml and

&item = Ck.itemk andtl .tid = tz.tid andtk-1 .tid = tk.tid

group by item1 ,itemz itemkhaving count( *) > :minsup

Figure 3: Support Counting by K-way joinThis SQL computation, when merged with the candidategeneration step, is similar to the one proposed in [25] as apossible mechanism to implement query flocks.For pass-2 we use a special optimization where insteadof materializing CZ, we replace it with the Z-way joins be-tween the Fls as shown in the candidate generation phasein section 3.1. This saves the cost of materializing Cz andalso provides early filtering of the Ts based on FI insteadof the larger Cz which is almost a Cartesian product of theFls. In contrast, for other passes corresponding to k > 2,Ck could be smaller than Fk-1 because of the prune step.

4.2 Subquery-basedThis approach makes use of common prefixes between theitemsets in ck to reduce the amount of work done duringsupport counting. The support counting phase is split into acascade of k subqueries. The I-th subquery QI (see Figure 4)finds all tids that match the distinct itemsets formed bythe first 1 columns of Ck (call it dr). The output of QI isjoined with T and &+I (the distinct itemsets formed by thefirst 1 + 1 columns of Ck) to get QI+~. The &al output isobtained by a group-by on the k items to count support as

Datasets # Records # Trans- # Items Avg.in actions in in #itemsmillions millions thousands(I)Dataset- 1

1 (R/T)85 I 4.4Dataset-B 7.5 2.5 15.8 2.62D&as&-C 6.6 0.21 15.8 31Dataset-D 14 1.44 480 9.62

Table 1: Description of different real-life datasets.

above. Note that the final select distinct operation on theck when 1 = k is not necessary.For pass-2 the special optimization of the KwayJoin ap-proach is used.

insert into Fk select itemI,. . , itemk, count(*)from (Subquery Qk) tgroup by item1 ,&em2 . . . demkhaving count(*) > :minsup

Subquery Qr (for any 1 between 1 and k):select %teml, . . . itemr, tidfrom T tl, (Subquery Q-1) as rl-1,

(Select distinct item1 . iteml from Ck) as drwhere rl--1 .iteml = &.iteml and . . . and~-1 .item[-1 = d[.item[-landPI-I .tid = t~.tid andtr.item = dr.itemrSubquery Qo: No subquery Qo.

Subf

ety Q-l

itoml,....iteml. tid

tl.item = dl.iteml Ttlitem = dl.iteml

r-l-1 .item I = dl.itemlr-l-1 .item-I-I = dl.item-I- I

c;;31\

r?/ kihnct TtSubquery Q-l- I iteml.. ..itemlfCk

Tree diagram for Subquery &lFigure 4: Support counting using subqueries

4.3 Performance comparison of SQL-92 approachesWe now briefly compare the different SQL-92 approaches;detailed results are available in [21].Our experiments were performed on Version 5 of IBMDB2 Universal Server installed on a RS/SOOO Model 140with a 200 MHz CPU, 256 MB main memory and a 9 GBdisk with a measured transfer rate of 8 MB/set.We selected four real-life datasets obtained from mail-order companies and retail stores for the experiments. Thesedatasets differ in the values of parameters like the numberof (tid,item) pairs, number of transactions (tids), numberof items and the average number of items per transaction.Table 1 summarizes characteristics of these datasets.

346


5/12

We found that the best SQL-92 approach was the Sub-query approach, which was often more than 8n order ofmagnitude better than the other three approaches. How-ever, this approach was comparable to the Loose-couplingapproach only in some cases whereas fo r several others itdid not complete even after taking ten times more time thanthe Loose-coupling approach.The important conclusion we drew from this study, there-fore is that implementations based on pure SQL-92 are tooslow to be considered an alternative to the existing Loose-coupling approach.5 Support counting using SQL with object-relational ex-

tensionsIn this section, we study 8ppro8CheS that use object-relationalfeatures in SQL to improve performance. We first consideran approach we cd GatherJoin and its three variants in Sec-tion 5.1. Next we present a very different 8pprO8Ch calledVertical in Section 5.2 . We do not discuss the sixth ap-proach called SBF based on SQL-bodied functions becauseof its inferior performance (see [21]). Fo r each approach, wealso outline a cost-based analysis of the execution time tochoose between these different approaches. In Section 5.3we present performance comparisons.5.1 GatherJoinThe GatherJoin approach (see Figure 5) generates 8ll possi-ble k-item combinations of items contained in a trsnsection,joins them with the candidate table Ck, and counts the sup-port o f the itemsets by grouping the join result. It usestwo table functions Gather and Comb-K. The data table Tis scanned in the (tid, iten) order and passed to the tablefunction Gather, which collects 811 he items of a transac-tion in memory and outputs a record for each transaction.Each record consists of two attributes: the tid 8nd item-listwhich is a collection of 8ll items in a field of type VARCHARor BLOB. The output of Gather is passed to another ts-ble function Comb-K which returns 8ll k-item combinationsformed out of the items o f a transaction. A record outputby Comb-K has k attributes Titml,. . . ,T-itmk, which canbe used to probe into the ck table. An index is constructedon all the items of Ck to make the probe efficient.This approach is analogous to the KwayJoin approach ex-cept that we have replaced the k-way self join of T with thetable functions Gather and Comb-K. These table functionsare easy to code and do not require 8 large amount of mem-ory. It is also possible to merge them into a single tablefunction GatherComb-K, which is what we did in our imple-mentation. Note that the Gather function is not requiredwhen the data is already in a horizontal format where eachtid is followed by a collection of 8ll its items.Special pass 2 optimization: For k = 2, the a-candidateset Cs is simply a join of FI with itself. Therefore, we canoptimize the pess 2 by replacing the join with Cz by a joinwith FI before the table function (see Figure 6). The tablefunction now gets only frequent items and generates signifi-cantly fewer 2-item combinations. We apply this optimiza-tion to other passes too. However, unlike pass 2 we still haveto do the final join with Ck and therefore the benefit is not8s significant.

insert into Fk select iteml,. . . , temk, count(*)from ck,(select tz.T.itmr , . . . , t2.Titmk from T,table (Gather(T.tid, T.item)) as tl,table (Comb-K(tl .tid, tl .item-list)) as tz)

where tz.T-itml = Ck.iteml and

tz.Titmk = Ck.itemkgroup by Ck.&eml, , Ck.itemkhaving count,(*) > :minsuP

havingcount(*) z- :minsupc

Groim bviteml....:,it&kt2.Tpitm1 F Ck.itoml

dat2.Tpitmk i Ck.ix 1Ta$$m%FEtion Ck

ATab&%t&zrcti onPOrder bytid. item4T

Figure 5: Support Counting by GatherJoin

G&p byft*.T_iUn 1, ttZ.T-itm2f tt2

Table function


6/12

GatherPrune: A problem with the GatherJoin approachis the high cost of joining the large number of item combina-tions with Ck. We can push the join with Ck inside the tablefunction and thus reduce the number of such combinations.Ck is converted to a BLOB 8nd passed as 8x-1 rgument tothe table function.The cost of passing the BLOB for every tuple of R canbe high. In general, we can reduce the parameter passingcost by using a smaller Blob that only approximates the realCk. The trade-off is increased cost for other parts notablygrouping because not as many combinations are filtered. Aproblem with this approach is the increased coding complex-ity o f the table function.Horizontal: This is another variation of GatherJoin thatfirst uses the Gather function to transform the data to thehorizontal format but is otherwise similar to the Gather-Join approach. R8j8m8ni et 81. [20] propose finding associa-tions using 8 similar approach augmented with some pruningbased on a variation of the GatherPrune approach. Their re-sults assume that the data is already in 8 horizontal formatwhich is often not true in practice. They report that theirSQL implementation is two to six times slower than a UDFimplementation.

R number of records in the input transactiontableT number of transactionsN avg. number o f items per transaction = $

$7) number of frequent itemssum of support of each itemset in set CRf number of records out of R involvingfrequent items = S(Fl)Nf average number of frequent items perEitiansaction = r%N, k) number of candidate k-itemsetsnumber of combinations of size k possibleout of 8 set of size n: = &Sk cost of generating a k item combinationusing table function Comb-k~oup(n, m) cost of grouping n records out of which m

are distinctjoin(n, m, r) cost of joining two relations of size n and mto get 8 result of size rblob(n) cost of passing a BLOB of size n integers asan argumentTable 2: Notations used for cost analysis of different ap-proaches

5.1.2 Cost analysis of GatherJoin and its variantsThe relative performance of these variants depends on anumber of data characteristics like the number of items, to-td number of transactions, average length o f a transactionetc. We express the costs in each pass in terms of perame-ters that are known or can be estimated after the candidategeneration step of each pass. The purpose of this analysisis to help us choose between the different options. There-fore, instead of including all I/O and CPU costs, we includeonly those terms that help us distinguish between differentoptions. We use the notations of Table 2 in the cost analysis.The cost of GatherJoin includes the cost of generating k-item combinations, joining with Ck and grouping to count

the support. The number of k-item combinations generated,Tk is C( N, k) *T. Join with ck filters out the non-candidateitem combinations. The size of the join result is the sum ofthe support of 8ll the candidates denoted by s(Ck). The8CtUd value of the support of a c8ndidate itemset will beknown only after the support counting phase. However, weapproximate it to the minimum of the support of all its(k - I)-subsets in Fk-1. The total cost of the GatherJoinapproach is:

Tk * Sk + jOin(Tk, ck, s(ck)) + @-OUP(s(ck), Ck)r

where Tk = C( N, k) * TThe above cost formula needs to be modified to reflectthe special optimization of joining with FI to consider onlyfrequent items. We need a new term join(R, 4, Rf) andneed to change the formula for Tk to include only frequentitems Nf instead of N.For the second pass, we do not need the outer join with

&. The total cost of GatherJoin in the second pass is:join(R, FI, Rf) + TZ * 92 + group(T2, C2),

N;*Twhere Tz = C( Nf ,2) * T z 2Cost of GatherCount in the second pass is similar to that

for basic GatherJoin except for the fhml grouping cost:join(R, FI, Rf) + groupinternal(Tz, Cz) + F2 * 92

In this formula, groupinternal denotes the cost of doingthe support counting inside the table function.Cost formulas for the GatherPrune and Horizontal ap-proaches can be derived similarly and appear in [21].5.2 VerticalWe first transform the data table into 8 vertical format bycreating for each item a BLOB containing all tids that con -tain that item (Tid-list creation phase) and then count thesupport of itemsets by merging together these tid-lists (sup-port counting phase). This approach is similar to the approaches in [26]. For creating the Tid-lists we use a tablefunction Gather. This is the same as the Gather functionin GatherJoin except that we create the tid-list for eachfrequent item. The data table T is scanned in the (item,tid)order and passed to the function Gather. The function col-lects the tids of all tuples of T with the same item in memoryand outputs a (item, tid-list) tuple for items that meet theminimum support criterion. The tid-lists are represented asBLOBS and stored in a new TidTable with attributes (item,tid-list).In the support counting phase, for each itemset in Ck wewent to collect the tid-lists of 8ll k items and use a UDF tocount the number of tids in the intersection of these k lists.The tids 8re in the same sorted order in 8ll the tid-lists andtherefore the intersection can be done efficiently by a singlepass of the k lists. This step can be improved by decompos-ing the intersect operation to share these operations acrossitemsets having common prefixes 8s follows.We first select distinct (itemI, itemz) pairs from Ck. Foreach distinct pair we first perform the intersect operation toget 8 new result-tidlist, then End distinct triples (itemi, items,items) from Ck with the same first two items, intersectresult-tidhst with tid-list for items fo r each triple 8nd con-tinue with item4 and so on until all k tid-lists per itemset

348


7/12

are intersected. This approach is analogous to the Subqueryapproach presented for SQL-92.The above sequence of operations can be written as asingle SQL query for any k as shown in Figure 7. The finalintersect operation can be merged with the count operationto return a count instead of the tid-list - we do not showthis optimization in the query o f Figure 7 for simplicity.

insert into Fk select itemi,. . . ,&en&k, count(tid-list) as cntfrom (Subquery Qk) t where cnt > :minsup

Subquery Ql (for any 1 between 2 and k):select itemi, . . . iteml,Intersect(rr-i .tid-list,tr .tid-list) as tid-listfrom TidTable tr, (Subquery Qr-1) as ~1-1,(select distinct item1 . . . itemt from ck) as drwhere rr-1 .iteml = &.iteml and . . . andrr--1 .&ml-l = dr.itemr-landtr.item = dl.iteml

Subquery Qi : (select * from TidTable)Sub cry Q-1

7iteml,...,iteml, tid

ttl.item = dl.itemltl.item = dl.iteml

d-l&ml = dl.iteml /w\rJl.iteml-1 = dl.itemJ-I m

/ \IT tl

select distinctSubquery Q-l-1 iteml,. .,itemlt

CkTree diagram for Subquery Ql

Figure 7: Support counting using UDFSpecial pass 2 optimization: For pass 2 we need notgenerate Cs and join the TidTables with CZ. Instead, weperform a self-join on the TidTable using predicate tl .item :minsup5.2.1 Cost analysisThe cost of the Vertical approach during support counting isdominated by the cost of invoking the UDFs and intersectingthe tid-lists. The UDF is first called for each distinct itempair in Ck, then for each distinct item triple and so on. Letd, be the number of distinct j item tuples in ck Then thenumber of UDF invocations is c:=, d:. In each invocationtwo BLOBS o f tid-list are passed as arguments. The UDFintersects the tid-lists by a merge pass and hence the cost isproportional to 2 * average length of a tid-lis t. The averagelength of a tid-list can be approximated to 2. Note thatwith each intersect the tid-list keeps shrinking. However, weignore such effects for simplicity.

The total cost of the Vertical approach is:

(2 d:) * (2 * BZob( $) + Intersect(F))3=2

In the above formula Intersect(n) denotes the cost ofintersecting two tid-lists with a combined size of n. Weare not including the join costs in this analysis because itaccounted for only a small fraction of the total cost.

5.3 Performance comparison of SQL-OR approachesWe studied the performance of six SQL-OR approaches us-ing the datasets summarized in Table 1. Figure 8 shows theresults for only four approaches: GatherJoin, GatherCount,GatherPrune and Vertical. For the other two approaches(Horizontal and SBF) the running times were so large thatwe had to abort the runs in many cases. The reason why theHorizontal approach was sigrriflcantly worse than the Gath-erJoin approach was the time to transform the data to thehorizontal format.We first concentrate on the overall comparison betweenthe different approaches. Then we will compare the approaches based on how they perform in each pass of thealgorithm.The Vertical approach has the best overall performanceand it is sometimes more than an order of magnitude betterthan the other three approaches.The majority of the time of the Vertical approach is spentin transforming the data to the Vertical format in most cases(shown as prep in figure 8). The vertical representation islike an index on the item attribute. If we think of this timeas a one-time activity like index building then performancelooks even better. The time to transform the data to theVertical format was much smaller than the time for the hori-zontal format although both formats write almost the sameamount of data. The reason is the difference in the numberof records written. The number of frequent items is oftentwo to three orders of magnitude smaller than the numberof transactions.Between GatherJoin and GatherPrune, neither strictly dom-inates the other. The special pass-2 optimization in Gather-Join had a big impact on performance. With this optimization, for Dataset-B with support O.l%, the running time forpass 2 was reduced from 5.2 hours to 10 minutes.When we compare these approaches based on time spentin each pass no single approach emerges as the best for allpasses of the with datasets.For pass three onwards, Vertical is often two or more or-ders of magnitude better than the other approaches. Forhigher passes, the performance degrades dramatically forGatherJoin, because the table function Gather-Comb-K gen-erates a large number of combinations. GatherPrune is bet-ter than GatherJoin for third and later passes. For pass 2GatherPrune is worse because the overhead of passing a largeobject as an argument dominates cost.The Vertical approach sometimes spends too much timein the second pass. In some of these cases the GatherJoinapproach was better in the second pass (for instance for lowsupport values of Dataset-B) whereas in other cases (forinstance, Dataset-C with minimum support 0.25%) Gather-Count was the only good option. In the latter case, bothGatherPrune and GatherJoin did not complete after morethan six hours for pass 2. Further, they caused a storageoverflow error because of the large size of the intermediateresults to be sorted. We had to divide the dataset into four

349


8/12

Figure 8: Comparison of four SQL-OR approaches: Vertical, GatherPrune, GatherJoin and GatherCount on four datasets fordifferent support values. The time taken is broken down by each pass and an initial prep stage where any one-time datatransformation cost is included.

equal parts and ran the second pass independently on eachpartition to avoid this problem.700 1

Figure 9: Effect of increasing transaction lengthTwo factors that affect the choice amongst the Vertical,

GatherJoin and GatherCount approaches in different passesand pass 2 in particular are: number of frequent items (A)and the average number of frequent items per transaction(Nf). From Figure 8 we notice that as the value of the sup-port is decreased fo r each dataset causing the size of Fl toincrease, the performance of pass 2 of the Vertical approachdegrades rapidly. This trend is also clear from our cost for-mulae. The cost of the Vertical approach increases quadrat-ically with FI. GatherJoin depends more critically on thenumber of frequent items per transaction. For Dataset-Beven when the size of FI increases by a factor of 10, the valueof Nf remains close to 2, therefore the time taken by Gath-

erJoin does not increase as much. However, for Datasat-Cthe size of Nf increases from 3.2 to 10 as the support is de-creased from 2.0% to 0.25% causing GatherJoin to deterio-rate rapidly. From the cost formula for GatherJoin we noticethat the total time fo r pass 2 increases almost quadraticallywith NfWe validated this observation further by running exper-iments on synthetic datasets for varying values of the num-ber of frequent items per transaction. We used the syntheticdataset generator described in [3] for this purpose. We var-ied the transaction length, the number of transactions andthe support values while keeping the total number of recordsand the number of frequent items fixed. In Figure 9 we showthe total time spent in pass 2 of the Vertical and GatherJoinapproaches. As the number of items per transaction (trans-action length) increases, the cost of Vertical remains almostunchanged whereas the cost of GatherJoin increases.5.4 Final hybrid approachThe previous performance section helps us draw the follow-ing conclusions: Overall, the Vertical approach is the bestoption especially for higher passes. When the size of thecandidate itemsets is too large, the performance of the Ver-tical approach could suffer. In such cases, GatherJoin is agood option as long as the number of frequent items pertransaction (Nf) is not too large. When Nf is large Gather-Count may be the only good option even though it may noteasily parallelize .The hybrid scheme chooses the best of the three ap-proaches GatherJoin, GatherCount and Vertical for each passbased on the cost estimates outlined in Sections 5.1.2 and

350


9/12


10/12

because of closer coupling with the database.l The SQL approach comes second in performance af-ter the Cache-Mine approach for low support valuesand is even somewhat better for high support values.The cost of converting the data to the vertical for-mat for SQL is typically lower than the cost of trans-forming data to binary format outside the DBMS for

Cache-Mine. However, after the initial transformationsubsequent passes take negligible time for Cache-Mine.For the second pass SQL takes significantly more timethan Cache-Mine particularly when we decrease sup-port. For subsequent passes even the SQL approachdoes not spend too much time. Therefore, the differ-ence between Cache-Mine and SQL is not very sensi-tive to the number of passes because both approachesspend negligible time in higher passes.The SQL approach is 1.8 to 3 times better than Stored-procedure or Loose-coupling approach. As we decreasedthe support value so that the number of passes over thedataset increases, the gap widens.

Figure 11: Scale-up with increasing number of transactions- GaDnO -9proc ,. SQL2600 7

Figure 12 : Scale-up with increasing transaction length

6.1.1 Scale-up experimentOur experiments with the four real-life datasets above hasshown the scaling property of the different approaches withdecreasing support value and increasing number of frequentitemsets. We experiment with synthetic detasets to studyother forms of scaling: increasing number of transactionsand increasing average length of transactions. Figure 11shows how Stored-procedure, Cache-Mine and SQL scale withincreasing number of transactions. UDF and Loose-coupling

have similar scale-up behavior as Stored-procedure, thereforewe do not show these approaches in the figure. We used adataset with 10 average number of items per transaction,100 thousand total items and a default pattern length (de-fined in [3]) of 4. Thus, the size of the dataset is 10 timesthe number of transactions. As the number of transactionsis increased f rom 10K to 3000K the time taken increases pro-portionately. The largest frequent itemset was 5 long. Thisexplains the five fold difference in performance between theStored-procedure and the Cache-Mine approach. Figure 12shows the scaling when the transaction length changes from3 to 50 while keeping the number of transactions fixed atlOOK. All three approaches scale linearly with increasingtransaction length.6.2 Space overhead of different approachesWe summarize the space required for different options. Weassume that the tids and items are integers. The space re-quirements for UDF and Loose-coupling is the same as thatfor Stored-procedure which in turn is less than the spaceneeded by the CachcMine and SQL approaches. The Cache-Mine and SQL approaches have comparable storage over-heads. For Stored-procedure and UDF we do not need anyextra storage for caching. However, all three options Cache-Mine, Stored-procedure and UDF require data in each passto be grouped on the tid. In a relational DBMS we cannotassume any order on the physical layout of a table, unlike ina file system. Therefore, we need either an index on the datatable or need to sort the table every time to ensure a par-ticular order. Let R denote the total number of (tid,item)pairs in the data table. Either option has a space overheadof 2 x R integers. The Cache-Mine approach caches the datain an alternative binary format where each tid is followed byall the items it contains. Thus, the size of the cached datain Cache-Mine is at most: R + T integers where T is thenumber of transactions. For SQL we use the hybrid Verticaloption. This requires creation of an initial TidTable of sizeat most I + R where I is the number of items. Note thatthis is slightly less than the cache required by the Cache-Mine approach. The SQL approach needs to sort data inpass 1 in all cases and pass 2 in some cases where we usedthe GatherJoin approach instead of the Vertical approach.In summary, the UDF and Stored-procedure approachesrequire the least amount of space followed by the Cache-Mine and the SQL approaches which require roughly as muchextra storage as the data. When the item-ids or tids arecharacter strings instead of integers, the extra space neededby Cache-Mine and SQL is a much smaller fraction of thetotal data size because before caching we always convertitem-ids to their compact integer representation and storein binary format. Details on how to do this conversion forSQL is presented in [21].6.3 Summary of comparison between different architec-

turesWe present a summary of the pros and cons of the differentarchitectures on each of the following yardsticks: (a) perfor-mance (execution time); (b) storage overhead; (c) potentialfor automatic parallelization; (d) development and mainte-nance ease; (e) portability (f) inter-operability.In terms of performance, the Cache-Mine approach is thebest option. The SQL approach is a close second - it isalways within a factor of two of Cache-Mine for all of ourexperiments and is sometimes even slightly better. The UDFapproach is better than the Stored-procedure approach by

352


11/12

30 to 50%. Between Stored-procedure and Cache-Mine, theperformance difference is 8 function of the number of passesmade on the data - if we make four passes of the date. theStored-procedure approach is four times slower than Cache-Mine. Some of the recent proposals [24, 61 that attemptto minimize the number of data passes to 2 or 3 might beuseful in reducing the gap between the Cache-Mine 8nd theStored-procedure spproach.In terms of space requirements, the Cache-Mine and theSQL approach loose to the UDF or the Stored-procedure ap-proach. The Cache-Mine and SQL approaches have similarstorege requirements.The SQL implementation he8 the potential for automaticparallelizrrtion particularly on 8 SMP machine. Pamllelizs-tion could come for free for SQL-92 queries. Unfortunately,the SQL-92 option is too slow to be 8 candidate for par-allelization. The stumbling block for automatic pareueliza-tion using SQL-OR could be queries involving UDFs thatuse scratch pads. The only such function in our queries isthe Gather table function. This function essentidy imple-ments a user defined aggregate, 8nd would have been easyto parallelize if the DBMS provided support for user definedaggregates or allowed explicit control from the applicationabout how to partition the data amongst differen t parallelinstances of the function. On a MPP machine, although onecould rely on the DBMS to come up with 8 deta partition-ing strategy, it might be possible to better tune performanceif the application could provide hints about the best parti-tioning [4]. Further experiments 8re required to assess howthe performance of these automatic parallelizations wouldcompare with algorithm-specific parallelizations (e.g [4]).The development time and code size using SQL couldbe shorter if one can get efficient implementations out ofexpressing the mining algorithms declaratively using a fewSQL statements. Thus, one can avoid writing and debuggingcode for memory manegement, indexing and space man-agement all of which are already provided in 8 databasesystem (Note that these s8me code reuse advantages canbe obtained from a well-planned library of mining buildingblocks). However, there are some detractors to easy de-velopment using the SQL alternative. First, any attachedUDF code will be herder to debug than stand-alone C++code due to lack of debugging tools. Second, stand-alonecode can be debugged and tested faster when run againstflat file data. Running agejnst flat file8 i5 typically 8 fac-tor of five to ten faster compared to running against datastored in DBMS tables. Firmlly, some mining algorithms[e.g. neural-net based) might be too awkward to express inSQLThe ease of porting of the SQL alternative depends onthe kind of SQL used. Within the same DBMS, portingfrom one OS pletform to another requires porting only thesmall UDF code and hence is easy. In contrast the Stored-procedure and Cache-Mine alternatives require porting largerlines of code. Porting from one DBMS to another couldget hard for SQL approech, if non-standard DBMS-specificfeatures are used. For instance, our preferred SQL imple-mentation relies on the 8vail8bility of DB25 table functions,for which the interface is still not standardized across othermajor DBMS vendors. Also, if different feetures have dif-ferent performence characteristics on different databese sys-tems, considerable tuning would be required. In contrast,the Stored-procedure and Cache-Mine approach are not tiedto any DBMS specific features. The UDF implementationhas the worst of both worlds - it is large and is tied to 8DBMS.

One attraction of SQL implementation is inter-operabilityand usage flexibility. The adhoc querying support providedby the DBMS enables flexible usage and exposes potentialfor pipelining the input and output operators of the min-ing process with other operators in the DBMS. However, toexploit this feature one needs to implement the mining op-erators inside the DBMS. This would require major reworkin existing database systems. The SQL approach presentedhere is based on embedded SQL and as such cannot pro-vide operator pipelining and inter-opembility. Queries onthe mined result is possible with 8ll four alternatives as longas the mined results are stored back in the DBMS.7 Conclusion and future workWe explored v8rious architectural alternatives for integrat-ing mining with 8 relational database system. As an initialstep in that direction we studied the association rules dgo-rithms with the twin goals of finding the trade-offs betweenarchitectural option5 and the extensions needed in 8 DBMSto efficiently support mining. We experimented with differ-ent ways of implementing the association rules mining algo-rithm in SQL to find if it is at all possible to get competitiveperformance out of SQL implementations.We considered two categories of SQL implementations.First, we experimented with four different implementationsbased purely on SQL-92. Experiments with real-life datasetsshowed thet it is not possible to get good performance out ofpure SQL based approaches alone. We next experimentedwith a collection of approaches that made use of the newobject-reletional extensions like UDFs, BLOBS, Table func-tions etc. With this extended SQL we got orders of magni-tude improvement over the SQL-92 based-implementations.We compared the SQL implementation with different ar-chitectural alternstives. We concluded thst based just onperformance the Cache-Mine approach is the winner. A closesecond is the SQL-OR approach thet was sometimes slightlybetter than Cache-Mine snd was never worse than 8 fectorof two on our datasets. Both these approaches require ad-ditional storage for caching, however. The Stored-procedurespproach does not require any extra space (except possiblyfor initially sorting the data in the DBMS) and can perhapsbe made to be within 8 factor of two to three of Cache-Minewith the recent algorithms 124, 61. The UDF 8pprO8Ch is afactor of 0.3 to 0.5 fester then Stored-procedure but is sig-nificently harder to code. The SQL 8pproech offers somesecondary advantages like easier development and mainte-nance 8nd potential for automatic perallelization. However,it might not be 8s portable as the Cache-Mine approechacross different datsbase management systems.The work presented in this paper points to several direc-tions for future research. A natural next step is to experi-ment with other kinds of mining operations (e.g. clusteringand classificetion [S]) to verify if our conclusions about 85-sociations hold for these other cases too. We experimentedwith generalized association rules [22] and sequential pat-terns 1231problems and found similar results. In some waysassociations is the easiest to integrate as the frequent item-sets can be viewed as generalized group-bys. Another usefuldirection is to explore what kind of a support is needed foranswering short, interactive, adhoc queries involving 8 mixof mining and relation81 operations. How much can we lever-age from existing relational engines? What data model andlanguage extensions are needed? Some of these questions areorthogonal to whether the bulky mining operations are im-plemented using SQL or not. Nevertheless, these are impor-

353


12/12

tant in providing analysts with a well-integrated platformwhere mining and relational operations can be inter-mixedin flexible ways.Acknowledgements We wish to thank Cliff Leung, GuyLohman, Eric Louie, Hamid Pirahesh, Eugene Shekita, DaveSimmens, Amit Somani, Ramakrisbnan Srikant, George Wil-son and Swati Vora for useful discussions and help with DB2.References

PI

PI

[33

PI[51

F1

171PIPI

WI1111PI

R. Agrawal, A. Arning, T. Bollinger, M. Mehta,J. Shafer, and R. Srikant. The Quest Data Mining Sys-tem. In Proc. o f the 2nd Int? Conference on Knowl-edge Discovery in Databases and Data Mining, Port-land, Oregon, August 1996.R. Agrawal, T. Imielinski, and A. Swami. Mining asso-ciation rules between sets of items in large databases.In Proc. of the ACM SIGMOD Conference on Manage-ment of Data, pages 207-216, W ashington, D.C., May1993.R. Agrawal, H. M&la, R. Srikant, H. Toivonen, andA. I. Verkamo. Fast Discovery of Association Rules.In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, andR. Uthurusamy, editors, Advances in Knowledge Dis-covery and Data Mining, chapter 12, pages 307-328.AAAI/MIT Press, 1996.R. Agrawal and J. Shafer. Parallel mining of associa-tion rules. IEEE Transactions on Knowledge and DataEngineering, 8(6), December 1996.R. Agrawal and K. Shim. Developing tightly-coupleddata mining applications on a relational database sys-tem. In Proc. of the 2nd Intl Conference on Knowl-edge Discovery in Databases and Data Mining, Port-land, Oregon, August 1996.S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dy-namic itemset counting and implication rules for mar-ket basket data. In Proc. of the ACM SIGMOD Con-ference on Management of Data, May 1997.D. Chamberlin. Using the New DBZ: IBMs Object-Relational Database System. Morgan Kaufmann, 1996.U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, andR. Uthurusamy, editors. Advances in Knowledge Dis-covery and Data Mining. AAAI/MIT Press, 1996.J. Han, Y. Fu, K. Koperski, W. Wang, and 0. Zaiane.DMQL: A data mining query language for relationaldatbases. In Proc. of the 1996 SIGMOD workshop onresearch issues on data mining and knowledge discov-ery, Montreal, Canada, May 1996.M. Houtsma and A. Swami. Set-oriented mining of asso-ciation rules. In Int? Conference on Data Engineering,Taipei, Taiwan, March 1995.IBM Corporation. DB.2 Universal Database Applicationprogramming guide Version 5, 1997.T. Imiehnski and H. Mannila. A database perspectiveon knowledge discovery. Communication of the ACM,39(11):58-64, Nov 1996.

P31

P41

P51WIP71

WIP91

WIWI

P21

1231

1241

P51

[=I

T. Imielinski, A. Virmani, and A. Abdulghani. Dis-covery Board Application Programming Interface andQuery Language fo r Database Mining. In Proc. of the2nd Intl Conference on Knowledge Discovery and DataMining, Portland, Oregon, August 1996.Internationl Business Machines. IBM Intelligent MinerUser's Guide, Version 1 Release 1, SH12-6213-00 edi-tion, July 1996.K. Kulkarni. Object oriented extensions in SQL3: astatus report. Sigmod record, 1994.J. Melton and A. Simon. Understanding the new SQL:A complete guide. Morgan Kauffman, 1992.R. Meo, G. Psaila, and S. Ceri. A new SQL like operatorfor mining association rules. In Proc. of the 22nd IntlConference on Very Large Databases, Bombay, India,Sep 1996.Oracle. Oracle RDBMS Database AdministratorsGuide Volumes I, II (Version 7.0) May 1992.H. Pirahesh and B. Reinwald. SQL table function openarchitecture and data access middleware. In SIGMOD,1998.K. Rajamani, B. Iyer, and A. Chaddha. Using DB/2sobject relational extensions for mining associationsrules. Technical Report TR 03,690., Santa Teresa Lab-oratory, IBM Corporation, sept 1997.S. Sarawagi, S. Thomas, and R. Agrawal. Integrat-ing association rule mining with relational databasesystems: Alternatives and implications. Research Re-port RJ 10107 (91923), IBM Almaden Research Cen-ter, San Jose, CA 95120, March 1998. Available fromhttp://www.almaden.ibm.com/cs/quest.R. Srikant and R. Agrawal. Mining Generalized Asso-ciation Rules. In Proc. of the 21st Intl Conference onVery Large Databases, Zurich, Switzerland, September1995.R. S&ant and R. AgrawaI. Mining Sequential Pet-terns: Generalizations and Performance Improvements.In Proc. of the Fifth Intl Conference on ExtendingDatabase Technology (EDBT), Avignon, France, March1996.H. Toivonen. Sampling large databases for associationrules. In Proc. of the 22nd Intl Conference on VeryLarge Databases, pages 134-145 , Mumbai (Bombay),India, September 1996.D. Tsur, S. Abiteboul, C. Clifton, R. Motwani, andS. Nestorov. Query flocks: A generalization of associa-tion rule mining. In SIGMOD, 1998. to appear.M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li.New Algorithms for Fast Discovery of AssociationRules. In Proc. of the 9rd Int? Conference on Knowl-edge Discovery and Data Mining, Newport Beach, Cal-ifornia, August 1997.

354

Documents

Integrate Association Rule With RDBMS