School of IT Technical Reportchawla/papers/icdm06_tr.pdfmining itemsets by traversing the tree. Like our al-gorithm it does not perform candidate generation and mines the itemsets

School of ITTechnical Report

GEOMETRICALLY INSPIRED ITEMSET MININGTECHNICAL REPORT 588

FLORIAN VERHEIN AND SANJAY CHAWLA

JULY, 2006

Geometrically Inspired Itemset Mining

Florian Verhein, Sanjay ChawlaSchool of Information Technologies, the University of Sydney, Australia

{fverhein,chawla}@it.usyd.edu.au

Abstract

In our geometric view, an itemset is a vector(itemvector) in the space of transactions. The sup-port of an itemset is the generalized dot product ofthe participating items. Linear and potentially non-linear transformations can be applied to the itemvec-tors before miningpatterns. Aggregation functionscan be applied to the transformed vectors and pushedinside the mining process. We show thatinterest-ing itemset mining can be carried out by instanti-ating four abstract functions: a transformation (g),an algebraic aggregation operator (◦) and measures(f and F ). For frequent itemset mining (FIM),gandF are identity transformations,◦ is intersectionand f is the cardinality. Based on the geometricview we present a novel algorithm that uses spacelinear in the number of 1-itemsets to mine all inter-esting itemsets in a single pass over this data, withno candidate generation. It scales (roughly) linearlyin running time with the number of interesting item-sets. FIM experiments show that it outperforms FP-Growth on realistic datasets above a small supportthreshold (0.29% and 1.2% in our experiments).

1 Introduction

Traditional Association Rule Mining (ARM) con-siders a set of transactionsT containing itemsI.Each transactiont ∈ T is a subset of the items,t ⊂ I.The most time-consuming task of ARM is FrequentItemset Mining (FIM), whereby all itemsetsi ⊂ Ithat occur in a sufficient number of transactions aregenerated. Specifically, ifσ(i) ≥ minSup, whereσ(i) = |{t : i ⊂ t}| is the number of transactionscontainingi (known as thesupportof i).

For item enumerationtype algorithms, each trans-action has generally been recorded as a row in the

dataset. These algorithms make two or more passes,reading it one transaction at a time.

We consider the same data in its transposed for-mat: Each row,xi, (corresponding to an itemi) con-tains the set of transaction identifiers (tids) of thetransactions containingi. Specifically,xi = {t.tid :t ∈ T ∧ i ∈ t}. We callxi an itemvectorbecause itrepresents an item in the space spanned by the trans-actions1. An example is provided in Figure 1(a).

Just as we can represent an item as an itemvector,we can represent an itemsetI ′ ⊂ I as an itemvec-tor: xI′ = {t.tid : t ∈ T ∧ I ′ ⊂ t}. Figure 1(b)shows all itemsets that have support greater than one,represented as vectors in transaction space. For ex-ample, considerx4 = {t2, t3} located atg andx2 ={t1, t2, t3} located atf . x{2,4} can be obtained usingx{2,4} = x2 ∩ x4 = {t2, t3}, and so is located atg.It should be clear thatσ(xI′) = |xI′ | = | ∩i∈I′ xi|.

There are a three important things to note fromthe above: (1) We can represent an item by a vector(we used a set representation to encode its locationin transaction space, but we could have equally wellused an alternate representation in some other space).(2) We can create itemvectors that represent itemsetsby performing a simple operation on the itemvec-tors (in the case above, set intersection). (3) We canevaluate a measure using a simple function on theitemvector (in the above case, we used set size andthe support measure). These fundamental operationsare all that are required for a mining algorithm. InSection 3, we generalise them to functionsg(·), f(·),operator◦ and an additional functionF (·) for morecomplicated measures.

So far we have considered itemvectors as binary.There is no reason for this restriction. Provided thatwe have functionsf(·) and F (·) and an operator◦ that obey the requirements set out in Section 3,we can map the original itemvector into some other

1For simplicity we will use transactions and theirtids inter-changeably

(a) Transposing a dataset of three transactions (tid ∈{t1, t2, t3}) containing itemsI = {1, 2, 3, 4, 5}.

(b) Itemsets with support greater than1 in transaction space.

Figure 1. Running itemvector example

space viag(·). This has potential for dimensionalityand noise reduction or sketch algorithms. Suppose itis possible to perform Singular Value Decomposition(SVD) or Random Projections [1] to reduce the noiseand dimensionality (number of transactions) beforemining itemsets. This has previously been impossi-ble since the transformation creates real valued vec-tors, and hence cannot be mined using exiting algo-rithms. Using our framework, all that is required aresuitable◦, f(·) andF (·). We can also use measuresother than support. Any anti-monotonic (or prefixor weakly anti-monotonic) measure that fits into ourframework can be used.

We briefly illustrate some of the ideas used in ouralgorithm using Figure 1(a). We simply wish toconvey the importance of the transpose view to ourtechnique, and introduce some of the challenges wesolved. Too keep things simple, we use the instantia-tion of g(·), ◦, f(·) andF (·) required for traditionalFIM. Our algorithm scans the transposed dataset rowby row. Suppose we scan it bottom up2 so we firstreadx5 = {t1, t3}. AssumeminSup = 1. Wecan immediately say thatσ({5}) = 2 ≥ minSupand so itemset{5} is frequent. We then read thenext row, x4 = {t2, t3}, and find that{4} is fre-quent. Since we now have bothx5 andx4, we cancreatex{4,5} = x4 ∩ x5 = {t3}. We have nowchecked all possible itemsets containing items4 and5. To progress, we readx3 = {t2} and find that{3} is frequent. We can also check more itemsets:x{3,5} = x3 ∩ x5 = ∅ andx{3,4} = x3 ∩ x4 = {t2}so {3, 4} is frequent. Since{3, 5} is not frequent,neither is{3, 4, 5} by the anti-monotonic property ofsupport [2]. We next readx2 and continue the pro-cess. It should be clear from the above that (1) a sin-gle pass over the dataset is sufficient to mine all fre-

2This is just so the ordering of items in the data structure of ouralgorithm is increasing.

quent itemsets, (2) having processed anyn itemvec-tors corresponding to items inJ = {1, ..., n}, wecan generate all itemsetsL ⊂ J and (3) having thedataset in transpose format and using the itemvectorconcept allows this to work.

Each itemvector could take up significant space,we may need many of them, and operations on themcould be expensive. We generate at least as manyitemvectors as there are frequent itemsets3. Since thenumber of itemsets is at worst2|I| − 1, clearly it isnot feasible to keep all these in memory, nor do weneed to. On the other hand, we do not want to recom-pute them as this is expensive. If there aren itemswe could usen itemvectors of space and create allitemsets, but we must recompute most itemvectorsmultiple times, so the time is not linear in the num-ber of frequent itemsets - it will be exponential. Forexample, suppose we have createdx{1,2,3}. Whenwe later needx{1,2,3,4}, we do not want to have torecalculate it asx1 ∩ x2 ∩ x3 ∩ x4. Instead, wewould like to use the previously calculatedx{1,2,3}:x{1,2,3,4} = x{1,2,3} ∩ x4 being one option. Thechallenge is to use as little space as necessary, whileavoiding re-computations.

We present an algorithm that uses time roughlylinear in the number of interesting itemsets and atworstn′+⌈l/2⌉) itemvectors of space, wheren′ ≤ nis the number of interesting 1-itemsets andl is thesize of the largest interesting itemset. This worst casescenario is only reached with extremely low support,and most practical situations require only a smallfraction ofn′. Based on these facts and the geomet-ric inspiration provided by the itemvectors, we call itGeometrically inspired Linear Itemset Mining In theTranspose, or GLIMIT.

It is widely recognised that FP-Growth type algo-

3It is ‘at least’ because some itemsets are not frequent, but weonly know this once we have calculated its itemvector.

rithms are the fastest know algorithms. We show ex-perimentally that GLIMIT outperforms FP-Growth[5] when the support is above a small threshold.

GLIMIT is more than “just another ARM algo-rithm” because it departs significantly from previousalgorithms. It is a new, and fast, class of algorithm.Furthermore, it opens possibilities for useful prepro-cessing techniques based on our itemvector frame-work, as well as new geometrically inspired interest-ingness measures.

We make the followingcontributions:• We show interesting consequences of viewing

transaction data as itemvectors in transaction-space. We develop a theoretical framework foroperating on itemvectors. This abstraction al-lows a new class of algorithm to be developed,great flexibility in the measures used and opensup the potential for useful transformations onthe data that were previously impossible.

• We present GLIMIT, a new class of algorithmthat uses our framework to mine itemsets in onepass without candidate generation. It uses lin-ear space and time linear in the number of in-teresting itemsets. It significantly departs fromexisting algorithms. Experiments show it beatsFP-Growth above small support thresholds.

In Section 2 we put our framework and GLIMITin context of previous work. Section 3 presentsour itemvector framework. Section 4 gives the thetwo data structures that can be used by GLIMIT. InSection 5 we first give the main facts exploited byGLIMIT and follow up with a comprehensive exam-ple. We prove the space complexity and give thepseudo-code. Section 6 contains our experiments.We conclude in Section 7.

2 Previous Work

Many itemset mining algorithms have been pro-posed since association rules were introduced [2].Recent advances can be found in [3] and [4]. Mostalgorithms can be broadly classified into two groups,the item enumeration(such as [2, 5, 9]) and therowenumeration(such as [7, 12]) techniques. Broadlyspeaking, item enumeration algorithms are most ef-fective for datasets where|T | >> |I|, while row enu-meration algorithms are effective for datasets where|T | << |I|, such as for microarray data [7].

Item enumeration algorithms mine subsets of anitemsetI ′ before miningI ′. Only those itemsets forwhich all subsets are frequent are generated - mak-ing use of the anti-monotonic property of support.

Apriori-like algorithms [2] do this in a breadth firstmanner and use a candidate generation step. Theyuse multiple passes, at most equal to the length ofthe longest frequent itemset. Our algorithm doesnotperform candidate generation, and generates associ-ation rules in adepth firstfashion using asinglepassover thetransposeddataset.

FP-Growth type algorithms [5, 9] generates acompressed summary of the dataset using two passesin a highly cross referenced tree, the FP-tree, beforemining itemsets by traversing the tree. Like our al-gorithm it does not perform candidate generation andmines the itemsets in a depth first manner while stillmining all subsets of an itemsetI ′ before miningI ′.It is very fast at reading from the FP-tree, but thedownside is that the FP-tree can become very largeand is expensive to generate, so this investment doesnot always pay off. Our algorithm uses only as muchspace as is required.

Row enumeration techniques effectively intersecttransactions and generate supersets ofI ′ before min-ing I ′. Although it is much more difficult for these al-gorithms to make use of the anti-monotonic propertyfor pruning, they exploit the fact that searching therow space in data with|T | << |I| becomes cheaperthan searching the itemset-space. GLIMIT is simi-lar to row enumeration algorithms since both searchusing the transpose of the dataset. However, whererow enumeration intersects transactions (rows), weeffectively intersect itemvectors (columns). But thissimilarity is tenuous at best. Furthermore, exist-ing algorithms use the transpose for counting con-venience rather than for any insight into the data, aswe do in our itemvector framework. Since GLIMITsearches through the itemset space, it is classified asan item enumeration technique and is suited to thesame types of data. However, it scans the originaldata column-wise (by scanning the transpose row-wise), while all other item enumeration techniquesscan it row-wise. The transpose hasnever, to our bestknowledge, been used in an item enumeration algo-rithm. In summary, we think it is about as similar toother item enumeration techniques as FP-Growth isto Apriori.

Efforts to create a framework for support exist.Steinbach et al. [11] present one such generalisation,but their goal is to extend support to cover continuousdata. This is very different to transforming the orig-inal (non-continuous) data into a real vector-space(which is one of our motivations). Their work isgeared toward existing item enumeration algorithmsand so their“pattern evaluation vector” summarises

transactions (that is,rows). Our framework operatesoncolumnsof the original data matrix. Furthermore,rather than generalising the support measures so as tocover more types of datasets, we generalise the oper-ations on itemvector and the transformations on thesamedataset that can be used to enable a wide rangeof measures,not just support.

To our best knowledge, Ratio Rules are the clos-est attempt at combining SVD (or similar techniquessuch as Principal Component Analysis) and rule min-ing. Korn et al. [6] consider transaction data whereitems have continuous values associated with them,such as price. A transaction is considered a point inthe space spanned by the items. By performing SVDon such datasets, they observe that the axes (orthogo-nal basis vectors) produced define ratios between sin-gle items. We consider items (and itemsets) in trans-action space(not the other way around) so when wetalk of performing SVD, the new axes are linear com-binations of transactions - not items. HenceI is un-changed. Secondly, we talk about mining itemsets,not just ratios between single items. Finally, SVD isjust one possible instantiation ofg(·).

By considering items as vectors in transactionspace, we can interpret itemsets geometrically, whichwe do not believe has been considered previously. Aswell as inspiring our algorithm, this geometric viewhas the potential to lead to very useful preprocess-ing techniques, such as dimensionality reduction ofthe transactions space. Since GLIMIT uses only thisframework, it will enable us to use such techniques -which are impossible using existing FIM algorithms.

3 Theoretical Itemvector Framework

In Section 1 we considered itemvectors as setsof transactions. We used the intersection operatorto create itemvectors corresponding to itemsets. Wethen used the cardinality function on itemvectors toevaluate the support of the itemsets. That is, we usedthree main functions. These functions can clearly beabstracted, and we do so withg(·), ◦ and f(·) re-spectively. We also include an additional functionF (·) for more complicated measures. But why dothis? The short answer is that by viewing each item-set as a vector in transaction space, we can think ofa few different instances of these functions that havemerit. Before going into this we will formally defineour functions. First, recall that we usexI′ to denotethe itemvector for itemsetI ′ ⊂ I, and for simplicitywe abbreviate single itemvectorsx{i} to xi. Recallalso thatxI′ is the set of transactions that contain the

itemsetI ′ ⊂ I. Call X the space spanned by all pos-siblexI′ .

Definition 1 g : X → Y is a transformation onthe original itemvector to a different representationyI′ = g(xI′) in a new spaceY .

Even thoughg(·) is a transformation, it’s output still‘represents’ the itemvector. To avoid too many terms,we also refer toyI′ as an itemvector.

Definition 2 ◦ is an operator on the transformeditemvectors so thatyI′∪I” = yI′ ◦ yI” = yI” ◦ yI′ .

That is,◦ is a commutative operator for combiningitemvectors to create itemvectors representing largeritemsets. We donot require thatyI′ = yI′ ◦ yI′

4.

Definition 3 f : Y → R is a measure on item-sets, evaluated on transformed itemvectors. We writemI′ = f(yI′).

Suppose we have a measure if interestingness of anitemset that depends only on that itemset. We canrepresent this as follows, whereI ′ = {i1, ..., iq}:

interestingness(I ′) = f(g(xi1)◦g(xi2)◦...◦g(xiq))

(1)So the challenge is, given aninterestingness

measure, find suitable and usefulg,◦ andf so thatthe above holds. Forsupport, we know we can use◦ = ∩, f = | · | andg as the identity function.

We now return to our motivation. First assumethatg(·) trivially mapsxI′ to a binary vector. Usingx1 = {t1, t2} andx5 = {t1, t3} from Figure 1(a) wehavey1 = g(x1) = 110 andy5 = g(x5) = 101. Itshould be clear that we can use bitwiseAND as◦andf = sum(), the number of set bits. But noticethatsum(y1 AND y2) = sum(y1.∗y2) = y1·y2, thedot product (.∗ is the element-wise product5). Thatis, the dot product of two itemvectors is the supportof the the 2-itemset. What makes this interesting isthat this holds for any rotation about the origin. Sup-pose we have an arbitrary3 × 3 matrix R defining arotation about the origin (soR is orthogonal). Thismeans we can defineg(x) = RxT because the dotproduct is preserved byR (henceg(·)). For exam-ple, σ({1, 5}) = y1 · y5 = (RxT

1 ) · (RxT5 ). So we

can perform an arbitrary rotation of our itemvectorsbefore mining 2-itemsets. Of course this is muchmore expensive than bitwiseAND, so why wouldwe want to do this? Consider Singular Value Decom-position. If normalisation is skipped, it becomes a

4Equivalently,◦ mayhave the restriction thatI′ ∩ I” = ∅.5(a. ∗ b)[i] = a[i] ∗ b[i] for all i, where[] indexes the vectors.

rotation about the origin, projecting the original dataonto a new set of basis vectors pointing in the di-rection of greatest variance (incidentally, the covari-ance matrix calculated in SVD also defines the sup-port of all 2-itemsets6). If we additionally use it fordimensionality reduction, it has the property that itroughly preserves the dot product. This means weshould be able to use SVD for dimensionality re-duction and or noise reduction prior to mining fre-quent 2-itemsets without introducing too much er-ror. The drawback is that the dot product appliesonly to two vectors. That is, we cannot use it forlarger itemsets because the‘generalised dot prod-uct’ satisfiessum(RxT

1 . ∗ RxT2 . ∗ ... . ∗ RxT

q ) =sum(x1.∗x2.∗... .∗xq) only for q = 2. However, thisdoes not mean that there are not other useful◦, f(·),F (·) and interestingness measures that satisfy Equa-tion 1 and useg(·) = SV D, some that perhaps willbe motivated by this observation. Note that the trans-pose operation is crucial in applying dimensionalityor noise reduction because it keeps the items intact.If we did not transpose the data, the itemspace wouldbe reduced, and the results would be in terms oflin-ear combinationsof the original items, which cannotbe interpreted. It also makes more sense to reducenoise in the transactions.

We could also chooseg(·) as a set compres-sion function or use approximate techniques, such assketches, to give estimates rather than exact values ofsupport or other measures. However, we think newmeasures inspired by the our view of itemvectors intransaction space will be the most promising. For ex-ample, angles between itemvectors are linked to thecorrelation between itemsets. Of course, we can alsotranslate existing measures into our framework.

To complete our framework we now defineF (·)and give an example.

Definition 4 F : Rk → R is a measure on an

itemsetI ′ that supports any composition of mea-sures (provided byf(·)) on any number of subsetsof I ′. That is,MI′ = F (mI′

1,mI′

2, ...,mI′

k) where

mI′

i= f(yI′

i) and I ′1, ..., I

′k are k arbitrary subsets

of I ′.

Note that k is not fixed. We can now supportmore complicated interestingness functions that re-quire more than a simple measure on one itemset:

interestingness(I ′) = F (mI′

1,mI′

2, ...,mI′

k) (2)

where themI′

iare evaluated byf(·) as before. That

is, MI′ = F (·) is evaluated over somek measures

6That is,CM [i, j] = σ({i, j}).

mI′

iwhere allI ′i ⊂ I ′. We callF (·) trivial if k = 1

andMI′ = F (mI′). In this case the function ofF (·)can be performed byf(·) alone, as was the case in theexamples we considered before introducingF (·).

Example 1 TheminPI of an itemsetI ′ = {1, ..., q}is minPI(I ′) = mini{σ(I ′)/σ({i})}. This mea-sure is anti-monotonic and gives high value to item-sets where each member predicts the itemset withhigh probability. It is used in part for spatial co-location mining [10]. Using the data in Figure 1(a),minPI({1, 2, 3}) = min{1/2, 1/3, 1/1} = 1/3. Interms of our frameworkg(·) is the identity func-tion, ◦ = ∩, f = | · | so thatmI′ = σ(I ′) andMI′ = F (mI′ ,m1, ...,mq) = mini{mI′/mi)}.

Our algorithm uses only the framework describedabove for computations on itemvectors. It also pro-vides the arguments for the operators and functionsvery efficiently so it is flexible and fast. BecauseGLIMIT generates all subsets of an itemsetI ’ beforeit generates the itemsetI ′, an anti-monotonic prop-erty enables it to prune the search space. Therefore,to avoid exhaustive searches, our algorithm requires7

that the functionF (·) be anti-monotonic8 in the un-derlying itemsets over which it operate (in conjunc-tion with ◦, g(·) andf(·)9).

Definition 5 F (·) is anti-monotonic if MI′ ≥MI” ⇐⇒ I ′ ⊂ I”, whereMI′ = F (·) is evalu-ated as per Definition 4.

In the spirit this restriction, an itemsetI ′ is consid-ered interesting ifMI′ ≥ minMeasure, a thresh-old. We call such itemsF-itemsets.

4 Data Structures

In this section we outline two data structures thatour algorithm (optionally) generates and uses.

We use thePrefixTree to efficiently store andbuild frequent itemsets. We represent an itemsetI ′ ={i1, ..., ik} as a sequence〈i1, ..., ik〉 by choosing anordering of the items (in this casei1 < ... < ik), andstore the sequence in the tree. Since each node rep-resents a sequence (ordered itemset) we can use theterms prefix node, itemset and sequence interchange-ably. The prefix tree is built ofPrefixNodes.EachPrefixNode is a tuple(parent, depth, m, M, item)

7If there are few items then this constraint is not needed.8Actually, prefix anti-monotonic[8] is necessary, which is

a weaker constraint. With some modification, weakly anti-monotonicF (·) can also be supported

9Note thatf(·) doesnot have to be anti-monotonic.

whereparent points to the parent of the node (son.parent represents the prefix ofn), depth is itsdepth of the node and therefore the length of theitemset at that node,m (M ) is the measure of theitemset represented by the node as evaluated byf(·)(F (·)) anditem is the last item in the sequence rep-resented by the node.ǫ is the empty item so that{ǫ}∪α = α whereα is an itemset. The sequence (inreverse) represented by any node can be recoveredby traversing toward the root. To make the link withour itemvector framework clear, suppose the itemsetrepresented at a PrefixNodep is I ′ = {i1, i2, .., ik}.Thenp.m = mI′ = f(g(xi1) ◦ g(xi2) ◦ ... ◦ g(xik

))andp.M = F (·) whereF is potentially a function ofthe m’s of PrefixNodes corresponding to subsets ofp.

The tree has the property that ifs is in the Prefix-Tree, then so are all subsequencess′ ⊂ s by the anti-monotonic property ofF (·). Hence we save a lot ofspace because the tree never duplicates prefixes. Infact, it contains exactly one node perF-itemset.

The PrefixTree is designed for efficient storage,and not for lookup purposes. This is why there are noreferences to child nodes. To facilitate the generationof association rules, a set of all nodes that have nochildren is maintained. We call this thefringe. Anexample of a PrefixTree is shown in Figure 2(a).

The fringe is useful for because (1) it contains allmaximal itemsets10, and (2) it can be used to effi-ciently generate all association rules:

Lemma 1 Let s =< i1, ..., ik >= αβγ be thesequence corresponding to a prefix noden whereα, β 6= ∅. All association rules can be generatedby creating all rulesα ⇒ β and β ⇒ α for eachfringe node.

Proof: (Sketch) We don’t generate all possible as-sociation rules that can be generated from itemset{i1, ..., ik} by considering onlyn. Specifically, wemiss (1) any rulesα′ ⇒ β′ or β′ ⇒ α′ whereα′ isnot a prefix ofs, and (2) any such rules where thereis a gap betweenα′ and β′. However, by the con-struction of the tree there exists another noden′ cor-responding to the sequences′ =< α′, β′ > (sinces′ ⊂ s). If n′ is not in the fringe, then by defini-tion s′ ⊂ s” wheres” =< α′, β′, γ′ > for someγ′ 6= ∅ and n” (the node fors”) is in the fringe.Henceα′ ⇒ β′ andβ′ ⇒ α′ will be generated fromnode(s) other thann. Finally, the longest sequences

10A maximal itemsetis a F-itemset for which no superset is in-teresting (has measure above thresholdminMeasure).

are guaranteed to be in the fringe, hence all rules willbe generated by induction.

We use aSequenceMapto index the nodes in thePrefixTree so we can retrieve them for the follow-ing purposes: (1) to checkall subsequences of a po-tential itemset for pruning (we automatically checktwo subsets without using the sequence map11), (2)to find the measures (m, M ) for β in Lemma 1 whengenerating association rules12, and (3) to find the thems when we evaluate anontrivial F (·).

First we introduce an abstract type -Sequence -which represents a sequence of items. Equality test-ing and hash-code calculation is done using the mostefficient iteration direction. PrefixNode can be con-sidered a Sequence13 and reverse iteration is the mostefficient. By performing equality comparisons in re-verse sequence order, we can map any Sequence to aPrefixNode by using a hash-table that stores the Pre-fixNode both as the key and the value. Hence wecan search using any Sequence implementation (in-cluding a list of items), and the hash-table maps Se-quences to the PrefixNodes that represents them. Thespace required is only that of the hash-table’s bucketarray, so it is very space efficient.

Finally, we canavoid the use ofboth the Pre-fixTree and SequenceMapwithout alteration toGLIMIT if (1) we output frequent itemsets whenthey are mined, (2) do not care if notall subsets arechecked before we calculate a new itemvector, and(3) have atrivial F (·).

5 Algorithm

In this section we outline the main principles weuse in GLIMIT and follow up with an illustrative ex-ample. We prove space complexity bounds beforegiving the algorithm in pseudo-code.

We exploit the followingfacts in order to use min-imum space while avoiding any re-computations:

1. We can construct all itemvectorsyI′ by incre-mentally applying the ruleyI′∪{i} = yI′ ◦ yi.That is, we only ever◦ itemvectors correspond-ing to single items to the end of an itemvector.This means that given a PrefixNodep that isnot the root, we only ever need to keep a sin-gle itemvector in memory for any child ofp at

11Fact 3 in Section 512Sinceβ is not a prefix ofs, its measures are not stored along

the path of PrefixNodes corresponding tos. So to get its mea-sures we need to find its PrefixNode - by looking upβ in the Se-quenceMap.

13It is can be considered as a reversed, singly linked list.

a time. Ifp is the root, we will need to keep itschildren’s itemvectors in memory (theyi).

2. It also means we use least space if we perform adepth first search. Then for any depth (p.depth),we will at most have only one itemvector inmemory at a time.

3. We only ever check a new sequence by ‘joining’siblings. That is, we check〈ia, ib, ..., ii, ij , ik〉only if siblings 〈ia, ib, ..., ii, ij〉 and〈ia, ib, ..., ii, ik〉, k > j are in the prefixtree. Hence, we only try to expand nodes whichhave one or more siblingsbelowit.

4. Suppose the items areI = {i1, i2, ..., in}.If we have read ink itemvectorsyij

j ∈{n, n − 1, .., n − k − 1}, then we can havecompleted all nodes corresponding to all sub-sets of{in−k−1, ..., in}. So if we use the depthfirst procedure, when a PrefixNodep is createdall PrefixNodes corresponding to subsets ofp’sitemset will already have been generated. Aswell as being most space efficient, this is re-quired to evaluate nontrivialF (·) and for morethorough (than Fact 3) pruning using the anti-monotonic requirement. This is what we call the‘bottom up’ order of building the Prefix Tree.

5. When a PrefixNodep with p.depth > 1 (orp.item is the top-most item) cannot have anychildren (because it has no siblings by Fact 3),its itemvector will no longer be needed.

6. When a topmostsibling (the topmostchild ofa node) is created (or we find its itemset isnot frequent and hence don’t need to createit), the itemvector corresponding to itsparentp can be deleted. That is, we have just cre-ated the topmost (last) immediate child ofp.This applies only whenp.depth > 1 or whenp.item is the top-mostitem14. This is becausewe only ever needy{ia,ib,...,ii,ij} until we gen-eratey{ia,ib,...,ii,ij ,ik} = y{ia,ib,...,ii,ij} ◦ yik

where ia < ib <, ..., < ii < ij < ik (eg:b − a and a may both greater then1, etc)and {ia, ib, ..., ii, iq} : j < q < k is notfrequent. Indeed, we can write the result ofy{ia,ib,...,ij} ◦ yik

directly into the itemvectorholdingy{ia,ib,...,ij}. Conversely, while there isstill a child to create (or test) we cannot deletep’s corresponding itemvector.

7. When we create a PrefixNodep on the topmostbranch (eg: when all itemsets are frequent,p

14By Fact 1 we cannot apply this to nodes withp.depth = 1(unless it is the topmost node) as they correspond to single itemsand are still needed.

will correspond to< i1, i2, .., ik > for anyk ≥1), we can delete the itemvector correspondingto the single itemp.item (eg: ik). Fact 6 willalways apply in this case too (eg: we can alsodeleteik−1 if k > 1). The reason behind thisis that by using the bottom up method (and thefact our itemsets are ordered), we know that ifwe haveyi1,...,ik

we can only ever◦ a yijwith

j > k onto the end.

We now present anexampleof our algorithm toillustrate some of these facts.

Suppose we have the items{1, 2, 3, 4} and theminMeasure (we will use minSup) threshold issuch that all itemsets are interesting (frequent). Fig-ure 2 shows the target prefix tree and the steps in min-ing it. This example serves to show how we managethe memory while avoiding any re-computations. Fornow, consider the frontier list in the figure as a list ofPrefixNodes that have not been completed. It shouldbe clear that we use a bottom up and depth first pro-cedure to mine the itemsets, as motivated by Facts 2and 4. We complete all subtrees before moving tothe next item. In (d) we calculatey{3,4} = y3 ◦ y4 asper Fact 1. Note we are also making use of Fact 3 -{3} and{4} are siblings. Once we have created thenode for{3, 4} in (d), we can deletey{3,4}by Fact 5.It has no possible children because of the ordering ofthe sequences. The same holds for{2, 4} in (f). In(g), the node for{2, 3} is the topmost sibling (child).Hence we can apply Fact 6 in (h). Note that by Fact1 we calculatey{2,3,4} asy{2,3,4} = y{2,3}◦y4. Notealso that because we need the itemvectors of the sin-gle items in memory we have not been able to useFact 7 yet. Similarly, Fact 6 is also applied in (l),(m), (o) and (p). However, note that in (m), (o) and(p) we also use Fact 7 to deletey2, y3, andy4. In(l) we deletedy1 for two reasons: Fact 6 and 7 (it isa special case in Fact 6). Finally, to better illustrateFact 3, suppose{2, 4} is not frequent. This meansthat{2, 3} will have no siblings anymore, so we donot even need to consider{2, 3, 4} by Fact 3.

We know already that the timecomplexity isroughly linear in the number of frequent itemsets, be-cause we avoid re-computations of itemvectors. Sothe question now is, what is the maximum numberof itemvectors that we have in memory at any time?There are two main factors that influence this. First,we need to keep the itemvectors for individual itemsin memory until we we have completed the node forthe top-most item (Fact 1 and 7). Hence, the ‘higher’up in the tree we are, the more this contributes. Sec-ondly, we need to keep itemvectors in memory until

Figure 2. Building the PrefixTree (mining itemsets) Example . Nodes are labeled with theiritem value. Shaded nodes have their corresponding itemvector in memory. Dotted nodeshave not been mined yet. Solid lines are the parts of the tree t hat have been created.

(a) Complete PrefixTree (b) Step 1 (c) Step 2 (d) Step 3

(e) Step 4 (f) Step 5 (g) Step 6 (h) Step 7

(i) Step 8 (j) Step 9 (k) Step 10 (l) Step 11

(m) Step 12 (n) Step 13 (o) Step 14 (p) Step 15

Figure 3. Maximum number of itemvec-tors. Two cases: n even or odd.

we complete their respective nodes. That is, check alltheir children (Fact 6) or if they can’t have children(Fact 5). Now, the further we are up in the tree, orany subtree for that matter, without completing thenode, the longer the sequence of incomplete nodesis and hence the the more itemvectors we need tokeep. Considering both these factors leads to the sit-uation in Figure 3 - that is, we are up to the top itemand the topmost path from that item so that no nodealong the path is completed. If we haven items,the worst case itemvector usage is just the numberof coloured nodes in Figure 3. There aren itemvec-tors yi : i ∈ {1, ..., n} corresponding to the singleitems (children of the root). There are a further⌈n/2⌉itemvectors along the path from node{1} (inclusive)to the last coloured node (these are the uncompletednodes)15. Therefore the total space required is justn + ⌈n/2⌉ − 1, where the−1 is so that we do notdouble count the itemvector for{1}.

This is for the worst case when all itemsets arefrequent. Clearly, a closer bound is if we letn′ ≤ nbe the number of frequent items. Hence, we needspace linear in the number of frequent items. Themultiplicative constant (1.5) is low, and in practice(with non-pathological support thresholds), we usefar fewer thann itemvectors. If we know that thelongest frequent itemset has sizel, then we can ad-ditionally bound the space byn′ + ⌈l/2⌉ − 1. Fur-thermore, since the frontier contains all uncompletednodes, we know from the above that its upper boundis ⌈l/2⌉. We have sketched the proof of:

15Whenn is even, the last node is{1, 3, 5, ..., n−3, n−1} andwhenn is odd it is{1, 3, 5, ..., n− 2, n}. The cardinality of boththese sets, equal to the number of nodes along the path, is⌈n/2⌉.Note that in the even case, the next step to that shown will usethesame memory (the itemvector for node{1, 3, 5, ..., n − 3, n −1} is no longer needed once we create{1, 3, 5, ..., n − 3, n −1, n} by Fact 6, and we writey{1,3,5,...,n−3,n−1,n} directly intoy{1,3,5,...,n−3,n−1} as we compute it so both need never be inmemory at the same time).

Lemma 2 Letn be the number of items, andn′ ≤ nbe the number of frequent items. Letl ≤ n′ be thelargest itemset. GLIMIT uses at mostn′ + ⌈l/2⌉ −1 itemvectors of space. Furthermore,|frontier| ≤⌈l/2⌉.

As an aside, note we could perform the transposeoperation in memory before mining while still re-maining within the worst case space complexity.However, onaverage and for practical levels ofminMeasure (eg: minSup), this would requiremore memory.

The algorithm is a depth first traversal throughthe PrefixTree. Any search can be implementedeither recursively or using thefrontier method,whereby a list (priority queue) of states (each con-taining a node that has yet to be completely ex-panded) is maintained. The general construct is toretrieve the first state, evaluate it for the search cri-teria, expand it (create some child nodes), and addstates corresponding to the child nodes to the frontier.Using different criteria and frontier orderings leadsto different search techniques. Our frontier containsany nodes that have not yet been completed, wrappedin State objects. Algorithm 1 describes16 the addi-tional types we use (such asState) and shows the ini-tialisation and the main loop - which callsstep(·). Italso describes thecheck(·) andcalculateF (·) meth-ods, used bystep(·).

6 Experiments

We evaluated our algorithm on two publiclyavailable datasets from the FIMI repository17 -T10I4D100K and T40I10D100K. These datasetshave100, 000 transactions and a realistic skewed his-togram of items. They have870 and942 items re-spectively. To apply GLIMIT we first transpose thedataset as a preprocessing step18.

We compared GLIMIT to a publicly available im-plementation of FP-Growth and Apriori. We usedthe algorithms from ARtool19 as it is written in Java,like our implementation, and it has been available for

16The pseudo-code in our algorithms is java-like and we assumea garbage collector which simplifies it. Indentation defines blocksand we ignore type casts.

17http://fimi.cs.helsinki.fi/data/18This is cheap, especially for sparse matrices - precisely what

the datasets in question typically are. Our data was transposed in8 and 15 seconds respectively using a naive Java implementation,and without using sparse techniques.

19http://www.cs.umb.edu/ laur/ARtool/. It was not used via thesupplied GUI. The underlying classes were invoked directly.

Algorithm 1 Data-types, initialisation, main loop and methods.Input: (1) Dataset (inputF ile) in transpose format (mayhaveg(·) already applied). (2)f(·), ◦, F (·) andminMeasure.Output: CompletedPrefixTree (prefixTree) andSequenceMap (map) containing allF-itemsets.

Pair : (Itemvector yi, Item item)/*yi is the itemvector foritem and corresponds toyi in Fact 1. We keep reusing them throughbuffer:*/State : (PrefixNode node, Itemvector yI′ , Iterator itemvectors, boolean top, Pair newPair, List buffer)/*yI′ is the itemvector corresponding tonode (andyI′ in Fact 1).buffer is used to create theIterators(such asitemvectors) for theStates created to hold the children ofnode. We needbuffer to make use of Fact 3.itemvectors

provides theyi to join with yI′ andnewPair helps us do this.*/

/* Initialisation : initialiseprefixTree with its root. Initialisemap andfrontier as empty. Create initial state:*/Iterator itemvectors = new AnnotatedItemvetorIterator(inputF ile); /*Iterator is overPair objects*//*Reads input one row at a time and annotates the itemvector with the item it corresponds to. could also applyg(·)*/frontier.add(new State(prefixTree.getRoot(), null, itemvectors, false, null, new LinkedList()));/*Main Loop*/while (!frontier.isEmpty()) step(frontier.getF irst());

/*Perform one expansion.state.node is the parent of the newPrefixNode (newNode) that we create ifnewNode.M ≥ minMeasure. localTop is true iff we are processing the top sibling ofanysubtree.nextTop becomesnewNode.top and is set so thattop is true only for a node that is along thetopmost branchof the prefix tree.*/void step(State state)

if (state.newPair 6= null) /*see end of method♣*/state.buffer.add(state.newPair); state.newPair = null; /*so we don’t add it again*/

Pair p = state.itemvectors.next(); boolean localTop =!state.itemvectors.hasNext();if ( localTop) /*Removestate from frontier (and hence deletestate.yI′ ) as we are creating...*/

localFrontier.removeF irst(); /*... the top child ofnode in this step. Fact 6*/Itemvector yI′∪{i} = null; double m, M ; boolean nextTop; /*top in the nextState we create.*/if (state.node.isRoot()) /*we are dealing with itemsets of length1 (soI ′ = {ǫ})*/

m = f(p.yi); M = calculateF (null, {p.item}, m); yI′∪{i} = p.yi;state.top = localTop; nextTop = localTop; /*initialise tops.*/

elsenextTop = localTop&& state.top;if (check(state.node, p.item)) /*make use of anti-monotonic pruning property*/

if ( localTop && (state.node.getDepth() > 1 || state.top)) /*Fact 6 or 7*//*No longer needstate.yI′ as this is the last child we can create under* state.node (and it is not a single item other than perhaps the topmost)*/yI′∪{i} = state.yI′ ; yI′∪{i}◦ = p.yi; /*can write result directly intoyI′∪{i}*/

elseyI′∪{i} = state.yI′ ◦ p.yi; /*need to use additional memory for the child (yI′∪{i}).*/m = f(yI′∪{i}); M = calculateF (state.node, {p.item}, m);

elsem = M = −∞ /*don’t need to calculate, we knowM < minMeasure*/if (M ≥ minMeasure) /*Found an interesting itemset - createnewNode for it.*/

PrefixNode newNode = prefixTree.createChildUnder(state.node);newNode.item = p.item; newNode.m = m; newNode.M = M ;sequenceMap.put(newNode);if (state.buffer.size() > 0) /*there is potential to expandnewNode. Fact 5*/

State newState = new State(newNode, yI′∪{i}, state.buffer.iterator(), nextTop, new LinkedList());/*add to front of frontier (ie: in front ofstate if it’s still present) so depth first search. Fact 2.*/frontier.addFront(newState); state.newPair = p; /*if state.node is not complete, we* will add p to state.buffer after newState has been completed. See♣*/

/*Let α be the itemset corresponding tonode. α ∪ {item} is the itemset represented by a childp of node so thatp.item = item. m would bep.m. This method calculatesp.M by usingmap to look up thePrefixNodescorresponding to thek required subsets ofα ∪ {item} to get theirm values,m1, ..., mk. Then it returnF (m1, ..., mk).*/double calculateF(PrefixNode node, Item item, double m) /*details depend onF (·)*/

/*Check whether the itemsetα ∪ {item} could be interesting by exploiting the anti-monotonic property ofF (·): usemap

to check whether subsets ofα ∪ {item} (exceptα and(α − node.item) ∪ {item} by Fact 3) exist.*/boolean check(PrefixNode node, Item item) /*details omitted*/

(a) Runtime and frequent itemsets. T10I4D100K. Inset showsdetail for low support.

(b) Runtime and frequent itemsets. T40I10D100K.

Figure 4. Results

some time. There are some implementation differ-ences that at worst give ARtool a small constant fac-tor advantage20. This is fine since in this section wereally only want to show that GLIMIT is quite fastand efficient when compared to existing algorithmson the traditional FIM problem. Our contribution isthe itemvector framework that allows operations thatpreviously could not be considered, and a flexibleand new class of algorithm that uses this frameworkto efficiently mine data cast into different and usefulspaces.The fact that it is also fast when applied totraditional FIM is secondary. To represent itemvec-tors for traditional FIM, we used bit-vectors21 so thateach bit is set if the corresponding transaction con-tains the item(set). Thereforeg creates the bit-vector,◦ = AND, f(·) = sum(·) andF (m) = m.

20We use ASCII input format, so we need to parse the trans-action numbers. ARtool uses a binary format, where items areintegers encoded in 4 bytes. We use counting algorithms via anabstract framework and generally make use of many layers of in-heritance and object level operations for flexibility whilethe AR-tool algorithms are rigid and use low level operations.

21We used theColt (http://dsd.lbl.gov/~hoschek/colt/) BitVec-tor implementation.

Figure 4(a) shows the runtime22 of FP-Growth,GLIMIT and Apriori23 on T10I4D100K, as wellas the number of frequent items. The analogousgraph for T40I10D100K is shown in Figure 4(b) -we did not run Apriori as it is too slow. Thesegraphs clearly show that when the support thresh-old is below a small value (about 0.29% and 1.2%for the respective datasets), FP-Growth is superiorto GLIMIT. However, above this threshold GLIMIToutperforms FP-Growth significantly. Figure 5(a)shows this more explicitly by presenting the runtimeratios for T40I10D100K. FP-Growth takes at worst19 times as long as GLIMIT. We think it is clear thatGLIMIT is superior above the threshold. Further-more, this threshold is very small and practical ap-plications usually mine with much larger thresholdsthan this.

GLIMIT scales roughly linearly in the number offrequent itemsets. Figure 5(b) demonstrates this ex-perimentally by showing the average time to minea single frequent itemset. The value for GLIMIT isquite stable, rising slowly toward the end (as therewe still need to check itemsets, but very few of theseturn out to be frequent). FP-Growth on the otherhand, clearly does not scale linearly. The reason be-hind these differences is that FP-Growth first buildsan FP-tree. This effectively stores the entire Dataset(minus infrequent single items) in memory. The FP-tree is also highly cross-referenced so that searchesare fast. The downside is that this takes significanttime and a lot of space. This pays off extremely wellwhen the support threshold is very low, as the fre-quent itemsets can read from the tree very quickly.However, whenminSup is larger, much of the timeand space is wasted. GLIMIT uses time and space asneeded, so it does not waste as many resources, mak-ing it fast. The downside is that the operations onbit-vectors (in our experiments, of length100, 000)can be time consuming when compared to the searchon the FP-tree, which is why GLIMIT cannot keepup whenminSup is very small. Figure 5(c) showsthe maximum and average24 number of itemvectorsour algorithm uses as a percentage of the numberof items. At worst, this can be interpreted as thepercentage of the dataset in memory. Although theworst case space is1.5 times the number of items,n (Lemma 2), the figure clearly shows this is neverreached in our experiments. Our maximum was ap-

22Pentium 4, 2.4GHz with 1GB RAM running WindowsXP Pro.23Apriori was not run for extremely low support as it takes

longer than 30 minutes forminSup ≤ 0.1%24over the calls tostep(·).

(a) Runtime ratios. T10I4D100K. (b) Average time taken per frequent itemsetshown on two scales. T10I4D100K.

(c) Number of Itemvectors needed and maxi-mum frontier size. T10I4D100K.

Figure 5. Results

proximately0.82n. By the time it gets close to1.5n,minSup would be so small that the runtime wouldbe unfeasibly large anyhow. Furthermore, the spacerequired drops quite quickly asminSup is increased(and hence the number of frequent items decreases).Figure 5(c) also shows that the maximum frontiersize is very small.

Finally, we reiterate that we can avoid using theprefix tree and sequence map, so the only space re-quired are the itemvectors and thefrontier. That is,the space required is truly linear.

7 Conclusion and Future Work

We showed interesting consequences of view-ing transaction data as itemvectors in transaction-space, and developed a framework for operating onitemvectors. This abstraction gives great flexibilityin the measures used and opens up the potential foruseful transformations on the data. Our future workwill focus on finding useful geometric measures andtransformations for itemset mining. One problem isto find a way to use SVD prior to mining for itemsetslarger than2. We also presented GLIMIT, a novelalgorithm that uses our framework and significantlydeparts from existing algorithms. GLIMIT minesitemsets in one pass without candidate generation,in linear space and time linear in the number of in-teresting itemsets. Experiments showed that it beatsFP-Growth above small support thresholds. Most im-portantly, it allows the use of transformations on thedata that were previously impossible.

References

[1] D. Achlioptas. Database-friendly random projec-tions. InSymposium on Principles of Database Sys-

tems, 2001.[2] R. Agrawal and R. Srikant. Fast algorithms for min-

ing association rules. InProceedings of 20th Interna-tional Conference on Very Large Data Bases VLDB,pages 487–499. Morgan Kaufmann, 1994.

[3] Workshop on frequent itemset mining implementa-tions 2003. http://fimi.cs.helsinki.fi/fimi03.

[4] Workshop on frequent itemset mining implementa-tions 2004. http://fimi.cs.helsinki.fi/fimi04.

[5] J. Han, J. Pei, and Y. Yin. Mining frequent pat-terns without candidate generation. In2000 ACMSIGMOD Intl. Conference on Management of Data,pages 1–12. ACM Press, May 2000.

[6] F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos.Quantifiable data mining using ratio rules.VLDBJournal: Very Large Data Bases, 8(3–4):254–266,2000.

[7] F. Pan, G. Cong, A. Tung, J. Yang, and M. Zaki.Carpenter: Finding closed patterns in long biologicaldatasets. InProceedings of the Ninth ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining. Morgan Kaufmann, 2003.

[8] J. Pei, J. Han, and L. Lakshmanan. Pushing con-vertible constraints in frequent itemset mining.DataMining and Knowledge Discovery: An InternationalJournal, 8:227–252, May 2004.

[9] J. Pei, J. Han, and R. Mao. CLOSET: An efficient al-gorithm for mining frequent closed itemsets. InACMSIGMOD Workshop on Research Issues in Data Min-ing and Knowledge Discovery, pages 21–30, 2000.

[10] S. Shekhar and Y. Huang. Discovering spatial co-location patterns: A summary of results.LectureNotes in Computer Science, 2121:236+, 2001.

[11] M. Steinbach, P.-N. Tan, H. Xiong, and V. Kumar.Generalizing the notion of support. InThe TenthACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining KDD’04, 2004.

[12] J. Wang and G. Karypis. Harmony: Efficiently min-ing the best rules for classification. InSIAM Interna-tional Conference on Data Mining, pages 205–215,2005.

School of Information Technologies, J12The University of Sydney NSW 2006 AUSTRALIA

T +61 2 9351 4917 F +61 2 9351 3838

www.it.usyd.edu.au

ISBN 1 86487 854 1

Documents

School of IT Technical Reportchawla/papers/icdm06_tr.pdfmining itemsets by traversing the tree. Like our al-gorithm it does not perform candidate generation and mines the itemsets