7
Parallel Hierarchical Clustering on Market Basket Data Baoying Wang Waynesburg University Waynesburg, PA 15370 [email protected] Qin Ding East Carolina University Greenville, NC 27858 [email protected] Imad Rahal College of Saint Benedict & Saint John's University Collegeville, MN 56321 [email protected] Abstract Data clustering has been proven to be a promising data mining technique. Recently, there have been many attempts for clustering market-basket data. In this paper, we propose a parallelized hierarchical clustering approach on market-basket data (PH- Clustering), which is implemented using MPI. Based on the analysis of the major clustering steps, we adopt a partial local and partial global approach to decrease the computation time meanwhile keeping communication time at minimum. Load balance issue is always considered especially at data partitioning stage. Our experimental results demonstrate that PH- Clustering speeds up the sequential clustering with a great magnitude. The larger the data size, the more significant the speedup when the number of processors is large. Our results also show that the number of items has more impact on the performance of PH-Clustering than the number of transactions. 1 Introduction Clustering techniques partition a data set into groups such that similar items fall into the same group [8]. Data clustering is a common data mining technique. Market-basket data analysis has been well studied in the literature for finding associations among items in large groups of transactions in order to discover hidden business trends [3][2], which could aid business in decision-making processes. Market-basket data is usually organized horizontally in the form of transactions each containing a list of items bought by a customer during a single visit to a store. Unlike traditional data, market-basket data are known to be highly dimensional and sparse and to contain attributes of categorical nature [14]. Recently, there have been many attempts for clustering market-basket data. There are different clustering approaches but the two major ones are partitioning clustering and hierarchical clustering. Partitioning clustering approaches require at least one input parameter such as the minimum intra-cluster similarity/affinity or the desired number of clusters. Hierarchical clustering is more flexible than partitioning clustering but has a higher time complexity. In this paper, we propose a parallelized hierarchical clustering approach on market-basket data (PH-Clustering), which is implemented using MPI (message passing interface). This work is an extension of our previous work [22]where we developed a weighted confidence affinity function between clusters (or itemsets) to minimize the impact of low support items and we showed that our method was more accurate than other contemporary affinity measures in the literature. This paper will focus on parallelizing clustering process to make our method not only more accurate but also more efficient. Our contributions are outlined below: (1) Our approach is hierarchical hence nonparametric; i.e., there is no need for the user to specify any input parameters. (2) We use a weighted confidence affinity measure to calculate the similarity between items/clusters to minimize the impact of low support items. (3) We employ vertical data structures by representing each item as a bit vector and resort to logical operations (such as AND, OR, etc.) in order to speed up the computation of itemset supports as well as the merging process. (4) We implement the clustering method using parallel machines to overcome the high time complexity of hierarchical clustering. Our experimental results show that PH-Clustering speeds up the sequential clustering dramatically with the same clustering results. The speedup is significant especially when the data set size (transactions and items) is large. We also conducted experiments on data sets with different number of transactions or different number of items and found that the number of items has more impact on the performance of PH- Clustering than the number of transactions. The rest of the paper is organized as follows. An overview of the related work is discussed in section 2. We present our parallelized hierarchical clustering 2008 IEEE International Conference on Data Mining Workshops 978-0-7695-3503-6/08 $25.00 © 2008 IEEE DOI 10.1109/ICDM.Workshops.2008.34 526 2008 IEEE International Conference on Data Mining Workshops 978-0-7695-3503-6/08 $25.00 © 2008 IEEE DOI 10.1109/ICDMW.2008.32 526

[IEEE 2008 IEEE International Conference on Data Mining Workshops (ICDMW) - Pisa, Italy (2008.12.15-2008.12.19)] 2008 IEEE International Conference on Data Mining Workshops - Parallel

  • Upload
    imad

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Parallel Hierarchical Clustering on Market Basket Data

Baoying Wang Waynesburg University

Waynesburg, PA 15370

[email protected]

Qin Ding

East Carolina University

Greenville, NC 27858

[email protected]

Imad Rahal

College of Saint Benedict

& Saint John's University

Collegeville, MN 56321

[email protected]

Abstract

Data clustering has been proven to be a promising

data mining technique. Recently, there have been

many attempts for clustering market-basket data. In

this paper, we propose a parallelized hierarchical

clustering approach on market-basket data (PH-

Clustering), which is implemented using MPI. Based

on the analysis of the major clustering steps, we

adopt a partial local and partial global approach to

decrease the computation time meanwhile keeping

communication time at minimum. Load balance issue

is always considered especially at data partitioning

stage. Our experimental results demonstrate that PH-

Clustering speeds up the sequential clustering with a

great magnitude. The larger the data size, the more

significant the speedup when the number of

processors is large. Our results also show that the

number of items has more impact on the performance

of PH-Clustering than the number of transactions.

1 Introduction

Clustering techniques partition a data set into

groups such that similar items fall into the same

group [8]. Data clustering is a common data mining

technique. Market-basket data analysis has been well

studied in the literature for finding associations

among items in large groups of transactions in order

to discover hidden business trends [3][2], which

could aid business in decision-making processes.

Market-basket data is usually organized horizontally

in the form of transactions each containing a list of

items bought by a customer during a single visit to a

store. Unlike traditional data, market-basket data are

known to be highly dimensional and sparse and to

contain attributes of categorical nature [14].

Recently, there have been many attempts for

clustering market-basket data. There are different

clustering approaches but the two major ones are

partitioning clustering and hierarchical clustering.

Partitioning clustering approaches require at least one

input parameter such as the minimum intra-cluster

similarity/affinity or the desired number of clusters.

Hierarchical clustering is more flexible than

partitioning clustering but has a higher time

complexity. In this paper, we propose a parallelized

hierarchical clustering approach on market-basket

data (PH-Clustering), which is implemented using

MPI (message passing interface). This work is an

extension of our previous work [22]where we

developed a weighted confidence affinity function

between clusters (or itemsets) to minimize the impact

of low support items and we showed that our method

was more accurate than other contemporary affinity

measures in the literature. This paper will focus on

parallelizing clustering process to make our method

not only more accurate but also more efficient. Our

contributions are outlined below:

(1) Our approach is hierarchical hence

nonparametric; i.e., there is no need for the

user to specify any input parameters.

(2) We use a weighted confidence affinity measure to calculate the similarity between

items/clusters to minimize the impact of low

support items.

(3) We employ vertical data structures by representing each item as a bit vector and

resort to logical operations (such as AND, OR,

etc.) in order to speed up the computation of

itemset supports as well as the merging

process.

(4) We implement the clustering method using parallel machines to overcome the high time

complexity of hierarchical clustering.

Our experimental results show that PH-Clustering

speeds up the sequential clustering dramatically with

the same clustering results. The speedup is significant

especially when the data set size (transactions and

items) is large. We also conducted experiments on

data sets with different number of transactions or

different number of items and found that the number

of items has more impact on the performance of PH-

Clustering than the number of transactions.

The rest of the paper is organized as follows. An

overview of the related work is discussed in section 2.

We present our parallelized hierarchical clustering

2008 IEEE International Conference on Data Mining Workshops

978-0-7695-3503-6/08 $25.00 © 2008 IEEE

DOI 10.1109/ICDM.Workshops.2008.34

526

2008 IEEE International Conference on Data Mining Workshops

978-0-7695-3503-6/08 $25.00 © 2008 IEEE

DOI 10.1109/ICDMW.2008.32

526

method (PH-clustering) in section 3 and show our

experimental results in section 4. Finally, we

conclude the paper in section 5.

2 Literature Review

The related work in several areas are discussed in

this section, such as clustering on market basket data,

parallel clustering, and vertical data structures.

2.1 Clustering on Market Basket Data

Clustering of market basket data can be done

either on transactions [13][1][15] or on items [14][7].

Both approaches share many challenges, such as high

dimensionality, sparsity, categorical attributes, and

substantial noise; yet, they are different in several

ways. First, they have different objectives: transaction

clustering is usually aimed at marketing by

discovering groups of customers sharing similar

shopping profiles while item clustering is mainly

used for catalog design. Second, they use different

similarity measures: transaction clustering is based on

the content of each transaction while item clustering

is based on the frequency/support of each

item/itemset among all transactions.

Our approach is similar to the Hypergraph [14]

and Hyperclique [7] approaches in that it focuses on

item clustering only. Hypergraph develops a

hypergraph by building hyperedges among all items

whose weight is equal to the average confidence of

their association rules and then partitions the

hypergraph in such a way that the data items in every

partition are highly related and the weight of the

hyperedges cut is minimized. Hyperclique [7]

partitions the itemset into hyperclique patterns which

are clusters based on the “all-confidence” measure

[9]. However, [14] and [7] are partitional clustering

methods. They both need an input parameter which is

the minimum confidence value.

2.2 Parallel Clustering

Parallel clustering belongs to high-performance

data mining techniques. Scalable parallel computers

can be used to speed up clustering process. Much

work has been done in parallel implementations of

data clustering algorithms[17][18][23][21]. The main

issues and challenges in parallel clustering include

data locality, load balancing, network

communication, and parallel I/O. Most existing

parallel clustering approaches have been developed

for traditional agglomerative clustering [19][20]. In

most cases, the number of processors required for

clustering is determined by the data size, which is not

flexible / adaptable for various parallel environments.

2.3 Vertical Data Structures

The concept of vertical partitioning for relations

has been well studied in data analysis fields.

Recently, a number of vertical models for association

rule mining have been proposed in the literature

[5][4][6][10][12][16]. The most basic vertical

structure is a bitmap [5], where every <transaction–

item> intersection is represented by a bit in an index

bitmap. Consequently, each item is represented by a

bit vector. The AND logical operation can then be

used to merge items and itemsets into larger itemset

patterns. The support of an itemset is calculated by

counting the number of 1-bits in the bit vector. It has

been demonstrated in the literature that vertical

approaches which utilize logical operations are more

efficient than traditional horizontal approaches

[5][10][12][16]. Moreover, vertical approaches can

be implemented easily in parallel environments in

order to speed up the data mining process further.

3 The PH-Clustering Method

This section presents PH-Clustering, our

parallelized hierarchical clustering method on

market-basket data using vertical data structures. We

assume that market-basket data is represented by a set

of transactions, denoted by D = {T1, T2, …, Tn},

where n is the total number of transactions. Each

transaction Ti consists of a subset of the item space

{I1, I2, …, Im}, where m is the total number of

available items. A sample transaction set is shown in

Figure 1. There are ten transactions and each

transaction has an id number.

TID Itemsets

T100 I2, I4

T110 I1, I2, I4, I5

T120 I2, I3, I5

T130 I4, I5, I6

T140 I1, I5

T150 I2, I6

T160 I1, I2, I6

T170 I2, I5, I6

T180 I4, I5

T190 I2, I6

Figure 1. An Example of Market Basket Data

3.1 Vertical Data Presentation

We first transform the transaction set to an array

of vertical bit vectors. Each item Ik is associated with

a vector bk. The ith bit of bk is 1 if transaction Ti

contains item Ik; it is 0 otherwise. Figure 2 shows the

six vertical bit vectors for the six items from the data

in Figure 1. Each vector has a size of 10. For

527527

instance, b1 has three 1-bits in the 2nd, 5

th, and 7

th

positions because I1 occurs in transactions T110, T140,

and T160.

b1 b2 b3 b4 b5 b6

0 1 0 1 0 0 1 1 0 1 1 0

0 1 1 0 1 0

0 0 0 1 1 1

1 0 0 0 1 0

0 1 0 0 0 1

1 1 0 0 0 1

0 1 0 0 1 1

0 0 0 1 1 0 0 1 0 0 0 1

Figure 2. Vertical Data Representation

3.2 Affinity Function between Items

In our previous work [22], we developed an

efficient weighted confidence affinity function

between clusters (or itemsets) to minimize the impact

of low support items. In this section, we briefly

review the affinity function which is used in PH-

Clustering. The weighted affinity function is a

weighted summation of the two confidences as the

affinity measure between two items, i.e.

})({*})({*),( ijjjiiji IIconfwIIconfwIIA →+→= , (1)

where

})({})({

})({

ji

ii

IsuppIsupp

Isuppw

+= (2)

})({})({

})({

ji

j

jIsuppIsupp

Isuppw

+= (3)

By replacing wi and wj in equation (1) using

equations (2) and (3), replacing confidence variables

with their formulas respectively, and simplifying

equation (1), we get the following affinity measure

function:

})({})({

}),({*2),(

ji

ji

jiIsuppIsupp

IIsuppIIA

+= (4)

Similarly, we define an affinity function based on

the support of the clusters. Given two clusters Ci and

Cj and assuming that supp(Ci), supp(Cj) and supp(Ci,

Cj) are calculated according to the above, the affinity

function between the two clusters are calculated as

follows:

)()(

),(*2),(

ji

jiji

CsuppCsupp

CCsuppCCA

+= (5)

We can use equation (5) to calculate affinity

measures between two items, between two clusters,

and even between a cluster and an item.

3.3 The PH-Clustering Process

The PH-Clustering process starts with single-item

clusters, i.e. each item is treated as its own cluster

initially. The parallel processors first divide the data

set and calculate the affinity between each pair of

items/clusters. The affinity values are then broadcast

to every processor. With the affinity values, each

processor finds the closest cluster pair among its data

portion, which has the highest affinity value. The

closest pair found by each processor (the partially

local closest pair) are collected and compared to find

out the closest among the closest pairs, which is the

global closest cluster pair. Henceforth, the global

closest cluster pair is broadcast to every processor.

After that, the global closest cluster pairs are merged.

The processors continue updating the affinity values

globally and merging the closest cluster pair at next

level until all items are in one cluster. The major

steps are outlined below.

Step 1: Partitioning data set among parallel

processors. The way to distribute data set among

processors is critical to the entire parallel clustering

process. Since the clustering is on items rather than

on transactions and the data set is in the form of

vertical bit sets, the best and efficient way is to split

the bit sets among the processors. Intuitively, the bit

sets ought to be divided into n continuous chunks

among n processors, such as {b1, b2, … bk}, {bk+1,

bk+2, … b2k}, … {b(n-1)k+1, b(n-1)k+2, … bnk}, where k is

the size of bit sets assigned to each machine and n is

the number of processors. The size of entire bit sets is

I = n*k. However, this approach proved to lead to a

poor load balance. Take the calculation of affinity for

example, only less than half of the affinities need to

be calculated due to the symmetric property of the

weighted confidence affinity. In another word, in an

I×I affinity matrix, only the affinities in the upper

triangle (not including the diagonal line) need to be

calculated. If the bit sets were divided into continuous

chunks, the first process would do the most amount of

calculation while the last process the least amount

(see Figure 3a). To solve this problem, we adopt the

modular approach. For example, in case of four

processors, the bit sets are first divided into groups of

four and then bit set partitioning among the

processors are carried out group by group. Each bit

set of a group is assigned to each processor

correspondingly (see Figure 3b).

Step 2: Calculation of affinities between cluster

pairs on parallel processors. We use equation (5) to

calculate the affinity between each pair of clusters on

every parallel processor. Calculation of affinities is

one of the expensive steps with the time complexity

of O(N*I2) where N is the number of transactions and

528528

I is the size of items. Although more than half of the

time is reduced when only the upper triangle of the

affinity matrix is calculated, we can speed up the

process even more with parallel processors. In the

affinity upper triangle, the ith bit set is supposed to

pair with the (i+1)th bit set, (i+2)

th bit set, etc. until

the last bit set (called following bit sets of ith bit set).

Our parallel approach slices the affinity triangle for

different processors as shown in Figure 3b. That is to

say the bit sets assigned to each processor (called

local bit sets) has to pair their following bit sets

globally. To save communication time, we store the

entire bit sets on each processor. Thus, the time

complexity of parallel affinity calculation is

decreased by n times, where n is the number of

parallel processors.

(a) Bit sets divided into continuous chunks

(b) Bit sets divided into even groups

Figure 3. Partitioning of the bit sets

So far, the calculation of affinities is completed

on each processor. But it doesn’t mean that the entire

affinity calculation process is completed. Each

processor only calculates part of the affinities. The

affinities calculated locally need to be collected and

broadcast to all processors. Finally, the affinities in

the upper triangle have to be mirrored to the lower

triangle to complete the affinity calculation process.

Step 3: Finding the closest cluster pair. To find

the closest cluster pair is to find the pair which has

the highest affinity value. This step is another time-

consuming step. So we want to use parallel approach

to find the closest cluster pair. Finding the closest pair

locally on each machine and picking the closest one

among the locally closest pairs is obviously incorrect.

The reason is that there are often cases that the closest

items are in different processors. Therefore, we adopt

a partially local and partially global approach. Each

processor is in charge of finding the closest pair

which contains at least one of its local bit sets. After

that, the closest pairs found in all processors are

collected and compared to find the global closest

cluster pair. And then the global closest cluster pair

has to be broadcast to every processor.

Step 4: Merging the closest cluster pair into a new

cluster. The clustering process then proceeds to

merge the closest cluster pair into a new cluster. In

the example in Figure 1, C2 and C6 are found to be

the closest pair. Therefore, the process merges C2 and

C6 into a new cluster C7. The current level of clusters

becomes {C1, C3, C4, C5, C7}. The number of current

level clusters is reduced from 6 to 5. The merging

process is done simultaneously on all parallel

processors no matter where the closest cluster pair is.

In this way, we save a lot of communication time of

broadcasting the updates which occur in the merging

process. After merging, the process will update the

affinities accordingly to conform to the new current

cluster set.

The process continues iterating through steps 2, 3

and 4 until only one cluster bit vector remains, which

implies that all items have been merged into a single

cluster. In summary, our PH-Clustering process

involves three steps at each level: (1) calculating

affinities, (2) finding the closest pair, and (3)

merging. For the first two steps, the parallel

processors perform calculation locally and then

broadcast the results to all processors. But at the

merging step, we let all parallel processors carry out

the same process in order to save the large amount of

communication time.

4 Experimental Results

We have implemented both PH-Clustering and its

sequential version (SH-Clustering) in C/C++ on the

BigBen at the Pittsburgh Supercomputer Center. The

PH-Clustering is implemented using MPI. BigBen is a

Cray XT3 MPP system with 2068 computer nodes.

Each computer node has two 2.6 GHz AMD Opteron

processors and 2 GBs of memory. The experiments

on affinity measure comparison were discussed in our

previous work to show that our weighted affinity

function produces better clustering results than other

standard affinity measures [22]. In this section, we

will focus on comparison of the execution time.

Our experiments are conducted on standard

benchmark market-basket data sets, such as

mushroom, retail, and IBM (downloaded from the

Frequent Itemset Mining Repository at

http://fimi.cs.helsinki.fi/data/) where IBM stands for

data set T10I4D100K. The mushroom is a relatively

P0

P1

P2

P3

P0

P1 P2 P3

529529

small data set with 8,124 transactions and 119 items;

the retail data has 88,162 transactions and 16,470

items, and the IBM data has 100,000 transactions and

870 items.

4.1 Parallel vs. Sequential Clustering

In the experiments, we compared the execution

time of SH-Clustering and PH-Clustering. Figure 4

shows the run time comparison between SH-

Clustering (denoted as “SH”) and PH-Clustering,

which runs on different numbers of nodes from

2(denoted as 2P) to 32 (denoted as 32P).

0.00

5000.00

10000.00

15000.00

20000.00

25000.00

30000.00

Number of Processors

Run time (s)

mushroom

retail

IBM

mushroom 25.40 10.40 5.98 4.09 3.01 2.60

retail 24668. 8575.34067.6 1787.3 833.71516.71

IBM 14345. 5596.63110.0 1540.3 829.64482.57

SH 2P 4P 8P 16P 32P

Figure 4. Parallel vs. Sequential Time

Comparing SH-Clustering with PH-Clustering

running on two nodes (2P), we can see that the run

time for all data sets is reduced dramatically. In case

of the mushroom data set, the run time is reduced by

2.5 and in case of the retail and IBM data sets, the run

time is reduced by almost 3 times. Since there are two

processors, it is reasonable to think that the run time

would be decreased by about 2 times. But the results

prove to be better. Recall that some of the steps in

PH-Clustering have time complexities of O(I2) where

I is the number of items, such as the calculation of

affinities and finding the closest clusters. If I is

reduced by 2 times, the complexities of those steps

are reduced to O(I2/4) ~ O(I

2/2), where O(I

2/4) refers

to an ideal case when all the calculation is done

locally and O(I2/2) refers a semi-parallel case

(partially local and partially global).

However, as the number of nodes increases, the

speedup rate slows down. The reason is that parallel

process involves both computation time and

communication time. When the number of nodes

increases, communication time increases although

computation time decreases. There is a trade-off

between communication time and computation time.

For small date sets, such a case occurs when the

number of nodes is still small. Take the mushroom

for example. Run time starts to decrease slowly when

the number of machines is 8. However, for the large

data sets, such as retail and IBM, run time continues

to decrease as the number of nodes increases until a

fairly large number. It is for sure that at a certain

point (threshold), increasing the number of nodes will

no longer decrease the run time of PH-Clustering.

Such a point/threshold can be different for different

data sets.

4.2 The Impact of Number of Transactions

In order to see the impact of the number of

transactions, we extract three data sets from original

IBM data, which have fewer transactions but the same

number of items as in the original IBM data set. We

name them IBM_50K, IBM_25K, and IBM_10K,

where IBM_xK refers to a partial IBM data set with

xK transactions and 870 items. The original IBM data

is designated as IBM_100K. Figure 5 shows the run

times (in second) of the four data sets with different

number of transactions on different number of nodes.

IBM_10K IBM_25K IBM_50K IBM_100K

SH 1577.52 3792.82 7383.10 14345.53

2P 560.14 1042.66 2741.56 5596.62

4P 285.47 540.88 1073.18 3110.08

8P 152.81 294.46 591.53 1540.37

16P 85.31 171.34 324.71 829.64

32P 53.14 94.45 179.31 482.57

Figure 5. Impact of the number of transactions

It can be seen from Figure 5 that for the same

number of parallel processors, the run time decreases

about proportionally to the number of transactions.

However, the speedup rates are similar across the

different transaction sizes, for example, the speedup

rates of 2 processors on four transaction sizes are

2.82, 3.64, 2.69 and 2.56 respectively.

4.3 The Impact of Number of Items

Similarly we did experiments on the impact of the

number of items. We also extracted three data sets

from the original IBM data but this time the data sets

have fewer items while keeping the same number of

transactions as in the original IBM data. We name

them IBM_100i, IBM_200i, and IBM_400i, where

IBM_xi refers to a partial IBM data set with 100000

transactions and x items. The original IBM data is

530530

designated as IBM_870i. Figure 6 shows the run time

of the four data sets with different numbers of items

on different number of nodes.

IBM_100i IBM_200i IBM_400i IBM_870i

SH 204.51 771.24 3045.80 14345.53

2P 83.92 312.30 1179.45 5596.62

4P 56.25 172.51 631.11 3110.08

8P 33.21 102.48 351.62 1540.37

16P 26.34 67.25 209.58 829.64

32P 22.32 60.74 137.27 482.57

Figure 6. Impact of the number of items

It can be seen from Figure 6 that the speedup are

getting larger from left columns to right columns

across the table. This means the number of items has

impact on parallelization. The reason is that data

partitioning is done vertically, i.e. on items.

In summary, our parallel PH-Clustering speeds up

the clustering process with a great magnitude. When

the data set size (transactions and items) is large,

speedup continues to be significant until the number

of processors is fairly large, while for small data sets,

speedup slows down quickly as the number of

processors increases. The speedup rates are greater

for data sets with more items than those with fewer

items. That means the number of items has more

impact on parallelization than the number of

transactions.

5 Conclusions

In this paper, we proposed a parallel hierarchical

clustering method (PH-Clustering) on market-basket

data items for catalog design purposes. PH-Clustering

is implemented using MPI. Based on the analysis of

the major clustering steps, we adopt the partial local

and partial global approach to decrease the

computation time meanwhile minimizing

communication time. Load balance issue is always

considered especially at the data partitioning stage.

Our experimental results demonstrate that PH-

clustering speeds up the clustering process with a

great magnitude. The larger the data size, the more

significant the speedup when the number of

processors is large. The number of items has more

impact on parallelization than the number of

transactions. Our future work is to research and find

the optimal numbers of parallel processors for PH-

Clustering on different data sets as we know

increasing the number of processors does not always

speedup the clustering due to the tradeoff between

computation time and communication time.

6 Acknowledgments

This research was supported in part by the

National Science Foundation through TeraGrid

resources and Cyberinfrastructure Program provided

by the Pittsburgh Supercomputer Center. Thanks to

Dr. Shawn Brown and other staff members from the

Pittsburgh Supercomputer Center for providing the

MPI training workshop and many other technical

supports. We are also grateful to the Center for

Research & Economic Development, Waynesburg

University, for the research grant. Thanks also go to

Erik Murphy and Richard Janicki who helped with

the conversion of the sequential programs.

7 References

[1] Agarwal, CC, Magdalena, C and Yu, PS (2002). Finding localized associations in market basket

data. IEEE Trans. Knowledge and Data Eng. 14,

1 51-62

[2] Agrawal R and Srikant R (1994). Fast Algorithms for Mining Association Rules. Proc.

VLDB, 487-499

[3] Agrawal R, Imielinski T and Swami A (1993). Mining Association Rules Between Sets of Items

in Large Databases. Proceedings of the ACM

SIGMOD, Conf. Management of Data, 207-216

[4] Coenen, F, Leng, P and Ahmed, S (2003). T-Trees, Vertical Partitioning and Distributed

Association Rule Mining. Proc. IEEE ICDM,

154-161

[5] Ding, Qin, Ding, Qiang and Perrizo, W (2002). Association Rule Mining on Remotely Sensed

Images Using P-trees. Proc. PAKDD2002, 232-

238

[6] Gardarin, G, Pucheral, P and Wu, F (1998). Bitmap Based Algorithms for Mining

Association Rules. In Actes des journes Bases de

Donnes Avances (BDA'98), Hammamet,

Tunisie,124-132

[7] Han, EH , Karypis, G, Kumar, V and Mobasher, B (1998). Hypergraph based clustering in

highdimensional data sets: A summary of results.

Bulletin of the Technical Committee on Data

Engineering, 21(1)

[8] Han, J and Kamber, M (2006). Data Mining, Concepts and Techniques. Morgan Kaufmann

[9] Omiecinski, E (2003). Alternative interest measures for mining associations. In IEEE

TKDE, Jan/Feb

[10] Rahal, I, Ren, D and Perrizo, W (2006). A Scalable Vertical Model for Mining Association

Rules. Journal of Information & Knowledge

Management (JIKM), Vol.3, No.4, 317-329.

iKMS & World Scientific Publishing Co.

[11] Rijsbergen, CJV (1979). Information Retrieval (2nd Edition). Butterworths, London

531531

[12] Shenoy, P, Haristsa, JR, Sudatsham, S, Bhalotia, G, Baqa, M and Shah, D (2000). Turbo-charging

vertical mining of large databases. In Proc. ACM

Inter. Conf. Management of Data (SIGMOD), W

Chen, JF Naughton and PA Bernstein (eds.), pp.

22-29. ACM 2000.

[13] Strehl, A and Ghosh, J (2000). A Scalable Approach to Balanced, High-dimensional

Clustering of Market-baskets. Proceedings of the

7th International Conference on High

Performance Computing

[14] Xiong, H, Tan, PN and Kumar, V (2003). Mining Strong Affinity Association Patterns in Data Sets

with Skewed Support Distribution. Third IEEE

International Conference on Data Mining,

Melbourne, Florida, 19 - 22

[15] Yun, C., Chuang, C., and Chen, M. (2002). Using Category-Based Adherence to Cluster

Market-Basket Data. Second IEEE International

Conference on Data Mining (ICDM'02), 546

[16] Zaki, MJ and Hsiao, C (2002). Charm: An efficient algorithm for closed itemset mining. In

SIAM International Conference on Data Mining,

457—473.

[17] Foti, D., Lipari, D., Pizzutti, C. and Talia D (2000). Scalable parallel clustering for data

mining on multicomputers. In Proceedings of the

Workshop on High Performance Data Mining,

IPDPS’00.

[18] Judd, D., McKinley, P. K., and Jain, A. K. (1998). "Large-Scale Parallel Data Clustering,"

IEEE Transactions on Pattern Analysis and

Machine Intelligence, vol. 20, no. 8, pp. 871-

876.

[19] Nagesh, H., Goil, S. and Choudhary, A (2001) Parallel algorithms for clustering high-

dimensional large-scale datasets. In Robert

Grossman, Chandrika Kamath, Philip

Kegelmeyer, Vipin Kumar, and Raju Namburu,

editors, Data Mining for Scientific and

Engineering Applications, pages 335–356.

Kluwer Academic Publishers.

[20] Olson, C. F. (1995) Parallel Algorithms for Hierarchical Clustering. Parallel Computing,

21:1313-1325.

[21] Stoffel, K. and Belkoniene, A (1999). “Parallel K-Means Clustering for Large Data Sets”.

Proceedings Euro-Par '99, LNCS 1685, pp. 1451-

1454.

[22] Wang, B. and Rahal, I (2007). “WC-clustering: Hierarchical Clustering using the Weighted

Confidence Affinity Measure” Proceedings of

the High Performance Data Mining Workshop

(IEEE ICDM’07), Omaha, NE. October. pp. 355-

360

[23] Zhou, B. Shen, J., and Peng Q (2003). “PARCLE: a parallel clustering algorithm for

cluster system” Machine Learning and

Cybernetics, 2003 International Conference on.

Volume 1, Issue , 2-5 pp. 4 - 8

532532