View
215
Download
1
Category
Preview:
Citation preview
Parallel Hierarchical Clustering on Market Basket Data
Baoying Wang Waynesburg University
Waynesburg, PA 15370
ewang@waynesburg.edu
Qin Ding
East Carolina University
Greenville, NC 27858
dingq@ecu.edu
Imad Rahal
College of Saint Benedict
& Saint John's University
Collegeville, MN 56321
irahal@csbsju.edu
Abstract
Data clustering has been proven to be a promising
data mining technique. Recently, there have been
many attempts for clustering market-basket data. In
this paper, we propose a parallelized hierarchical
clustering approach on market-basket data (PH-
Clustering), which is implemented using MPI. Based
on the analysis of the major clustering steps, we
adopt a partial local and partial global approach to
decrease the computation time meanwhile keeping
communication time at minimum. Load balance issue
is always considered especially at data partitioning
stage. Our experimental results demonstrate that PH-
Clustering speeds up the sequential clustering with a
great magnitude. The larger the data size, the more
significant the speedup when the number of
processors is large. Our results also show that the
number of items has more impact on the performance
of PH-Clustering than the number of transactions.
1 Introduction
Clustering techniques partition a data set into
groups such that similar items fall into the same
group [8]. Data clustering is a common data mining
technique. Market-basket data analysis has been well
studied in the literature for finding associations
among items in large groups of transactions in order
to discover hidden business trends [3][2], which
could aid business in decision-making processes.
Market-basket data is usually organized horizontally
in the form of transactions each containing a list of
items bought by a customer during a single visit to a
store. Unlike traditional data, market-basket data are
known to be highly dimensional and sparse and to
contain attributes of categorical nature [14].
Recently, there have been many attempts for
clustering market-basket data. There are different
clustering approaches but the two major ones are
partitioning clustering and hierarchical clustering.
Partitioning clustering approaches require at least one
input parameter such as the minimum intra-cluster
similarity/affinity or the desired number of clusters.
Hierarchical clustering is more flexible than
partitioning clustering but has a higher time
complexity. In this paper, we propose a parallelized
hierarchical clustering approach on market-basket
data (PH-Clustering), which is implemented using
MPI (message passing interface). This work is an
extension of our previous work [22]where we
developed a weighted confidence affinity function
between clusters (or itemsets) to minimize the impact
of low support items and we showed that our method
was more accurate than other contemporary affinity
measures in the literature. This paper will focus on
parallelizing clustering process to make our method
not only more accurate but also more efficient. Our
contributions are outlined below:
(1) Our approach is hierarchical hence
nonparametric; i.e., there is no need for the
user to specify any input parameters.
(2) We use a weighted confidence affinity measure to calculate the similarity between
items/clusters to minimize the impact of low
support items.
(3) We employ vertical data structures by representing each item as a bit vector and
resort to logical operations (such as AND, OR,
etc.) in order to speed up the computation of
itemset supports as well as the merging
process.
(4) We implement the clustering method using parallel machines to overcome the high time
complexity of hierarchical clustering.
Our experimental results show that PH-Clustering
speeds up the sequential clustering dramatically with
the same clustering results. The speedup is significant
especially when the data set size (transactions and
items) is large. We also conducted experiments on
data sets with different number of transactions or
different number of items and found that the number
of items has more impact on the performance of PH-
Clustering than the number of transactions.
The rest of the paper is organized as follows. An
overview of the related work is discussed in section 2.
We present our parallelized hierarchical clustering
2008 IEEE International Conference on Data Mining Workshops
978-0-7695-3503-6/08 $25.00 © 2008 IEEE
DOI 10.1109/ICDM.Workshops.2008.34
526
2008 IEEE International Conference on Data Mining Workshops
978-0-7695-3503-6/08 $25.00 © 2008 IEEE
DOI 10.1109/ICDMW.2008.32
526
method (PH-clustering) in section 3 and show our
experimental results in section 4. Finally, we
conclude the paper in section 5.
2 Literature Review
The related work in several areas are discussed in
this section, such as clustering on market basket data,
parallel clustering, and vertical data structures.
2.1 Clustering on Market Basket Data
Clustering of market basket data can be done
either on transactions [13][1][15] or on items [14][7].
Both approaches share many challenges, such as high
dimensionality, sparsity, categorical attributes, and
substantial noise; yet, they are different in several
ways. First, they have different objectives: transaction
clustering is usually aimed at marketing by
discovering groups of customers sharing similar
shopping profiles while item clustering is mainly
used for catalog design. Second, they use different
similarity measures: transaction clustering is based on
the content of each transaction while item clustering
is based on the frequency/support of each
item/itemset among all transactions.
Our approach is similar to the Hypergraph [14]
and Hyperclique [7] approaches in that it focuses on
item clustering only. Hypergraph develops a
hypergraph by building hyperedges among all items
whose weight is equal to the average confidence of
their association rules and then partitions the
hypergraph in such a way that the data items in every
partition are highly related and the weight of the
hyperedges cut is minimized. Hyperclique [7]
partitions the itemset into hyperclique patterns which
are clusters based on the “all-confidence” measure
[9]. However, [14] and [7] are partitional clustering
methods. They both need an input parameter which is
the minimum confidence value.
2.2 Parallel Clustering
Parallel clustering belongs to high-performance
data mining techniques. Scalable parallel computers
can be used to speed up clustering process. Much
work has been done in parallel implementations of
data clustering algorithms[17][18][23][21]. The main
issues and challenges in parallel clustering include
data locality, load balancing, network
communication, and parallel I/O. Most existing
parallel clustering approaches have been developed
for traditional agglomerative clustering [19][20]. In
most cases, the number of processors required for
clustering is determined by the data size, which is not
flexible / adaptable for various parallel environments.
2.3 Vertical Data Structures
The concept of vertical partitioning for relations
has been well studied in data analysis fields.
Recently, a number of vertical models for association
rule mining have been proposed in the literature
[5][4][6][10][12][16]. The most basic vertical
structure is a bitmap [5], where every <transaction–
item> intersection is represented by a bit in an index
bitmap. Consequently, each item is represented by a
bit vector. The AND logical operation can then be
used to merge items and itemsets into larger itemset
patterns. The support of an itemset is calculated by
counting the number of 1-bits in the bit vector. It has
been demonstrated in the literature that vertical
approaches which utilize logical operations are more
efficient than traditional horizontal approaches
[5][10][12][16]. Moreover, vertical approaches can
be implemented easily in parallel environments in
order to speed up the data mining process further.
3 The PH-Clustering Method
This section presents PH-Clustering, our
parallelized hierarchical clustering method on
market-basket data using vertical data structures. We
assume that market-basket data is represented by a set
of transactions, denoted by D = {T1, T2, …, Tn},
where n is the total number of transactions. Each
transaction Ti consists of a subset of the item space
{I1, I2, …, Im}, where m is the total number of
available items. A sample transaction set is shown in
Figure 1. There are ten transactions and each
transaction has an id number.
TID Itemsets
T100 I2, I4
T110 I1, I2, I4, I5
T120 I2, I3, I5
T130 I4, I5, I6
T140 I1, I5
T150 I2, I6
T160 I1, I2, I6
T170 I2, I5, I6
T180 I4, I5
T190 I2, I6
Figure 1. An Example of Market Basket Data
3.1 Vertical Data Presentation
We first transform the transaction set to an array
of vertical bit vectors. Each item Ik is associated with
a vector bk. The ith bit of bk is 1 if transaction Ti
contains item Ik; it is 0 otherwise. Figure 2 shows the
six vertical bit vectors for the six items from the data
in Figure 1. Each vector has a size of 10. For
527527
instance, b1 has three 1-bits in the 2nd, 5
th, and 7
th
positions because I1 occurs in transactions T110, T140,
and T160.
b1 b2 b3 b4 b5 b6
0 1 0 1 0 0 1 1 0 1 1 0
0 1 1 0 1 0
0 0 0 1 1 1
1 0 0 0 1 0
0 1 0 0 0 1
1 1 0 0 0 1
0 1 0 0 1 1
0 0 0 1 1 0 0 1 0 0 0 1
Figure 2. Vertical Data Representation
3.2 Affinity Function between Items
In our previous work [22], we developed an
efficient weighted confidence affinity function
between clusters (or itemsets) to minimize the impact
of low support items. In this section, we briefly
review the affinity function which is used in PH-
Clustering. The weighted affinity function is a
weighted summation of the two confidences as the
affinity measure between two items, i.e.
})({*})({*),( ijjjiiji IIconfwIIconfwIIA →+→= , (1)
where
})({})({
})({
ji
ii
IsuppIsupp
Isuppw
+= (2)
})({})({
})({
ji
j
jIsuppIsupp
Isuppw
+= (3)
By replacing wi and wj in equation (1) using
equations (2) and (3), replacing confidence variables
with their formulas respectively, and simplifying
equation (1), we get the following affinity measure
function:
})({})({
}),({*2),(
ji
ji
jiIsuppIsupp
IIsuppIIA
+= (4)
Similarly, we define an affinity function based on
the support of the clusters. Given two clusters Ci and
Cj and assuming that supp(Ci), supp(Cj) and supp(Ci,
Cj) are calculated according to the above, the affinity
function between the two clusters are calculated as
follows:
)()(
),(*2),(
ji
jiji
CsuppCsupp
CCsuppCCA
+= (5)
We can use equation (5) to calculate affinity
measures between two items, between two clusters,
and even between a cluster and an item.
3.3 The PH-Clustering Process
The PH-Clustering process starts with single-item
clusters, i.e. each item is treated as its own cluster
initially. The parallel processors first divide the data
set and calculate the affinity between each pair of
items/clusters. The affinity values are then broadcast
to every processor. With the affinity values, each
processor finds the closest cluster pair among its data
portion, which has the highest affinity value. The
closest pair found by each processor (the partially
local closest pair) are collected and compared to find
out the closest among the closest pairs, which is the
global closest cluster pair. Henceforth, the global
closest cluster pair is broadcast to every processor.
After that, the global closest cluster pairs are merged.
The processors continue updating the affinity values
globally and merging the closest cluster pair at next
level until all items are in one cluster. The major
steps are outlined below.
Step 1: Partitioning data set among parallel
processors. The way to distribute data set among
processors is critical to the entire parallel clustering
process. Since the clustering is on items rather than
on transactions and the data set is in the form of
vertical bit sets, the best and efficient way is to split
the bit sets among the processors. Intuitively, the bit
sets ought to be divided into n continuous chunks
among n processors, such as {b1, b2, … bk}, {bk+1,
bk+2, … b2k}, … {b(n-1)k+1, b(n-1)k+2, … bnk}, where k is
the size of bit sets assigned to each machine and n is
the number of processors. The size of entire bit sets is
I = n*k. However, this approach proved to lead to a
poor load balance. Take the calculation of affinity for
example, only less than half of the affinities need to
be calculated due to the symmetric property of the
weighted confidence affinity. In another word, in an
I×I affinity matrix, only the affinities in the upper
triangle (not including the diagonal line) need to be
calculated. If the bit sets were divided into continuous
chunks, the first process would do the most amount of
calculation while the last process the least amount
(see Figure 3a). To solve this problem, we adopt the
modular approach. For example, in case of four
processors, the bit sets are first divided into groups of
four and then bit set partitioning among the
processors are carried out group by group. Each bit
set of a group is assigned to each processor
correspondingly (see Figure 3b).
Step 2: Calculation of affinities between cluster
pairs on parallel processors. We use equation (5) to
calculate the affinity between each pair of clusters on
every parallel processor. Calculation of affinities is
one of the expensive steps with the time complexity
of O(N*I2) where N is the number of transactions and
528528
I is the size of items. Although more than half of the
time is reduced when only the upper triangle of the
affinity matrix is calculated, we can speed up the
process even more with parallel processors. In the
affinity upper triangle, the ith bit set is supposed to
pair with the (i+1)th bit set, (i+2)
th bit set, etc. until
the last bit set (called following bit sets of ith bit set).
Our parallel approach slices the affinity triangle for
different processors as shown in Figure 3b. That is to
say the bit sets assigned to each processor (called
local bit sets) has to pair their following bit sets
globally. To save communication time, we store the
entire bit sets on each processor. Thus, the time
complexity of parallel affinity calculation is
decreased by n times, where n is the number of
parallel processors.
(a) Bit sets divided into continuous chunks
(b) Bit sets divided into even groups
Figure 3. Partitioning of the bit sets
So far, the calculation of affinities is completed
on each processor. But it doesn’t mean that the entire
affinity calculation process is completed. Each
processor only calculates part of the affinities. The
affinities calculated locally need to be collected and
broadcast to all processors. Finally, the affinities in
the upper triangle have to be mirrored to the lower
triangle to complete the affinity calculation process.
Step 3: Finding the closest cluster pair. To find
the closest cluster pair is to find the pair which has
the highest affinity value. This step is another time-
consuming step. So we want to use parallel approach
to find the closest cluster pair. Finding the closest pair
locally on each machine and picking the closest one
among the locally closest pairs is obviously incorrect.
The reason is that there are often cases that the closest
items are in different processors. Therefore, we adopt
a partially local and partially global approach. Each
processor is in charge of finding the closest pair
which contains at least one of its local bit sets. After
that, the closest pairs found in all processors are
collected and compared to find the global closest
cluster pair. And then the global closest cluster pair
has to be broadcast to every processor.
Step 4: Merging the closest cluster pair into a new
cluster. The clustering process then proceeds to
merge the closest cluster pair into a new cluster. In
the example in Figure 1, C2 and C6 are found to be
the closest pair. Therefore, the process merges C2 and
C6 into a new cluster C7. The current level of clusters
becomes {C1, C3, C4, C5, C7}. The number of current
level clusters is reduced from 6 to 5. The merging
process is done simultaneously on all parallel
processors no matter where the closest cluster pair is.
In this way, we save a lot of communication time of
broadcasting the updates which occur in the merging
process. After merging, the process will update the
affinities accordingly to conform to the new current
cluster set.
The process continues iterating through steps 2, 3
and 4 until only one cluster bit vector remains, which
implies that all items have been merged into a single
cluster. In summary, our PH-Clustering process
involves three steps at each level: (1) calculating
affinities, (2) finding the closest pair, and (3)
merging. For the first two steps, the parallel
processors perform calculation locally and then
broadcast the results to all processors. But at the
merging step, we let all parallel processors carry out
the same process in order to save the large amount of
communication time.
4 Experimental Results
We have implemented both PH-Clustering and its
sequential version (SH-Clustering) in C/C++ on the
BigBen at the Pittsburgh Supercomputer Center. The
PH-Clustering is implemented using MPI. BigBen is a
Cray XT3 MPP system with 2068 computer nodes.
Each computer node has two 2.6 GHz AMD Opteron
processors and 2 GBs of memory. The experiments
on affinity measure comparison were discussed in our
previous work to show that our weighted affinity
function produces better clustering results than other
standard affinity measures [22]. In this section, we
will focus on comparison of the execution time.
Our experiments are conducted on standard
benchmark market-basket data sets, such as
mushroom, retail, and IBM (downloaded from the
Frequent Itemset Mining Repository at
http://fimi.cs.helsinki.fi/data/) where IBM stands for
data set T10I4D100K. The mushroom is a relatively
P0
P1
P2
P3
P0
P1 P2 P3
529529
small data set with 8,124 transactions and 119 items;
the retail data has 88,162 transactions and 16,470
items, and the IBM data has 100,000 transactions and
870 items.
4.1 Parallel vs. Sequential Clustering
In the experiments, we compared the execution
time of SH-Clustering and PH-Clustering. Figure 4
shows the run time comparison between SH-
Clustering (denoted as “SH”) and PH-Clustering,
which runs on different numbers of nodes from
2(denoted as 2P) to 32 (denoted as 32P).
0.00
5000.00
10000.00
15000.00
20000.00
25000.00
30000.00
Number of Processors
Run time (s)
mushroom
retail
IBM
mushroom 25.40 10.40 5.98 4.09 3.01 2.60
retail 24668. 8575.34067.6 1787.3 833.71516.71
IBM 14345. 5596.63110.0 1540.3 829.64482.57
SH 2P 4P 8P 16P 32P
Figure 4. Parallel vs. Sequential Time
Comparing SH-Clustering with PH-Clustering
running on two nodes (2P), we can see that the run
time for all data sets is reduced dramatically. In case
of the mushroom data set, the run time is reduced by
2.5 and in case of the retail and IBM data sets, the run
time is reduced by almost 3 times. Since there are two
processors, it is reasonable to think that the run time
would be decreased by about 2 times. But the results
prove to be better. Recall that some of the steps in
PH-Clustering have time complexities of O(I2) where
I is the number of items, such as the calculation of
affinities and finding the closest clusters. If I is
reduced by 2 times, the complexities of those steps
are reduced to O(I2/4) ~ O(I
2/2), where O(I
2/4) refers
to an ideal case when all the calculation is done
locally and O(I2/2) refers a semi-parallel case
(partially local and partially global).
However, as the number of nodes increases, the
speedup rate slows down. The reason is that parallel
process involves both computation time and
communication time. When the number of nodes
increases, communication time increases although
computation time decreases. There is a trade-off
between communication time and computation time.
For small date sets, such a case occurs when the
number of nodes is still small. Take the mushroom
for example. Run time starts to decrease slowly when
the number of machines is 8. However, for the large
data sets, such as retail and IBM, run time continues
to decrease as the number of nodes increases until a
fairly large number. It is for sure that at a certain
point (threshold), increasing the number of nodes will
no longer decrease the run time of PH-Clustering.
Such a point/threshold can be different for different
data sets.
4.2 The Impact of Number of Transactions
In order to see the impact of the number of
transactions, we extract three data sets from original
IBM data, which have fewer transactions but the same
number of items as in the original IBM data set. We
name them IBM_50K, IBM_25K, and IBM_10K,
where IBM_xK refers to a partial IBM data set with
xK transactions and 870 items. The original IBM data
is designated as IBM_100K. Figure 5 shows the run
times (in second) of the four data sets with different
number of transactions on different number of nodes.
IBM_10K IBM_25K IBM_50K IBM_100K
SH 1577.52 3792.82 7383.10 14345.53
2P 560.14 1042.66 2741.56 5596.62
4P 285.47 540.88 1073.18 3110.08
8P 152.81 294.46 591.53 1540.37
16P 85.31 171.34 324.71 829.64
32P 53.14 94.45 179.31 482.57
Figure 5. Impact of the number of transactions
It can be seen from Figure 5 that for the same
number of parallel processors, the run time decreases
about proportionally to the number of transactions.
However, the speedup rates are similar across the
different transaction sizes, for example, the speedup
rates of 2 processors on four transaction sizes are
2.82, 3.64, 2.69 and 2.56 respectively.
4.3 The Impact of Number of Items
Similarly we did experiments on the impact of the
number of items. We also extracted three data sets
from the original IBM data but this time the data sets
have fewer items while keeping the same number of
transactions as in the original IBM data. We name
them IBM_100i, IBM_200i, and IBM_400i, where
IBM_xi refers to a partial IBM data set with 100000
transactions and x items. The original IBM data is
530530
designated as IBM_870i. Figure 6 shows the run time
of the four data sets with different numbers of items
on different number of nodes.
IBM_100i IBM_200i IBM_400i IBM_870i
SH 204.51 771.24 3045.80 14345.53
2P 83.92 312.30 1179.45 5596.62
4P 56.25 172.51 631.11 3110.08
8P 33.21 102.48 351.62 1540.37
16P 26.34 67.25 209.58 829.64
32P 22.32 60.74 137.27 482.57
Figure 6. Impact of the number of items
It can be seen from Figure 6 that the speedup are
getting larger from left columns to right columns
across the table. This means the number of items has
impact on parallelization. The reason is that data
partitioning is done vertically, i.e. on items.
In summary, our parallel PH-Clustering speeds up
the clustering process with a great magnitude. When
the data set size (transactions and items) is large,
speedup continues to be significant until the number
of processors is fairly large, while for small data sets,
speedup slows down quickly as the number of
processors increases. The speedup rates are greater
for data sets with more items than those with fewer
items. That means the number of items has more
impact on parallelization than the number of
transactions.
5 Conclusions
In this paper, we proposed a parallel hierarchical
clustering method (PH-Clustering) on market-basket
data items for catalog design purposes. PH-Clustering
is implemented using MPI. Based on the analysis of
the major clustering steps, we adopt the partial local
and partial global approach to decrease the
computation time meanwhile minimizing
communication time. Load balance issue is always
considered especially at the data partitioning stage.
Our experimental results demonstrate that PH-
clustering speeds up the clustering process with a
great magnitude. The larger the data size, the more
significant the speedup when the number of
processors is large. The number of items has more
impact on parallelization than the number of
transactions. Our future work is to research and find
the optimal numbers of parallel processors for PH-
Clustering on different data sets as we know
increasing the number of processors does not always
speedup the clustering due to the tradeoff between
computation time and communication time.
6 Acknowledgments
This research was supported in part by the
National Science Foundation through TeraGrid
resources and Cyberinfrastructure Program provided
by the Pittsburgh Supercomputer Center. Thanks to
Dr. Shawn Brown and other staff members from the
Pittsburgh Supercomputer Center for providing the
MPI training workshop and many other technical
supports. We are also grateful to the Center for
Research & Economic Development, Waynesburg
University, for the research grant. Thanks also go to
Erik Murphy and Richard Janicki who helped with
the conversion of the sequential programs.
7 References
[1] Agarwal, CC, Magdalena, C and Yu, PS (2002). Finding localized associations in market basket
data. IEEE Trans. Knowledge and Data Eng. 14,
1 51-62
[2] Agrawal R and Srikant R (1994). Fast Algorithms for Mining Association Rules. Proc.
VLDB, 487-499
[3] Agrawal R, Imielinski T and Swami A (1993). Mining Association Rules Between Sets of Items
in Large Databases. Proceedings of the ACM
SIGMOD, Conf. Management of Data, 207-216
[4] Coenen, F, Leng, P and Ahmed, S (2003). T-Trees, Vertical Partitioning and Distributed
Association Rule Mining. Proc. IEEE ICDM,
154-161
[5] Ding, Qin, Ding, Qiang and Perrizo, W (2002). Association Rule Mining on Remotely Sensed
Images Using P-trees. Proc. PAKDD2002, 232-
238
[6] Gardarin, G, Pucheral, P and Wu, F (1998). Bitmap Based Algorithms for Mining
Association Rules. In Actes des journes Bases de
Donnes Avances (BDA'98), Hammamet,
Tunisie,124-132
[7] Han, EH , Karypis, G, Kumar, V and Mobasher, B (1998). Hypergraph based clustering in
highdimensional data sets: A summary of results.
Bulletin of the Technical Committee on Data
Engineering, 21(1)
[8] Han, J and Kamber, M (2006). Data Mining, Concepts and Techniques. Morgan Kaufmann
[9] Omiecinski, E (2003). Alternative interest measures for mining associations. In IEEE
TKDE, Jan/Feb
[10] Rahal, I, Ren, D and Perrizo, W (2006). A Scalable Vertical Model for Mining Association
Rules. Journal of Information & Knowledge
Management (JIKM), Vol.3, No.4, 317-329.
iKMS & World Scientific Publishing Co.
[11] Rijsbergen, CJV (1979). Information Retrieval (2nd Edition). Butterworths, London
531531
[12] Shenoy, P, Haristsa, JR, Sudatsham, S, Bhalotia, G, Baqa, M and Shah, D (2000). Turbo-charging
vertical mining of large databases. In Proc. ACM
Inter. Conf. Management of Data (SIGMOD), W
Chen, JF Naughton and PA Bernstein (eds.), pp.
22-29. ACM 2000.
[13] Strehl, A and Ghosh, J (2000). A Scalable Approach to Balanced, High-dimensional
Clustering of Market-baskets. Proceedings of the
7th International Conference on High
Performance Computing
[14] Xiong, H, Tan, PN and Kumar, V (2003). Mining Strong Affinity Association Patterns in Data Sets
with Skewed Support Distribution. Third IEEE
International Conference on Data Mining,
Melbourne, Florida, 19 - 22
[15] Yun, C., Chuang, C., and Chen, M. (2002). Using Category-Based Adherence to Cluster
Market-Basket Data. Second IEEE International
Conference on Data Mining (ICDM'02), 546
[16] Zaki, MJ and Hsiao, C (2002). Charm: An efficient algorithm for closed itemset mining. In
SIAM International Conference on Data Mining,
457—473.
[17] Foti, D., Lipari, D., Pizzutti, C. and Talia D (2000). Scalable parallel clustering for data
mining on multicomputers. In Proceedings of the
Workshop on High Performance Data Mining,
IPDPS’00.
[18] Judd, D., McKinley, P. K., and Jain, A. K. (1998). "Large-Scale Parallel Data Clustering,"
IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 20, no. 8, pp. 871-
876.
[19] Nagesh, H., Goil, S. and Choudhary, A (2001) Parallel algorithms for clustering high-
dimensional large-scale datasets. In Robert
Grossman, Chandrika Kamath, Philip
Kegelmeyer, Vipin Kumar, and Raju Namburu,
editors, Data Mining for Scientific and
Engineering Applications, pages 335–356.
Kluwer Academic Publishers.
[20] Olson, C. F. (1995) Parallel Algorithms for Hierarchical Clustering. Parallel Computing,
21:1313-1325.
[21] Stoffel, K. and Belkoniene, A (1999). “Parallel K-Means Clustering for Large Data Sets”.
Proceedings Euro-Par '99, LNCS 1685, pp. 1451-
1454.
[22] Wang, B. and Rahal, I (2007). “WC-clustering: Hierarchical Clustering using the Weighted
Confidence Affinity Measure” Proceedings of
the High Performance Data Mining Workshop
(IEEE ICDM’07), Omaha, NE. October. pp. 355-
360
[23] Zhou, B. Shen, J., and Peng Q (2003). “PARCLE: a parallel clustering algorithm for
cluster system” Machine Learning and
Cybernetics, 2003 International Conference on.
Volume 1, Issue , 2-5 pp. 4 - 8
532532
Recommended