Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in
Unsupervised Learning Hossein Estiri,1,2,3 Behzad A Omran,4 Shawn N Murphy1,2,3
1Harvard Medical School; 2Massachusetts General Hospital; 3Partners Healthcare, Boston, MA; 4Construction System Management, The Ohio State University, Columbus, OH
Corresponding Author: Hossein Estiri
E-mail: hestiri at mgh dot harvard dot edu
PRE-PRINT. Download the final publication from:
https://doi.org/10.1016/j.bdr.2018.05.003
Abstract
The majority of the clinical observation data stored in large-scale Electronic Health
Record (EHR) research data networks are unlabeled. Unsupervised clustering can provide
invaluable tools for studying patient sub-groups in these data. Many of the popular
unsupervised clustering algorithms are dependent on identifying the number of
clusters. Multiple statistical methods are available to approximate the number of
clusters in a dataset. However, available methods are computationally inefficient
when applied to large amounts of data. Scalable analytical procedures are needed to
extract knowledge from large clinical datasets. Using both simulated, clinical, and
public data, we developed and tested the kluster procedure for approximating the
number of clusters in a large clinical dataset. The kluster procedure iteratively
applies four statistical cluster number approximation methods to small subsets of
data that were drawn randomly with replacements and recommends the most frequent and
mean number of clusters resulted from the iterations as the potential optimum number
of clusters. Our results showed that the kluster’s most frequent product that
iteratively applies a model-based clustering strategy using Bayesian Information
Criterion (BIC) to samples of 200-500 data points, through 100 iterations, offers a
reliable and scalable solution for approximating the number of clusters in
unsupervised clustering. We provide the kluster procedure as an R package.
1. Introduction
The high throughput of Electronic Health Records (EHR) from multi-site clinical data
repositories provides numerous opportunities for novel data-driven healthcare
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
discovery. EHR data contain unlabeled clinical observations (e.g., laboratory result
values) that can be used to characterize patients with similar phenotypic
characteristics, using unsupervised learning. In healthcare research, unsupervised
learning has been applied for clustering and/or dimensionality reduction. In
unsupervised learning, the machine develops a formal framework to build
representations of the input data to facilitate further prediction and/or decision
making [1]. The goal in unsupervised clustering is to partition data points into
clusters with high intra-class similarities and low inter-class similarities [2,3].
Unsupervised learning is widely used for applications in computer vision, and in
particular for image segmentation. In healthcare research, unsupervised clustering
has been applied to tasks such as image/tissue segmentation and disease/tumor subtype
clustering, dimensionality reduction, but more commonly it has been used in genomics
for gene/cell expression and RNA sequencing.
Many of the popular unsupervised clustering algorithms (e.g., k-means) are dependent
on setting initial parameters, most importantly the number of clusters, k. Initial
parameters play a key role in determining intra-cluster cohesion (compactness) and
inter-cluster separation (isolation) of an unsupervised clustering algorithm.
Initializing the number of clusters for the unsupervised clustering algorithm to begin
with is a challenging problem, for which available solutions are often ad hoc or based
on expert judgment [4–6]. Over the past few decades, the statistics literature has
presented different solutions that apply different quantitative indices to this
problem. Some of the notable statistical solutions are the Calinski and Harabasz index
[7], silhouette statistic [8,9], gap statistic [10], and the model-based approach
using approximate Bayes factor [6,11]. In addition, iterative clustering algorithms
such as the Affinity Propagation algorithm [12], PAM (Partitioning Around Medoids)
[8], and Gaussian-means (G-means) [5] have also been used in the Machine Learning
community for identifying number of clusters in a dataset.
These statistical approaches primarily compare the result of clustering with different
cluster numbers and recommend the best number of the clusters for a dataset. Further,
these techniques were mostly developed for conventional statistical analysis, where
the number of data points does not often exceed a few thousand. As a result, available
statistical approaches either involve making strong parametric assumptions, are
computation-intensive, or both [4]. Especially dealing with large amounts of data,
available statistical solutions are computationally inefficient. Although this is a
general issue across the board, intensive computing requirements to conduct
unsupervised clustering becomes a more prominent issue in clinical research
settings (e.g., research data networks and academic institutions), where
computational capacities are often limited. Applying unsupervised clustering to
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
large amounts of clinical observations data requires scalable methods for identifying
the number of clusters. In this work, we test and present an efficient scalable
procedure, kluster, for approximating the number of clusters in unsupervised
clustering. We have made kluster available as an R package.
2. Material and methods
Selection of number of clusters, k, directly impacts the clustering “accuracy”
criteria, including intra-cluster cohesion and inter-cluster separation. Intuitively,
increasing the initial number of clusters should decrease the clustering error. The
highest error is obtained when the data is clustered into only one partition (i.e.,
maximum compression, minimum accuracy) and the lowest error is when k equals the
number of data points (i.e., minimum compression, maximum accuracy). When prior expert
judgement is unavailable, the optimal choice for the number of clusters can be obtained
at a balanced representation of the data between the minimum and maximum compression
[13]. Multiple statistical approaches have been developed to approximate the number
of clusters. Almost all of the available methods use different statistics for
evaluating clustering performances iteratively over different cluster numbers.
Silhouette coefficient is a measure of cluster assignment accuracy based on comparing
of “tightness” (how far a point is from other points in the same cluster) and
“separation” (how close a point is to its neighboring clusters) [14]. Through
iterative clustering over a range of cluster numbers, an optimal k should maximize
the average silhouette coefficient. [9] The Elbow method is another iterative approach
for identifying the optimal k. In the Elbow method, the total within-cluster sum of
square (WSS) is calculated for each k through iterative clustering with different
cluster numbers. The optimum number of clusters is identified by plotting the WSS
versus the cluster numbers and finding the location of a bend (knee) in the plot. Two
main problems with the Elbow method (other than being computationally intensive) are
that it still requires expert judgement (it is often genuinely ambiguous to identify
number of clusters, even visually), and even if there is expert judgement available,
sometimes there is no clear elbow. Furthermore, it is often genuinely ambiguous to
identify the number of clusters, even visually. The gap statistic method was developed
to standardize comparison in the Elbow method. The gap statistic iteratively compares
normalized within-cluster sums of squares against a null data distribution with no
obvious clustering, and for each k, it compares the WSS with their expected values
under null distribution. The optimal k is where the difference is farthest below the
null distribution curve [10].
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
The Bayesian Information Criterion (BIC) and the Calinski and Harabasz index (CAL)
methods are also popular index-based methods, meaning that they aim to maximize
indices computed through iterative clustering. BIC is another index that can be used
iteratively to approximate the number of clusters [6,11]. This method is part of a
comprehensive model-based clustering strategy, in which a maximum number of clusters
and an initial set of mixture models are applied to hierarchical agglomeration and
expectation–maximization algorithms. The BIC from the resulting models are computed
and an optimal number of clusters, k, is identified from the model with a decisive
maximum of the BIC [6]. We used the implementation of BIC algorithm in R package
‘mclust’ [15]. The Calinski and Harabasz index [7], which is also known as variance
ratio criterion, approximates the optimal number of clusters by maximizing 𝐶𝐻 from equation 1.
Equation 1: 𝐶𝐻# = &'##'(
× *+,,-+,,
where, k is the number of clusters and N is the total number of data points, BGSS is
the overall between-group dispersion and WGSS is the sum of within-cluster dispersions
for all the clusters. We used implementation of CAL algorithm in R package ‘vegan’
[16].
In addition to the index-based methods that can be iteratively computed and optimized,
there are iterative clustering algorithms that do not rely on the initial
approximation of the optimal number of clusters. These algorithms can be used to
identify k for use in other clustering algorithms. Partitioning Around Medoids (PAM)
[9] is a clustering algorithm that can self-identify k, and thus, can be used to
identify the optimal number of clusters. The PAM algorithm searches for a sequence of
centroids for clusters (called medoids) to reduce the effect of outliers. Each
observation is then assigned to its nearest medoids to generate k number of clusters
and a dissimilarity matrix is computed as the basis to re-adjust the medoids. The
process is iterated until there is no change in the medoids [9]. We used the
implementation of PAM algorithm in R package ‘fpc’ [17]. The Affinity Propagation
(AP) algorithm uses measures of similarity between pairs of data points in search of
a “high-quality set of exemplar” data points, by iteratively exchanging real-valued
messages between them. In the AP algorithm, all the data points are considered
potential exemplars and viewed as nodes in the networks. These nodes exchange messages
with each other to generate a better cluster [12]. We use the implementation of AP
algorithm in R package ‘pcluster’ [18].
Regardless of the accuracy of the k optimized by these methods, we argue that they
may not scale to large datasets in their original form. Electronic Health Records
(EHR) on clinical observations for a single patient can add up to hundreds of rows of
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
data. In an average provider, clinical observation data often reach to tens (or
thousands) of millions rows of data. A recommended solution for this issue is to apply
the method to a random sample of the data (as a training set) and use discriminant
analysis to make inferences about the full population [6]. In this paper we develop,
test, and present kluster, a procedure that uses iterative sampling with replacement
to approximate k with application to clinical data.
3. kluster procedure
The Bayesian Information Criterion (BIC), Calinski and Harabasz index (CAL),
Partitioning Around Medoids (PAM), and Affinity Propagation (AP) are popular methods
for approximating the number of clusters, k, which also have well-maintained
implementations in R. We use these four methods as representatives of the available
statistical methods for the purpose of testing the principal hypothesis of this study
and employ them as baseline algorithms to develop and test our proposed procedure.
We argue that applying cluster number approximation methods to an entire dataset
is computationally inefficient, and more importantly, do not scale up to large
datasets. As a result, it is computationally expensive (or currently
impossible) to incorporate such algorithms within recurring unsupervised
learning pipelines in most clinical research institutions. It is also possible
that applying these methods to the entirety of data points will increase the
likelihood of overfitting, and therefore impact the precision of the
recommended clusters number approximation. We hypothesize that employing a
sampling strategy can scale up the cluster number approximation processes
without significantly diminishing performance. To evaluate this hypothesis, we
conducted experimental analyses using the BIC, PAM, AP, and CAL methods. To
conduct simulations for the experimental analyses, we developed functions in
R statistical language, in which we also implemented a procedure of each method
based on iterative sampling. We call this package kluster.
Through kluster, we relax the computational requirements by applying a cluster number
approximation method in iterations to samples of data that were drawn at random and
with replacement. Suppose a population parameter 𝑘, or the number of optimal clusters,
is sought. The kluster procedure produces the 𝑘 as follows:
1. Collect a random sample of size n with replacement from the database, which
yields the data (𝑋(, 𝑋2,… ,𝑋4).
Drawing random samples of n data points from the data will result in 𝑋6s to be i.i.d.
samples from distribution of a random variable 𝑋, and we hypothesize will diminish chances of over-fitting.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
2. Apply cluster number approximation algorithm 𝜔 to the sampled data (𝑋(, 𝑋2,… , 𝑋4),
to identify number of clusters, 𝑘, in the sample data. Currently, 𝜔 ∈
{𝐵𝐼𝐶, 𝐴𝑃, 𝐶𝐴𝐿, 𝑎𝑛𝑑𝑃𝐴𝑀}.
3. Repeat steps 1 and 2, 𝑖 times to produce a vector of 𝑘s, (𝑘E(, 𝑘E2,… , 𝑘E6).
After the kluster procedure is completed, it provides two products for each of the
four cluster number approximation method: (1) the most frequent approximated number
of clusters, and (2) the mean approximated number for clusters.
4. Calculate the mean and most frequent (mode) in (𝑘E(, 𝑘E2,… , 𝑘E6).
We refer to these products as kluster’s mean and most frequent products respectively.
Equation 1: klusterNsmeanproductonω = mean(𝑘E(, 𝑘E2,… , 𝑘E6)
Equation 2: klusterNsmostfrequentproductonω = mode(𝑘E(,𝑘E2,… , 𝑘E6)
For example, when kluster is applying the BIC method to a user-defined samples of
data and in user-defined number of iterations, it will produce a most frequent product
on BIC and a mean product on BIC. Through the next sections, we describe results of
comparing the approximated number of clusters by each of the four methods and their
corresponding kluster’s products on simulated, clinical, and public datasets.
3.1. Data
We argued that applying the original cluster number approximation methods on the
entire database is computationally inefficient, and therefore, does not scale up to
large amounts of data. Our hypothesis was that employing a sampling strategy
would scale up cluster number approximation without significantly diminishing
performance. We evaluated this hypothesis and developed an efficient scalable
procedure, kluster, to optimize cluster number approximation for unsupervised
clustering. We generated two sets of simulated datasets (first set contains small
datasets and second set contains large datasets) with different cluster compositions
– i.e., different number clusters and separation values – using clusterGeneration
package in R [19] that provides functions for generating random clusters with specific
degrees of separation (value for separation index between any cluster and its nearest
neighboring cluster) and numbers of clusters. Each set of simulation datasets consists
of 91 datasets in comma separated values (csv) format (total of 182 csv files) with
3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between
(−0.999, 0.999), where a higher separation value indicates cluster structure with
more separable clusters (Figure 1). Both the simulated datasets and results are
provided as supplementary files and on the Harvard Dataverse Network and Mendeley
Data (links will be provided after the peer review).
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
Figure 1: Seven example simulated datasets with five clusters and different
separation values.
We then tested our proposed procedure on clinical data from Partners HealthCare’s
Research Patient Data Registry (RPDR) [20], as well as and four public datasets. In
the following sections, we first describe the results from experimental analyses and
then proceed to the results from application of the kluster procedure to clinical and
public data.
4. Results
4.1. Experimental Results on First Set of Simulation Data
The first set of simulation data contained small size datasets with number of rows
ranging from 600 to 3,000. We used these datasets to evaluate performance of the four
cluster number approximation methods as well as their corresponding kluster
implementation.
4.1.1. Processing time for oiginal algorithms
To examine computational intensiveness of running statistical methods for
approximating the optimal number of clusters in data, we stored the processing time
requirement for applying the four methods to the first set of datasets. We used the
results to estimate processing time requirement for running each algorithm on datasets
of up to 100,000 data points, using a third degree regression model (Equation 3).
Equation 3: 𝑦 = 𝛽[ +𝛽(𝑛 +𝛽(𝑛2 +𝛽]𝑛] + 𝜀
Where, 𝑦 is the processing time in minutes and n is the number of data points in the dataset (Figure 2). As the figure shows, the processing time drastically increased
for three of the four examined algorithms as the size of the database increased. The
BIC, AP, and PAM methods respectively required the longest time to approximate the
cluster numbers on the simulated data. According to our estimates, even if enough
memory was available, it would take about 400 minutes for BIC method, and 200 minutes
for the AP and PAM methods to approximate the optimal number of clusters in a 2-
dimensional dataset with 100,000 rows. Nevertheless, the BIC algorithm cannot handle
datasets larger than 50,000 data points. The CAL algorithm was less sensitive to the
size of data. Although, it would still take more than 30 minutes to apply the CAL
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
method to the 100k dataset – we later confirmed this processing time by applying the
CAL method to a dataset with 90,000 data points.
Figure 2: Estimated processing time for running the four cluster number
approximation methods over the size of the database.
4.1.2. Accuracy of cluster number approximation
To evaluate our hypothesis and test our recommended kluster procedure, we conducted
further analyses on the first set of simulated datasets. We conveniently specified
the kluster procedure parameters on using samples of 100 data points (n = 100), and
100 iterations (i=100). We iterated the entire process 25 more times, which resulted
in a total of 25,000 simulation iterations (25 ´ 100). For evaluating the performance
of each method and the kluster procedure across the 91 simulation datasets, we first
looked at the distribution of a normalized index for estimation error. We created the
normalized index by taking the difference between the estimated cluster number and
the actual cluster number and dividing it by the actual cluster number (Equation 4).
Equation 4: 𝑒𝑟𝑟6a = bcd'efef
where,𝑒𝑟𝑟6a is the ratio of the error to the actual number of clusters for algorithm 𝑖
on dataset 𝑗, 𝜂6a is the estimated number of clusters by algorithm 𝑖 on dataset 𝑗, and
Νj is the actual number of clusters in dataset 𝑗. For example, if a method approximates
12 clusters in a 10-cluster dataset, the error ratio (𝑒𝑟𝑟) will be 0.2.
Figure 3 demonstrates the density function for the error ratio of the estimated to
the actual number of cluster (𝑒𝑟𝑟). Vertical lines on the plots show the boundary for 95 percent accuracy. Among the original methods (plots on the left), the CAL and BIC
algorithms had the highest probability of estimating the number of clusters with
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
utmost accuracy. Distribution of the error ratio results from the kluster procedures
(except for kluster’s mean product on CAL) also peaked at 𝑒𝑟𝑟 = 0.
* dotted vertical lines delineate better-than-95% approximation of the number of
clusters
Figure 3: Density functions for the ratio of error to the actual number of clusters
(𝑒𝑟𝑟) by method. To further evaluate the error ratios, we created a heatmap of the frequency of
results with better that 95 percent estimation of the number of clusters across
number of clusters and separation values (Figure 4). Results showed that the
kluster’s most frequent product on BIC performed the top approximation of the number
of clusters across the datasets with different numbers of clusters and separation
values – methods on Figure 4 are ordered based on performance – i.e., frequency of
better-than-95% approximation.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
* The heatmap on the left shows the ratio of the cluster number approximation accuracies that were betther than 95% over datasets with a known cluster number and different separation value – ratio denominator for each cell is 7. The heatmap on the right shows the ratio of more-than-90% accuracy over datasets with a given separation value and different cluster number – ratio denominator for each cell equals 13.
Figure 4: Goodness of cluster number approximation by method, cluster number,
and separation values.
The heatmap plot across separation values in Figure 4 (plot on the right) shows
that the kluster’s most frequent product on BIC product also held the best overall
performance in approximating the number of clusters in datasets with cluster
different separation values. Nevertheless, to statistically evaluate the difference
in performances obtained from each algorithm, we performed non-parametric
hypothesis testing.
4.1.3. Non-parametric hypothesis testing
To compare the results obtained from implementing the methods on simulation
datasets, we applied non-parametric and post-hoc tests based on the machine learning
experimental scenarios presented by García et al. (2009 and 2010) [21,22] and Santafe
et al. (2015) [23]. We used the ‘scmamp’ package [24] in R to perform hypothesis
testing. The goal is to evaluate whether the error indices (𝑒𝑟𝑟) we obtained in the evaluation process of each algorithm would provide enough statistical evidence that
the algorithms have different performances.
We first applied the Iman and Davenport omnibus test to analyze all the pair-wise
comparisons in order to detect whether at least one of the algorithms performed
differently than the others. The test resulted in a corrected Friedman's chi-
squared of 39.514 and a p-value < 0.0001, indicating that at least one algorithm
performed differently. We therefore proceeded with the post-hoc analysis of the
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
results. Second, we applied the Wilcoxon signed-ranks test [25], with Holm p-value
adjustment method [26], for pair-wise comparison of the kluster procedure results,
and the Friedman post-hoc test [27], with Bergmann and Hommel’s correction [28],
for comparing and ranking all algorithms [24,29].
Applying the Wilcoxon signed-ranks test to the normalized error index (𝑒𝑟𝑟) showed that kluster’s mean and most frequent products are statistically indifferent for
all four algorithms. On BIC and PAM, we found that kluster’s most frequent products
were also statistically indifferent from their corresponding original algorithms’
result – at p-value<0.01 for the BIC method with Holm adjustment. Wilcoxon signed-
ranks test results are provided in Appendix table.
For ranking the algorithms, we computed an absolute accuracy index (𝑎𝑐𝑐) by normalizing
the error ratios (𝑒𝑟𝑟) to range 0 and 1 (Equation 5) – performance improves as the accuracy index approaches 1:
Equation 5: 𝑎𝑐𝑐𝑖 = 1 − |𝑒𝑟𝑟𝑖|−min(|𝑒𝑟𝑟|)
max(|𝑒𝑟𝑟|)−min(|𝑒𝑟𝑟|)
where 𝑒𝑟𝑟 = (𝑒𝑟𝑟(, … , 𝑒𝑟𝑟4) and 𝑎𝑐𝑐6 is the 𝑖rs the absolute accuracy index.
Before performing the Friedman test, we ran Nemenyi’s test on the accuracy index
𝑎𝑐𝑐 to perform an initial ranking, identify the Critical Difference (CD), and create a ranking diagram. CD diagrams effectively summarize algorithm ranking, magnitude
of difference between them, and the significance of observed differences. Any two
algorithms who have an average performance ranking difference greater that the CD
are significantly different [29]. Figure 5 presents the CD diagram of the
algorithms. We obtained a Critical Difference of 1.7507 for the average performance
rankings between the algorithms.
* CD: Critical Difference
Figure 5: CD diagram of the average algorithm performance rankings.
On the CD diagram, each algorithm is placed on an axis according to its average
performance ranking. Those algorithms that exhibit insignificant differences in
their average performance ranking are grouped together using a horizontal line.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
According to Figure 5, although Calinski (CAL) algorithm has the best average
performance, its performance is not statistically different from kluster’s most
frequent product on BIC and BIC, which respectively had the second and third best
average performances. Although Nemenyi’s test is simple, it is less powerful than
its alternatives, such as the Friedman test, and hence is not often recommended in
practice [24]. We use this test for visualization purpose and filtering out the top
algorithm for Friedman test. The Friedman test’s implementation in ‘scmamp’ package
only take nine variables at the time, so we used the Nemenyi’s test results to
select the top performing kluster performances for Friedman test.
The Friedman test (a.k.a., Friedman two-way analysis of variances by ranks) is a
widely-used non-parametric testing procedure for comparing and ranking observations
obtained more than two related samples [22]. We applied Friedman post-hoc test with
Bergmann and Hommel’s correction to the 𝑎𝑐𝑐 accuracy index obtained for the top six algorithms, plus the original PAM and AP algorithms. The summary of the average
performance ranking of each algorithm over all the dataset is presented in Table
1.
Table 1: Average performance ranking from Friedman post-hoc test
Algorithm BIC
kluster frequent
CAL BIC PAM
kluster frequent
BIC kluster mean
PAM kluster mean
PAM AP
Rank 1 2 3 4 5 6 7 8 average performance ranking
3.835 3.879 4.005 4.225 4.296 4.351 5.197 6.208
The Friedman post-hoc test showed that kluster’s most frequent product on BIC has
the best average performance, but confirms the rankings obtained from the CD diagram
for the remainder of the 8 selected algorithm. To further evaluate significance of
the average performance ranking differences, we studied the Bergmann and Hommel’s
corrected p-values of the Friedman test (Table 2).
Table 2: pair-wise Bergmann and Hommel’s corrected p-values from the Friedman test.
CAL BIC BIC
kluster frequent
BIC kluster mean
PAM kluster frequent
PAM kluster mean
PAM
BIC 1.0000 BIC kluster frequent
1.0000 1.0000
BIC kluster mean
1.0000 1.0000 1.0000
PAM kluster 1.0000 1.0000 1.0000 1.0000
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
frequent
PAM kluster mean
1.0000 1.0000 1.0000 1.0000 1.0000
PAM 0.0042 0.0113 0.0037 0.1439 0.0859 0.2178
AP 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0859 * Insignificant p-values (pairwise-differences) are highlighted in
Bold.
The p-values from the Friedman test also confirm that the average performance
ranking for the top six algorithms are not significantly different from the top-
performing algorithm and each other at p-value <0.05. These results provide support
for the second segment of our hypothesis that applying a sampling strategy does not
diminish the performance of cluster number approximation methods – indeed, we found
that it can improve cluster number approximation.
4.2. Experimental Results on the Second Set of Simulation Data
To evaluate the first segment of our hypothesis that applying a sampling strategy can
scale up the cluster number approximation, we applied the kluster procedure to large
datasets. The second set of simulation data contained large datasets with number of
rows ranging from 90,000 to 2,250,000. We used these datasets to exclusively re-
evaluate the performance of kluster procedure. We performed non-parametric hypothesis
testing to rank kluster performances and evaluate its sensitivity to sample size.
4.2.1. Re-evaluating the kluster procedure
We used the results from applying the kluster procedures to large datasets to re-
evaluate their performance against large datasets. Following the non-parametric
hypothesis procedure from the previous sub-section, we first performed the Nemenyi’s
test on the accuracy index 𝑎𝑐𝑐 to perform an initial ranking, identify the Critical Difference (CD), and create a ranking diagram. Figure 6 presents the CD diagram of
the kluster procedures. We obtained a Critical Difference of 1.1039 for the average
performance rankings between the kluster procedures.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
* CD: Critical Difference
Figure 6: CD diagram of the average kluster procedure performance rankings.
Similar to the results we obtained from the first set of datasets, kluster’s most
frequent products on BIC, and PAM and their corresponding mean products were
respectively the top four cluster number approximation procedures – none were
significantly different from each other. We then applied Friedman post-hoc test
with Bergmann and Hommel’s correction to the 𝑎𝑐𝑐 accuracy index obtained for the kluster procedures (Table 3).
Table 3: Average performance ranking of kluster procedures from Friedman post-hoc test
Algorithm
BIC kluster frequen
t
PAM kluster frequen
t
BIC kluste
r mean
PAM kluste
r mean
CAL kluster frequen
t
CAL kluste
r mean
AP kluster frequen
t
AP kluste
r mean
Rank 1 2 3 4 5 6 7 8 average performance ranking
3.230 3.368 3.813 3.884 4.065 5.587 5.637 6.412
The Friedman test verified the performance rankings produced by the Nemenyi’s test.
The kluster’s most frequent product on BIC still holds the best performance,
although the very same product on PAM is not significantly worse.
4.2.2. Processing time for the kluster procedure
As we expected, kluster procedures were fast. For example, on a 2,250k dataset, the
kluster procedure on BIC took between 36.99 seconds (with 100 samples), to 176.44
seconds (with 500 samples), and to 444.6 seconds (with 1,000 samples). We evaluated
the processing time for the two kluster procedures with the best accuracy
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
performance, on BIC and PAM algorithms. Because we are comparing two algorithms
across multiple datasets, we first ran Wilcoxon signed-ranks test [25] with Holm
p-value adjustment or processing time recorded from each procedures across the 91
large datasets. A p-value < 0.0001 suggests that the two procedures are
significantly different from each other. On 66 datasets (out of 91, i.e., 72.527
percent) the kluster procedure on BIC was faster than the kluster procedure on PAM.
The adjusted p-value (<0.0001) suggests a significant difference between processing
time when the kluster procedure on BIC is faster than its equivalence on PAM. On
25 datasets (27.472 percent of the large datasets), the kluster procedure on PAM
was faster than the kluster procedure on BIC, in which the difference was also
significant at p-value < 0.0001. Overall, the kluster procedure on BIC is slightly
faster than the kluster procedure on PAM.
4.2.3. Sensitivity of the kluster procedure to sample size
Results of our experiments on the first set of datasets showed that kluster procedures
on BIC and PAM performed better than, or as good as, when their corresponding methods
are applied to the entire dataset. These results were based on 25,000 iterations (100
sampling iterations Î 25 simulation iterations) of samples of 100 data points drawn
with replacement. Sensitivity of the kluster procedure to the size of samples taken
from data is important in setting up a generalizable specification for the kluster
procedure. To test sensitivity of the procedure to sample size, we ran kluster
procedures with 100, 200, 300, 400, 500, and 1,000 samples, on the second set of
datasets.
We focused on the kluster procedure on BIC, as our most efficient and recommended
implementation of the kluster procedure. Similar to the hypothesis testing procedures
that we have been following in the previous sub-sections, we began by running
Nemenyi’s test on the accuracy index 𝑎𝑐𝑐. Figure 7 presents the CD diagram of the kluster procedure on BIC across different sample sizes. We obtained a Critical
Difference of 0.793 for the average performance rankings between the kluster
procedures on different sample sizes.
* CD: Critical Difference
Figure 7: CD diagram of the average kluster procedure on BIC performance rankings
over sample size.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
As the CD diagram shows, there is essentially no difference in performance ranking of
kluster procedure on BIC across samples of sizes 200 to 500. The sample size 100,
which we used in the experimental analyses on the first set of datasets had the lowest
performance, although still insignificant. The Friedman test also confirmed these
results. With respect to these results, we recommend kluster’ most frequent results
with a sample size between 200 and 500 for approximating the number of clusters in
large datasets.
4.3. Results on Clinical Data
Our results on simulated data recommended that the kluster’s most frequent product on
BIC was promising for application on large amounts of data. To test the utility of
kluster’s products for unsupervised clustering of clinical data from EHRs, we applied
the kluster procedure (100 to 500 random samples with 100 sampling iterations) to
over 320 million rows of data representing 25 clinical observations extracted from
Partners HealthCare’s Research Patient Data Registry (RPDR) [20]. Table 1 presents
the results of kluster’s most frequent product on BIC, along with processing time and
number of rows for each observation. The number of rows for each group of observations
ranged from 1,226 to 34,341,494. Twenty out of the 25 observations had more than
1,000,000 rows of data (24 had more than 100,000 rows of data), which made running
the cluster number identification algorithms on the entire dataset virtually
impossible. Before discussing the accuracy of the results, we were able to complete
the procedure and come up with an approximated number of clusters in datasets with
over 30 million rows of data, in less than two minutes.
Table 4: kluster results (most frequent product on BIC) on RPDR observations data.
Observation kluster*
processing time
(seconds)
rows of data
Human serum albumin** 2 42.347 15,079,716
Calcium 2 54.701 26,975,428
Bicarbonate (HCO3) 2 20.207 740,440 Carbon dioxide, total 2 38.327 29,547,86
4 Chloride 2 30.91 29,412,93
8 HDL cholesterol 2 38.735 152,706 LDL cholesterol 3 23.84 108,075 Total cholesterol 2 38.557 5,822,564 Potassium 2 47.937 30,351,59
2 Albumin 2 102.727 124,714 Sodium 2 22.712 2,421,092
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
Hemoglobin 2 54.458 1,257,063 Basophils [#/Volume] 2 12.07 5,312,086 Basophils/100 Leukocytes 3 28.006 5,327,076 Hemoglobin 2 68.058 34,341,49
4 Lymphocytes/100 Leukocytes 4 45.786 2,100,438 Platelet count 2 7.121 1,226 Mean corpuscular hemoglobin concentration
1 65.273 30,651,862
Mean cell volume (MCV) 2 39.978 30,651,826
Red blood cells (RBC) 1 92.181 30,652,302
BMI 2 41.861 1,255,943 Diastolic Blood Pressure 1 26.354 12,699,44
1 eGFR 15 15.627 4,374,748 Systolic Blood Pressure 2 31.576 12,699,44
1 Weight 3 44.092 11,909,93
2 * kluster’s most frequent product on BIC ** Plots for bolded observations are provided in Figure 7.
The processing times are even more remarkable when the resulted cluster number
approximations also passed the eye test. Unlike the simulated data, we did not have
a preset gold standard number of clusters for EHR observations to calculate acceptable
boundaries. Figure 8 illustrates the [mirrored] probability distribution of eight of
the observations from Table 1. Horizontal axes in Figure 8 plots were transformed to
the square root of observation values for better visualization of often-skewed
distributions. It appears that clinical observation data often has uniform
distribution. Associating these distributions with the simulated datasets, clusters
in clinical observation data were often tight – i.e., separation values were small.
We found that kluster’s most frequent product on BIC approximated two clusters in the
majority of the observations. These observations are often a main body and a few
outliers (e.g., RBC, Human serum albumin, and HCO3). The most number of approximated
clusters belonged to eGFR, which had a long distribution across the horizontal axis.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
* Horizontal axes represented squared observation values. Vertical axes are
probabilities mirrored around 0 for better visualization.
Figure 8: Density plots of 8 selected observations from RPDR Electronic Health Records.
4.4. Results on Public Data
We applied the kluster procedure to four public datasets. Due to the relatively small
sample size of these datasets (between 150 to 3,168 data samples), we were able to
apply the original methods as well as their kluster procedure to each of these
datasets. Due to small sample sizes, we used the same specification for kluster as we
used on our small size simulation data (i.e., sample size = 100, iterations = 100).
The first dataset was Breast Cancer Wisconsin (Diagnostic)[30]. Features of this
dataset are computed from a digitized image of a fine needle aspirate (FNA), describing
characteristics of a breast mass cell nuclei present in the image. Figure 9 shows a
scatter plot of mean area versus mean texture (standard deviation of gray-scale
values) classified by diagnosis (M = malignant, B = benign). Although the two clusters
of malignant and benign diagnoses are not well-separated, we can distinguish two
clusters in this dataset when organized by mean area and texture.
Basophils/100 Leukocytes Lymphocytes/100 Leukocytes Weight eGFR
HemoglobinBicarbonate (HCO3)Human serum albuminRed blood cells (RBC)
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
Figure 9. Scatter plot of mean area versus mean texture by diagnosis (M = malignant, B = benign) in Breast Cancer Wisconsin (Diagnostic) dataset.
We applied the kluster procedure with 100 random samples taken from the data in 100
iterations. The small sample size was due to the small dataset. Results are presented
in Table 5. Results showed that kluster’s products on PAM were fastest with a perfect
approximation. Kluster’s products on BIC were also perfect in approximating k, but
were slower than the kluster’s products on PAM. From the original methods, applying
the PAM and BIC methods to the entire dataset also produced comparable results with
their respective kluster’s products. However, the processing time to obtain the same
results was 63 times longer for the PAM method and more than 12 times longer for the
BIC method.
Table 5. Results of applying the four cluster number approximation methods and kluster procedure on Breast Cancer Wisconsin dataset.
Method Approximated k
Processing Time
𝝐*
Kluster’s mean product on PAM 2 0.02324 0 Kluster’s most frequent product on PAM
2 0.02324 0
Kluster’s mean product on AP 8 0.03762 6 Kluster’s most frequent product on AP
8 0.03762 6
Kluster’s mean product on BIC 2 0.59856 0 Kluster’s most frequent product on BIC
2 0.59856 0
Kluster’s mean product on Calinski
15 0.90375 13
Kluster’s most frequent product on Calinski
15 0.90375 13
PAM algorithm 2 1.466 0 AP algorithm 17 1.793 15 Calinski algorithm 15 2.024 13 BIC algorithm 2 7.566 0
* The actual number of clusters in the Breast Cancer Wisconsin dataset is 2 and n = 569. Kluster’s products are based on 100 iterations, and samples of
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
100 data points. Data is sorted by processing time. Method(s) with the best results are bolded.
The second public dataset that was the famed Iris Species dataset [31], which contains
150 data samples (50 for each of the three species) and their properties. Figure 10
presents a scatter plot of data based on petal length (cm) and petal width (cm). The
three classes of Iris species including Iris Setosa, Iris Versicolour, and Iris
Virginica are distinguishable in the plot.
Figure 10: Scatter plot of petal length (cm) versus petal width (cm) by Iris
species in the Iris Species dataset
The kluster procedure with the same setting as before (100 random samples taken from
the data in 100 iterations) was applied to the Iris dataset. Table 6 shows the result
of this analysis. For this dataset, the BIC method and its kluster’s products all
produced a perfect approximation. The PAM method and its kluster’s products all had
similar results (three clusters) which were the next best cluster approximations for
this dataset. Overall, all the methods except Calinski had a better approximation
result when the kluster procedure was applied. As expected, kluster’s products had
significantly shorter processing times than their original methods except for the
Calinski method which was expected given the limit number of total sample size (150).
Table 6: Results of applying the four cluster number approximation methods and kluster procedure on the Iris Species dataset.
Method Approximated k
Processing Time
𝝐*
Kluster’s mean product on PAM 2 0.0231 -1 Kluster’s most frequent product on PAM 2 0.0231 -1
Kluster’s mean product on AP 4 0.0298 1 Kluster’s most frequent product on AP 4 0.0298 1
AP algorithm 5 0.04 2
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
PAM algorithm 2 0.06 -1 Kluster’s mean product on BIC 3 0.4544 0 Kluster’s most frequent product on BIC 3 0.4544 0
Calinski algorithm 10 0.5 7 Kluster’s mean product on Calinski 15 0.7439 12
Kluster’s most frequent product on Calinski 15 0.7439 12
BIC algorithm 3 0.81 0 * The actual number of clusters in the Iris Species dataset is 3 and n = 150. Kluster’s products are based on 100 iterations, and samples of 100 data points. Data is sorted by processing time. Method(s) with the best results are bolded.
The third public dataset we used was the Voice Gender dataset [32]. This dataset
consists of 3,168 data samples created to identify the gender of the voice as either
male or female, according to acoustic properties of the voice and speech. Figure 11
shows the scatter plot of meanfun (average of fundamental frequency measured across
acoustic signals) vs modindx (modulation index: Calculated as the accumulated absolute
difference between adjacent measurements of fundamental frequencies divided by the
frequency range). The two clusters of males and females are easily recognizable in
this figure.
Figure 11: Scatter plot of meanfun versus modindx by gender in the Voice Gender
dataset
Similar to the previous public datasets, we applied the kluster procedure using 100
random samples taken from the data in 100 iterations and the result of the evaluation
is presented in Table 7. We found that the PAM algorithm as well as its two kluster’s
products had the perfect prediction of cluster number. The BIC method had a very poor
approximation of clusters with nine clusters – instead of two. However, kluster’s
most frequent products had a perfect approximation. Calinski was the only method with
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
a better approximation of k than its kluster’s products. In terms of processing time
AP, BIC, Calinski, and PAM each had respectively 2,880, 102, 30 and 6,538 times longer
processing time than their kluster’s products.
Table 7. Results of applying the four cluster number approximation methods and kluster procedure on the Voice Gender dataset
Method Approximated k
Processing Time
𝝐*
Kluster’s mean product on PAM 2 0.0343 0 Kluster’s most frequent product on PAM 2 0.0343 0
Kluster’s mean product on AP 11 0.0346 9 Kluster’s most frequent product on AP 11 0.0346 9
Kluster’s mean product on BIC 3 0.7218 1 Kluster’s most frequent product on BIC 2 0.7218 0
Kluster’s mean product on Calinski 14 1.1214 1
2 Kluster’s most frequent product on Calinski 15 1.1214 1
3 Calinski algorithm 4 33.77 2 BIC algorithm 9 73.34 7
AP algorithm 66 99.64 64
PAM algorithm 2 224.25 0 * The actual number of clusters in the Voice Gender dataset is 2 and n = 3168. Kluster’s products are based on 100 iterations, and samples of 100 data points. Data is sorted by processing time. Method(s) with the best results are bolded.
The last dataset is the Pima Indians Diabetes Database [33]. Obtained from the National
Institute of Diabetes and Digestive and Kidney Diseases, the dataset has 768 samples
all from female patients who are at least 21 years old and of Pima Indian heritage.
The scatter plot for Glucose (plasma glucose concentration) vs BMI (body mass index)
are presented in Figure 12. Similar to the Breast Cancer Wisconsin dataset, this
dataset also does not have a clear separation between the two clusters of outcome
based only on a 2-D plot, but the two classes are somewhat recognizable.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
Figure 12: Scatter plot of glucose versus BMI by outcome in the Pima Indians
Diabetes Database
On the Pima Indians Diabetes Database, we found that the PAM method, its kluster’s
products, and BIC kluster’s products had the perfect approximation (Table 8). In
addition, the kluster procedure improved the accuracy of clustering for the BIC and
AP methods. Calinski was the only method with a better approximation of k than its
kluster’s products. AP, BIC, Calinski, and PAM each had respectively 92, 10, 4, and
120 times longer processing time than their kluster’s products.
Table 8. Results of applying the four cluster number approximation methods and kluster procedure on the Pima Indians Diabetes Database
Method Approximated k
Processing Time
𝝐*
Kluster’s mean product on PAM 2 0.0355 0 Kluster’s most frequent product on PAM 2 0.0355 0
Kluster’s mean product on AP 8 0.0397 6 Kluster’s most frequent product on AP 8 0.0397 6
Kluster’s mean product on BIC 2 0.9902 0 Kluster’s most frequent product on BIC 2 0.9902 0
Kluster’s mean product on Calinski 13 1.0267 11
Kluster’s most frequent product on Calinski 15 1.0267 13
AP algorithm 24 3.64 22 Calinski algorithm 6 3.87 4 PAM algorithm 2 4.27 0 BIC algorithm 3 10.25 1
* The actual number of clusters in the Pima Indians Diabetes Database is two and n = 768. Kluster’s products are based on 100 iterations, and samples of 100 data points. Data is sorted by processing time. Method(s) with the best results are bolded.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
5. Discussion
Unsupervised learning has been applied to a variety of dimensionality reduction or
clustering tasks in Clinical Informatics. The high throughput of unlabeled data from
multi-site clinical repositories offers new opportunities to apply unsupervised
clustering for characterizing patients into groups with similar phenotypic
characteristics. Applying some of the most popular unsupervised clustering algorithms
(e.g., k-means and its many derivatives) to clinical data is dependent on
initializations, most importantly setting up the number of clusters, k. There are
multiple statistical solutions for approximating the number of clusters in a dataset.
We argued that these methods are computationally inefficient when dealing with large
amounts of clinical data, due to the high likelihood of over-fitting (which results
in over- or under-estimation of the number of clusters) and extensive computing
requirements. We hypothesized and showed that applying a sampling strategy can scale
up the cluster number approximation while improving accuracy. Based on our hypothesis,
we developed a procedure, kluster, which iteratively applies statistical cluster
number approximation methods to samples of data.
Bootstrap methods have been applied to various clustering problems, including
approximation of cluster numbers. For example, bootstrapping has been used for
estimating the clustering instability and then selecting the optimal number of
clusters that maximize clustering stability [34–37]. In the case of Big Data, would
still require clustering to happen on large amounts of data. In addition,
bootstrapping has been applied into defining the optimum number of clusters, using
statistical criterion, such as the Hubert’s gamma statistic [38]. However, in dealing
with unlabeled clinical observation data, extracting a representative training dataset
(e.g., with less than 30 percent of the entire dataset) may still result in a large
dataset. Iterative sampling can provide a scalable solution to this problem.
We tested the kluster procedure on four cluster number approximation method with
simulated data, as well as on clinical observation data. Our results showed that
the kluster’s products were as good as or better than applying their corresponding
cluster number approximation method to the entire dataset. Taking the processing
time and accuracy into account, we found that the kluster’s most frequent product
on the BIC method (the most frequent number of clusters approximated through
kluster’s iterations applying the BIC method to samples of data) performed better
than any other of the four methods, on almost any cluster structure. Testing the
kluster procedure on clinical observation data also verified reliability of the
kluster’s most frequent product on BIC. We also evaluated sensitivity of the
kluster procedure to sample size. Results of our analyses recommend the kluster’s
most frequent product on BIC with between 200 to 500 samples and 100 iterations as
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
an efficient procedure for unsupervised clustering of large scale clinical
datasets.
Currently, we have embedded the kluster procedure functions into an unsupervised
learning pipeline on large clinical observation datasets from the Research Patient
Data Registry (RPDR) that are being utilized by a multi-site clinical data research
network, Accessible Research Commons for Health (ARCH). Using the kluster procedure
has significantly reduced the computational requirement for developing and applying
a variety of unsupervised clustering algorithms that would not have been possible
without the kluster procedure, given available computational resources in clinical
settings.
6. Conclusion
Due to computational limitations, scalable analytical procedures are needed to
extract knowledge from large clinical datasets. Many of the popular unsupervised
clustering algorithms are dependent on pre-identification of the number of clusters,
k. Over the past few decades, the statistics literature has presented different
solutions that apply different quantitative indices to this problem. In the context
of emerging large scale clinical data networks, however, available statistical
methods are computationally inefficient. In this paper we present a simple efficient
procedure, kluster, for identifying the number of clusters in unsupervised
learning. Using two sets of simulation datasets, and clinical and public datasets,
we showed that kluster’s most frequent product using the BIC method on random
samples of 200-500 data points, with 100 times iteration, provides a reliable and
scalable solution for approximating number of clusters in large clinical datasets.
Together, the sampling strategy (i.e., number of samples to take) and the simulation
iterations we applied in the experimental analyses provided us with sufficient
information to test the principal hypothesis of this study. Although we found that
choice of the sample size between 100-1,000 data points may not play a significant
role, further work is required to obtain (or test existence of) best practices for
the number of simulation iteration and the sample size.
Although kluster results are promising, generalizability of its result may require
further evaluation due to two limitations. First, we have only applied four of the
available cluster number identification algorithms into kluster (BIC, PAM, AP,
CAL). Implementation of the four algorithms are available in different R packages,
as cited in this paper. We plan to incorporate more algorithms into kluster R
package in the near future. Nevertheless, with the current four algorithm we were
able to find a scalable solution. Second, the simulation data used for our
assessment of kluster was 2-dimensional and the clinical observation data was 1-
dimensional. Further evidence might be needed to verify the effectiveness of the
kluster procedure against datasets with higher dimensions.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
In addition to the main kluster functions, we have also developed functions to
compare accuracy and processing time for the kluster procedure and the four cluster
number approximation methods. We provided all of these functions as an R package,
named kluster, through GitHub: https://github.com/hestiri/kluster. We have also
made all the codes and results of the simulations, conducted in this study, publicly
available on the GitHub – https://github.com/hestiri/klusterX. Simulation data
generated in this study is also available on the Harvard Dataverse Network and
Mendeley Data (links will be provided after the peer review).
Acknowledgements
This work was supported in part by the Patient-Centered Outcomes Research Institute
(PCORI) Award (CDRN-1306-04608) for development of the National Patient-Centered
Clinical Research Network, known as PCORnet, NIH R01-HG009174, and the NLM training
grant T15LM007092. The authors are very grateful to the anonymous reviewers for
their valuable suggestions and comments to improve the quality of this paper.
References
[1] Z. Ghahramani, Unsupervised Learning, in: O. Bousquet, U. von Luxburg, G. Rätsch
(Eds.), Adv. Lect. Mach. Learn. ML Summer Sch. 2003, Canberra, Aust. Febr. 2 -
14, 2003, T{ü}bingen, Ger. August 4 - 16, 2003, Revis. Lect., Springer Berlin
Heidelberg, Berlin, Heidelberg, 2004: pp. 72–112. doi:10.1007/978-3-540-28650-
9_5.
[2] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning:
Data Mining, Inference, and Prediction, 2009. doi:10.1007/b94608.
[3] A.K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognit. Lett. 31
(2010) 651–666. doi:10.1016/j.patrec.2009.09.011.
[4] C.A. Sugar, G.M. James, Finding the Number of Clusters in a Dataset, J. Am.
Stat. Assoc. 98 (2003) 750–763. doi:10.1198/016214503000000666.
[5] G. Hamerly, C. Elkan, Learning the k in k means, Adv. Neural Inf. Process. 17
(2004) 1–8. doi:10.1.1.9.3574.
[6] C. Fraley, A.E. Raftery, Model-Based Clustering, Discriminant Analysis, and
Density Estimation, J. Am. Stat. Assoc. 97 (2002) 611–631.
doi:10.1198/016214502760047131.
[7] T. Caliński, J.A. Harabasz, A dendrite method for cluster analysis, Commun.
Stat. 3 (1974) 1–27. doi:10.1080/03610927408827101.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
[8] L. Kaufman, P.J. Rousseeuw, Clustering by means of medoids, in: Stat. Data Anal.
Based L 1-Norm Relat. Methods. First Int. Conf., 1987: pp. 405–416416.
[9] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster
Analysis (Wiley Series in Probability and Statistics), 1990.
http://www.eepe.ethz.ch/cepe/cepe/publications/Muller_ClusterVorlesung_28_1_04
.pdf%5Cnhttp://books.google.com/books?hl=en&lr=&id=YeFQHiikNo0C&oi=fnd&pg=PR11
&dq=Finding+Groups+in+Data+-
+An+introduction+to+Cluster+Analysis&ots=5zp9F4PGxF&sig=SeUYzccb34LjgB8.
[10] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a
data set via the gap statistic, J. R. Stat. Soc. Ser. B (Statistical Methodol.
63 (2001) 411–423. doi:10.1111/1467-9868.00293.
[11] C. Fraley, A.E. Raftery, How Many Clusters? Which Clustering Method? Answers Via
Model-Based Cluster Analysis, Comput. J. 41 (1998) 578–588.
doi:10.1093/comjnl/41.8.578.
[12] B.J. Frey, D. Dueck, Clustering by passing messages between data points.,
Science. 315 (2007) 972–976. doi:10.1126/science.1136800.
[13] T. Pinto, G. Santos, L. Marques, T.M. Sousa, I. Pra??a, Z. Vale, S.L. Abreu,
Solar intensity characterization using data-mining to support solar forecasting,
in: Adv. Intell. Syst. Comput., 2015: pp. 193–201. doi:10.1007/978-3-319-19638-
1_22.
[14] P.J. Rousseeuw, Silhouettes: A graphical aid to the interpretation and validation
of cluster analysis, J. Comput. Appl. Math. 20 (1987) 53–65. doi:10.1016/0377-
0427(87)90125-7.
[15] L. Scrucca, M. Fop, T.B. Murphy, A.E. Raftery, mclust 5: Clustering,
Classification and Density Estimation Using Gaussian Finite Mixture Models, R
J. 8 (2016) 289–317. doi:10.1177/2167702614534210.
[16] J. Oksanen, F.G. Blanchet, R. Kindt, P. Legendre, P.R. Minchin, R.B. O’Hara,
G.L. Simpson, P. Solymos, M. Henry, H. Stevens, H. Wagner, vegan: Community
Ecology Package, R Packag. Version 2.4-1. https//CRAN.R-
Project.org/package=vegan. (2017). https://cran.r-project.org/package=vegan.
[17] C. Hennig, fpc: Flexible Procedures for Clustering., (2018). https://cran.r-
project.org/package=fpc.
[18] U. Bodenhofer, A. Kothmeier, S. Hochreiter, Apcluster: An R package for affinity
propagation clustering, Bioinformatics. 27 (2011) 2463–2464.
doi:10.1093/bioinformatics/btr406.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
[19] W. Qiu, H. Joe, clusterGeneration: Random Cluster Generation (with Specified
Degree of Separation), (2015). https://cran.r-
project.org/package=clusterGeneration.
[20] R. Nalichowski, D. Keogh, H.C. Chueh, S.N. Murphy, Calculating the benefits of
a Research Patient Data Repository., AMIA Annu. Symp. Proc. (2006) 1044.
[21] S. García, D. Molina, M. Lozano, F. Herrera, A study on the use of non-parametric
tests for analyzing the evolutionary algorithms’ behaviour: A case study on the
CEC’2005 Special Session on Real Parameter Optimization, J. Heuristics. 15 (2009)
617–644. doi:10.1007/s10732-008-9080-4.
[22] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for
multiple comparisons in the design of experiments in computational intelligence
and data mining: Experimental analysis of power, Inf. Sci. (Ny). 180 (2010)
2044–2064. doi:https://doi.org/10.1016/j.ins.2009.12.010.
[23] G. Santafe, I. Inza, J.A. Lozano, Dealing with the evaluation of supervised
classification algorithms, Artif. Intell. Rev. 44 (2015) 467–508.
doi:10.1007/s10462-015-9433-y.
[24] B. Calvo, G. Santafé, scmamp: Statistical Comparison of Multiple Algorithms in
Multiple Problems, R J. XX (2015) 8.
[25] F. Wilcoxon, Individual Comparisons by Ranking Methods, Biometrics Bull. 1 (1945)
80. doi:10.2307/3001968.
[26] S. Holm, A simple sequential rejective multiple test procedure, Scand. J. Stat.
6 (1979) 65–70. doi:10.2307/4615733.
[27] M. Friedman, The Use of Ranks to Avoid the Assumption of Normality Implicit in
the Analysis of Variance, J. Am. Stat. Assoc. 32 (1937) 675–701.
doi:10.1080/01621459.1937.10503522.
[28] B. Bergmann, G. Hommel, Improvements of General Multiple Test Procedures for
Redundant Systems of Hypotheses, in: P. Bauer, G. Hommel, E. Sonnemann (Eds.),
Mult. Hypothesenprüfung / Mult. Hypotheses Test., Springer Berlin Heidelberg,
Berlin, Heidelberg, 1988: pp. 100–115.
[29] J. Demšar, Statistical Comparisons of Classifiers over Multiple Data Sets, J.
Mach. Learn. Res. 7 (2006) 1–30. doi:10.1016/j.jecp.2010.03.005.
[30] W.H. Wolberg, W.N. Street, O.L. Mangasarian, Breast Cancer Wisconsin
(Diagnostic) Data Set, UCI Mach. Learn. Repos. (1992).
[31] R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugen.
7 (1936) 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x.
pre-print https://doi.org/10.1016/j.bdr.2018.05.003
[32] Kory Becker, Identifying the Gender of a Voice using Machine Learning | Primary
Objects, (2016). http://www.primaryobjects.com/2016/06/22/identifying-the-
gender-of-a-voice-using-machine-learning/ (accessed August 20, 2017).
[33] J.W. Smith, J.E. Everhart, W.C. Dickson, W.C. Knowler, R.S. Johannes, Using the
ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus, Proc. Annu.
Symp. Comput. Appl. Med. Care. (1988) 261–265.
[34] Y.X. Fang, J.H. Wang, Selection of the number of clusters via the bootstrap
method, Comput. Stat. Data Anal. 56 (2012) 468–477.
doi:10.1016/j.csda.2011.09.003.
[35] A.K. Jain, J. V. Moreau, Bootstrap technique in cluster analysis, Pattern
Recognit. 20 (1987) 547–568. doi:10.1016/0031-3203(87)90081-1.
[36] C. Garcia, BoCluSt: Bootstrap clustering stability algorithm for community
detection, PLoS One. 11 (2016). doi:10.1371/journal.pone.0156576.
[37] M.K. Kerr, G.A. Churchill, Bootstrapping cluster analysis: Assessing the
reliability of conclusions from microarray experiments, Proc. Natl. Acad. Sci.
98 (2001) 8961–8965. http://www.pnas.org/cgi/content/abstract/98/16/8961.
[38] M.A. Newell, D. Cook, H. Hofmann, J.L. Jannink, An algorithm for deciding the
number of clusters and validation using simulated data with application to
exploring crop population structure, Ann. Appl. Stat. 7 (2013) 1898–1916.
doi:10.1214/13-AOAS671.