Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
The Pennsylvania State University
The Graduate School
Department of Computer Science and Engineering
EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR
KNOWLEDGE DISCOVERY
A Dissertation in
Computer Science and Engineering
by
Anirban Chatterjee
c⃝ 2011 Anirban Chatterjee
Submitted in Partial Fulfillmentof the Requirements
for the Degree of
Doctor of Philosophy
December 2011
The dissertation of Anirban Chatterjee was reviewed and approved* by the following:
Padma RaghavanProfessor of Computer Science and EngineeringDissertation AdviserChair of Committee
Mahmut Taylan KandemirProfessor of Computer Science and Engineering
Suzanne M. ShontzAssistant Professor of Computer Science and Engineering
Kateryna MakovaAssociate Professor of Biology
Raj AcharyaProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering
*Signatures are on file in the Graduate School.
iii
Abstract
Data-driven discovery seeks to obtain a computational model of the underlying process
using observed data on a large number of variables. Observations can be viewed as points in
a high-dimensional space with coordinates given by values of the variables. It is common for
observations to have nonzero values in only a few dimensions, i.e., the data are sparse. We
seek to exploit the sparsity of the data by interpreting the observations as a sparse graph and
manipulating its geometry (or embedding) in a high-dimensional space. Our goal is to obtain al-
gorithms that demonstrate high accuracy for key problems in data analysis and parallel scientific
computing.
The first part of this dissertation focuses on combining geometry and sparse graph struc-
ture to yield more accurate classification algorithms for data mining. We have developed a
feature subspace transformation (FST) scheme that transforms the data iteratively and provides
a better selection of features to improve unsupervised classification, also known as clustering.
FST utilizes the combinatorial structure of the entity relationship graph and high-dimensional
geometry of the sparse data to iteratively bring related entities closer, which enhances cluster-
ing. Our approach improves clustering quality relative to established schemes such as K-Means
and multilevel K-Means (GraClus). Next we consider transformations to enhance supervised
classification with prelabeled data. We obtain similarity graph neighborhoods (SGN) in the
high-dimensional feature subspace of the training data and transform it by determining displace-
ments for each entity. Our SGN classifier is a supervised learning scheme that is trained on these
transformed data. The goal of our SGN transform is to increase the separation between means
iv
of different classes, such that the classifier learns a better boundary. Our results indicate that
a linear discriminant and support vector machine classification on these SGN transformed data
improves accuracy by 5.0% and 4.52%, respectively.
The second part of this dissertation focuses on utilizing geometry and the structure of
sparse graphs to enhance the quality and performance of algorithms for parallel scientific com-
puting. We develop a parallel scheme, ScalaPart, for partitioning a large sparse graph into k
subgraphs such that the number of cross edges is reduced. Our scheme combines a parallel
graph embedding with a parallel geometric partitioning scheme to yield a scalable approach suit-
able for multicore-multiprocessors. Our analysis of ScalaPart demonstrates its scalability, and
our empirical evaluation indicates that the performance of ScalaPart and the quality of cuts com-
pares well with established schemes such as ParMetis and PT-Scotch. Next, we consider scalable
sparse linear system solution which plays a key role in many applications including data mining
with support vector machines and partial differential equation-based modeling and simulation.
Our graph partitioning algorithm can be used to yield a nested dissection fill-reducing ordering
and a tree for structuring numeric computations for the solution of sparse linear systems. We
develop a hybrid linear solver that couples tree-structured direct and iterative solves for repeated
right-hand side solutions. Our results indicate that our hybrid solver is 1.87 times faster than
preconditioned conjugate gradients (PCG) based methods for achieving same levels of accuracy.
In conclusion, this dissertation demonstrates that by combining geometric and combina-
torial properties of sparse graphs and matrices, we can enhance algorithms that are commonly
used in knowledge discovery through modeling and simulation.
v
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
I Combining High-dimensional Geometry with Sparsity for Improving Ac-
curacy of Data Mining 6
Chapter 2. Combining Geometry and Combinatorics for Enhanced Unsupervised and Su-
pervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Background on Unsupervised Classification using K-Means . . . . . . . . 8
2.3 Background on Supervised Classification . . . . . . . . . . . . . . . . . . 9
2.3.1 Linear Discriminant Classifier (LD) . . . . . . . . . . . . . . . . . 9
2.3.2 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . 10
2.4 Metrics of Classification Quality . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 3. Feature Subspace Transformation for Enhancing K-Means Clustering . . . 14
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
vi
3.2 FST-K-Means: Feature Subspace Transformations for Enhanced Classifi-
cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Toward optimal classification . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 4. Similarity Graph Neighborhoods for Enhanced Supervised Classification . 41
4.1 Exploiting Similarity Graph Neighborhoods for Enhancing SVM Accuracy 42
4.1.1 Determining γ-Neighborhoods in Similarity Graph G(B,A) . . . . 43
4.1.2 Transforming Training Data through Entity Displacement Vectors. . 45
4.1.3 Training an LDA or SVM on Transformed Data. . . . . . . . . . . 48
4.1.3.1 An Example of SGN Transformation using the Radial
Basis Function . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.4 Classifying Test Data . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Experimental Setup and Metrics for Evaluation. . . . . . . . . . . . 50
4.2.2 Artificial Dataset Results . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.3 Empirical Results on Benchmark Datasets. . . . . . . . . . . . . . 52
4.2.4 Why does classification improve with our SGN transformation? . . 53
4.2.5 Sensitivity of SGN-SVM Classification Accuracy . . . . . . . . . . 55
4.2.5.1 Effect of Growing or Shrinking the Neighborhood on Clas-
sification Accuracy . . . . . . . . . . . . . . . . . . . . 55
vii
4.2.5.2 Effect of Changing Sparsity of G(B,A) on Classification
Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
II Scalable Geometric Embedding and Partitioning for Parallel Scientific Com-
puting 67
Chapter 5. Background on Sparse Graph Partitioning and Sparse Linear System Solution 68
5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Graph Embedding and Graph Partitioning . . . . . . . . . . . . . . . . . . 69
5.2.1 Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.2 Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.3 The Charm++ Parallel Framework . . . . . . . . . . . . . . . . . . 70
5.3 Sparse Linear Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.1 Sparse Direct Solvers . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.2 Preconditioned Conjugate Gradient . . . . . . . . . . . . . . . . . 71
5.3.3 Incomplete Cholesky Preconditioning . . . . . . . . . . . . . . . . 72
5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 6. Parallel Geometric Partitioning through Sparse Graph Embedding . . . . . 75
6.1 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 ScalaPart: A Parallel Graph Embedding enabled Scalable Geometric Parti-
tioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
viii
6.2.1 Structure of our Charm++ Parallel Graph Embedding . . . . . . . . 79
6.2.2 Implementation of a Data Parallel Geometric Partitioning using Charm++ 83
6.2.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.1 Experimental Setup and Evaluation Metrics . . . . . . . . . . . . . 86
6.3.2 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.3 Discussion on observed quality and performance . . . . . . . . . . 88
6.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Chapter 7. A Multilevel Cholesky Conjugate Gradients Hybrid Solver for Linear Systems
with Multiple Right-hand Sides . . . . . . . . . . . . . . . . . . . . . . 99
7.1 A New Multilevel Sparse Cholesky-PCG Hybrid Solver . . . . . . . . . . . 100
7.1.1 A One-level Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.1.1 Obtaining a Tree-structured Aggregate of the Coefficient
Matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.1.2 Constructing a Hybrid Solution Scheme using the Tree-
structure . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.1.2 A Multilevel Tree-based Hybrid Solver . . . . . . . . . . . . . . . 106
7.1.3 Computational Costs of our Hybrid Solver . . . . . . . . . . . . . 107
7.2 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.2 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . 109
7.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
ix
Chapter 8. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
x
List of Tables
3.1 Test suite of datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Accuracy of classification of K-Means, GraClus and FST-K-Means. . . . . . . 30
3.3 Accuracy of classification for PCA (top three principal components) and FST-
K-Means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Improvement or degradation (negative values) percentage of accuracy of FST-
K-Means relative to K-Means and GraClus. . . . . . . . . . . . . . . . . . . . 31
3.5 Cluster cohesiveness of K-Means, GraClus, PCA (top three principal compo-
nents), and FST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Improvement (positive values) or degradation (negative values) percentage of
cohesiveness of FST-K-Means relative to K-Means and GraClus. . . . . . . . . 33
3.7 The lower bound, range and upper bound of cohesiveness across 100 runs of
FST-K-Means (top half) and K-Means in the original feature space. Observe
that FST-K-Means consistently satisfies the optimality bounds while K-Means
fails to do so for most datasets. The highlighted numbers in the lower table
indicate datasets for which the minimum value of cohesiveness exceeds the
upper bound on optimality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Details of the fourclass dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Classification accuracy (percentage) of LDA, SGN-LDA, SVM, and SGN-SVM
for the fourclass dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Description of UCI datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
xi
4.4 Classification accuracy (as a percentage) and F1-Score for LDA and SGN-LDA
on benchmark datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Classification accuracy (as a percentage) and F1-Score for SVM and SGN-
SVM on benchmark datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1 Details of benchmark graphs indicating the number of nodes (|V |) and edges
(|E|) in the graph with a brief description of the application domain. . . . . . . 92
7.1 Benchmark matrices from the University of Florida Sparse Matrix Collection [1]. 109
7.2 Operation counts (in millions) for 10 repeated right-hand side vectors and PCG
with IC Level-of-fill (IC0) and drop-theshold (ICT) preconditioner using (a)
natural ordering (NAT), (b) RCM ordering, and (c) nested dissection (ND) or-
dering, and (d) minimum degree (MMD) ordering. Hybrid-DI represents our
tree-based hybrid solver framework. Values in bold represent the best perfor-
mance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3 Relative error for 10 repeated right-hand side vectors and PCG with IC Level-
of-fill (IC0) and drop-theshold (ICT) preconditioner using (a) natural ordering
(NAT), (b) RCM ordering, and (c) nested dissection (ND) ordering, and (d)
minimum degree (MMD) ordering. Hybrid-DI represents our tree-based hybrid
solver framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xii
List of Figures
1.1 Raw data (for example, blogs, newsgroup posts, gene sequences, census data)
are collected and subsequently processed (for example, using tokenizers) to
obtain a dataset, where rows represent samples and columns indicate variables
(or features). A cross symbol (×) indicates a nonzero feature in the observation. 3
1.2 (a) A matrix indicating relationships among different observations. (b) An em-
bedding of the relationship graph in two dimensions providing a geometry to
the graph structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Illustration of the three main steps of FST-K-Means.
(1) A is a sparse matrix representing a dataset with 6 entities and 3 features.
B ≈ AAT is the adjacency matrix of the weighted graph G(B,A) with 6 ver-
tices and 7 edges.
(2a) FST is applied on G(B,A) to transform the coordinates of the vertices.
Observe that the final embedded graph G(B, A) has the same sparsity structure
as G(B,A).
(2b) The sparse matrix A represents the dataset with the transformed feature
space.
(3) K-Means is applied to the dataset A to produce high quality clustering. . . . 18
xiii
3.2 Plots of classification accuracy and the 1-norm of feature variance vector across
FST iterations. FST iterations are continued until feature variance decreases
relative to the previous iteration. . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Layout of entities in splice in the original (top) and transformed (bottom) fea-
ture space, projected onto the first three principal components. Observe that
two clusters are more distinct after FST. . . . . . . . . . . . . . . . . . . . . . 38
3.4 Layout of entities in two dimensional synthetic dataset in the original (top) and
transformed (bottom) feature space. . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Sensitivity of classification accuracy (P) of K-Means to number of principal
components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Forming the sparse γ-neighborhood similarity graph G(B,A) from A. F (A)
represents the transformation described in Section 4.1.1. G(B,A) is a weighted
graph representation of matrix B. . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Entities ap, a
q, a
u, a
j, and a
krepresent the immediate neighbors of a
i. A ∆
ij
denotes the difference between two entity vectors ai
and aj. . . . . . . . . . . 59
4.3 (a) Transforming the training data A using G(B,A) to a new graph G(B, (A)).
(b) We obtain new entity coordinates A from the graph G(B, A). . . . . . . . . 60
4.4 Training an SVM on the transformed data matrix A to obtain separating hyper-
planes (shown in white). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5 Percentage improvements in accuracy and F1-score with SGN-LDA over LDA. 61
4.6 Percentage improvements in accuracy and F1-score with SGN-SVM over SVM. 62
xiv
4.7 Illustration of the fourclass dataset after performing our SGN transformation.
Observe that the two classes separate out while elements maintain their relative
position within their class. A better separating boundary is obtained on this
transformed data using LDA (as shown above). . . . . . . . . . . . . . . . . . 63
4.8 Support vectors obtained (a) using SVM, and (b) using SGN-SVM training for
the fourclass dataset in the training data indicating two classes with support
vectors shown in black. Illustration of the (c) SVM separating plane, and (d)
the SGN-SVM separating plane on testing data. . . . . . . . . . . . . . . . . . 64
4.9 Illustration of two four-dimensional Gaussian processes (a) before transforma-
tion and (b) after SGN transformation that have been projected to two PCA
dimensions for the purpose of visualization. In both cases, the boundary is
obtained using LDA. Along side are the correlation matrices before and after
the transformation. The dashed boxes indicate the feature pairs that showed a
significant change in the correlation value after the SGN transformation. . . . . 65
4.10 (a) Effect of increasing the entity neighborhood parameter γ in G(B,A) (con-
sidering larger neighborhoods) on classification accuracy. When the sparse
neighborhood is a good approximation indicating similarity, adding elements
to the neighborhood does not alter classification accuracy. Effect of varying
number of common features q to create the similarity graph B on (b) sparsity of
B relative to A and (c) impact on overall classification accuracy. . . . . . . . . 66
xv
6.1 Average layout time per iteration for benchmark graphs using 8, 16 and 32
cores. We include results for 8 cores to demonstrate the effect of doubling the
number of cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Normalized edge cuts for 16 partitions with ParMetis as base set to 1. . . . . . 94
6.3 Normalized edge cuts for 32 partitions with ParMetis as base set to 1. . . . . . 94
6.4 Normalized time for (a) 16 partitions and (b) 32 partitions on 1 core with
ParMetis as base set to 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Normalized time for (a) 16 partitions on 16 cores and (b) 32 partitions on 32
cores with ParMetis as base set to 1. . . . . . . . . . . . . . . . . . . . . . . . 96
6.6 Normalized time for 32 partitions on 32 cores with ParMetis as base set to 1.
The above normalized times represent the total time, including time for a coars-
ening phase, embedding of the coarsened graph, partitioning of the coarsened
graph, and subsequent refinement of the cut. . . . . . . . . . . . . . . . . . . . 97
6.7 A two partition illustration of a graph from a heat exchanger flow problem using
(a) ParMetis, (b) PT-Scotch, and (c) our framework. This example demonstrates
that a geometric partitioning scheme clearly achieves a competitive edgecut
with a good underlying embedding of the graph. . . . . . . . . . . . . . . . . . 98
7.1 (a) Matrix bcsstk11 with natural ordering; (b) a one-level nested dissection or-
dering of bcsstk11; (c) a supernodal tree represention of the one-level ordering. 113
7.2 (a) Matrix bcsstk11 with natural ordering (b) A two-level nested dissection or-
dering of bcsstk11 (c) A supernodal tree represention of the two-level ordering. 114
xvi
7.3 Speedup obtained by our method over (a) best PCG and ordering combination,
and (b) average PCG performance across different variants for 10 right-hand
sides. A speedup greater than 1 indicates improvement. . . . . . . . . . . . . . 116
7.4 Impact on performance with increase in tree levels and number of right-hand
sides for the bcsstk10 matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5 Effect of increasing (a) tree levels, and (b) right-hand sides on operations count
(in millions) of our Hybrid solver compared to the best PCG performance for
10 right-hand sides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
xvii
Acknowledgments
I would like to acknowledge National Science Foundation (NSF) for funding my graduate
research.
I am indebted to my thesis advisor, Professor Padma Raghavan, for her guidance, support
and constant encouragement. I am grateful to my committee members Professor Mahmut Kan-
demir, Dr. Suzanne Shontz, and Dr. Kateryna Makova for their invaluable contributions towards
my research. I would also like to thank my friends and colleagues Dr. Sanjukta Bhowmick
(Assistant Professor, Department of Computer Science, University Of Nebraska), Manu Shan-
tharam (Graduate Student, The Pennsylvania State University), Michael Frasca (Graduate Stu-
dent, The Pennsylvania State University), Joshua Booth (Graduate Student, The Pennsylvania
State University), and Shad Kirmani (Graduate Student, The Pennsylvania State University) for
their insightful contributions towards our joint research activities.
Most importantly, I would like to thank my family for instilling into me the importance of
sincerity in all my endeavors, and my wife, Kolika, for being my support and inspiration during
my graduate student life at Penn State.
1
Chapter 1
Introduction
Large scale simulation and modeling of processes in many applications areas, such as
text mining [2], computational biology [3], astrophysics [4], and materials processing [5], rely
on data-driven techniques involving a large number of feature variables (or dimensions). Obser-
vations can be viewed as points embedded in a d-dimensional (d ≥ 2) subspace, which provides
a geometry to the observations. In particular, for a high-dimensional space in which d is equal
to the number of variables, the coordinates are specified by values of the feature variables. It
is common for observations to have nonzero values in only a few dimensions, which makes the
data extremely sparse. A sparse graph representation of the observations provides a structure,
indicating relationships in the data. Our goal is to obtain algorithms that demonstrate improved
accuracy for key problems in data analysis and parallel scientific computing by combining geo-
metric and structural properties of the data.
We elaborate further on sparsity, geometry, and structure of high-dimensional data us-
ing an example of text data analysis. Text data is typically obtained from web crawls of social,
economic, and scientific web media such as, newgroups, blogs, journals, and similar webpages.
The crawled data files are processed and each document is represented as a vector comprising
frequency of the occurence of words from a predefined dictionary. Only a few words from the
dictionary occur in each document, which accounts for the spasity of the data. Figure 1.1 illus-
trates this using a simple example comprising twelve documents and nine features (or words).
2
These twelve documents can be viewed as data points embedded in a nine-dimensional geo-
metric space of words. Figure 1.2(a) presents a sparse graph representation of the relationship
between these documents. In our example, a document i is related to another document j if they
share at least one word. Figure 1.2(a) indicates a matrix representation of the sparse graph struc-
ture. The relationships between documents are understood better when structural information
with a two-dimensional spatial representation of the same sparse matrix is augmented as shown
in Figure 1.2(b). This example clearly demonstrates that a better understanding of processes can
be obtained by combining geometric and structural properties of the underlying sparse represen-
tations.
This dissertation is organised in two parts. Part I manipulates high-dimensional geometry
using sparse graph structure to improve classification problems in data mining. Part II utilizes
geometric and structural properties of sparse graphs to develop improved algorithms in parallel
scientific computing.
In the first part of this dissertation, comprising Chapters 2, 3, and 4, we focus on en-
hancing classification by combining high-dimensional geometry with sparse graph structure. In
Chapter 2, we provide background material on unsupervised classification, supervised classifi-
cation, and force-directed graph embedding. In Chapter 3, we consider improving unsupervised
classification using our Feature Subspace Transformation (FST) scheme [6]. Traditional data
clustering methods use either geometry based (K-Means [7]) or combinatorial (multilevel K-
Means [8–10]) measures to cluster data in a high-dimensional feature space. Our FST scheme
3
Fig. 1.1. Raw data (for example, blogs, newsgroup posts, gene sequences, census data) arecollected and subsequently processed (for example, using tokenizers) to obtain a dataset, whererows represent samples and columns indicate variables (or features). A cross symbol (×) indi-cates a nonzero feature in the observation.
(a) (b)
Fig. 1.2. (a) A matrix indicating relationships among different observations. (b) An embeddingof the relationship graph in two dimensions providing a geometry to the graph structure.
4
brings related entities iteratively closer in the feature space by combining geometry with combi-
natorial structure using a force based method [11]. We apply traditional K-Means to the trans-
formed data to obtain FST-K-Means. Our results indicate that, on average, FST-K-Means im-
proves the internal quality metric (cluster cohesiveness) by 20.2% relative to K-Means and by
6.6% relative to Multilevel K-Means (GraClus). More significantly, FST-K-Means improves the
external quality (accuracy) by 14.9% relative to K-Means and by 23.6% relative to GraClus.
In Chapter 4, we develop our Similarity Graph Neighborhoods (SGN) based data transforma-
tion scheme for supervised classification [12]. Unlike FST, SGN is a one-step transform that is
applied to the data before the training phase of a supervised classifier. Our SGN approach trans-
forms the geometry of entities in the training data by considering similarity graph neighborhoods
in the high-dimensional feature space. An SGN classifier is a supervised classification scheme,
such as the support vector machine (SVM) [13, 14] or linear discriminant (LD) [12] classifier,
that has been trained on this transformed data. We demonstrate the accuracy of our classifier
on a suite of benchmark data. Our results indicate that SGN-LD improves accuracy by 5.0%
compared to LD, and SGN-SVM improves it by 4.62% compared to SVM.
In the second part of this dissertation, comprising Chapters 5, 6 and 7, we focus on com-
bining geometry with sparse graph structure to improve two key problems in parallel scientific
computing, namely, parallel graph partitioning and sparse linear system solution. In Chapter 5,
we present background material on graph partitioning, graph layout algorithms, and sparse linear
systems solutions. Sparse graph partitioning is an important component of many parallel scien-
tific and data mining applications, especially domain decomposition. However, scaling existing
graph partitioning algorithms in distributed multicore environments remains an open challenge.
5
In Chapter 6, we develop a tree-structured, parallel geometry embedding enabled, scalable ge-
ometric partitioning using the Charm++ [15] parallel programming system. Our parallel graph
embedding and geometric partitioning scheme, ScalaPart, first generates a geometric layout for a
sparse graph by using structural properties of the graph. This layout is subsequently partitioned
using a scalable parallel geometric partitioning scheme. Parallel geometric partitioning [16, 17]
is inherently more scalable than adjacency graph-based schemes due to a significantly reduced
need for communication. However, the partition quality can depend largely on the quality of
the geometric coordinates. Our analysis and empirical evaluation demonstrates that partitioning
quality of ScalaPart compares well with leading schemes, such as ParMetis or PT-Scotch and it is
highly scalable. In Chapter 7, we develop a multilevel tree-structured hybrid solver for symmet-
ric positive definite systems of the form Ax = b with multiple right-hand sides [18]. A hybrid
solution technique that interleaves phases of direct and iterative solutions serves as an effective
alternative to preconditioned conjugate gradients (PCG) [19]. The initial tree-structured domain
decomposition [20] takes advantage of the sparsity structure of the matrix A to find small dense
sub-blocks at the leaves of the tree that can be solved using a sparse direct Cholesky [21] solver.
Subsequently, these partial solutions can be combined at the separator blocks by traversing the
tree recursively. We provide an empirical evaluation on a suite of 12 benchmark matrices from
the University of Florida sparse matrix collection [22]. Our results indicate that our tree-based
hybrid solver is 1.87 times faster than PCG using different combinations of ordering, level-of-fill,
or threshold methods with an incomplete Cholesky preconditioner.
In Chapter 8, we conclude this dissertation with a discussion of our results, open prob-
lems, and possible future research directions.
Part I
Combining High-dimensional Geometry with Sparsity for
Improving Accuracy of Data Mining
6
7
Chapter 2
Combining Geometry and Combinatoricsfor Enhanced Unsupervised and Supervised Classification
Many data analysis applications typically obtain a geometry for the data by representing
sample points in a high-dimensional feature space. Additionally, a combinatorial representation
of the data is constructed in the form of a similarity graph [23] where edges indicate relation-
ships amongst data points and edge-weights indicate strength of these relationships. In this part
of the dissertation, our key focus is to utilize the geometry of the data in conjunction with its
combinatorial representation to improve the quality of unsupervised and supervised classifica-
tion in data mining [24]. In Chapter 3, we present a feature subspace transformation scheme
that combines geometric and combinatorial measures to improve quality of unsupervised clas-
sification. In Chapter 4, we develop a high-dimensional transformation scheme that improves
quality of supervised classification by utilizing local neighborhoods in the similarity graph of
the training data. This chapter provides the necessary background that is required to understand
key concepts in Chapters 3 and 4.
The rest of this chapter is organised as follows. Section 2.1 introduces our notation for
Chapters 3 and 4. Section 2.2 provides a brief overview of K-Means clustering. Section 2.3
presents a brief overview of supervised classification using Support Vector Machine (SVM)
and Linear Discriminant (LD) classifiers. Section 2.4 presents key metrics used to evaluate the
quality of unsupervised and supervised classifiers.
8
2.1 Notation
We represent matrices using bold upper-case alphabets, e.g., A. A lower-case alphabet
with an arrow denotes a vector, e.g., x. The notation xij
refers to the j-th component of the
i-th vector in a set of n vectors xin
. We use || · ||2
to represent the two-norm of a vector. R
denotes the set of real numbers.
2.2 Background on Unsupervised Classification using K-Means
The most common implementations of the K-Means algorithm are by Steinhaus [25],
Lloyd [26], Ball and Hall [27], and MacQueen [28]. In the first step, the four independent
implementations of K-Means initialize the positions of the centroids (the number of centroids is
equal to the number of pre-determined classes). Then the entities are assigned to their closest
centroid to form the initial clusters. The centroids are then recalculated based on the center of
mass of these clusters, and the entities are reassigned to clusters according to the new centroids.
These steps are repeated until convergence is achieved.
Graph approaches to classification have also become popular, especially, multilevel K-
Means (GraClus) [8] that performs unsupervised classification through graph clustering. The
dataset is represented as a weighted graph, where each entity is a vertex and vertices are con-
nected by weighted edges. High edge weights denote greater similarity between corresponding
entities. Similar entities are identified by clustering of vertices with heavy edge weights. Gra-
Clus uses multilevel graph coarsening algorithms. The initial partition of the graph is generated
using a spectral algorithm [29], and the subsequent coarsening stages use a weighted kernel
K-Means.
9
2.3 Background on Supervised Classification
Supervised classification techniques use prelabeled data to learn models that can subse-
quently be applied to classify unlabeled data. In this section, we provide a brief overview of
two popular supervised learning schemes: (i) Linear Discriminant Analysis, and (ii) Support
Vector Machine. LDA and SVM represent two distinct types of classifiers; the former relies
on constructing a covariance based discriminant function, and the latter learns coefficients of a
separating plane in high-dimensions to classify future data.
2.3.1 Linear Discriminant Classifier (LD)
The goal of Linear Discriminant Classifier (LD) is to express a predictor variable as a
linear combination of various features in the data. Consider a set of n samples xi, i = 1, . . . , n
and their corresponding class vector y, where yi∈ 0, 1 for binary classification. LD analysis
is based on the assumption that the class conditional probability distributions are normally dis-
tributed with mean and covariance (µ0,Σ
0) and (µ
1,Σ
1). Additionally, LD analysis imposes
the condition that all classes of data in the training set have the same covariance parameter, that
is, Σ0
is equal to Σ1
. Classification using LD is obtained by computing the log-likelihood ratios
for the two classes and assigning samples to the first class if this ratio is greater than a threshold
τ or to the second class if this ratio is less than τ . Equation 2.1 represents the LD classification
criteria, where Σ0= Σ
1.
(x− µ0)TΣ−10
(x− µ0)− (x− µ
1)TΣ−11
(x− µ1) < τ (2.1)
10
2.3.2 Support Vector Machine (SVM)
Supervised learning using support vector machines was first proposed by Vladimir Vap-
nik as a support vector regression [13, 14] method. Subsequently, SVM gained popularity as a
classifier that requires only a subset of the training points to build maximally separated hyper-
planes to classify the data.
Consider a training set of data (x1, y
1), (x
2, y
2),. . .,(x
n, y
n) with x
i∈ Rn and y
i∈
−1,+1. Define ξi
as the training error associated with the ith data point and C as a constant
defining a bound on the training error. We formulate and solve an optimization problem that
seeks to maximize the separation between separating hyperplanes. The goal is to find w, which is
the vector of weights that defines the separating plane wTx = 0. In a binary SVM classification
problem, a simple classification rule h(w) = sign(wTx + b) is used to classify the given test
data, where b can be easily modeled. Equation (2.2) represents the primal form of the problem.
minw,ξ
i≥0
1
2wTw +
C
n
n∑i=1
ξi
(2.2)
s.t.yi(w
Txi) ≥ 1− ξ
i: ∀i ∈ 1, · · · , n
Alternatively, recent implementations of SVM [30] solve the equivalent dual formulation of
the primal problem (stated in Equation 2.2). Equation 2.3 presents the dual formulation of the
primal SVM optimization problem. The variables αi’s indicates the i-th Lagrange multiplier and
ci∈ 0, 1 indicates whether the constraint is active (c
i= 1) or inactive (c
i= 0).
11
maxα≥0
1n
∑i
||ci||1αi− 1
2
∑i
∑j
αiαjxT
ixj
s.t.∑i
αi≤ C
(2.3)
There are multiple implementations of the SVM classifier available [30–34]. For our re-
search, we use Joachim’s SVM-Perf [30] package that solves the SVM dual optimization prob-
lem using the active set method [57].
2.4 Metrics of Classification Quality
The quality of unsupervised classification is typically evaluated using external and in-
ternal metrics. In addition to external quality, supervised classification measures precision and
recall to evaluate quality.
The external metric evaluates the accuracy of classification (the higher the value, the
better the classification) as the ratio of correctly classified entities to the total number of entities.
This metric is dependent on a subjective pre-determined labeling. This measure, though an
indicator of correctness, cannot be incorporated into the design of classification algorithms. We
henceforth refer to this metric as “accuracy” denoted by P defined as,
P =Number of correctly classified entities
Total number of entities. (2.4)
The internal metric measures the cohesiveness of the clusters, i.e., the sum of the square
of the distance of the clustered points from their centroids. We henceforth refer to this metric as
“cohesiveness” denoted by J . If M1, . . . ,M
kare the k clusters and µ
irepresents the centroid
12
of cluster Mi, the cohesiveness J is defined as:
J = Σk
h=1Σx∈M
h
∥x− µh∥2 (2.5)
If similar entities are clustered closely, then they are generally easier to classify and
so a lower value of J is preferred. The cohesiveness is an internal metric because it can be
incorporated into the design of the classification algorithm; several classification methods seek
to minimize the objective function given by J .
We evaluate the classification quality in further detail using two additional metrics, (a)
precision, and (b) recall. In order to understand precision and recall, we first define four cate-
gories: true positive, true negative, false positive, and false negative. True positive (tp) indicates
a scenario where the class was positive and the prediction was positive. True negative (tn) in-
dicates a scenario where the class was negative and the prediction was negative. False positive
(fp) indicates a scenario where the class was positive and the prediction was negative. False
negative (fn) indicates a scenario where the class was negative and the prediction was positive.
Precision is the ratio of true positive to the sum of true positive and false positive as shown in
Equations 2.6. Recall is the ratio of true postive to the sum of true positive and false negative as
shown in 2.7.
precision =tp
tp+ fp(2.6)
recall =tp
tp+ fn(2.7)
13
2.5 Chapter Summary
In this chapter, we briefly discussed necessary background on data analysis using clas-
sification techniques. In Section 2.2, we discussed the basic K-Means clustering algorithm and
its popular combinatorial variant GraClus. Subsequently, we discussed the algorithmic steps of
two important supervised classification techniques, namely, support vector machines and linear
discriminant classifiers. Finally, we discussed metrics that are used to evaluate unsupervised and
supervised classification techniques.
14
Chapter 3
Feature Subspace Transformationfor Enhancing K-Means Clustering
Unsupervised classification is used to identify similar entities in a dataset and is exten-
sively used in many applications domains such as spam filtering [35], medical diagnosis [36],
demographic research [37], etc. Unsupervised classification using K-Means generally clusters
data based on (i) distance-based attributes of the dataset [25–28] , or by (ii) combinatorial prop-
erties of a weighted graph representation of the dataset [8].
Classification schemes, such as K-Means [7], that use distance-based attributes view
entities of the dataset as existing in an n-dimensional feature space. The value of the i-th feature
of an entity determines its coordinate in the i-th dimension of the feature space. The distance
between the entities is used as a classification metric. The entities that lie close to each other are
assigned to the same cluster.
Combinatorial techniques for clustering, such as multilevel K-Means (GraClus) [8], rep-
resent the dataset as a weighted graph, where the entities are represented by vertices. The edge
weights of the graph indicate the degree of similarity between the entities. A highly weighted
subgraph forms a class, and its vertices (entities) are given the same label.
In this chapter, we present a feature subspace transformation (FST) scheme to transform
the dataset before the application of K-Means (or other distance-based clustering schemes). A
unique attribute of our FST-K-Means method is that it utilizes both distance-based and com-
binatorial attributes of the original dataset to seek improvements in the internal and external
15
quality metrics of unsupervised classification. FST-K-Means starts by forming a weighted graph
with the entities as vertices that are connected by weighted edges indicating a measure of hav-
ing shared features. The vertices of the graph are initially viewed as being embedded in the
high-dimensional feature subspace, i.e., the coordinates of each vertex (entity) are given by the
values of its feature vector in the original dataset. This initial layout of the weighted graph is
transformed by a special form of a force-directed graph embedding algorithm that attracts similar
entities. The nodal coordinates of the embedded graph provide a feature subspace transformation
of the original dataset; K-Means is then applied to the transformed dataset.
The remainder of this chapter is organized as follows. In Section 3.1, we provide a brief
review of related classification schemes and a graph layout algorithm that we adapt for use in
our FST-K-Means. In Section 3.2, we develop FST-K-Means including its feature subspace
transformation scheme and efficient implementation. In Section 3.3, we provide an empirical
evaluation of the effectiveness of FST-K-Means using a test suite of datasets from a variety
of domains. In Section 3.4, we attempt to quantitatively characterize the performance of FST-
K-Means by demonstrating that the clustering obtained from our method satisfies optimality
constraints for the internal quality. We present brief concluding remarks in Section 3.5. We
would like to note that content in this chapter is mainly obtained from our research that appears
in [6].
3.1 Preliminaries
We now provide a brief overview of the Fruchterman-Reingold (FR) [11] graph layout
method that we later adapt in Section 3.2 for use as a component in our FST-K-Means.
16
Graph layouts with the Fruchterman and Reingold scheme. Graph layout methods are de-
signed to produce aesthetically pleasing graphs embedded in a two-(or three)-dimensional space.
A popular technique is the Fruchterman and Reingold (FR) [11] algorithm that considers a graph
as a collection of objects (vertices) connected together by springs (edges). Initially, the vertices
in the graph are placed at randomly assigned coordinates. The FR model assumes that there are
two kinds of forces acting on the vertices: (i) attraction only between connected vertices due
to the springs, and (ii) repulsion between all vertices due to mutually charged objects. If the
Euclidean distance between two vertices u and v is duv
and k is a constant proportional to the
square root of the ratio of the embedding area by the number of vertices, then the attractive force
FFR
Aand the repulsive force F
FR
Rare calculated as:
FFR
A(u, v) = −k2/d
uvand F
FR
R(u, v) = d
2
uv/k.
At each iteration of FR, the vertices are moved in proportion to the calculated attractive or re-
pulsive forces until the desired layout is achieved.
In Section 3.2, we adapt the FR algorithm as a component in our feature subspace trans-
formation in high dimensions. Through this transformation, we essentially seek embeddings that
enhance cluster cohesiveness while improving the accuracy of classification.
3.2 FST-K-Means: Feature Subspace Transformations for Enhanced Classifica-
tion
We now develop our feature subspace transformation scheme which seeks to produce
a transformed dataset to which K-Means can be applied to yield high quality classifications.
17
Our FST-K-Means algorithm seeks to utilize both distance-based and combinatorial measures
through a geometric interpretation of the entities of the dataset in its high-dimensional feature
space.
Consider a dataset of N entities and R features represented by an N×R sparse matrix A.
The i-th row, ai,∗ of the matrix A, represents the feature vector of the i-th entity in the dataset.
Thus, we can view entity i as being embedded in the high-dimensional feature space (dimension
≤ R) with coordinates given by the feature vector ai,∗.
The N × N matrix B ≈ AAT represents the entity-to-entity relationship. B forms
the adjacency matrix of the undirected graph G(B,A), where bi,j
is the edge weight between
vertices i and j in the graph, and ai,k
is the coordinate of vertex i in the k-th dimension.
FST is then applied to G(B,A) to transform the coordinates of the vertices producing
the graph G(B, A). It should be noted that the structure of the adjacency matrix i.e., the set
of edges in the graph, remains unchanged. The transformed coordinates of the vertices are
now represented by the N × R sparse matrix A. The transformed dataset or equivalently, its
matrix representation A, has exactly the same sparsity structure as the original, i.e., ai,j= 0
if and only if ai,j= 0. However, typically a
i,j= a
i,jas a result of the feature subspace
transformation, thus changing the embedding of entity i. Now the i-th entity is represented in a
transformed feature subspace (dimension ≤ R) by the feature vector ai,∗. K-Means is applied
to this transformed data set represented by A to yield FST-K-Means.
FST-K-Means comprises three mains steps: (i) forming an entity-to-entity sparse, weighted
embedded graph G(B,A), (ii) feature subspace transformation (FST) of G(B,A) to yield trans-
formed embedded graph G(B, A), and (iii) applying K-Means to A for classification.
18
Fig. 3.1. Illustration of the three main steps of FST-K-Means.(1) A is a sparse matrix representing a dataset with 6 entities and 3 features. B ≈ AA
T is theadjacency matrix of the weighted graph G(B,A) with 6 vertices and 7 edges.
(2a) FST is applied on G(B,A) to transform the coordinates of the vertices. Observe thatthe final embedded graph G(B, A) has the same sparsity structure as G(B,A).
(2b) The sparse matrix A represents the dataset with the transformed feature space.
(3) K-Means is applied to the dataset A to produce high quality clustering.
19
Algorithm 1 procedure FST(A)n← entitiesB ← AA
T
for i = 1 to MAX ITER doInitialize displacement ∆← (0
T
1, . . . , 0
T
n)
Compute displacementfor all edges(u, v) in G(B,A) do
FA
uv←−k2×w
uvduv×i
dist = ||(Av−A
u)||
∆u= ∆
u+
(Av−A
u)
dist × FA
uv
∆v= ∆
v−
(Av−A
u)
dist × FA
uvend for
Update entity positionsfor j = 1 to n do
Aj← A
j+∆
jend for
end for
20
Figure 3.1 illustrates these steps using a simple example.
Forming an entity-to-entity, weighted embedded graph G(B,A). Consider the dataset
represented by the N × R sparse matrix A. Form B ≈ AAT . Although, B could be computed
exactly as AAT , approximations that compute only a subset of representative values may also
be used. Observe that bi,j
, which represents the relationship between entities i and j, is given
by ai,∗ · aj,∗, i.e., the dot product of the feature vectors of the i-th and j-th entities; thus, b
i,jis
proportional to their cosine distance in the feature space. Next, view the matrix B as represent-
ing the adjacency matrix of the undirected weighted graph G(B), where vertices (entities) v and
u are connected by edge (u, v) if bu,v
is nonzero; the weight of the edge (u, v) is set to bu,v
.
Finally, consider the weighted graph G(B) of entities as being located in the high-dimensional
feature space of A, i.e., vertex v has coordinates av,∗. Thus, G(B,A) represents the combi-
natorial information of the entity to entity relationship similar to graph clustering methods like
GraClus. However, G(B,A) uses the distance attributes in A to add geometric information in
the form of the coordinates of vertices (entities).
Feature Subspace Transformation of G(B,A) to G(B, A). We develop FST as a vari-
ant of the FR-graph layout algorithm [11], which is described in Section 3.1, to obtain G(B, A)
from G(B,A).
Although FST is motivated from the FR-graph layout algorithm, it is significantly differ-
ent in the following aspects.
(i) FST operates in the high-dimensional feature subspace unlike the FR scheme which seeks
21
layouts in two or three dimensions. Thus, unlike FR which begins by randomly assigning coor-
dinates in 2 or 3 dimensional space to the vertices of a graph, we begin with an embedding of
the graph in the feature subspace.
(ii) The original FR scheme assumes that the vertices can be moved freely in any dimension of
the embedding space. However, for our purposes, it is important to restrict their movement to
prevent entities from developing spurious relationships to features, i.e., relationships that were
not present in the original A. We therefore allow vertices to move only in the dimensions where
their original feature values are nonzero.
(iii) The goal of FST is to bring highly connected vertices closer in the feature space, in contrast
to FR objectives that aim to obtain a visually pleasing layout. Therefore at each iteration of
FST, we move the vertices based only on the attractive force (between connected vertices) and
eliminate the balancing effect of the repulsive force. Furthermore, we scale the attractive force
by the edge weights to reflect higher attraction from greater similarity between the correspond-
ing entities. Together, these modifications can cause heavily connected vertices to converge to
nearly the same position. While this effect might be desirable for classification, it hampers the
computation of attractive forces for the next iteration of FST. In particular, very small distances
between vertices cause an overflow error due to division by zero. We mitigate this problem by
scaling the force by the number of iterations to ensure that at higher iterations, the effect of the
displacement is less pronounced. In summary, at FST iteration i, the attractive force between
two vertices u and v with edge weight wu,v
and Euclidean distance du,v
is given by:
FFST
i= −
k2 ∗ w
u,v
du,v∗ i
.
22
In the expression above, k is a constant proportional to the square root of the ratio of the embed-
ding area by the number of vertices as in the original FR scheme.
Applying K-Means to A for Classification. A forms the matrix representation of the
dataset in the transformed feature space, where ai,∗ represents the new coordinates of vertex
(entity) i. The sparsity structure of A is identical to that of A, the original dataset. K-Means is
then applied to A, the transformed dataset.
Computational Costs and Stopping Criteria of FST. In FST-K-Means, the cost per
iteration of K-Means stays unaffected because it operates on A which has exactly the same non-
zero pattern as the original A, i.e., ai,j= 0 if and only if a
i,j= 0. Additionally, although FST
may change the number of K-Means iterations, it should not affect the worst case complexity
which is superpolynomial with a lower bound of 2Ω√N iterations for a dataset of N entries [38].
Thus the main overheads in FST-K-Means are those for FST.
The first step, that of forming G(B,A), is similar to the graph setup step in GraClus
and other graph clustering methods. Even if B is computed exactly, its costs are no more than
Σi=1:n
nnz2
iwhere nnz
iis the number of nonzero feature values of the i-th entity. The cost of
FST is given by the number of iterations multiplied by the cost per iteration, which is propor-
tional to the number of edges in the graph of B.
Depending on the degree of similarity between the entities, G(B,A) can be dense, thereby
increasing the cost per iteration of FST. Consequently, an implementation of FST could benefit
from using sparse approximations to the graph through sampling, though this could adversely
23
affect the classification results.
Stopping Criteria for FST. The number of iterations for an ideal embedding using FST
varies according to the feature values in the dataset. In this subsection, we describe how we
identify a convergence criteria that promotes improved classification.
An ideal embedding would simultaneously satisfy the two following properties: (i) sim-
ilar entities are close together, and (ii) dissimilar entities are far apart. The first property is
incorporated into the design of FST. We, therefore seek to identify a near-ideal embedding based
on the second property, i.e., the inter-cluster distance [39], which is estimated by the distance of
the entities from their global mean.
We next show in Lemma 3.1 below how the distance of the entities from their global
mean is related to feature variance. We use this relation to determine our convergence test for
terminating FST iterations.
Lemma 3.1. Consider a dataset of N entities and R features represented as a sparse matrix A;
the entity i can be viewed as embedded in the feature space at the coordinates given by the i-th
feature vector, ai,∗ = [a
i,1, · · · , a
i,R]. Let f
qbe the feature variance vector, and let d
qbe the
vector of the distance of each entity from the global mean. i.e., the centroid of all entities, at
iteration q of FST. Now the following relation is satisfied by the f and d vectors:
∥fq∥1=
1
N∥d
q∥2.
24
Proof. Let ai,j
denote the feature j of entity i. Now ai,j
is also the j-th coordinate of entity
i. Let the mean of the feature vector a∗,j be τj= 1
NΣN
k=1(a
k,j) and the variance of feature
vector a∗,j be ϕj= 1
NΣN
k=1(a
k,j− τ
j)2.
Let fq= [ϕ
1, ϕ
2, · · · , ϕ
R] be the vector of feature variances. Now ∥f
q∥1= Σ
R
j=1|ϕj| =
ΣR
j=1ϕj, because ϕ
j≥ 0.
Let the global mean of the entities be represented by µ. The i-th coordinate of µ is
calculated as µi= 1
NΣN
k=1(a
k,i) = τ
i.
Let di
be the distance of entity i from µ. Therefore δi=
√ΣR
k=1(a
i,k− µ
k)2 =
√ΣR
k=1(a
i,k− τ
k)2. Let d
q= [δ
1, δ2, · · · , δ
N] be the vector representing the distance of the
entities from the global mean µ. The 2-norm of dq
is given by;
∥dq∥2
= ΣN
i=1(δi)2
= ΣN
i=1(Σ
R
k=1(a
i,k− τ
k)2)
= ΣR
k=1(Σ
N
i=1(a
i,k− τ
k)2)
= ΣR
k=1(Nϕ
k)
= N∥fq∥1
A high value of ∥fq∥1
implies that there is a high variance among the features of the
entities relative to the transformed space. High variance among features indicates easily dis-
tinguishable entities. Lemma 3.1 shows that ∥fq∥1= 1
N ∥dq∥2, i.e., high feature variance is
25
proportional to the average distance of entities from their global mean. The value of dq
can be
easily computed, and we base our stopping criteria on this value.
Our heuristic for determining the termination of FST is as follows: continue layout it-
erations until ∥di+1∥2≤ ∥d
i∥2
. That is, we terminate FST when successive iterations fail to
increase the global mean of the distance of the entities. As illustrated in Figure 3.2, this heuristic
produces an embedding that can be classified with high accuracy for a sample dataset.
Impact of FST on clustering. We now consider the impact of FST on a sample dataset,
namely splice [40], with two clusters. Figure 3.3 shows the layout of the entities of splice, in both
the original and the transformed feature space. For ease of viewing, the entities are projected to
the first three principal components. Observe that the clusters are not apparent in the original
dataset. However, FST repositions the entities in the feature space into two distinct clusters,
thereby potentially facilitating classification.
3.3 Evaluation and Discussion
We now provide an empirical evaluation of the quality of classification using FST-K-
Means. We define external and internal quality metrics for evaluating classification results. We
use these metrics for a comparative evaluation of the performance of FST-K-Means, K-Means,
and GraClus on a test suite of eight datasets from a variety of applications.
Experimental setup. Our experiments compare the accuracy of our feature subspace
transformation method coupled with K-Means clustering versus a commercial implementation
of K-Means. We first apply K-Means clustering to the unmodified collection of feature values as
26
obtained from the dataset. We refer to the use of the algorithm on this dataset as K-Means. We
then apply K-Means clustering to the dataset transformed using FST as described in Section 3.2.
We refer to this scheme as FST-K-Means in our experimental evaluation.
In both set of experiments, we use K-Means based on Llyod’s algorithm (as implemented
in the MATLAB [41] Statistical Toolbox), where the centroids are initialized randomly and K-
Means is limited to a maximum of 25 iterations. The quality of clustering using K-Means varies
depending on the initial choice of centroids. To mitigate this effect, we execute 100 runs of K-
Means, each time with a different set of initial centroids. To ensure a fair comparison, the initial
centroids for each execution of FST-K-Means are selected to be exactly the same as those for the
corresponding K-Means execution.
The value of the cohesiveness metric is dependent on the feature vector of the entities.
Therefore, for the set FST-K-Means where the values of the features have been modified, we
compute cohesiveness by first identifying the elements in the cluster and then transforming them
back to the original space. We thereby ensure that the modified datasets are used exclusively
for cluster identification and that the values of cohesiveness are not affected by the transformed
subspace.
The GraClus scheme is executed on the weighted graph representation. The results of
GraClus do not significantly change across runs. Consequently, one execution of this scheme is
sufficient for our empirical evaluation.
The computational cost of FST is given by the number of nonzeros in matrix B, which
is less than O(N2) per iteration, where N is the number of entities in the dataset.
27
Evaluation of FST-K-Means on a synthetic dataset. We first evaluate our model using arti-
ficially generated data. Our goal is to easily demonstrate and explain the steps of our FST-K-
Means algorithm using this data before we test our method on benchmark datasets from different
repositories.
Figure 3.4 represents our test case in which three Gaussian clusters are formed in two
dimensions using three different cluster means and covariance matrices. Figure 3.4(a) represents
the original synthetic data. Figure 3.4(b) represents the same data points after the FST transfor-
mation has been applied to it. Even before we proceed to formally clustering the two datasets
(original and transformed), a visual inspection of the layout suggests that clusters obtained using
the transformed dataset are likely to be distinct. Finally, on clustering the two datasets using
K-Means, we observe that K-Means on the transformed dataset is 3% more accurate than K-
Means on the original data. Matrices ZK−Means
and ZFST−K−Means
in Equations 3.1 and
3.2, respectively, indicate the confusion matrices obtained using the two datasets. In our exam-
ple, the cluster label of each entity is known apriori, therefore, after clustering the original or
transformed data, we compare it to the true cluster label. The diagonal element Z(ii)
K−Meansof
matrix ZK−Means
indicates the number of entities that were assigned correctly to cluster i and
the off-diagonal element Z(ij)
K−Means, where i = j indicates the number of elements of cluster
i that were mapped to cluster j (ZFST−K−Means
is formed in the same way as ZK−Means
).
We observe that our FST transformation scheme clearly obtains higher precision and accuracy.
ZK−Means
=
59 16 25
6 78 16
6 3 91
(3.1)
28
ZFST−K−Means
=
68 11 21
10 82 8
7 8 85
(3.2)
Evaluation of FST-K-Means on benchmark datasets We test K-Means and FST-K-Means
classification algorithms on datasets from the UCI [42], Delve [40], Statlog [43], SMART [44]
and Yahoo 20Newsgroup [45] repositories. The parameters of each dataset, along with their
application domain, are shown in Table 3.1. Datasets dna, 180txt, and 300txt have three classes.
The rest of the datasets represent binary classification problems. We select two classes, comp.graphics
and alt.atheism, from the Yahoo 20Newsgroup dataset and form the 20news binary classification
set.
Name Samples Features Source(Type)adult a2a 2,265 123 UCI(Census)australian 690 14 UCI(Credit Card)breast-cancer 683 10 UCI(Census)dna 2,000 180 Statlog(Medical)splice 1,000 60 Delve(Medical)180txt 180 19,698 SMART(Text)300txt 300 53,914 SMART(Text)20news 1,061 16,127 Yahoo
Newsgroup(Text)
Table 3.1. Test suite of datasets.
29
Comparative evaluation of K-Means, GraClus, and FST-K-Means. We now com-
pare the quality of classification using the accuracy, and cohesiveness metrics for the datasets in
Table 3.1 using K-Means, GraClus, and FST-K-Means.
Accuracy. Table 3.2 reports the accuracy of classification of the three schemes. The
values for K-Means and FST-K-Means are the mode (most frequently occurring value) over
100 executions. These results indicate that with the exception of the breast-cancer dataset, the
accuracy of FST-K-Means is either the highest value (5 out of 8 datasets) or is comparable to
the highest value. In general, GraClus has lower accuracy than either FST-K-Means or K-Means
with comparable values for only two of the datasets, namely, dna and 180txt.
Table 3.4 shows the improvement in accuracy of FST-K-Means relative to K-Means and
GraClus. The improvement in accuracy as a percentage of method A relative to method B, is
defined as (P (A)−P (B))×100P (B)
, where P (A) and P (B) denote accuracies of methods A and B,
respectively; positive values represent improvements, while negative values indicate degradation.
We see that the improvement in accuracy for FST-K-Means can be as high as 57.6% relative to
K-Means (for 20news) and as high as 47.7% relative to GraClus (for 300txt). On average, the
accuracy metric obtained from FST-K-Means shows an improvement of 14.9% relative to K-
Means and 23.6% relative to GraClus.
Comparison of FST-K-Means with K-Means after PCA dimension reduction. Table 3.3
compares the accuracy obtained by K-Means using top three principal components (PCA-K-
Means) of the dataset with FST-K-Means. We observe that in seven out of eight datasets FST-
K-Means performs better than PCA-K-Means and on average achieves 24.27% better accuracy
30
DatasetsClassification Accuracy (P)
K-Means GraClus FST-K-Meansadult a2a 70.60 52.49 74.17australian 85.51 74.20 85.36breast-cancer 93.70 69.69 83.16dna 72.68 70.75 70.75splice 55.80 53.20 69.90180txt 73.33 91.67 91.67300txt 78.67 64.33 95.0020news 46.74 54.85 73.70
Table 3.2. Accuracy of classification of K-Means, GraClus and FST-K-Means.
relative to PCA-K-Means. This is an empirical proof that K-Means benefits from the high-
dimensional embedding of FST. We also observe that the relative improvement in accuracy be-
tween PCA-K-Means and FST-K-Means is particularly high for text datasets (180txt, 300txt,
and 20news). Typically, feature selection in noisy text data is a difficult problem, and the right
set of principal components that can provide high accuracy is often unknown [46]. Our fea-
ture subspace transformation brings related entities closer, in effect, common features gain more
importance than unrelated features.
Additionally, in Figure 3.5 we present a sensitivity study of using upto 50 principal com-
ponents on K-Means accuracy. Table 3.3 reports results for top three principal components,
while Figure 3.5 presents the accuracy of PCA-K-Means upto 50 principal components. It is
clear that FST-K-Means performs better than PCA-K-Means.
Cohesiveness. Table 3.5 compares cluster cohesiveness, the internal quality metric across
all three schemes. Once again, the values for K-Means and FST-K-Means are the mode (most
31
DatasetsClassification Accuracy (P) RelativePCA FST-K-Means Improvement
adult a2a 65.81 74.17 +12.70australian 71.17 85.36 +19.93breast-cancer 96.05 83.16 -13.42dna 60.41 70.75 +17.12splice 65.10 69.90 +7.37180txt 61.78 91.67 +48.38300txt 62.00 95.00 +53.2220news 49.50 73.70 +48.49
Table 3.3. Accuracy of classification for PCA (top three principal components) and FST-K-Means.
DatasetsRelative Improvement
(Accuracy)K-Means GraClus
(percentage)adult a2a 5 41.3australian -1 15breast-cancer -11 19dna -2 0splice 25.2 31.4180txt 25.0 0300txt 20.75 47.720news 57.6 34.3Average 14.9 23.6
Table 3.4. Improvement or degradation (negative values) percentage of accuracy of FST-K-Means relative to K-Means and GraClus.
32
frequently occurring value) over 100 executions. Recall that a lower value of cohesiveness is
better than a higher value. The cohesiveness measure of FST-K-Means is the lowest in four
out of eight datasets, and it is comparable to the lowest values (obtained by GraClus) for the
remaining 4 datasets.
Table 3.6 shows the improvement of the cohesiveness metric of FST-K-Means relative to
K-Means, and GraClus. The improvement in cohesiveness as a percentage of method A relative
to method B is defined as (J(B)−J(A))×100J(B)
, where J(A) and J(B) denote cohesiveness of
methods A and B respectively; positive values represent improvements while negative values
indicate degradation. FST-K-Means achieves as high as 44.8% improvement in cohesiveness
relative to K-Means and 37.9% improvement relative to GraClus. On average FST-K-Means
realizes an improvement in cohesiveness of 20.2% relative to K-Means and 6.6% relative to
(GraClus).
DatasetsCluster Cohesiveness (J)
K-Means GraClus PCA FSTAdult a2a 24,013 16,665 15,522 16,721australian 4,266 3,034 2,791 2,638breast-cancer 2,475 2,203 980 1,366dna 84,063 65,035 65,029 65,545splice 31,883 31,618 30,830 31,205180txt 25,681 23,776 23,806 24,131300txt 47,235 44,667 44,539 45,05220news 3,851,900 3,483,591 3,029,614 3,341,400
Table 3.5. Cluster cohesiveness of K-Means, GraClus, PCA (top three principal components),and FST.
33
DatasetsRelative Improvement
(Cohesiveness)K-Means GraClus
(percentage)adult a2a 30.3 -.33australian 38.1 13.0breast-cancer 44.8 37.9dna 22.0 -.78splice 2.1 1.3180txt 6.0 -1.4300txt 4.6 -.8620news 13.2 4.0Average 20.2 6.6
Table 3.6. Improvement (positive values) or degradation (negative values) percentage of cohe-siveness of FST-K-Means relative to K-Means and GraClus.
In summary, the results in Tables 3.2 through 3.6 clearly demonstrate that FST-K-Means
is successful in improving accuracy beyond K-Means at cohesiveness measures that are signif-
icantly better than K-Means and comparable to GraClus. The superior accuracy of K-Means
is related to the effectiveness of the distance-based measure for clustering. Likewise, the su-
perior cohesiveness of GraClus derives from using the combinatorial connectivity measure. We
conjecture that FST-K-Means shows superior accuracy and cohesiveness from a successful com-
bination of both distance and combinatorial measures through FST.
3.4 Toward optimal classification
Unsupervised classification ideally seeks 100% accuracy in labeling entities. However,
as discussed in Section 3.3, accuracy is a subjective measure based on external user specifica-
tions. It is therefore difficult to analyze algorithms based on this metric. On the other hand, the
34
internal metric, i,e., the cluster cohesiveness, provides an objective function dependent only on
the coordinates (features) of the entities and centroids. Thus, this metric is more amenable for
use in analysis of clustering algorithms
Let the dataset be represented by the N × R sparse matrix A, with N entities and R
features. Let the i-th row vector be denoted by ai,∗ = [a
i,1, a
i,2, · · · , a
i,R]. Now the global
mean of the entities (i.e., the centroid over all entities) is given by the R-dimensional vector a =
1NΣ
N
i=1ai, where the j-th component of a is denoted by a
j= 1
NΣN
i=1ai,j
for j = 1, · · · , R.
Let Y be the centered data matrix, where each column yi,∗ = (a
i,∗ − a)T or equivalently
yi,j
= ai,j− a
jfor i = 1, · · ·N and j = 1, · · ·R.
It has been shown in [47] that for a classification with k clusters, the optimal value of the
cohesiveness can be bounded as shown below, where N × y2 is the trace of Y T
Y and λi
is the
i-th principal eigenvalue of Y TY ;
Ny2 − Σ
k−1i=1
(λi) ≤ J ≤ Ny
2.
Table 3.7 shows the lower and upper bounds for the cohesiveness measure in the original
feature space along with the range observed for this measure (minimum and maximum values)
over 100 trials of FST-K-Means and K-Means. The bounds of the 20news group were excluded
due to computational complexity in calculating eigenvalues, posed by the size of the data. These
results indicate that unlike K-Means, FST-K-Means always achieves near-optimal values for J
and it consistently satisfies the bounds for the optimal cohesiveness. Thus FST-K-Means moves
K-Means toward cohesiveness optimality and consequently toward enhanced classification ac-
curacy.
35
Cluster Cohesiveness: FST-K-MeansDatasets Lower Upper
Bound Min Max BoundAdult a2a 15,250 15,866 17,208 17,380australian 2,442 2,638 3,008 3,493
breast-cancer 747 1,366 1,366 5,169dna 63,890 65,525 65,865 67,190
splice 30,389 31,205 31,205 31,942180txt 23,188 23,765 24,178 25,290300txt 43,708 44,800 45,194 46,512
Cluster Cohesiveness: K-MeansDatasets Lower Upper
Bound Min Max BoundAdult a2a 15,250 24,013 24,409 17,380australian 2,442 4,266 4,458 3,493
breast-cancer 747 2,475 2,475 5,169dna 63,890 84,062 84,123 67,190
splice 30,389 31,882 31,884 31,942180txt 23,188 25,651 25,730 25,290300txt 43,708 47,220 47,288 46,512
Table 3.7. The lower bound, range and upper bound of cohesiveness across 100 runs of FST-K-Means (top half) and K-Means in the original feature space. Observe that FST-K-Meansconsistently satisfies the optimality bounds while K-Means fails to do so for most datasets. Thehighlighted numbers in the lower table indicate datasets for which the minimum value of cohe-siveness exceeds the upper bound on optimality.
36
3.5 Chapter Summary
We have developed and evaluated a feature subspace transformation scheme to combine
the advantages of distance-based and graph-based clustering methods to enhance K-Means clus-
tering. However, our transformation is general and as part of our plans for future work we
propose to adapt FST to improve other unsupervised clustering methods.
We empirically demonstrate that our method, FST-K-Means, improves both the accuracy
and cohesiveness relative to popular classification methods like K-Means and GraClus.
We also show that the values of the cohesiveness metric for FST-K-Means consistently
satisfy the optimality criterion (i.e., lie within the theoretical bounds for the optimal value) for
datasets in our test suite. Thus, FST-K-Means can be viewed as moving K-Means toward cohe-
siveness optimality to achieve improved classification.
37
Fig. 3.2. Plots of classification accuracy and the 1-norm of feature variance vector across FSTiterations. FST iterations are continued until feature variance decreases relative to the previousiteration.
38
Fig. 3.3. Layout of entities in splice in the original (top) and transformed (bottom) featurespace, projected onto the first three principal components. Observe that two clusters are moredistinct after FST.
39
(a)
(b)
Fig. 3.4. Layout of entities in two dimensional synthetic dataset in the original (top) and trans-formed (bottom) feature space.
40
Fig. 3.5. Sensitivity of classification accuracy (P) of K-Means to number of principal compo-nents.
41
Chapter 4
Similarity Graph Neighborhoodsfor Enhanced Supervised Classification
Supervised classification using Linear Discriminant Analysis [48–50] or support vector
machines [30, 31, 33, 34] are effective machine learning techniques for data classification. Su-
pervised learning algorithms, like LDA or SVM, utilize similarity in the data as a criterion to
build effective classification rules [12]. Such schemes typically use a measure of distance in the
high-dimensional space to quantify this similarity [51]. However, when the number of features
are large, then distance measures can be ambiguous and represent false relationships between
entities [52]. In such scenarios, we need to understand the strength of the similarity to be able to
classify entities correctly. Usually, the neighborhood of an entity plays a critical role in decid-
ing the strength of a similarity [53]. The similarity between entities is typically represented by
the similarity graph whose vertices represent entities and the edge weights indicate the strength
of the similarity between entities. In forming the similarity graph, we are faced with a two-
fold problem, (i) the similarity graph is dense [23], and (ii) computing the similarity graph is
quadratic in the number of observations (N ), which is expensive when N is large. Therefore,
our objective is to determinine a sparse neighborhood of an entity that represents the similarity
relationships most relevant for achieving high classification accuracy.
In this chapter, we propose a data transformation scheme to enhance accuracy of two pop-
ular supervised classification schemes, LDA and SVM. We determine a sparse similarity graph
G from the labeled training data A to reveal a connectivity in the data based on their features.
42
Subsequently, we utilize local neighborhoods of an entity in this sparse similarity neighbor-
hood graph (SGN) of the training data to transform the high-dimensional feature space to A.
The position of an entity is transformed by a resultant displacement computation using the local
neighborhood. We train a supervised classifier, namely LDA or SVM, on the transformed data
to obtain the classification boundary, which is subsequently used for classification of unlabeled
data in the testing phase. We refer to this combination of SGN data transformation with the LDA
or SVM classifier as SGN-LDA or SGN-SVM, respectively. We evaluate the accuracy of our
schemes SGN-LDA and SGN-SVM on a suite of seven well-known UCI datasets [54] and com-
pare it with a popular implementation of LDA [55] and SVM [30]. Furthermore, we empirically
characterize the sensisitivity of classification accuracy to choices of parameter values in SGN-
SVM and present measures related to the sparsity of the similarity graph neighborhoods. These
measures indicate that the computational costs of our SGN transformations remain proportional
to the size of the original data set.
The remainder of this chapter is organized as follows. In Section 4.1, we describe the
main steps of our SGN transformation and its application to LDA or SVM. We present and
discuss our empirical results in Section 4.2. Section 4.3 presents related research, and Section 4.4
includes a summary of our findings. We would like to note that contents of this chapter are
motivated by our research in [56].
4.1 Exploiting Similarity Graph Neighborhoods for Enhancing SVM Accuracy
In this section, we develop our main contribution, namely the SGN-LDA and SGN-SVM
methods. Key steps in SGN-LDA or SGN-SVM concern: (i) determining a parametrizable γ
43
entity neighborhood in the high dimensional feature space of training dataset A through spec-
ification of a sparse graph G(B,A), which is our similarity graph, (ii) a training data transfor-
mation scheme to obtain A from the original A, through displacements to entities using γ entity
neighborhoods in the similarity graph G(B,A), (iii) training an LDA or SVM classifier on the
transformed data matrix A, and (iv) classifying the test data. These steps are presented in detail
in the remainder of this section.
4.1.1 Determining γ-Neighborhoods in Similarity Graph G(B,A)
Consider a training dataset A ∈ Rn×r with n entities and r features represented by an
n × r sparse matrix, whose entities have a preassigned label ti∈ −1, 1 ∀ i = 1, · · · , n,
where -1 and 1 each represent a class. The i-th row ai
of the matrix A, represents the feature
vector of the i-th entity in the dataset. Thus, we can view entity i as being embedded in the
high-dimensional feature space (of dimension ≤ r) with coordinates given by the feature vector
ai.
We consider the training data A and form matrix B ∈ Rn×n whose nonzero entries
express similarity strength between two entities ai
and aj
in A. Each element bij
in B is a simi-
larity strength between pairs ai
and aj
computed as shown in Equation 4.1. Function F (ai, a
j)
represents a similarity function of ai
and aj. As an example to do a similarity transform consider
replacing F (ai, a
j) by the Euclidean distance that is given by ||a
i− a
j||2
.
bij
= F (ai, a
j) (4.1)
44
We view the similarity graph G(B,A) as the adjacency graph representation of B obtained from
dataset A. Computing the similarity function for all (ai, a
j) pairs results in a dense graph; hence,
we focus on maintaining the sparsity of B by preserving the nonzero elements bij
that satisfy
the following two conditions. First, |bij| is greater than ϕ
i, where ϕ
iis a threshold related to the
mean over all elements of row i of B. Second, the entities associated with bij
, i.e., ai
and aj
have q or more features in common. With appropriate choices for ϕ and q, the matrix B will be
relatively sparse. It can then be stored in the compressed sparse row format (CSR) [21], which
stores only non-zero edges, such that neighborhood information can be retrieved in constant,
O(1), time.
In practice, F (ai, a
j) can be replaced by appropriate domain-specific similarity mea-
sures that model similarity in the training data better. For example, in text mining, document
similarity is typically represented as an inner product distance [12]. Therefore, we use the
generic term “similarity” in our description of the method, as our SGN tranformation is not
limited by any particular similarity measure. Figure 4.1 illustrates the process of constructing
the neighborhood graph using a simple example.
Next, we define an entity neighborhood for ai
in the similarity graph G(B,A) as the set
of p entities ajp
, such that ai
and aj
share an edge in G(B,A). We further generalize the
notion of neighborhood by defining a reach set, γ-neighbor(ai), for an entity a
ias follows:
γ-neighbor(ai)a
j: reach(a
i, a
j) ≤ γ, ∀j = 1, . . . , n.. (4.2)
45
In the equation above, reach(ai, a
j) indicates a path from a
ito a
jin G(B,A) whose length is
no more than γ. Increasing reach with higher γ will typically result in larger neighborhoods;
γ is viewed as a tunable paramater in our method. Figure 4.2 represents a sample γ = 1 entity
neighborhood of an entity ai.
Fig. 4.1. Forming the sparse γ-neighborhood similarity graph G(B,A) from A. F (A) repre-sents the transformation described in Section 4.1.1. G(B,A) is a weighted graph representationof matrix B.
4.1.2 Transforming Training Data through Entity Displacement Vectors.
In this step, we consider transforming training data A to A using a vector measure ob-
tained from γ-neighborhoods in G(B,A). Each entity ai∈ Rr is viewed as embedded in
an r-dimensional feature space, where r ≥ 2. The geometric information for ai
is obtained
from A where aik
represents the coordinate of the i-th entity in the k-th dimension (k ≤ r).
Using this information, we first construct the displacement vector ∆ij
between ai
and aj,
∀aj∈ γ-neighbor(a
i). The displacement vector ∆
ijis computed as the difference between
46
the vectors aj
and ai. Equation 4.3 represents the displacement vector for entity pair (a
i, a
j).
∆ij
= aj− a
i(4.3)
We multiply the displacement vector by the product titj, where t
i, tj∈ −1, 1 indicate the
class of ai
and aj, respectively. Equation 4.4 represents the modified displacement vector.
∆ij
= titj∆ij
(4.4)
When ai
and aj
belong to the same class, then the product titj
equals 1, while titj
equals -1
when ai
and aj
belong to different classes. Therefore, entities belonging to the same class are
displaced towards each other, and entities belonging to the different classes are displaced away
from each other.
We compute the resultant displacement vector for entity ai
as the weighted sum of all
displacement vectors ∆ij
, j ∈ γ-neighbor(ai). Thus the resultant displacement ρ
iis
ρi
=∑
j∈neighbor(ai)
bij∆ij. (4.5)
For example, in Figure 4.2, ρi
equals bik∆ik+ b
ij∆ij+ b
iu∆iu
+ bip∆ip+ b
iq∆iq
for γ equal
to 1. For γ > 1, we would also include, in the resultant calculation, other entities that are within
a path length of less than or equal to γ from ai.
Subsequently, we seek to update the position of ai
in the direction governed by the unit
vector ρi=
ρi
||ρi||2
. The resultant direction for ai
is typically directed towards the entities with
47
stronger similarity to ai; therefore, we claim that this direction is the direction in which similarity
between related entities increases. We also introduce a normalizing factor K =√N to scale ρ
i
such that the step size for updating ai
does not increase arbitrarily. We compute the step size
βi
by dividing the magnitude of the resultant displacement ρi
by the normalizing factor K, as
shown in Equation 4.6.
βi
=||ρ
i||2
K(4.6)
Therefore, β is essentially the magnitude of the displacement vector scaled by a constant K to
limit the displacement from growing arbitrarily large.
We update the position of entity ai
in the direction of increasing similarity ρi
with a step
size of βi. Equation 4.7 represents the updated position entity a
i.
ai
= ai+ β
iρi
(4.7)
In our formulation, the rate at which similar entities are moved closer to each other is high;
therefore, we seek to observe an enhanced separation amongst unrelated entities. Furthermore,
similar entities move close to each other while maintaining their positions relative to each other.
Figure 4.3(a) shows the result of our SGN data transformation scheme when applied to
a simple graph. Figure 4.3(b) illustrates obtaining a transformed dataset A from the positions in
the transformed high-dimensional feature space.
48
4.1.3 Training an LDA or SVM on Transformed Data.
We consider the process of training two supervised classifiers,i.e., LDA and SVM, on the
transformed data.
LDA training. We obtain the class mean and covariance parameter estimates from the trans-
formed data. These estimates are required to construct the discriminant function for the LDA
classifier.
SVM training. We train an SVM [30] (as shown in Equations 2.2 and 2.3) on A to produce
separating hyperplanes characterized by a set of support vectors S. The set S is a subset of
the transformed training data A and marks the separating boundary between the two classes.
Typically, the size of the set S is small relative to n, which is, the size of the training set A.
Using matrix notation, we rewrite Equation 2.3 as shown in Equation 4.8, where Q ∈
Rn×n is typically referred to as the kernel matrix [12], c = ci: c
i∈ (0, 1), ∀i = 1 . . . n,
and α is the vector of Lagrange multipliers [57]. Equation 4.8 seeks to maximize a quadratic
optimization function with respect to the Langrange multipliers α.
maxα≥0
1n c
Tα− 1
2 αTQα (4.8)
In Equation 4.8, often the (i, j)-th entry qij
of Q is represented as the inner product of entities
ai
and aj
(qij
= ai˙aj). However, q
ijcan be formed using other functions of a
iand a
j, such as
the radial basis kernel and many others. This technique, referred to as kernel trick [12], is useful
in many situations where we have prior knowledge about the application that generates the data.
Our selection of the similarity function F (ai, a
j) (or equivalently b
ij) is dependent on
the choice of SVM kernel function. This ensures that the transformation happens using the same
49
similarity measure (kernel) as used by the SVM optimization problem to avoid any form of
ambiguity. Therefore, our transformation seeks to transform the entity space such that it directly
impacts the kernel matrix Q in the SVM formulation.
4.1.3.1 An Example of SGN Transformation using the Radial Basis Function
In the context of SVM learning, our SGN data transform modifies Q by changing the
position of entities in the high-dimensional feature space. Our SGN transformation brings related
entities closer, thereby decreasing the quantity ||ai− a
j||2
. For the purpose of explaining the
impact of SGN on the Q, we consider that the radial basis function (RBF) based kernel is used
to form Q. Equation 4.9 presents an example in which, each element of matrix Q is computed
using a simple radial basis function.
qij
= e−||a
i−a
j||22 (4.9)
We observe that as ||ai− a
j||2→ 0 the quantity q
ij→ 1, and as ||a
i− a
j||2→ ∞, the
quantity qij→ 0. Modifying Q using SGN, therefore, impacts the properties of Q and the
convex optimization problem defined in Equation 4.8. Figure 4.4 illustrated the SVM training
process using our simple example.
4.1.4 Classifying Test Data
In the case of LDA, the discriminant function that determines the separating boundary is
used to classify test data. In a two-class problem, a test data sample is assigned to the second
class if the log-likelihood ratio (as shown in Equation 2.1) is below a certain threshold τ .
50
In the case of SVM, the support vectors obtained in the transformed training data A are
directly mapped to the corresponding entities in the original training data A. These mapped
entities form the new support vector for the training data A. For the testing phase, consider a
test set X ∈ Rn×r. The similarity F(xi, sj) of test sample x
iis computed with the set of k
modified support vectors S = sj j = 1 . . . k. Since the set of support vectors is small and
sparse, computing similarity can be performed in constant time. We determine the support vector
sk
that has maximum similarity with xi. We assign x
ithe same class as s
k.
4.2 Evaluation and Discussion
In this section, we first describe our experimental setup followed by evaluation of SGN-
LDA and SGN-SVM first on an artificial dataset then using a set of benchmark datasets. We also
report on sensitivity of our method to the input parameter γ (the number of levels in G(B,A)
used to determine the neighborhood), the sparsity of B, and the dimensions used to form the
dataset.
4.2.1 Experimental Setup and Metrics for Evaluation.
We implemented our SGN scheme using Matlab [55], and a sample implementation is
available for academic purposes. For our experiments, we use radial basis similarity for building
the neighborhood similarity graph. The δi
value is set for each entity ai
as the mean of all
neighbor similarities and any radial basis similarity less than this value are considered to be zero.
We use the LDA implementation as provided in the Matlab Statistics Toolbox [55] and SVM-
Perf [30] as a state-of-art, two-class support vector machine for obtaining the SVM hyperplane.
The base case, that is, training LDA and SVM-Perf on the original training data and recording
51
the accuracy of classification is referred to as LDA and SVM, respectively. The training error
bound parameter is set to 20. Finally, we use classification accuracy and F1-Score as metrics for
comparing the potential benefits of SGN-SVM over SVM for six benchmark datasets described
in Table 4.3. The classification accuracy is computed as the percentage of correctly classified
test data samples out of the total testing data. The F1-Score is computed as the harmonic mean
of the precision and recall.
4.2.2 Artificial Dataset Results
We now discuss the impact of SGN-LDA and SGN-SVM using an example of the four-
class dataset [58]. This dataset, created by Kleinberg et al. [58], represents artificially created
data in two dimensions, and each attribute is linearly scaled to [−1, 1]. This dataset, that had
four classes originally, was transformed to two classes that are not easily separable as they are
irregularly spread over the space. Table 4.1 provides details of the dataset. Table 4.2 presents
the accuracy obtained by LDA, SGN-LDA, SVM, and SGN-SVM on the fourclass dataset.
Dataset Observations Featuresfourclass 862 2
Table 4.1. Details of the fourclass dataset.
52
Dataset LDA SGN-LDA SVM SGN-SVMfourclass 73.20 77.91 75.87 94.47
Table 4.2. Classification accuracy (percentage) of LDA, SGN-LDA, SVM, and SGN-SVM forthe fourclass dataset.
4.2.3 Empirical Results on Benchmark Datasets.
We use a suite of six benchmark datasets chosen from the UCI repository [54] for our
experiments. Table 4.3 presents the datasets with the numbers of observations and features in
each dataset. Each dataset is randomly split to form data for training (60%) and for testing
(40%).
Dataset Observations Featuresaustralian 690 14heart 270 13liver 345 6sonar 208 60splice 1,000 60german 1,000 24
Table 4.3. Description of UCI datasets.
Tables 4.4 and 4.5 present classification accuracy and F1-Score results for our six bench-
mark datasets. Figures 4.5 and 4.6 present percentage improvements in classification accuracy
obtained using SGN-LDA and SGN-SVM on the artificial and benchmark datasets. SGN-LDA
improves LDA classification accuracy on average by 5% and the F1-Score by 3.44%. SGN-SVM
performs at par or better than SVM in five out of six datasets with an average improvement in
53
Dataset Accuracy F1-ScoreLDA SGN-LDA LDA SGN-LDA
australian 86.23 89.86 86.74 89.91heart 84.63 87.96 84.27 88.13liver 64.93 67.25 64.84 66.40sonar 68.67 74.70 70.58 74.80splice 78.25 80.50 78.33 80.50german 71.75 75.50 70.67 71.30
Table 4.4. Classification accuracy (as a percentage) and F1-Score for LDA and SGN-LDA onbenchmark datasets.
accuracy of 4.52% and an improvement in F1-Score of 8.09%. In case of the australian dataset,
SVM outperforms our SGN-SVM. This dataset has a heavily unequal distribution of labeled
data, and any one class is more heavily represented than the other. This adversely affects the
support vectors that SGN-SVM learns, as it is biased towards a particular class of data.
4.2.4 Why does classification improve with our SGN transformation?
Based on our empirical evaluation of SGN-LDA and SGN-SVM presented in Section 4.2,
we observe that our SGN transformation improves classification accuracy. We intuitively believe
that our SGN transformation could potentially help separate clusters that may appear very close
in a high dimensional feature space. In this regard, the question we pose is does increasing
separation between the class centers improve classification accuracy?
Earlier, Hopcroft et al. [59] showed that two points generated by the same Gaussian
process in d-dimensions would be a distance√2d apart, and the distance between two points be-
longing to different Gaussian processes would be√
2d+ δ2, where δ is the separation between
54
Dataset Accuracy F1-ScoreSVM SGN-SVM SVM SGN-SVM
australian 85.87 82.25 85.88 82.15heart 77.78 79.63 77.51 78.94liver 57.25 63.62 50.26 62.82sonar 79.52 86.27 81.75 86.63splice 73.50 73.75 69.54 74.62german 68.75 69.50 61.51 63.86
Table 4.5. Classification accuracy (as a percentage) and F1-Score for SVM and SGN-SVM onbenchmark datasets.
the two centers. We conjecture that our SGN transformation increases the separation between
the means of the different classes, thus increasing the initial separation δ to a larger δ.
We first demonstrate this using the artificially generated fourclass dataset with respect
to SGN-LDA and SGN-SVM. Figure 4.7 illustrates the separation produced in the training data
by our SGN transformation and the corresponding change in LDA decision boundary. Observe
that elements within a class approximately have the same relative positions. This is an evidence
that our method constructively modifies the training data to obtain better separation and hence
better support vectors. Figures 4.8(a) and (c)(left top, and bottom) illustrate the original training
data, original testing data and the support vectors obtained by SVM. The two classes are rep-
resented by circles and crosses with (green,blue) representing training data and (magenta,cyan)
representing testing data. The black markers represent the support vectors. Figures 4.8(b) and
(d)(right top and bottom) illustrate the original training data, original testing data, and the new
support vectors obtained by SGN-SVM. We observe that our SGN-SVM captures the separating
boundary better than SVM, and our SGN-SVM finds support vectors in critical regions of the
55
dataset where testing data are prone to be misclassified. Additionally, SGN-SVM also identi-
fies regions where the data are primarily from one class and few support vectors are needed in
this region of the data. In the case of SGN-SVM, the separation in the two classes of data
affects the similarity value between different pairs of samples, consequently, changing the deci-
sion boundary. Alternatively, LDA relies on the sample covariance and mean of the training data.
To explore this in greater detail, we construct training and testing sets from two Gaussian pro-
cesses in four-dimensions. Figure 4.9 visualizes the data in two principal component directions.
Additionally, we show the heatmap of the correlation matrix of the four features. We observe
that for certain pairs of features (indicated by dashed boxes), the correlation between features
change during the SGN transformation, indicating that some features become more influential
than others, consequently explaining the change in decision boundary.
4.2.5 Sensitivity of SGN-SVM Classification Accuracy
We study the effect of varying the value of SGN input parameters on the classification
accuracy of SVM.
4.2.5.1 Effect of Growing or Shrinking the Neighborhood on Classification Accuracy
We study the effect of increasing the neighborhood levels γ on the classification accuracy.
Figure 4.10(a) presents, for three datasets, the accuracy of classification as the γ parameter
increases. We observe that accuracy improvements are obtained with a relatively small number
of levels γ in the range 1 to 3 in G(B,A). This observation indicates that the local structure in a
dataset can lead to improved training and thus improved classification.
56
4.2.5.2 Effect of Changing Sparsity of G(B,A) on Classification Accuracy
We next study the effect of changing the sparsity of B and the effects a sparser B has on
the classification accuracy for three datasets from our benchmark suite.
In our similarity graph G(B,A), an edge exists between entities ai
and aj
only if q or
more common features are shared between them. As expected, in Figure 4.10(b) we observe that
the number of nonzeros in B, relative to those in A, decreases as q increases. That is, B becomes
sparser for larger values of q. Plots in Figure 4.10(c) indicate that even with higher values of q,
and thus a sparser B, the accuracy of SGN-SVM is not adversely affected.
These observations indicate that our SGN-SVM is not overly sensitive to relatively small
parameter changes.
4.3 Related Research
In this section, we present briefly an overview of related research that uses the influence
of node neighborhood in graphs to improve classification. S. Chakrabarti et al. [60, 61] pro-
pose a greedy graph labeling algorithm that iteratively corrects the labeling by considering the
neighborhood around nodes. In an initial phase, it computes class probabilities for a node us-
ing a Naive Bayes classifier and then re-evaluates the probabilities after the probabilities of the
neighboring nodes are known. Angelova et al. [62] extend and generalize this method using the
theory of Markov random fields to propose a relaxation-based graph labeling technique. This
technique uses different levels of trust in different neighbors by assigning node weights based
on the similarity between neighbors.
57
Oh et al. [63] propose a method that assigns labels to each node of a graph by considering
the popularity of the node among all of its immediate neighbors.
In a regression model proposed by Lu and Getoor [64], the text features of a document
are used in combination with the labels of immediate neighbors to improve classification. The
feature vector for a node is modified using feature information from either all neighboring nodes
or the most popular node amongst the neighbors.
Castillo et al. [65] address the problem of web spam detection in which they improve
classification accuracy using dependencies among labels of neighboring hosts in the web graph.
Our SGN data transformation derives motivation from the above earlier work and pro-
poses a novel entity feature update technique based on resultant-displacement using neighbor-
hood connectivity. Additionally, we unify our approach with the learning phase of an LDA or
SVM classifier to improve classification.
4.4 Chapter Summary
In this chapter, our main contribution is a γ-neighborhood based training data transforma-
tion scheme (SGN) using similarity graphs to displace entities in high-dimensional feature space.
We unify our SGN data transformation with the learning phase of either a linear discriminant
analysis classifier or a support vector machine to form SGN-LDA or SGN-SVM, respectively.
We show, that on average our SGN-LDA obtains 5.00% better accuracy than traditional LDA,
and SGN-SVM obtains 4.52% better accuracy than traditional SVM on a set of seven datasets.
Additionally, we present analysis on two aspects of our algorithm: (i) the effects of growing
or shrinking the neighborhood size γ, and (ii) the effects of sparsity of the γ-neighborhoods on
58
classification accuracy. Our analysis shows that our SGN-SVM algorithm improves accuracy
even with relatively sparse similarity graphs and small γ-neighborhoods.
59
Fig. 4.2. Entities ap, a
q, a
u, a
j, and a
krepresent the immediate neighbors of a
i. A ∆
ijdenotes
the difference between two entity vectors ai
and aj.
60
Fig. 4.3. (a) Transforming the training data A using G(B,A) to a new graph G(B, (A)). (b) Weobtain new entity coordinates A from the graph G(B, A).
Fig. 4.4. Training an SVM on the transformed data matrix A to obtain separating hyperplanes(shown in white).
61
Fig. 4.5. Percentage improvements in accuracy and F1-score with SGN-LDA over LDA.
62
Fig. 4.6. Percentage improvements in accuracy and F1-score with SGN-SVM over SVM.
63
Fig. 4.7. Illustration of the fourclass dataset after performing our SGN transformation. Observethat the two classes separate out while elements maintain their relative position within their class.A better separating boundary is obtained on this transformed data using LDA (as shown above).
64
Fig. 4.8. Support vectors obtained (a) using SVM, and (b) using SGN-SVM training for thefourclass dataset in the training data indicating two classes with support vectors shown in black.Illustration of the (c) SVM separating plane, and (d) the SGN-SVM separating plane on testingdata.
65
Fig. 4.9. Illustration of two four-dimensional Gaussian processes (a) before transformation and(b) after SGN transformation that have been projected to two PCA dimensions for the purpose ofvisualization. In both cases, the boundary is obtained using LDA. Along side are the correlationmatrices before and after the transformation. The dashed boxes indicate the feature pairs thatshowed a significant change in the correlation value after the SGN transformation.
66
(a)
(b)
(c)
Fig. 4.10. (a) Effect of increasing the entity neighborhood parameter γ in G(B,A) (consider-ing larger neighborhoods) on classification accuracy. When the sparse neighborhood is a goodapproximation indicating similarity, adding elements to the neighborhood does not alter classi-fication accuracy. Effect of varying number of common features q to create the similarity graphB on (b) sparsity of B relative to A and (c) impact on overall classification accuracy.
Part II
Scalable Geometric Embedding and Partitioning for Parallel
Scientific Computing
67
68
Chapter 5
Background on Sparse Graph Partitioningand Sparse Linear System Solution
In this part of the dissertation we utilize geometric and combinatorial properties of sparse
graphs to improve quality and performance of parallel scientific computing applications. In
particular, we consider the key problem of parallel sparse graph partitioning and its application in
designing a hybrid solver for sparse linear systems with multiple right-hand sides. In Chapter 6
we develop a scalable parallel geometric partitioner for sparse graphs using a tree-structured
geometric embedding and parallel partitioning approach. In Chapter 7 we show that our tree-
structured geometric embedding and partitioning approach can be used to obtain a fill-reducing
ordering for sparse linear systems. Additionally, in Chapter 7 we show that such a tree-structured
reordering enables developing a direct-iterative hybrid solution approach that improves solver
performance.
The rest of this chapter presents background material required to understand Chapters 6
and 7. Section 5.1 presents notation used in Chapters 6 and 7. Section 5.2.1 presents background
on graph embedding and partitioning. Section 5.3 presents a brief overview of sparse direct and
iterative solvers.
69
5.1 Notation
We represent a sparse graph with a vertex set V and edge set E as G(V,E). The sizes
of the vertex set and edge set are represented as |V | and |E|, respectively. A vector quantity is
represented as v, a scalar quantity as f , and the two-norm of a vector v as ||v||.
5.2 Graph Embedding and Graph Partitioning
We first provide a brief background on graph embedding and graph partitioning for Chap-
ter 6. Subsequently, we introduce the Charm++ parallel framework that is used to develop our
parallel geometric graph embedding and partitioning scheme.
5.2.1 Graph Embedding
In an early work Eades [66] provides a heuristic for drawing graphs using the force-
directed approach. Kamada and Kawai [67] present a force-directed approach where all ver-
tices are connected by springs with length proportional to the graph distance between vertices.
Fruchterman and Reingold [68] in their method use two types of forces. Edges in the graph
are represented by springs and exert attractive forces, while simultaneously vertices represent
charged particles and experience repulsion. Battista et al. [69] provide a detailed overview of
graph embedding algorithms in their work. In a recent work, Hu [70] proposed a Barnes-Hut
structure of repulsive force computation for the spring-electric model that would reduce compu-
tational cost from O(|V |2) to O(|V |log(|V |)). In an interesting work, Walshaw [71] proposed a
multilevel solution to the graph embedding problem. Harel and Koren [72] have also proposed
70
a graph embedding algorithm for high-dimensional spaces. Godiyal et al. [73] proposed a fast
implementation of the spring-electric model on GPU architectures using multipole expansions.
5.2.2 Graph Partitioning
Graph partitioning schemes have been studied under three main categories, namely, spec-
tral schemes, multilevel combinatorial schemes, and geometric schemes. Lipton and Tarjan [74]
in an early work proposed a separtor theorem for partitioning planar graphs. Hendrickson and
Leland [75,76] proposed a multilevel spectral bissection based graph partitioning algorithm that
recursively divides the graph based on eigenvalues and eigenvectors (fiedler vector). Karypis
and Kumar [77–79] proposed Metis, a multilevel scheme to partition irregular graph that imple-
ments a V-cycle comprising a coarsensing of the source graph, followed by an initial separation
and then successive refinement [80] and projection. This method was later implemented for
distributed systems and called ParMetis. Heath and Raghavan [17] in their work, proposed a
geometric nested dissection based partitioning scheme for sparse graphs. Miller et al. [16] pro-
posed a geometric partitioning scheme that projects the data to a higher dimensional coordinate
space, obtains a separation in this space and projects the separtor back to the original space. Ad-
ditionally, they proved a theorem to show that the separator obtained in the higher dimensional
space obtains better separation quality.
5.2.3 The Charm++ Parallel Framework
The Charm++ framework for large scale parallel application development provides ab-
stractions that an application programmer can use to decompose the underlying problem into a
71
set of objects and subsequently model the interactions between them. Objects in a Charm++ sys-
tem, also called chares, are mapped through the runtime system to the physical processor. This
allows the creation of a large number of chares relative to the number of processors that are inde-
pendent of the processors thus facilitating processor virtualization. Additionally, the Charm++
runtime system provides support for sophisticated runtime prefetching and load balancing.
5.3 Sparse Linear Solvers
In this section, we first provide an overview of sparse direct and iterative schemes for
solving sparse linear systems. Subsequently, we provide an overview of the Incomplete Cholesky
preconditioning scheme.
5.3.1 Sparse Direct Solvers
Sparse direct solvers [81–85] compute robust linear systems solutions through sparse
matrix factorizations. Often during factorization, fill-in occurs when zeros in the coefficient ma-
trix become nonzeros. The efficiency of a sparse direct solver is primarily decided by its ability
to control and manage fill-in. Typically, matrix reorderings are applied during factorization to
preserve sparsity of the resulting factor [86–88]. However, the overall memory and computa-
tional costs can grow superlinearly with the dimension of the matrix making it unsuitable for
three-dimensional models [85].
5.3.2 Preconditioned Conjugate Gradient
The Conjugate Gradients(CG) [89] method is a popular iterative method used to solve
linear systems where the coefficient matrix A is symmetric positive definite. In practice, CG
72
works more effectively when used with a preconditioning scheme to improve convergence. In
practice, the convergence of CG is governed by the conditioning [90–92] of matrix A. That
is, in spite of theoretical guarantees on convergence [93], CG could fail if A is ill-conditioned.
Preconditioning schemes [19, 90, 94–96] can improve overall convergence by representing the
original linear system Ax = b as MAx = Mb, where M is the preconditioner matrix. The
quality of a preconditioner depends mainly on ease of its construction, its application, and its
ability to accelerate CG convergence.
5.3.3 Incomplete Cholesky Preconditioning
Incomplete Cholesky (IC) [19, 97] preconditioner L is computed as an approximation to
the sparse Cholesky factor L where A = LLT . The Cholesky factor L is substantially dense, as
it incurs fill-in, i.e., there are zeroes in the original matrix A that become nonzeros during factor-
ization [85]. The incomplete factor L is obtained by eliminating fill-in from L using two popular
methods, (i) Incomplete Cholesky with level-of-fill (IC(k)), and (ii) Incomplete Cholesky with
drop-threshold (ICT). An IC(k) level-of-fill preconditioner [19] retains those nonzeros in the
factor L that are within a path length of k + 1 from any node in the adjacency graph represen-
tation of A. An ICT [19] preconditioner uses a drop-threshold δ to decide which elements to
drop in the factor L, that is, for δ = 10−2, all nonzero elements in L less than δ are dropped to
form L. The ICT preconditioner provides greater flexibility in creating an incomplete Cholesky
preconditioner. The IC preconditioner is applied to CG using two successive triangular solutions
in every CG iteration.
73
Algorithm 2 procedure PCG(A,b,M ,x0
,tol)
r0= b−Ax
0
z0= M
−1r0
p0= z
0k = 0repeat
αk=
rT
kzk
pT
kAp
kxk+1
= xk+ α
kpk
if ||rk+1|| < tol then
exitend ifzk+1
= M−1
rk+1
βk=
zT
k+1rk+1
zT
krk
pk+1
= zk+1
+ βkpk
k = k + 1until ||r
k+1|| < tol
74
5.4 Chapter Summary
In this chapter, we presented background on specific scientific computing algorithms.
Section 5.2.3 provides an overview of the Charm++ system. In Sections 5.3.1, 5.3.2, and 5.3.3
we provided a brief overview of sparse direct methods, the preconditioned conjugate gradient
scheme, and incomplete Cholesky preconditioning.
75
Chapter 6
Parallel Geometric Partitioningthrough Sparse Graph Embedding
Many problems in large scale scientific simulations involving partial differential equa-
tions can be modeled using sparse graphs. Typically, solving such problems at scale in a dis-
tributed environment requires a domain decomposition scheme that can reduce communication
across processing elements. Sparse graph partitioning, a key step in domain decomposition,
seeks to divide the sparse representation of a computationally intensive problem into a set of
subproblems with low interdependencies.
Graph partitioning research can be grouped into three broad categories, (i) geometric
techniques [16, 98], (ii) combinatorial multilevel techniques [77–79, 99, 100], and (ii) parallel
distributed graph partitioning [78, 101] techniques that balance tradeoffs between quality, par-
allelism, and performance. Parallel combinatorial schemes are effective in reducing edge cuts
between partitions but scaling them in the petascale and exascale regimes with thousands of pro-
cessors can be a challenging problem. Alternatively, geometric schemes are considered highly
data parallel and amenable to scaling, but their partition quality is largely dependent on the initial
coordinate layout of the graph. Although, modeling and simulation applications involving finite
elements or finite-differences possess an underlying vertex coordinate structure, it is not nec-
essary for sparse graphs arising from other applications. Thus, existing geometric partitioning
methods cannot be applied for applications that do not possess an underlying geometry for the
sparse graph. The goal of our framework is to develop a scalable graph partitioning scheme for a
76
generic class of graph problems by combining sparse graph embedding with parallel geometric
partitioning.
In this chapter, we develop ScalaPart, a parallel graph embedding enabled geometric par-
titioning scheme using the Charm++ [15] parallel programming system. In particular, ScalaPart
is a tree-structured graph embedding combined with a geometric partitioning scheme designed to
address the challenge of scalable graph partitioning in multicore, multiprocessor environments.
First, ScalaPart obtains a coordinate structure for a given sparse graph through a tree-stuctured
parallel graph embedding. Subsequently, this embedding is provided as input to a data parallel
geometric partitioning scheme that seeks to partition G into k subdomains, such that the number
of edges connecting subdomains Gi, . . . , G
kare reduced. In our approach, we augment our par-
allel embedding with our Charm++ data parallel implementation of the geometric partitioning
in [98]. We compare the quality and performance of our method to two popular parallel graph
partitioning schemes, namely PT-Scotch [100] and ParMetis [77–79], on 16 and 32 cores for
a suite of benchmark graphs. Our results indicate that on average we improve quality of the
partitions by 7.4% for ParMetis and are within 3.0% of PT-Scotch. Our ScalaPart approach is,
(i) 92.6% faster than ParMetis and 25.2% faster than PT-Scotch on 16 cores, and is (ii) 97.2%
faster than ParMetis and 11.4% faster than PT-Scotch on 32 cores. Additionally, our embedding
results on 8, 16, and 32 cores show that, on average, our embedding implementation obtains a
relative speedup of 1.73 as the number of cores double.
The remainder of this chapter is organized as follows. Section 6.1 provides background
on graph embedding and graph partitioning. Section 6.2 presents our key contribution, a paral-
lel embedding enabled geometric partitioning. We present our experimental evaluation in Sec-
tion 6.3 followed by a brief conclusion in Section 6.4.
77
6.1 Background and Related Work
In this section, we discuss previous research in the areas of graph embedding and graph
partitioning. In an early paper Eades [66] provides a heuristic for drawing graphs using the
force-directed approach. Kamada and Kawai [67] present a force-directed approach where all
vertices are connected by springs with lengths proportional to the graph distance between ver-
tices. Fruchterman and Reingold [68] use two types of forces in their method. Edges in the
graph are represented by springs and exert attractive forces, while simultaneously vertices repre-
sent charged particles and experience repulsion. Battista et al. [69] provide a detailed overview
of graph embedding algorithms in their paper. In a recent paper, Hu [70] proposed a Barnes-Hut
structure of repulsive force computation for the spring-electric model that would reduce compu-
tational cost from O(|V |2) to O(|V |log(|V |)). In an interesting paper, Walshaw [71] proposed a
multilevel solution to the graph embedding problem. Harel and Koren [72] have also proposed
a graph embedding algorithm for high-dimensional spaces. Godiyal et al. [73] proposed a fast
implementation of the spring-electric model on GPU architectures using multipole expansions.
Graph partitioning schemes have been studied under three main categories, namely, spec-
tral schemes, multilevel combinatorial schemes, and geometric schemes. Lipton and Tarjan [74]
in an early paper, proposed a separtor theorem for partitioning planar graphs. Hendrickson and
Leland [75,76] proposed a multilevel spectral bissection based graph partitioning algorithm that
recursively divides the graph based on eigenvalues and eigenvectors (namely, the Fiedler vector).
Karypis and Kumar [77–79] proposed Metis, a multilevel scheme to partition an irregular graph
that implements a V-cycle comprising a coarsensing of the source graph, followed by an initial
78
separation, and then successive refinement [80] and projection. This method was later imple-
mented for distributed systems and called ParMetis. Heath and Raghavan [17] in their paper,
proposed a geometric nested dissection-based partitioning scheme for sparse graphs. Miller et
al. [16] proposed a geometric partitioning scheme that projects the data to a higher-dimensional
coordinate space, obtains a separation in this space, and projects the separtor back to the original
space. Additionally, they proved a theorem to show that the separator obtained in the higher
dimensional space obtains better separation quality than existing adjacency based schemes.
Our scheme is motivated by ChaNGa [102, 103] that develops a massively parallel cos-
mological simulation using the Charm++ parallel programming libraries. Our framework im-
plements a Barnes-Hut tree-structured spring-electric model for highly distributed environments
thus making large-scale graph embedding feasible. Additionally, our framework provides a so-
phisticated interface that makes it easy to develop distributed tree-structured graph embedding
applications. We show that our embedding can potentially help develop scalable geometric graph
partitioning schemes.
6.2 ScalaPart: A Parallel Graph Embedding enabled Scalable Geometric Parti-
tioning
In this section, we present our contribution in the form of a parallel graph embedding
enabled geometric parititoning that first generates a coordinate structure for the vertex set V of a
sparse graph G(V,E) and, subsequently, partitions it through a scalable data-parallel geometric
graph partitioning scheme. Our embedding and the geometric partitioning scheme are designed
using the Charm++ parallel libraries enabling shared and distributed memory parallelism.
79
Consider a sparse graph G(V,E) with a vertex set V and an edge set E. We define
a |V | × d sparse set X , in which each row xi
of X represents the d-dimensional (d ≥ 2)
coordinate of vertex vi. We develop ScalaPart by combining a scalable graph embedding frame-
work to obtain a d-dimensional geometry for the vertex set V and a data-parallel geometric
partitioning scheme. We first seek to obtain the vertex coordinates through a parallel tree-
structured graph embedding approach. We next seek to partition the vertex set V of G(V,E)
into k subgraphs G1, . . . , G
k using the embedded coordinate structure, such that, the number
of edges connecting these subgraphs is reduced. We develop a data-parallel geometric parti-
tioning scheme [98], also using Charm++, to partition the vertex geometry obtained using our
embedding. Section 6.2.1 presents details of this parallel graph embedding framework, and Sec-
tion 6.2.2 describes the parallel implementation of the geometric partitioning scheme.
6.2.1 Structure of our Charm++ Parallel Graph Embedding
We adapt the Fruchterman-Reingold spring-electric graph embedding model [68] to de-
velop our parallel graph embedding approach. In this model, the vertices are associated to
charged electric particles and edges represent springs that connect these charged particles. The
traditional sequential model, as shown in Algorithm 3, computes two different types of forces
on the vertices: (i) attractive spring forces between adjacent neighbors (fa
) and (ii) repulsive
coulombic forces between all pairs of vertices (fr). Equations 6.1 and 6.2 calculate the attrac-
tive and repulsive forces, where K and C are algorithm parameters.
fa(i, j) =
||xi− x
j||2
K(6.1)
80
fr(i, j) =
−CK2
||xi− x
j||
(6.2)
As the number of vertices |V | increase, the repulsive force computation, that has a com-
plexity of O(|V |2), dominates the performance in every iteration of the classical sequential
model. Therefore, to decrease overall complexity, tree-structured approaches similar to the
Barnes-Hut algorithm [104] have been proposed [70, 73]. A tree-structured approach reduces
the complexity of the all-pair repulsive force computation from O(|V |2) to O(|V |log(|V |)).
Additionally, such a tree-structured approach naturally enables a parallel implementation of the
algorithm.
We develop a parallel tree-structured implementation of the Fruchterman-Reingold spring-
electric model for embedding vertices of a graph G(V,E) in d-dimensions (d = 3) using the
Charm++ parallel implementation framework [15]. Our implementation derives motivation from
earlier work on cosmological simulation using Charm++ [102,103,105,106]. We define a chare
as a distributed object in the Charm++ system that is capable of performing force computations
on its local set of vertices. Our implementation of the spring electric model using Charm++,
involves three main steps at each iteration (i) distributing the vertices across multiple Charm++
chares or objects, (ii) building an octree decomposition at each chare using vertex positions, and
(iii) calculating forces on vertices using near neighbor and remote neighbor force computations.
Distributing vertices across Charm++ chares. In the Charm++ programming paradigm,
distributed objects are referred to as chares. Consider a distributed programming environment
with P processing elements and T chares with shared and distributed memory parallelism. The
number of chares T is a parameter that is set to at least the number of processing elements P
81
(T ≥ P ). In our framework, execution initiates through a main chare that is responsible for
creating and allocating resources to other chares. In the first step, the main chare creates a set of
worker chares that are responsible for a set of vertices. Subsequently, each worker chare loads
the adjacency structure and associates a uniformly random starting position in d dimensions to
each vertex in that chare. Henceforth, we restrict d to 3 dimensions; however, in practice d can be
extended to higher dimensions with relative ease within our framework for higher-dimensional
embeddings.
Assigning keys to vertices. Once the initial three-dimensional vertex positions are
loaded, each worker assigns a 64-bit key to each vertex. In our model, we define our three-
coordinate directions as the x, y, and z directions. The 64-bit key is generated based on a
function of the x, y, and z directions. This facilitates the use of coordinate positions to locate
any vertex of G(V,E). These keys are used in the next step to redistribute and load balance
vertices across worker chares.
Sorting and redistributing vertices. Before we start building the octree representation
of the three-dimensional space, vertices need to be sorted based on their positions. The sorting
process has a two-fold purpose which ensures that (a) a particular worker chare is responsible for
vertices in a particular region of the geometry, and (b) each worker has approximately the same
number of vertices. Therefore, the sorting procedure iteratively decides a set of splitter keys that
seek to divide the vertex set into T balanced subsets. At each iteration, this procedure evaluates
a set of splitter keys until it finds a set that meets the required termination criteria. Gioachin et
al. [102] implemented the splitter-based technique in Charm++ for sorting a set of particles in
82
the context of scalable cosmological simulations.
Building an octree decomposition at each chare. Each worker chare constructs an
octree decomposition of the three-dimensional coordinate space of the vertices. In the octree
representation, a node represents a group of vertices that are bounded within a cube. Each chare
can have two main types of nodes in the octree representation, namely, local and nonlocal. A
local node resides on the chare and is immediately accessible, while a nonlocal node indicates
that it is a placeholder for a remote node and requires communication to fetch the remote node
data.
Calculating forces on particles. We compute attractive forces simply using direct neigh-
bor interactions. The repulsive force computation, that requires all-pair computations, involves
traversing the octree in a depth-first manner to reduce computational costs. Each chare begins
traversal of the octree at the root node. Consider a vertex Vi
for which we need to compute
repulsive forces due to all other vertices. In the octree method, the position of each non-leaf
node is represented by the center of mass its group of vertices. If the distance of Vi
from the
node position is within a user defined limit θ, then this node is traversed further. Otherwise, this
node represents the group of nodes, and repulsive forces are computed between this node and Vi.
Update vertex positions. Each worker updates position of vertices that it owns using the
resultant force information computed in the previous step.
83
6.2.2 Implementation of a Data Parallel Geometric Partitioning using Charm++
We develop a scalable data parallel geometric partitioning based on [16] using the Charm++
parallel framework. Our parallel implementation of [16] using Charm++ obtains a partitioning
of the vertex coordinates in seven steps. The vertex coordinates obtained from our graph em-
bedding discussed in Section 6.2.1 forms the input to the geometric partitioning algorithm. We
obtain a separator through a series of seven steps.
Shift coordinates. The first step involves moving the vertices such that they are centered
around the origin. This operation is highly parallel and is performed by every worker chare.
Project vertex positions up. We next obtain a stereographic projection of the points
from Rd to ℜd+1 centered around the origin. This operation can be performed by each worker
chare independently.
Compute centerpoint. We compute a center point for the newly projected points. Each
worker chare first computes its local sum of the vertex positions. Subsequently, a collective op-
eration at the main chare finds a global sum and average using the local values.
Obtain conformal map. Finding a conformal mapping involves two steps. First, we ro-
tate the projected points about the origin such that, the centerpoint becomes a point (0, . . . , 0, r)
on the (d+1)-th axis. Subsequently, we obtain a stereographic projection of the rotated points
down to ℜd. We scale all points in ℜd by a factor of√
(1− r)/(1 + r). Finally, we project the
scaled points back to ℜd+1. This part of the computation involves computing a special point
84
called the radon point [98]. The radon point is computed in a distributed setting by first creating
a random sample of the points at each worker chare. Each worker chare then computes its own
value of a radon point. These local values are communicated to the main chare which computes
the final value of the radon point.
Compute great circle. The main chare obtains a random great circle to divide the set of
points in two sets. This great circle is communicated to all worker chares.
Project vertices down. Each worker chare next converts the great circle to a circle in ℜd
by reversing the dilation, rotation, and projection.
Compute separator from circle. Since our goal is to find an edge separator, the circle
obtained in the previous step can now be used to separate points into two subsets. Points that
intersect with the circle and assigned proportionally to each subset.
The shifting, scaling, higher-dimensional projection operations, and obtaining the cut
require minimum to no communication between chares. Therefore, it would be easy to scale our
geometric graph partitioning to large problem sizes. The next section provides an analysis of the
algorithmic time complexity.
6.2.3 Complexity Analysis
As discussed in Sections 6.2.1 and 6.2.2, our approach consists of a parallel graph layout
and a parallel geometric partitioning.
85
The parallel algorithmic complexity of the graph layout (Tlayout
) is obtained as the sum-
mation of the time required to compute (a) the attractive forces and (b) the repulsive forces on P
processing elements. The attractive force on each particle that is calculated for each neighbor of
that particle requires O(µ⌈ |V |P
⌉)time, where µ is the average number of neighbors per parti-
cle. The repulsive force computation performs a Barnes-Hut type of tree-based force calculation
in parallel that requires O(⌈ |V |
P
⌉log
⌈ |V |P
⌉)time. Equation 6.3 presents T
layout, where c
1is
a constant.
Tlayout
= c1
⌈|V |P
⌉log
⌈|V |P
⌉+ µ
⌈|V |P
⌉(6.3)
In our parallel geometric partitioning scheme [16], shifting and scaling the node positions
such that they are centered around a zero mean requires O(µ⌈ |V |P
⌉)time on P processing
elements. The centerpoint computation in parallel involves a d-ary tree (d = 6) that requires
O(logP ) time. Subsequently, the conformal mapping and the inertial matrix computation require
a total of 2O(µ⌈ |V |P
⌉)time. Equation 6.4 presents the total parallel time complexity of the
partitioning algorithm, where c2
is a constant.
Tpartition
= c2
⌈|V |P
⌉+ logP (6.4)
6.3 Experiments and Evaluation
In this section, we evaluate the quality of cut and performance of ScalaPart. We first
provide details of our experimental setup and the evaluation metrics used to compare quality and
performance. Subsequently, we present our empirical evaluation of the method and compare it
to two popular graph partitioning schemes.
86
6.3.1 Experimental Setup and Evaluation Metrics
Our approach is implemented using the Charm++ parallel programming libraries on an
Intel Nehalem cluster. We report quality and performance results on a suite of 26 benchmark
graphs from the University of Florida sparse matrix collection [22]. We compare the quality and
performance of our method with two popular graph partitioning schemes, i.e., ParMetis [78] and
PT-Scotch [101].
Benchmark Matrices. We report our empirical evaluation on a set of 26 benchmark
graphs from the University of Florida sparse matrix collection. Table 6.3.1 presents the details
of these 26 benchmark graphs.
Evaluation Metrics. Consider a graph G(V,E) that is partitioned into k subgraphs
G1, . . . , G
k. Our first evaluation metric, edgecut, determines the quality of the partition by
counting the number of cross edges across the k subgraphs. We define outdegree(Gi, G
j) as the
number of edges crossing from subgraph Gi
to subgraph Gj. Equation 6.5 presents a formal
definition of edgecut EdgecutGk
for k partitions of a graph G.
EdgecutGk
=1
2
k∑i=1
k∑j=1
outdegree(Gi, G
j) (6.5)
6.3.2 Empirical Results
In this section, we evaluate three main aspects of our framework: (i) the performance of
our graph embedding framework, (ii) the quality of partition obtained, and (iii) the performance
87
of the parallel partitioning also implemented using Charm++. We consider performance on 16
and 32 cores for 16 and 32 partitions.
Graph embedding performance. Figure 6.1 presents the average times required per
iteration to embed a sparse graph using our framework on 8, 16, and 32 cores. Typically, we
execute 200 iterations to obtain a layout that can be used by our partitioning implementation.
Our results indicate, that using our embedding scheme, we obtain an average relative speedup
1.73 as the number of cores double. Both the buildtree operation and the force computations have
low communication volume, and as a result with an increase in the number of cores, individual
processor elements have to perform less work.
Partition quality. In graph partitioning, it is desired that the edgecut obtained after par-
titioning is low. Figures 6.2 and 6.3 report the edgecuts (normalized) obtained using ParMetis,
PT-Scotch, and our scheme for 16 and 32 partitions. In order to normalize the edgecuts we
choose ParMetis as our base method (set to 1) and scale the other two methods accordingly. In
these figures, a lower value indicates a better partition quality. We observe that, on average, our
scheme is 7.4% better than ParMetis and is competitive with PT-Scotch. Partition quality of our
method on average is within 3% of PT-Scotch.
Partitioning performance. We evaluate the performance of ParMetis, PT-Scotch and
our scheme by measuring the execution time of the scheme. Figures 6.4(a) and (b) present the
normalized performance of the three schemes on a single core of a multiprocessor with ParMetis
representing the base method that is set to 1. On a single processor and 16 cuts we observe that
88
ParMetis is 13.5% and 2.9% faster than PT-Scotch and our scheme, respectively. For 32 cuts, we
observe that ParMetis is 17.8% and 3.8% faster than PT-Scotch and our scheme, respectively.
Figures 6.5(a) and (b) indicate the normalized time (in seconds) of the 3 schemes. ParMetis
is again the base mathod (set to 1), and PT-Scotch and our scheme are scaled accordingly. We
observe that, on average, our scheme is 92.6% faster than ParMetis and is 25.2% faster than PT-
Scotch on 16 cores. When we double the number of cores to 32, our method is 97.2% faster than
ParMetis and, is 11.4% faster than PT-Scotch. Typically, a geometric scheme is more scalable
than a combinatorial scheme, as it is more data parallel and incurs less communication overhead.
6.3.3 Discussion on observed quality and performance
The success of a graph partitioning algorithm is based on its ability to balance the trade-
offs between quality and performance. Earlier partitioning schemes, like ParMetis and PT-
Scotch, take advantage of aggressive coarsening and subsequent refinement strategies to achieve
competitive cuts with reasonable performance. We demonstrate that our scheme could take ad-
vantage of such a multilevel partitioning startegy to reduce the overall time. Our multilevel
approach involves four main steps: (a) coarsen the graph using heavy-edge matching, (b) embed
the coarsened graph in three-dimensions, (c) obtain an initial geometric cut on the coarsened
graph using the embedded coordinates, and (d) project the cut to the finest level and then refine
it. In Figure 6.6 we report the total time required to coarsen the graph, partition the coarsened
graph, and subsequently refine the obtained cut. We observe that ScalaPart is 87.14% faster than
ParMetis and 3.45% faster than PT-Scotch.
89
However, we show with an example that, for a large class of problems with planar ge-
ometry, multilevel schemes, such as ParMetis and PT-Scotch, may not necessarily result in high
quality cuts. Figures 6.7(a) to (c) present surface cuts obtained using ParMetis, PT-Scotch and
our scheme. The mesh in Figure 6.7 originates from a heat exchanger flow problem. A combina-
torial partitioning scheme such as ParMetis, performs aggressive coarsening based on the graph
structure. However, Figure 6.7(a) clearly indicates that adjacency-based schemes can skew the
cut and result in a higher edgecut. In Figure 6.7(b), although the PT-Scotch cut is skewed, it is
marginally better than ParMetis. In both these cases, the geometry is not known to the algorithm.
In Figure 6.7, we focus on our scheme that provides a geometry based on a 3D embedding of the
sparse graph. We observe that geometric partitioning schemes can benefit from an efficient ge-
ometry that can reduce the size of the edge cut. In particular, our geometric partitioning scheme,
that is modeled using Charm++ based on [16], obtains a better cut, as it has a sense of the dis-
tribution of the points in three-dimensions and benefits by selecting a split point based on this
layout.
6.4 Chapter Summary
In this chapter, we developed a tree-structured parallel graph embedding and geometric
partitioning scheme using the Charm++ parallel programming system. We show that, on average,
our scheme improves edgecuts by 7.4% for ParMetis and is within 3.0% of PT-Scotch. Our
ScalaPart approach is (i) 92.6% faster than ParMetis and is 25.2% faster than PT-Scotch on
16 cores, and is (ii) 97.2% faster than ParMetis and 11.4% faster than PT-Scotch on 32 cores.
Additionally, our embedding results on 8, 16, and 32 cores show that, on average, our embedding
implementation obtains a relative speedup of 1.73 as the number of cores double.
90
Algorithm 3 SequentialSpringElectricModel(G,x,δ)converged = FALSEstep = initial step lengtht = step scaling constantE =∞ /* Energy */while converged == FALSE do
x0= x
E0= E
E = 0for i ∈ V do
f = 0for j ∈ adjacency(i) do
f = f +fa(i,j)
||xj−x
i||(xj − x
i)
end forfor j = i, j ∈ V do
f = f +fr(i,j)
||xj−x
i||(xj − x
i)
end forxi= x
i+ step ∗ f
||f ||E = E + ||f ||2
end forif E < E
0 thencounter = counter + 1if counter >= 5 then
counter = 0step = step/t
end ifelse
counter = 0step = step ∗ t
end ifif ||x− x
0|| < δ thenconverged = TRUE
end ifend while
91
Algorithm 4 ParallelSpringElectricModel(G,x,δ)
Charm++ implementation of the spring-electric modelstep = initial step lengtht = step scaling constantE =∞load particles (Worker Chare)assign keys (Worker Chare)for i = 1, 2, . . . ,MAX ITER dosort particles (Worker Chare)build Octree (Worker Chare)compute forces (Worker Chare)for i ∈ V do
f = 0for j ∈ adjacency(i) do
f = f +fa(i,j)
||xj−x
i||(xj − x
i)
end forfor j = i, j ∈ V do
if xj
near neighbor of xi
then
f = f +fr(i,j)
||xj−x
i||(xj − x
i)
elsexcm
=center of mass of far neighbor tree node
f = f +fr(i,j)
|| xcm−x
i||( x
cm− x
i)
end ifend forxi= x
i+ step ∗ f
||f ||E = E + ||f ||2
end forupdate positions (Worker Chare)update step (Main Chare)Gather E from Worker CharesE0= E
E = 0if E < E
0 thencounter = counter + 1if counter >= 5 then
counter = 0step = step/t
end ifelse
counter = 0step = step ∗ t
end ifend for
92
Graph —V— —E— Description1 linverse 11,999 95,977 statistical2 crystm02 13,965 322,905 materials3 Pres Poisson 14,822 715,804 CFD4 olafu 16,146 1,015,156 structural5 gyro 17,361 1,021,159 optimization6 msc23052 23,052 1,142,686 structural7 aug3d 24,300 69,984 2D/3D8 aug2d 29,008 76,832 2D/3D9 wathen120 36,441 565,761 random
10 mario001 38,434 204,912 2D/3D11 jnlbrng1 40,000 199,200 optimization12 gridgena 48,962 512,084 optimization13 oilpan 73,752 2,148,558 structural14 finan512 74,752 596,992 economic15 cont-201 80,595 438,795 optimization16 denormal 89,400 1,156,224 optimization17 s3dkq4m2 90,449 4,427,725 structural18 s3dkt3m2 90,449 3,686,223 structural19 shipsec1 140,874 3,568,176 structural20 Dubcova3 146,689 3,636,643 2D/3D21 cont-300 180,895 988,195 optimization22 d pretok 182,730 1,641,672 2D/3D23 pwtk 217,918 11,524,432 structural24 Lin 256,000 1,766,400 eigenvalue25 mario002 389,874 2,097,566 2D/3D26 helm2d03 392,257 2,741,935 2D/3D
Table 6.1. Details of benchmark graphs indicating the number of nodes (|V |) and edges (|E|)in the graph with a brief description of the application domain.
93
Fig. 6.1. Average layout time per iteration for benchmark graphs using 8, 16 and 32 cores. Weinclude results for 8 cores to demonstrate the effect of doubling the number of cores.
94
Fig. 6.2. Normalized edge cuts for 16 partitions with ParMetis as base set to 1.
Fig. 6.3. Normalized edge cuts for 32 partitions with ParMetis as base set to 1.
95
(a)
(b)
Fig. 6.4. Normalized time for (a) 16 partitions and (b) 32 partitions on 1 core with ParMetis asbase set to 1.
96
(a)
(b)
Fig. 6.5. Normalized time for (a) 16 partitions on 16 cores and (b) 32 partitions on 32 coreswith ParMetis as base set to 1.
97
Fig. 6.6. Normalized time for 32 partitions on 32 cores with ParMetis as base set to 1. The abovenormalized times represent the total time, including time for a coarsening phase, embedding ofthe coarsened graph, partitioning of the coarsened graph, and subsequent refinement of the cut.
98
(a)
(b)
(c)
Fig. 6.7. A two partition illustration of a graph from a heat exchanger flow problem using (a)ParMetis, (b) PT-Scotch, and (c) our framework. This example demonstrates that a geometricpartitioning scheme clearly achieves a competitive edgecut with a good underlying embeddingof the graph.
99
Chapter 7
A Multilevel Cholesky ConjugateGradients Hybrid Solverfor Linear Systems with
Multiple Right-hand Sides
The computational simulation of partial differential equation-based models using finite
difference or finite element methods involves the solution of large sparse linear systems [107].
The cost of the linear system solution often dominates overall execution time of such applications
and in many instances a linear system with the same coefficient matrix is solved for a sequence
of right-hand side vectors. In this chapter, we specifically consider the multiple right-hand side
case and seek efficient alternatives to preconditioned conjugate gradient-based solutions when
the coefficient matrix is symmetric positive definite and sparse.
Consider a linear system Ax = b, where A is the coefficient matrix. Sparse solvers
for such systems can be grouped into two broad categories, namely, direct [90,108] using sparse
Cholesky, and iterative [19,94] using conjugate gradients and its preconditioned variants. Sparse
direct solvers are designed to be robust but are mainly limited by large arithmetic and memory
costs that grow superlinearly [85]. Conversely, a sparse iterative solver, such as conjugate gradi-
ent [89], and its preconditioned forms, require only enough memory to store just the coefficient
matrix and a few additional vectors but lacks robustness.
We propose a multilevel hybrid of preconditioned conjugate gradients [89] and Cholesky,
where Cholesky is applied on leaf submatrices and PCG is applied to subsystems corresponding
100
to higher levels in the tree with partial solutions aggregated and corrected to provide the overall
solution. We expect the setup costs to be higher than traditional PCG; however, these higher
setup costs can be amortized over faster solutions for multiple right-hand sides. In particular,
we seek a hybrid solver that can provide faster solutions than PCG for multiple right-hand sides.
We propose a tree-based substructuring of A in which leaf nodes of the tree represent subma-
trices of A. Nodes at higher levels of the tree represent the recursive coupling between these
submatrices. Our tree-structured partitioning, discussed in Chapter 6, can be used to obtain such
a substructuring of A.
The remainder of this chapter is organized as follows. Section 7.1 develops our key con-
tribution, our tree-structured hybrid solver. Section 7.2 presents our experiments and empirical
evaluation with a brief conclusion in Section 7.3. We would like to note that contents of this
chapter have been obtained from our research in [18].
7.1 A New Multilevel Sparse Cholesky-PCG Hybrid Solver
In this section, we develop a new tree-based sparse hybrid solver. We present the basic
idea by first developing a one-level hybrid solver. Subsequently, we generalize the one-level to a
multilevel hybrid solver with multiple levels of nested solves. In particular, we elaborate on the
solver design for solving a symmetric positive definite sparse linear system.
Consider a sparse linear system Au = b, where A is an n× n sparse symmetric positive
definite matrix (A ∈ Rn×n), b ∈ Rn is an n × 1 column vector, and u ∈ Rn is the n × 1
solution vector. Direct and indirect schemes can both be used to solve this sparse linear system;
however, the choice depends on the tradeoffs between memory demands and robustness. Our
goal is to develop a tree-structured hybrid solution framework that benefits from both direct and
101
indirect schemes to obtain efficient solutions with low memory overheads for different multiple
right-hand side vectors b. Additionally, a tree-based multilevel method extends easily to parallel
execution environments. In developing this framework, we refer to earlier work on domain
decomposition schemes [20] for sparse direct [109] and sparse iterative solvers [110, 111].
7.1.1 A One-level Hybrid
The zero-nonzero structure of the sparse matrix A enables splitting A into a set of two
disjoint subdomains Ω1
and Ω2
that are separated by another smaller subdomain Γ. Each sub-
domain is a supernode comprising multiple nodes, and this partitioning of A can be interpreted
as a supernodal tree, where Ω1
and Ω2
represent leaf nodes rooted at Γ. The decomposition of
A into a blocked system and its supernodal tree representation form the core of our tree-based
hybrid solver. Section 7.1.1.1 presents an in-depth description of our one-level supernodal tree
construction. Furthermore, in our tree-based representation, smaller subsystems at the leaf nodes
could potentially be solved using a sparse direct solver to obtain a part of the solution vector.
Consequently, the partial solutions could be coupled at the root node of the tree using an iterative
scheme to obtain the complete solution. We provide a detailed description of this scheme for a
one-level tree in Section 7.1.1.2.
7.1.1.1 Obtaining a Tree-structured Aggregate of the Coefficient Matrix A
We split A into two disjoint subdomains connected by a separating subdomain. We
apply the nested dissection [112–115] algorithm to obtain this split; however, a split could be
obtained using any efficient geometric [116] or combinatorial graph partitioning scheme [114].
Earlier such partitioning has been used for developing hybrid preconditioners [117] for sparse
102
linear systems. We obtain a reordering of A such that (a) rows and columns belonging to each
subdomain are numbered contiguously, and (b) those comprising the separating subdomain are
numbered higher than the disjoint subdomains.
Figures 7.1(a) to (c) represent a one-level nested dissection ordering of a sample matrix
A and the corresponding supernodal tree. In Figure 7.1(c), Σ(0) and Σ(1) represent the disjoint
subdomains. Σ(0 : 1) indicates the subdomain block that separates Σ(0) through Σ(1). Nodes
in Σ(0 : 1) are numbered higher than nodes in both Σ(0) and Σ(1).
Consider a permutation matrix P that represents this tree-structured reordering of A. An
n× n symmetric sparse matrix B is obtained by permuting A using P . Equation 7.1 represents
the permuted matrix B.
B = PAPT (7.1)
7.1.1.2 Constructing a Hybrid Solution Scheme using the Tree-structure
In earlier research on domain decomposition solvers, Mansfield [118] showed that for a
sparse linear system Au = b, if matrix A is reordered to an equivalent blocked representation (as
shown in Equation 7.2) using domain decomposition, then it can be solved efficiently by solving
a set of three linear systems.
B =
B11
BT
21
B21
B22
, (7.2)
Consider blockwise the equivalent linear system Bx = f , where x and f are permu-
tations of u and b corresponding to the permutations for deriving B from A. It is possible to
efficiently solve this system blockwise by making a simple assumption. The assumption is that
103
x1
, i.e., the first block component of x, can be expressed as the sum of two parts. The block
equation under these assumptions can be written as shown in Equation 7.3.
B11
BT
21
B21
B22
x
1= x
S
1+ x
D
1
x2
=
f1
f2
. (7.3)
This system can be solved for x by first solving for xD1
as shown in Equation 7.5. Equa-
tion 7.4 represents the coefficient matrix S for the subsystem corresponding to x2
.
S =
(B22−B
21B−111
BT
21
)(7.4)
Subsequently, using xD
1, we solve for x
2and x
S
1as shown in Equations 7.6 and 7.7, respectively.
B11xD
1= f
1, (7.5)
Sx2
= f2−B
21xD
1, (7.6)
B11xS
1= −BT
21x2. (7.7)
We derive our motivation from this approach and propose a scheme that is based on our
tree-based restructuring of A (discussed in Section 7.1.1.1). Consider a one-level tree with one
separator as the root and two disjoint subdomains as leaf nodes of the tree. In matrix notation,
we represent this tree structured linear system using Equation 7.8, with the assumptions that
104
B11xS
1= −BT
31x3
and B22xS
2= −BT
32x3
.
B11
0 BT
31
0 B22
BT
32
B31
B32
B33
x1= x
S
1+ x
D
1
x2= x
S
2+ x
D
2
x3
=
f1
f2
f3
(7.8)
Since B is a symmetric positive definite matrix, a Cholesky decomposition of B can be
represented as B = LLT , where L is the sparse lower triangular factor. Equation 7.9 indicates
the blockwise Cholesky factorization of B.
B11
0 BT
31
0 B22
BT
32
B31
B32
B33
=
L11
0 0
0 L22
0
L31
L32
L33
LT
110 L
T
31
0 LT
22LT
32
0 0 LT
33
(7.9)
The Cholesky factor L can be computed blockwise as shown in Equations 7.10a to 7.10e.
B11
= L11LT
11, (7.10a)
B31
= L31LT
11, (7.10b)
B22
= L22LT
22, (7.10c)
B32
= L32LT
22, (7.10d)
B33
= L31LT
31+ L
32LT
32+ L
33LT
33. (7.10e)
105
In our supernodal representation, L11
and L22
are computed at the leaf nodes and are
reused to compute L31
and L32
as a sequence of linear system solutions as shown in Equa-
tions 7.11 and 7.12.
L11LT
31= B
T
31(7.11)
L22LT
32= B
T
32(7.12)
The final solution x is obtained using a series of intermediate steps. We first compute xP1
and xP
2as shown in Equations 7.13 and 7.14.
L11LT
11xD
1= f
1(7.13)
L22LT
22xD
2= f
2(7.14)
Equation 7.15a computes the coefficient matrix S for the separator linear system. Rewriting this
expression using the Cholesky factors, yields a simpler system as shown in Equation 7.15b.
S = B33−B
31B−111
BT
31−B
32B−122
BT
32(7.15a)
= L33LT
33(7.15b)
We then solve Equation 7.16 to compute x3
using the PCG scheme with an incomplete cholesky
preconditioner.
Sx3
= f3−B
31xD
1−B
32xD
2(7.16)
106
Subsequently, we compute xS
1and x
S
2using Equations 7.17 and 7.18.
L11LT
11xS
1= −BT
31x3
(7.17)
L22LT
22xS
2= −BT
32x3
(7.18)
The final solution x is computed by aggregating xD
1, xS
1, xD
2, xS
2, and x
3.
7.1.2 A Multilevel Tree-based Hybrid Solver
We apply our tree-structured reordering procedure recursively to each disjoint subdomain
until the desired number of levels η is reached. The above recursion can be represented in the
form of a supernodal tree where each level in the tree corresponds to one step in the recursion.
Therefore, the disjoint submatrices at the lowest level of the recursion, which form the leaf nodes
of the supernodal tree, are connected by a structured hierarchy of supernodal separators.
In a supernodal tree with multiple levels, we take a bottom-up approach and evaluate
the solution (as discussed above in Section 7.1.1.2) at each subtree. We aggregate the solution
at each subtree and pass it on to one level higher as the tree folds during the solution process.
Figures 7.2(a) to (c) represent a two-level nested dissection ordering of the same sample matrix A
as discussed in Figure 7.1 and the corresponding supernodal tree. In Figure 7.2(c) Σ(0) through
Σ(3) represent the disjoint subdomains. Σ(r : s) indicates the subdomain block that separates
Σ(r) through Σ(s) for r, s ∈ 0, 1, 2, 3. Nodes in Σ(r : s) are numbered higher than nodes in
both Σ(r) and Σ(s). Therefore, Σ(0 : 3) contains the nodes with the highest numbering.
107
7.1.3 Computational Costs of our Hybrid Solver
Consider a sparse linear system Bx = f , where B is an n× n sparse matrix. We define
µ(B) as the number of nonzeros in matrix A. The computational costs of our hybrid solver
include (a) the cost to setup the Cholesky factors of each subdomain and (b) the cost to solve
each subdomain.
Setup cost (Φsetup). The setup cost is computed as the sum of the square of the number
of nonzeros in each column of the Cholesky factor L of A. L∗,j represents the j-th column of
L. Equation 7.19 provides an estimate of the setup cost.
Φsetup =
n∑j=1
µ2(L∗,j) (7.19)
Solution cost (Φsolve). Consider a supernodal tree representation of A with k levels,
where level zero is at the root of the tree and level k indicates the leafnodes. We define nd, d =
1, . . . , 2k as the number of nodes in d-th subdomain and n
s, s = 1, . . . , (2
k−1) as the number
of nodes in the s-th separator in our supernodal tree. We solve the d-th subdomain at level k
using a sparse direct solver in time τd
. For sparse matrices, τd
is O(n1.5
d
)when the application
domain is two-dimensional and O
(n2
d
)in three-dimensions. We solve the s-th separator block
Bs
using PCG with Incomplete Cholesky preconditioning in O(µBs) + 2× µ(L
s) time, where
Ls
is the IC factor of Bs. Equation 7.20 presents the cost for the tree-structured solution of B.
Φsolve =
2k∑
d=1
τd+
2k−1∑s=1
[O(µ(B
s)) + 2µ(L
s)]
(7.20)
108
Therefore, the total cost Φhybrid is the summation of the setup cost and the total solution
cost for m right-hand side vectors.
Φhybrid = Φsetup +mΦsolve
7.2 Experiments and Evaluation
In this section, we present our experimental setup, an empirical evaluation of our solver
framework using a suite of benchmark datasets, and discussion of our results.
7.2.1 Experimental Setup
We implemented our hybrid framework in Matlab [41] using Metis [114] to obtain the
nested dissection ordering. We used the sparse Cholesky direct solver to compute the direct
solves and the PCG solver with Incomplete Cholesky preconditioner for iterative solves. We
report results on a suite of benchmark matrices from the University of Florida Sparse Matrix
Collection [1]. We report statistics for a total of 10 different random repeated right-hand side
vectors b.
Metrics for evaluation. We evaluate solver performance using the number of floating
point operations and the solution accuracy using relative error. Additionally, we perform analy-
sis on the sensitivity of our hybrid solver peformance to the number of levels in the supernodal
tree and the number of right-hand sides.
109
Benchmark Matrices. We evalute our solver on a suite of 12 benchmark matrices from
the University of Florida Sparse Matrix collection [1]. Table 7.1 lists the details of these matrices.
These matrices are typically obtained from discretization of partial differential equations using
finite element or finite difference methods.
Table 7.1. Benchmark matrices from the University of Florida Sparse Matrix Collection [1].Matrix N Nonzeros Descriptionnos7 729 4,617 Poisson’s Equation in Unit Cubebcsstk09 1,083 18,437 Stiffness Matrix - Square Plate Clampedbcsstk10 1,086 22,070 Stiffness Matrix - Buking of Hot Washerbcsstk11 1,473 34,241 Stiffness Matrix - Ore Car (Lumped Masses)bcsstk27 1,224 56,126 Stiffness Matrix - Buckling problem (Andy Mera)bcsstk14 1,806 63,454 Stiffness Matrix - Roof of Omni Coliseum, Atlantabcsstk18 11,948 149,090 Stiffness Matrix - R.E. Ginna Nuclear Power Stationbcsstk16 4,884 290,378 Stiffness Matrix - Corp. of Engineers Damcystem01 4,875 105,339 FEM crystal free vibration mass matrixcystem02 13,965 322,905 FEM crystal free vibration mass matrixs1rmt3m1 5,489 217,651 Matrix from a static analysis of a cylindrical shell.s1rmq4m1 5,489 262,411 Matrix from a static analysis of a cylindrical shell.
7.2.2 Evaluation and Discussion
In Table 7.2 we report the operations count (in millions) required by PCG for our bench-
mark matrices and 10 repeated right-hand side vectors b. A lower count indicates higher per-
formance and the absense of a value indicates that the solution did not converge to the desired
tolerance of 10−8. We report our results for four matrix orderings (Natural, RCM, Nested Dis-
section, and MMD) and two preconditioners (Incomplete Cholesky with zero level-of-fill (IC0)
110
and drop-threshold (ICT) of 10−2). In Table 7.2, our hybrid direct-iterative solver (Hybrid-DI)
has the lowest operations count in eight out of twelve matrices. On average, our method per-
forms 1.87 times faster than the best PCG and ordering combination. For the bcsstk10 matrix,
the gain in performance is as high as 7.36 times compared to the best PCG performance. How-
ever, for some matrices, such as crystm02, we observe a degradation in performance. For this
class of matrices, it is often difficult to find a compact node separator. Consequently, this leads
to very small blocks (leaf nodes) that are solved using Cholesky and very large separators that
are solved using PCG. Therefore, the coupling between these is not effective.
Table 7.3 presents the relative error in the final solution. We compute relative error
defined as ||x∗ − x||/||x∗||, where x∗ is the true solution for the linear system solution. We
observe that PCG does not converge for a majority of the orderings even with the use of a
preconditioner. However, our tree-based hybrid method is able to produce a solution within an
accuracy of 10−4 or higher.
For a sparse matrix A, we define OpsHybridDI(A) the operation count for our Hybrid-DI
method, OpsBestPCG(A) the operation count for the best PCG performance, and OpsAvgPCG(A)
the average PCG performance for different variants of PCG used in Table 7.2. We compute
Speedupbest(A) as the ratio of OpsBestPCG(A) and OpsHybridDI(A). Additionally, we com-
pute Speedupavg(A) as the ratio of OpsAvgPCG(A) and OpsHybridDI(A).
Figure 7.3(a) presents Speedupbest(A), the speedup obtained by using our hybrid solver
compared to the best PCG performance. Figure 7.3(b) presents Speedupavg(A), the speedup
using Hybrid-DI compared to an average PCG performance across different variants. We ob-
serve that certain matrices, such as bcsstk10, can obtain over seven times the performance gain
by using our tree-based hybrid scheme. However, there are also matrices, such as crystm01 and
111
crystm02, that are not good candidates for our hybrid scheme and may not benefit from this
method. In particular, this suggests that the success of our methodology is related to the un-
derlying structure and properties of the matrix A, which is an attribute of the application. We
anticipate that setup cost will decrease with multiple levels; however, convergence will be slower
for a given right-hand side vector. Consequently, the overall operation count may increase. How-
ever, for a specific number of right-hand sides, this tradeoff could be exploited to obtain the right
balance. For a problem with few right-hand sides, it may be important to consider a hybrid
solver with relatively small number of levels that also incurs low memory overheads. Figure 7.4
illustrates this using the example of the bcsstk10 matrix. For fewer levels in the tree, the memory
overheads due to sparse factorization could increase setup costs.
Effect of increasing tree-levels on solver performance. We perform a sensitivity study
on six of our best performing matrices from our benchmark suite to understand solver perfor-
mance as we increase tree levels. Figure 7.5(a) indicates the solver performance for 10 right-hand
sides and levels increasing from 1 to 5. We observe that the floating point operations increase
asymtotically with the number of levels. Therefore, it is desired to set the number of levels, such
that the memory demands of the largest subdomain block are satisfied while maintaining lower
setup costs.
Effect of increasing right-hand sides on solver performance. Figure 7.5(b) presents
sensitivity of increasing the number of right-hand sides on the hybrid solver performance and
compares it to the best PCG result for our six best performing matrices. We observe that as the
number of right-hand sides increases, the solution cost dominates the performance. Additionally,
112
our method sustains higher performance compared to PCG, and speedup increases with a greater
number of right-hand sides.
7.3 Chapter Summary
In this chapter, we developed a multilevel tree-based Cholesky-PCG hybrid solver that
solves a linear system Ax = b with the same coefficient matrix but multiple right-hand sides.
For our test matrices, we show that our hybrid approach for obtaining a sparse linear system
solution is 1.87 times faster than the best performing PCG solution for multiple right-hand sides.
We believe that such a hybrid solution framework could maintain the accuracy of a sparse direct
solution while reducing memory demands. Additionally, a tree-structured approach enables us
to consider the development of a parallel implementation of our hybrid approach.
113
(a)
(b)
(c)
Fig. 7.1. (a) Matrix bcsstk11 with natural ordering; (b) a one-level nested dissection orderingof bcsstk11; (c) a supernodal tree represention of the one-level ordering.
114
(a)
(b)
(c)
Fig. 7.2. (a) Matrix bcsstk11 with natural ordering (b) A two-level nested dissection orderingof bcsstk11 (c) A supernodal tree represention of the two-level ordering.
115
Table 7.2. Operation counts (in millions) for 10 repeated right-hand side vectors and PCGwith IC Level-of-fill (IC0) and drop-theshold (ICT) preconditioner using (a) natural ordering(NAT), (b) RCM ordering, and (c) nested dissection (ND) ordering, and (d) minimum degree(MMD) ordering. Hybrid-DI represents our tree-based hybrid solver framework. Values in boldrepresent the best performance.
MatrixNAT RCM ND MMD
Hybrid-DIIC0 ICT IC0 ICT IC0 ICT IC0 ICT
nos7 7.73 1.97 7.72 12.21 - 32.37 - 32.56 2.95bcsstk09 - 11.60 - 583.60 - 687.51 - - 10.42bcsstk10 - 35.54 - - - - - - 4.83bcsstk11 - - - - - - - - 16.08bcsstk27 - 11.51 - 579.48 - - - - 10.42bcsstk14 - 55.21 - - - - - - 24.15bcsstk18 - 985.15 - - - - - - 247.75bcsstk16 986.89 144.01 1,180.21 1,257.05 - 2,128.73 - 2,005.01 275.54crystm01 30.51 5.27 30.51 24.13 - 58.63 - 69.66 104.10crystm02 93.42 10.82 93.43 73.46 - 193.60 - 190.17 911.31s1rmt3m1 - 432.19 - 5,795.08 - - - - 228.93s1rmq4m1 - 507.84 - 2,079.73 - - - - 311.53
Table 7.3. Relative error for 10 repeated right-hand side vectors and PCG with IC Level-of-fill (IC0) and drop-theshold (ICT) preconditioner using (a) natural ordering (NAT), (b) RCMordering, and (c) nested dissection (ND) ordering, and (d) minimum degree (MMD) ordering.Hybrid-DI represents our tree-based hybrid solver framework.
MatrixNAT RCM ND MMD
HybridIC0 ICT IC0 ICT IC0 ICT IC0 ICT
nos7 4.8E-06 1.3E-06 5.2E-06 2.0E-06 - 7.0E-06 - 1.6E-06 1.7E-04bcsstk09 - 0.0E+00 - 0.0E+00 - - - - 0.0E+00bcsstk10 - 9.5E-06 - - - - - - 0.0E+00bcsstk11 - - - - - - - - 2.6E-05bcsstk27 - 0.0E+00 - 0.0E+00 - - - - 0.0E+00bcsstk14 - 6.1E-06 - - - - - - 1.9E-06bcsstk16 0.0E+00 0.0E+00 0.0E+00 0.0E+00 - 0.0E+00 - 0.0E+00 0.0E+00bcsstk18 - 4.2E-05 - - - - - - 3.1E-05crystm01 0.0E+00 0.0E+00 0.0E+00 0.0E+00 - 0.0E+00 - 0.0E+00 0.0E+00crystm02 0.0E+00 0.0E+00 0.0E+00 0.0E+00 - 0.0E+00 - 0.0E+00 0.0E+00s1rmt3m1 - 4.1E-06 - 6.2E-04 - - - - 2.8E-06s1rmq4m1 - 4.1E-06 - 5.0E-06 - - - - 2.4E-06
116
(a)
(b)
Fig. 7.3. Speedup obtained by our method over (a) best PCG and ordering combination, and (b)average PCG performance across different variants for 10 right-hand sides. A speedup greaterthan 1 indicates improvement.
117
Fig. 7.4. Impact on performance with increase in tree levels and number of right-hand sides forthe bcsstk10 matrix.
118
(a)
(b)
Fig. 7.5. Effect of increasing (a) tree levels, and (b) right-hand sides on operations count (inmillions) of our Hybrid solver compared to the best PCG performance for 10 right-hand sides.
119
Chapter 8
Discussion
We conclude this dissertation by discussing our major research findings, associated open
problems, and possible extensions to our work on combining geometry and structure for large
scale data analysis and parallel scientific computing. Our work is motivated by the fact that
research in many areas of science and engineering is increasing relying on data-driven compu-
tational modeling and analysis. The underlying data in these applications is typically sparse and
possesses a high-dimensional geometry. Additionally, the sparsity of the data can be represented
as a graph. The key challenge is to develop hybrid scalable approaches that can utilize the spar-
sity of the graph representation of the data to manipulate the high-dimensional geometry. In this
dissertation, we approached this problem in two parts with an application perspective. In the
first part, we considered improvements for the classification problem in data analysis. In the
second part, we focused on scalable domain decomposition for parallel scientific computing on
multicore-multiprocessors.
In the first part of this dissertation, we begin by focusing on unsupervised classification
and then on supervised classification. Typically, applications that do not possess prelabeled ob-
servations have to rely on unsupervised classification to understand a grouping in the data. In
Chapter 3, we developed a new feature subspace transformation scheme that iteratively combines
geometry-based distance measures with graph-based measures to enhance K-Means clustering.
120
Our FST scheme first forms a sparse entity-to-entity weighted graph indicating sparse relation-
ships in the data. We then transform the geometry of the data iteratively by bringing related
entities in the graph close together. This transformed data is subsequently clustered using a
traditional K-Means implementation. The strength of a relationship between two entities is di-
rectly proportional to the edge weight in the entity-to-entity graph and depends inversely on the
geometric separation between related entities in the high-dimensional space.
Since we seek to bring related entities closer, the strength of the relationship is higher
when related entities are far apart and gradually decreases as the algorithm iterates. It is an open
research problem to determine a domain specific similarity function that best describes the entity
relationships. Our results indicate that on average FST-K-Means improves accuracy by 14.9%
relative to K-Means and by 23.6% relative to multilevel K-Means (GraClus). Furthermore, FST-
K-Means achieves upto 44.8% improvement in cluster cohesiveness relative to K-Means and
upto 37.9% improvement relative to GraClus. Additionally, we investigate the cohesiveness of
the obtained clusters and show that our FST-K-Means algorithm consistently satisfies the opti-
mality criterion for cluster cohesiveness (i.e., it lies within theoretical upper and lower bounds).
Another interesting open research problem is whether FST can be combined with other cluster-
ing methods to achieve similar gains. We plan to address open problems concerning this research
in our future work.
We next consider transformations for enhancing supervised classification with prelabeled
data. In Chapter 4, we developed a data transformation for prelabeled data that utilizes local
neighborhoods in the similarity neighborhood graph of the training data to manipulate the high-
dimensional geometry. Our Similarity Graph Neighborhood (SGN) approach transforms the
geometry of the training data by applying displacements to entities based on their neighborhood.
121
This procedure can be viewed as a smoothing of the training data in order to obtain a better
separation. In the presence of nonlinearities in the data, an open problem associated with this
approach is determining appropriate sparse entity neighborhood. In our current scheme, we con-
sider level-based nearest neighbors (a γ-neighborhood) to define a neighborhood for an entity.
However, it would be interesting to consider neighborhoods adaptively based on aproaches, such
as the node degree of the entity or density of nodes in different regions of the high-dimensional
space. In our SGN classifer, the boundary obtained in the transformed space is mapped to
the original space for subsequently classifying test data. Our SGN transformation, on average,
enhances the quality of a Linear Discriminant (LD) classifier by 5.0% and a Support Vector Ma-
chine (SVM) classifier by 4.52%. We believe that our SGN approach would be more effective
if the test data is projected and tested in the transformed space. However, developing such a
projection for our SGN transformation is an open research problem. We plan to address these
problems related to the SGN transformation in our future work.
In the second part of this dissertation, we address the challenge of developing a scalable
graph partitioning scheme by combining geometric and structural properties of sparse graphs.
In Chapter 6, we develop ScalaPart as a parallel graph embedding approach coupled with a
scalable geometric partitioning scheme. In earlier chapters in Part I, the data originally had a
high-dimensional geometry that we could manipulate using the sparse graph structure. How-
ever, ScalaPart begins with a sparse graph structure and a uniformly random layout and itera-
tively converges to a meaningful geometric layout. This layout can subsequently be partitioned
by a scalable geometric partitioning scheme. We developed the embedding enabled partitioning
approach using the Charm++ parallel programming system to achieve scalability. Our embed-
ding results indicate that on average we achieve a 1.78 times speedup as the number of cores
122
double. The partition quality obtained using our implementation of the geometric partitioning
scheme in Charm++ is, on average, 7.4% better than ParMetis and remains within 3.0% of PT-
Scotch. Our geometric partitioning scheme in ScalaPart performs 92.6% faster than ParMetis
and 25.2% faster than PT-Scotch on 16 cores. On 32 cores, our scheme is 97.2% faster than
ParMetis and 11.4% faster than PT-Scotch. We believe that as the number of cores increases, our
scheme will be more scalable than established schemes such as, ParMetis or PT-Scotch, since a
geometric partitioning approach is highly data-parallel. However, we would need to verify our
claim through experiments on a larger number of cores, which remains an open question. Since
our framework is developed using the Charm++ libraries, it is also independent of the underly-
ing MPI (or OpenMP) implementation, thus making it easily portable to other architectures. We
plan to address open problems related to ScalaPart in our future work.
In Chapter 7, we developed a multilevel tree-based Cholesky-PCG hybrid solver that
solves a linear system Ax = b with the same coefficient matrix but multiple right-hand sides.
The first step in our hybrid approach obtains a tree-structured recursive partitioning of A into
groups of blocks and their separators. We believe that our ScalaPart framework could be easily
adapted to obtain this tree-structured domain decomposition using the geometry of A. However,
an important research problem is to determine an optimal depth of this tree that can balance
tradeoffs between computation and communication. We demonstrate, on a suite of benchmark
matrices, that our hybrid approach for obtaining a sparse linear system solution is 1.87 times
faster than the best performing PCG solution for multiple right-hand sides. We believe that
such a hybrid solution framework could maintain the accuracy of a sparse direct solution while
reducing memory demands on multicore-multiprocessors. We address open problems related to
this problem in our future work.
123
Over the next few years, we plan to continue doing research on sparsity-aware algo-
rithms for data mining and parallel scientific computing in multicore-multiprocessor environ-
ments. We believe that scientific algorithms for high-dimensional data will need to utilize both
sparse structure and high-dimensional geometry to achieve precise scalable models. Some in-
teresting research problems that we would like to investigate include scalable data mining for
heterogeneous architectures, analysis of large time varying sparse graphs, and scalable solvers
for exascale platforms.
124
References
[1] T. Davis. University of Florida Sparse Matrix Collection. URL:
http://www.cise.ufl.edu/research/sparse/matrices/.
[2] M. W. Berry. Survey of Text Mining I: Clustering, Classification, and Retrieval. Springer,
2003.
[3] N. D. Lawrence, M Girolami, M Rattray, and G Sanguinetti. Learning and Inference in
Computational Systems Biology. MIT Press, 2010.
[4] T. J. Loredo. The promise of bayesian inference for astrophysics, 1992.
[5] K. Rajan. Combinatorial materials sciences: Experimental strategies for accelerated
knowledge discovery. Annual Review of Materials Research, 38(1):299–322, 2008.
[6] A. Chatterjee, S. Bhowmick, and P. Raghavan. Feature subspace transformations for
enhancing k-means clustering. In CIKM, pages 1801–1804, 2010.
[7] J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Applied Statistics, 28(1),
1979.
[8] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors a mul-
tilevel approach. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
29(11):1944–1957, Nov. 2007.
125
[9] I. S. Dhillon, Y. Guan, and B. Kulis. A fast kernel-based multilevel algorithm for graph
clustering. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international con-
ference on Knowledge discovery in data mining, pages 629–634, New York, NY, USA,
2005. ACM.
[10] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized
cuts. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 551–556, New York, NY, USA, 2004. ACM
Press.
[11] T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement.
Software - Practice and Experience, 21(11):1129–1164, 1991.
[12] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and
Statistics). Springer, August 2006.
[13] V. N. Vapnik. The Nature of Statistical Learning Theory (Information Science and Statis-
tics). Springer, November 1999.
[14] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 2:121–167, 1998.
[15] L. V. Kale and S. Krishnan. CHARM++: a portable concurrent object oriented system
based on C++. In Proceedings of the eighth annual conference on Object-oriented pro-
gramming systems, languages, and applications, OOPSLA ’93, pages 91–108, New York,
NY, USA, 1993. ACM.
126
[16] J. R. Gilbert, G. L. Miller, and S. Teng. Geometric mesh partitioning: Implementation and
experiments. In In Proceedings of International Parallel Processing Symposium, pages
418–427, 1995.
[17] M. T. Heath and P. Raghavan. A cartesian parallel nested dissection algorithm, 1995.
[18] J. D. Booth, A. Chatterjee, P. Raghavan, and M. Frasca. A multilevel cholesky conju-
gate gradients hybrid solver for linear systems with multiple right-hand sides. Procedia
Computer Science, 4:2307 – 2316, 2011. Proceedings of the International Conference on
Computational Science, ICCS 2011.
[19] Y. Saad. Iterative Methods for Sparse Linears Systems. PWS Publishing Co., Boston,
MA, 1996.
[20] T. F. Chan and B. Smith. Domain decomposition and multigrid algorithms for elliptic
problems on unstructured meshes. Technical Report CAM 93–42, University of California
at Los Angeles, 1993.
[21] A. George and J. Liu. Computer Solution of Large Sparse Positive Definite Systems.
Prentice Hall, 1981.
[22] T. A. Davis. University of florida sparse matrix collection. NA Digest, 92, 1994.
[23] J. C. Platt. Fast embedding of sparse music similarity graphs. In Advances in Neural
Information Processing Systems, page 2004. MIT Press, 2004.
[24] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning:Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
127
[25] H. Steinhaus. Sur la division des corp materials en parties. Bulletin of Acad. Polon. Sci,
IV (C1. III):801–804, 1956.
[26] S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory,
28:129–137, 1982.
[27] G. Ball and D. Hall. Isodata, a novel method of data analysis and pattern classification.
Tech. report NTIS AD 699616, Stanford Research Institute, Stanford, CA, 1965.
[28] J. MacQueen. Some methods for classification and analysis of multivariate observations.
Fifth Berkeley Symposium on Mathematics, Statistics and Probability, University of Cali-
fornia Press, 1967.
[29] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm, 2001.
[30] Thorsten Joachims. Training linear svms in linear time. In KDD ’06: Proceedings of the
12th ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 217–226, New York, NY, USA, 2006. ACM.
[31] C. Chang and C. J. Lin. LIBSVM: A library for support vector machines, 2011.
[32] M. C. Ferris and T. S. Munson. Interior point methods for massive support vector ma-
chines. SIAM Journal on Optimization, 13:783–804, 2003.
[33] O. L. Mangasarian and David R. Musicant. Lagrangian support vector machines. J. Mach.
Learn. Res., 1:161–177, 2001.
[34] E. Y. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. PSVM: Parallelizing
support vector machines on distributed computers. In NIPS, 2007.
128
[35] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, August 1999.
[36] I. Bıro, J. Szabo, and A.A. Benczur. Latent dirichlet allocation in web spam filtering. In
AIRWeb ’08: Proceedings of the 4th international workshop on Adversarial information
retrieval on the web, pages 29–32, New York, NY, USA, 2008. ACM.
[37] X. Li. A volume segmentation algorithm for medical image based on k-means cluster-
ing. In IIH-MSP ’08: Proceedings of the 2008 International Conference on Intelligent
Information Hiding and Multimedia Signal Processing, pages 881–884, Washington, DC,
USA, 2008. IEEE Computer Society.
[38] K. Kim and H. Ahn. A recommender system using ga k-means clustering in an online
shopping market. Expert Systems with Applications, 34(2):1200 – 1209, 2008.
[39] D. Arthur and S. Vassilvitskii. How slow is the k-means method? In SCG ’06: Proceed-
ings of the twenty-second annual symposium on Computational geometry, pages 144–153,
New York, NY, USA, 2006. ACM.
[40] S. M. Savaresi, D. L. Boley, S. Bittanti, and G. Gazzaniga. Cluster selection in divisive
clustering algorithms. In proceeding SIAM Datamining Conference, Arlington, VA, 2002.
[41] R. Neal. Assessing relevance determination methods using delve. In Neural Networks
and Machine Learning, pages 97–129. Springer-Verlag, 1998.
[42] The MathWorks Inc. Matlab and simulink for technical computing, 2007.
http://www.mathworks.com.
[43] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.
http://www.ics.uci.edu/∼mlearn/MLRepository.html.
129
[44] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine learning, neural and statistical
classification, 1994.
[45] G. Salton. Smart data set, 1971. ftp://ftp.cs.cornell.edu/pub/smart.
[46] K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth Interna-
tional Conference on Machine Learning, pages 331–339, 1995.
[47] L. Douglas Baker and Andrew Kachites McCallum. Distributional clustering of words
for text classification. In SIGIR ’98: Proceedings of the 21st annual international ACM
SIGIR conference on Research and development in information retrieval, pages 96–103,
New York, NY, USA, 1998. ACM.
[48] C. Ding and X. He. K-means clustering via principal component analysis. pages 225–232.
ACM Press, 2004.
[49] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals Eugen.,
7:179–188, 1936.
[50] J. H. Friedman. Regularized discriminant analysis. Journal of the American Statistical
Association, 84:165–175, 1989.
[51] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers. Fisher discriminant
analysis with kernels. pages 41–48. IEEE, 1999.
[52] M. Aizerman, E. Braverman, , and L. Rozonoer. Theoretical foundations of the potential
function method in pattern recognition learning. Automation and Remote Control, page
821837, 1964.
130
[53] T. Oommen, D. Misra, N. K. C. Twarakavi, A. Prakash, B. Sahoo, and S. Bandopadhyay.
An objective analysis of support vector machine based classification for remote sensing.
Mathematical Geosciences, 40(4):409, 2008.
[54] J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood formation and anomaly
detection in bipartite graphs. In proceedings ICDM, pages 418–425, 2005.
[55] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.
[56] MathWorks. Matlab and simulink for technical computing, 2007.
[57] A. Chatterjee and P. Raghavan. Similarity graph neighborhoods for enhanced supervised
classification. In Review, 2011.
[58] T. K. Ho and E. M. Kleinberg. Building projectable classifiers of arbitrary complexity.
Pattern Recognition, International Conference on, 2:880, 1996.
[59] J. E. Hopcroft, S. Soundarajan, and L. Wang. The future of computer science. Interna-
tional Journal of Software and Informatics (IJSI), 2011.
[60] Soumen Chakrabarti. Mining the Web: Discovering Knowledge from HyperText Data.
Sci. & Tech. Books, 2002.
[61] Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization
using hyperlinks. In SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international
conference on Management of data, pages 307–318, New York, NY, USA, 1998. ACM.
[62] R. Angelova and G. Weikum. Graph-based text classification: Learn from your neighbors.
In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on
131
Research and development in information retrieval, pages 485–492, New York, NY, USA,
2006. ACM.
[63] H. J. Oh, S. H. Myaeng, and M. H. Lee. A practical hypertext catergorization method
using links and incrementally available class information. In SIGIR ’00: Proceedings of
the 23rd annual international ACM SIGIR conference on Research and development in
information retrieval, pages 264–271, New York, NY, USA, 2000. ACM.
[64] L. Getoor. Link mining: A new data mining challenge. SIGKDD Explor. Newsl., 5(1):84–
89, 2003.
[65] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors:
Web spam detection using the web topology. In SIGIR ’07: Proceedings of the 30th
annual international ACM SIGIR conference on Research and development in information
retrieval, pages 423–430, New York, NY, USA, 2007. ACM.
[66] P. A. Eades. A heuristic for graph drawing. In Congressus Numerantium, volume 42,
pages 149–160, 1984.
[67] T. Kamada and S. Kawai. An algorithm for drawing general undirected graphs. Inf.
Process. Lett., 31:7–15, April 1989.
[68] T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement.
Software - Practice and Experience, 21(11):1129–1164, 1991.
[69] G. D. Battista, P. Eades, R. Tamassia, and I. G. Tollis. Graph drawing: Algorithms for the
visualization of graphs. 1999.
132
[70] Y. F. Hu. Efficient, high-quality force-directed graph drawing. The Mathematica Journal,
10:37–71, 2006.
[71] C. Walshaw. A multilevel algorithm for force-directed graph drawing. In Proceedings of
the 8th International Symposium on Graph Drawing, GD ’00, pages 171–182, London,
UK, 2001. Springer-Verlag.
[72] D. Harel and Y. Koren. Graph drawing by high-dimensional embedding. In Revised
Papers from the 10th International Symposium on Graph Drawing, GD ’02, pages 207–
219, London, UK, 2002. Springer-Verlag.
[73] A. Godiyal, J. Hoberock, M. Garland, and J. C. Hart. Graph drawing. chapter Rapid Mul-
tipole Graph Drawing on the GPU, pages 90–101. Springer-Verlag, Berlin, Heidelberg,
2009.
[74] R.J. Lipton and R.E. Tarjan. A separator theorem for planar graphs. SIAM J. Appl. Math.,
36:177–199, 1979.
[75] B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs. Technical
Report SAND93-1301, Sandia National Laboratories, Albuquerque, NM 87185, 1993.
[76] B. Hendrickson and R. Leland. An improved spectral graph partitioning algorithm for
mapping parallel computations. SIAM Journal on Scientific Computing, 16.
[77] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning
irregular graphs. SIAM Journal on Scientific Computing.
[78] George Karypis and Vipin Kumar. Parallel multilevel k-way partitioning scheme for ir-
regular graphs. In Supercomputing Conference.
133
[79] G. Karypis and V. Kumar. A fast and highly quality multilevel scheme for partitioning
irregular graphs. SIAM Journal on Scientific Computing, 1998.
[80] C. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improving network
partitions. In Proceedings of the 19th Design Automation Conference, DAC ’82, pages
175–181, Piscataway, NJ, USA, 1982. IEEE Press.
[81] J. Demmel, S. C. Eisenstat, J. R. Gilbert, Xiaoye Sherry Li, and Joseph W. H. Liu. A
supernodal approach to sparse partial pivoting. Technical Report CSL–94–14, Xerox
Palo Alto Research Center, 1995.
[82] A. Gupta, F. Gustavson, M. Joshi, G. Karypis, and V. Kumar. PSPASES: An efficient
and scalable parallel sparse direct solver, 1999. See http://www-users.cs.umn.
edu/˜mjoshi/pspases.
[83] Anshul Gupta and Vipin Kumar. A scalable parallel algorithm for sparse matrix factoriza-
tion. Technical Report 94-19, Department of Computer Science, University of Minnesota,
Minneapolis, MN, 1994. A short version submitted for Supercomputing ’94.
[84] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Clarendon
Press, Oxford, 1986.
[85] J. A. George and J. W-H. Liu. Computer Solution of Large Sparse Positive Definite Sys-
tems. Prentice-Hall Inc., Englewood Cliffs, NJ, 1981.
[86] P. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering
algorithm. SIAM J. Matrix Anal. Appl., 17:886–905, 1996.
134
[87] B. Hendrickson and E. Rothberg. Improving the runtime and quality of nested dissection
ordering. Technical report, Sandia National Laboratories, Albuquerque, NM 87185, 1996.
[88] I. Lee, P. Raghavan, and E. G. Ng. Ordering schemes for preconditioning with sparse
incomplete factors. In Proceedings of the International Conference On Preconditioning
Techniques For Large Sparse Matrix Problems In Scientific And Industrial Applications.
2003.
[89] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.
J. Res. Nat. Bur. Standards., 49:409–436, 1952.
[90] J. W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.
[91] A. Greenbaum and Z. Strakos. Predicting the behavior of finite precision lanczos and
conjugate gradient computations. SIAM J. Matrix Anal. Appl., 13:121–137, 1992.
[92] A. Van Der Sluis and H. A. Van Der Vorst. The rate of convergence of conjugate gradients.
Numer. Math., 48:543–560, 1986.
[93] D. Luenberger. Introduction to Linear and Nonlinear Programming. Addison Wesley,
second edition, 1984.
[94] O. Axelsson. A survey of preconditioned iterative methods for linear systems of equations.
BIT, 25:166–187, 1987.
[95] P. Concus, G. Golub, and D. O’Leary. A generalized conjugate gradient method for the
numerical solution of elliptic partial differential equations. In J. R. Bunch and D. J. Rose,
editors, Sparse Matrix Computations, pages 309–332. Academic Press, 1976.
135
[96] G.H. Golub and C.F. Van Loan. Matrix Computations. The Johns Hopkins University
Press, 1989.
[97] T. A. Manteuffel. An incomplete factorization technique for positive definite linear sys-
tems. Math. Comput., 34:473–497, 1980.
[98] Gary L. Miller, Shang hua Teng, and Stephen A. Vavasis. A unified geometric approach
to graph separators. In IEEE Symposium on Foundations of Computer Science, pages
538–547, 1991.
[99] C. Walshaw and M. Cross. Parallel optimisation algorithms for multilevel mesh partition-
ing. Parallel Computing, 26:1635–1660, 2000.
[100] F. Pellegrini and J. Roman. Scotch: A software package for static mapping by dual recur-
sive bipartitioning of process and architecture graphs. In High-Performance Computing
and Networking, pages 493–498, 1996.
[101] C. Chevalier and F. Pellegrini. PT-scotch: A tool for efficient parallel graph ordering.
Computing Research Repository, 2009.
[102] F. Gioachin, A. Sharma, S. Chakravorty, C. Mendes, L. V. Kale, and T. R. Quinn. Scalable
cosmology simulations on parallel machines. In VECPAR 2006, LNCS 4395, pp. 476-489,
2007.
[103] F. Gioachin, P. Jetley, C. L. Mendes, L. V. Kale, and T. R. Quinn. Toward petascale
cosmological simulations with ChaNGa. Technical Report 07-08, 2007.
[104] J. Barnes and P. Hut. A hierarchical o(n log n) force calculation algorithm. Nature, page
324, 1986.
136
[105] Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale, and Thomas R. Quinn.
Massively parallel cosmological simulations with ChaNGa. In Proceedings of IEEE In-
ternational Parallel and Distributed Processing Symposium 2008, pages 1–12, 2008.
[106] P. Jetley, L. Wesolowski, F. Gioachin, L. V. Kale, and T. R. Quinn. Scaling Hierarchical
N -body Simulations on GPU Clusters. In Proceedings of the ACM/IEEE Supercomputing
Conference 2010, 2010.
[107] M. A. Heroux, P. Raghavan, and H. D. Simon. Parallel Processing for Scientific Comput-
ing (Software, Environments and Tools). SIAM, Philadelphia, PA, USA, 2006.
[108] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,
Baltimore, MD, third edition, 1996.
[109] D. Rixen, C. Farhat, R. Tezaur, and J. Mandel. Theoretical Comparison of the FETI and
Algebraically Partitioned FETI Methods, and Performance Comparisons with a Direct
Sparse Solver. Int. J. Numer. Meth. Engrg., 46:501–534, 1999.
[110] C. Farhat and J. Li. An iterative domain decomposition method for the solution of a
class of indefinite problems in computational structural dynamics. Appl. Numer. Math.,
54(2):150–166, 2005.
[111] J. Sun, P. Michaleris, A. Gupta, and P. Raghavan. A fast implementation of the FETI-
DP method: FETI-DP-RBS-LNA and applications on large scale problems with localized
nonlinearities. International Journal for Numerical Methods in Engineering, 60(4):833–
858, 2005.
137
[112] M. T. Heath and P. Raghavan. A Cartesian nested dissection algorithm. SIAM J. Matrix
Anal. Appl., 16(1):235–253, 1995.
[113] A. George and J. W-H Liu. An automatic nested dissection algorithm for irregular finite
element problems. SIAM J. Numer. Anal., 15:1053–1069, 1978.
[114] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning
irregular graphs. SIAM Journal on Scientific Computing, 20:359–392, 1998.
[115] R. J. Lipton, D. J. Rose, and R. E. Tarjan. Generalized nested dissection. SIAM J. Numer.
Anal., 16:346–358, 1979.
[116] J. R. Gilbert, G. L. Miller, and S. H. Teng. Geometric mesh partitioning: Implementation
and experiments. Technical Report CSL–94–13, Xerox Palo Alto Research Center, 1994.
A short version appears in the proceedings of IPPS 1995.
[117] P. Raghavan and K. Teranishi. Parallel hybrid preconditioning: Incomplete factorization
with selective sparse approximate inversion. SIAM J. Sci. Comput., 32:1323–1345, May
2010.
[118] Lois Mansfield. On the conjugate gradient solution of the schur complement system
obtained from domain decomposition. SIAM J. Numer. Anal., 27:1612–1620, November
1990.
Vita
Anirban Chatterjee is a Graduate student in the Department of Computer Science and
Engineering at The Pennsylvania State University, working in the Scalable Scientific Computing
Laboratory, under the guidance of Professor Padma Raghavan.
Anirban came to Penn State with Bachelor and Master of Science degrees in Computer
Science from University Of Pune. While working on his Masters program, he worked with Dr.
Govind Swarup at the Tata Institute of Fundamental Research (TIFR) on Computational Mod-
eling in Radio Astronomy where he developed an algorithm for Location of Radio Frequency
Interference using the Giant Meterwave Radio Telescope.
While at Penn State, Anirban worked on Data Mining and Computational Modeling prob-
lems. Between 2006-2009 he also worked on Software Engineering of Phasefield Codes for the
Center for Computational Materials Design (CCMD) with Professor Long-Qing Chen (Depart-
ment of Material Science and Engineering). Since 2006, Anirban has been an active student
member of IEEE (Institute of Electrical and Electronics Engineers), ACM (Association of Com-
puting Machinery), and SIAM (Society of Industrial and Applied Mathematics).
In the Summer of 2009, Anirban was awarded a Master of Engineering Degree from
the Department of Computer Science and Engineering. During Fall 2010, he was a graduate
instructor for CMPSC 450 (Concurrent Scientific Computing).
In his free time, Anirban enjoys photography and traveling besides spending time with
his family and rabbit, Purple.